Database Systems - DDE, MDU, Rohtak

Database Systems - DDE, MDU, Rohtak
Database Systems
BCA-204
Directorate of Distance Education
Maharshi Dayanand University
ROHTAK – 124 001
Copyright © 2002, Maharshi Dayanand University, ROHTAK
All Rights Reserved. No part of this publication may be reproduced or stored in a retrieval system or
transmitted in any form or by any means; electronic, mechanical, photocopying, recording or otherwise,
without the written permission of the copyright holder.
Maharshi Dayanand University
ROHTAK – 124 001
Developed & Produced by EXCEL BOOKS, A-45 Naraina, Phase 1, New Delhi-110028
Contents
UNIT 1
DATA MODELLING FOR A DATABASE
Introduction
Database
Benefits of the Database Approach
Structure of DBMS
DBA
Records
Files
Abstraction
1
Integration
UNIT 2
DATABASE MANAGEMENT SYSTEM
28
Data Model
ER Analysis
Record Based Logical Model
Relational Model
Network Model
Hierarchical Model
UNIT 3
RELATIONAL DATA MANIPULATION
36
Relation Algebra
Relational Calculus
SQL
UNIT 4
RELATIONAL DATABASE DESIGN
99
Introduction
Functional Dependencies
Normalisation
First Normal Form
Second Normal Form
Third Normal Form
Boyce-Codd Normal Form
Fourth Normal Form
Fifth Normal Form
UNIT 5
QUERY PROCESSING
131
Introduction
Query Processor
General Strategies for Query Processing
Query Optimization
Concept of Security
Concurrency
Recovery
UNIT 6
DATABASE DESIGN PROJECT
177
Definition and Analysis of Existing Systems
Data Analysis
Preliminary and Final Design
Testing & Implementation
Maintenance
Operation and Tuning
UNIT 7
USE OF RELATIONAL DBMS PACKAGE FOR CLASS PROJECT
197
Implementation of SQL using Oracle RDBMS
APPENDIX
211
Introduction
Database
Benefits of the Database Approach
Structure of DBMS
Database Administrator
Records and Record Types
Files
Data Integrity Constraints
Data Abstraction
Data Modeling for a Database
Learning Objectives
After reading this unit you should appreciate the following:
Introduction
Database
Benefits of the Database Approach
Structure of DBMS
DBA
Records
Files
Abstraction
Integration
Top
The present time is known as the information age, reason being that humans are dealing with data and
information related to business or organization. Since the beginning of civilization, man is manipulating
data and the give and take of information has been in practice, but this has been considered as an important
discipline only for the last few decades. Today, data manipulation and information processing have become
the major tasks of any organization, small or big, whether it is an educational institution, government
concern, scientific, commercial or any other. Thus we can say that information is the essential requirement
of any business or organization.
Data: It is the plural of a Greek word datum, which means any raw facts, or figure like numbers, events,
letters, transactions, etc, based on which we cannot reach any conclusion. It can be useful after processing,
2
DATABASE SYSTEMS
e.g. 78, it is simply a number (data) but if we say physics (78) then it will becomes information. It means
somebody got distinctional marks in physics.
Information is processed data. The user can take decision based on information.
Data
Processing
Information
Information systems, through their central role in information economy, bring about the following changes:
•
Global exposure of the industry.
•
Actively working people.
•
Precedence of idea and information over money.
•
Growth in the business size.
•
Globalization – changing technologies.
•
Integration among different components based on information flow.
•
Need for optimum utilization of resources.
•
Deciding loss/benefit of business.
•
Future oriented Information.
•
External Interfaces.
Sales
Control
Planning
Product
Corporate
Database
Accountin
Material
A/c
A/c Payable
Manufacturin
Schedulin
Productio
Purchasin
Requirement
DATA MODELLING FOR A DATABASE
3
An organization is only a mechanism for processing information and considers that the traditional
management of information can be viewed in the context of information and process. The manager may be
considered as a planning and decision center. Established routes of information flow are used to determine
the effectiveness of the organization in achieving its objectives. Thus, information is often described as the
key to success in business.
Student Activity 1.1
Before reading the next section, answer the following questions:
1.
Justify that numbers 90, 40, 45 are data or information.
2.
What is the difference between data and information?
3.
Give the example of objects by which we can judge that these are information.
4.
Name a list data.
5.
Make a list of information.
6.
Data are the raw facts and…….
If your answers are correct, then proceed to the next section.
A system is a group of associated activities or functions with the following attributes:
A common goal or purpose
Identifiable objectives
A set of procedures including inputs, outputs and feedback
An environment in which the system exists
Mutually dependent sub systems that interact
It can be understood as follows:
Labour
Money
Goods
Services
and
Growth
Downfall
or
4
DATABASE SYSTEMS
Productivity
Materials
Processor Management
Machinery
Planning
Taxes
Organizing
Methods
Employment
Staffing
Controlling
It is imminent from the above mentioned job profile and process of management that effective
management, and effective decision making ability is directly dependent on individuals and organizational
ability to manage information.
Information can be collected as data or information, which some other person has introduced. Data /
information can be obtained from both internal and external organizational sources.
Information sources from internal and external sources may be classified as formal, or informal.
The uses of formal systems are based on proper procedures of collecting data. The conversion of data into
information, and the use of information, requires proper procedures to be in place. A means of identifying
such sources is to look for internal systems within which inputs and outputs are impressed in a constant
format. An informal internal source of information is a source of information through which management
receives information outside formal procedures. Many informal sources are verbal, therefore, the
organization will require procedures through which such information can be collated for future use.
Informal and formal information may also be obtained from external sources like newspapers, To etc. In
addition to identifying potential sources the organization will need to devise systems through which internal
information can be collated for potential future use.
The basic quality of good information is that it has some value for the recipient. A measure of value can be
found in its usefulness and reliability to the recipient. The value of the information will be decreased if the
level of inaccuracy is unknown or the cost of compensating for the inaccuracies are greater than the
potential benefits.
To have value, the information must be used. The value an organization gains from information, relates to
the decision making process and the ability of the management to improve profitability through use of
information.
DATA MODELLING FOR A DATABASE
5
Information should be provided at all levels. The objective of the provision of information is to enable
managers to control those areas, which they have responsibility for. Information will come from internal
and external sources and has to be communicated on time to facilitate effective decision-making.
Management is another form of the system, which comprises of elements and/or activities, which relate to
the achievement of a goal.
Management control: is the means through which managers ensure that required resources are obtained and
used effectively and efficiently to accomplish the objectives of the organization.
Operational control: Ensures that specific tasks are undertaken and completed effectively and efficiently.
Operational control has become less important with automation because tasks are increasingly becoming
subject to programmed control.
Strategic information: Enables directors and senior managers to plan the organization’s overall objectives
and strategies.
Tactical information: Used at all management levels, but mainly at middle management level, for tactical
planning and managing control function.
Operational information: The managers at the operational level who need to ensure that routine tasks are
being correctly planned and controlled use this information. These decisions are usually highly structured
and repetitive [i.e. programmable] with the required information being presented on a regular basis.
Collecting data to provide information is a time-consuming exercise. The developer must check that the
extra information gained is worthwhile, both in terms of cost and time. The assessment of what is valuable
data is carried out before any data is collected. Clear objectives of the intended system should be used to
determine data requirements. The company may commission a market research survey to analyses
customer-buying habits. The range of investigation and size of the survey sample would be controlled by
the survey budge.
We expect information to be ‘reliable’ and ‘accurate’. These features can be measured by the degree of
completeness, precision and timeliness.
The user of information should receive all the details necessary to aid decision-making. It is important for
all information to be supplied before decisions are made. For example, new stock should not be ordered
until full details of current stock levels are known. This is a simple example, since we know what
information is required and where to obtain it. Difficulties begin when we are not sure of the completeness
of the information received. Business analysts and economic advisors are well aware of these problems
when devising strategies and fiscal plans.
Inaccurate information can be more damaging than incomplete information to a business. The degree of
accuracy required depends on the recipient’s position in the management hierarchy. In general terms, the
higher the position, the less accuracy required. Decisions made at the top management level are based on
6
DATABASE SYSTEMS
annual summaries of items such as sales, purchases and capital spending. Middle managers would require a
greater degree of accuracy, perhaps weekly or monthly totals. Junior management requires the greatest
degree of accuracy to aid decision-making. Daily up-to-date information is often necessary, with accuracy
to the nearest percentage point or unit.
This is described as ‘the provision of prepared information as soon as it is required’. We also need to
consider the case where accurate information is produced, but not used immediately, rendering it out-ofdate. Some systems demand timely information and cannot operate without it. Airline reservation systems
are one example, passengers and airline staff depend on timely information concerning flight times,
reservations and hold-ups.
Student Activity 1.2
Before reading the next section, answer the following questions:
1.
What is source?
2.
Define source’s types.
3.
What are various qualities of a good information?
4.
Information will come from…………….. and …………………..
5.
The basic quality of good information is that it has some value for the …………..
6.
Give the examples of source.
If your answers are correct, then proceed to the next section.
This is a traditional term used to describe the processing of function-related data with a business
organization. Sales order processing is a typical example of data processing. Note that processing may be
carried out manually or using a computer. Some systems employ a combination of both manual and
computerized processing techniques. In both the cases, the data processing is essentially. Differences can
be described in terms of:
Computers can process data much quicker than any human. Hence, a computer system has a potentially
higher level of productivity and, therefore, it is cheaper for high-volume data processing. Speed allows
more timely information to be generated.
Computers have a reputation for accuracy, assuming that correct data has been input and that procedures
define processing steps correctly. The errors in computer systems are thus human errors (software, or
input), or less likely, machine errors (hardware failure).
DATA MODELLING FOR A DATABASE
7
As processing requirements increase, possibly due to business expansion, manages require more
information processing. Human systems cannot cope up with these demands. Banking is a prime example
where the dependency on computers is total.
There are some tasks that computers cannot perform. These activities usually have a high degree of nonprocedural thinking in which the rules of processing are difficult to define - it would be extremely difficult
to produce a set of ‘rules’ even for safety in crossing a busy road. Many management posts still rely to a
great degree on human decision-making. Top management decisions on policy and future business are still
determined by a board of directors and not by a computer.
Having understood the basic concept and significance of information and database, let us now get into the
basics:
•
Data: As we described earlier, Data are the ‘raw’ facts used for information processing. Data must be
collected and then ‘input’ ready for processing.
Each item of data must be clearly labeled, formatted and its size determined. For example, a
customer account number may be labeled ‘A/C’, in numeric format, of size of five digits.
Data may enter a system in one form and then be changed as it is processed or calculated.
Customer order data, for example, may be converted to electronic form by keying in the orders
from specially prepared data entry forms. The order data may then be used to update both
customer and stock files.
•
Input: The transaction is the primary data input which leads to system action, e.g., the input of a
customer order to the sales order processing system. The volume and frequency of transactions will
often determine the structure of an organization.
In addition to transaction data, a business system will also need reference to stored data, know
as standing or fixed data. Within a sales order processing system we have standing data in the
form of customer names and addresses, stock records and price lists. The transactions contain
some standing data, for referencing, but mainly variable data, such as items and quantities
ordered.
•
Output: Output from a business system is often seen as planning or control information, or as input to
another system. This can be understood if we consider a stock control system. Output will be stock
level information, slow-and fast-moving items for example are stock orders, for items whose quantities
fall below their reorder level. Stock movement information would be used to plan stock levels and
reorder levels, whilst stock order requirements would be used as input to the purchasing system.
•
Files: A file is an ordered collection of data records, stored for retrieval or amendment, as required.
When files are amended from transaction data, this is referred to as updating. In order to aid
information flow, files may be shared between sub systems. For example, a stock file may be shared
between the sales function and the purchasing function.
•
Processes: Data is converted to output or information by processing. Processing examples include
sorting, calculating and extracting.
Student Activity 1.3
Before reading the next section, answer the following questions:
8
DATABASE SYSTEMS
1.
Why we process the data?
2.
What is input?
3.
What is output?
4.
Is data processing essential?
5.
Speed, accuracy, volume ………are used in favour of …….
6.
Output from a business system is often seen as…………….
If your answers are correct, then proceed to the next section.
Top
A database is a collection of related data or operational data extracted from any firm or organization. For
example, consider the names, telephone number, and address of people you know. You may have recorded
this data in an indexed address book, or you may have stored it on a diskette, using a personal computer and
software such as Microsoft Access of MS Office or ORACLE, SQL SERVER etc.
The common use of the term database is usually more restricted.
A database has the following implicit properties:
•
A database represents some aspect of the real world, sometimes called the miniworld or the Universe of
Discourse (U.D.). Changes to the miniworld are reflected in the database.
•
A database is a logically coherent collection of data with some inherent meaning. A random assortment
of data cannot correctly be referred to as a database.
•
A database is designed, built and populated with data for a specific purpose. It has an intended group
of users and some preconceived applications in which these users are interested.
In other words, a database has some source from which data is derived, some degree of interaction with
events and an audience that is actively interested in the contents of the database. A database can be of any
size and of varying complexity. For example, the list of names and addresses referred to earlier may consist
of only a few hundred records, each with a simple structure. On the other hand, the card catalog of a large
library may contain half a million cards stored under different categories – by primary author’s last name,
by subject, by book titles – with each category organized in alphabetic order.
Here are several examples of databases.
1. Manufacturing company
2. Bank
3. Hospital
4. University
5. Government department
In general, it is a collection of files (tables)
Entity: A person, place, thing or event about which information must be kept.
Attribute: Pieces of information describing a particular entity. These are mainly the characteristics about
the individual entity. Individual attributes help to identify and distinguish one entity from another.
DATA MODELLING FOR A DATABASE
9
#
$
%
%
%
!
%
'%(
&
%)
'
"
Bit
0,1
Byte
10101011 (8-bits)
Field
(Attribute name like name, Age, Address)
Record (One or more rows in a table)
File
(Table or collection of all files)
Database (Collection of files or tables)
e.g.
Student
(Database Name)
**
(
)
!"
#"
( +
)
(
(
!
!,
-(
'
(
$
Handling of a small shop’s database can be done normally but if you have a large database and multiple
users then in that case you have to maintain computerized database. The advantages of a database system
over traditional, paper-based methods of record-keeping tag will perhaps be more readily apparent in these
examples. Here are some of them.
•
Compactness: No need for possibly voluminous paper files.
•
Speed: The machine can retrieve and change data faster than a human can.
•
Less drudgrey: Much of the sheer tedium of maintaining files by hand is eliminated. Mechanical tasks
are always better done by machines.
10
•
DATABASE SYSTEMS
Currency: Accurate, up-to-date information is available on demand at any time.
Top
%
"
&&
"
There are following benefits of the Database Approach:
Redundancy and duplication can be reduced. In the database approach, the views of different user
groups are integrated during database design. For consistency, we should have a database design
that stores each logical data item – such as student’s name or birth date – in only one place in the
database. This does not permit inconsistency, and it saves. However, in some cases, controlled
redundancy may be useful for improving the performance of queries. For example, we may store
Student Name and Course Number redundantly in a GRADE_REPORT file(fig below), because
whenever we retrieve a GRADE_REPORT record, we want to retrieve the student name and course
number along with the grade, student number, and section identifier. By placing all the data
together, we do not have to search multiple files to collect this data.
!
"# $
%
!
"# $
&
Inconsistency can be avoided (to some extent). Employee E4 works in department D5 – is
represented by two distinct entries in the stored database. Suppose also that the DBMS is not aware
of this duplication (i.e. redundancy is not controlled). Then there will necessarily be an occasion on
which the two entries will not agree, i.e., when one of the two has been epilated and the other has
not. At such times the database is said to be inconsistent.
The data can be shared. Same database can be used by variety of users, for their different
objectives, simultaneously.
Security restrictions can be applied. It is likely that some users is often will not be authorized to
access all information in the database. For example, financial data is often considered confidential,
and hence only authorized persons are allowed to access such data. In addition, some users may be
permitted only to retrieve data, whereas others are allowed both to retrieve and to-update.
Integrity can be maintained. The problem of integrity is the problem of ensuring that the data in the
database in accentuate it means if the data type of any field is number then we cannot insert any
string text here.
Student Activity 1.4
DATA MODELLING FOR A DATABASE
11
Before reading the next section, answer the following questions:
1.
What is database?
2.
What is record?
3.
What is field?
4.
Why database is needed?
5.
What is redundancy?
6.
What are the benefits of database?
If your answers are correct, then proceed to the next section.
#"
$
A DBMS is a sophisticated piece of software, which supports the creation, manipulation and administration
of database system. A database system comprises a database of operational data together with the
processing functionality required to access and manage that data. Typically, this means a computerized
record keeping system whose overall purpose is to maintain information and to make that information
available on demand.
) . -
/
. 0 .
) . -
$
$$*( .
/
+
$
$
This picture shows a greatly simplified view of a database system. The figure is intended to illustrate the
point that a database system involves four major components namely, data, hardware, software, and users.
"
( )
%
'
&"
The DBMS responds to a query by invoking the appropriate sub-programmes, each of
which performs its special function to interpret the query, or to locate the desired data in the database and
insert it in the designed order. Thus DBMS shields database users from the tedious programming they
would have to do, organize data for storage, or to gain access to it once it has been stored.
+
1
12
DATABASE SYSTEMS
2
#
*
3
)
#
#
)
As already mentioned, a database consists of a group of related files of different record types and the
DBMS allows users to access data anywhere in the database, without the knowledge of how data are
actually organized on the storage device.
Student Activity 1.5
Before reading the next section, answer the following questions:
1. Define DBMS.
2. Why DBMS is needed?
3. Users requests are handled by…………………
4. A database consists of…………………..
5. DBMS is an interface between database and………………..
6. Give the examples of several DBMS.
If your answers are correct, then proceed to the next section.
Top
%
The role of the DBMS as an intermediary between the users and the database is very much like the function
of a salesperson in a consumer distributor system. A consumer specifies desired items by filling out an
order form, which is submitted to a salesperson at the counter. The salesperson presents the specified items
to consumer after they have been retrieved from the storage room. Consumers who place orders have no
idea of where and how the items are stored; they simply select the desired items from an alphabetical list in
a catalogue. However, the logical order of goods in the catalogue bears no relationship to the actual
physical arrangement of the inventory in the storage room. Similarly, the database user needs to know only
what data he or she requires; the DBMS will take care of retrieving it.
Database Management Systems: A database management system (DBMS) is a software application system
that is used to create, maintain and provide controlled access to user databases. Database management
systems range in complexity from a PC-DBMS (such as Ashton Tate’s dBASE IV) costing a few hundred
dollars to a mainframe DBMS product (such as IBM’s DB2) costing several hundred thousand dollars. The
major components of a full-function DBMS are shown in the diagram given below:
*
)
$ 3
#
DATA MODELLING FOR A DATABASE
'
*+ %
13
()
,
The engine is the central component of a DBMS. This module provides access to the reposition and the
database and coordinates all the other functional elements of the DBMS. The DBMS engine receives
logical requests for data (and metadata) from human users and from applications, determines the secondary
storage location of those data and issues physical input/output requests to the computer operating system.
The engine provides services such as memory and buffer management, maintenance of indexes and lists
and secondary storage or disk management.
*-+
The interface sub system provides for users and applications to access the various components of the
DBMS. Most DBMS products provide a range of languages and other interfaces, since programmers (or
other technical persons by users with little or no programming experience will use the system. Some of the
typical interfaces to a DBMS are the following:
A data definition language (or data sub-language), which is used to define database structures such as
records, tables, files and vies.
An interactive query language (such as SQL), which is used to display data extracted from the
database and to perform simple updates.
A graphic interface (such as Query-by-Example) in which the system displays a skeleton table (or
tables) and users propose requests by suitable entries in the table.
A forms interface in which a screen–oriented form is presented to the user, who responds by filling in
blanks in the form.
A DBMS programming language, (such as the DBMS IV command language), which is procedural
language that allows programmers to develop sophisticated application.
An interface to standard third-generation programming languages such as BASIC and COBOL.
A natural language interface that allows users to present requests in free-form English statements.
14
DATABASE SYSTEMS
*.+
( &
The information repository dictionary sub system is used to manage and control access to the repository.
IRDS is a component that is integrated within the DBMS. Notice that the IRDS uses the facilities of the
database engine to manage the repository.
*/+
The performance management sub system provides facilities to optimize (or at least improve) DBMS
performance. Two of its important functions follow:
Query optimization. Structuring SQL queries (or other forms of user queries) to minimize response times.
*0+
The data integrity management sub system provides facilities for managing the integrity of Data in the
database and the integrity of metadata in the repository. There are three important functions:
1.
Intra-record integrity: Enforcing constraints or data item values and types within each record in the
database.
2.
Referential integrity: Enforcing the validity of references between records in the database.
3.
Concurrency control: Assuring the validity of database updates when multiple users access the
database (discussed in al later section).
*1+%
&
(
2
The backup and recovery sub system provides facilities for logging transactions and database changes,
periodically making backup copies of the database and recovering the database in the event of some type of
failure.
*3+ &&
2
&
The application development sub system provides facilities that allow end users and/or programmers to
develop complete database applications. It includes CASE tools as well as facilities such as screen
generators and report generators.
*4+
The security management sub system provides facilities to protect and control access to the database and
repository.
What has been described above is the manifestation of individual components of a typical DBMS; we
would again look at these components from another view later in this section. But first, it is relatively
important to focus on some of the other interfacing aspects of DBMS software.
What might be termed the third-generation approach to systems development involved the production of a
suite of programmers that together constituted an application system- a self-contained functional capability
to do something useful. Within an application system, each program manipulates data in one for more files
and a particular file might be both read and written by several programmers. An organization typically
develops several application systems for each-one information systems task perceived.
The DBMS (database approach) tries to overcome all of the shortcomings of the pre database approach as
follows:
DATA MODELLING FOR A DATABASE
15
Data Validation Problems: If many programs manipulate a particular type of information then validation of
its correctness must be carried out by each of those on guard against entry of any illegal values.
Consequently, program code may need to duplicate and, if the validation conditions change, each program
(at least) must be recompiled.
•
Data Sharing Problems: Perhaps more seriously, if a file is used by several programmes and there is a
need to change its structures in some way, perhaps to add a new type information object that is required
by a new program, then each program will need to be recompiled-unless one maintains duplicate
information in different structures, in which case there is a synchronization problem.
A further dimension to this problem results from the fact that with conventional operating system
facilities, if two or more programs write to the same file at the same time unpredictable results will be
obtained. Concurrent update must be avoided either by use imposed synchronization (that is, manually
controlling the usage of programmers), or by locking scheme that would have to be implemented by the
application programs. In either case, there are costs-management control or programming effort.
•
Manipulation Problems: When writing a program using a conventional programming language and
operating system facilities, a programmer uses record-level commands (i.e. reads and writes) on each
file to perform the required functions; this is laborious and hence unproductive of the programmer’s
time.
•
Data Redundancy: The same piece of information may be stored in two or more files. For example, the
particulars of an individual who may be a customer and an employee may be stored in two or more
files.
•
Program/Data Dependency: In the traditional approach, if a data field is to be added to a master file, all
such programmes that access the master file would have to be changed to allow for this new field which
would have been added to the master record.
•
Lack of Flexibility: In view of the strong coupling between the programme and the data, most
information retrieval possibilities would be limited to well-anticipated and predetermined requests for
data, the system would normally be capable of producing schedule records and queries which it would
have been programmed to create. In the fast moving and competent business environment of today,
apart from such regularly scheduled records there is a need for responding to un-anticipatory queries
and some kind of investigative analysis which cannot be envisaged professionally.
One could discuss other points and security problems which would probably come next, but the above
should be sufficient to illustrate that the approach is fundamentally inadequate for the problem to which it
has been applied. Agreement in this matter has grown since the mid-1960s and the database approach is
now well established as the basis for information system development and management in many application
areas.
All the above difficulties result from two surroundings:
•
The lack of any definition of data objects independently of the way in which they are used by specific
application programmes; and
•
The lack of control over data object manipulation beyond that imposed by existing application
programmers.
The database approach has emerged in response. Fundamentally it rests on the following two
interrelated ideas:
The extraction of data object type descriptions from application programmes into a single
repository called a database schema (the word schema can be taken to mean a description of
form)-an application-independent description of the object in the database; and
16
DATABASE SYSTEMS
The introduction of a software component called a database management system (DBMS)
through which all data definition and manipulation (update and interrogation) occurs-a buffer
that controls data access and removes this function from the applications.
Together, these ideas have the effect of fixing a funnel over the top of the data used by application systems
and forcing all application program’s data manipulation through it. So let us now try to appreciate how
DBMS solves some of the issues.
•
Data Validation: In principle, validation rules for data objects can be held in the schema and enforced
on entry by the DBMS. This reduces the amount of application code that is needed. Changes to these
rules need be made exactly once because they are not duplicated.
•
Data Sharing: Changes to the structures of data objects are registered by modifications
to the schema. Existing application programmes need not be aware of any differences, because a
correspondence between their view of data and that, which is now supported,
can also be held in the schema and interpreted by the DBMS. This concept is often referred
to as data independence; applications are independent of the actual representation of
their data.
Synchronization of concurrent access can be implemented by the DBMS because it oversees all database
access.
The record-level data manipulation concept of programming languages such as Cobol, PL/1, Fortran and so
on can be escaped by means of a higher-level (more problem-oriented than implementation-oriented) data
manipulation language that is embedded within application programs can be improved.
Furthermore, because the approach involves a central repository of data description, it is possible to
develop a mechanism that provides a general inquiry facility to data objects and their descriptions; such a
mechanism is normally called a query language.
It is interesting that the emergence of the database approach has brought about a new class of programming
language; this is symptomatic of the significant change that database thinking has brought to the
information systems development process.
Having described the database approach in terms of its impact on the development and management of
information systems, it is now appropriate to attempt some definitions.
Student Activity 1.6
Before reading the next section, answer the following questions:
1.
Make a diagram showing the structure of DBMS.
2.
What are the major components of DBMS?
3.
What is DBMS engine?
4.
Write brief notes on security management sub system.
5.
What are the functions of data integrity?
6.
What do you understand by data validation?
If your answers are correct, then proceed to the next section.
Top
DATA MODELLING FOR A DATABASE
17
One of the main reasons for using DBMS is to have central control of both the data and the processes that
access those data. The person who has such central control over the system is called the database
administrator (DBA). The functions of the DBA include the following:
•
Schema Definition: The DBA creates the original database schema by writing a set of defines that is
translated by the DDL (Data Defn. Lang.) Compiler to a set of tables that is store permanently in the
data dictionary.
•
Storage Structure and Access-Method Definition: The DBA creates appropriate storage structures and
access methods by writing a set of definitions, which is translated by the DDL compiler.
•
Schema and Physical-Organization Modification: Programmers accomplish the relatively rare
modifications either to the database schema or to the description of the physical storage organization.
By writing a set of definitions that is used by either the DDL compiler or the data-storage and data defn.
Language compilers to generate modifications to the appropriate intend system-tables (for example, the
data dictionary).
•
Growing of Authorizations for Data Access: The granting of different types of authorizations allows the
DBA to regulate the parts of the database, which various users can access.
•
Integrity-Constraint Specification: The data values stored in the database must satisfy certain
consistency constraints e.g., perhaps the number of hours an employee may work in 1 week may not
exceed a pre-specified limit (say 80 hours)
The primary goal of a database system is to provide an environment for retrieving information from and
storing new information into the database. There are four different types of database system users,
differentiated by the way that they expect to interact with the system.
•
Application programmers are computer professionals who interact with the system through DML (Data
Manipulation Language) calls, which are embedded in a program written in a host language (for
example, Cobol, PL/S, Pascal, C). These programs are commonly referred as application programs.
e.g.: A Banking system includes programs that generate payroll checks that debit accounts, that credit
accounts, or that transfer funds between accounts.
•
Sophisticated Users: Such users interact with the system without writing programs. Instead, they form
their requests in database query language. Each such query is submitted to a very processor whose
function is to breakdown DML statement into instructions that the storage manager understands.
Analysts who submit to explore data in the database till in the category.
•
Specialized Users: Such users are those who write specialized database applications that do not fit into
the fractional data-processing framework.
e.g. computer-aided design systems, knowledge base and expert systems, systems that store data with
complex data types (for example, graphics data and audio data).
Naive users: These users are unsophisticated who interact with the system by involving one of the
permanent application programs that have been written. For example, a bank teller who needs to
transfer $50 from account A to account B invokes a program called transfer.
This program asks the teller for the amount of money to be transferred, the account from which the
money is to be transferred, and the account to which the money is to be transferred.
"
18
DATABASE SYSTEMS
Database changes over time when and as information is inserted and deleted. The collection of information
stored in the database at a particular moment is called an instance of the database. The overall design of the
database is called the database schema, schemas one changes infrequently, if at all.
Analogies to the concepts of data types, variables and values in programming languages are useful here.
Returning to the customer record types definition, note that in declaring the type of customer, we have not
declared any variables. To declare such variables in a Pascal-like language, we write
Var customer: customer; variable customer2 now corresponds to an area of storage containing a customer
type record.
A database schema corresponds to the programming-language type definition. A variable of a given type
has a particular value at a given instant. Thus, the value of a variable in programming languages
corresponds to an instance of a database schema. In other words “the description of a database is called the
database schema, which is specified during database design and is not expected to change frequently”, A
displayed schema is called a schema diagram.
E.g. student-schema.
'
(
'
(
(
#
)
A schema diagram displays only some aspects of a schema, such as the names of record types and data
items, and some types of constraints. Other aspects are not specified in the schema diagram. As in the
above diagram they’re neither in data type of each data item, nor in the relationships among the various
files.
Student Activity 1.7
Before reading the next section, answer the following questions:
1.
What do you understand by DBA and how he plays an important role?
2.
Describe various types of Database users?
3.
Differentiate between instances and schemes?
If your answers are correct, then proceed to the next section.
Top
(
(
&
Data is usually stored in the form of records. Each record consists of a collection of related data values or
items where each value is formed of one or more bytes and corresponds to a particular field of the record.
Records usually describe entities and their attributes. For example, an EMPLOYEE and record represents
an employee entity, and each field value in the record specifies some attribute of that employee, such as
NAME, BIRTHDATE, SALARY, or SUPERVISOR. A collection of field names and their corresponding
data types constitutes a record type or record format definition. A data type, associated with each field,
specifies the type of values a field can take.
DATA MODELLING FOR A DATABASE
19
The data type of a field is usually one of the standard data types used in programming. These include
numeric (integer, long integer, or floating point), string of characters (fixed-length or varying), Boolean
(having 0 and 1 or TRUE and FALSE values only), and sometimes specially coded data and time data
types. The number of bytes required for each data type is fixed for a given computer system. An integer
may require 4 bytes, a long integer 8 bytes, a real number 4 bytes, a Boolean 1 byte, a Boolean 1 byte, a
date 10 bytes (assuming a format of YYYY-MM-DD), and a fixed-length string of k characters k bytes.
Variable-length strings may require, as many bytes as there are characters in each field value. For example,
an EMPLOYEE record type may be defined–using the C programming language notation–as the following
structure:
Struct employee{
char name [30];
char ssn[9];
int salary;
int jobcode;
char department[20];
};
In recent database applications, the need may arise for storing data items that consist of large unstructured
objects, which represent images, digitized video or audio streams, or free text. These are referred to as
BLOBs (Binary Large Objects). A BLOB data item is typically stored separately from its record in a pool
in a pool of disk blocks, and a pointer to the BLOB is included in the record.
Top
5
A file is a sequence of records. In many cases, all records in a file are of the same record type. If every
record in the file has exactly the same size (in bytes), the file is said to be made up of fixed-length records.
If different records in the file have different sizes, the file is said to be made up of variable-length records.
A file may have variable-length records for several reasons:
The file records are of the same record type, but one or more of the field are of varying size (variable-length
fields). For example, the NAME field of EMPLOYEE can be a variable-length field.
The file records are of the same record type, but one or more of the field may have multiple values for
individual records; such a field is called a repeating field and a group of values for the field is often called a
repeating group.
The file records are of the same record type, but one or more of the fields are optional; that is, they may
have values for some but not all of the file records (optional fields).
The file contains records of different record types and hence of varying size (mixed file). This would occur
if related records of different types were clustered (placed together) on disk blocks; for example, the
GRADE_REPORT records of a particular student may be placed following that STUDENT’s record.
The fixed-length EMPLOYEE records in the Figure given below have a record size of 71 bytes. Every
record has the same fields, and field lengths are fixed, so the system can identify the starting byte position
of each field relative to the starting position of the record. This facilitates locating field values by programs
20
DATABASE SYSTEMS
that access such files. Notice that it is possible to represent a file that logically should have variable-length
records as a fixed-length records file. For example, in the case of optional fields we could have each field in
every file record but store a special null value of no value exists for that field. For a repeating field, we
could allocate as many spaces in each record as the maximum number of values that the field can take. In
either case, space is wasted when certain records do not have values for all the physical spaces provided in
each record. We now consider other options for formatting records of a file of variable-length records.
*
& '
&
'
,
+ ,
,
&
&
+
-.
*
+ ,
$
$
*
*
For variable-length fields; each record has a value for each field, but we do not know the exact length of
some field values. To determine the bytes within a particular record that represent each field, we can use
special separator characters (such as ? or % or $) – which do not appear in any field value–to terminate
variable-length fields (See Figure), or we can store the length in bytes of the field in the record, preceding
the field value.
A file of records with optional fields can be formatted in different ways. If the total number of fields for the
record type is large but the number of fields that actually appear in a typical record is small, we can include
in each record a sequence of <field-name, field-value>pairs rather than just the field values. Three types of
separator character for the first two purposes–separating the field name from the field value and separating
one field from the next field. A more practical option is to assign a short field type code–say, an integer
number–to each field and include in each record a sequence of <field-type, field-value> pairs rather than
<field-name, field-value> pairs.
A repeating field needs one separator character to separate the repeating values of the field and another
separator character to indicate termination of the field. Finally, for a file that includes records of different
types, each record is preceded by a record type indicator. Understandably, programs that process files of
variable-length records–which are usually part of the file system and hence hidden from the typical
DATA MODELLING FOR A DATABASE
21
programmers–need to be more complex than those for fixed-length records, where the starting position and
size of each field are known and fixed.
Top
6
Most database applications have certain integrity constraints that must hold for the data. A DBMS should
provide capabilities for defining and enforcing these constraints. The simplest type of integrity constraint
involves specifying a data type for each data item. For example, in Figure we may specify that the value of
the Class data item within each student record must be an integer between 1 and 5 and that the value of
Name must be a string of no more than 30 alphabetic characters. A more complex type of constraint that
occurs frequently involves specifying that a record in one file must be related to records in other files. For
example, in Figure given below, we can specify that “every section record must be related to a course
record.” Another type of constraint specifies uniqueness on data item values, such as “every course record
must have a unique value for Course Number.” These constraints are derived from the meaning or
semantics of the data and of the miniword it represents. It is the database designer’s responsibility to
identify integrity constraints during database design. Some constraints can be specified to the DBMS and
automatically enforced. Other constraints may have to be checked by update programs or at the time of data
entry.
A data item may be entered erroneously and still satisfy the specified integrity constraints. For example, if a
student receives a grade of A but a grade of C is entered in the database, the DBMS cannot discover this
error automatically, because C is a valid value for accounts. Since application programs are added to the
system in an ad hoc manner, it is difficult to enforce such security constraints.
) /
!4
- 5
!
(
-
(
0
(
)
)
( !6!7
(
( 66 7
(
.8
)
$
!7
( 66"7
6
.8
6
(
1
",
.8
!7
9"
:
9
( !6!7
9"
!7
( 66 7
99
:
99
(
!!
.8
!7
!!9
( !6!7
99
!6,
( 66"7
99
22
DATABASE SYSTEMS
!
,# $
!
2
!4
!!
-
!4
!!9
(
"
",
"
9
"
!7
"
!6,
-
2
( 66"7
( 66"7
( 66 7
(,66 7
.8
!7
( !6!7
Top
For the system to be usable, it must retrieve data efficiently. This concern has led to the design of complex
data structures for the representation of data in the database. Since many database-system users are not
computer trained, developers hide the complexity from users through several levels of abstraction, to
simplify users’ interactions with the systems:
Physical level. The lowest level of abstraction describes how the data are actually stored. At the
physical level, complex low-level data structures are described in detail.
Logical level. The next higher level of abstraction describes what data are stored in the database,
and what relationships exist among those data. The entire database is thus described in terms of a
small number of relatively simple structures. Although implementation of the simple structures at
the logical level may involve complex physical-level structures, the user of the logical level does
not need to be aware of this complexity. Database administrators, who must decide what
information is to be kept in the database, use the logical level of abstraction.
View level. The highest level of abstraction describes only part of the entire database. Despite the
use of simpler structures at the logical level, some complexity remains, because of the large size of
the database. Many users of the database system will not be concerned with all this information.
Instead, such users need to access only a part of the database. So that their interaction with the
system is simplified, the view level of abstraction is defined. The system may provide many views
for the same database.
The interrelationship among these three levels of abstraction is illustrated in Figure given below.
DATA MODELLING FOR A DATABASE
23
3'
An analogy to the concept of data types in programming languages may clarify the distinction among levels
of abstraction. Most high-level programming languages support the notion of a record type. For example, in
a Pascal-like language, we may declare a record as follows:
type customer = record
customer-name : string;
social-security : string;
customer-street : string;
customer-city : string;
end
This code defines a new record called customer with three fields. Each field has a name and a type associate
with it. A banking enterprise may have several such record types, including
Account, with fields account-number and balance
Employee, with fields employee-name and salary
At the physical level, a customer, account, or employee record can be described as a block of consecutive
storage locations (for example, words or bytes). The language compiler hides this level of detail from
programmers. Similarly, the database system hides many of the lowest-level storage details from database
programmers. Database administrators may be aware of certain details of the physical organization of the
data.
At the logical level, each such record is described by a type definition, as illustrated in the previous code
segment, and the interrelationship among these record types is defined. Programmers using a programming
language work at this level of abstraction. Similarly, database administrators usually work at this level of
abstraction.
Finally, at the view level, computer users see a set of application programs that hide details of the data
types. Similarly, at the view level, several views of the database are defined, and database users see these
views. In addition to hiding details of the logical level of the database, the views also provide a security
mechanism to prevent users from accessing parts of the database. For example, tellers in a bank see only
that part of the database that has information on customer accounts; they cannot access information
concerning salaries of employees.
24
DATABASE SYSTEMS
Student Activity 1.8
Answer the following questions:
1.
Write short notes on following:
a.
Records
b.
Files
c.
Record types
2.
What do understand by Data Integrity Constraints?
3.
What do understand by Data Abstraction?
•
Data is the plural of a Greek word datum, which means any raw facts, or figure like numbers, events,
letters, transactions, etc, based on which we cannot reach any conclusion. It can be useful after
processing
•
Information is processed data. The user can take decision based on information.
•
A database is a collection of related data or operational data extracted from any firm or organization
•
A database is a logically coherent collection of data with some inherent meaning. A random assortment
of data cannot correctly be referred to as a database.
•
A database management system (DBMS) is a software application system that is used to create,
maintain and provide controlled access to user databases. Database management systems range in
complexity from a PC-DBMS (such as Ashton Tate’s dBASE IV) costing a few hundred dollars to a
mainframe DBMS product (such as IBM’s DB2) costing several hundred thousand dollars.
•
One of the main reasons for using DBMS is to have central control of both the data and the processes
that access those data. The person who has such central control over the system is called the database
administrator (DBA).
•
Data is usually stored in the form of records. Each record consists of a collection of related data values
or items
•
A file is a sequence of records.
•
To simplify users’ interactions with the systems, developers hide the complexity from users through
several levels of abstraction.
I.
True or False
1.
Data means any raw facts or numbers
DATA MODELLING FOR A DATABASE
II.
2.
A system is not a group of associated activities or functions with the following attributes.
3.
A file is a sequence of fields .
25
Fill in the Blanks
1.
Data is the plural of a Greek word _____________.
2.
The uses of formal systems are based on proper procedures of collecting _________.
3.
Data is converted to output or information by ____________.
4.
The primary goal of a database system is to provide an environment for retrieving
_______________.
5.
At the________________, each such record is described by a type definition.
!
I.
II.
True or False
1.
True
2.
False
3.
Flase
Fill in the Blanks
1.
datum
2.
data
3.
processing
4.
information
5.
logical level
"
I.
II.
True or False
1.
Today, data manipulation and information processing have become the major tasks of any
organization
2.
Information sources from internal and external sources may be classified only as formal.
3.
Most database applications have certain integrity constraints that must hold for the data.
4.
The person who central control over the system is called the database user.
5.
Database administrators usually work at logical level of abstraction.
Fill in the Blanks
1.
_________ and _________ information can be obtained from external sources.
2.
The user of information should receive all the details necessary to aid_________.
3.
_______ information can be more damaging than ______ information to a business.
26
DATABASE SYSTEMS
4.
__________ can process data much quicker than human.
5.
__________ of concurrent access can be implemented by DBMS.
1.
Define a system.
2.
What is a database system?
3.
What is DBMS Engine?
4.
Discuss the roles of DBA.
5.
Describe different levels of abstraction.
Introduction
Object-Based Logical Models
ER Analysis
Record-Based Logical Models
Relational Model
Network Model
Hierarchical Data Model
Database Management System
Learning Objectives
After reading this unit you should appreciate the following:
•
Data Model
•
ER Analysis
•
Record Based Logical Model
•
Relational Model
•
Network Model
•
Hierarchical Model
Top
Underlying the structure of a database is the data model.
Data model: Data model is a collection of conceptual tools for describing data, data relationships, data
semantics and consistency constraints.
Constraint: A rule applied on the data or column.
The various data models that have been proposed fall into three different groups.
1.
Object-Based Logical Models
2.
Record-Based Logical Models
3.
Physical Models.
Top
DATABASE MANAGEMENT SYSTEM
29
Object-based logical models are used in describing data at conceptual and external schemas. They provide
fairly flexible structuring capabilities and allow data constraints to be specified explicitly. Some of object
based models are:
The entity-relationship model
The object-oriented model
The semantic model
The functional data model
Entity-relational model and the object-oriented model act as representatives of the class of the object-based
logical models.
The entity-relationship (E-R) data model is based on a perception of a real world that consists of a
collection of basic objects, called entities, and of relationships among these objects.
Fig. Below shows two entities and the values of their attributes. The employee entity e1 has four attributes:
Name, Address, Age and Home phone; their values are “John Smith,” ”2311 Kirby, Houston, Texas
77001,” ”55,”and “713-749-2630,” respectively. The company entity c1 has three attributes: Name,
Headquarters, and president; their values are “Sunco Oil,” “ Houston,” “and “John Smith,” respectively.
Like E-R model, the object-oriented model is based on a collection of objects. An object is a software
bundle of variables and related methods.
Object that contains the same types of values and the same methods are grouped together into class.
Example: Bank
Each bank contains same type of functionalities such as withdraw and deposit.
Here withdraw and deposits are called related methods of class bank. Each bank i.e. Canara Bank, State
Bank of India, Andhra Bank contains same functionalities. All these three banks represent the Objects of
Bank class. Because the functionalities of all these are same.
30
DATABASE SYSTEMS
Top
A method called entity- relationship analysis is used for obtaining the conceptual model of the data, which
will finally help us in obtaining our relational database. In order to carry out ER analysis it is necessary to
understand and use the following features:
1.
Entities: These are the real word objects in an application, rectangles that represent entity sets.
2.
Attributes: These are the properties of importance of the entities and the relationships. Ellipse
represents the attributes.
3.
Relationships: These connect different entities and represent dependencies between them. Diamonds
that represent relationships among entity sets.
4.
Lines: which links attributes to entity sets and entity sets and entity sets to relationships.
In order to understand these terms let us take an example. If a supplier supplies an item to a company, then
the supplier is an entity. The item supplied is also an entity. The item and supplier are related with each
other are in the sense that the supplier supplies an item. Supplying is thus the verb which specifies the
relationship between item and supplier. A collection of similar entities is known as an entity set. Therefore
each member of the entity set is described by its attributes.
A supplier could be described by the following attributes:
SUPPLIER [supplier code, supplier name, address] an item would have the following attributes
ITEM [item Code, item name, rate] The Entity Relationship diagram shown in Figure 2.3 represents entities
by rectangles and relationships by a diamond.
DATABASE MANAGEMENT SYSTEM
31
SUPPLIES [Supplier code, item code, order no, quantity] is represented by Diamond which combines the
two table with relation
Top
Record-based logical models are used in describing data at the logical and view levels. In contrast to objectbased data models, they are used both to specify the overall logical structure of the database and to provide
a higher-level description of the implementation.
Record-based models are so named because the database is structured in fixed-format records of several
types. Each record type defines a fixed number of fields, or attributes, and each field is usually of a fixed
number of fields, or attributes, and each field is usually of a fixed length.
The three most widely accepted record-based data models are the
1.
Relational Model
2.
Network Model
3.
Hierarchical model
Top
The relational model uses a collection of tables to represent both data and the relationships among those
data. Each table has multiple columns, and each column has a unique name.
32
DATABASE SYSTEMS
Fig 2.2 shows that customer Johnson with social-security number 321-12-3123 has two accounts A-101,
with a balance 500, and A-201, with a balance of 900. Note that the customers Johnson and Smith share the
same account number which means that they may share the same business venture.
Top
!
"
Data in the network model are represented by collections of records, and relationship among data is
represented by links, which can be viewed as pointers. The records in the database are organized as a
collection of arbitrary graphs. Fig 2.5 shows the same relation data model information in network model.
Fig. 2.5 the same data of relational model was represented by network model.
Top
DATABASE MANAGEMENT SYSTEM
#
33
$
The hierarchical data model is similar to the network model in the sense that it records and links represent
data and relationships among the data, respectively. It differs from the network model in that the records are
organized as collections of trees rather than arbitrary graphs. Fig 2.6 represents the same information as in
Fig 2.4 and Fig 2.5
! "#
$ %%
$
%
&
The relational model differs from the network and hierarchical models in that it does not use pointers or
links. Instead, the relational model relates records by the values that they contain. This freedom from the
use of pointers allows a formal mathematical foundation to be defined.
Student Activity 2.1
Answer the following questions.
1.
What do understand by ER Diagram & Analysis?
2.
Differentiate between Network and Hierarchical Data Model.
' &&
•
Data model is a collection of conceptual tools for describing data, data relationships, data semantics and
consistency constraints.
34
DATABASE SYSTEMS
•
A method called entity-relationship analysis is used for obtaining the conceptual model of the data,
which finally helps in obtaining a relational database.
•
Record-based logical models are used in describing data at the logical and view levels.
•
The relational model uses a collection of tables to represent both data and the relationships among those
data.
•
Data in the network model are represented by collections of records, and relationship among data is
represented by links, which can be viewed as pointers.
•
In hierarchical data model, data and relationships among the data are represented by records and links,
respectively.
' %
I.
II.
I.
II.
&
(
True or False
1.
Object-based logical models are used in describing data at conceptual and external schemas.
2.
The semantic data model is based on a perception of a real world that consists of a collection
of basic objects, called entities, and of relationships among these objects.
3.
Record-based logical models are used in describing data at the logical and view levels.
Fill in the Blanks
1.
_______________models provide fairly flexible structuring capabilities and allow data
constraints to be specified explicitly.
2.
The object-oriented model is based on a collection of _________.
3.
________ connect different entities and represent dependencies between them.
4.
Object that contains the same types of values and the same methods are grouped together
into _______________.
5.
The records in the network model database are organized as a collection of
___________________.
True or False
1.
True
2.
False
3.
True
Fill in the Blanks
1.
Object-based
2.
objects
3.
Relationships
DATABASE MANAGEMENT SYSTEM
I.
II.
$
4.
class
5.
arbitrary graphs
35
True or False
1.
Entity-relational model and the object-oriented model act as
of the object-based logical models.
representatives of the class
2.
The various data models that have been proposed fall into three different
3.
Like E-R model, semantic model is based on a collection of objects. An object is a software
bundle of variables and related methods.
4.
Record-based logical models are used in describing data at the physical level.
groups.
Fill in the Blanks
a.
_________ is a rule applied on the data or column.
b.
_______ are the real world objects in an application, rectangles that represent entity sets.
c.
_______models are used in describing data at the logical and view levels.
d.
The related not model differs from the network and hierarchical model in that it does not
use________or ________.
(
1.
Name different object based data models.
2.
What are entities, attribute and relationships?
3.
Discuss the importance of E-R model.
4.
Differentiate between Network and Hierarchical model.
5.
Discuss the advantage of relational model over other models.
Basic Relational Algebra Operations
A Complete Set of Relational Algebra Operations
Relational Calculus
SQL
Structured Query Language (SQL)
Relational Data Manipulation
Learning Objectives
After reading this unit you should appreciate the following:
•
Relation Algebra
•
Relational Calculus
•
SQL
Top
In addition to defining the database structure and constraints, a data model must include a set of operations
to manipulate the data. Basic sets of relational model operations constitute the relational algebra. These
operations enable the user to specify basic retrieval requests. The result of retrieval is a new relation, which
may have been formed from one or more relations. The algebra operations thus produce new relations,
which can be further manipulated using operations of the same algebra. A sequence of relational algebra
operations forms a relational algebra expression, whose result will also be a relation.
The relational algebra operations are usually divided into two groups. One group includes set operations
from mathematical set theory; these are applicable because each relation is defined to be a set of tuples. Set
operations include UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN PRODUCT. The
other group consists of operations developed specifically for relational databases; these include SELECT,
PROJECT, and JOIN, among others. The SELECT and PROJECT operations are discussed first below,
because they are the simplest. Then we discuss set operations. Finally, we discuss JOIN and other
complex operations. The relational database shown in Figure 3.1 is used for our example.
Some common database requests cannot be performed with the basic relational algebra operations, so
additional operations are needed to, express the requests.
RELATIONAL DATA MANIPULATION
37
!
$
*
"# $
+ , +(
% $#& %! $ ' ( (
&! %& !
)
The SELECT operation is used to select a subset of the tuples from a relation that satisfy a selection
condition. One can consider the SELECT operation to be a filter that keeps only those tuples that satisfy a
qualifying condition. For example, to select the EMPLOYEE tuples whose department is 4, or those whose
salary is greater than $30,000, we can individually specify each of these two conditions with a SELECT
operation as follows:
σDNO =4 (EMPLOYEE)
σSALARY>30000(EMPLOYEE)
In general, the SELECT operation is denoted by
σ<selection condition> ( R )
Where the symbol σ (sigma) is used to denote the SELECT operator, and the selection condition is a
Boolean expression specified on the attributes of relation R. Notice that R is generally a relational algebra
expression whose result is a relation; the simplest expression is just the name of a database relation. The
relation resulting from the SELECT operation has the same attributes as R. The Boolean expression
specified in <selection condition> is made up of a number of clauses of the form.
<attribute name> <comparison operator> <constraint value>, or
<attribute name> <comparison operator> <attribute name>
where <attribute name> is the name of an attribute of R, <comparison operator> is normally one of the
operators {=, <, ≥, >, ≥, ≠,} and <constant value> is a constant value from the attribute domain. Clauses
can be arbitrarily connected by the Boolean operators AND, OR, and NOT to form a general selection
condition. For example, to select the tuples for all employees who either work in department 4 and make
over $25,000 per year, or work in department 5 and make over $30,000, we can specify the following
SELECT operation:
σ(DNO=4 AND SALARY> 25000) OR (DNO=5 AND SALARY>30000)(EMPLOYEE)
38
DATABASE SYSTEMS
!"#$%&
+/ 01
2
115 /
3
&#"'( '(&
!!"""""
6*1/7/
&&&%%$$$$
&""'
$
'(%
-.
,
# %
-.
,
&#%'($' &
)*+* ,
((((
!!""""
"
4 ..,
!((((
%%%##""""
"
/. 1*,
"(((
&%$#"!
!
8+ 3, 9
*
:*+
4/11/5*
/ *.
;
75*
*)
4
&%$##"!
&! '(#' (
&
/11/+ *,
*++7,
! (((
%%%##""""
!
/+/7/
###%%!!!!
&# '(&' "
&$"
+*
- <1*,
/0,
%(((
!!""""
"
31.
!" !" !"
&$ '($'
"#
5*,
"((((
/<<*+
&#$&%$&%$
&#&'( ' &
-.
,
&%(
-.
,
/11/.,
"(((
!!""""
&%$#"!
"
!
-.
*/.*+5
"
)
!
. +/
!!""""
*/)=-/+ *+.
2
&#%'("'
&%$#"!
&&"'( '(
%%%##""""
&% '(#' &
!
/:: +)
"
*11/ +*
"
-3/+ 1/ )
"
-.
;
!"#$%&
>"
!"#$%&
$>"
##%%%!!!!
!(>(
!" !" !"
(>(
!" !" !"
(>(
!!""""
(>(
!!""""
(>(
!!""""
(
(>(
!!""""
(
(>(
&&&%%$$$$
(
(>(
&&&%%$$$$
(
(>(
&%$&%$&%$
(
">(
&%$&%$&%$
(
&%$#"!
(
(>(
&%$#"!
(
">(
%%%##""""
(
-11
">(
+ )-5
*11/ +*
"
+ )-5
-3/+1/ )
"
+ )-5 6
8- *+ ?/
-.
(
*5 3 ?/
(
*@<* *: .
(
+ )-5
/:: +)
"
!
-.
/:: +)
!
"
RELATIONAL DATA MANIPULATION
39
!!""""
15*
!!""""
*)
!!""""
7
&%$#"!
&%#'(!'("
*
&# ' (' "
&"%'("'(
< *+
!"#$%&
&! '( ' %
5 /*
&#%'( '(!
!"#$%&
15*
&%%'
!"#$%&
1?/<*
&#$'("'("
27 & *#(('%
% $#& %! $ ' ( ($ $ #
' (
(*#&! & $# $
(
A/B
+/ 0
2
*
2 /11/5*
:*+
/ *.
;
3
/+/7/
!!""""
&#"'
'(%
% % 4 ..,
&%$#"!
&! '(#' (
&
###%%!!!
!
&# '(&' "
&$"
(b)
-.
,
!((((
%%%##""""
"
*++7, *11/ +*,
! (((
%%%#""""
!
+* /0, - <1*,
%(((
!!""""
"
(c)
((((
3
+/ 0
!((((
((((
6*1/7/
15 /
"(((
!((((
2 /11/5*
*
! (((
"(((
2
:*+
/+/7/
/ *.
%(((
! (((
31.
75*
"(((
%(((
/<</+
3
/)
/ *.
"(((
"(((
""(((
""(((
( %$( #"
-
1 4444.
&!
. -'.π
#*
5
5
$#&( - .σ- / 0
. - .π
1 23444.
65
-
-
/3
.
The result is shown in Figure 3.3(a). Notice that the comparison operators in the set {=, <, ≥, >, ≥, ≠,} apply
to attributes whose domains are ordered values, such as numeric or date domains. Domains of strings of
characters are considered ordered, based on the collaring sequence of the characters. If the domain of an
attribute is a set of unordered values, then only the comparison operators in the set {=, ≠} can be used.
40
DATABASE SYSTEMS
An example of an unordered domain is the domain Color={red, blue, green, white, yellow, ….} where no
order is specified among the various colors. Some domains allow additional types of comparison operators;
for example, a domain of character strings may allow the comparison operator SUBSTRING_OF.
In general, the result of a SELECT operation can be determined as follows. The <selection condition> is
applied independently to each tuple t in R. This is done by substituting each occurrence of an attribute Ai in
the selection condition with its value in the tuple t[Ai]. If the condition evaluates to true, then tuple t is
selected. All the selected tuples appear in the result of the SELECT operation. The Boolean conditions
AND, OR and NOT have their normal interpretation as follows:
(cond1 AND cond2) is true if both (cond1)and (cond2) are true; otherwise, it is false.
(cond1 OR cond2) is true if either (cond1) or (cond2) are true; otherwise, it is false.
(NOT cond) is true if cond is false; otherwise, it is false.
The SELECT operator is unary; that is, it is applied to a single relation. Moreover, the selection operation
is applied to each tuple individually; hence, selection conditions cannot involve more than one tuple. The
degree of the relation resulting from a SELECT operation is the same as that of R. The number of tuples in
the resulting relation is always less than or equal to the number of tuples in R. That is Iσc ( R ) I ≤ I R I for
any condition C.
The fraction of tuples selected by a selection condition is referred to as the selectivity of the condition.
Notice that the SELECT operation is commutative; that is,
σ<cond1>(σ<cond2> ( R )) = σ<cond2>(σ<cond1>( R ))
Hence, a sequence of SELECT can be applied in any order. In addition, we can always combine a cascade
of SELECT operations into a single SELECT operation with a conjunctive (AND) condition; that is :
σ<cond1>(σ<cond2> (…(σ<condn>( R ))…)) = σ<cond1> AND <cond2> AND …AND <condn>( R )
If we think of a relation as a table, the SELECT operation selects some of the rows from the table while
discarding other rows. The PROJECT operation, on the other hand, selects certain columns from the table
and discards the other columns. If we are interested in only certain attributes of a relation, we use the
PROJECT operation to project the relation over these attributes only. For example, to list each employee’s
first and last name and salary, we can use the PROJECT operation as follows:
πLNAME, FNAME, SALARY (EMPLOYEE)
The resulting relation is shown in Figure 3.3(b). The general form of the PROJECT operation is
π<attribute list> ( R )
where π (pi) is the symbol used to represent the PROJECT operation and <attribute list> is a list of
attributes from the attributes of relation R. Again, notice that R is, in general, a relational algebra
expression whose result is a relation, which in the simplest case is just the name of a database relation.
The result of the PROJECT operation has only the attributes specified in <attribute list> and in the same
order as they appear in the list. Hence, its degree is equal to the number of attributes in <attribute list>.
If the attribute list includes only non-key attribute of R, duplicate tuples are likely to occure; the PROJECT
operation removes any duplicate tuples, so the result of the PROJECT operation is a set of tuples and hence
a valid relation. This is known as duplicate elimination.
For example, consider the following PROJECT operation:
πSEX, SALARY (EMPLOYEE)
RELATIONAL DATA MANIPULATION
41
The result is shown in Figure 3.3 (c) Notice that the tuple <F,25000> appears only once in Figure 3.3 (c)
even though this combination of values appears twice in the EMPLOYEE relation.
The number of tuples in a relation resulting from a PROJECT operation is always less than or equal to the
number of tuples in R. If the projection list is a super key of R-that is, it includes some key of R, the result
has the same number of tuples as R. Moreover,
π<list> (π<list2>( R )) = π<list> ( R )
As long as <list2> contains the attributes in <list1>; otherwise, the left-hand side is an incorrect expression.
It is also noteworthy that commutativity does not hold on PROJECT.
!
The relation shown in Figure 3.4 does not have any names. In general, we may want to apply several
relational algebra operations one after the other. Either we can write the operations as a single relational
algebra expression by nesting the operations, or we can apply one operation at a time and create
intermediate result relations. In the latter case, we must name the relations that hold the intermediate
results. For example, to retrieve the first name, last name, and salary of all employees who work in
department number 5, we must apply a SELECT and a PROJECT operation. We can write a single
relational algebra expression as follows:
πFNAME, LNAME, SALARY(σDNO=5(EMPLOYEE))
Figure 2.4(a) shows the result of this relational algebra expression. Alternatively, we can explicitly show
the sequence of operations, giving a name to each intermediate relation:
DEP5_EMPS ← σDNO=5(EMPLOYEE)
RESULT ← πFNAME, LNAME, SALARY(DEP5_EMPS)
It is often simpler to break down a complex sequence of operations by specifying intermediate result
relations than to write a single relational algebra expression. We can also use this technique to rename the
attributes in the intermediate and result relations. This can be useful in connection with more complex
operations such as UNION and JOIN, as we shall see. To rename the attributes in a relation, we simply list
the new attribute names in parentheses, as in the following example:
TEMP ← σDNO=5(EMPLOYEE)
R(FIRSTNAME, LASTNAME, SALARY ← πFNAME, LNAME, SALARY(TEMP)
The above two operations are illustrated in Figure 3.4(b). If no renaming is applied, the names of the
attributes in the resulting relation of SELECT operation are the same as those in the original relation and in
the same order. For a PROJECT operation with no renaming, the resulting relation has the same attribute
names as those in the projection list and in the same order in which they appear in the list.
We can also define a RENAME operation which can rename either the relation name, or the attribute
names, or both in a manner similar to the way we defined SELECT
(a)
((((
+/ 0)
2
/*.
75*
(b) TEMP
3
!((((
/+/7/
%(((
31.
"(((
42
DATABASE SYSTEMS
* 8
+/ 0)
/*.
75*
2
;
3
!"#$%&
&#"'( '(&
$
!!"""
&""'
# % 4 ..,
'(%
/+/7/
&# '(&' "
&$"
31.
&$ '($'
"#
)+* ,
+*
-.
-.
,
,
/0, - <1*,
, 5*,
-.
,
((((
!!"""
"
!((((
%%%##"""
"
%(((
!!"""
"
"(((
!!"""
"
((((
+/ 0)
2
/*.
75*
07
3
!((((
/+/7/
%(((
31.
"(((
( %$( #" % $#& % % '
8* ((#&
πFNAME, LNAME, SALARY (σDNO=5(EMPLOYEE)).
The same expression using intermediate relations and renaming of attributes. And PROJECT. The general
RENAME operation when applied to a relation R of degree n is denoted by
ρS(B1,B2,…,BN) ( R ) or ρS( R ) or ρS (B1,B2,…,Bn)( R )
Where the symbol ρ (rho) is used to denote the RENAME operator, S is the new relation name, and B1, B2,
…, B n are the new attribute names. The first expression renames both the relation and its attributes; the
second renames the relation only; and the third renames the attributes only. If the attributes of R are (A1,
A2, …, An) in that order, then each Ai is renamed as Bi.
The next group of relational algebra operations is the standard mathematical operations on sets. For
example, to retrieve the social security numbers of all employees who either work in department 5 or
directly supervise an employee who works in department 5, we can use the UNION operation as follows:
DEP5_EMPS ← σDNO=5(EMPLOYEE)
RESULT1← πSSN(DEP5_EMPS)
RESULT2(SSN) ← πSUPERSSN(DEP5_EMPS)
RESULT←RESULT1 U RESULT2
The relation RESULT1 has the social security numbers of all employees who work in department 5,
whereas RESULT2 has the social security numbers of all employees who directly supervise an
employee who works in department 5. The UNION operation produces the tuples that are in either
RESULT1 or RESULT2 or both (see figure 3.5).
RELATIONAL DATA MANIPULATION
43
!"#$%&
!!""""
!"#$%&
!!""""
%%%##""""
!!""""
###%%!!!!
###%%!!!!
!" !" !"
!" !" !"
%%%##""""
379
+
( %$ "$ $
:
#*
$#&7
:
←
:
:
:
2
Several set theoretic operations are used to merge the elements of two sets in various ways, including
UNION, INTERSECTION, and SET DIFFERENCE. These are binary operations; that is, each is applied
to two sets. When these operations are adapted to relational databases, the two relations on which any of
the above three operations are applied must have the same type of tuples; this condition is called union
compatibility. Two relations of R(A1, A2, …, An) and of S(B1, B2, …, Bn) are said to be union compatible if
they have the same degree n, and if dom (Ai) = dom(Bi) for 1≤ i ≤n. This means that the two relations have
the same number of attributes and that each pair of corresponding attributes have the same domain.
We can define the three operations UNION, INTERSECTION, and SET DIFFEENCE on two unioncompatible relations R and S as follows:
UNION: The result of this operation, denoted by R ∪ S, is a relation that includes all tuples that are either
in R or in S or in both R and S. Duplicate tuples are eliminated.
INTERSECTION: The result of this operation, denoted by R ∩ S , is a relation that includes all tuples that
are in both R and S.
SET DIFFERNCES: The result of this operation, denoted by R – S, is a relation that includes all tuples that
are in R but not in S.
We will adopt the convention that the resulting relation has the same attribute names as the first relation R.
Figure 3.6 illustrates the three operations. The relations STUDENT and INSTRUCTOR in Figure 3.6(a)
are union compatible, and their tuples represent the names of students and instructors, respectively. The
result of the UNION operation in Figure 3.6 (b) shows the names of all students and instructors. Note that
duplicate tuples appear only once in the result. The result of the INTERSECTION operation (Figure 3.6 (c)
includes only those who are both students and instructors. Notice that UNION and INTERSECTION are
commutative operations; that is
R ∪ S = S ∪ R ,and R ∩ S = S ∩ R ,
Both union and intersection can be treated as n-array operations applicable to any number of relations as
both are associative operations; that is
R ∪ (S ∪ T) = (R ∪ S) ∪ T, and (R ∩ S) ∩t = R ∩ (S ∩ T).
(a)
5/+)
+ @ *
-./
/
+/ 5 .
.
44
DATABASE SYSTEMS
-./
/
/ *.
7
/
;
*+
/+</+/
*.
+ 7
+)
7
2/ 3
*.
11<*+
A<B
A5B
-./
/
/ *.
7
-./
/
;
/+</+
/
/ *.
/
*+
*.
7
+)
7
2/ 3
*.
1<*+
5/+)
+ @ *
+/ 5 .
.
A)B
7
A*B
;
/+</+
*+
*.
7
+)
7
5/+)
2/ 3
+ @ *
+/ 5 .
*.
3-+* >#C 11-. . +/ 3 * .* 8*+/
A<B
∪
>A B
.
1<*+
.
,
∩
,/ )
A)B
D
>A/B @ A*B
5
8/ <1* +*1/
D
.>
The DIFFERNCE operation is not commutative; that is, in general
R–S≠S–R
Next we discuss the CARTESIAN PRODUCT operation also known as CROSS PRODUCT or CROSS
JOIN—denoted by x, which is also a binary set operation, but the relations on which it is applied do not
have to be union compatible. This operation is used to combine tuples from two relations in a
combinatorial fashion. In general, the result of R(A1, A2, …, An) x S (B1, B2, …, Bm) is a relation Q with
n+m attributes Q(A1, A2, …, An, B1, B2, …, Bm), in that order. The resulting relation Q has one tuple for
each combination of tuples —one from R and one from S. Hence, if R has nR tuples and S has nS tuples,
then R x S will have nR * nS tuples. The operation applied by itself is generally meaningless. It is useful
RELATIONAL DATA MANIPULATION
45
when followed by a selection that matches values of attributes coming from the component relations. For
example, suppose that we want to retrieve for each female employee a list of the names of her dependents;
we can do this as follows:
FEMALE_EMPS ← σ SEX = ‘F’(EMPLOYEE)
EMPNAMES ← πFNAME, LNAME SSN (FEMALE_EMPS)
EMP_DEPENDENTS ← EMPNAMES X DEPENDENT
ACTUAL_DEPENDENTS ← σ SSN=ESSN (EMP_DEPENDTS)
RESULT ← πFNAME, LNAME, DEPENDT_NAME (ACTUAL_DEPENDENT)
The resulting relations from the above sequence of operations are shown in Figure 3.7. The
EMP_DEPENDETS relation is the result of applying the CARTESIAN PRODUCT operation to
EMPNAMES from Figure 3.7 with DEPENDENT from Figure 3.7. In EMP_DEPENDENTS, every tuple
from EMPNAMES is combined with every tuple from DEPENDENT, giving a result that is not very
meaningful. We only want to combine a female employee tuple with her dependents—namely, the
DEPENDENT tuples whose ESSN values match the SSN value of the EMPLOYEE TUPLE. The
ACTUAL_DEPENDENTS relation accomplishes this.
The CARTESIAN PRODUCT creates tuple with the combined attributes of two relations. We can then
SELECT only related tuples from the two relations by specifying an appropriate selection condition, as we
did in the preceding example. Because this sequence of CARTESIAN PRODUCT followed by SELECT is
used quite commonly to identify and select related tuples from two relations, a special operation, called
JOIN, was created to specify this sequence as a single operation. We discuss the JOIN operation next.
"
The JOIN operation, denoted by, is used to combine related tuples from two relations into single tuples.
This operation is very important for any relational database with more than a single relation, because it
allows us to process relationships among relations. To illustrate join, suppose that we want to retrieve the
name of the manager of each department. To get the manager’s name, we need to combine each
department tuple with the employee tuple whose SSN value matches the MGRSSN value in the department
tuple. We do this by suing the JOIN operation, and then projecting the result over the necessary attributes,
as follows:
DEPT_MGR ← DEPARTMENT
MGRSSN=SSN
EMPLOYEE
RESULT ← πDNAME, LNAME, FNAME (DEPT_MGR)
The first operation is illustrated in Figure 3.7. Note that MGRSSN is a foreign key and that
the referential integrity constraints plays a role in having matching tuples in the referenced relation
EMPLOYEE. The example we gave earlier to illustrate the CARTESIAN PRODUCT operation can be
specified, using the JOIN operation, by replacing the two operations:
15 /
6*1/7/
&&&%%$$$$
&#%'
($' &
8+ 3,
*
2 /11/5*
&%$#"!
&! '
(#' (
&
*11/ +*,
:*+
/. 1*,
*++7,
"(((
! (((
&%$#"!
!
%%%##""""
!
46
DATABASE SYSTEMS
75*
31.
&$ '
($'
!" !" !"
"#
,
5*,
-.
15 /
6*1/7/
&&&%%$$$$
*
2 /11/5*
&%$#"!
:*+
75*
31.
"(((
,
!!""""
!" !" !"
•••
*.*/+5
"
)
!
. +/
*/)=-/+ *+.
•••
•••
+/ 01
2
&%$#"!
•••
* :*+
2 /11/ 5*
%%%##""""
•••
/ *.
!!""""
"
3
+3
•••
!!"""
&%$#"!
•••
%%%##""""
•••
* :*+
2 /11/ 5*
< *+
•••
15 /
6*1/7/
&&&%%$$$$
!!""""
15 /
6*1/7/
&&&%%$$$$
!!""""
15*
* ) +*
&%#'(!'("
•••
&% ' (' "
•••
•••
* :*+
2 /11/ 5*
&%$#"!
&%$#"!
&! '( ' %
&"%'("'(
•••
&! '( ' %
•••
&%%'( '(!
•••
15*
&%%'( '(!
•••
!"#$%&
1?/<*
&%%'
•••
!!""""
15*
&#$'("'("
15 /
6*1/7/
&&&%%$$$$
15 /
6*1/7/
&&&%%$$$$
15 /
6*1/7/
&&&%%$$$$
!"#$%&
15 /
6*1/7/
&&&%%$$$$
!"#$%&
15 /
6*1/7/
&&&%%$$$$
*
2 /11/5*
&%$#"!
:*+
< *+
!!""""
&%$#"!
7
< *+
5 /*1
' (
•••
*
:*+
2 /11/5*
&%$#"!
!!""""
*)
*
&%%'(!'("
•••
•••
RELATIONAL DATA MANIPULATION
*
:*+
2 /11/5*
47
&%$#"!
!!""""
7
&% ' (' "
•••
*
:*+
2 /11/5*
&%$#"!
&%$#"!
< *+
&"%'("'(
•••
&! '( ' %
•••
15*
&%%'( '(!
•••
!"#$%&
1?/<*
&%%'
' (
•••
&#$'("'("
•••
&##'(!'(
•••
&% ' (' "
•••
&"%'("'(
•••
&! '( ' %
•••
*
:*+
2 /11/5*
&%$#"!
!"#$%&
*
:*+
2 /11/5*
&%$#"!
!"#$%&
*
:*+
2 /11/5*
&%$#"!
5 /*1
75*
31.
!" !" !"
!!""""
15*
75*
31.
!" !" !"
!!""""
*)
75*
31.
!" !" !"
!!""""
7
75*
31.
!" !" !"
75*
31.
!" !" !"
!"#$%&
75*
31.
!" !" !"
!"#$%&
15*
&%%'
' (
•••
75*
31.
!" !" !"
!"#$%&
1?/<*
&#$'("'("
•••
&%$#"!
; 7 %% ($ $& $
*
< *+
5 /*1
#*
$#&
EMP_DEPENDENTS ← EMPNAMES X DEPENDENT
ACTUAL_DEPENDENTS ← σSSN=ESSN(EMP_DEPENDENTS)
With a single JOIN operation:
ACTUAL_DEPENDENTS ← EMPNAMES
SSN=ESSN
DEPENDENT
The general form of a JOIN operation on two relations R (A1, A2, …, An2) and S (B1, B2, …, Bm) is:
R
<join condition>S
The result of the JOIN is a relation Q with n + m attributes Q (A1, A2, …, An, B1, B2, …, Bm) in that order;
Q has one tuple for each combination of tuples—one from R and one from S—wherever the combination
satisfies the join condition. This is the main difference between Cartesian Product and JOIN: in JOIN, only
combinations of tuples satisfying the join condition appear in the result, whereas in the CARTESIAN
PRODUCT all combinations of tuples are included in the result. The join condition is specified on
attributes from the two relations R and S and is evaluated for each precombination of tuples. Each tuple
combination for which the join condition evaluates to true is included in the resulting relation Q as a single
combined tuple.
A general join condition is of the form:
<condition> AND <condition> AND… AND <condition>
where each condition is of the form Ai θ Bi, Ai is an attribute of R; Bi is an attribute of S, Ai and Bi
have the same domain, and θ (theta) is one of the comparison operations {=, <, ≤ >, ≥, ≠}. A JOIN,
operation with such a general join condition is called a THETA JOIN. Tuples whose join attributes are null
do not appear in the result. In that sense, the join operation does not necessarily preserve all of the
information in the participating relations.
48
DATABASE SYSTEMS
The most common JOIN involves join conditions with equality comparisons only. Such a JOIN, where the
only comparison operator used is =, is called an EQUIJOIN. Both examples we have considered were
EQUIJOINs. Notice that in the result of an EQUIJOIN we always have one or more pairs of attributes that
have identical values in every tuple. For example, in Figure 3.9 the values of the attributes MGRSSN and
SSN are identical in every tuple of DEPT_MGR because of the equality join condition specified on these
two attributes. Because one of each pair of attributes with identical values is superfluous, a new operation
called NATURAL JOIN—denoted by ‘ * ’ was created to get rid of the second (superfluous) attribute in an
EQUIJOIN condition. The standard definition of NATURAL JOIN requires that the two join attributes (or
each pair of join attributes) have the same name in both relations.
(a)
+ )-5
*11/ +*
+ )-5
-3/+1/ )
+ )-5 6
8- *+ ?/
"
-.
(
* +3/ ?/
(
*@<* *: .
(
*.*/+5
*.*/+5
!!""""
&%%'("'
"
*.*/+5
!!""""
&%%'("'
)
-.
!
/:: +)
&%%'("'
"
!
/:: +)
!!""""
. +/
&%$#"!
&&"'( '(
*/)=-/+ *+.
%%%##""""
&% '(#' &
)
&%$#"!
&&"'( '(
. +/
(b)
*/)=-/+ *+.
)
. +/
%%%##""""
&% '(#' &
Houston
!
&%$#"!
&&"'( '(
/:: +)
*.*/+5
"
333445555
&%%'("'
*11/ +*
*.*/+5
"
333445555
&%%'("'
-3/+1/ )
*.*/+5
"
333445555
&%%'("'
< 7 & %% ($ $#& #"$
:
←
#*
$#& - .
-.
←
=
-'.
=
If this is not the case, a renaming operation is applied first. In the following example, we first rename the
DNUMBER attribute of DEPARTMENT to DNUM so that it has the same name as the DNUM attribute in
PROJECT then apply NATURAL JOIN:
PROJ_DEPT ← PROJECT * ρ(DNAME, DNUM, MFRSSN, MGRSTARTDATE) (DEPARTMENT)
RELATIONAL DATA MANIPULATION
49
The attribute DNUM is called the join attribute. The resulting relation is illustrated in Figure 3.8 (a). In the
PROJ_DEPT relation, each tuple combines a PROJECT tuple with the DEPARTMENT tuple for the
department that controls the project, but only one join attribute is kept.
If the attributes on which the natural join is specified have the same names in both relations, renaming is
unnecessary. For example, to apply a natural join on the number attributes of department and
DEPT_LOCATION, it is sufficient to write:
DEPT_LOCS ← DEPARTMENT * DEPT_LOCATIONS
The resulting relation is shown in Figure 3.8 (b), which combines each department with its locations and
has one tuple for each location. In general, equating all attribute pairs that have the same name in the two
relations performs NATURAL JOIN. There can be a list of join attributes from each relation, and each
corresponding pair must have the same name.
A more general but non-standard definition for NATURAL JOIN is
Q ← R * <list>), (<list2>) S
In this case, <list> specifies a list of I attributes from R, and <list2> specifies a list of I attributes from S.
The lists are used to form equality comparison condition between pairs of corresponding attributes; the
conditions are then ANDed together. Only the list corresponding to attributes of the first relation R —
<list> is kept in the result Q.
Notice that if no combination of tuples satisfies the join condition, the result of a JOIN is an empty relation
with zero tuples. In general, if R has nR tuples and S has nS tuples, the result of a JOIN operation R
<join
condition> S will have between zero and nR* nS tuples. The expected size of the join result divided by the
maximum size nR* nS leads to a relation called join selectivity, which is a property of each join condition.
If there is no join condition, all combinations of tuples qualify and the JOIN becomes a CARTESIAN
PRODUCT, also called CROSS PRODUCT or CROSS JOIN.
The join operation is used to combine data from multiple relations so that related information can be
presented in a single table. Note that sometimes a join may be specified between a relation and itself. The
natural join or equijoin operation can also be specified among multiple tables, leading to an n-way join. For
example, consider the following three-way join:
((PROJECT
<NUM=DNUMBER)
<MGRSSN=SSN EMPLOYEE)
This links each project to its controlling department, and then relates the department to its manager
employee. The net result is a consolidated relation where each tuple contains this project-departmentmanager information.
Student Activity 3.1
Before reading the next section, answer the following question:
1.
What do you understand by PROJECT, RENAME, and Set Theoretic operations?
2.
What do you understand by Difference, Select and Joint Operation?
If your answer is correct, then proceed to the next section.
Top
#
It has been shown that the set of relational algebra operations {σ, π, U, –, x} is a complete set; that is, any
of the other relational algebra operations can be expressed as a sequence of operations from this set. For
example, the INTERSECTION operation can be expressed by using UNION and DIFFERENCE as
follows:
50
DATABASE SYSTEMS
R ∩ S ≡ ( R ∪ S ) – ((R – S ) ∪ ( S – R ))
Although, strictly speaking, INTERSECTION is not required, it is inconvenient to specify this complex
expression every time we wish to specify an intersection. As another example, a JOIN operation can be
specified as a CARTESIAN PRODUCT followed by a SELECT operation, as we discussed:
R
<condition>S
≡ σ <condition> (R x S )
Similarly, a NATURAL JOIN can be specified as a CARTESIAN PRODUCT proceeded by RENAME and
followed by SELECT and PROJECT operations. Hence, the various JOIN operations are also not strictly
necessary for the expressive power of the relational algebra; however, they are very important because they
are convenient to use and are very commonly applied in database applications. Other operations have been
included in the relational algebra for convenience rather than necessity. We discuss one of these—the
DIVISION operation—in the next section.
$%
The DIVISION operation is useful for a special kind of query that sometimes occurs in database
application. An example is “Retrieve the names of employees who work on all the projects that ‘John
Smith’ works on. “To express this query using the DIVISION operation, proceed as follows. First, retrieve
the list of project numbers that ‘John Smith’ works on in the intermediate relation SMITH_PNOS:
←σ
EF
G
←π
EF
A2
G
A
B
;
E
B
Next, create a relation that includes a tuple <PNO, ESSN> whenever the employee whose social security
number is ESSN works on the project whose number is PNO in the intermediate relation SSN_PNOS:
←π
,
A2
;
B
Finally, apply the DIVISION operation to the two relations, which are given the desired employees’ social
security numbers:
A
←
←π
÷
,
A
H
B
The previous operations are shown in Figure 3.9 (a). In general, the DIVISION operation is applied to two
relations R(Z) ÷ S(X), where X ⊆ Z. Let Y = Z – X (and hence Z = X U Y); that is, let Y be the set of
attributes of S. The result of DIVISION is a relation T(Y) that includes a tuple t if tuple t if tuples tR appear
in R with tR [Y] = t, and with tR [X] = tS for every tuple tS in S. This means that, for a tuple t to appear in
the result T of the DIVISION, the values in t must appear in R in combination with every tuple in S.
Figure 3.9 (b) illustrates a DIVISON operator where X = {A}, Y = {B}, and Z = {A, B}. Notice that the
tuples (values) b1 and b4 appear in R in combination with all three tuples in S; that is why they appear in the
resulting relation T. All other values of B in R do not appear with all the tuples in S and are not selected: b2
does not appear with a2 and b3 does not appear with a1.
The DIVISION operator can be expressed as a sequence of π, x, and – operations as follows:
T1 ← πγ ( R)
T2 ← πγ ((S x T) – R )
T ← T1 – T2
(a)
RELATIONAL DATA MANIPULATION
51
!"#$%&
!"#$%&
###%%!!!!
!" !" !"
!" !" !"
!!""""
!!""""
!!""""
(
!!""""
&&&%%$$$$
&&&%%$$$$
&%$&%$&%$
(
(
(
(
&%$&%$&%$
&%$#"!
&%$#"!
%%%##""""
(
(
(
(
SSN
123456789
453453453
S
(b)
A
A
B
A1
b1
a1
A2
b2
a2
A3
b3
a3
A4
b4
A1
b1
A2
b2
A3
b3
A4
b4
A2
b4
A3
b4
>7 %% ($ $& $
T
B
b1
b4
?
#*
$#& - . @! &
'+
A
-'. ←
÷
52
DATABASE SYSTEMS
In this section we define additional operations to express these requests. These operations enhance the
expressive power of the relational algebra.
&
'
The first type of request that cannot be expressed in the basic relational algebra is to specify mathematical
aggregate functions on collections of values from the database. Examples of such functions include
retrieving the average or total salary of all employees or the number of employee tuples. Common functions
applied to collections of numeric values include SUM, AVERAGE, MAXIMUM and MINIMUM. The
COUNT function is used for counting tuples or values.
Another common type of request involves grouping the tuples in a relation by the value of some of their
attributes and then applying an aggregate function independently to each group. An example would be to
group employee tuples by DNO, so that each group includes the tuples for employees working in the same
department. We can then list each DNO value along with, say, the average salary of employees within the
department.
We can define an AGGREGATE FUNCTION operation, using the symbol ℑ (pronounced “scrip F”), to
specify these types of requests as follows :
<grouping attributes> ℑ
< function list> (R)
where <grouping attributes> is a list of attributes of the relation specified in R, and <function list> is a list
of (<function> < attribute>) pairs. In each such pair, <function> is one of the allowed functions —such as
SUM, AVERAGE, MAXIMUM, MINIMUM, COUNT— and <attribute> is an attribute of the relation
specified by R. The resulting relation has the grouping attributes plus one attribute for each element in the
function list. For example, to retrieve each department number,
the number of employees in the department, and their average salary, while renaming the resulting
attributes as indicated below, we write:
ρR(DNO, NO_OF_EMPLOYEES, AVEARGE_SAL) (DNO ℑ COUNT SSN, AVERAGE SALARY (EMPLOYEE))
The result of this operation is shown in Figure 3.10
In the above example, we specified a list of attribute names—between parentheses in the rename
operation—for the resulting relation R. If no renaming is applied, then the attributes of the resulting
relation that correspond to the function list will each be the concatenation of the function name with the
attribute name in the form <function>_<attribute>. For example, Figure 3.10 (b) shows the result of the
following operation:
DNO
ℑCOUNT SSN, AVERAGE SALARY(EMPLOYEE)
(a)
?
3
0
0
B
234
444
""(((
(b)
COUNT_SSN
AVERAGE_SALARY
RELATIONAL DATA MANIPULATION
53
3
4
33250
0
3
31000
1
55000
(c)
COUNT_SSN
AVERAGE_SALARY
8
35125
47 & %% ($ $#& #"$
?
B
-'.
ℑ
:
5 ?
BB
.
B
-.
B
:
:
-
*
5 ?
.
-.
B
$#&
-
:
-
5 ?
5
5
.
B
-
.
If no grouping attributes are specified, the functions are applied to the attribute values of all the tuples in
the relation, so the resulting relation has a single tuple only. For example, Figure 3.11 shows the result of
the following operation :
ℑ COUNT SSN, AVERAGE SALARY(EMPLOYEE)
It is important to note that, in general, duplicates are not eliminated when an aggregate function is applied;
this way, the normal interpretation of functions such as SUM and AVERAGE is computed. It is worth
emphasizing that the result of applying an aggregate function is a relation, not a scalar number—even if it
has a single value.
%
Another type of operation that, in general, cannot be specified in the basic relational algebra is recursive
closure. This operation is applied to a recursive relationship between tuples of the same type, such as the
relationship between an employee and a supervisor. This relationship is described by the foreign key
SUPERSSN of the EMPLOYEE relation in Figure 3.2 and 3.3, which relates each employee tuple (in the
role of supervisee) to another employee tuple (in the role of supervisor). An example of a recursive
operation is to retrieve all supervisees of an employee e at all levels —that is, all employees e’ directly
supervised by e; all employees e” directly supervised by each employee’; all employees e” directly
supervised by each employee e”; and so on. Although it is straightforward in the relational algebra to
specify all employees supervised by e at a specific level, it is difficult to specify all supervisees at all levels.
For example, to specify the SSNs of all employees e’ directly supervised—at level one — by the employee
e whose name is ‘James Bong’ (see Figure 3.3) , we can apply the following operation:
BORG_SSN ← π SSN (FNAME=’James’ AND LNAME=’Bong’ (Employee))
SUPERVISION(SSN1, SSN2) ← π SSN, SUPERSSN(EMPLOYEE)
RESULT1(SSN) ← π SSN1 (SUPERVISION
SSN2=SSNBORG_SSN)
To retrieve all employees supervised by Bong at level 2—that is, all employees e” supervised by some
employee e’ who is directly supervised by Bong—we can apply another JOIN to the result of the first
query, as follows :
RESULT2 (SSN) ← π SSN2 (SUPERVISION
SSN2=SSN RESULT1)
54
DATABASE SYSTEMS
To get both sets of employees supervised at level 1 and 2 by ‘James Bong’, we can apply the UNION
operation to the two results, as follows:
RESULT ← RESULT2 ∪ RESULT1
The results of these queries are illustrated in Figure 3.12. Although it is possible to retrieve employees at
each level and then take their UNION, we cannot, in general, specify a query such as “retrieve the
supervisees of ‘James Bong’ at all levels” without utilizing a looping mechanism. An operation called the
transitive closure of relations has been proposed to computer the recursive relationship as far as the
recursion proceeds.
Finally, we discuss some extensions of the JOIN and UNION operations. The JOIN operations described
earlier match tuples that satisfy the join condition. For example, for a NATURAL JOIN operation R * S,
only tuples form R that have matching tuples in S—and vice versa—appear in the result. Hence, tuples
without a matching (or related) tuples are eliminated from the JOIN result. Tuples with null in the join
attributes are also eliminating. A set of operations, called OUTER JOINs, can be used when we want to
keep all the tuples in R, or those in S, or those in both relations in the result of the JOIN, whether or not
they have matching tuples in the other relation. This satisfies the need of queries where tuples from two
tables are to be combined by matching corresponding rows, but some tuples are liable to be lost for lack of
matching values. In such cases an operation is desirable that would preserve all the tuples whether or not
they produce a match.
For example, suppose that we want a list of all employee names and also the name of the departments they
manger if they happen to manage a department; we can apply an operation LEFT OUTER JOIN, denoted
, to retrieve the result as follows:
by
TEMP← (EMPLYEE
SSN=MGRSSN DEPARMENT)
RESULT ←πFNAME, MINIT, LNAME, DNAME (TEMP)
(Bong’s SSN is 888665555)
(SSN)
(SUPERSSN)
4
!"#$%&
!!""""
!!""""
%%%##""""
&&&%%$$$$
&%$#"!
&%$#"!
%%%##""""
###%%!!!!
!!""""
!" !" !"
!!""""
&%$&%$&%$
%%%##""""
!!"""""
&%$#"!
&%$#"!
-11
RELATIONAL DATA MANIPULATION
55
RESULT
SSN
!"#$%&
123456789
&&&%%!!!!
999887777
###%%!!!!
!" !" !"
666884444
&%$&%$&%$
453453453
987987987
- * @( ! '+ C#& D(( '# ! & $ (.
333445555
-
:
∪
:
2.
987654321
7
$E#F% @ %
(@ G
+
The LEFT OUTER JOIN operation keeps every tuple in the first or left relation R in R
S; if no
matching tuple is found in S, then the attributes of S in the join result are filled or “padded” with null
values. The result of these operations is shown in Figure 3.12.
A similar operation, RIGHT OUTER JOIN, denoted by
keeps every tuple in the second or right
relation S in the result of R
S. A third operation, FULL OUTER JOIN, denoted by
, keeps all
tuples in both the left and the right relations when no matching tuples are found, padding them with null
values as needed.
The OUTER UNION operation was developed to take the union of tuples from two relations if the relations
are not union compatible. This operation will take the UNION of tuples in two relations that are partially
compatible, meaning that only some of their attributes are union compatible. It is expected that the list of
compatible attributes includes a key for both relations. Tuples from the component relations with the same
key are represented only once in the result and have values for all attributes in the result. The attributes that
are not union compatible from either relation are kept in the result, and tuples that have no values for these
attributes are padded with null values. For example, an OUTER UNION can be applied to two relations
whose schemas are STUDENT (Name , SSN, Department, Advisor) and FACULTY (Name, SSN,
Department, Rank). The resulting relation schema is R (Name, SSN, Department, Advisor, Rank), and all
the tuples from both relations are included in the result. Student tuples will have a null for the Rank
attribute, whereas faculty tuples will have a null for the Advisor attribute. A tuple that exists in both will
have values for all its attributes.
Another capability that exist in most commercial languages (but not in the basic relational algebra) is that
of specifying operations on values after they are extracted from the database. For example, arithmetic
operations such as +, –, and * can be applied to numeric values.
-11
+/ 01
2
15 /
6*1/7/
-11
*
2 /11/5*
)
:*+
3
*.*/+5
. +/
56
DATABASE SYSTEMS
/ *.
;
/+7/
-11
31.
-11
/<</+
-11
75*
/)
4
/ *.
3
*/)=-/+ *+.
27
( #
:
#*
$#&
)
We now give additional examples to illustrate the use of the relational algebra operations. All examples
refer to the database of Figure 3.2. In general, the same query can be stated in numerous ways using the
various operations. We will state each query in one way and leave it to the reader to come up with
equivalent formulations.
)
*+
Retrieve the name and address of all employees who work for the ‘Research’ department.
←σ
A
B
EG *.*/+5 G
←
←π
,
E
A
,
B
I
B
This query could be specified in other ways; for example, the order of the JOIN and SELECT operations
could be reversed, or the JOIN could be replaced by a NATURAL JOIN (after renaming).
)
*,
For every project located in ‘Stafford’, list the project number, the controlling department number, and the
department manager’s last name, address, and birthday.
←σ
EG
G
A
B
← A
←A
←π
)
,
B
J
,
J
,
,
A
B
B
*
Find the names of employees who work on all the projects controlled by department number 5.
DEPT5_PROJS(PNO) ←π PNUMBER (σDNUM=5(PROJECT))
EMP_PROJ(SSN, PNO) ←πESSN, PNO(WORKS_ON)
RESULT_EMP_SSNS ← EMP_PRJO ÷ DEPT5_PROJS
RESULT ←πLNAME, FNAME (RESULT_EMP_SSNS * EMPLOYEE)
)
*-
Make a list of project numbers for projects that involve an employee whose last name is ‘Smith’, either as a
worker or as a manager of the department that controls the project.
A
B ←π
Aσ
EG
G
A
BB
RELATIONAL DATA MANIPULATION
2
←π
;
←π
A2
H
B
B
E
B ←π
A
)
;
A
,
←A
57
A
B ←π
2
;
Aσ
EG
G
A
A
BB
H
Υ
B
B
*.
List the names of all employees with two or more dependents.
Strictly speaking, this query cannot be handled in the basic relational algebra. We have to use the
AGGREGATE Function operation with the COUNT aggregate function. We assume that dependents of the
same employee have distinct DEPENDENT_NAME values.
A
B←
,
←
ℑ
A
B
≥ A B
A H
,
)
B
*/
Retrieve the names of employees who have no dependents.
←π
A
2
B
A
B ←π
B
←A
2
←π
)
A
A
,
D
2
2
B
H
*0
List the names of managers who have at least one dependent.
B ←π
A
A
2
A
←π
B ←π
← A
2
,
B
A
A
∩
2
B
2
B
H
B
As we mentioned earlier, the same query can in general be specified in many different ways. For example,
the operations can often be applied in various sequences. In addition, some operations can be used to
replace others; for example, the INTERSECTION operation in Query 7 can be replaced by NATURAL
JOIN. As an exercise, try to do each of the above example queries using different operations.
Student Activity 3.2
Before reading the next section, answer the following questions:
1.
What do you understand by Aggregate functions?
2.
Make a list of Aggregate functions.
3.
Why do we use group by clause in our query?
4.
What do understand by division operations? Discuss with example.
58
DATABASE SYSTEMS
If your answers are correct, then proceed to the next section.
Top
Relational calculus is an alternative to relational algebra. In contrast to the algebra, which is procedural, the
calculus is nonprocedural, or declarative, in that it allows us to describe the set of answers without being
explicit about how they should be computed. Relational calculus has had a bid influence on the design of
commercial query languages such as SQL and, especially, Query-by-Example (QBE).
The Variant of the calculus that we present in detail is called the tuple relational calculus (TRC), variables
in TRC take on tuples as values. In another variant, called the domain relational calculus (DRC), the
variables range over field values. TRC has had more of an influence on SQL, while DRC has strongly
influenced QBE.
A tuple variable is a variable that takes on tuples of a particular relation schema as values. That is, every
value assigned to a given tuple variable has the same number and type of fields. A tuple relational calculus
query has the form (T p(T) ), where T is a tuple variable and p(T) denotes a formula that describes T; we
will shortly define formulas and queries rigorously. The result of this query is the set of all tuples t for
which the formula p(T) is thus at the heart of TRC and is essentially a simple subset of first-order logic. As
a simple example, consider the following query.
(Q1) Find all sailors with a rating above 7.
(S ! S
sid
Sailors S. rating >7)
sname
Rating
age
Instance S3 of Sailors
When this query is evaluated on instance of the Sailors relation, the tuple variable S is instantiated
successively with each tuple, and the test S.rating>7 is applied. The answer contains those instances of S
that pass this test. On instance S3 of Sailors, the answer contains Sailors tuples with sid 31.
We now define these concepts formally, beginning with the notion of a formula. Let Rel be a relation
name, R and S be tuple variable, a an attribute of R, and b an attribute of S. Let op denote an operator in the
set (<,>, =,
≥ , ≠). An atomic formula, is one of the following:
R ε Rel
R.a op S.b
R.a op constant, or constant op R.a
RELATIONAL DATA MANIPULATION
59
A formula is recursively defined to be one of the following, where p and q are themselves formulas, and
p(R) denotes a formula in which the variable R appears:
Any atomic formula
¬p, p ∧ q, p v q, or p
q
∃R (p(R)), where R is a tuple variable
∀ R (pR), where R is a tuple variable
In the last two clauses above, the quantifiers ∃ and ∀ are said to blind the variable R. A variable is said to
be free in a formula or sub formula (a formula contained in a larger formula) if the (sub) formula does not
contain an occurrence of a quantifier that binds it.
We observe that every variable in a TRC formula appears in a sub formula that is atomic, and every relation
schema specifies a domain for each field; this observation ensures that each variable in a TRC formula has
a well-defined domain from which values for the variable are drawn. That is, each variable has a welldefined type, in the programming language sense. Informally, an atomic formula R Rel gives R the type
of tuples in Rel, and comparisons such as R.a op S.b and R.a op constant induce type restrictions on the
field R.a. If a variable R does not appear in an atomic formula of the form R Rel (i.e., it appears only in
atomic formulas that are comparisons), we will follow the convention that the type of R is a type whose
fields include all (and only) fields of R that appear in the formula.
We will not define types of variables formally, but the type of a variable should be clear in most cases, and
the important point to note is that comparisons of values having different types should always fail. (In
discussions of relational calculus, the simplifying assumption is often made that there is a single domain or
constants and that this is the domain associated with each field of each relation.)
A TRC query is defined to be expression of the form (T! p (T)), where T is the only free variable in the
formula p.
What does a TRC query mean? More precisely, what is the set of answer tuples for a given TRC query the
answer to a TRC (T!p(T)), as we noted earlier, is the set of all tuples t for which the formula p(T) evaluates
to true with assignments of tuple values to the free variables in formula make the formula evaluate to true.
A query is evaluated on a given instance of the database. Let each free variable in a formula F be bound to
a tuple value. For the given assignment of tuples to variables, with respect to the given database instance, F
evaluates to (or simply ‘is’) true if one of the following holds:
F is an atomic formula R Rel, and R is assigned a tuple in the instance of relation Rel.
F is a comparison R.a op S.b, R.a op constant, or constant op R.a, and the tuples assigned to R and S
have field values R.a and S.b that make the comparison true.
F is of the form ¬p, and p is not true; or of the form p ∧q, and both p and q are true; or of the form p v
q, and one of them is true, or of the form p q and q is true whenever p is true.
F is of the form ∃R(p(R)), and there is some assignment of tuples to the free variables in p(R),
including the variable R, that makes the formula p(R) true.
F is of the form ∀ R (p(R)), and there is some assignment of tuples to the free variables in p(R) that
makes the formula p (R) true no matter what tuple is assigned to R.
( #
)
60
DATABASE SYSTEMS
We now illustrate the calculus through several examples, using the instances B1 of Boats, R2 of Reserves,
and S3 of Sailors as shown below:
sid
&($ &
'
sname
#"
!
#
%
&
(
Rating
age
%# (
"
$
$
!
!
!
)
* *+
+* *+
* *+
+* *+
* *+
* *+
+* *+
*,*+
* *+
* *+
'
$
($
#
We will use parentheses as needed to make our formulas unambiguous. Often, a formula p (R) includes a
condition R ∈ Rel, and the meaning of the phrases some tuple R and for all tuples R is intuitive. We will
use the notation ∃ R ∈ Rel (p(R) for ∃R(R ∈Rel ∧ p(R)).
Similarly, we use the notation ∀R ∈ Rel (p(R)) for ∀ R (R∈Rel
p(R)).
(Q2) find the names and ages of sailors with a rating above 7.
(P ! ∃ S ∈ Sailors (S.rating > 7 ∧ fP. Name = S. sname ∧ P.age = S. age))
This query illustrates a useful convention: P is considered to be a tuple variable with exactly two fields,
which are called name and age, because these are the only fields of FP that are mentioned and P does not
range over any of the relations in the query; that is, there is no subformula of the form P ∈ Relname. The
RELATIONAL DATA MANIPULATION
61
result of this query is a relation with two fields, name and age. The atomic formula P. name = S.sname and
P.age = S. age give values to the fields of answer tuple P. On instances B1, R2, and S3, the answer is the set
of tuples (Lubber, 55.5), (Andy, 25.5), (Rusty, 35.0), (Zorba, 16.0), and (Horatio,35.0).
(Q3) Find the sailor name, boat id, and reservation date for each reservation.
{P  ∃ R ∈ Reserves ∃ S ∈ Sailors
(R.sid = S.sid ∧ P.bid = R.bid ∧ P.day = R.day ∧ P.sname = S.sname)
For each Reserves tuple, we look for a tuple in Sailors with the same sid. Given a pair of such tuples, we
construct an answer tuple P with fields sname, bid, and day by copying the corresponding fields from these
two tuples. This query illustrates how we can combine values from different relations in each answer tuple.
The answer to this query on instances B1, R2, and S3 is shown in figure given below:
/ *
)
/7
-.
(
(K
K&%
-.
(
(K (K&%
-.
(
(K(%K&%
-.
(!
(K($K&%
-<<*+
(
K (K&%
-<<*+
(
K(#K&%
-<<*+
(!
K
+/
(
(&K("K&%
+/
(
(&K(%K&%
+/
(
(&K(%K&%
K&%
(Q4) find the names of sailors who have reserved boat 103.
{p ∃ S ∈ Sailors ∃ R ∈ Reserves (R.sid = S.sid ∧R.bid= 103 ∧P.sname = S.sname)}
This query can be read as follows: “Retrieve all sailor tuples for which there exists a tuple in Reserves,
having the same value in the sid field, and with bid = 103”. That is, for each sailor tuple, we look for a tuple
in Reserves that shows that this sailor has reserved boat 103. The answer tuple P contains just one field,
sname.
(Q5) Find the names of sailors who have reserved a red boat.
{(P  ∃S ∈ Sailors ∃ R ∈Reserves (R.sid = S. sid ∆∧ FP. Sname = S. sname
∧ ∃ B ∈Boads (B.bid = R.bid ∧ B.color = ‘red’))}
This query can be read as follows: “Retrieve all sailor tuples S for which there exist tuples R in Reserves
and B in Boats such that S.sid = R.sid, R.bid = B.bid, and B.color =’red’.” Another way to write this query,
which corresponds more closely to this reading, is as follows:
{(P | ∃ S x2 Sailors ∃ R ∈Reserves ∃ B ∈ Boats
(R.sid = S.sid ∧ B.bid = R.bid ∧ B.color = ‘red’ ∧ P.sname = S.sname)}
(Q6) Find the names of sailors who have reserved at least two boats.
62
DATABASE SYSTEMS
{P | ∃ S ∈Sailors ∃ R1 ∈ Reserves ∃ R2 ∈ Reserves
(S.sid = R1.sid ∧ R1.sid = R2.sid ∧ R1.bid ≠ R2.bid ∧ P.sname = S.sname) }
Contrast this query with the algebra version and see how much simpler the calculus version is. In past, this
difference is due to the cumbersome renaming of fields in the algebra version, but the calculus version
really is simple.
(Q7) Find the names of sailors who have reserved all boats.
{P | ∃ S ∈ Sailors ∀B ∈ Boats
(∃R ∈ Reserves (S.sid = R.sid ∧R.bid = B.bid ∧ P.sname = S.sname)) }
This query was expressed using the division operator in relational algebra. Notice how easily it is
expressed in the calculus. The calculus query directly reflects how we might express the query in English:
“Find sailors S such that for all boats B there is a Reserves tuple showing that sailor S has reserved boat B.”
(Q 8 ) Find sailors who have reserved all red boats.
{ S | S ∈ Sailors ∧ ∀ B ∈ Boats
(B.color =’red’
(∃R ∈ Reserves (S.sid = R.sid ∧ R.bid = B.bid))}
This query can be read as follows: For each candidate (sailor), if a boat is red, the sailor must have reserved
it. That is, for a candidate sailor, a boat being red must imply the sailor having reserved it. Observe that
since we can return an entire sailor tuple as the answer instead of just the sailor’s name, we have avoided
introducing a new free variable (e.g., the variable P in the previous example) to hold the answer values. On
instances B1, R2, and S3, the answer contains the Sailors tuples with sid 22 and 31.
We can write this query without using implication, by observing that an expression of the form p q is
logically equivalent to y3 p v q:
{ S | S ∈ Sailors ∧∀ B ∈Boats
(B.color ≠ ‘red’ v (∃ R ∈ Reserves (S.sid = R.sid ∧R.bid = B.bid)))}
This query should be read as follows: “Find sailors S such that for all boats B, either the boat is not red or a
Reserves tuple shows that sailor S has reserved boat B.”
$ #
A domain variable is a variable that ranges over the values in the domain of some attribute (e.g., the
variable can be assigned an integer if it appears in an attribute whose domain is the set of integers). A DRC
query has the form { (∃, x2,……,xn) |p ((∃, x2,…..,xn)) }, where each xi is either a domain variable or a
constant and p ((∃, x2,…..,xn)) denotes a DRC formula whose only free variables are the variables among
the xi, 1 ≤ i ≤ n. The result of this query is the set of all tuples (∃,x2,…,xn) for which the formula evaluates
to true.
A DRC formula is defined in a manner that is very similar to the definition of a TRC formula. The main
difference is that the variables are now domain variables. Let op denote an operator in the set { <,>, = ≤, ≠}
and let X and Y be domain variables. An atomic formula in DRC is one of the following:
(∃, x2,…,xn) ∈Rel, where Rel is a relation with n attributes; each xi, 1 ≤ i n is either a variable or a
constant.
X op Y
X op constant, or constant op X
RELATIONAL DATA MANIPULATION
63
A formula is recursively defined to be one of the following, where p and q are themselves formulas, and
p(X) denotes a formula in which the variable X appears:
Any atomic formula
p, p ∧ q, p V q, or p
q
∃ X (p(X)), where X is a domain variable
∀ X (p(X), where X is a domain variable
the reader is invited to compare this definition with the definition of TRC formulas and see how closely
these two definitions correspond. We will not define the semantics of DRC formulas formally; this is left
as an exercise for the reader.
( #
$
)
We now illustrate DRC through several examples. The reader is invited to compare these with the TRC
versions.
(Q2) Find all sailors with a rating above 7.
{ (I,N,T,A) | (I, N, T, A) ∈ Sailors ∧ T > 7 }
This differs from the TRC version in giving each attribute a (variable) name. The condition (I, N, T, A) ∈
Sailors ensures that the domain variables I, N, T, and A are restricted to be fields of the same tuple. In
comparison with the TRC query, we can say T>7 instead of S.rating > 7, but we must specify the tuple (I,
N,T, A) in the result, rather than just S.
(Q4) Find the names of sailors who have reserved boat 103.
{ (N) | ∃ I, T, A ((I,N,T,A) ∈ Sailors
∧∃ Ir, Br, D((Ir, Br, D ) ∈ Reserves ∧ Ir = I ∧ Br = 103)) }
Notice that only the sname field is retained in the answer and that only N is a free variable. We use the
notation ∃ Ir, Br, D(..) as a shorthand for x1 Ir (∃Br (∃D(..)))).
Very often, all the quantified variables appear in a single relation, as in this example. an even more compact
notation in this case is ∃ (Ir, Br, D) ∈ Reserves. With this notation, which we will use henceforth, the above
query would be as follows:
{ (N) || ∃ I, T, A((I,N,T, A ) ∈ Sailors
∧ ∃(Ir, Br, D) ∈ Reserves (Ir = I ∧ Br = 103 )) }
The comparison with the corresponding TRC formula should now be straightforward. This query can also
be written as follows; notice the repetition of variable I and the use of the constant 103:
{ (N) | ∃ I, T, A((I,N,T,A) ∈Sailors
∧ ∃D ((I, 103, D ) ∈Reserves))}
(Q5) Find the names of sailors who have reserved a red boat.
{ (N) | ∃ I,T, A((I, N, T, A) ∈ Sailors
∧ ∃ (I, Br, D) ∈ Reserves ∧ ∃ (Br, BN, ‘red) ∈Boats) }
(Q6) Find the names of sailors who have reserved at least two boats.
{(N) | ∃I, T, A((I, N, T, A) ∈ Sailors ∧ ∃ Brl, Br2 , D1, D2 ((I, Brl, D1) ∈Reserves ∧ (I,Br2, D2) ∈
Reserves ∧ Brl ≠ Br2))}
64
DATABASE SYSTEMS
Notice how the repeated use of variable I insures that the same sailor has reserved both the boats in
question.
(Q7) Find the names of sailors who have reserved all boats.
{ (N) | ∃I, T, A((I,N, T, A) ∈ Sailors ∧
∀B, BN, C(¬((B, BN, C) ∈ Boats) V
(∃(Ir, Br, D) ∈Reserves (I= Ir ∧ Br = B))))}
This query can be read as follows: “Find all values of N such that there is some tuple (I,N,T,A) in Sailors
satisfying the following condition: for every (B,BN,C), either this is not a tuple in Boats or there is some
tuple (Ir, Br, D ) in Reserves that proves that Sailor I has reserved boat B.” The ∀ quantifier allows the
domain variables B, BN, and C to range over all values in their respective attribute domains, and the pattern
‘ ¬((B, BN, C) ∈ Boats) V’ is necessary to restrict attention to those values that appear in tuples of boats.
This pattern is common in DRC formulas, and the notation ∀ (B, BN, C) ∈ Boats can be used shorthand
instead. This is similar to the notation introduced earlier for ∃. With this notation the query would be
written as follows:
{ (N) | ∃ I, T, A((I,N,T,A) ∈ Sailors ∧ ∀ (B,BN,C ) ∈ Boats
(∃( Ir, Br, D) ∈ Reserves ( I = Ir ∧ Br = B)))}
(Q8) Find sailors who have reserved all red boats.
{ (I, N, T, A) | (I,N,T,A) ∈Sailors ∧∀( B, BN, C) ∈Boats
(C =’red’
∃( Ir, Br, D) ∈ Reserves (I = Ir ∧ Br = B))}
Here, we find all sailors such that for every red boat there is a tuple in Reserves that shows the sailor has
reserved it.
SQL language is a " Query language", it contains many other capabilities besides querying a data base. It
includes features for defining the structure of the data, for modifying the data in data base, and for
specifying security constrains. SQL has clearly established itself as the standard relational-data base
language.
A relational database consists of a collection of relations, each of which is assigned a unique name. Each
relation has a structure.
Student Activity 3.3
Before reading the next section, answer the following questions.
1.
How Relational Calculus is differ from Relational Algebra?
2.
What do understand by TRC queries?
3.
What do understand by DRC queries?
If your answers are correct, then proceed to the next section.
Top
)1
The basic structure of an SQL expression consists of three clauses: select, from, and where.
RELATIONAL DATA MANIPULATION
65
The select clause corresponds to the projection operation of the relational algebra. It is used to list the
attributes desired in the result of a query.
The from clause corresponds to the Cartesian product operation of the relational algebra. It lists the
relations to be scanned in the evaluation of the expression.
The where clause corresponds to the selection predicate of the relational algebra. It consists of a predicate
involving attributes of the relations that appear in the from clause.
A typical SQL query has the form
select A1,A2,..., An
from r1,r2,....,rm
where P
Each Ai represents an attribute, and each ri a relation, P represent a predicate.
SQL * PLUS: GETTING STARTED
Update the std_fee of the student tuple with std_no. = 1 to 3000.50.
In SQL
Update student set std_fee = 3000.50 where std_id = 1
Domain Constraint: Data types help determine what values are valid for a particular column.
Referential constraint: It refers to the maintenance of relationships of data rows in multiple tables.
Entity Constraint: It means that we can uniquely identify every row in a table.
Student Activity 3.4
Before reading the next section, answer the following questions:
1.
What are the various types of the update operations on relations?
2.
Which operation we have to use to change an existing value in a table?
If your answers are correct, then proceed to the next section.
Top
)
*1
2 )13
Structured Query Language (SQL), pronounced “sequel”, is the set of commands that all programs and
users must use to access data within the Oracle7 database. Application programmes and Oracle7 tools often
allow users to access the database without directly using SQL, but these applications in turn must use SQL
when executing the user’s request.
Historically the paper, “A Relational Model of Data for Large Shared Data Banks,” by Dr E F Codd, was
published in June 1970 in the Association of Computer Machinery (ACM) journal, Communications of the
ACM. Codd’s model is now accepted as the definitive model for relational database management systems
(RDBMS). The language, Structured English Query Language (SEQUEL) was developed by IBM
Corporation, Inc. to use Codd’s model. SEQUEL later became SQL. In 1979, Relational Software, Inc.
(now Oracle Corporation) introduced the first commercially available implementation of SQL. Today, SQL
is accepted as the standard RDBMS language. The latest SQL standard published by ANSI and ISO is often
called SQL-92 (and sometimes SQL2).
66
DATABASE SYSTEMS
)1
This section describes many reasons for SQL’s widespread acceptance by relational database vendors as
well as end users. The strengths of SQL benefit all ranges of users including application programmers,
database administrators, and management and end users
4
1
SQL is a non-procedural language because it:
•
Processes sets of records rather than just one at a time;
•
Provides automatic navigation to the data.
•
System Administrators
•
Database Administrators
•
Security Administrators
•
Application Programmers
•
Decision Support System personnel
•
Many other types of end users
SQL provides easy-to-learn commands that are both consistent and applicable to all users. The basic SQL
commands can be learned in a few hours and even the most advanced commands can be mastered in a few
days.
1
SQL provides commands for a variety of tasks including:
Querying data;
Inserting, updating and deleting rows in a table;
Creating, replacing, altering and dropping objects;
Controlling access to the database and its object;
Guaranteeing database consistency and language.
SQL unifies all the above tasks in one consistent language.
##
1
$
Because all major relational database management systems support SQL, you can transfer all skills you
have gained with SQL from one database to another. In addition, since all programmes written in SQL are
portable, they can often be moved from one database to another with very little modification.
#
)1
Embedded SQL refers to the use of standard SQL commands embedded within a procedural programming
language. Embedded SQL is a collection of these commands:
All SQL commands, such as SELECT and INSERT, available with SQL with interactive tools;
Flow control commands, such as PREPARE and OPEN, which integrate the standard SQL commands with
a procedural programming language.
RELATIONAL DATA MANIPULATION
67
The Oracle precompilers support embedded SQL. The Oracle precompilers interpret embedded SQL
statements and translate them into statements that can be understood by procedural language compilers.
Each of these Oracle precompilers translate embedded SQL programmes into a different procedural
language:
The Pro*Ada precompiler
The Pro*C/C++ precompiler
The Pro*COBOL precompiler
The Pro*FORTAN precompiler
The Pro*Pascal precompiler
The Pro*PL/l precompiler
Oracle supports two types of data objects:
Schema Objects: A schema is a collection of logical structures of data, of schema objects. A schema is
owned by a database user and has the same name as that user. Each user owns a single schema. Schema
objects can be created and manipulated with SQL and include the following types of objects.
1-. *+
)/ /</.* 1 0.
)*9*.
/8.
)/ /</.* + 33*+.
/50/3*)
.
. /8.
+*) 8+ 5*)-+*.
.7
.*=-* 5*.
1 3.
. +*) :- 5
7 .
.
/<1*.
4 *@.
Non-schema Objects: Other types of objects are also stored in the database and can be created and
manipulated with SQL, but are not contained in a schema.
+ :1*.
11</50 .*3 * .
*.
/<1* .8/5*.
.*+.
5
#
%
The following rules apply when naming objects:
•
Names must be from 1 to 30 characters long with the following exceptions:
•
Names of databases are limited to 8 characters. Names of database links can be as long as 128
characters.
•
Names cannot contain quotation marks.
•
Names are not case-sensitive.
68
DATABASE SYSTEMS
•
A name must begin with an alphabetic character from you database character set unless surrounded by
double quotation marks.
•
Names can only contain alphanumeric characters form your database character set and the characters_,$
and#. You are strongly discouraged from using $ and #.
•
If your database character set contains multi-byte characters, it is recommended that each name for a
user or a role contain at least one single-byte character.
•
Names of databases links can also contain periods (.) and ampersands @.
•
Columns in the same table or view cannot have the same name. However, column in different tables or
views can have the same name.
•
Procedures or functions contained in the same package can have the same name, provided that their
arguments are not of the same number and data types. Creating multiple procedures of functions with
the same name in the same package with different arguments is called overloading the procedure or
function.
5
#
'
There are several helpful guidelines for naming objects and their parts:
•
Useful, descriptive, pronounceable names (or well-known abbreviations).
•
Use consistent naming rules.
•
Use the same name to describe the same entity or attributes across tables.
•
When naming objects, balance the objective of keeping names short and easy to use with the objective
of making names as long and descriptive as possible. When in doubt, choose the more descriptive name
because many people may use the objects in the database over a period of time. Your counterpart ten
years from now may have difficulty understanding a database with names like PMDD instead of
PAYMENT_DUE_DATE.
•
Using consistent naming rules helps users to understand the part that each table plays in you
application. One such rule might be beginning the names of all tables belonging to the FINANCE
application with FIN_.
•
Use the same names to describe the same things across tables. For example, the department number
columns of the EMP and DEPT tables should both be named DEPTNO.
*+ /1 / 78*
4
A. ?*B
A8,.B
*.5+ 8
4/+ /<1* 1* 3 5 /+/5 *+ . + 3 /L 3 /9 - . > - -. .8*5 :7 . ?* : + / 4
- <*+ /L 3 8+*5 .
8 / ) .5/1* .>
+/ 3* : + ' %!
$>
/+/5 *+ )/ / :L/+ /<1* 1* 3 -8
4/1) )/ / +/ 3* :+
2 A. ?*B
2
8 5/ +/ 3* :+
3 3/<7 *., +
' <7 *.>
*5* <*+
. ?* <7 *.> /9 -
/@ < /+7 )/ / :L/+ /<1* 1* 3
-8
. ?* <7 *.> /9 -
* 8+*5 .
/ -/+7 ,!$
/@ < /+7 )/ / :1* 3
L/1-*>
1* 3
>
,!$
. ?* . "" <7 *.>
3 3/<7 *.>
. ?* .
%>
((( / )
* .5/1* . 5/
>
-
-. .8*5 :7 . ?* :/
2
RELATIONAL DATA MANIPULATION
2
A.**
* <*1 @B
69
*9/)*5 /1. + 3 +*8+*.*
3 * - =-* /))+*.. :/ + @
: + L/1-*. +* -+ *) <7 *
2
8.*-) 5 1- >
A. ?*B
9*) 1* 3
<7*>
5 /+/5 *+ )/ / :1*
/+7: + /
:/
8*+/ 3 .7. *
7
$
. ?* <7 *> /9 1/<1*>
$ $+*
. /<1*>
. )/ / 78* . 8+
. ?* . ""> *:/-1 / )
. )/ / 78* . -.*) @
+-. *)
-
/+ 17
. ?* .
+/51*$>
+
*
Character data types are used to manipulate words and free form text. These data types are used to store
character (alphanumeric) data in the database character set. They are less restrictive than other data types
and consequently have fewer properties. For example, character columns can store all alphanumeric values,
but NUMBER columns can only store numeric values. These data types are used for character data CHAR,
VARCHAR2.
6
$
*
The CHAR data type specifies a fixed length character string. When you create a table with a CHAR
column, you can supply the column length in bytes. Oracle7 subsequently ensures that all values stored in
that column have this length. If you insert a value that is shorter than the column length, Oracle7 adds blank
pads to the value to column length. If you try to insert a value that is too long for the column, Oracle7
returns an error. The default for a CHAR column is 1 character and the maximum allowed is 255
characters. A zero-length string can be inserted into CHAR column, but the column is blank-padded to 1
character when used in comparisons.
7
6
,$
*
The VARCHAR2 data type specifies a variable length character string. When you create a VARCHAR2
column, you can supply the maximum number of bytes of data that it can hold. Oracle7 subsequently stores
each value in the column exactly as you specify it, provided it does not exceed the column’s maximum
length.
7
6
$
*
The VARCHAR data type is currently synonymous with the VARCHAR2 data type. It is recommended
that you use VARCHAR2 rather that VARCHAR. In a future version of Oracle7, VARCHAR might be a
separate data type used for variable length character strings compared with different comparison semantics.
!
$
*
The NUMBER data type is used to store zero, positive and negative fixed and floating point numbers with
magnitudes between 1.0x10-130 and 9.9…x10125 (38 9s followed by 88 0s) with 38 digits of precision.
$
$
*
The DATE data type is used to store data and time information. Although data and time information can be
represented in both CHAR and NUMBER data types, the DATE data type has special associated properties.
For each DATE value the following information is stored:
Century year month day hour minute second.
To specify date value, you must convert a character or numeric value to data value with the TO_DATE
function. Oracle7 automatically converts character values that are in the default date format into date values
70
DATABASE SYSTEMS
when they are used in date expressions. The default date format is specified by the initialization parameter
NLS_DATE_FORMAT and is a string such as ‘DD-MM_YY’. This example date format includes a twodigit number for the day of the month, an abbreviation of the month name and the last two digits of the
year.
If you specify a date value without a time component, the default time is 12:00 a.m. (midnight). If you
specify a date value without a date, the default date is the first day of the current month.
The date function SYSDATE returns the current data and time.
8
1
'
8 $
*
The RAW and LONG RAW data types are used for data that is not to be interpreted (not converted when
moving data between different systems) by Oracle. These data types are intended for binary data or byte
strings. For example, LONG RAW can be used to store graphics, sound, documents or areas of binary data;
the interpretation is dependent on the use.
8 "$ $
*
Each row in the database has an address. You can examine a row’s address by querying the pseudo-column
ROWID. Values of this pseudo-column are hexadecimal strings representing the address of each row.
These strings have the data type ROWID. You can also create tables and clusters that contain actual
columns having the ROWID data type. Oracle7 does not guarantee that the values of such columns are
valid ROWIDs.
!1 1
1$
*
The MLSLABEL data type is used to store the binary format a label used on a secure operating system.
Labels are used by Trusted Oracle7 to mediate access to information. You can also define columns with this
data type if you are using the standard Oracle7 server.
If a column in a row has no value, then column is said to be null, or to contain a null. Nulls can appear in
columns of any data type that are not restricted by NOT NULL or PRIMARY KEY integrity constraints.
Use a null when the actual value is not known or when a value would not be meaningful. Oracle7 currently
treats a character value with a length of zero as null. However, this may not continue to be true in future
versions of Oracle7. Do not use null to represent a value of zero, because they are not equivalent. Any
arithmetic expression containing a null always evaluates to null. For example, null added to 10 is null. In
fact, all operators (except concatenation) return null when given a null operand.
All data in a relational database is stored in tables. Every table has a table name and a set of columns and
rows in which the data is stored. Each column is given a column name, a data type (defining characteristics
of the data to be entered in the column). Usually in a relational database, some of the columns in different
tables contain the same information. In this way, the tables can refer to one another.
For example, you might want to create a database containing information about the products your company
manufactures. In a relational database, you can create several tables to store different pieces of information
about your products, such as an inventory table, a manufacturing table and a shipping table. Each table
would include columns to store data appropriate to the table (for example, the inventory table would
include a column showing how much stock is on hand) and a column for the product’s part number.
7
9
RELATIONAL DATA MANIPULATION
71
A view is customized presentation of the data from one or more tables. Views derive their data from the
tables on which they are based, which are known as base tables. All operations performed on a view
actually affect the base tables of the view. You can use views for several purposes: to give you an
additional level of table security by restricting access to a predetermined set of table rows and columns. For
example, you can create a view of a table that does not include sensitive data (i.e., salary information).
To hide data complexity: Oracle7 databases usually include many tables and by creating a view combining
information from two or more tables, you make it easier for other users to access information from your
database. For example, you might have a view that is a combination of your Employee table and
Department table. A user looking at this view, which you have called emp_dept, only has to go to one place
to get information, instead of having to access the two tables that make up this view.
To present the data in a different perspective from that of the base table: View provides a means to rename
columns without affecting the base table. For example, to store complex queries, a query might perform
extensive calculations with table information. By saving this query as a view, the calculations are
performed only when the view is queried.
"
(
An index is used to quickly retrieve information from a database project. Just as indexes help you retrieve
specific information faster, a database index provides faster access to table data. Indexing creates an index
file consisting of a list of records in a logical record order, along with their corresponding physical position
in the table. You can use indexes to rapidly locate and display records, which is especially important with
large tables, or with database composed of many tables.
Indexes are created on one or more columns of a table. Once created, an index is automatically maintained
and used by the Oracle7 Database. Changes to table data (such as adding new rows, or deleting rows) are
automatically incorporated into all relevant indexes.
To understand how an index works, suppose you have created an employee table containing the first name,
last name an employee ID number of hundreds of employees, and that you entered the name of each
employee into the table as they were hired. Now, suppose you want to locate a particular record in the table.
Because you entered information about each employee in no particular order, the DBMS must do a great
deal of database searching to find the record.
If you create an index using the LAST-NAME column of your employee table, the DBMS has to do much
less searching and can return the results of a query very quickly.
The tables in the following sections provide a functional summary of SQL commands and are divided into
these categories:
•
Date Definition Language commands
•
Data Manipulation Language commands
•
Transaction Control commands
•
Session Control commands
•
System Control commands
•
Embedded SQL commands
Student Activity 3.5
Before reading the next section, answer the following questions:
72
DATABASE SYSTEMS
1.
What do you understand by SQL?
2.
What we view in DBMS?
3.
Why do we use indexes?
If your answers are correct, then proceed to the next section.
$
$
1
##
Data Definition Language (DDL) commands allow you to perform these tasks:
•
Create, alter and drop objects;
•
Grant and revoke privileges and roles;
•
Analysis information on a table, index, or cluster;
•
Establish auditing options.
##
$
$
*
The CREATE, ALTER and DROP command require exclusive access to the object being acted upon. For
example, an ALTER TABLE command fails if another user has an open transaction on the specified table.
The GRANT, REVOKE, ANALYSE, AUDIT and COMMENT commands do not require exclusive access
to the object being acted upon. For example, you can analyze a table while other users are updating the
table.
The following Table shows the Data Definition Language Commands.
'%
/ )
27
$
"& $#&
&
5 / 3*
1*+ / /</.*
* . +/3* 5 /+/5 *+ . 5. :/ 51-. *+>
+*5
1*+
+*)*: * /
)*9
8 1* / . +*) :- 5
: + / 51-. *+>
>
)*9G. :- -+* . +/3* /11 5/
1*+ /50/3*
+*5
8 1* / . +*) 8+ 5*)-+*>
1*+ + 5*)-+*
+*5
8 1* / . +*) 8+ 5*)-+*>
1*+ + :*
/)) + +*
1*
/11 5/ * / *9*
8* K * )/ /</.*>
5 L*+ /
+/51* 4*+.
# )/ / ) 5
/+7 @ *
3+/ 3
+/51* $>
8+*8/+*
) @ 3+/)*
/ */+1*+ +*1*/.* : +/51*$>
5
.* /+L L* 1 3K /5 L* 1 3
)*>
8*+: +
*) / +*5 L*+7>
/))K)+ 8K51*/+
+*') 1 3 :1* 3+ -8. * <*+.> +* / * / )/ / :1*K+*') 1 3 3 1* * <*+> </50-8
* 5-++* 5 + 1 :1*>
</50 -8 I 5
/ ). A / 5/ <* -.*)
+*'L5+*/ * *
)/ /</.*B
* +/5* :1*> 5+*/ * / *@ )/ / :1*> +*. ?* * +
+* )/ / :1*.>
5+*/ * / *@ )/ / :1* 81/5* :/ 1) * : + +*5 L*+7 8-+8 .*.> * /<1*K) ./<1* /*9* ) 3 * . ?* : )/ / :1*.>
/0* / )/ / :1*
'1 *K ::'1 *>
* /<*1K) ./<1* /
+*/) : +*') 1 3 :1* 3+ -8.>
5 / 3* * )/ /</.*G. 31 </1 / *>
5 / 3* *
)*> .* *
+
2 1/<1*.>
1*+ - 5
1*+
&!(
-+8 .*
1*+ 1-. *+
1*+ *. -+5*
#
.
L* / +*. -+5* 1
.8*5 :7 / : + -1/
5 / 3*
* /-
5/1-5-1/ *
+ ./
**)*)
+ :+
*
>
/ 8+ :1*>
/15 .
:+*. -+5*. -.*) <7/ .*..
/55*.. / + 1*>
>
RELATIONAL DATA MANIPULATION
1*+
73
11</50 *3 *
/-
1*+
/8.
3
5 / 3* / + 11</50 .*3 * G. . +/3* 5 /+/5 *+.. 5., // 5 +*:+*.
)*>
5 / 3* / . /8.
/ 5 +*:+*.
*, +
1 3G. . +/3* 5 /+*5 *+. 5.>
1*+ /<1*
/)) / 5 1- K *3+ 7 5 . +/
/ </1*> +*)*: * / 5 1- , 5 / 3* / /<1*G.
. +/3* 5 /+/5 *+. 5.>
*</<1*K) ./<1*K)+ 8 /
*3+ 7 5 . +/
>
* /<*1K) ./<1*
/<1* 1 50.
/ /<1*> * /<1*K) ./<1* /11 + 33*+.
/ /<1*> /11 5/ * / *9 * : +
* /<1*>
/11 @K) ./11 @ @+ 3
/ /<1*>
) :7 * )*3+** : 8/+/11*1. : + /
/<1*>
1*+ /<1*.8/5*
/))K+* / * )/ / :1*.>
5 / 3* . +/3* 5 /+/5 *+ . 5.>
/0* / /<1*.8/5*
1 *K ::'1 *> <*3 K* ) / </50 -8> /11 @K) ./11 @ @+ 3 / /<1*.85*>
1*+ + 33*+
'
* /<1*K) ./<1* / )/ /</.* + 33*+>
1*+ .*+
5 / 3* / -.*+G. 8/[email protected] +), )*:/-1 /<1*.8/5*, * 8 +/+7 /<1*.8/5*, /<1*.8/5*
=- /.,8+ :1*, + )*:/-1 + 1*.>
1*+ 4 *@
+*5
/17.*
8 1* / L*.>
5 11*5 8*+: + / 5* . / . 5., L/1)/ * . +-5 -+*, + )*
51-. *+, + )*9>
-)
5
*
.* /-)
3 : + .8*5 :*) I 5
/)) / 5
+*/ *
+ 1 1*
+*5+*/ * / 5
+*/ * / /</.*
+*/ * / /</.*
*
/< - / /<1,L*@,. /8.
0
5+*/ * / 1 0
/ +*
5+*/ * /
)*9
+*/ * /50/3*
)7
>
)*9: + / /<1* + 51-. *+>
5+*/ *
* .8*5 :5/
:/ . +*) 8/50/3*>
5+*/ *
* < )7 :/ . +*) 8/50/3*>
+*/ * + 5*)-+*
5+*/ * / . +*) 8+ 5*)-+*>
+*/ * + :1*
5+*/ * / 8+ :1/ / ) .8*5 :7 . +*. -+5* 1
+*/ *
1*
5+*/ * / + 1*>
+*/ *
11</50 *3 *
5+*/ * / + 11</50 .*3 * >
+*/ * 5 * /
..-* -1 81*
+/ ./5
>
,
+*/ * *=-* 5*
5+*/ * / .*=-* 5* : + 3* *+/
+*/ *
/8.
5+*/ * / . /8.
+*/ *
/8.
3
+*/ * /<1*
+*/ * /<1*.8/5*
/+7>
* )/ /</.*>
+*/ *
7
* )/ / ) 5
5+*/ * / )/ /</.*
5+*/ * / . +*) :- 5
+*/ * 7
, + 5 1-
.5 * / <M*5 .>
+ 1:1*>
+*/ * - 5
+*/ * /50/3*
/ ). + 8*+/
:7 5 / *) + @. : + / </1*,
5+*/ * / . /8.
5+*/ * / .7
:)/ / :+
135
/
.>
4 2 / )
3 .*=-*
* +
3 5 / 3*.
./* * .
/ . 31*
/1L/1-*.>
+* +*
/)*
*
/. *+ /<1*.>
*
/. *+ /<1* :/ . /8.
>
7 : + / .5 * / <M*5 >
5+*/ * / /<1*,)*:
3 . 5 1-
., *3+ 7 5
. + / . / ) . +/3* /11 5/ >
5+*/ * / 81/5*
* )/ /</.* : + . +/3* : .5 * / <M*5 ., + 11</50 .*3 * . / )
* 8 +/+7 .*3 * ., /
3 * )/ / :1*. 5 8+ .* * /<1*.8/5*>
+*/ * + 33*+
5+*/ * / )/ /</.* + 33*+>
+*/ * .*+
5+*/ * / )/ /</.* -.*+>
74
DATABASE SYSTEMS
+*/ * 4 *@
)*: * / L*@ :
+ 8 1-. *+
+*
L* / 51-. *+ :+
+*
L* / )/ /</.* 1 0>
+ 8 - 5
+*
L* / . +*) :- 5
+ 8
+*
L* /
+ 8 /50/3*
+*
L* / . +*) 8/50/3* :+
+ 8 + 5*)-5*
+*
L* / . +*) 8+ 5*)-+* :+
+ 8 + :1*
+*
L* / 8+ :1* :+
+ 8
1*
+*
L* / + 1* : +
+ 8 *=-* 5*
+*
L* / .*=-* 5* :+
* )/ /</.*>
* 8
/8.
+*
L* / .
* )/ /</.*
+ 8
/ 8.
+*
L* / . /8.
+*
L* / .7
+ 8 /<*1
+*
L* / /<1* :+
+ 8 /<1*.8/5*
+*
L* / /<1*.8/5* :+
+ 8 + 33*+
+*
L* / + 33*+ :+
+ 8 .*+
+*
L* / -.*+ / ) * <M*5 .
+ 8 4 *@
+*
L* / L*@ : +
+/
3+/
.7. *
+ 8 / /</.*
0
)*9
+ 8 7
3
7
/-)
* +
* )/ /</.*>
:+
)*9 :+
8.
5 / 3*
*L 0*
+*L 0* .7. *
+- 5/ *
+*
* )/ /</.*>
* )/ /</.*>
* )/ /</.*>
* )/ /</.*>
* )/ /</.*>
* )/ /</.*>
:+
1 3 :+
7 :+
* )/ /</.*>
* )/ /</.*>
* )/ /</.*>
* )/ /</.*>
* )/ /</.*
* -.*+G. .5 * / :+
* )/ /</.*>
* )/ /</.*>
8+ L13*.,+ 1*. / ) <M*5 8+ L11*3*.
) .</1* /-)
./* * >
* / *
+* /<1*. + L*@.>
3 <7 +*L*+. 3, 8/+ /117 + 5
-.**+. / )
81* *17,
1*.>
* *::*5
: / 8+ +
* / * :/ .5 * / <M*5 >
8+ L1*3*.,+ 1*. / ) <M*5 8+ L1*3*. : +
L* /11+ @. :+
/ /<1* + 51-. *+ / ) :+**
-.*+. / ) + 1*.>
* .8/5*
/
* + @. -.*)>
Student Activity 3.6
Before reading the next section, answer the following questions:
1.
What do you understand by DDL?
2.
Make a list of commands used in DDL.
If your answers are correct, then proceed to the next section.
$
!
1
##
Data Manipulation Language (DML) commands query and manipulate data in existing schema objects.
These commands implicitly commit the current transaction.
' %
#
&!
7
$
& * % $#&
&
#
*#(
&!(
RELATIONAL DATA MANIPULATION
75
+*
L* + @. :+
+* -+
* *9*5-
/)) *@ + @.
;
/ /<1*>
81/ : + / I . / * * >
/ /<1*>
1 50 / /<1* + L*@,1
.*1*5 )/ /
5 / 3* )/ /
3 /55*..
+ @. / ) 5 1-
<7
. :+
*+ -.*+.>
* +
+* /<1*.>
/ /<1*>
##
Transaction Control commands manage changes made by Data Manipulation Language commands.
'%
#
07
&( $#& #&$#% #
&!
*#(
/0* 8*+ / *
+/ ./5
>
;
4
* 5 / 3*)
##
*
/)* <7. / * * . ..-*) /
- ) /115 / 3*. . 5*
* <*3
*. /<1. / 8
@ 5 7-
</50
*9 /<1. 8+ 8*+ *. : +
)1
&!(
* 5-++*
3 :/ +/ ./5
* <*3
+ . 5* / ./5*8
3
:/
>
/7+ 11>
+/ ./5
>
(
The description and syntax of some of the primary commands used in the SQL is explained next.
8
)1
##
When writing SQL commands, it is important to remember a few simple rules and guidelines in order to
construct valid statements that are easy to read and edit:
• SQL commands may be on one or many lines.
• Clauses are usually placed on separate lines.
• Tabulation can be used.
• Command words cannot be split across lines.
• SQL commands are not case sensitive (unless indicated otherwise).
• An SQL command is entered at the SQL prompt and subsequent lines are numbered. This is called the
SQL buffer.
• Only one statement can be current at any time within the buffer and it can be run in a number of ways:
Place a semi-colon (;) at the end of last clause.
Place a semi-colon/ forward slash on the last line in the buffer.
Place a forward slash at the SQL prompt.
Issue a RUN command at the SQL prompt.
Any one of the following statements is valid:
Select*From EMP;
Select
76
DATABASE SYSTEMS
*
From
EMP
;
Select *
FROM EMP;
Here, SQL commands are split into clauses in the interests of clarity.
$
PURPOSE: Is to view the structure of the table
SYNTAX
DESCRIBE <tablename>
E.g.
DESCRIBEemp
SELECT
PURPOSE: Is used to extract information of a table
SYNTAX
SELECT column 1, column 2
FROM tablename
[WHERE condition]
SELECT*FROM tablename
E.g. SELECT empno, ename, job, salary FROM emp
WHERE ename=’KAMAL’
!
COLUMN EXPRESSION
SELSECT empno, ename, job, salary*1000 “NET SALARY”
FROM emp
WHERE salary>1000;
)
*
:
The SELECT statement retrieves information from the database, implementing all the operators of
Relational Algebra.
In its simplest form, it must include:
1. A SELECT clause, which lists the columns to be displayed i.e. it is essentially a PROJECTION.
2. A FROM clause, which specifies the table involved.
To list all department numbers, employee names and manager numbers in the EMP table you enter the
following:
RELATIONAL DATA MANIPULATION
77
5
5 B
(
$&(
(
$#&%
(
2
$#&%
(
$#&%
(
$#&%
(
;
(
$% &
;
$% &
(
$"##
(
$#&%
(
$$%%
(
$#&%
(
$"##
(
$$%
Note that column names are separated by a comma.
It is possible to select all columns from the table, by specifying an*(asterisk) after the SELECT command
word.
% $=
C
$ #&
B
;
A
$&(
'
$#&%
"'
'%
,#((>((
((>((
(
$#&%
#'
'%!
, "(>((
"((>((
(
$"##
$% &
'
'%
,&$">((
$#"!
$#&%
("'
'%
, ">((
$% &
'
'%!
,%"(>((
(
$% &
!'
'%!
,!"(>((
(
$$%%
$"##
("'
'%!
,(((>((
(
$%!!
$#&%
(!'
'%!
,"((>((
'%!
, ((>((
$!&&
$"
2
$#&%
;
$$%
;
'%
%((>((
(
(
,!((>((
>((
(
(
$%$#
;
$$%%
(!'
$&((
;
$#&%
'
'%!
$"##
("'
'%
,(((>((
(
'
4'%
, ((>((
(
$&(
$& !
;
" #
$$%
1
It is possible to include other items in the SELECT Clause.
•
Arithmetic expressions
•
Column aliases
&"(>((
(
(
78
DATABASE SYSTEMS
•
Concatenated columns
•
Literals
All these options allow the user to query data, manipulate it for query purposes; for example, performing
calculations, joining columns together, or displaying literal text strings.
An expression is a combination of one or more values, operators and functions, which evaluate to a value.
Arithmetic expressions may contains column names, constant numeric values and the arithmetic operators:
J
/))
'
.-< +/5
H
-1 817
K
) L)*
*1*5
/ *, /1H
,
>
+
If your arithmetic expression contains more than one operator, the priority is *, /first, the +, - second (left to
right if there are several operators with the same priority).
In the following example, the multiplication (250*12) is evaluated first; then the salary value is added to the
result of the multiplication (3000). So for Smith’s row: 800+3000=3800
*1*5 * / * ./1J "(H
:+
* 8N
Pretences may be used to specify the order, in which operators are to be executed; if, for example, addition
is required before multiplication,
*1*5 * / * A./1J "(BH
+
* 8N
#
When displaying the result of a query, SQL*Plus normally uses the selected column’s name as the heading.
In many cases it may be cryptic or meaningless, You can change a column’s heading by using an Alias.
A column alias gives a column an alternative heading on output. Specify the alias after the column in the
select list. By default, alias headings will be forced to uppercase and cannot contain blank spaces, unless the
alias in enclosed in double quotes (“ “).
To display the column heading ANNSAL for annual salary instead of SAL*12, use a column alias:
*1*5 * / * ./1H
./1,
+
* 8N
Once defined, an alias can be used with SQL*Plus commands.
Note: 2
/ I . / * * ,/ 5 1-
/1/. 5/
17 <* -.*)
*
51/-.*>
The Concatenation Operator (..…) allows columns to be linked to other columns, arithmetic expressions or
constant values to create a character expression. Columns on either side of the operator are combined to
make one single column.
To combine EMPNO and ENAME and give the alias EMPLOYEE to the expression, enter:
RELATIONAL DATA MANIPULATION
79
H
$ #&
$!&&
$"
2
$"##
$#"!
$#&%
;
$$%
$% &
;
$%!!
$%$#
$&((
$&(
$& !
1
A literal is any character, expression, number included on the SELECT list which is not a column name or a
column alias.
A literal in the SELECT list is output for each row returned. Literal strings of free formal text can be
included in the query result and are treated like a column in the select list.
The following statement contains literal selected with concatenation and a column alias:
O G GO >
;
G,
2
;
$ #&'
2
;
(
$!&&'
2
;
(
$!&&'
2
;
(
$"
2
;
(
$"##'
2
;
(
$#"!'
2
;
(
2
;
(
$$% '
2
;
(
$% &';
2
;
(
$%!!'
2
;
(
$%$#'
2
;
(
$&(('
2
;
(
$&( '
2
;
(
$& !'
2
;
(
'2
$#&%'
6
O O G2
;
7
If a row lacks a data value for a particular column, that value is said to be null.
80
DATABASE SYSTEMS
A null value is a value which is either unavailable, unassigned, unknown or inapplicable. A null value is
not the same as zero. Zero is a number. Null values take up one byte of internal ‘storage’ overhead.
! !
"#
If any column value in an expression is null, the result is null. In the following statement, only Salesmen
have a remuneration result:
SELECT ENAME, SAL * 12 +COMM ANNUAL_SAL
FROM EMP
&"((
2
""((
#!((
;
;
;
%(((
In order to achieve a result for all employees, it is necessary to convert the null value to a number. We use
the NVL function a null value to a non-null value.
Use the NVL function to convert null values form the previous statement to zero.
H
J 4 A
,(B
:
&#((
&"((
2
"""((
"$((
#!((
;
! ((
;
&!((
#(((
;
#((((
%(((
RELATIONAL DATA MANIPULATION
81
((
!((
#(((
"#((
NVL expects two arguments:
1. an expression
2. a non-null value
Note that you can use the NVL function to convert a null number, date or even character string to another
number, date or character string, as long as the data types match.
NVL (Datecolumn, ’01-jan-88’)
NVL (Numbercolumn, 9)
NVL (charcolumn, ‘strong’)
Preventing the Selection of Duplicate Rows
Unless you indicate otherwise, SQL*Plus displays the result of query without eliminating duplicate entries.
To list all department numbers in the EMP table, enter:
(
(
(
(
(
(
(
(
(
(
(
(
(
(
$
To eliminate duplicate values in the result, include the DISTINCT qualifier in the SELECT command.
To eliminate the duplicate values displayed in the previous example, enter:
82
DATABASE SYSTEMS
(
(
(
Multiple columns may be specified after the DISTINCT qualifier and the DISTINCT affects all selected
columns.
To display distinct values of DEPTNO and JOB, enter:
>,
(
;
(
(
(
(
;
(
(
;
(
(
This displays a list of all different combinations of jobs and department numbers.
Note that order of rows returned in a query result is undefined. The ORDER BY clause may be used to sort
the rows. If used, ORDER by must always be the last clause in the SELECT statement.
To sort by ENAME, enter:
5
C5
= 25
C
;
;
;
;
;
;
;
2
C
= 2
((
& ((
! ((
%!((
#(((
!((
"$((
#((((
"(((
"#((
#(((
&#((
%(((
"(((
(
(
(
(
(
(
(
(
(
(
(
(
(
(
RELATIONAL DATA MANIPULATION
$
83
$
The default sort order is ASCENDING:
•
Numeric values lowest first
•
Date values earliest first
•
Character values alphabetically
%
$
To reverse the order, the command word DESC is specified after the column name in the ORDER BY
clause.
To reverse the order of the HIREDATE column, so that the latest dates are displayed first, enter:
,
,
C
A
2 F : F<0
B
4>F : F<0
C
B
F : F<0
:
40F : F<0
40F : F<0
B
B
0F
F<0
2IF
F<0
43F
F<0
43F
F<
43F
F<
2 F
?F<
F
F<
3F : B F<
A
*!
F : F<
*
#
It is possible to ORDER BY more than one column. The limit is the number of columns on the table. In the
ORDER BY clause, specify the columns to order by separated commas. If any or all are to be reversed,
specify DESC after any or each column.
To order by two columns and display in reverse order of salary, enter:
5
C5
C
5
C
4
4
4
24
B
B
84
DATABASE SYSTEMS
24
24
B
24
24
A
4
B
C
4
4
:
4
4
4
86
The WHERE clause corresponds to the Restriction operator of Relational Algebra.
It contains a condition that rows must meet in order to be displayed.
The WHERE clause, if used, must follow the FROM clause:
*1*5
+
2 *+*
5 1/<1*
5*+ /
.
5
)
. /+*
*
The WHERE clause may compare values in columns literal values, arithmetic expressions or functions. The
WHERE clause expects 3 elements:
1. A column name
2. A comparison operator
3. A column name, constant or list of values.
Comparison Operators are used on the WHERE clause and can be divided into two categories-Logical and
SQL.
1
The logical operators will test the following conditions:
8*+/ +
E
P
PE
Q
QE
*/ 3
*=-/1
3+*/ *+ /
3+*/ *+ / *=-/1
1*.. /
1*.. / + *=-/1
Character String and Database in the WHERE clause
ORACLE columns may be: Character, Number or Date.
Character strings and dates in the WHERE clause must be enclosed in single quotation marks. Character
strings must match case with the column value unless modified by a function.
To list the names, number, job and department of all clerks, enter:
RELATIONAL DATA MANIPULATION
85
5
5
C5
A
C/ D
C
;
;
;
;
$ #&
$%$#
$&((
$& !
D
(
(
(
(
To find department names with department numbers greater than 20, enter:
,
2
P (N
(
!(
#
#
9
#
9
You can compare a column with another column in the same row, as well as with constant value.
For example, suppose you want to find those employees whose commission is greater than their salary,
enter:
,
2
P
, "((>((
,
C
,!((>((
)1
There are four SQL operators, which operate with all data types:
SQL
Operators
8*+/ +
2
>>
A1. B
;
>>
*/ 3
* @** @ L/1-*. A 51-. L*B
/ 5 / 7 :/ 1. :L/1-*.
/ 5 / 7 :/ 1. :L/1-*.
. / -11L/1-*
8
Tests for values between and inclusive of, low and high range.
Suppose we want to see those employees whose salary is between 1000 and 2000.
,
2
2
,#((>((
, "(>((
, "(>((
,"((>((
, ((>((
, ((>((
2
(((
(((
86
DATABASE SYSTEMS
Note that values specified are inclusive and the lower limit must be specified first.
"
Purpose: Tests for existence of values in a specified list.
To find all employees who have one of the three MGR numbers, enter:
,
$ #&
,
,
2
%((>((
A$&( ,$"##,$$%%BN
$&(
$$%%
,(((>((
$"##
$%$#
, ((>((
$$%%
$&(
,((>((
$"##
If character or dates are used in the list they must be enclosed in single quotes(‘ ’).
1:
Sometimes you may not know the exact value to search for. Using the LIKE operator, it is possible to select
rows that match a character pattern. The character pattern matching operation may be referred to as ‘wildcard’ search. Two symbols can be used to construct the search string.
7 < 1
*8+*.* .
7 .*=-* 5* :?*+
7 . 31* 5 /+/5 *+
+
+* 5 /+/5 *+.
To list all employees whose name starts with an S, enter:
2
; G RG
The can be used to search for specific number of characters.
For example, to list all employees who have a name exactly 4 characters in length:
2
; G
G
2
;
The % and * may be used in any combination with literal characters.
"
11
The ISNULL operator specifically tests for values that are NULL.
So to find all employees who have no manager, you are testing for a NULL:
RELATIONAL DATA MANIPULATION
,
87
2
ENAME MGR
KING
(
The following operators are negative test:
SE
*=-=1
A4
,
T E
*=-/1
A
B
*=-/1
A/11 K B
QP
E
*=-/1
P
3+*/ *+
,
B
/
)1
2
<* @**
;
@ 3 L* L/1-*.
3 L* 1.
;
.
:L/1-*.
/ -11L/1-*
To find employees whose salary is not between a range, enter:
,
2
2
(((
%((>((
,&$">((
;
,%"(>((
;
,!"(>((
,((>((
;
",(((>((
&"(>((
,(((>((
# "&! $ #(
*#+ ( E #( J#' !# ( &#$($ $E $
,
2
<
;
2
;
; F RG
5 &$ 7
(((
88
DATABASE SYSTEMS
;
;
;
# "&! %%
*%#+ ( E #
@
5 B
&
A
- B .5 &$ 7
B
:
)
B
$&(
$#&%
2
$#&%
$% &
$#&%
;
$% &
;
$% &
$"##
$#&%
$$%%
$#&%
$"##
$$%
Note
•
If a NULL value is used in a comparison, then the comparison operator should be either IS or IS NOT
NULL. If these operators are not used and NULL values are compared, the result is always FALSE.
•
For example, COMM!=NULL is always FALSE. The result is false because a NULL value can neither
be either equal or unequal to any other value, even another NULL. Note that an error is not raised; the
result is simply always false.
Student Activity 3.7
Before reading the next section, answer the following questions:
1.
What do you understand by DML?
2.
What is the use of Insert, Delete and Update commands?
3.
Why do we use select statement?
If your answers are correct, then proceed to the next section.
)
*
$
9
!
The AND an OR operators may be used to make compound logical expressions.
RELATIONAL DATA MANIPULATION
89
The AND predicate will expect both conditions to be ‘true’; whereas the OR predicate will expect either
condition to be ‘true.
In the following two examples, the conditions are the same, the predicate is different. See how the result is
dramatically changed.
To find all clerks who earn between 1000 and 2000, enter:
,
,
,
2
2
(((
(((
EG
;G
$%$#
;
, ((>((
$& !
;
, ((>((
To find all employees who are either clerks and/or all employees who earn between 1000 and 2000, enter.
5
5
C5
A
C
444
2444
C/ D
D
C
$ #&
;
%((>((
$!&&
,#((>((
$"
2
, "(>((
$#"!
, "(>((
$%!!
, "(>((
$%$#
;
$&((
;
$& !
;
, ((>((
&"(>((
, ((>((
You may combine AND and OR in the same logical expression. When AND and OR appear in the same
WHERE clause, all the ANDs are performed first, then all the Ors are performed. We say that AND has a
higher precedence than OR.
Since AND has a higher precedence than OR, the following SQL statements returns all managers with
salaries over $1500 and all salesmen.
5
C/ D
B
5 C5
C/ K
D
5
A
1
344
D
C
$!&&
,#((>((
(
, "(>((
(
$"##
,&$">(((
(
$#"!
, "(>((
(
,!"(>((
(
,"((>((
(
$"
2
$#&%
;
$%!!
If you wanted to select all managers and salesman with salaries over $1500, you would enter:
C/ D
5
D
5
C5
5
A
C
1 344
- C/ D
B
D
90
DATABASE SYSTEMS
$!&&
,#((>((
(
$"##
,&$">((
(
,%"(>((
(
,!"(>((
(
$#&%
;
$$%
;
The parentheses specify the order in which the operators should be evaluated. In the second example, the
OR operator is evaluated before the AND operator.
"
Purpose: This command is used to add new records into a table.
SYNTAX: INSERT INTO<table name>
VALUES (value 1. value 2…………)
E.G.
INSERT INTO emp
VALUES (‘101’, king’, ‘President’, 17-NOV-91’, 5000, null, ‘10’)
INSERT INTO emp (empno, deptno, ename)
VALUES (‘101’, 29’, ANITA’);
INSERT INTO TABLE (column 1, column 2………..)
SELECT column 1, column 2…………………….
FROM TABLE WHERE (condition);
E.g.
* 8 A* 8
* 8
/ *E F
,)*8
,)*8
,* / *B
,* / *,:+
* 8 @ *+*
G
Purpose: is used to change the values of the field in specified table.
Syntax: Update <tableneme>
SET column 1= expression, column 2= expression
WHERE condition
E.g.
* 8
./1/+7E (((,* / *' F4
2
+* )/ * QEG ('
G
'&"GN
Note: If where clause is omitted all rows are updated.
$
Purpose: Removes the rows from a table.
RELATIONAL DATA MANIPULATION
7 /9C *1* * +
2 *+*Q5
)
91
Q /<1* / *P
P
>3>
* 8
./1/+7P (((
2
*C : * 2
51/-.* .
*) /11 * + @. : * .8*5 :*) /<1* @ 11<* )*1* *)> 8/+
: * + @ 5/
<* )*1* *)>
##
Purpose: Commit is used to make the changes [Insert, Update, Delete] that the have been made permanent.
:
Purpose: This is used to undo (discard) all the changes that have been completed but not made permanent in
the table by a COMMIT, i.e., all changes since the lose commit.
$
$
1
•
Data definition language is used to create, alter or remove a data structure, table ORACLE Database
Structure.
•
In the ORACLE database data is stored in data structures known as tables, comprising of rows and
columns. A table is created in a logical unit of the database called table space. A database may have one
or more table spaces.
•
The table space is divided into segments, which is a set of database blocks allocated for storage of
database structures, namely tables, indexes etc.
•
Segments are defined using storage parameters, which in turn is expressed in terms of extents of data.
An extent is an allocation of database space which itself contains many oracle blocks- the basic unit of
storage.
Purpose: To create a TABLE Structure.
Syntax: CREATE TABLE <tablename>
A 1-
)/ / 78*A. ?*B U -11K
-11V
A5 1-
)/ / 78* A. ?*B,O O O O O >B
> > +*/ * /<1* * 8
A* 8
- <*+ A!B
* / * L/+5 /+
-11
A (B,
M < L/+5 /+ A&B
+*)/ * )/ *,
./1/+7 - <*+ A$, B
5
- <*+ A$, B
)*8
- <*+ A
-11B
>3>
A
AB
;
,
92
DATABASE SYSTEMS
4
A !B
= ) / *
I
4
A B
A
A!B
* 8 8+
0*7
;
,
A
A$, B
A
B
5 0
P (((B,
AB
: + 0*7
*8 A
B
B
Purpose: ALTER is used to change the structure of an existing table
Syntax: ALTER TABEL TABLENAME
U
5 1-
*1* * O O O ,
V
>3> )) / *@ 5 1-
A
A (BB
It is not possible to change the name of an existing column or delete an existing column.
The data type of an existing column can be changed, if the field is blank for all existing records.
7
9
•
View can be defined as a Logical (Virtual) table derived from one or more base tables or Views.
•
It is basically a subschema defined as a subset of the Schema.
•
Views are like windows through which one can view/forced-restricted information.
•
A View is a data object that contains no data of its own but whose contents are taken from other tables
through the execution of a query.
•
Since the contents of the table change, the view would also change dynamically.
Syntax: Create View <view name>
As <query>
[with CHECK OPTION];
•
Oracle implementation of WITH CHECK option places no restrictions on the form of query that may
be used in the AS clause.
•
One may UPDATE and DELETE rows in a view, based on a single table and its query does not contain
GORUP BY clause the DISTINCT clause.
RELATIONAL DATA MANIPULATION
•
93
One may INSERT rows if the vies observes the same restrictions and its query contains on columns
defined by expressions.
E.g. In order to create a view of EMP table named DEPT 20, to show the employees in department 20 and
their annual salary:
4 2 )*8
(
* / *,./1H
+
/
-/1 ./1/+7
* 8
2 *+* )*8
E (
Once the VIEW is up, it can be treated like any other table
H :+
)*8
(
Purpose: Creates a database object to generate unique integers.
Syntax: CREATE SEQUENCE sez_name
U
V
U
2
V
U
4
V
U
4
V
U
V
> >
I
(
2
4
((
"
(
PURPOSE: To create an index on one or more columns of a table or a cluster.
SYNTAX: CREATE [UNIQUE] INDEX index_name
ON table_name
(column_name[, column_name…])
TABLESPACE tablespace
E.g. CREATE INDEX 1_emp_ename ON emp (ename)
Student Activity 3.8
Answer the following questions:
1.
What is SQL? Define its types.
2.
What are the functionality of create, After commands?
##
*
94
DATABASE SYSTEMS
•
Basic sets of relational model operations constitute the relational algebra.
•
A sequence of relational algebra operations forms a relational algebra expression, whose result will also
be a relation.
•
The set of relational algebra operations {σ, π, U, –, x } is a complete set; that is, any of the other
relational algebra operations can be expressed as a sequence of operations from this set.
•
Relational calculus is an alternative to relational algebra. The calculus is nonprocedural, or declarative,
that is it allows to describe the set of answers without being explicit about how they should be
computed.
•
Structured Query Language (SQL), pronounced “sequel”, is the set of commands that all programs and
users use to access data within the Oracle7 database.
•
The strengths of SQL benefit all ranges of users including application programmers, database
administrators, management and end users.
4
#
)
$!
I.
II.
True or False
1.
Basic sets of relational model operations constitute the relational algebra.
2.
The SELECT operation is used to select a subset of the columns from a relation that satisfy a
selection condition.
3.
The SELECT operator is binary.
Fill in the Blanks
1.
The fraction of tuples selected by a __________condition is referred to as the selectivity of
the condition.
2.
The _________operation selects certain columns from the table and discards the other
columns.
3.
The CARTESIAN PRODUCT creates tuple with the ________attributes of two relations.
4.
____________is used to undo (discard) all the changes that have been completed but not
made permanent in the table by a COMMIT.
5.
Character strings must match case with the _________value unless modified by a function.
%
I.
II.
True or False
1.
True
2.
False
3.
False
Fill in the Blanks
1.
selection
RELATIONAL DATA MANIPULATION
2.
PROJECT
3.
combined
4.
rollback
5.
column
&
$!
I.
True or False
II.
95
1.
Set operations include UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN
PRODUCT.
2.
The selection operation is not applied to each tuple individually.
3.
A NATURAL JOIN can be specified as a CARTESIAN PRODUCT
RENAME and followed by SELECT and PROJECT operations.
4.
Character strings and dates in the WHERE clause need not be enclosed in single quotation
marks.
5.
The ISNULL operator specifically tests for values that are NULL.
proceeded by
Fill in the Blanks
$
1.
A ________ must include a set of operations to manipulate the data.
2.
Cartesian Product is also known as ___________.
3.
The JOIN operation is used to combine ____________ from two relations into single tuple.
4.
__________ is an alternative to relational algebra.
5.
The CHAR data type specifies a ____________ character string.
)
1.
What is the difference between a key and a super key?
2.
Discuss the various reasons that lead to the occurrence of null values in relations.
3.
Discuss the entity integrity and referential integrity constraints. Why is each considered important?
4.
Define foreign key. What is this used for? How does it play a role in join operation?
5.
Discuss the various update operations on relations and the types of integrity constraints that must be
checked for each update operation.
6.
List the operations of relational algebra and the purpose of each.
7.
Discuss the various types of join operations. Why is join required?
8.
How are the OUTER JOIN operations different from the (INNER) JOIN operation? How is the
OUTER UNION OPERATION different form UNION?
9.
Suppose we have a table having structure like employee (emp_id number (3), emp_name
varchar2(15), dep_no number(2), emp_desig varchar2(5), salary number (8.2))
i)
Create a table named employee as given above.
96
DATABASE SYSTEMS
ii)
Insert <1, ‘John’, 01,’manager, 25000.00) into employee.
iii) Inser <2. ‘Ram’, 02,’cler,’ 5000.00) into employee.
iv) Insert <3, ‘ Ramesh,’ 03, ‘Accountant,’ 7000.00) into employee.
v)
Insert <4. ‘ Raje,’ 05 ‘clerk’, 500000.00) into employee.
Now make the following queriesvi) Find the employee whose designation is manager.
vii) Find the employee’s details whose salary is second longest.
viii) Find the name of all employees who are clerks.
ix) Find the employee who belongs to the same department.
x)
Find the sum of all salaries.
xi) Find the sum of salary of the employees who belong to the same department.
xii) Update Ramesh department no to 04.
10. Design a relational database for a University register’s office. The office maintains data about each
class, including the instructor, the number of students enrolled and the time and placed the class
meetings for each student class pain, a grade is recorded.
List two reasons why null values might be introduced into the database.
Introduction
Update Operations on Relations
Functional Dependencies
Closure Of A Set of dependencies
First Normal Form
Second Normal Form
Third Normal Form
Boyce-Codd Normal Form
Fifth Normal Form
Relational Database Design
Learning Objectives
After reading this unit you should appreciate the following:
•
Introduction
•
Functional Dependencies
•
Normalisation
•
First Normal Form
•
Second Normal Form
•
Third Normal Form
•
Boyce-Codd Normal Form
•
Fourth Normal Form
•
Fifth Normal Form
Top
The relational model is an abstract theory of data that is based on certain aspects of mathematics
(principally set theory and predicate logic).
The principles of relational model were originally laid down in 1969-70 by Dr. E.F. Codd. Relational
model is a way of looking at data. The relational model is concerned with three aspects of data-Structures,
data integrity, manipulation (for example join, projection etc.)
1. Structure aspects: the data in the database is perceived by the user as a table. It means database
arranged in the tables & collection of tables called database. Structure means design view of database
like data type, its size etc.
100
DATABASE SYSTEMS
2. Integrity aspect: Those tables that satisfy certain integrity constraints like domain constraints, entity
integrity, referential integrity and operational constraints.
3. Manipulative aspects: The operators available for the user for manipulating those tables into database
e.g. for purpose of retrieval of data like projection join and restrict.
PURPOSE: Used to validate data entered for the specified columns (s) namely
There are two types of constraints
•
Table Constraint
•
Column
If the constraint span across multiple columns, the user will have to use table level constraints. If the data
constraint attached to a specific cell in a table references the contents of another cell in the table, then the
user will have to use table level constraints.
Primary key as a table level constraint.
E.g. Create table sales-order-details (s_order_no var char2 (6),
Product no var char2 (6),…. PREMARY KEY (S_order_no, product no.));
If the constraints are defined with the column definition, it is called as a column level constraint. They are
local to a specific column.
Primary key as a column level constraint
Create table client (client _no varchar 2(61 Primary key…);
Student Activity 4.1
Before reading the next section, answer the following questions:
1.
What is relational model? Define its aspects.
2.
What is relational model constraint? Define its types.
If your answers are correct, then proceed to the next section.
•
NOT NULL CONDITION
•
UNIQUENESS
•
PRIMARY KEY identification
•
FOREGIEN KEY
•
CHECK the column value against a specified condition
RELATIONAL DATABASE DESIGN
101
Some important constraints features and their implementation have been discussed below:
A PRIMARY KEY constraint designates a column or combination of columns as the table’s primary key.
To satisfy a PRIMARY KEY constraint, both the following conditions must be true:
•
No primary key value can appear in more than one row in the table .
•
No column that is part of the primary key can contain a null.
A table can have only one primary key.
A primary key column cannot be of data type LONG OR LONG ROW. You cannot designate the same
column or combination of columns as both a primary key and a unique key or as both a primary key and a
cluster key. However, you can designate the same column or combination of columns as both a primary key
and a foreign key.
You can use the column_constraint syntax to define a primary key on a single column.
Example
The following statement creates the DEPT table and defines and enables a primary key on the DEPTNO
column:
CREATE TABLE dept
(deptno NUMBER (2)CONSTRAINTpk_dept PRIMARY KEY,
dname VARCHAR2 (10))
The PK_DEPT constraint identifies the DEPTNO column as the primary key of the DEPTNO table. This
constraint ensures that no two departments in the table have the same department number and that no
department number is NULL.
Alternatively, you can define and enable this constraint with table_ constraint syntax:
CREATE TABLE dept
(deptno NUMBER(2),
dname VARCHAR2 (9),
loc VARCHAR2 (10),
CONSTRAINT pk_deptPRIMARY KEY (deptno)
!
A composite primary key is a primary key made up of a combination of columns. Because Oracle 7 creates
an index on the columns of a primary key, a composite primary key can contain a maximum of 16 columns.
To define a composite primary key, you must use the table_constraint syntax, rather than the
column_constraint syntax.
Example
The following statement defines a composite primary key on the combination of the SHIP_NO and
CONTAINER_NO columns of the SHIP_CONT table:
102
DATABASE SYSTEMS
ALTER TABEL ship_cont
ADD PRIMARY KEY (ship_no, container_no) DISABEL
This constraint identifies the combination of the SHIP_NO and CONTAINER_NO columns as the primary
key of the SHIP_CONTAINER. The constraint ensures that no two rows in the table have the same values
for both SHIP_NO column and the CONTAINER_NO column.
The CONSTRAINT clause also specifies the following properties of the constraint.
•
Since the constraint definition does not include a constraint name, Oracle 7 generates a name for the
constraint.
•
The DISABLE option causes Oracel7 to define the constraint but not enforce it.
Student Activity 4.2
Before reading the next section, answer the following questions:
1.
What are the various features of constraint?
2.
Define primary key and composite key.
If your answers are correct, then proceed to the next section.
A referential integrity constraint designates a column or combination of columns as a foreign key and
establishes a relationship between that foreign key and a specified primary or unique key, called the
referenced key. In this relationship, the table containing the foreign key is called the child table and the
table containing the referenced key is called the parent table. Note the following:
•
The child and parent tables must be on the same database. They cannot be on different nodes of a
distributed database.
•
The foreign key and the referenced key can be in the same table. In this case, the parent and child tables
are the same.
•
To satisfy a referential integrity constraint, each row of the child table must meet on of the following
conditions:
The value of the row’s foreign key must appear as a referenced key value in one of the parent
table’s rows. The row in the child table is said to depend on the referenced key in the parent
table.
The value of one of the columns that makes up the foreign key must be null.
A referential integrity constraint is defined in the child table. A referential integrity constraint definition can
include any of the following key words:
•
Foreign Key: Identifies the column or combination of columns in the child table that makes up the
foreign key. Only use this keyword when you define a foreign key with a table constraint clause.
•
Reference: Identifies the parent table and the column or combination of columns that make up the
referenced key. If you only identify the parent table and omit the column names, the foreign key
automatically references the primary key of the parent table. The corresponding columns of the
referenced key and the foreign key must match in number and data types.
RELATIONAL DATABASE DESIGN
103
On Delete Cascade: Allows deletion of referenced key values in the parent table that have dependent rows
in the child table and causes Oracle7 to automatically delete dependent rows from the child table to
maintain referential integrity. If you omit this option, Oracle7 forbids deletions or referenced key in the
parent table that have dependent rows in the child table.
Before you define a referential integrity constraint in a CREATE TABLE statement that contains as clause.
Instead, you can create the table without the constraint and then add it later with an ALTER TABLE
statement.
You can define multiple foreign keys in a table. Also, a single column can be part of more than one foreign
key.
You can use column_constraint syntax to define a referential integrity constraint in which the foreign key is
made up of a single column.
Example
The following statement creates the EMP table and defines and enables a foreign key on the DEPTNO
column that references the primary key on the DDPTNO column of the DEPT table:
CREATE TABLE emp
(EmpnoNUMBER (4),
ename VARCHAR2 (10),
job VARCHAR2 (9),
ngr NUMBER (4),
hiredate DATE,
sl NUMBER (7,2),
comm. NUMBER (7,2),
deptno CONSTRAINT fk_deptno REFERENCES dept (deptno))
The constraint FK_DEPTNO ensures that all employees in the EMP table work in a department in the
DEPT table. However, employees can have null department numbers.
Before you define and enable this constraint, you must define and enable a constraint that designates the
DEPTNO column of the DEPT table as a primary or unique key. Note that the referential integrity
constraint definition does not use the FOREIGNKEY keyword to identify the columns that make up the
foreign key. Because the constraint is defined with a column constraint clause on the DEPTNO column, the
foreign, the foreign key is automatically on the DEPTNO column.
Note that the constraint definition identifies both the parent table and the columns of the referenced key.
Because the referenced key is the parent table’s primary key, the referenced key column names are
optional.
Note that the above statement omits the DEPTNO column’s data type. Because this column is a foreign
key, Oracle 7 automatically assigns it the data type. Because this column is a foreign key, Oracle 7
automatically assigns it the data TYPE OF THE DEPT. DEPTNO column to which the foreign key refers.
Alternatively, you can define a referential integrity constraint with table_constraint syntax :
CREATETABLEemp (empno
104
DATABASE SYSTEMS
NUMBER (4), ename
VARCHAR2(10), job
VERCHAR2(9)
VERCHAR2(9), mgr
NUMBER(4), hiredate
DATE, sal
NUMBER(7,2), comm
NUMBER(7,2), deptno, CONSTRAINT fk_deptno FOREIGN KEY
(deptno) REFERENCES dept(deptno)
Note that the foreign key definitions in both the above statement omit the ON DELETE CASADE option,
causing Oracle7 to forbid the deletion of a department if any employee works in that department.
Student Activity 4.3
Before reading the next section, answer the following questions:
1.
What do you understand by primary key constraint and Referential Integrity constraint?
2.
Why do we use Null constraint in the table?
If your answers are correct, then proceed to the next section.
Top
!
"!
The operations of the relational model can be categorised into retrievals and updates. But we will discuss
update operation here.
There are three basic update operations on relations
(1) Insert, (2) delete, and (3) modify.
#
"!
It is used to insert a new tuple (row) or tuples in a relation. It can violate any types of constraint (Primary,
referential constraint) Domain constraints do mash can be violated if an attribute value is given that does
not appear in the corresponding. Key constraint can be violated if entered a key value in the new tuple
ahead, exists in another tuple in the relation r(k), Referential Intergrity can be violated if the value of any
foreign key, t refers to a tuple that does not exist in the referenced ration.
Suppose we have a table student(std-id number(4), std-name varchar2(10), std_cause vanchar2(5), std_fee
number(7,2)). Then, to insert values in this tube, we will use Insert operation in such a way
Insert < 1, ‘john’, ‘Msc’, 5000, 50> into student
In SQL (Structured Query Language)
Insert into student values(1, ‘john’ ‘Msc’, 5000.50)’,
#
"!
RELATIONAL DATABASE DESIGN
105
The Delete operation is used to delete tuples. The Delete operation can violate only referential integrity, if
the tuple being deleted is referenced by the foreign keys from other tuple in the database. To specify
deletion, conditions on the attributes of the relation select the tuple (or tuples) to be deleted.
Delete the student tuple with std_no = 2;
Delete where std_id = 2;
#
!
"!
Update (or Modify) is used to change the values of some attributes in existing tuples. Whenever update
operation is applied, the integrity constraints specified on the relational datbase scheme should not be
violated. It is necessary to specify a condition on the attributes of the relation to select the tuple (or tuples)
to be modified.
Top
!
Functional Dependencies (abbreviated as FD) is a many-to-one relationship from one set of attributes to
another within a given relation.
Definition: If X and Y are two subsets of a set of attributes of a relation, then attribute Y is said to be
functionally dependent on attribute X, if and only if each X value in the relation has associated with it
exactly one Y value in the relation.
Stated differently, whenever two tuples of the relation agree on their X value, they also agree on their Y
value.
Symbolically, the same can be expressed as:
X →Y
To understand the concept of functional dependency, consider the following relation (BM) with three
attributes (S#, Item, P#, Qty):
There are 4 attributes in this relation and at the moment 8 tuples are inserted into it. Note that whenever in
any two tuples, S# value is same, Item value is also same. That is whenever value of S# is ‘S1’, the value of
Item is ‘Book’; if value of #S is ‘S2’, value of Item is ‘Magazine’ etc. Therefore, attribute Item is
functionally dependent on attribute S#, i.e. the set of attributes {S#} → {Item}. However, the converse is
not true, i.e. {Item} does not functionally determine {S#} in this example.
Other functional dependencies valid in the relation are:
→
→
→
106
DATABASE SYSTEMS
→
→
→
→
The L.H.S. is called determinant and R.H.S. is called dependent. When the set contains just one attribute
(i.e. S#) we can drop the brackets and can write S# → Item.
The above definition refers not only to the existing tuples but all the possible values of the attributes in all
the tuples.
What is the significance of finding functional dependency after all? Well, this is because FD’s (short for
functional dependencies) represent integrity constraints of the database and therefore, must be enforced.
Now, obviously, the set of all FD’s could be very large. This motivates the database designer to look for a
smaller set of FD’s which is easily manageable and yet implies all the FD’s. Thus finding out a minimal set
of FD’s is of great practical value.
FD’s can be of two types: Trivial and Non Trivial functional dependency. An FD is of trivial type if the
right hand side is a subset (not necessarily a proper subset) of the left hand side such as:
{ S#, P# } → { S#}
Here you see that the right hand side (i.e. dependent S#) is a subset of its determinant (i.e. {S#, P#}) and
hence this is an instance of trivial functional dependency. If this condition does not hold for a dependency,
we call it Non trivial dependency.
Top
" $%
!
A dependency may imply another dependency even if it is not explicitly obvious. For example dependency
{S#, P#} implies two dependencies, viz. {S#, P#} →Item and {S#, P#} →Qty.
The set of all functional dependencies implied by a given set S of FD’s is called the closure of S and is
denoted by S+.
For the purpose of deriving S+ of a given S, a set of inference rules exist, called Armstrong’s Inference sets
or axioms. Using these rules a set of FD’s may be reduced to its Closure. We state these rules hereunder:
Let R be a relation and A, B, C arbitrary subsets of set of attributes of R (i.e. A, B, C ∈ R) then:
1. Reflexivity
—
If B is a subset of A, then A → B
2. Augmentation —
If A → B, then A Υ C → B Υ C
3. Transitivity
—
If A→ B and B → C, then A → C
The rules can be used to derive precisely the closure S+.
The additional rules can be derived from these rules and can be used to simplify the task of computing S+
from S. The derived rules are:
A→A
4. Self-determination
—
5. Decomposition —
If A → B Υ C, then A → B and A → C
6. Union
—
If A → B and A → C, then A → B Υ C
RELATIONAL DATABASE DESIGN
107
7. Composition
If A→B and C → D, then A Υ C → B Υ D
—
For simplification, we will represent A Υ B by AB in what follows.
Example: Suppose we have a relation R with attributes A, B, C, D, E, F and following FD’s:
→
→
!→ "
# $% & %
'.
0
"!
! → " % '($
& '' 11' %
)
+ * - *.' $
→
56 4 7
→
5( -
1 $
7
→
5( -
1 $
7
!→ !
5.
!→ "
5* $ 4 7
!→"
5! -
( $ % *+*
2/3 / /
, * + % -' $.* /
( *4 % )/3/ /
7
1 $
7
Thus we can say that in the given set of FD’s, AD → F holds.
It is clear from the ongoing discussion that given a set of FD’s, we can easily determine (by applying
Armstrong axioms or otherwise) whether an FD holds in the relation variable R or not.
In principle, we can always compute the closure of a given set of attributes by repeatedly applying the
Armstrong axioms and derived rules until they stop producing any new FD. However, as stated earlier, it is
difficult this way.
It is more relevant to compute the closure of a set of attributes of relation R (say Z) and a set of functional
dependencies on R (say S). We will call this as closure Z+ of Z under S. One of many possible algorithms is
given below:
Closure(Z, S)
repeat
for each FD, X→Y in S
do
if (X is subset of Closure(Z, S)) then
Closure(Z, S) = Closure(Z, S) Υ Y
End if
end do
if Closure(Z, S) did not change in the current iteration leave the for loop.
End repeat
Example: Let the FD’s of a relation R with attributes A, B, C, D, E, F are:
A → BC
E → CF
108
DATABASE SYSTEMS
B→E
CD → EF
Compute the closure Z+, where Z={A, B} under the given set of FD’s (S).
Solution: Applying the above algorithm we find
1. Closure = Z. That is initialize Closure with {A, B}.
2. Start repeating
a. Number of FD’s is 4, therefore, loop 4 times.
i. First FD is A → BC. Since the LHS is subset of Closure (Z, S), so we add B
and C to the Closure set. Thus, Closure becomes {A, B, C}.
ii. The LHS of FD, E → CF, is not a subset of the Closure set. Therefore, no
change in Closure set.
iii. In FD, B → E, the LHS is a subset of Closure, therefore add RHS to Closure.
Thus Closure={A, B, C, E}.
iv. LHS of FD, CD →EF, CD is not a subset of the Closure, so no change.
b. We go round the ‘for’ loop once again 4 times. Closure set, evidently, does not change
for first, third and fourth iterations. However, in second iteration it changes to
include attribute F. It becomes now {A, B, C, E, F}.
c. We go round the ‘for’ loop once again. But this time there is no change in the Closure
set and hence the algorithm terminates giving the result as {A, B}+={A, B, C, E, F}.
From the above algorithm two corollaries can be derived.
Corollary 1: An FD X Y follows from a set of FD’s S if and only if Y is a subset of the closure of X+ of X
under S. Thus, we can determine if an FD follows from S even without computing S+.
Corollary 2: A subset of attributes K of a relation R is a super-key of R if and only if the closure of K+ of K
under the given set of FD’s is exactly the set of all attributes of R.
%
!
Sometimes we may have two sets of FD’s S1 and S2 such that every FD implied by S1 is also implied by
S2 (i.e. S1+ ⊂ S2+). When this happens, S2 is called a cover of S1. The implication of this fact is that if the
DBMS enforces the FD’s in S2, then it will automatically enforce the FD’s in S1.
If it so happens that S1+ ⊂ S2+ and S2+ ⊂ S1+ (i.e. S1+=S2+) then S1 and S2 are said to be equivalent. In this
the if the FD’s of S1 are enforced it implies that FD’s of S2 are also enforced and vice versa.
A set of FDs is said to be irreducible (also called minimal) if and only if it satisfies the following three
properties:
1.
The RHS (the dependent) of every FD in S consists of just one attribute (i.e. is a singleton set).
2.
No attribute of the LHS (the determinant) can be removed without changing the closure of the set or
its equivalents. We term it as Left-irreducable.
3.
No FD of the set can be discarded without changing the closure S+ or its equivalents.
For example, consider the following FD’s in relation R ():
/
→
→
→
!
RELATIONAL DATABASE DESIGN
109
→
The RHS of each of the FD’s is a singleton. Also, each case, the LHS is obviously irreducible in turn, and
none of the FDs can be discarded without changing the closure (i.e., without losing some information). The
above set of FD’s has all the three properties and therefore is left-irreducable.
The following sets of FDs are not irreducible for the stated reasons.
2.
3.
4.
A
→ {A, B}
A
→C
A
→D
{A, B} → C
The RHS is not a singleton set.
This FD can be simplified by dropping B from
A
→B
the left-hand side without changing the
A
→D
closure (i.e., it is not left – irreducible)
A
→E
A
→A
The first FD here can be discarded without
A
→B
Changing the closure
A
→C
A
→D
A
→E
Now, the claim is that for every set of FDs, there exists at least one equivalent set that is irreducible. This is
easy to see. Let there be a set of FDs, S. By decomposition axiom, we can assume without loss of
generality that every FD in S has a singleton right-hand side. Also, for each FD f in S, we examine each
attribute A in the LHS of f; if deleting A from the LHS of f has no effect on the closure S+, we delete A
from the LHS of f. Then, for each FD f remaining in S, if deleting f from S has no effect on the closure S+,
we delete f from S. The final set S is irreducable and is equivalent to the original set.
Example: Suppose we are given relation R with attributes A, B, C, D and FDs
A → BC
B→ C
A→ B
AB → C
AC → D
We now compute an irreducible set of FDs that is equivalent to this given set.
1.
The first step is to rewrite the FDs such that each one has a singleton RHS. Thus,
A→B
A →C
A→ C
A→ B
110
DATABASE SYSTEMS
AB → C
AC → D
There is double occurrence of FD A → C so one occurrence can be eliminated.
2.
The attribute FD, AC → D can be eliminated from the L.H.S. because we have A → C, so by
augmentation A → AC, and we are given AC → D so A → D by transitivity; thus C on the LHS of
AC → D is redundant and can be eliminated.
3.
Next, we observe that the FD AB → C can be eliminated, because again we have A → C, so AB →
CB augmentation, so AB → C by decomposition.
4.
Finally, the FD A → C is implied by the FDA A → B and B → C, so it can also be eliminated. We
are left with:
A→B
B→C
C→D
This is the required irreducible set.
Student Activity 4.4
Before reading the next section, answer the following questions:
1.
What do you understand by functional dependency?
2.
What is irreducible set of dependencies?
If your answers are correct, then proceed to the next section.
&
'
While designing a database, usually a data model is translated into relational schema. The important
question is whether there is a design methodology or is the process arbitrary. A simple answer to this
question is affirmative. There are certain properties that a good database design must possess as dictated by
Codd’s rules.
There are many different ways of designing good database. One of such methodologies is the method
involving ‘Normalization’.
Normalization theory is built around the concept of normal forms. Normalization reduces redundancy.
Redundancy is unnecessary repetition of data. It can cause problems with storage and retrieval of data.
During the process of normalization, dependencies can be identified, which can cause problems during
deletion and updation. Normalization theory is based on the fundamental notion of Dependency.
Normalization helps in simplifying the structure of schema and tables.
For the purpose of illustration of the normal forms, we will take an example of a database of the following
logical design:
Relation S { S#, SUPPLIERNAME, SUPPLYTATUS, SUPPLYCITY}, Primary Key{S#}
Relation P { P#, PARTNAME, PARTCOLOR, PARTWEIGHT, SUPPLYCITY}, Primary Key{P#}
Relation SP { S#, SUPPLYCITY, P#, PARTQTY},
Primary Key{S#, P#}
Foreign Key{S#} Reference S
RELATIONAL DATABASE DESIGN
111
Foreign Key{P#} Reference P
Now, what prompts the designer to make schema this way? Is this the only design? Is this the most
appropriate? Could it be better if we modify it? If yes, then how? There are many such questions that a
designer has to ask and answer. Let us see what are the problems we might face if we continue with this
design. First of all let us insert some tuples into the table SP.
SP
S#
SUPPLYCITY
P#
PARTQTY
S1
Delhi
P1
3000
S1
Delhi
P2
2000
S1
Delhi
P3
4000
S1
Delhi
P4
2000
S1
Delhi
P5
1000
S1
Delhi
P6
1000
S2
Mumbai
P1
3000
S2
Mumbai
P2
4000
S3
Mumbai
P2
2000
S4
Delhi
P2
2000
S4
Delhi
P4
3000
S4
Delhi
P5
4000
Let us examine the table above to find any design discrepancy. A quick glance reveals that some of the data
are being repeated. That is data redundancy, which is of course an undesirable. The fact that a particular
supplier is located in a city has been repeated many times. This redundancy causes many other related
problems. For instance, after an update a supplier may be displayed to be from Delhi in one entry while
from Mumbai in another. This further gives rise to many other problems.
Therefore, for the above reasons, the tables need to be refined. This process of refinement of a given
schema into another schema or a set of schema possessing qualities of a good database is known as
Normalization.
Database experts have defined a series of Normal forms each conforming to some specified design quality
condition(s). We shall restrict ourselves to the first five normal forms for the simple reason of simplicity.
Each next level of normal form adds another condition. It is interesting to note that the process of
normalization is reversible. The following diagram depicts the relation between various normal forms.
1NF
2NF
3NF
4NF
5NF
112
DATABASE SYSTEMS
The diagram implies that 5th Normal form is also in 4th Normal form, which itself in 3rd Normal form and so
on. These normal forms are not the only ones. There may be 6th, 7th and nth normal forms, but this is not of
our concern at this stage.
Before we embark on normalization, however, there are a few more concepts that should be understood.
!
Decomposition is the process of splitting a relation into two or more relations. This is nothing but
projection process.
Decompositions may or may not loose information. As you would learn shortly, that normalization process
involves breaking a given relation into one or more relations and also that these decompositions should be
reversible as well, so that no information is lost in the process. Thus, we will be interested more with the
decompositions that incur no loss of information rather than the ones in which information is lost.
Lossless decomposition: The decomposition, which results into relations without loosing any information,
is known as lossless decomposition or nonloss decomposition. The decomposition that results in loss of
information is known as lossy decomposition.
Consider the relation S{S#, SUPPLYSTATUS, SUPPLYCITY} with some instances of the entries as
shown below.
S
S#
SUPPLYSTATUS
SUPPLYCITY
S3
100
Delhi
S5
100
Mumbai
Let us decompose this table into two as shown below:
(1) SX
(2) SX
S#
SUPPLYSTATUS
S3
SY
S#
SUPPLYCITY
100
S3
Delhi
S5
100
S5
Mumbai
S#
SUPPLYSTATUS
S3
S5
SY
SUPPLYSTATUS
SUPPLYCITY
100
100
Delhi
100
100
Mumbai
Let us examine these decompositions. In decomposition (1) no information is lost. We can still say that
S3’s status is 100 and location is Delhi and also that supplier S5 has 100 as its status and location Mumbai.
This decomposition is therefore lossless.
In decomposition (2), however, we can still say that status of both S3 and S5 is 100. But the location of
suppliers cannot be determined by these two tables. The information regarding the location of the suppliers
has been lost in this case. This is a lossy decomposition.
Certainly, lossless decomposition is more desirable because otherwise the decomposition will be
irreversible. The decomposition process is in fact projection, where some attributes are selected from a
table.
RELATIONAL DATABASE DESIGN
113
A natural question arises here as to why the first decomposition is lossless while the second one is lossy?
How should a given relation must be decomposed so that the resulting projections are nonlossy? Answer to
these questions lies in functional dependencies and may be given by the following theorem.
Heath’s theorem: Let R{A, B, C} be a relation, where A, B and C are sets of attributes. If R satisfies the FD
A→B, then R is equal to the join of its projections on {A, B} and {A, C}.
Let us apply this theorem on the decompositions described above. We observe that relation S satisfies two
irreducible sets of FD’s
S# → SUPPLYSTATUS
S# → SUPPLYCITY
Now taking A as S#, B as SUPPLYSTATUS, and C as SUPPLYCITY, this theorem confirms that relation
S can be nonloss decomposition into its projections on {S#, SUPPLYSTATUS} and {S#, SUPPLYCITY} .
Note, however, that the theorem does not say why projections {S#, SUPPLYSTATUS} and
{SUPPLYSTATUS, SUPPLYCITY} should be lossy. Yet we can see that one of the FD’s is lost in this
decomposition. While the FD S#→SUPPLYSTATUS is still represented by projection on {S#,
SUPPLYSTATUS}, but the FD S#→SUPPLYCITY has been lost.
An alternative criteria for lossless decomposition is as follows. Let R be a relation schema, and let F be a
set of functional dependencies on R. let R1 and R2 form a decomposition of R. this decomposition is a
lossless-join decomposition of R if at least one of the following functional dependencies are in F+:
R1
R2 → R 1
R1
R2 → R2
Functional Dependency Diagrams: This is a handy tool for representing function dependencies existing in a
relation.
PARTNAME
SUPPLIERNAME
S#
S#
SUPPLYSTATUS
PARTQTY
P#
SUPPLYCITY
PARTCOLOR
P#
PARTWEIGHT
SUPPLYCITY
The diagram is very useful for its eloquence and in visualizing the FD’s in a relation. Later in the Unit you
will learn how to use this diagram for normalization purposes.
Top
&
A relation is in 1st Normal form (1NF) if and only if, in every legal value of that relation, every tuple
contains exactly one value for each attribute.
Although, simplest, 1NF relations have a number of discrepancies and therefore it not the most desirable
form of a relation.
114
DATABASE SYSTEMS
Let us take a relation (modified to illustrate the point in discussion) as
Rel1{S#, SUPPLYSTATUS, SUPPLYCITY, P#, PARTQTY} Primary Key{S#, P#}
FD{SUPPLYCITY → SUPPLYSTATUS}
Note that SUPPLYSTATUS is functionally dependent on SUPPLYCITY; meaning that a supplier’s status
is determined by the location of that supplier – e.g. all suppliers from Delhi must have status of 100. The
primary key of the relation Rel1 is {S#, P#}. The FD diagram is shown below:
S#
SUPPLYCITY
P#
SUPPLYSTATUS
PARTQTY
For a good design the diagram should have arrows out of candidate keys only. The additional arrows cause
trouble.
Let us discuss some of the problems with this 1NF relation. For the purpose of illustration, let us insert
some sample tuples into this relation.
REL1
S#
SUPPLYSTATUS
SUPPLYCITY P#
PARTQTY
S1
200
Delhi
P1
3000
S1
200
Delhi
P2
2000
S1
200
Delhi
P3
4000
S1
200
Delhi
P4
2000
S1
200
Delhi
P5
1000
S1
200
Delhi
P6
1000
S2
100
Mumbai P1
3000
S2
100
Mumbai P2
4000
S3
100
Mumbai P2
2000
S4
200
Delhi
P2
2000
S4
200
Delhi
P4
3000
S4
200
Delhi
P5
4000
The redundancies in the above relation causes many problems – usually known as update anamolies, that is
in INSERT, DELETE and UPDATE operations. Let us see these problems due to supplier-city redundancy
corresponding to FD S#→SUPPLYCITY.
INSERT: In this relation, unless a supplier supplies at least one part, we cannot insert the information
regarding a supplier. Thus, a supplier located in Kolkata is missing from the relation because he has not
supplied any part so far.
DELETE: Let us see what problem we may face during deletion of a tuple. If we delete the tuple of a
supplier (if there is a single entry for that supplier), we not only delte the fact that the supplier supplied a
RELATIONAL DATABASE DESIGN
115
particular part but also the fact that the supplier is located in a particular city. In our case, if we delete
entries corresponding to S#=S2, we loose the information that the supplier is located at Mumbai. This is
definitely undesirable. The problem here is there are too many informations attached to each tuple,
therefore deletion forces loosing too many informations.
UPDATE: If we modify the city of a supplier S1 to Mumbai from Delhi, we have to make sure that all the
entries corresponding to S#=S1 are updated otherwise inconsistency will be introduced. As a result some
entries will suggest that the supplier is located at Delhi while others will contradict this fact.
Top
%
&
A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is fully functionally dependent on
the primary key. Here it has been assumed that there is only one candidate key, which is of course primary
key.
A relation in 1NF can always decomposed into an equivalent set of 2NF relations. The reduction process
consists of replacing the 1NF relation by suitable projections.
We have seen the problems arising due to the less-normalization (1NF) of the relation. The remedy is to
break the relation into two simpler relations.
REL2{S#, SUPPLYSTATUS, SUPPLYCITY} and
REL3{S#, P#, PARTQTY}
The FD diagram and sample relation, are shown below.
SUPPLYCITY
S#
S#
PARTQTY
SUPPLYSTATUS
REL2
P#
REL3
S# SUPPLYSTATUS
SUPPLYCITY
S#
P#
PARTQTY
S1
200
Delhi
S1
P1
3000
S2
100
Mumbai
S1
P2
2000
S3
100
Mumbai
S1
P3
4000
S4
200
Delhi
S1
P4
2000
S5
300
Kolkata
S1
P5
1000
S1
P6
1000
S2
P1
3000
S2
P2
4000
S3
P2
2000
116
DATABASE SYSTEMS
S4
P2
2000
S4
P4
3000
S4
P5
4000
REL2 and REL3 are in 2NF with their {S#} and {S#, P#} respectively. This is because all nonkeys of
REL1{ SUPPLYSTATUS, SUPPLYCITY}, each is functionally dependent on the primary key that is S#.
By similar argument, REL3 is also in 2NF.
Evidently, these two relations have overcome all the update anomalies stated earlier.
Now it is possible to insert the facts regarding supplier S5 even when he is not supplied any part, which was
earlier not possible. This solves insert problem. Similarly, delete and update problems are also over now.
These relations in 2NF are still not free from all the anomalies. REL3 is free from most of the problems we
are going to discuss here, however, REL2 still carries some problems. The reason is that the dependency of
SUPPLYSTATUS on S# is though functional, it is transitive via SUPPLYCITY. Thus we see that there are
two dependencies S#→SUPPLYCITY and SUPPLYCITY→SUPPLYSTATUS. This implies
S#→SUPPLYSTATUS. This relation has a transitive dependency. We will see that this transitive
dependency gives rise to another set of anomalies.
INSERT: We are unable to insert the fact that a particular city has a particular status until we have some
supplier actually located in that city.
DELETE: If we delete sole REL2 tuple for a particular city, we delete the information that that city has that
particular status.
UPDATE: The status for a given city still has redundancy. This causes usual redundancy problem related to
updataion.
Top
#
&
A relation is in 3NF if only if it is in 2NF and every non-key attribute is non-transitively dependent on the
primary key.
To convert the 2NF relation into 3NF, once again, the REL2 is split into two simpler relations – REL4 and
REL5 as shown below.
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
The FD diagram and sample relation, is shown below.
S#
SUPPLYCITY
REL4
SUPPLYCITY
SUPPLYCITY
REL5
S#
SUPPLYCITY
SUPPLYCITY SUPPLYSTATUS
S1
Delhi
Delhi
200
S2
Mumbai
Mumbai
100
RELATIONAL DATABASE DESIGN
S3
Mumbai
S4
Delhi
S5
Kolkata
117
Kolakata
300
Evidently, the above relations REL4 and REL5 are in 3NF, because there is no transitive dependencies.
Every 2NF can be reduced into 3NF by decomposing it further and removing any transitive dependency.
!
The reduction process may suggest a variety of ways in which a relation may be decomposed in lossless
decomposition. Thus, REL2 can be in which there was a transitive dependency and therefore, we split it
into two 3NF projections, i.e.
REL4{S#, SUPPLYCITY} and
REL5{SUPPLYCITY, SUPLLYSTATUS}
Let us call this decomposition as decompositio-1. An alternative decomposition may be:
REL4{S#, SUPPLYCITY} and
REL5{S#, SUPPLYSTATUS}
Which we will call decomposition-2.
Both the decompositions decomposition-1 and decomposition-2 are 3NF and lossless. However,
decomposition-2 is less satisfactory than decomposition-1. For example, it is still not possible to insert the
information that a particular city has a particular status unless some supplier is located in the city.
In the decomposition-1, the two projections are independent of each other but the same is not true in the
second decomposition. Here independence is in the sense that updates are made into the relations without
regard of the other provided the insertion is legal. Also independent decompositions preserve the
dependencies of the database and no dependence is lost in the decomposition process.
The concept of independent projections provides for choosing a particular decomposition when there is
more than one choice.
Top
(
)
&
The previous normal forms assumed that there was just one candidate key in the relation and that key was
also the primary key. Another class of problems arises when this is not the case. Very often there will be
more candidate keys than one in practical database designing situation. To be precise the 1NF, 2NF and
3NF did not deal adequately with the case of relations that
1. Had two or more candidate keys, and that
2. The candidate keys were composite, and
3. They overlapped (i.e. had at least one attribute common).
A relation is in BCNF (Boyce-Codd Normal Form) if and only if every nontrivial, left-irreducible FD has a
candiadte key as its determinant.
Or
A relation is in BCNF if and only if all the determinants are candidate keys.
118
DATABASE SYSTEMS
In other words, the only arrows in the FD diagram are arrows out of candidate keys. It has already been
explained that there will always be arrows out of candidate keys; the BCNF definition says there are no
others, meaning there are no arrows that can be eliminated by the normalization procedure.
These two definitions are apparently different from each other. The difference between the two BCNF
definitions is that we tacitly assume in the former case determinants are "not too big" and that all FDs are
nontrivial.
It should be noted that the BCNF definition is conceptually simpler than the old 3NF definition, in that it
makes no explicit reference to first and second normal forms as such, nor to the concept of transitive
dependence. Furthermore, although BCNF is strictly stronger than 3NF, it is still the case that any given
relation can be nonloss decomposed into an equivalent collection of BCNF relations.
Thus, relations REL1 and REL2 which were not in 3NF, are not in BCNF either; also that relations REL3,
REL4, and REL5, which were in 3NF, are also in BCNF. Relation REL1 contains three determinants,
namely {S#}, {SUPPLYCITY}, and {S#, P#}; of these, only {S#, P#} is a candidate key, so REL1 is not in
BCNF. Similarly, REL2 is not in BCNF either, because the determinant {SUPPLYCITY} is not a candidate
key. Relations REL3, REL4, and REL5, on the other hand, are each in BCNF, because in each case the sole
candidate key is the only determinant in the respective relations.
We now consider an example involving two disjoint - i.e., nonoverlapping - candidate keys. Suppose that in
the usual suppliers relation REL1{S#, SUPPLIERNAME, SUPPLYSTATUS, SUPPLYCITY}, {S#} and
{SUPPLIERNAME} are both candidate keys (i.e., for all time, it is the case that every supplier has a
unique supplier number and also a unique supplier name). Assume, however, that attributes
SUPPLYSTATUS
and
SUPPLYCITY
are
mutually
independent
i.e.,
the
FD
SUPPLYCITY→SUPPLYSTATUS no longer holds. Then the FD diagram is as shown below.
S#
SUPPLYSTATUS
SUPPLIERNAME
SUPPLYCITY
Relation REL1 is in BCNF. Although the FD diagram does look "more complex" than a 3NF diagram, it is
nevertheless still the case that the only determinants are candidate keys; i.e., the only arrows are arrows out
of candidate keys. So the message of this example is just that having more than one candidate key is not
necessarily bad.
For illustration we will assume that in our relations supplier names are unique. Consider REL6.
REL6{ S#, SUPPLIERNAME, P#, PARTQTY }.
Since it contains two determinants, S# and SUPPLIERNAME that are not candidate keys for the relation,
this relation is not in BCNF. A sample snapshot of this relation is shown below:
REL6
S#
SUPPLIERNAME P#
PARTQTY
S1
Pooran
P1
3000
S1
Anupam
P2
2000
S1
Vishal
P3
4000
S1
Vinod
P4
2000
RELATIONAL DATABASE DESIGN
119
As is evident from the figure above, relation REL6 involves the same kind of redundancies as did relations
REL1 and REL2, and hence is subject to the same kind of update anomalies. For example, changing the
name of suppliers from Vinod to Rahul leads, once again, either to search problems or to possibly
inconsistent results. Yet REL6 is in 3NF by the old definition, because that definition did not require an
attribute to be irreducibly dependent on each candidate key if it was itself a component of some candidate
key of the relation, and so the fact that SUPPLIERNAME is not irreducibly dependent on {S#, P#} was
ignored.
The solution to the REL6 problems is, of course, to break the relation down into two projections, in this
case the projections are:
REL7{S#, SUPPLIERNAME} and
REL8{S#, P#, PARTQTY}
Or
REL7{S#, SUPPLIERNAME} and
REL8{SUPPLIERNAME, P#, PARTQTY}
Both of these projections are in BCNF. The original design, consisting of the single relation REL1, is
clearly bad; the problems with it are intuitively obvious, and it is unlikely that any competent database
designer would ever seriously propose it, even if he or she had no exposure to the ideas of BCNF etc. at all.
!
( &
*&
We have seen two normal forms for relational-database schemas: 3NF and BCNF. There is an advantage to
3NF in that we know that it is always possible to obtain a 3NF design without sacrificing a lossless join or
dependency preservation. Nevertheless, there is a disadvantage to 3NF. If we do not eliminate all transitive
dependencies, we may have to use null values to represent some of the possible meaningful relationship
among data items, and there is the problem of repetition of information. The other difficulty is the repetition
of information.
If we are forced to choose between BCNF and dependency preservation with 3NF, it is generally preferable
to opt for 3NF. If we cannot test for dependency preservation efficiently, we either pay a high penalty in
system performance or risk the integrity of the data in our database. Neither of these alternatives is
attractive. With such alternatives, the limited amount of redundancy imposed by transitive dependencies
allowed under 3NF is the lesser evil. Thus, we normally choose to retain dependency preservation and to
sacrifice BCNF.
In summary, we repeat that our three design goals for a relational-database design are
1.
BCNF
2.
Lossless join
3.
Dependency preservation
If we cannot achieve all three, we accept
1.
3NF
2.
Lossless join
3.
Dependency preservation
Student Activity 4.5
120
DATABASE SYSTEMS
Before reading the next section, answer the following questions:
1.
What do you understand by Normalisation?
2.
Why Normalisation of Database is required?
3.
Write short notes of the following:
4.
a.
1st Normal form
b.
IInd Normal form
c.
IIIrd Normal form
Discuss the difference between BCNF and BNF
If your answers are correct, then proceed to the next section.
#&
So far we have been normalizing relations based on their functional dependencies. However, they are not
the only type dependencies found in relations giving their own characteristic anomalies.
There is another class of higher normalization (4th and 5th) that revolve around the concept of another type
of dependencies – multi-valued dependency (MVD) and join-dependency (JD).
)
!
Multi-valued dependency may be formally defined as:
Let R be a relation, and let A, B, and C be subsets of the attributes of R. Then we say that B is multidependent on A - in symbols,
A →→B
(read "A multi-determines B," or simply "A double arrow B") - if and only if, in every possible legal value
of R, the set of B values matching a given A value, C value pair depends only on the A value and is
independent of the C value.
To elucidate the meaningof the above statement, let us take one example relation REL8 as shown beolw:
REL8
COURSE
TEACHERS
BOOKS
Computer
TEACHER
BOOK
Dr. Wadhwa
Graphics
Prof. Mittal
UNIX
TEACHER
BOOK
Prof. Saxena
Relational Algebra
Prof. Karmeshu
Discrete Maths
Mathematics
Assume that for a given course there can exist any number of corresponding teachers and any number of
corresponding books. Moreover, let us also assume that teachers and books are quite independent of one
RELATIONAL DATABASE DESIGN
121
another; that is, no matter who actually teaches any particular course, the same books are used. Finally, also
assume that a given teacher or a given book can be associated with any number of courses.
Let us try to eliminate the relation-valued attributes. One way to do this is simply to replace relation REL8
by a relation REL9 with three scalar attributes COURSE, TEACHER, and BOOK as indicated below.
REL9
COURSE
TEACHER
BOOK
Computer
Dr. Wadhwa
Graphics
Computer
Dr. Wadhwa
UNIX
Computer
Prof. Mittal
Graphics
Computer
Prof. Mittal
UNIX
Mathematics
Prof. Saxena
Relational Algebra
Mathematics
Prof. Karmeshu
Disrete Maths
Mathematics
Prof. Karmeshu
Relational Algebra
As you can see from the relation, each tuple of REL8 gives rise to m * n tuples in REL9, where m and n are
the cardinalities of the TEACHERS and BOOKS relations in that REL8 tuple. Note that the resulting
relation REL9 is "all key".
The meaning of relation REL9 is basically as follows: A tuple {COURSE:c, TEACHER:t, BOOK:x}
appears in REL9 if and only if course c can be taught by teacher t and uses book x as a reference. Observe
that, for a given course, all possible combinations of teacher and book appear: that is, REL9 satisfies the
(relation) constraint
if
tuples (c, t1, x1), (c, t2, x2) both appear
then tuples (c, t1, x2), (c, t2, x1) both appear also
Now, it should be apparent that relation REL9 involves a good deal of redundancy, leading as usual to
certain update anomalies. For example, to add the information that the Computer course can be taught by a
new teacher, it is necessary to insert two new tuples, one for each of the two books. Can we avoid such
problems? Well, it is easy to see that:
1. The problems in question are caused by the fact that teachers and books are completely independent of
one another;
2. Matters would be much improved if REL9 were decomposed into its two projections call them REL10
and REL11 - on {COURSE, TEACHER} and {COURSE, BOOK}, respectively.
To add the information that the Computer course can be taught by a new teacher, all we have to do now is
insert a single tuple into relation REL10. Thus, it does seem reasonable to suggest that there should be a
way of "further normalizing" a relation like REL9.
It is obvious that the design of REL9 is bad and the decomposition into REL10 and REL11 is better. The
trouble is, however, these facts are not formally obvious. Note in particular that REL9 satisfies no
functional dependencies at all (apart from trivial ones such as COURSE → COURSE); in fact, REL9 is in
122
DATABASE SYSTEMS
BCNF, since as already noted it is all key-any "all key" relation must necessarily be in BCNF. (Note that
the two projections REL10 and REL11 are also all key and hence in BCNF.) The ideas of the previous
normalization are therefore of no help with the problem at hand.
The existence of "problem" BCNF relation like REL9 was recognized very early on, and the way to deal
with them was also soon understood, at least intuitively. However, it was not until 1977 that these intuitive
ideas were put on a sound theoretical footing by Fagin'
s introduction of the notion of multi-valued
dependencies, MVDs. Multi-valued dependencies are a generalization of functional dependencies, in the
sense that every FD is an MVD, but the converse is not true (i.e., there exist MVDs that are not FDs). In the
case of relation REL9 there are two MVDs that hold:
COURSE →→ TEACHER
COURSE →→ BOOK
Note the double arrows; the MVD A→→B is read as "B is multi-dependent on A" or, equivalently, "A
multi-determines B." Let us concentrate on the first MVD, COURSE→→TEACHER. Intuitively, what this
MVD means is that, although a course does not have a single corresponding teacher - i.e., the functional
dependence COURSE→TEACHER does not hold-nevertheless, each course does have a well-defined set
of corresponding teachers. By "well-defined" here we mean, more precisely, that for a given course c and a
given book x, the set of teachers t matching the pair (c, x) in REL9 depends on the value c alone - it makes
no difference which particular value of x we choose. The second MVD, COURSE→→BOOK, is
interpreted analogously.
It is easy to show that, given the relation R{A, B, C), the MVD A→→B holds if and only if the MVD
A→→C also holds. MVDs always go together in pairs in this way. For this reason it is common to
represent them both in one statement, thus:
COURSE→→TEACHER | TEXT
Now, we stated above that multi-valued dependencies are a generalization of functional dependencies, in
the sense that every FD is an MVD. More precisely, an FD is an MVD in which the set of dependent (righthand side) values matching a given determinant (left-hand side) value is always a singleton set. Thus, if
A→B. then certainly A→→B.
Returning to our original REL9 problem, we can now see that the trouble with relation such as REL9 is that
they involve MVDs that are not also FDs. (In case it is not obvious, we point out that it is precisely the
existence of those MVDs that leads to the necessity of – for example - inserting two tuples to add another
Computer teacher. Those two tuples are needed in order to maintain the integrity constraint that is
represented by the MVD.) The two projections REL10 and REL11 do not involve any such MVDs, which
is why they represent an improvement over the original design. We would therefore like to replace REL9
by those two projections, and an important theorem proved by Fagin in reference allows us to make exactly
that replacement:
Theorem (Fagin): Let R{A, B, C} be a relation, where A, B, and C are sets of attributes. Then R is equal to
the join of its projections on {A, B} and {A, C} if and only if R satisfies the MVDs A→→B | C.
At this stage we are equipped to define fourth normal form:
Fourth normal form: Relation R is in 4NF if and only if, whenever there exist subsets A and B of the
attributes of R such that the nontrivial (An MVD A→→B is trivial if either A is a superset of B or the
union of R and B is the entire heading) MVD A→→B is satisfied, then all attributes of R are also
functionally dependent on A.
RELATIONAL DATABASE DESIGN
123
In other words, the only nontrivial dependencies (FDs or MVDs) in R are of the form Y→X (i.e., functional
dependency from a superkey Y to some other attribute X). Equivalently: R is in 4NF if it is in BCNF and
all MVDs in R are in fact "FDs out of keys." Therefore, that 4NF implies BCNF.
Relation REL9 is not in 4NF, since it involves an MVD that is not an FD at all, let alone an FD "out of a
key." The two projections REL10 and REL11 are both in 4NF, however. Thus 4NF is an improvement over
BCNF, in that it eliminates another form of undesirable dependency. What is more, 4NF is always
achievable; that is, any relation can be nonloss decomposed into an equivalent collection of 4NF relations.
You may recall that a relation R{A, B, C} satisfying the FDs A→B and B→C is better decomposed into its
projections on (A, B) and {B, C} rather than into those on {A, B] and {A, C). The same holds true if we
replace the FDs by the MVDs A→→B and B→→C.
Top
#&
It seems from our discussion so far in that the sole operation necessary or available in the further
normalization process is the replacement of a relation in a nonloss way by exactly two of its projections.
This assumption has successfully carried us as far as 4NF. It comes perhaps as a surprise, therefore, to
discover that there exist relations that cannot be nonloss-decomposed into two projections but can be
nonloss-decomposed into three (or more). An unpleasant but convenient term, we will describe such a
relation as "n-decomposable" (for some n > 2) - meaning that the relation in question can be nonlossdecomposed into n projections but not into m for any m < n.
A relation that can be nonloss-decomposed into two projections we will call "2-decomposable" and
similarly term “n-decomposable” may be defined. The phenomenon of n-decomposability for n > 2 was
first noted by Aho, Been, and Ullman. The particular case n = 3 was also studied by Nicolas.
Consider relation REL12 from the suppliers-parts-projects database ignoring attribute QTY for simplicity
for the moment. A sample snapshot of the same is shown below. It may be pointed out that relation REL12
is all key and involves no nontrivial FDs or MVDs at all, and is therefore in 4NF. The snapshot of the
relations also shows:
a. The three binary projections REL13, REL14, and REL15 corresponding to the REL12 relation value
displayed on the top section of the adjoining diagram;
b. The effect of joining the REL13 and REL14 projections (over P#);
c. The effect of joining that result and the REL15 projection (over J# and S#).
REL12
REL13
S#
P#
J#
S1
P1
J2
S1
P2
J1
S2
P1
J1
S1
P1
J1
S#
P#
REL14
S1
P#
J#
P1
P1
S1
P2
S2
P1
REL15
J#
S#
J2
J2
S1
P1
J1
J1
S1
P1
J1
J1
S2
124
DATABASE SYSTEMS
Join Dependency:
Let R be a relation, and let A, B, ..., Z be subsets of the attributes of R. Then we say that R satisfies the Join
Dependency (JD)
*{ A, B, ..., Z}
(read "star A, K ..., Z") if and only if every possible legal value of R is equal to the join of its projections on
A, B,..., Z.
For example, if we agree to use SP to mean the subset (S#, P#} of the set of attributes of REL12, and
similarly for FJ and JS, then relation REL12 satisfies the JD * {SP, PJ, JS}.
We have seen, then, that relation REL12, with its JD * {REL13, REL14, REL15}, can be 3-decomposed.
The question is, should it be? And the answer is "Probably yes." Relation REL12 (with its JD) suffers from
a number of problems over update operations, problems that are removed when it is 3-decomposed.
Fagin'
s theorem, to the effect that R{A, B, C} can be non-loss-decomposed into its projections on {A, B}
and {A, C] if and only if the MVDs A→→B and A→→C hold in R, can now be restated as follows:
R{A, B, C} satisfies the JD*{AB, AC} if and only if it satisfies the MVDs A→→B | C.
Since this theorem can be taken as a definition of multi-valued dependency, it follows that an MVD is just a
special case of a JD, or (equivalently) that JDs are a generalization of MVDs.
Thus, to put it formally, we have
A→→B | C ≡ * {AB, AC}
Note that joint dependencies are the most general form of dependency possible (using, of course, the term
"dependency" in a very special sense). That is, there does not exist a still higher form of dependency such
that JDs are merely a special case of that higher form - so long as we restrict our attention to dependencies
that deal with a relation being decomposed via projection and recomposed via join.
Coming back to the running example, we can see that the problem with relation REL12 is that it involves a
JD that is not an MVD, and hence not an FD either. We have also seen that it is possible, and probably
desirable, to decompose such a relation into smaller components - namely, into the projections specified by
the join dependency. That decomposition process can be repeated until all resulting relations are in fifth
normal form, which we now define:
Fifth normal form: A relation R is in 5NF - also called projection-join normal torn (PJ/NF) - if and only if
every nontrivial* join dependency that holds for R is implied by the candidate keys of R.
Let us understand what it means for a JD to be "implied by candidate keys."
Relation REL12 is not in 5NF, it satisfies a certain join dependency, namely Constraint 3D, that is certainly
not implied by its sole candidate key (that key being the combination of all of its attributes). Stated
differently, relation REL12 is not in 5NF, because (a) it can be 3-decomposed and (b) that 3decomposability is not implied by the fact that the combination {S#, P#, J#} is a candidate key. By contrast,
after 3-decomposition, the three projections SP, PJ, and JS are each in 5NF, since they do not involve any
(nontrivial) JDs at all.
Now let us understand through an example, what it means for a JD to be implied by candidate keys.
Suppose that the familiar suppliers relation REL1 has two candidate keys, {S#} and {SUPPLIERNAME}.
Then that relation satisfies several join dependencies - for example, it satisfies the JD
*{ { S#, SUPPLIERNAME, SUPPLYSTATUS }, { S#, SUPPLYCITY } }
RELATIONAL DATABASE DESIGN
125
That is, relation REL1 is equal to the join of its projections on {S#, SUPPLIERNAME, SUPPLYSTATUS}
and {S#, SUPPLYCITY), and hence can be nonloss-decomposed into those projections. (This fact does not
mean that it should be so decomposed, of course, only that it could be.) This JD is implied by the fact that
{S#} is a candidate key (in fact it is implied by Heath'
s theorem) Likewise, relation REL1 also satisfies the
JD
* {{S#, SUPPLIERNAME}, {S#, SUPPLYSTATUS}, {SUPPLIERNAME, SUPPLYCITY}}
This JD is implied by the fact that {S#} and { SUPPLYNAME} are both candidate keys.
To conclude, we note that it follows from the definition that 5NF is the ultimate normal form with respect
to projection and join (which accounts for its alternative name, projection-join normal form). That is, a
relation in 5NF is guaranteed to be free of anomalies that can be eliminated by taking projections. For a
relation is in 5NF the only join dependencies are those that are implied by candidate keys, and so the only
valid decompositions are ones that are based on those candidate keys. (Each projection in such a
decomposition will consist of one or more of those candidate keys, plus zero or more additional attributes.)
For example, the suppliers relation REL15 is in 5NF. It can be further decomposed in several nonloss ways,
as we saw earlier, but every projection in any such decomposition will still include one of the original
candidate keys, and hence there does not seem to be any particular advantage in that further reduction.
+, &"
$
-$
"&
" ,
,%
$
-,
In normalization of a relation, the basic idea is as follows:
Given some 1NF relation R and some set of FDs, MVDs, and JDs that apply to R, we systematically reduce
R to a collection of "smaller" (i.e., lower-degree) relations that are equivalent to R in a certain well-defined
sense but are also in some way more desirable. (The original relation might have been obtained by first
eliminating certain relation-valued attributes)
The process is essentially an iterative refinement. Each step of the reduction process consists of taking
projections of the relations resulting from the preceding step. The given constraints are used at each step to
guide the choice of which projections to take next. The overall process can be stated informally as a set of
rules, thus:
1. Take projections of the original 1NF relation to eliminate any FDs that are not irreducible. This step will
produce a collection of 2NF relations.
2. Take projections of those 2NF relations to eliminate any transitive FDs. This step will produce a
collection of 3NF relations.
3. Take projections of those 3NF relations to eliminate any remaining FDs in which the determinant is not a
candidate key. This step will produce a collection of BCNF relations.
Rules 1-3 can be condensed into the single guideline "Take projections of the original relation to eliminate
all FDs in which the determinant is not a candidate key"
4. Take projections of those BCNF relations to eliminate any MVDs that are not also FDs.
This step will produce a collection of 4NF relations. In practice it is usual "separating independent RVAs,"
as explained in our discussion of the REL13 example.
5. Take projections of those 4NF relations to eliminate any JDs that are not implied by the candidate keys though perhaps we should add "if you can find them." This step will produce a collection of relations in
5NF.
Student Activity 4.7
126
DATABASE SYSTEMS
Answer the following questions.
1.
What do understand by Decomposition?
2.
Discuss the various properties of Decomposition.
3.
Write short notes of following:
a.
Dependency preservation decomposition.
b.
Lossless-Join Decomposition.
%
•
Normalization is a technique used to design tables in which data redundancies are minimized.
•
The first three normal forms (1NF, 2NF and 3NF) are most commonly encountered.
•
From a structural point of view, higher normal forms yield relatively fewer data redundancies in the
database. In other words, 3NF is better than 2NF, which is better than 1NF.
•
Almost all business designs use the 3NF as the ideal normal form. (A special, more restricted 3NF is
known as Boyce-Codd normal form, or BCNF).
•
A table is in 1NF when all the key attributes are defined and when all remaining attributes are
dependent on the primary key. However, a table in 1NF can still contain both partial and transitive
dependencies.
•
A partial dependency is one in which an attribute is functionally dependent on only a part of a multiattribute primary key.
•
A transitive dependency is one in which one attribute is functionally dependent on another non-key
attribute.
•
A table with a single-attribute primary key cannot exhibit partial dependencies.
•
A table is in 2NF if it is in 1NF and contains no partial dependencies.
•
A 1NF table is also in 2NF if its primary key is based on only single attribute.
•
A table is in 3NF if it is in 2NF and contains no transitive dependencies.
•
Boyce-Codd (BCNF) is a special case of 3NF in which all the determinant keys are also candidate keys.
•
A 3NF table having a single candidate key is in BCNF.
%
I.
II.
)
.
True or False
1.
The data in the database is perceived by the user as a record.
2.
If the constraints are defined within the column definition, it is called as a table level
constraint.
3.
A primary key column cannot be of data type LONG OR LONG ROW.
Fill in the Blanks
RELATIONAL DATABASE DESIGN
I.
II.
I.
II.
127
1.
The relational model is an abstract theory of data that is based on certain aspects of
_____________.
2.
Oracle 7 creates an index on the columns of a _________.
3.
You can use ______________syntax to define a referential integrity constraint in which the
foreign key is made up of a single column.
4.
For a table to be in the third normal form, the first condition is that it should also be in the
________normal form
5.
When decomposing a relation into a number of smaller relations, it is crucial that the
decomposition be _________.
True or False
1.
False
2.
False
3.
True
Fill in the Blanks
1.
mathematics
2.
primary key
3.
column_constraint
4.
second
5.
lossless.
True or False
1.
Database arranged in the tables & collection of tables called relation database.
2.
A composite primary key is a foreign key made up of a combination of columns.
3.
The table containing the foreign key is called the child table and the table containing the
referenced key is called the parent table.
4.
The repetition of information required by the use of our alternative design is desirable.
5.
To have lossless-join decomposition, we need not to impose constraints on the set of possible
relations.
Fill in the Blanks
a. A _________ constraint designates a column or combination of columns as the tables primary
by.
128
DATABASE SYSTEMS
b. A primary by column cannot be of data type ___________.
c. ____________ helps in reducing redundancy.
d. To determine whether these schemas are in BCNF, we need to determine what _____________
apply to them.
.
1.
Discuss insertion, deletion, and modification anomalies. Why are they considered bad? Illustrate
with examples.
2.
Why are many nulls in a relation considered bad?
3.
Discuss the problem of spurious tuples and how we may prevent it.
4.
State the informal guidelines for relation schema design that we discuss. Illustrate how violation of
these guidelines may be harmful.
5.
What is a functional dependency? Who specifies the functional dependences that hold among the
attributes of a relation schema?
6.
What are Armstrong’s inference rules?
7.
What is meant by the closure of a set of functional dependency?
8.
When are two sets of functional dependences equivalent? How can we determine their
equivalence?
9.
What does the term unnormalised relation refer to?
10.
Define first, second and third normal forms.
11.
Define Boyce-Codd normal form. How does it differ from 3NF? Why is it considered a stronger
form of 3NF?
12.
What is irreducible set of dependencies? Suppose we are given relation R with attributes B, C, D, E
F, G and
F
G
Gf
E
B
CD
D
F
Find an irreducible set of FD that is equivalent to this given set.
13.
What is normalization? Normalize the following table up to 3NF.
Supplier
S_id
s_city
s_status
P_id
Sl
Delhi
10
pl, p2
S2
Calcutta
20
p3
S3
Madras
30
pl, p5
Introduction
Query Processor
Query Processing Strategies
Selections Involving Comparisons
Query Optimization
General Transformation Rules for Relational Algebra Operations
Basic Algorithms for Executing Query Operations
Locking Techniques for Concurrency Control
Concurrency Control Based on Timestamp Ordering
Multiversion Concurrency Control Techniques
Query Processing
Learning Objectives
After reading this unit you should appreciate the following:
•
Introduction
•
Query Processor
•
General Strategies for Query Processing
•
Query Optimization
•
Concept of Security
•
Concurrency
•
Recovery
Top
In this chapter we discuss the techniques used by a DBMS to process, optimize, and execute high-level
queries. A query expressed in a high-level query language such as SQL must first be scanned, parsed, and
validated. The scanner identifies the language tokens—such as SQL keywords, attribute names, and
relation names—in the text of the query, whereas the parser checks the query syntax to determine whether it
is formulated according to the syntax rules (rules of grammar) of the query language. The query must also
be validated, by checking that all attribute and relation names are valid and semantically meaningful names
in the schema of the particular database being queried. An internal representation of the query is then
created, usually as a tree data structure called query tree. It is also possible to represent the query using a
graph data structure called a query graph. The DBMS must then devise an execution strategy for retrieving
the result of the query from the database files. A query typically has many possible execution strategies,
and the process of choosing a suitable one for processing a query is known as query optimization.
Top
132
DATABASE SYSTEMS
Figure 5.1 shows the different steps of processing a high-level query. The query optimizer module has the
task of producing an execution plan, and the code generator generates the code to execute the plan. The
runtime database processor has the task of running the query code, whether in compiled or interpreted
mode, to produce the query result. If a runtime error results, an error message is generated by the runtime
database processor.
The term optimization is actually a misnomer because in some cases the chosen execution plan is not the
optimal (best) strategy—it is just a reasonably efficient strategy for executing the query. Finding the
optimal strategy is usually too time-consuming except for the simplest of queries and may require
information on how the files are implemented and even on the contents of these-information that may not
be fully available in the DBMS catalog. Hence, planning of an execution strategy may be a more accurate
description than query optimization.
QUERY PROCESSING
133
For lower-level navigational database languages in legacy systems–such as the network DML or the
hierarchical HDML- the programmer must choose the query execution strategy while writing a database
program. If a DBMS provides only a navigational language, there is limited need or opportunity for
extensive query optimization by the DBMS; instead, the programmer is given the capability to choose the
"optimal" execution strategy. On the other hand, a high-level query language—such as SQL for relational
DBMSs (RDBMSs) or OQL for object DBMSs (ODBMSs)—is more declarative in nature because it
specifies what the intended results of the query are, rather than the details of how the result should be
obtained. Query optimization's thus necessary for queries that are specified in a high-level query language.
Top
The steps involved in processing a query are illustrated in Figure 5.2. The basic steps are:
1.
Parsing and translation
2.
Optimization
3.
Evaluation
Before query processing can begin, the system must translate the query into a usable form. A language such
as SQL is suitable for human use, but is ill suited to be the system’s internal representation of a query. A
more useful internal representation is one based on the extended relational algebra.
Thus, the first action the system must take in query processing is to translate a given query into its internal
form. This translation process is similar to the work performed by the parser of a compiler. In generating
the internal form of the query, the parser checks the syntax of the user's query, verifies that the relation
names appearing in the query are names of relations in the database, and so on. A parse-tree representation
of the query is constructed, which is then translated into a relational-algebra expression. If the query was
expressed in terms of a view, the translation phase also replaces all uses of the view by the relationalalgebra expression that defines the view. Parsing is covered in most compiler texts and is outside the scope
of this book.
In the network and hierarchical models (discussed later), query optimization is left, for the most part, to the
application programmer. That choice is made because the data-manipulation-language statements of these
two models are usually embedded in a host programming language, and it is not easy to transform a
network or hierarchical query into an equivalent one without knowledge of the entire application program.
In contrast, relational-query languages are either declarative or algebraic. Declarative languages permit
134
DATABASE SYSTEMS
users to specify what a query should generate without saying how the system should do the generating.
Algebraic languages allow for algebraic transformation of users' queries. Based on the query specification,
it is relatively easy for an optimizer to generate a variety of equivalent plans for a query, and to choose the
least expensive one.
In this unit, we assume the relational model. We shall see that the algebraic basis provided by this model is
of considerable help in query optimization. Given a query, generally there are B+ variety of methods for
computing the answer. For example, we have seen that, in SQL, a query could be expressed in several
different ways. Each SQL query can itself be translated into a relational-algebra expression in one of
several ways. Furthermore, the relational-algebra representation of a query specifies only partially how to
evaluate a query; there are usually several ways to evaluate relational-algebra expressions. As an
illustration, consider the query
select balance
from account
where balance < 2500
This query can be translated into either of the following relational-algebra expressions:
σ balance<2500 ( πbalance (account))
πbalance ( σbalance <2500(account))
Further, we can execute each relational-algebra operation using one of several different algorithms. For
example, to implement the preceding selection, we can search every tuple in account to find tuples with
balance less than 2500. If a tree index is available on the attribute balance, we can use the index instead to
locate the tuples.
To specify fully how to evaluate a query, we need not only to provide the relational-algebra expression, but
also to annotate it with instructions specifying how to evaluate each operation. Annotations may state the
algorithm to be used for a specific operation, or the particular index or indices to use. A relational-algebra
operation annotated with instructions on how to evaluate it is called an evaluation primitive. Several
primitives may be grouped together into a pipeline, in which several operations are performed in parallel. A
sequence of primitive operations that can be used to evaluate a query is a query-execution plan or queryevaluation plan. Figure 5.3 illustrated an evaluation plan for our example query, in which a particular index
(denoted in the figure as "index I") is specified for the selection operation. The query-execution engine
takes a query-evaluation plan, executes that plan, and returns the answers to the query.
The different evaluation plans for a given query can have different costs. We do not expect users to write
their queries in a way that suggests the most efficient evaluation plan. Rather, it is the responsibility of the
system to construct a query-evaluation plan that minimizes the cost of query evaluation. As we explained
the most relevant performance measure is usually the number of disk accesses.
πbalance
σbalance< 2500; useindexI
account
QUERY PROCESSING
135
Query optimization is the process of selecting the most efficient query-evaluation plan for a query. One
aspect of optimization occurs at the relational- algebra level. An attempt is made to find an expression that
is equivalent to the given expression, but that is more efficient to execute. The other aspect involves the
selection of a detailed strategy for processing the query, such as choosing the algorithm to use for executing
an operation, choosing the specific indices to use, and so on.
To choose among different query-evaluation plans, the optimizer has to estimate the cost of each evaluation
plan. Computing the precise cost of evaluation of a plan is usually not possible without actually evaluating
the plan. Instead, optimizers make use of statistical information about the relations, such as relation sizes
and index depths, to make a good estimate of the cost of a plan.
Consider the preceding example of a selection applied to the account relation. The optimizer estimates the
cost of the different evaluation plans. If an index is available on attribute balance of account, then the
evaluation plan shown in Figure 5.3, in which the selection is done using the index, is likely to have the
lowest cost and thus, to be chosen.
Once the query plan is chosen, the query is evaluated with that plan. And the result of the query is output.
The sequence of steps already described for processing a query is representative; not all databases exactly
follow those steps. For instance, instead of using the relational-algebra representation, several databases use
an annotated parse-tree representation based on the structure of the given SQL query. However, the
concepts that we describe here form the basis of query processing in databases.
In the next section, we construct a cost model that allows us to estimate the cost of various operations.
Using this cost measure, we address the optimal evaluation of individual operations. We examine the
efficiencies that we can achieve by combining multiple operations into one pipelined operation. These tools
allow us to determine the approximate cost of evaluating a given relational-algebra expression optimally.
Finally, we show equivalences among relational-algebra expressions. We can use these equivalences to
replace a relational-algebra expression constructed from a user's query with an equivalent expression whose
estimated cost of evaluation is lower.
The strategy that we choose for query evaluation depends on the estimated cost of the strategy. Query
optimizers make use of statistical information stored in the DBMS catalog to estimate the cost of a plan.
The relevant catalog information about relations includes:
nr is the number of tuples in the relation r.
br is the number of blocks containing tuples of relation r.
Sr is the size of a tuple of relation r in bytes.
fr is the blocking factor of relation r—that is the number of tuples of relation r that fit into one
block.
V(A, r) is the number of distinct values that appear in the relation r for attribute A. This value is the
same as the size of Π A ( r ) . If A is a key for relation r, V(A,r) is nr.
SC(A, r) is the selection cardinality of attribute A of relation r. Given relation r and an attribute A of
the relation, SC{A. r) is the average number of records that satisfy an equality condition on attribute
A, given that at least one record satisfies the equality condition. For example, SC(A,r) = 1, if f is a
key attribute of r; for a non-key attribute, we estimate that the V(A,r distinct values are distributed
evenly among the tuples, yielding SC(A, r) = {nr/V(A.r)).
The last two statistics, V(A, r) and SC(A, r), can also be maintained for sets of attributes if desired, instead
of just for individual attributes. Thus, given a set of attributes A, V(A, r) is the size of Π A ( r ).
136
DATABASE SYSTEMS
If we assume that the tuples of relations are stored together physically in file, the following equation holds:
br =
nr
f4
In addition to catalog information about relations, the following catalog information about indices is also
used:
f1 is the average fan-out of internal nodes of index i, for tree-structured indices such as
B+ trees.
HTi is the number of levels in index i—that is, the height of index i. For balanced tree index (such
as a B+-tree) on attribute A of relation r, HTi = [logfi(V (A, r)]. For a hash index, HTi is 1.
LBi is the number of lowest-level index blocks in index i—that is, the number of blocks at the leaf
level of the index.
We use the statistical variables to estimate the size of the result and the cost for various operations and
algorithms, as we shall see in the following sections. We refer to the cost estimate of algorithm A as E.
If we wish to maintain accurate statistics, then every time a relation i modified, we must also update the
statistics. This update incurs a substantial amount of overhead. Therefore, most systems do not update the
statistics on ever modification. Instead, the updates are done during periods of light system load.
As a result, the statistics used for choosing a query-processing strategy may not be completely accurate.
However, if not too many updates occur in the intervals between the updates of the statistics, the statistics
will be sufficiently accurate to provide a good estimation of the statistics, the differed plans. The statistical
information noted here is simplified. Real-worid optimizers often maintain further statistical information to
improve the accuracy of their cost estimates of evaluation plans.
The cost of query evaluation can be measured in terms of a number of different resources, including disk
accesses, CPU time to execute a query, and in a distributed or parallel database system, the cost of
communication. The response time for a query-evaluation plan (that is, the clock time required to execute
the plan), assuming no other activity is going on the computer, would account for all these costs, and could
be used as a good measure of the cost of the plan.
In large database systems, however, disk accesses (which we measure as the number of transfers of blocks
from disk) are usually the most important cost, since disk accesses are slow compared to in-memory
operations. Moreover, CPU speeds have been improving much faster than have disk speeds. Thus, it is
likely that the time spent in disk activity will continue to dominate the total time to execute a query. Finally,
estimating the CPU time is relatively hard, compared to estimating the disk-access cost. Therefore, the
disk-access cost is considered a reasonable measure of the cost of a query-evaluation plan.
To simplify our computation of disk-access cost, we assume that all transfers of blocks have the same cost.
This assumption ignores the variance arising from rotational latency (waiting for the desired data to spin
under the read-write head) and seek time (the time that it takes to move the head over the desired track or
cylinder). Although these factors are significant, they are difficult to estimate in a shared system. Therefore,
we simply use the number of block transfers from disk as a measure of the actual cost.
We also ignore the cost of writing the final result of an operation back to disk. Whatever the queryevaluation plan used, this cost does not change; hence, ignoring it does not affect the choice of a plan.
The costs of all the algorithms that we consider depend significantly on the size of the buffer in main
memory. In the best case, all data can be read into the buffers, and the disk does not need to be accessed
QUERY PROCESSING
137
again. In the worst case, we assume that the buffer can hold only a few blocks of data—approximately one
block per relation. When presenting cost estimates, we generally assume the worst case.
In query processing, the file scan is the lowest-level operator to access data. File scans are search
algorithms that locate and retrieve records that fulfill a selection condition. In relational systems, file scan
allows an entire relation to be read in the cases where the relation is stored in a single, dedicated file.
Consider a selection operation on a relation whose tuples are stored together in one file. Two scan
algorithms to implement the selection operation are as follows:
A1 (linear search). In a linear search, each file block is scanned, and all records are tested to see
whether they satisfy the selection condition. Since all blocks have to be read, EA 1 = br . (Recall that
EA 1 denotes the estimated cost of algorithm A1). For a selection on a key attribute, we assume that
one-half of the blocks will be searched before the record is found, at which point the scan can
terminate. The estimate in this case is EA 1 = ( br / 2) .
Although it may be inefficient in many cases, the linear search algorithm can be applied to any file,
regardless of the ordering of the file or of the availability of indices.
A2 (binary search). If the file is ordered on an attribute, and the selection condition is an equality
comparison on the attribute, we can use a binary search to locate records that satisfy the selection.
The binary search is performed on the blocks of the file, giving the following estimate for the file
blocks to be scanned:
EA 2 = [log 2 ( br )] +
SC( A , r )
−1
fr
The first term, log2 ( br )], accounts for the cost of locating the first tuple by a binary search on the blocks.
The total number of records that will satisfy the selection is SC(A,r), and these records will occupy
[SC(A,r)/fr] blocks, of which one has already been retrieved, giving the preceding estimate. If the equality
condition is on a key attribute, then SC(A, r) = 1, and the estimate reduces to EA2 = E2 = [log 2 ( br )].
The cost estimates for binary search are based on the assumption that the blocks of a relation are stored
contiguously on disk. Otherwise, the cost of looking up the file-access structures (which may be on disk) to
locate the physical address of a block in a file must be added to the estimates. The cost estimates also
depend on the size of the result of the selection.
If we assume uniform distribution of values (that is. each value appears with equal probability), then the
query σ A = a( r ) is estimated to have tuples, assuming that the value appears in attribute A of some record of r.
The assumption that the value a in the selection appears in some record is generally true, and cost estimates
often make it implicitly. However, it is often not realistic to assume that each value appears wish equal
probability. The branch-name attribute in the account relation is an example where the assumption is not
valid. There is one tuple in the account relation for each account. It is reasonable to expect that the large
branches have more accounts than smaller branches. Therefore, certain branch-name values appear with
greater probability than do others. Despite the fact that the uniform-distribution assumption is often not
correct, it is a reasonable approximation of reality in many cases, and it helps us to keep our presentation
relatively simple.
SC( A , r ) =
nr
V ( A, r )
138
DATABASE SYSTEMS
As an illustration of this use of the cost estimates, suppose that we have the following statistical
information about the account relation:
faccount=20 (that is, 20 tuples of account fit in one block).
V(branch-name, account) = 50 (that is, there are 50 different branches).
V (balance, account) = 500 (that is, there are 500 different balance values).
naccount=10000 (that is, the account relation has 10,000 tuples).
Consider the query
σbranch− name=" Perryridge" ( account)
Since the relation has 10,000 tuples, and each block holds 20 tuples, the number of blocks is baccount=500. A
simple file scan on account therefore takes 500 block accesses.
Suppose that account is sorted on branch-name. Since V (branch-name, account) is 50, we expect that
10000/50 = 200 tuples of the account relation pertain to the Perry ridge branch. These tuples would fit in
200/20 == 10 blocks. A binary search to find the first record would take [log2(500)] = 9 block accesses.
Thus, the total cost would be 9+10— 1 = 18 block accesses.
Index structures are referred to as access paths, since they provide a path through which data can be located
and accessed. It is efficient to read the records of a file in an order corresponding closely to physical order.
Recall that a primary index is an index that allows the records of a file to be read in an order that
corresponds to the physical order in the tile. An index that is not a primary index is called a secondary
index.
Search algorithms that use an index are referred to as index scans. Ordered indices, such as
B+-trees, also permit access to tuples in a sorted order, which is useful for implementing range queries.
Although indices can provide fast, direct, and ordered access, their use imposes the overhead of access to
those blocks containing the index. We need to take into account these block accesses when we estimate the
cost of a strategy that involves the use of indices. We use the selection predicate for guide use in the
selection predicate to guide us in the choice of the index to use in processing the query.
A3 (primary index, equality on key). For an equality comparison on a key attribute with a primary
index, we can use the index to retrieve a single record that satisfies the corresponding equality
condition. To retrieve a single record, we need to retrieve one block more than the number of index
levels (HTi); the cost is EA3=HTi+1.
A4 (primary index, equality on non-key). We can retrieve multiple records by using a primary index
when the selection condition specifies an equality comparison on a non-key attribute, A,SC (A,r)
records will satisfy an equality condition, and [SC(A,r)/fr) file blocks will be accessed; hence,
EA 4 = HT i =
SC( A , r )
fr
A5 (secondary index, equality). Selections specifying an equality condition can use a secondary
index. This strategy can retrieve a single record if the indexing field is a key; multiple records can
be retrieved if the indexing field is not a key. For an equality condition on attribute A,SC(A,r)
records satisfy the condition. Given that the index is a secondary index, we assume the worst-case
scenario that each matching record resides on a different block, yielding EA5=HTi+ SC(A,r), or, for
a key indexing attribute, EA5=HTi+1.
QUERY PROCESSING
139
We assume the same statistical information about account as used in the earlier example. We also suppose
that the following indices exist on account:
A primary, B+-tree index for attribute branch name.
A secondary, B+-tree index for attribute balance.
As mentioned earlier, we make the simplifying assumption that values are distributed uniformly.
Consider the query
σ branch− name=" Perryridge" ( account)
Since V (branch-name, account)=50, we expect that 10000/50=200 tuples of the account relation pertain to
the Perry ridge branch. Suppose that we use the index on branch-name. Since the index is a clustering
index. 200/20=10 block reads are required to read the account tuples. In addition, several index blocks must
be read. Assume that the B+-tree index stores 20 pointers per node. Since there are MB 50 different branch
names, the B+-tree index must have between three and five leaf nodes. With this number of leaf nodes, the
entire tree has a depth of 2, so two index blocks must be read. Thus, the preceding strategy requires 12 total
block reads.
Top
Consider a selection of the form σ A ≤ v ( r ) . In the absence of any further information about the comparison,
we assume that approximately one-half of the records will satisfy the comparison condition; hence, the
result has n r / 2 tuples.
If the actual value used in the comparison (u) is available at the time of cost estimation, a more accurate
estimate can be made. The lowest and highest values (min(A, r) and max(A, r)) for the attribute can be
stored in the catalog. Assuming that values are uniformly distributed, we can estimate the number of
records that will satisfy the condition
A ≤ ν as0 if ν < min ( A , r ) and n r
ν − min (A , r )
otherwise.
mex(A , r ) − min(A , r )
We can implement selections involving comparisons either using a linear or binary search, or using indices
in one of the following ways:
A6 (primary index, comparison). A primary ordered index (for example, a primary B+-tree index)
can be used when the selection condition is a comparison. For comparison conditions of the form A
> ν or A ≥ ν , the primary index can be used to direct the retrieval of tuples, as follows. For A ≥
ν , we look up-the value v in the index to find the first tuple in the file that has a value of A = ν . A
file scan starting from that tuple up to the end of the file returns all tuples that satisfy the condition.
For A > ν , the file scan starts with the first tuple such that A > ν .
For comparisons of the form A < ν or A ≤ ν , an index lookup is not required. For A < ν , we use a
simple file scan starting from the beginning of the file, and continuing up to (but not including) the
first tuple with attribute A = ν . The case of A ≤ ν is similar, except that the scan continues up to
(but not including) the first tuple with attribute A > ν . In either case, the index is not useful.
We assume that approximately one-half of the records will satisfy one of the conditions. Under this
assumption, retrieval using the index has the following cost:
140
DATABASE SYSTEMS
EA 6 = HT i +
br
2
If the actual value used in the comparison is available at the time of cost estimation, a more
accurate estimate can be made. Let the estimated number of values that satisfy the condition (as
described earlier) be c. Then,
EA 6 = HT i +
c
fr
A7 (secondary index, comparison). We can use a secondary ordered index to guide retrieval for
comparison conditions involving <, ≤ , ≥ , or >. The lowest level index blocks are scaled either
from the smallest value upto v (for < and ≤ ), or from ν upto the maximum value (for > and ≥ ).
For these comparisons, if we assume that at least one-half of the records satisfy the condition, then
one-half of the lowest-level index blocks are accessed and, via the index, one-half of the file
records are accessed. Furthermore, a path must be traversed in the index from the root block to the
first leaf block to be used. Thus, the cost estimate is the following:
EA 7 = HTi +
LBi n r
+
2
2
As with non-equality comparisons on clustering indices, we can get a more accurate estimate if we
know the actual value used in the comparison at the time of cost estimation. In Tandem's Non-Stop
SQL System, B+-trees are used both for primary data storage and as secondary access paths. The
primary indices are clustering indices, whereas the secondary ones are not. Rather than pointers to
records' physical location, the secondary indices contain keys to search the primary B+-tree. The
cost formulae described previously for secondary indices will have to be modified slightly if such
indices are used.
Although the preceding algorithms show that indices are helpful in processing selections with
comparisons, they are not always so useful. As an illustration, consider the query
σ balance< 1200 ( account)
Suppose that the statistical information about the relations is the same as that used earlier. If we have no
information about the minimum and maximum balances in the account relation, then we assume that onehalf of the tuples satisfy the selection.
If we use the index for balance, we estimate the number of block accesses as follows. Let us assume that 20
pointers fit into one node of the B+ tree index for balance. Since there are 500 different balance values, and
each leaf node of the tree must be at least half-full, the tree has between 25 and 50 leaf nodes. So, as was
the case for the index on branch-name, the index for balance has a depth of 2, and two block accesses are
required to read the first index block. In the worst case, there are 50 leaf nodes, one-half of which must be
accessed. This accessing leads to 25 more block reads. Finally, for each tuple that we locate in the index,
we have to retrieve that tuple from the relation. We estimate that 5000 tuples (one-half of the 10,000 tuples)
satisfy the condition. Since the index is non-clustering, in the worst case each of these tuple accesses will
require a separate block access. Thus, we get a total of 5027 block accesses.
In contrast, a simple file scan will take only 10000/20 == 500 block accesses. In this case, it is clearly not
wise to use the index, and we should use the file scan instead.
!
So far, we have considered only simple selection conditions of the form A op B, where op is an equality or
comparison operation. We now consider more complex selection predicates.
QUERY PROCESSING
141
Conjunction: A conjunctive selection is a selection of the form
σ θ1 ∧ ...∧ θn ( r )
We can estimate the result size of such a selection as follows. For each θi , we estimate the size of
the selection θi ( r ) denoted by si as described previously. Thus, the probability that a tuple in the
relation satisfies selection condition θ i is si/nr.
The preceding probability is called the selectivity of the selection σ θi ( r ) . Assuming that the
conditions are independent of each other, the probability that a tuple satisfies all the conditions is
simply the product of all these probabilities. Thus, we estimate the size of the full selection as
nr ×
s1 × s2 × ...× sn
n rn
Disjunction: A disjunctive selection is a selection of the form
σ
θ1 ∨ θ2 ∨ ... ∨ θn ( r )
A disjunctive condition is satisfied by the union of all records satisfying the individual, simple
conditions θ i .
As before, let si/nr, denote the probability that a tuple satisfies condition θ i . The probability that the
tuple will satisfy the disjunction is then 1 minus the probability that it will satisfy none of the
conditions, or
1− 1−
s
s1
s
× 1 − 2 × ...× 1 − n
nr
nr
nr
Multiplying this value by nr, gives us the number of tuples that satisfy the selection.
Negation: The result of a selection σ − θ ( r ) is simply the tuples of r that are not in σ θ ( r ) . We already
know how to estimate the size of σ θ (r ) . The size of σ − θ ( r ) is therefore estimated to be
size(r) - size (σθ (r ))
We can implement a selection operation involving either a conjunction or a disjunction of simple
conditions using one of the following algorithms:
A8 (conjunctive selection using one index). We first determine whether an access path is available
for an attribute in one of the simple conditions. If one of the selection algorithms A2 through A7 can
retrieve records satisfying that condition then we complete the operation by testing in the memory
buffer, whether or not each retrieved record satisfies the remaining simple conditions.
Selectivity is central to determining in what order the simple conditions in a conjunctive selection
should be tested. The most selective condition (that is, the one with the smallest selectivity) will
retrieve the smallest number of records; hence, that condition should constitute the first scan.
A9 (conjunctive selection using composite index). An appropriate composite index may be
available for some conjunctive selections. If the selection specifies an equality condition on two or
more attributes, and a composite index exists on these combined attribute fields, then the index can
be searched directly. The type of index determines which of algorithms A3, A4, or A5 will be used.
A10(conjunctive selection by intersection of identifiers). Another alternative for implementing
conjunctive selection operations involves the use of record pointers or record identifiers. This
142
DATABASE SYSTEMS
algorithm requires indices with record pointers, on the fields involved in the individual conditions.
Each index is scanned for pointers to tuples that satisfy an individual condition. The intersection of
all the retrieved pointers is the set of pointers to tuples that satisfy the conjunctive condition. We
then use the pointers to retrieve the actual records. If indices are not available on all the individual
conditions, then the retrieved records are tested against the remaining conditions.
A11 (disjunctive selection by union of identifiers). If access paths are available on all the conditions
of a disjunctive selection, each index is scanned for pointers to tuples that satisfy the individual
condition. The union of all the retrieved pointers yields the set of pointers to all tuples that satisfy
the disjunctive condition. We then use the pointers to retrieve the actual records.
However, if even one of the conditions does not have an access path, we will have to perform a
linear scan of the relation to find tuples that satisfy the condition. Therefore, if there is even one
such condition in the disjunct, the most efficient access method is a linear scan, with the disjunctive
condition tested on each tuple during the scan.
To illustrate the preceding algorithms, we suppose that we have the query
select account-number
from account
where branch-name = "Perry ridge" and balance = 1200
We assume that the statistical information about the account relation is the same as that in the earlier
example.
If we use the index on branch-name, we will have a total of 12 blocks. If we use the index for balance, we
estimate its access as follows. Since V (balance, account) = 500, we expect that 10000/500 = 20 tuples of
the account relation will have a balance of $1200. However, since the index for balance is non-clustering,
we anticipate that one block read will be required for each tuple. Thus, 20 block reads are required just to
read the account tuples.
Let us assume that 20 pointers fit into one node of the B+ tree index for balance. Since there are 500
different balance values, the tree has between 25 and 50 leaf nodes. So, as was the case for the B+-tree
index on branch-name, the index for balance has a depth of 2, and two block accesses are required to read
the necessary index blocks. Therefore, this strategy requires a total of 22 block reads.
Thus, we conclude that it is preferable to use the index for branch-name. Observe that, if both indices were
non-clustering, we would prefer to use the index on balance, since we would expect only 10 tuples to have
balance = 1200, versus 200 tuples with branch-name = "Perry ridge". Without the clustering property, our
first strategy could require as many as 200 block accesses to read the data, since, in the worst case, each
tuple is on a different block. We add these 200 accesses to the 2 index block accesses, for a total of 202
block reads. However, because of the clustering property of the branch-name index, it is actually less
expensive in this example to use the branch-name index.
Another way in which we could use the indices to process our example query is by using intersection of
identifiers. We use the index for balance to retrieve pointers to records with balance = 1200, rather than
retrieving the records themselves. Let S1 denote this set of pointers. Similarly, we use the index for branchname to retrieve pointers to records with branch-name = "Perry ridge". Let S2 denote this set of pointers.
Then, S1 ∩ S2 is a set of pointers to records with branch-name= "Perry ridge" and balance = 1200.
This technique requires both indices to be accessed. Both indices have a height of 2, and, for each index,
the number of pointers retrieved, estimated earlier as 20 and 200, will fit into a single leaf page. Thus, we
read a total of four index blocks to retrieve the two sets of pointers. The intersection of the two sets of
pointers can be computed with no further disk I/O. We estimate the number of blocks that must be read
from the account file by estimating the number of pointers in S1 ∩ S2 .
QUERY PROCESSING
143
Since V (branch-name, account) = 50 and V (balance account) = 1000, we estimate that one tuple in 50*
1000 or one in 50, 000 has both branch-name = "Perry ridge" and balance = 1200. This estimate is based on
an assumption of uniform distribution (which we made earlier), and on an added assumption that the
distribution of branch names and balances are independent. Based on these assumptions. S1 ∩ S2 is
estimated to have only one pointer. Thus, only one block of account needs to be read. The total estimated
cost of this strategy is five block reads.
Sorting of data plays an important role in database systems for two reasons. First, SQL queries can specify
that the output be sorted. Second, and equally important for query processing, several of the relational
operations, such as joins, can be implemented efficiently if the input relations are first sorted.
We can accomplish sorting by building an index on the sort key and then using that index to read the
relation in sorted order. However, such a process orders the relation only logically, through an index, rather
than physically. Hence, the reading of tuples in the sorted order may lead to a disk access for each tuple.
For this reason, it may be desirable to order the tuples physically.
The problem of sorting has been studied extensively, both for the case where the relation fits entirely in
main memory, and for the case where the relation is bigger than memory. In the first case, standard-sorting
techniques such as quick- sort can be used. Here, we discuss how to handle the second case.
Sorting of relations that do not fit in memory is called external sorting. The most commonly used technique
for external sorting is the external sort-merge algorithm. We describe the external sort-merge algorithm
next. Let M denote the number of page frames in the main-memory buffer (the number of disk blocks
whose contents can be buffered in main memory).
1.
In the first stage, a number of sorted runs are created.
i=0;
repeat
read M blocks of the relation, or the rest of the relation.
whichever is smaller;
sort the in-memory part of the relation;
write the sorted data to run file Ri,
i = i + 1:
until the end of the relation
2.
In the second stage, the runs are merged. Suppose, for now, that the total number of runs, N, is less
than M, so that we can allocate one page frame to each run and have space left to hold one page of
output. The merge stage operates as follows:
read one block of each of the N files K, into a buffer page in memory:
repeat
choose the first tuple (in sort order) among all buffer pages;
write the tuple to the output, and delete it from the buffer page:
if the buffer page of any run R, is empty and not end-of-file(Ri)
then read the next block of Ri into the buffer page;
144
DATABASE SYSTEMS
until all buffer pages are empty
The output of the merge stage is the sorted relation.
!"
#
$
In general, if the relation is much larger than memory, then there may be M or more runs generated in the
first stage, and it is not possible to allocate a page frame for each run during the merge stage. In this case,
the merge operation is done in multiple passes. Since there is enough memory for M -1 input buffer pages,
each merge can take M - 1 runs as input.
The initial pass functions as follows. The first M - 1 runs are merged (as described previously) to get a
single run for the next pass. Then, the next M - 1 runs are similarly merged, and so on, until all the initial
runs have been processed. At this point, the number of runs has been reduced by a factor of M - 1. If this
reduced number of runs is still greater than or equal to M, another pass is made, with the runs created by the
first pass as input. Each pass reduces the number of runs by a factor of M - 1. These passes are repeated as
many times as required until the number of runs is less than M, a final pass then generates the sorted output.
Figure 5.4 illustrates the steps of the external sort-merge of an example relation. For illustration purposes,
we assume that only one tuple fits in a block (fr = 1), and we assume that memory holds at most three page
frames. During the merge stage, two page frames are used for input and one for output.
Let us compute how many block transfers are required for the external sort merge. In the first stage, every
block of the relation is read and is written out again, giving a total of 2br, disk accesses. The initial number
of runs is [br/M]. Since the number of runs is decreased by a factor of M-1 in each merge pass, the total
number of merge passes required is given by [logM − 1 (br / M )] . Each of these passes reads every block of
the relation once and writes it out once with two exceptions. First, the final pass can produce the sorted
output without writing its result to disk. Second, there may be runs that are not read in or written out during
a pass—for example, if there are M runs to be merged in a pass, M - 1 are read in and merged, and one run
is not accessed during the pass. Ignoring the (relatively small) savings due to the latter effect, the total
number of disk accesses for external sorting of the relation is
QUERY PROCESSING
145
br (2[logM −1 (br / M )] + 1)
Applying this equation to the example in Figure 5.3, we get a total of 12* (4+1) = 60 block transfers, as you
can verify from the figure. Note that this value does not include the cost of writing out the final result.
"
In this section, we first show how to estimate the size of the result of a join. We then study several
algorithms for computing the join of relations, and analyze their respective costs. We use the word equi-join
to refer to a join of the form r
r.A=s.B s, where A and B are attributes or sets of attributes of relations r and
s respectively.
We use as a running example the expression
depositor
customer
We assume the following catalog information about the two relations:
ncustomer=10,000.
fcustomer=25, which implies that bcustomer = 10000/25=400
ndepositor=5000.
fdespositor=50, which implies that bdepositor=5000/50=100.
V (customer-name, depositor)=2500, which implies that, on average, each customer has two
accounts.
We also assume that customer-name in depositor is a foreign key on customer.
#
"
The Cartesian product r × s contains n r × n s , tuples. Each tuple of r x s occupies sr + ss bytes, from which
we can calculate the size of the Cartesian product.
Estimation of the size of a natural join is somewhat more complicated than is estimation of the size of a
selection or of a Cartesian product. Let r(R) and s(S) be relations.
If R ∩ S = φ that is, the relations have no attribute in common—their r
and we can use our estimation technique for Cartesian products.
s is the same as r × s ,
Student Activity 5.1
1.
Discuss the various steps in Query processing?
2.
What do you understand by measuring the cost of Query Cost?
3.
Discuss the steps involved in Basic algorithm of query processing.
4.
Discuss the sorting for query processing?
Top
#
In this lesson, we discuss optimization techniques that apply heuristics rules to modify the internal
representation of a query-which is usually in the form of a query tree or a query graph data structure- to
improve its expected performance. The parser of a high-level query first generates an initial internal
146
DATABASE SYSTEMS
representation, which is then optimized according to heuristic rules. Following that, query execution plan is
generated to execute groups of operations based on the access paths available on the files involved in the
query.
One of the main heuristic rules is to apply SELECT and PROJECT operations before applying the JOIN or
other binary operations. This is because the size of the file resulting from a binary operation-such as JOIN
is usually a multiplicative function of the sizes of the input files. The SELECT and PROJECT operations
reduce the size of a file and hence, should be applied before a join or other binary operation.
$
%
&
A query is a tree data structure that corresponds to a relational algebra expression. It represents the input
relations of the query as leaf nodes of the tree, and represents the relational algebra operations as internal
nodes. An execution of the query tree consists of execution of an internal node operation whenever its
operands are available and then replacing that internal node by the relation that results from executing the
operation. The execution terminates when the root node is executed and produces the result ratio for the
query.
Figure 5.5 shows a query tree for query block 2. For every project located in ‘Stafford’, retrieve the project
number, the controlling department number, and the department manager’s last name, address, and birth
date. This query is specified on the relational schema and corresponds to the following relational algebra
expression:
Π
% # & '!() * % # & )! +% & !)!
* * (! )! '*
!
(3)
(2)
P.DNUM=D.DNUMBER
(1)
P.DNUM=D.DNUMBER
E
D
P
,'-Π
% # & '!() * % & ) * % # & )! % & !)! * * (! )!'*
P.DNUM=D.D.NUMBER AND E.MGRSSN=ESSN AND P.PLOCATION Stafford
x
x
!
QUERY PROCESSING
147
.
/
"
,,1-2
,
0
-
1
E
D
P
/
+
P.PLOCATION=’Stafford’
‘Stafford’
3 , -
/
∏
!
""
""
# ""
This corresponds to the following SQL query:
$ %&
'
!
#
()
In figure 5.6, the three relations Project, Department, and Employee are represented by leaf nodes P, D, and
E, while the relational algebra internal tree nodes represent operations of the expression. When this query
tree is executed, the node marked (1) in Figure 5.4(a) must begin execution before node (2) because some
resulting tuples of operation (1) must be available before we can begin execution operation (2). Similarly,
node (2) must begin executing and producing results before node (3) can start execution, and so on.
As we can see, the query tree represents a specific order of operations for executing a query. A more neutral
representation of a query is the query graph notation. Figure 5.5 show the query graph for query Q2.
Relations in the query are represented by relation nodes, which are displayed as single circles. Constant
values, typically from the query selection conditions are represented by the constant nodes, which are
displayed as double circles. Selection and join conditions are represented by the graph edges, as shown in
Fig. 5.5. Finally, the attributes to be retrieved from each relation are displayed in square brackets above
each relation.
The query graph representation does not indicate an order in which operations perform. There is only a
single graph corresponding to each query. Although some optimization techniques were based on query
graphs, it is now generally accepted that query trees are preferable because, in practice, the query optimizer
needs to show the order of operations for query execution, which is not possible in query graphs.
148
DATABASE SYSTEMS
'
#
%
In general, many different relational algebra expressions and hence many different query trees-can be
equivalent; that is, they can correspond to the same query. The query parser will typically generate a
standard initial query tree to correspond to an SQL query, without doing any optimization. Fig 5.4(b). The
CARTESIAN PRODUCT of the relations specified in the FROM clause is first applied; then the selection
and join conditions of the WHERE clause are applied, followed by the projection on the SELECT clause
attributes. Such a canonical query tree represents a relational algebra expression that is very inefficient if
executed directly, because of the CARTESIAN PRODUCT (X) operations. For example, if the PROJECT,
DEPARTMENT, and EMPLOYEE relations had record sizes of 100, 50, and 150 bytes and contained
100,20 and 5000 tuples, respectively, the result of the CARTESIAN PRODUCT would contain 10 million
tuples of record size 300 bytes each. It is now the job of the heuristic query optimizer to transform this
initial query tree into a final query tree that is efficient to execute.
The optimizer must include rules for equivalence among relational algebra expressions that can be applied
to the initial tree. The heuristic query optimization rules then utilize these equivalence expressions to
transform the initial tree into the final, optimized query tree. We discuss general transformation rules and
show how they may be used in an algebraic heuristic optimizer.
Example of Transforming a Query. Consider the following query Q on following database
#
'
*
! +,
' ,<=.,
[email protected]
!
!9,,.9
- 9;+
- .+
/%0123456
/63278/786
40/
' , 9 9,
) :; , *
08888
000112222
2
( ,>
0001122222
/6227/%785
305
) :; ,
18888
555332222
2
A9= B
? == @9
666554444
/6357847/6
?
; =9
'
%2888
6543210%/
1
'
10888
555332222
1
05888
00112222
2
%28888
000112222
2
%2888
6543210%/
1
D
/61/7837%8
%6/
== .9 *
9 B
B
333551111
/63%7867/2
642 '.9
+:- F=9 *
<
,>=.;+
120120120
/64%78470/
230/
) :; ,
*
658
) :; ,
*
,
+- 9
00%/
C .,>
? ;;
*
65433210%/
E
! [email protected]
#
! FF9
634654654
/6367807%6
[email protected]
== ;
'
“Find the last names of employees born after 1957 who work on a project named ‘Aquarius’. “ this query
can be specified in SQL as follows:
Q:
SELECT
LNAME
FROM
EMPLOYEE, WORKS_ON, PROJECT
WHERE
PNAME = ‘Aquarius’ AND PNUMBER = PNO AND ESSN = SSN
AND BDATE. ‘1957-12-31’;
The initial query tree for Q is shown in Figure 5.4(a). Executing this tree directly first creates a very large
file containing the CARTESIAN PRODUCT of the entire EMPLOYEE, WORKS_ON, and PROJECT
files. However, this query needs only one record from the PROJECT relation—for the ‘Aquarius’ project –
and only the EMPLOYEE records for the those whose date of birth is after ‘1957-12-31’. Figure 5.4(b)
QUERY PROCESSING
149
shows an improved query tree that first applies the SELECT operations to reduce the number of tuples that
appear in the CARTESIAN PRODUCT.
A further improvement is achieved by switching the positions of the EMPLOYEE and PROJECT relations
in the tree, as shown in Figure 5.6. This uses the information that PNUMBER is a key attribute of the
project relation, and hence the SELECT operation on the PROJECT relation will retrieve a single record
only. We can further improve the query tree by replacing any CARESTIAN PRODUCT operation that is
followed by a join condition with a JOIN operation, as shown in Figure 5.6. Another improvement is to
keep only the attributes needed by subsequent operations in the intermediate relation, by including
PROJECT (Π) operations as early as possible in the query tree, as shown in Figure 5.7. This reduces the
attributes (columns) of the intermediate relations, whereas the SELECT operations reduce the number of
tuples (records).
As the preceding example demonstrates, a query tree can be transformed step by step into another query
tree that is more efficient to execute. However, we must sure that the transformation steps always lead to an
equivalent query tree. To do this, the query optimizer must know which transformation rules preserve this
equivalence. We discuss some of these transformation rules next.
π
LNAME
σ
PNAME = ‘Aquarius’ AND PNUMBER=PNO AND ESSN=SSN AND BOATE> ‘1957-12-31
X
X
PROJECT
TT
EMPLOYEE
WORKS_ON
150
DATABASE SYSTEMS
π
σ
*
σ
σ
I /6347/%70/
π
*
σ
EMPLOYEE
PNUMBER-PNO
H: .:;
(
E G
*
!
σ
σ
π
*
H: .:;
LNAME
!
σ
ESSN=SSN
σ
/6347/%70/
(
E G
PNUMBER=PNO
σ
#
BDATE>’1957-12-31
σ
PNAME=’Aquarius’
EMPLOYEE
WORKS_ON
PROJECT
TT
4
5
/ $
!+!6
0 !+!6 . / 87 2%
0
/
,0-(
$
6 (! ! 2 % (7 * # 6
QUERY PROCESSING
151
π
σ
π
σ
σ
σ
I /6347/%70/
π
π
#
σ
H: .:;
(
E G
!
9
0
0 .
(7 8!6
/
$
, -&
/
Top
&
%
(
(
)
There are many rules for transforming relational algebra operations into equivalent ones. Here
we are
interested in the meaning of the operations and the resulting relations. Hence, if two relations have the same
set of attributes in a different order but the two relations represent the same information, we consider the
relations equivalent. We now state some transformation rules that are useful in query optimization, without
proving them:
1.
Cascade of σ: A conjunctive selection condition can be broken up into a cascade (that is a
sequence) of individual σ operations.
c1 AND c2 AND…AND cn (R) = σ c1 (σc2(…(σcn(R))…))
2.
Community of σ: The σ operation is commutative:
c1(σc2(R)=σc2 (σc1 (R))
3.
Cascade of Π: In a cascade ( sequence) of Π operations, all but the last one can be ignored:
List ( Π)List2 (…)ΠListn (R))…))=List (R)
152
4.
DATABASE SYSTEMS
Commuting σ with Π : If the selection condition c involves only those attributes Al,…, An in the
projection list, the two operations can be commuted:
A1, A2…An (σ c(R)))=σc (Π)A1, A2… An (R))
5.
Commutatively
R
c S= S
(and X): The
o operation is commutative, as is the X operation:
cR
R X S= S X R
Notice that, although the order of attributes may not be the same in the relations resulting from the
two joints (or two Cartesian products), the “meaning” is the same because order of attributes is not
important in the alternative definition of relation.
6.
Commuting σ with
(or X): If all the attributes in the selection condition c involve only the
attributes of one of the relations being joined-say, R- the two operations can be commuted as
follows:
σc (R
S) =(σc (R) )
S
Alternatively, if the selection condition c can be written as (c1 AND c2), where condition c1
involves only the attributes of R and condition c2 involves only the attributes of S, the operations
commute as follows:
σ c (R
S) = (σc1 (R) )
(σc2 (S))
The same rules apply if the
7.
is replaced by a X operation.
Commuting Π with (or C): Suppose that the projection list is L= (A1,…, An, B1,…,Bm), where
A1…, An are attributes of R and B1,…Bm are attributes in L, the two operations can be commuted as
follows:
Πl (R
c S) = (Π A1.., An (R)) c (Π B1,…Bm(S))
If the join condition c contains additional attributes not in L, these must be added to the projection
list, and a final Π operation is needed. For example, if attributes An+1,…An+k of R and Bm+1, ….
Bm+p of S are involved in the join condition c but are not in the projection list L, the operations
commute as follows:
Π (R c S)= Πl ((Π Al, An+1, …,…An+k (R))
c (Π B1, …Bm, Bm+1,…Bm+p (S))
For X, there is no condition c, so the first transformation rule always applies by replacing
X.
c with
8.
Commutatively of set operations: The set operations U and ∩ are commutative but is not.
9.
Associatively of
, X, U, and ∩: These four operations are individually associative; that is, if θ
stands for any one of these four operations (throughout the expression), we have:
(R θ(S θ T)
10.
Commuting σ with set operations: The σ operation comments with U, ∩ and -. If θ stands for any
one of these three operations (throughout the expression), we have:
σc (R θ S )= (σc (R)) θ (σc (S))
11.
The Π operation commutes with U:
QUERY PROCESSING
153
ΠL (R U C)= (ΠL (R)) U (Πl (S))
'
)
#
We can now outline the steps of an algorithm that utilizes some of the above rules to transform an initial
query tree into an optimized tree that is more efficient to execute (in most case). The algorithm will lead to
transformations similar to those discussed in our example of Figure 4.4. The steps of the algorithm are as
follows:
1.
Using Rule1, break up any SELECT operations with conjunctive conditions into a cascade of
SELECT operations. This permits a greater degree of freedom in moving SELECT operations
down different branches of the tree.
2.
Using Rules 2, 4, 6, and 10 concerning the commutatively of SELECT with other operations, move
each SELECT operation as far down the query tree as is permitted by the attributes involved in the
select condition.
3.
Using Rules 5 and 9 concerning commutatively and associatively of binary operations, rearrange
the leaf nodes of the tree using the following criteria. First, position the leaf node relations with
most restrictive SELECT operations so they are executed first in the query tree representation. The
definition of most restrictive are executed fires in the query tree representation. The definition of
most restrictive SELECT can mean either the ones that produce a relation with the fewest tuples or
with the smallest absolute size. Another possibility is to define the most restrictive SELECT as the
one with the smallest selectivity; this is more practical because estimates of selectivity are often
available in the DBMS catalog. Second, make sure that the ordering of leaf nodes does not cause
CARTESIAN PRODUCT operations. For example, if the two relations with the most restrictive
SELECT do not have a direct join condition between them, it may be desirable to change the order
of leaf nodes to avoid Cartesian products.
4.
Using Rules 3, 4, 7, and 11 concerning the cascading of PROJECT and the commuting of
PROJECT with other operations, break down and move lists of projection attributes down the tree
as far as possible by creating new PROJECT operations as needed. Only those attributes needed in
the query result and in subsequent operations in the query tree should be kept after each PROJECT
operation.
5.
Identify sub trees that represent groups of operations that can be executed by a single algorithm.
In our example, Figure 5.4 (b) shows the tree of Figure 5.4(a) after applying steps 1
and 2 of the algorithm; Figure 5.5 shows the tree after Step 3; Figure 5.6 after Step 4; and Figure
5.7 after Step 6 we may group together the operations in the sub tree whose root is the operation Π
ESSN into a single algorithm. We may also group the remaining operations into another sub tree,
where the tuples resulting from the first algorithm replace the sub tree whose root is the operation
ΠESSN because the first grouping means that this sub tree is executed first.
'
)
#
We now summarize the basic heuristics for algebraic optimization. The main heuristic is to apply first the
operations that reduce the size of intermediate results. This includes performing as early as possible
SELECT operations to reduce the number of tuples and PROJECT operations as far down the tree as
possible. In addition, the SELECT and JOIN operations that are most restrictive that is, result with the
fewest tuples or with the smallest absolute size should be executed before other similar operations. This is
done by reordering the leaf nodes of the tree among themselves while avoiding Cartesian products, and
adjusting the rest of the tree appropriately.
154
DATABASE SYSTEMS
Student Activity 5.2
Before reading the next section, answer the following questions:
1.
Why do we use Heuristics in Query Optimization?
2.
Write about the Notation for Query trees and Query graphs.
3.
Make a Heuristic optimization of Query trees.
4.
What are the General Transformation Rules for Relational operations?
If your answers are correct, then proceed to the next section.
Top
!
%
!
An execution plan for a relational algebra expression represented as a query tree includes information about
the access methods available for each relation as well as the algorithms to be used in computing the
relational operators represented in the tree. As a simple example, consider query Q1 from Block 2 whose
corresponding relational algebra expression is
Π FNAME, LNAME, ADDRESS (σ DNAME= ‘RESEARCH’ (DEPARTMENT)
DNUMBER=DND EMPLOYEE)
The query trees are shown in Figure 5.5. To convert this into an execution plan, the optimizer might choose
an index search for the SELECT operation (assuming one exists), a table scan as access method for
EMPLOYEE, a nested loop join algorithm for the join, and a scan of the JOIN result for the PROJECT
operator. In addition, the approach taken for executing the query may specify a materialized or a pipelined
evaluation.
With materialized evolution, the result of operations is stored as temporary relation (that is, the result is
physically materialized). For instance, the join operation can be computed and the entire result stored as a
temporary relation, which is then read as input by the algorithm that computes the PROJECT operation,
which would produce the query result table. On the other hand, with pipelined evaluation, as the resulting
tuples of an operation are produced, they are forwards directly to the next operation in the query sequence.
For example, as the selected tuples from DEPARTMENT are produced by the SELECT operation, they are
placed in a buffer; the JOIN operation algorithm would then consume the tuples from the buffer, and those
tuples that result from the JOIN operation are pipelined to the projection operation algorithm. The
advantages of pipelining is the cost savings in not having to write the intermediate results to disk and don’t
having to read them back for the next operation.
Π'
9;9 @+
#
QUERY PROCESSING
155
:
Student Activity 5.3
Answer the following questions:
1.
How can convert Query trees into Query execution plans?
2.
Make some Query and try to make Query trees for them.
*
The SQL data definition language includes commands to grant and revoke privileges. The SQL standard
include delete, insert, select, and update privileges. The select privileges correspondence to read privilege.
SQL also includes a references privilege that restrict a user’s ability to declare foreign key when creating
relations. If the relation to be created includes the foreign key that references attributes of another relation.
The user must have been granted references privilege on those attributes. The reason that the references
privilege is a useful feature is somewhat subtle, and is explained later in this section.
The grant statement is used to confer authorization. The basic form of this statement is as follows:
Grant <privileges list> on <relation name or view name> to <user list>
The privilege list allows the granting of several privileges in one command. The following grant statement
grants user U1, U2 and U3 select authorization on the branch relation:
Grant select on branch to U1, U2, U3
The update authorization may be given either on all attributes of the relation or on only some. If update
authorization is include in a grant statement, the list of attributes on which update authorization is to be
optionally appears parentheses immediately after the update keyword. If the list of attributes is omitted, the
update privilege is granted on all attributes of the relation. The following grant statement gives users U1, U2
and U3 update authorization on the amount attribute of the loan relation:
Grant update (amount) on loan to U1, U2, U3
156
DATABASE SYSTEMS
;
$
*
/
<
(
In SQL-92, the insert privilege may also specify a list of attributes: any insert to the relation must specify
only these attributes, and each of the remaining attributes is either given default values (if a default is
defined for the attribute) or set to null.
The SQL reference privilege is granted on specific attributes in a manner similar to that show that for the
update privilege. The following grant statement allows user U1 to create relations that reference the key
branch-name of the branch relation as a foreign key:
/
<
=
/
Grant reference (branch name) on branch to U1
Initially, it may appear that there is no region ever to prevent user from creating foreign keys referencing
another relation. However recall from chapter 6 that foreign key constraints restrict deletion and update
QUERY PROCESSING
157
operation on the referenced relation. In the preceding example, if U1 creates a foreign key in a relation r
referencing the branch-name attribute of the branch relation, and then insert a tuple into r pertaining to the
Perry ridge branch, it is no longer possible to delete the Perry ridge branch from the branch relation without
also modifying relation r Thus, the definition of a foreign key by U restricts future activity by other user
therefore, there is a need for the references privilege
All privileges can be used as a short form for all the allowable privileges. Similarly, the user name public
refers to all current and further user of the system. SQL-92 also includes a usage privilege that authorize a
user to use a specified domain (recall that a domain corresponds to the programming-language notion of a
type, and may be user defined).
By default, a user who is granted a privilege In SQL is not authorized to grant that privilege to another user.
If we wish to grant a privilege and to allow the recipient to pass the privilege to another users then we
append the with grant option clause to the appropriate grant command. For example, if we wish to allow U1
the select privilege on branch and allow U1 to grant this privilege to others, we write.
Grant select on branch to U1with grant option
To revoke an authorization, we use the revoke statement. It takes a form almost identical to that of grant.
Revoke<privilege list> on <relation name or view name>
From <user list> {restrict} {cascade}
Thus, to revoke the privilege that was granted previously, we write
Revoke select on branch from U1, U2, and U3 cascade
Revoke update (amount) on loan from U1, U2, and U3
Revoke reference (branch-name) on branch from U1
The revocation of a privilege from a user may cause other user also to lose that privilege. This behavior is
called cascading of the revoke. The revoke statement may also specify restrict:
Revoke select on branch from U1, U2, U3 restrict
In this case, an error is returned if there are any cascading revokes, and the revoke action is not carried out.
The following revoke statement revokes only the grant option, rather than the actual select privilege.
Revoke grant option for select on branch from U1
The SQL-92 standard specifies a primitive authorization mechanism for the database schema: Only the
owner of the schema can carry out any modification to the schema. Thus, schema modification – such as
creating or deleting relations, adding or dropping attributes of relations, and adding or dropping indicesmay be executed by only the owner of the schema. Several database implementations have more powerful
authorization mechanisms for database schemas, similar to those discussed earlier, but these mechanisms
are non-standard.
Student Activity 5.4
Before reading the next section, answer the following questions.
1.
What do you understand by security specification SQL?
2.
What do understand by Authorization revocation and how will you remove it?
158
DATABASE SYSTEMS
If your answers are correct, then proceed to the next section.
The various provisions that a database system may make for authorization may not provide sufficient
protection for higher sensitive data. In such cases, data may be encrypted. It is not possible for encrypted
data to be read unless the reader knows how to de-cipher (decrypt) them.
There are a vast number of techniques for the encryption of data. Simple encryption techniques may not
provide adequate security, since it may be easy for an unauthorized user to break the code. As an example
of a weak encryption technique, consider the substitution of each character with the next character in the
alphabet Thus.
If an unauthorized user sees only “Qfsszsjehf”, she probably has insufficient information to break the code.
However, if the intruder sees a large number of encrypted branch names, she could use statistical data
regarding the relative of quench of characters (for example, e is more common than f) to guess what
substitution is being made.
A good encryption technique has the following properties
It is relatively simple doe authorized user to encrypt and decrypt data.
The encryption scheme depends not on the secrecy of the algorithm, but rather on a parameter of the
algorithm called the encryption key.
It is extremely difficult for an intruder to determine the encryption key.
A computer system, like any other mechanical or electrical device, is subject to failure. There is a variety of
causes of such failure, including disk crash, power failure, software error, a fire in the machine room, or
even sabotage. In each of these cases, information may be lost .an integral part of database system is
recovery scheme that is responsible for the restoration of the database to a consistent state that existed prior
to the occurrence of the failure.
+
There are various types of failure that may occur in a system, each of which needs to be dealt in a different
manner. The simplest type of failure to deal with is one that does not result in the loss of information in the
system. The failures that are more difficult to deal with are those that result in loss of information. Here, we
shall consider only the following types of failures:
Transaction Failure: There are two types of errors that may cause a transaction to fail:
•
Logical error: The transaction can no longer continue with it’s normal execution, owing to some
internal condition, such as bad input, data not found, overflow or resource limit exceeded.
•
System error: The system has entered an undesirable state (for example, deadlock), as a result of
which a transaction cannot continue with its normal execution. The transaction, however, can be reexecuted at a later time.
System crash: There is a hardware malfunction, or a bug in the database software or the operating
system, that causes the loss of the content of volatile storage, and brings transaction processing to halt.
The content of nonvolatile storage remains intact and is not corrupted.
The assumption that hardware errors and bugs in the software bring the system to halt, but do not
corrupt the nonvolatile storage contents is known as the fail-stop assumption. Well-designed systems
have numerous internal checks, at the hardware and software level, which bring the system to a halt
when there is an error. Hence, the fail-stop assumption is a reasonable one.
QUERY PROCESSING
159
Disk failure. A disk loses its content as a result of either a head crash or failure during a data transfer
operation. Copies of data on other disks, or archival backups on tertiary media, such as tapes are used
to recover from the failure.
The various data in the database may be stored and accessed in a number of different storage media. To
understand how to ensure the atomicity and durability properties of a transaction, we must gain a better
understanding of these storage media and their access methods
%
There are various types of storage media; they are distinguished by their relative speed, capacity, and
resilience to failure.
Volatile Storage: Information residing in volatile storage does not usually survive system crashes.
Examples of such storage are main memory and cache memory. Access to volatile storage is extremely fast,
both because of the speed of the memory access itself, and because it is possible to access any data item in
volatile storage directly.
Nonvolatile storage. Information residing in nonvolatile storage survives system crashes. Examples of such
storage are disk and magnetic tapes. Disks are used for online storage, where as tapes are used for archival
storage. Both however, are subject to failure (for example, head crash), which may result in loss of
information. At the current state of technology, nonvolatile storage is slower than volatile storage by
several orders of magnitude. This distinction is the result of disk and tape devices being electro mechanical,
rather than based entirely on chips, as is volatile storage. Other nonvolatile media are normally used for
backup data.
Stable storage Information residing in stable storage is never lost. Although stable storage is theoretically
impossible to obtain., it can be closely approximated by techniques that make data loss extremely unlikely.
The distinction among the various storage types is often less clear in practice than in our presentation.
Certain system provide battery backup, so that some main memory can survive system crashes and power
failures. Alternative forms of nonvolatile storage, such as optical media, provide an even higher degree of
reliability than disks do.
)
,
To implement stable storage, we need to replicate the needed information in several nonvolatile storage
media (usually disk) with independent failure modes, and to update the information in a controlled manner
to ensure that failure during data transfer does not damage the needed information.
Data transfer between memory and disk storage can result in successful completion. The transferred
information arrived safely at its destination.
Partial failure:. A failure that occurs in the midst of transfer, and the destination block has incorrect
information.
Total failure: The failure that occurs sufficiently early during the transfer and the destination block remains
intact.
We require that, if data transfer failure occur, the system detects it and invokes a recovery procedure to
restore the block to a consistent state. To do so the system must maintain two physical blocks for each
logical database block; in the case of mirrored disks, both blocks are at the same location; in the case of
remote backup, one of the blocks is local, where as the other is at a remote site.
160
DATABASE SYSTEMS
An output operation is executed as follows:
1.
Write the information into the first physical block.
2.
When the first write completes successfully, write the same information onto the second block.
3.
The output is completed only after the second write completes successfully.
During recovery, each pair of physical blocks is examined. If both are the same and no detectable error
exists, then no further actions are necessary. If one block contains detectable errors, then we replace its
content with the content of the second block. If both the blocks contain no detectable error, but they differ
in content, then we replace the content of the first block with the value of the second. This recovery
procedure ensures that a write to stable storage either succeeds completely.
The requirements of comparing every corresponding pair of blocks during recovery are expensive to meet.
We can reduce the cost greatly by keeping track of block writes that are in progress. Using a small mount of
nonvolatile RAM.
The database system resides permanently on nonvolatile storage, and is partitioned into fixed length storage
units called blocks.
'
>
7
Blocks are the units of data transfer to and from disk, and may contain several data items. We shall assume
that no data item spans two or more blocks. This assumption is realistic for most data-processing
applications, such as our banking example
Transaction input information from the disk to main memory, and then output the information back onto the
disk. The input and output operations are done on block units. The blocks residing on the disk are referred
to as physical blocks; the blocks residing temporarily in main memory are referred to as buffer blocks. The
area of memory where blocks reside temporarily is called the disk buffer.
Block movements between disk and main memory are initiated through the following two operations:
input(B) transfers the physical block B to main memory.
output(B) transfers the buffer block B to the disk and replaces the appropriate physical block there.
This scheme is illustrated in the above figure 5.1.
Each transaction Ti has a private work area in which copies of all the data items accessed and updated by Ti
are kept. This work area is created when the transaction is initiated; it is removed when the transaction
either commits or aborts. Each data item x kept in the work area of transaction Ti is denoted by xi.
Transaction Ti interacts with the database system by transferring data to and from its work area to the
system buffer. We transfer data using the following two operations:
QUERY PROCESSING
1.
161
read(X) assigns the value of data item X to the local variable xi. This operation is executed as follows:
if the block Bx on which x resides is not in main memory, then issue input(bx).
Assign to xi, the value of X from the buffer block.
2.
write(X) assigns the value of the local variable xi to data item X in the buffer block. This operation is
executed as follows:
if block Bx on which X resides is not in main memory, then issue input(Bx).
Assign the value of xi to X in buffer Bx.
Note that both operations may require the transfer of a block from disk to main memory. They do not,
however, specially require the transfer of a block from main memory to disk.
A buffer block is eventually written out to the disk either because the buffer manager needs the memory
space for other purposes or because the database system wishes to reflect the change to B on the disk. We
shall say that the database system force-outputs buffer B if it issues an output(B).
When a transaction needs to access a data item X for the first time, it must execute write(x) to reflect the
change to X in the database itself.
The output (Bx) operation for the buffer block Bx on which X resides does not need to take effect
immediately after write(x) is executed, since the block Bx may contain other data items that are still being
accessed. Thus, the actual output takes place later. Notice that, if the system crashes after the write(x)
operation was executed but before output(b) was executed, the new value of X is never written to disk and ,
thus, is lost.
(
Consider again our simplified banking system and transaction Ti that transfers 50 dollars form account A to
account B, with initial values of A and B being 1000 and 2000 dollars each. Suppose that a system crash
has occurred during the execution of Ti, after output(BA) has taken place, but before output(BB) was
executed, where BA and BB denote the buffer blocks on which A and B reside.
Since the memory contents were lost, we do not know the fate of the transaction: thus we could invoke on
of the two possible recovery procedures.
Re-execute T1 -- This procedure is the value of A becoming 900 dollars rather than 950. Thus the system
enters an inconsistent state.
Do not re-execute T1 -- The current system has values of 950 and 2000 dollars for A and B, respectively.
Thus the system enters an inconsistent state.
In either case, the database is left in an inconsistent state, and thus this simple recovery scheme does not
work. The reason for this difficulty is that we have modified the database without having assurances that
the transaction will indeed commit. Our goal is to perform either all or no database modifications made by
T1. However, if T1 performed multiple database modifications made by T1, several output operations have
been made. But before all of them are made to achieve our goal of atomicity, we must first output
information describing the modifications to stable storage. Without modifying the database itself. As we
shall see, this procedure will allow us to output all the modifications by a committed transaction, despite
failures.
Student Activity 5.5
Before reading the next section, answer the following questions.
1.
What do you understand by Encryption?
162
2.
DATABASE SYSTEMS
Write a short note on following: Recovery and Atomicity.
If your answers are correct, then proceed to the next section.
.
/
Several problems can occur when concurrent transactions execute in an uncontrolled manner. Here, some
of the problems are discussed.
Consider an airline reservation database in which a record is stored for each airline flight. Each record
includes the no. of reserved seats on that flight as a named data item, among other information. A
transaction T1 that transfers N reservations form one flight whose number of reserved seats is stored in the
database item named X to another flight whose number of reserved seats is stored in the database item
named y. T2 represents another transaction that just reserves M seats on the 1st flight (x) referenced in
transactionT1.
T1
T2
Read – item (X);
X : = X – N;
Write – item (X);
read-item (X);
X : = X + M;
Write – item(X);
Read – item (y);
Y: = y+N;
Write – item (y)
We now discuss the types of problems we may encounter with these two transactions if they run
concurrently.
%
*
)
This problem occurs when two transactions that access the same database items have their operations
interleaved in a way that makes the values of some database item incorrect. Suppose that transactions T1
and T2 are submitted at approximately the same time, and suppose that their operations are interrelated as
shown in bit (3); then the final value of items X is incorrect, because T2 reads the value of X before T1
changes it in the database and hence the updated value resulting from T1 is lost. For example, if X = 80 at
the start (originally there were 80 reservations on the flight), N = 5 (T1 transfers 5 seat reservations from the
flight corresponding to X to the flight corresponding to Y), and M = 4 (T2 reserves 4 seats on X), the final
result should be X = 79; but in the interleaving of operations shown in fig 5.12 it is X = 84 because the
update in T1 that removed the five seats from X was lost.
T1
T2
Read – item (X);
X : = X – N;
read-item (X);
X : = X + M;
Write – item (X);
Read – item (y);
Time
Write – item(X);
Y: = y+N;
Write – item (y)
Item X has an incorrect
values because its update by
is “lost” (overwritten)
'.>: 9 2 /%
%
%
0
-
(
1
)
This problem occurs when one transaction updates a database item and then the transaction fills for some
reason. The updated item is accessed by another transaction before it is changed back to its original value.
Fig. 5.13 shows an example where T1 updates item X and then tails before completion, so the system must
QUERY PROCESSING
163
change X back to its original value. Before it can do so, transaction T2 reads the “temporary” value of X,
which will not be recorded permanently in the database because of the failure of T1. The value of item X
that is read by T2 is called dirty data, because it has been created by a transaction that has not completed
and committed yet; hence, this problem is also known as the dirty read problem.
T1
T2
Read – item (X);
X : = X – N;
Write – item (X);
Time
read-item (X);
X : = X + M;
Write – item(X);
Read – item (y);
Transaction T fails and must
change the value of x back to
its old value. Mean while T2
has read the time e incorrect
value. of X.
%
Figure 5.13
)
If one transaction is calculating an aggregate summary function on a number of records while other
transactions are updating some of these records, then the aggregate function may calculate some values
before they are updated and others after they are updated. For e.g., suppose that transaction T3 is calculating
the total no. of reservations on all the flights; meanwhile, transaction T1 is executing. If interleaving of
operations shown in (fig 5.14) occurs, the result of T3 will be off by an amount N because T3 reads the
value of X after n1 seats have been subtracted from it but reads the value of Y before those N. seats have
been added to it.
T1
Read – item (X);
X : = X – N;
Write – item (X);
Read – item (y);
Y: = Y+N;
Write – item (y)
T2
Sum : = 0;
read – item (A);
Sum : = Sum + A;
Read – item (X);
Sum : = Sum+X;
Read – item (Y);
Sum : = Sum+Y;
Student Activity 5.6
Before reading the next section, answer the following questions:
T3 reads X after N is
subtracted and reads Y
before N is added; A wrong
summary is the result (off by
N).
164
DATABASE SYSTEMS
1.
Make a comparison between single user and multi-user system.
2.
Why concurrency control?
3.
What do you understand by Dirty Read problem?
If your answers are correct, then proceed to the next section.
Top
*
2
%
3
Some of the main techniques used to control concurrent execution of transactions are based on the concept
of locking data items. A lock is a variable associated with a date item that describes the status of the item
with respect to possible operations that can be applied to it. Generally, there is one lock for each data item
in the database. Locks are used as a means of synchronizing the access by concurrent transactions to the
database items.
%
*
2
*
2% )
A binary lock can have two states or values: locked and unlocked (or 1 and 0 for simplicity). A distinct
lock is associated with each database item X. If the value of the lock on X is 1, item X cannot be accessed
by the database operation that requests the item. If the value of the lock on X is 0, the item can be accessed
when requested. We refer to the current value of the lock associated with item X as Lock (x).
Two operations lock-item and unlock-items, are used with binary locking. A transaction requests access to
an item X by virus + issuing a lock-item (x) operation. If lock(s) = 1, the transaction is forced to wait. If
Lock (x) = 0, it is set to 1 (the transaction locks the item) and the transaction is allowed to access item X.
When the transaction is through using the item, it issues an unlock-item (X) operation, which sets locks (X)
to 0 (unlocks the item) so that X may be accessed by other transactions. Hence, a binary lock enforces
mutual exclusion on the data item. A description of the lock-item (x) and unlock-item (X) operations is
shown below.
Lock-item (X):
B:
If lock (x) = 0 (* item is unlocked *)
Then Lock (x) – 1 (* lock the item *)
Else begin
Wait (until lock (x) = 0 and
The lock manager wakes up the transaction);
Go to B
End;
Unlock-item (X):
Lock (X) ← 0; (* unlock the item * )
If any of transactions are waiting, then wakeup one of the waiting transactions.
If the simple binary locking scheme described here is used, every transaction must obey the following rules.
QUERY PROCESSING
165
1.
A transaction T must issue the operation lock-item (X) before any read-item(X) or write –item(X)
operations are performed in it.
2.
A transaction T must issue the operation unlock-item (X) after all read-item(X) and write-item(X)
operations are completed in T.
3.
A transaction T will not issue a lock-item(X) operation if it already holds the lock on
item X.
4.
A transaction T will no issue an unlock-item (X) operation unless it already holds the lock on item
X.
4!
0
(
4.
1*
2
The binary locking scheme is too restrictive for database items, because at most one transaction can hold a
lock on a given item. We should allow several transactions to access the same item X if they all access X
for reading purposes only. However, if a transaction is to write an item X, it must have exclusive access to
X. For this purpose, a different type of lock called a multi-mode lock is used. In this scheme – called
shared / exclusive or read/write locks – there are three locking operations: read-lock (x), write-lock(x), and
unlock(x). A lock associated with an item X, lock (X), now has 3 possible states: “read-locked”, “writelocked” or “unlocked”. A real-locked item is also called shared-locked, because other transactions are
allowed to read the item, whereas a write-locked item is called exclusive-locked, because a single
transaction exclusively holds the lock on the item. When we use the share/exclusive locking scheme, the
system must enforce the following rules.
1.
A transaction T must issue the operation read-lock (X) or write-lock(X) before any read-item (X)
operation is performed in T.
2.
A transaction T must issue the operation write-lock(X) before any writ-item(X) operation is
performed in T.
3.
A transaction t must issue the operation unlock(X) after all read-item(X) and write-item(X)
operations are complete in T.
4.
A transaction T will not issue a read-lock (X) operation if it already holds a read (shared) lock or a
write (exclusive) lock on item X.
5.
A transaction T will not issue a write-lock operation if it already holds a read (shared) lock or write
(exclusive) lock on item X.
6.
A transaction T will not issue an unlock (X) operation unless it already holds a read (shared) lock
or a write (exclusive) lock on item X.
Read-lock (X):
B:
If lock (x) = “unlocked”
Then begin Lock (X) ← “read-locked”;
No. of reads (X) ←1
End
Else if Lock (X) = read-locked”
Then no. of reads (X) ← no. of reads (x) + 1;
Else begin
Wait (until Lock (x) = “unlocked” and
The lock manager wakes up the transaction);
166
DATABASE SYSTEMS
goto B;
End;
Write-Lock (X):
B:
If Lock(X) = “unlocked”
Then Lock (X)
← “write – locked”
Else begin
Wait (until lock(X) = “unlocked” and the lock manager wakes up the transaction);
goto B
End;
Unlock (X):
If Lock (X) = “write-locked”
Then begin Lock (X) ← “unlocked”;
Wakeup one of the waiting transactions, it any
End;
Else if Lock (X) = “read-locked”
Then begin
No_of_reads (X) ← no_of_reads(X) –1;
If no_of_reads(X) = 0
Then begin Lock (X) = “unlocked”;
Wakeup one of the waiting transaction,
If any
End
End;
(Locking an unlocking Operations for two-mode locks)
%5 6
*
2
A transaction is said to follow the two-phase locking protocol if all locking operations (read-lock, writelock) precede the first unlock operation in the transaction. Such a transaction can be divided into two
phases: an expanding or growing (first phase), during which new locks on items can be acquired but none
can be released; and a shrinking (second phase), during which existing locks can be released but no new
locks can be acquired.
Given below are two transactions T1 and T2 which do not follow the two-phase locking protocol.
T1
T2
Read-locks (Y) ;
read-lock (X);
Read-item (Y);
read-item (X);
Unlock (Y);
Unlock(X);
Write-lock(X);
write-lock (Y);
Read-item(X);
read-item (Y);
X: = X + Y;
Y : Y + 1;
Write-item(X);
Write-item (y);
Unlock (X)
unlock (y);
QUERY PROCESSING
167
This is because the write-lock(X) operation follows the unlock(Y) operation in T1, and similarly the writelock(y) operation follows the unlock(X) operation in T2. If we enforce two-phase locking, the transactions
can be rewritten as T1, and T2, as shown below :
T1
T2
Read-locks (Y) ;
read-lock (X);
Read-item (Y);
read-item (X);
Write-lock(X);
write-lock (Y);
Unlock (Y);
Unlock(X);
Write-lock(X);
write-lock (Y);
read-item(X);
read-item (Y);
X: = X + Y;
Y : Y + 1;
Write-item(X);
Write-item (y);
Unlock (X)
unlock (y);
It can be proved that, if every transaction in a schedule follows the two-phase locking protocol, the
schedule is guaranteed to be serializable, obviating the need to test for serializability of schedules any more.
The locking mechanism by enforcing two-phase locking run also enforces serializability.
Two phase locking may limit the amount of concurrency that can occur in a schedule. This is because a
transaction T may not be able to release an item X after it is through using it if T must lock an additional
item Y later on; or conversely, T must have the additional item Y before it needs it so that it can release X.
Hence, X must remain locked by T until all items that the transaction needs to read or write have been
locked; only then can X be release by T. Meanwhile, another transaction seeking to access X may be forced
to wait, even though T is done with X; conversely, if Y is locked earlier than it is needed, another
transaction seeking to access Y is forced to wait even though T is not using Y yet. This is the price for
guaranteeing serializability of all schedules without having to check the schedules themselves.
Basic, Conservative, Strict, and Rigorous Two-Phase Locking:
The technique just described is known as basic 2PL. The variation known as conservative 2PL (or Static
2PL) requires a transaction to lock all the items it accesses before the transaction begins execution, by
predefining its read-set and write-set. If any of the pre-declared items needed cannot be locked, then the
transaction does not lock any item; instead, it waits until all the items are available for locking.
Conservative 2 phase locking is a deadlock free protocol. However it is difficult to use in practice because
of the need to pre-declare the read-set and write-set, which is not possible in most situations.
In practice, the most popular variation of 2PL is strict 2PL, which guarantees strict schedule. In this
variation, a transaction T does not release any of its exclusive (write) locks until after it commits or aborts.
Student Activity 5.7
Before reading the next section, answer the following questions:
1.
What are the various Locking Techniques for concurrency control?
2.
What is two phase locking?
If your answers are correct, then proceed to the next section.
Top
%
168
DATABASE SYSTEMS
Time Stamps
A time stamp is a unique identifier created by the DBMS to identify a transaction. Typically, timestamp
values are assigned in the order in which the transactions are submitted to the system, so a timestamp can
be thought of as the transaction start time. We will refer to the time stamp of transaction T as TS(T).
Time Stamp Ordering Algorithm:
The idea for this scheme is to order the transactions based on their timestamps. A schedule in which the
transactions participate is then serial table, and the equivalent serial schedule has the transactions in order
of their timestamp values. This is called timestamp ordering (TO). In timestamp ordering, the schedule is
equivalent to the particular serial order corresponding to the order of the transaction time stamps. The
algorithm must ensure that for each item accessed by convicting operations in the schedule, the order in
which the item is accessed does not violate the serializability order. To do this, the algorithm association
with each database item X two timestamp (TS) values.
1.
Read – TS(X): - The read timestamp of item X; this is the largest time stamp among all the
timestamps of transactions that have successfully read item - X i.e.
read-TS(X) = TS(T), where T is the youngest transaction that has read X successfully
2.
Write – TS (X): - The write timestamp of item X; this is the largest of all the timestamps of
transactions that have successfully written item X – i.e., write – TS (X) = TS (T), where T is the
youngest transaction that has written X successfully.
%
Whenever some transaction T tries to issue a read-item(X) or write-item(X) operation, the basic TO (Time
ordering) compares the timestamp of T with read-TS(X) and write-TS(X) to ensure that the timestamp
order of transaction is not violate. If this order is violated, then transaction T is aborted resubmitted to the
system as a new transaction with a new timestamp. It T is aborted and rolled back, any transaction T1 that
may have used a value written by T must also be rolled back. Similarly, any transaction T2 that may have
used a value written by T1 must also be rolled back, and so on. This effect is known as cascading rollback
and is one of the problems associated with basic TO, since the schedules produced are not recoverable,
cordless or stick. We visit desirable the basic to algorithm here. The concurrency country algorithm must
check whether convicting operations violate the time stamp ordering in the following two cases.
1.
Transaction T issues a write-item(X) operation:
a)
If read-TS(X) > TS (T) or if write – TX(X) > TS (T), then about and roll back T and
rejects the operations. This should be done because some younger transaction with a
time stamp greater than TS(T) –– and hence after T in the timestamp ordering – has
already read or written the value of item X before T had a chance to write X, thus
violating the timestamp ordering
b)
If the condition in Past (a) does not occur; then execute the write-item(X) operation on
T and set write – TS (X) to TS(T)
2. Transaction T issues a read –item(X) operation:
a)
If write-TS(X) > TS(T), then abort and roll back T and reject the operation. This
should be done because some younger transaction with timestamp greater than TS(T) –
and hence after T in the timestamp ordering-has already written the value of item X
before T had a chance to read X.
b)
If write_TS(X) < = TS(T), then execute the read_item (X) operation of T and set readTS(X) to the larger of TS(T) and the current read-TS(X).
QUERY PROCESSING
169
Hence, whenever the basic TO algorithm detects two conflicting operations that occur in the incorrect
order, it rejects the latter of the two operations by aborting the transaction that issued it.
%
A variation of basic TO called strict TO ensure that the schedules are both strict and serial table. In this
variation, a transaction T that issues a read-item(X) or write-item(X) such that TS(T) write_TS(X) has its
read or write operation delayed until the transaction T1 that wrote the value of X (hence TS(T’) = write
TS_TS(X)) has committed or aborted. To implement this algorithm, it is necessary to simulate the locking
of an item X that has been written by transaction T’ until T’ is either committed or aborted. This algorithm
does not cause deadlock, since T waits for T’ only if TS (T) > TS (T’).
Student Activity 5.8
Before reading the next section, answer the following questions:
1.
Write brief notes about Basic Time Stamp Ordering.
2.
Write the Time Stamp Ordering Algorithm.
3.
What is Strict Time Stamp Ordering?
If your answers are correct, then proceed to the next section.
Top
%
3
Other protocols for concurrency control keep the old values of a data item when the item is updated. These
are known as multi-version concurrency control, because several version (values) of an item are
maintained. When a transaction requires access to an item, an appropriate version is chosen to maintain the
serializability of the currently executing schedule, if possible. The idea is that same read operations that
would be rejected in other techniques can still be accepted by reading an older version of the item to
maintain serializability. When a transaction writes an item it writes a new version and the old version of
the item is retained. Some multi-version concurrency control algorithms use the concept of view
serializability rather than convict serializability.
An obvious drawback of multi-version techniques is that more storage is needed to maintain multiple
versions of the database items.
Several multi-version concurrency control schemes have been proposed. We discuss two schemes here,
one based on timestamp ordering and the other based on 2PL.
6
%
3
)
%
7
In this method, several versions X1, X2, …, Xk of each data item are maintained. For each version, the
value of version Xi and the following two timestamps are kept.
1.
read_TS(X1) : the read timestamp of Xi is the largest of all the timestamps of transactions that have
successfully read version X;
2.
Write –TS(X1) : The write timestamp of Xi is the time stamp of the transaction that wrote the value
of version Xi.
Whenever a transaction T is allowed to execute a write-item(X) operation, a new version X4+1 of item X is
created, with both the write – TS (X1+1) and the read – TS (Xx+1) set to TS(T). Correspondingly, when a
transaction T is allowed to read the value of version Xi, the value of read-TS(Xi) is set to the larger of the
current read-TS(Xi ) and TS (T).
170
DATABASE SYSTEMS
To ensure serializability, the following rules are used.
1.
It transaction T issues a write-item (X) operation, version I of X has the highest. Write _TS(Xi) of
all versions of X that is also less than or equal to is (T), and read – TS (Xi) TS (T), then abort and
roll back transaction T; otherwise create a new version x; of x with read-TS(Xi) = write _ TS (Xi) =
TS(T).
If transaction T issues a read_item (X) operation, find the version of X that has the highest write –TS(Xi) of
all versions of X that is also less than or equal to TS(T); then return the value of Xi to transaction T, and
set the value of read –TS (Xi) to the larger of TS(T) and the current read –TS (Xi).
As we can see in case 2, a read-item (X) is always successful, since it is the appropriate version Xi to read
based, on the write-TS of the various existing versions of X. In case 1, however, transaction T may be
aborted and rolled back. This happens if T is attempting to write a version of C that should not have been
read by another transaction T’ whose time stamp is read_TS(Xi); however, T’ has already read version Xi,
which was written by the transaction with time stamp equal to write-TS(Xi). If this conflict occurs, T is
rolled back; otherwise, a new version of X, written by transaction T, is created. Notice that, if T is rolled
back, cascading rollback may occur. Hence, to ensure recoverability, a transaction T should not be allowed
to commit until after all the transactions that have written some version that T has read have committed.
6
%5 6
*
2
*
2
In this Multiple-mode locking scheme, there are 3 locking modes for an item: read, write, and certify,
instead of just two modes (read, write) discussed previously. Hence the state of Lock (X) for an item X can
be one of read-locked, writ-locked, certify-locked or unlocked. In the standard locking schemes with only
read and write locks, a write lock is an exclusive lock. We can describe the relationship but read and write
locks in the standard scheme by means of the lock compatibility table as shown below. An entry of yes
means that, it a transaction T holds the type of lock specified in the column header on item X and it
transaction T’ requests the type of lock specified in the row header on the same item X, then T’ can obtain
the lock because the locking modes are compatible. On the other hand, an entry of no in the table indicates
that the locks are not compatible, so T’ must wait until T releases the lock.
In the standard locking scheme, once a transaction obtains a write lock on an item, no other transactions can
access that item. The idea behind multi-version 2PL is to allow other transaction T’ to read on item X while
a single transaction T holds a write lock on X. This is accomplished by allowing two versions for each item
X; one version must always have been written by some committed transaction. The 2nd version X’ is
created when a transaction T’ requires a write lock on the item. Other transactions can continue to read the
committed version of X while t holds the write lock. Transaction T can write the value of X’ as needed,
without affecting the value of the committed version X. However once T is ready to commit, it must obtain
a certify lock on all items that it currently holds write locks on before it can commit. The certify lock is not
compatible with read locks, so the transaction may have to delay its commit until all its write–locked items
are released by any reading transitions in order to obtain the certify locks. Once the certify locks - which
are exclusive locks – are acquired the committed version X of the data item is set to the value of version X’,
version X’ is disrobed, and then the certify locks are released. The lock compatibility table for this scheme
is shown below.
Read
Write
Read
Yes
No
Write
No
No
Read
Write
Read
Write
Yes
Yes
Yes
No
Certify Yes
(A compatibility table for read/Write locking system)
Certify
No
No
No
(A compatibility table for read/Write/Certify locking scheme)
QUERY PROCESSING
171
Hence no other transaction can read or write an item that is written by T unless it has committed leading to
a strict schedule for recoverability. Strict 2PL is not deadlock-free. A more restrictive variation of strict
2PL is rigorous 2PL and it also guarantees strict schedules. In this variation a transaction T does not release
any of its locks (exclusive or shared) until after it commits or aborts and so it is easier to implement than
strict 2PL. Notice the difference between conservative and rigorous 2PL; the former must lock all its items
before it starts so once the transaction starts it is in shrinking phase whereas the latter does not unlock any
of its items until after it terminates (by committing or aborting) so the transaction is in its expanding phase
until it ends.
Student Activity 5.9
Answer the following questions:
1.
What is meant by the concurrent execution of database transactions in a multi-user system? Discuss
why concurrency control is needed, and give informal examples.
2.
Why concurrency control is needed in database system?
3.
What is the two-phase locking protocol? How does it guarantee serializability?
*
6
(
The most widely used structure for recording database modifications is the log. The log is sequence of log
records, and maintains a record of all the update activities in the database. There are several types of log
records. An update log record describes a single database write, and has the following fields:
Transaction identifier is the unique identifier of the transaction that performed the write operation.
Data-item identifier is the unique identifier of the data item written. Typically, it is the location on disk of
the data item.
Old value is the value of the data item prior to the write.
New value is the value that the data item will have after the write.
Other special log records exist to record significant events during transaction processing, such as the start of
a transaction and the commit or abort of a transaction.
< Ti start>. Transaction Ti has started.
< Ti, Xj, V1, V2> Transaction Ti has performed a write on data item Xj.
Xj had value V1before the write, and will have value V2 after the write.
<Ti commit> transaction Ti has committed.
<Ti abort> Transaction Ti has aborted.
Whenever a transaction performs a write, it is estimated that the log record for that write be created before
the database is modified. Once a log record exists, we can output the modification to the database if that is
desirable. Also, we have the ability to undo a modification that has already been output to the database. We
undo it by using the old value field in log records.
•
A query typically has many possible execution strategies, and the process of choosing a suitable one for
processing a query is known as query optimisation.
172
DATABASE SYSTEMS
•
The runtime database processor has the task of running the query code, whether in compiled or
interpreted mode, to produce the query result.
•
The steps involved in processing a query are: Parsing and translation, Optimisation, Evaluation
•
The most widely used structure for recording database modifications is the log. The log is sequence of
log records, and maintains a record of all the update activities in the database.
6
I.
II.
I.
II.
True or False
1.
A query expressed in a high-level query language such as SQL must first be scanned, parsed,
and validated.
2.
Before query processing can begin, the system need not translate the query into a usable
form.
3.
Computing the precise cost of evaluation of a plan is usually not possible without actually
evaluating the plan.
Fill in the Blanks
1.
Planning of an execution strategy may be a more accurate description than_____________.
2.
___________languages permit users to specify what a query should generate without saying
how the system should do the generating.
3.
The cost estimates for _____________are based on the assumption that the blocks of a
relation are stored contiguously on disk.
4.
We must sure that the ________________steps always lead to an equivalent query tree.
5.
___________ identifier is the unique identifier of the transaction that performed the write
operation.
True or False
1.
True
2.
False
3.
True
Fill in the Blanks
1.
query optimization
2.
Declarative
3.
binary search
4.
transformation
5.
Transaction
QUERY PROCESSING
I.
II.
173
True or False
1.
A query typically has many possible execution strategies, and the process of choosing a
suitable one for processing a query is known as query processing.
2.
The first action the system must take in query processing is to translate a given query into its
internal form.
3.
The programmer must choose the query execution strategy while writing a database program.
4.
The content of nonvolatile storage remains intact and is not corrupted.
5.
The idea for time – stamp ordering scheme is to order the transactions based on their
timestamps.
Fill in the Blanks
1.
A DBMS must devise a _____________ for retrieving the result of the query from the
database files.
2.
_____________ make use of statistical information stored in the DBMS catalog to estimate
the cost of a plan.
3.
In query processing, the ____________ is the lowest-level operator to access data.
4.
The SQL ____________ includes commands to grant and revoke privileges.
5.
A ________________ is a logical unit of database processing.
1.
”Query optimization is an important step in Query” justify the statement.
2.
Discuss the various steps in Query processing?
3.
What are Heuristic Optimization of Query trees? Discuss the example.
4.
“Security is basic need for database system” justify the statement with supporting example?
5.
Compare binary locks to exclusive/shared locks. Why is the latter type of locks preferable?
6.
What is a time stamp? How does the system generate timestamps?
7.
Discuss the timestamp ordering protocol for concurrency control. How does strict timestamp
ordering differ from basic timestamp ordering?
8.
Discuss two multi-version techniques per concurrency control.
9.
What is a certify lock? What are the advantages and disadvantages of using certify locks?
10.
Fill in the blanks:
Definition and Analysis of Existing Systems
Data Analysis
Preliminary & Final Design of Relational Database
Testing
Process of Testing
Drawbacks of Testing
What is Implementation?
Operation and Tuning
Database Design Project
Learning Objectives
After reading this unit you should appreciate the following:
•
Definition and Analysis of Existing Systems
•
Data Analysis
•
Preliminary and Final Design
•
Testing & Implementation
•
Maintenance
•
Operation and Tuning
Top
Here Existing System means that “a system which is working within certain constraints” and now we want
to update. In other words, we can say that now we want our existing database system should work under
improved constraints. For example, if we are handling the database of employees of a particular
organization then their it needs time to time upgradation i.e. we have to add or delete something.
For this upgradation process we have to carefully analyze the system so that database constraints should not
be changed. We should also take care of data analysis in following ways.
Top
Although complex statistical analysis is best left to statistics packages, databases should support simple,
commonly used, forms of data analysis. Since the data stored in databases are usually large in volume, they
need to be summarized in some fashion if we are to derive information that humans can use then aggregate
functions are commonly used for this task.
178
DATABASE SYSTEMS
The SQL aggregation functionality is limited, so several extensions have been implemented by different
databases. For instance, although SQL defines only a few aggregate functions, many database systems
provide a richer set of functions, including variance, median, and so on. Some systems also allow users to
add new aggregate functions.
Histograms are frequently used in data analysis. A histogram partitions the values taken by an attribute into
ranges, and computes an aggregate, such as sum, over the values in each range. For example, a histogram
on salaries values might count the number of people whose salaries fall in each of the ranges 0 to 20000,
20001 to 40000, 40001 to 60000, and above 60000. Using SQL to construct such a histogram efficiently
would be cumbersome. We leave it as an exercise for you to verify our claim. Extensions to the SQL syntax
that allow functions to be used in the group by clause have been proposed to simplify the task. For instance,
the N_tile function supported on the Red Brick database system divides values into percentiles. Consider
the following query:
select percentile, avg(balance)
from account
group by N_tile(balance. 10) as percentile
Here, N_tile(balance, 10) divides the values for balance into 10 consecutive ranges, with an equal number
of values in each range; duplicates are not eliminated. Thus, the first range would have the bottom 10
percent of the values, and the tenth range would have the top 10 percent of the values. The rest of the query
performs a groupby based on these ranges, and returns the average balance for each range.
Statistical analysis often requires aggregation on multiple attributes. Consider an application where a shop
wants to find out what kinds of clothes are popular. Let us suppose that clothes are classified based on color
and size, and that we have a relation sales with the schema Sales(color, size, number). To analyze the sales
by color (light versus dark) and size (small, medium, large), manager may want to see data laid out as
shown in the table in figure 6.1.
The table in Figure 6.1 is an example of a cross-tabulation (cross-tab). Many report writers provide support
for such tables. In this case, the data are two- dimensional, since they are based on two attributes: size and
color. In general, the data can be represented as a multidimensional array, with a value for each element of
the array. Such data are called multidimensional data.
The data in a cross-tabulation cannot be generated by a single SQL query, since totals are taken at several
different levels. Moreover, we can see easily that a cross-tabulation is not the same as a relational table. We
can represent the data in relational form by introducing a special value all to represent subtotals, as shown
in Figure 6.2.
Consider the tuples (Light, all, 53) and (Dark. all. 35). We have obtained these tuples by eliminating
individual tuples with different values for size, and by replacing the value of number by an aggregate—
namely, sum. The value all can be thought of as representing the set of values for size. Moving from finergranularity data to a coarser granularity by means of aggregation is called doing a rollup. In our example,
we have rolled up the attribute size. The opposite operation—that of moving from coarser-granularity data
to finer-granularity—is called drill down. Clearly, finer-granularity data cannot be generated from coarse-
DATABASE DESIGN PROJECT
179
granularity data: they must be generated either from the original data, or from yet-more-fine-granularity
summary data.
The number of different ways in which the tuples can be grouped for aggregation can be large, as you can
verify easily from the table in Figure 6.2. In fact for a table with n dimensions, rollup can be performed on
each of the 2" subsets of the n dimensions. Consider a three dimensional version of the sales relation with
attributes size, color, and price as the three dimensions. Figure 6.3 shows the subsets of attributes of the
relation as corners of a three – dimensional cube; rollup can be performed on each of these subsets
attributes . In general the subsets attributes of an n – dimensional relation can be visualized as the corners of
a corresponding n-dimensional cube.
!
Although we can generate tables such as the one in Figure 6.2 using SQL., doing so is cumbersome. The
query involves the use of the union operation, and can be long; we leave it to you as an exercise to generate
the rows containing all from a table containing the other rows.
180
DATABASE SYSTEMS
There have been proposals to extend the SQL syntax with a cube operator. For instance, the following
extended SQL query would generate the table shown in Figure 6.2:
select colour, size, sum (number)
from sales
groupby colour, size with cube
Top
Here we will discuss the process and various steps of the design of relational database design. Here we are
taking an example and applying all the rules of Design step by step on that:
Before we begin our discussion of normal forms and data dependencies, let us look at what can go wrong in
a bad database design. Among the undesirable properties that a bad design may have are
Repetition of information
Inability to represent certain information
We shall discuss these problems using a modified database design for our banking example, the information
concerning loans is now kept in one single relation , lending , which is defined over the relation schema
"
!
Lending-schema = (branch-name, branch-city, assets, customer-name,
loan-number, amount)
Figure 6.4 shows an instance of the relation lending (Lending-schema). A tuple t in the lending relation has
the following intuitive meaning:
•
t{assets} is the asset figure for the branch named ({branch-name}.
•
t{branch-city} is the city in which the branch named t {branch-name} is located.
•
{loan-number} is the number assigned to a loan made by the branch named
t{branch-name} to the customer named t{customer-name].
•
t[amount is the amount of the loan whose number is t{loan-number].
DATABASE DESIGN PROJECT
181
Suppose that we wish to add a new loan to our database. Say that the loan is made by the Perryridge branch
to Adams in the amount of $1500. Let the loan number be L-31. In our design, we need a tuple with values
on all the attributes of Lending-schema. Thus, we must repeat the asset and city data for the Perryridge
branch, and must add the tuple
(Perryridge, Horseneck, 1700000, Adams, L-31, 1500)
to the lending relation. In general, the asset and city data for a branch must appear once for each loan made
by that branch.
The repetition of information required by the use of our alternative design is undesirable. Repeating
information wastes space. Furthermore, it complicates updating the database. Suppose, for example, that the
Perryridge branch moves from Horseneck to Newtown. Under our original design, one tuple of the branch
relation needs to be changed. Under our alternative design, many tuples of the lending relation need to be
changed. Thus, updates are more costly under the alternative design than under the original design. When
we perform the update in the alternative database, we must ensure that every tuple pertaining to the
Perryridge branch is updated, or else our database will show two cities for the Perryridge branch.
That observation is central to understanding why the alternative design is bad. We know that a bank branch
is located in exactly one city. On the other hand, we know that a branch may make many loans. In other
words, the functional dependency
branch-name → branch-city
holds on Lending-schema, but we do not expect the functional dependency branch-name → loan-number
to hold. The fact that a branch is located in a city and the fact that a branch makes a loan are independent,
and, as we have seen, these facts are best represented in separate relations. We shall see that we can use
functional dependencies to specify formally when a database design is good.
Another problem with the Lending-schema design is that we cannot represent directly the information
concerning a branch (branch-name, branch-city, assets) unless there exists at least one loan at the branch.
The problem is that tuples in the lending relation require values for loan number, amount, and customername.
One solution to this problem is to introduce null values to handle updates through views. Recall, however,
that null values are difficult to handle. If we are not willing to deal with null values, then we can create the
branch information only when the first loan application at that branch is made. Worse, we would have to
delete this information when all the loans have been paid. Clearly, this situation is undesirable, since, under
our original database design, the branch information would be available regardless of whether or not loans
are currently maintained in the branch, and without resorting to the use of null values.
A bad design suggests that we should decompose a relational schema that has many attributes into several
schemas with fewer attributes. Careless decomposition, however, may lead to another form of bad design.
Consider an alternative design in which Lending-schema is decomposed into the following two schemas:
Brunch-customer-schema = ( Π branch-name, branch-city, assets, customer-name)
Customer-loan-schema = ( Π customer-name, loan-number, amount)
Using the lending relation of Figure 6.1, we construct our new relations branch customer (Branchcustomer) and customer-loan (Customer-loan-schema) as follows:
branch customer = Π branch-name, branch-city, assets, customer-name (lending)
182
DATABASE SYSTEMS
customer-loan = Π customer-name, loan-number, amount (lending)
We show the resulting branch-customer and customer-name relations in Figures 6.5 and 6.6 respectively.
#
$
%
!
Of course, there are cases in which we need to reconstruct the loan relation. For example, suppose that we
wish to find all branches that have loans with amounts less than $1000. No relation in our alternative
database contains these data. We need to reconstruct the lending relation. It appears that we can do so by
writing
branch-customer
customer-loan
Figure 6.4 shows the result of computing branch-customer M customer-loan. When we compare this
relation and the lending relation with whom we started (Figure 6.1), we notice some differences. Although
every
tuple
that
appears
in
lending
appears
in
branch-customer
customer-loan, there are tuples in branch-customer M customer-loan that are not in lending. In our
example,
!
&
DATABASE DESIGN PROJECT
183
'
branch -customer
$
% &
!
!
&(
customer-loan has the following additional tuples:
(Downtown, Brooklyn, 9000000, Jones, L-93, 500)
(Perry ridge, Horseneck, 1700000, Hayes, L-16, 1300)
(Mianus, Horseneck, 400000, Jones, L-17, 1000)
(North Town, Rye, 3700000, Hayes, L-15, 1500)
Consider the query, "Find all branches that have made a loan in an amount less than $ 1000." if we look
back at Figure 6.4 we see that the only branches with loan amounts less than $ 1000 are Mianus and Round
Hill. However, when we apply the expression we obtain three branch names: Mianus, Round Hill, and
Downtown.
Π branch-name (σ amount < 100( branch− customer customer− loand)
Let us examine this example more closely. If a customer happens to have several loans from different
branches, we cannot tell which loan belongs to which branch. Thus, when we join branch-customer and
customer-loan, we obtain not only the tuples we had originally in lending, but also several additional tuples.
Although
we
have
more
tuples
in
branch-customer
customer-loan, we actually have less information. We are no longer able, in general, to represent in the
database information about which customers are borrowers from which branch. Because of this loss of
information, we call the decomposition of Lending-schema into Branch-customer-schema and customerloan-schema a lossy decomposition, or a lossy-join decomposition. A decomposition that is not a lossy-join
decomposition is a lossless-join decomposition. It should be clear from our example that a lossy-join
decomposition is, in general, a bad database design.
Let us examine the decomposition more closely to see why it is lossy. There is one attribute in common
between Branch-customer-schema and Customer-loan- schema'.
Branch-customer-schema ∩ Customer-loan-schema == {customer-name]
184
DATABASE SYSTEMS
The only way that we can represent a relationship between, for example, loan number and branch-name is
through customer-name. This representation is not adequate because a customer may have several loans, yet
these loans are not necessarily obtained from the same branch.
Let us consider another alternative design, in which Lending schema is decomposed into the following two
schemas:
Branch-schema = (branch-name, branch-city, assets)
Loan-info-schema = (branch-name, customer-name, loan-number, amount)
There is one attribute in common between these two schemas:
Branch-loan-schema ∩ Customer-loan-schema = {branch-name]
Thus, the only way that we can represent a relationship between, for example, customer-name and assets is
through branch-name. The difference between this example and the preceding one is that the assets of a
branch are the same, regardless of the customer to which we are referring, whereas the lending branch
associated with a certain loan amount does depend on the customer to which we are referring. For a given
branch-name, there is exactly one assets value and exactly one branch-city, whereas a similar statement
cannot be made for customer-name. That is, the functional dependency holds, but customer-name does not
functionally determine loan-number.
branch-name → assets branch-city
The notion of lossless joins is central to much of relational-database design. Therefore, we re-state the
preceding examples more concisely and more formally. Let R be a relation schema. A set of relation
schemas {R1, R2,...,Rn] is a decomposition of R if
R = R1 ∪ R2 ∪ … ∪ Rn
That is, (R1, R2…..Rn) is a decomposition of R if, for i = 1,2,.. ., n, each Ri is a subset of R, and every
attribute in R appears in at least one Ri .
Let r be a relation on schema R, and let ri= Π Ri ( r ) for i = 1, 2,..,,n. That is, (r1, r2….rn) is the database that
results from decomposing R into (R1, R2,….Rn). It is always the case that
r ⊆ r1
r2
.....
rn
To see that this assertion is true, consider a tuple t in relation r. When we compute the relations (r1,..r2.,….
rn), the tuple t gives rise to one tuple ti in each ri , i = 1, 2,..., n. These n tuples combine to regenerate t
when we compute r1
r2
….rn. The details are left for you to complete as an exercise. Therefore, every
tuple in r appears in r ≠ r1
r2
……
rn.
In general, r ≠ r1
r2
……
rn. As an illustration, consider our earlier example, in which
n = 2.
R = Lending-schema.
R1 = Branch-customer-schema.
R2 = Customer-loan-schema.
r = the relation shown in Figure 6.4
r1 = the relation shown in Figure 6.5
r2 = the relation shown in Figure 6.6
DATABASE DESIGN PROJECT
ri
185
r2 = the relation shown in Figure 6.7
Note that the relations in Figures 6.4 and 6.7 are not the same.
To have a lossless-join decomposition, we need to impose constraints on the set of possible relations. We
found that the decomposition of Lending-schema into Branch-schema and Loan-info-schema is lossless
because the functional dependency
branch-name → branch-city assets
holds on Branch-schema.
Later in this chapter, we shall introduce constraints other than functional dependencies. We say that a
relation is legal if it satisfies all rules, or constraints, that we impose on our database.
Let C represent a set of constraints on the database. A decomposition {R2, R2,..., Rn] of a relation schema R
is a lossless-join decomposition for R if, for all relations r on schema R that are legal under C,
r = Π R1 ( r )
Π R2 ( r )
.......
Π Rn ( r )
We shall show how to test whether a decomposition is a lossless-join decomposition in the next few
sections. A major part of this chapter is concerned with the questions of how to specify constraints on the
database, and how to obtain lossless-join decompositions that avoid the pitfalls represented by the examples
of bad database designs that we have seen in this section.
We can use a given set of functional dependencies in designing a relational database in which most of the
undesirable properties do not occur. When we design such systems, it may become necessary to decompose
a relation into several smaller relations. Using functional dependencies, we can define several normal
forms that represent "good" database designs.
In this subsection, we shall illustrate our concepts by considering the Lending schema discussed earlier.
Lending-schema == (branch-name, branch-city, assets, customer-name, loan-number, amount)
The set F of functional dependencies that we require to hold on Lending-schema are
branch-name → assets branch-city
loan-number → amount branch-name
As we have discussed earlier as per Fig 6.1, the Lending-schema is an example of a bad database design.
Assume that we decompose it to the following three relations:
Branch-schema = (branch-name, assets, branch-city)
Loan-schema == (branch-name, loan-number, amount)
Borrower-schema = (customer-name, loan-number)
!
"#
When decomposing a relation into a number of smaller relations, it is crucial that the decomposition be
lossless. To demonstrate our claim we must first present a criterion for determining whether a
decomposition is lossy.
186
DATABASE SYSTEMS
Let R be a relation schema, and let F be a set of functional dependencies on R. Let R1 and R2 form a
decomposition of R. This decomposition is a lossless-join decomposition of if at least one of the following
functional dependencies are in F+:
R1 ∩ R2 → R1
R1 ∩ R2 → R2
We now show that our decomposition of Lending-schema is a lossless-join decomposition by showing a
sequence of steps that generate the decomposition. We begin by decomposing Lending-schema into two
schemas:
Branch-schema = (branch-name, branch-city, assets)
Loan-info-schema = (branch-name, customer-name, loan-number, amount)
Since branch-name → branch-city assets, the augmentation rule for functional. dependencies implies that
branch-name → branch-name branch-city assets
Since Branch-schema ∩ Loan-info-schema={branch-name), it follows that our initial decomposition is a
lossless-join decomposition.
Next, we decompose Loan-info-schema into
Loan-schema = (branch-name, loan-number, amount)
Borrower-schema =(customer-name, loan-number)
This step results in a lossless-join decomposition, since loan-number is a common attribute and loannumber → amount branch-name.
$
There is another goal in relational-database design: dependency preservation. When an update is made to
the database, the system should be able to check that the update will not create an illegal relation—that is,
one that does not satisfy all the given functional dependencies. If we are to check updates efficiently, we
should design relational-database schemas that allow update validation without the computation of joins.
To decide whether joins must be computed, we need to determine what functional dependencies may be
tested by checking each relation individually. Let F be a set of functional dependencies on a schema R, and
let R1 , R2 ….. Rn be a decomposition of R. The restriction of F to Ri: is the set Fi, of all functional
dependencies in F+, which include only attributes of Ri. Since all functional dependencies in a restriction
involve attributes of only one relation schema, it is possible to test satisfaction of such a dependency by
checking only one relation.
The set of restrictions F1, F2, …. Fn is the set of dependencies that can be checked efficiently. We now
must ask whether testing only the restrictions is sufficient. Let F' = F1 ∪ F2 ∪ Fn. F' is a set of functional
dependencies on schema R, but, in general, F` ≠ F. However, even if F' ≠ F, it may be that F'+= F+. If the
latter is true, then every dependency in F is logically implied by F', and, if we verify that F’ is satisfied, we
have verified that F is satisfied. We say that a decomposition having the property F’+=F+ is a dependency
preserving decomposition. Figure 6.8 shows an algorithm for testing dependency preservation. The input is
a set D= {R1, R2,….Rn} of decomposed relation schemas, and a set F of functional dependencies.
We can now show that our decomposition of Lending-schema is dependency preserving. We consider each
member of the set F of functional dependencies that we require to hold on Lending-schema, and show that
each one can be tested in at least one relation in the decomposition.
We can test the functional dependency: branch-name → branch-city assets using Branch - schema
= (branch-name, branch-city, assets).
DATABASE DESIGN PROJECT
187
We can test the functional dependency: loan-number → amount branch-name using Loan-schema
= (branch-name, loan-number, amount).
As the preceding example shows, it is often easier not to apply the algorithm of Figure 6.8 to test
dependency preservation, since the first step—computation of F+—takes exponential time.
%
φ
%
)
% *
+
Figure 6.8 to test dependency preservation, since the first step—computation of F+-takes exponential time.
%
ln Lending-schema, it was necessary to repeat the city and assets of a branch for each loan. The
decomposition separates branch and loan data into distinct relations, thereby eliminating this redundancy.
Similarly, observe that, if a single loan is made to several customers, we must repeat the amount of the loan
once for each customer (as well as the city and assets of the branch). In the decomposition, the relation on
schema Borrower-schema contains the loan-number, customer-name relationship, and no other schema
does. Therefore, we have one tuple for each customer for a loan in only the relation on Borrower-schema.
In the other relations involving loan number (those on schemas Loan-schema and Borrower-schema), only
one tuple per loan needs to appear.
Clearly, the lack of redundancy exhibited by our decomposition is desirable. The degree to which we can
achieve this lack of redundancy is represented b) several normal forms, which we shall discuss in the
remainder of this chapter.
Student Activity 6.1
Before reading the next section, answer the following questions:
1.
Consider a Database of a University (computer science department which have the following
attribute student-id, course, teacher-id, center. How to design a database?
2.
How will you analyse an existing system?
If your answers are correct, then proceed to the next section.
Top
&
Testing is the most time-consuming, but an essential activity of a software project. It is vital to the success
of candidate system. Though, during development phase, programmers also test their programs, but they
188
DATABASE SYSTEMS
generally do not test the programs in a systematic way. This is because, during development phase, they
concentrate more on removing syntax and some logical errors of programs and hence neither compare the
outputs with requirements nor test the complete system. For making the system reliable and error free, the
complete system must be tested in a systematic and organized way. Before discussing the process of testing,
let us first identify the activities that are required to be tested.
&
$
During system testing, the following activities must be tested:
(a)
Outputs: The system is tested to see whether it provides the desired outputs correctly and
efficiently.
(b)
Response Time: A system is expected to response quickly during data entry, modifications and
query processes. The system should be tested to find the response time for various operations.
(c)
Storage: A system should be tested to determine the capacity of the system to store data on the hard
disk or other external storage device.
(d)
Memory: During execution of the system, the programs require sufficient memory. The system is
tested to determine the memory required for running various programs.
(e)
Peak Load Processing: The system must also be tested to determine whether it can handle more
than one activities simultaneously during peak of its processing demand. This type of test is
generally conducted for multi-user systems such as banking applications, railway reservations
systems, etc.
(f)
Security: The system must ensure the security of data and information. Therefore, the system is
tested to check whether all the security measures are provided in the system or not.
(g)
Recovery: Sometimes due to certain technical or operational problems data may also he lost or
damaged. The system must be tested to ensure that an efficient recovery procedure is available in
the system to avoid disasters.
&
&
Testing can be of following types –
(a)
Unit Testing: Testing of individual programs or modules is known as unit testing. Unit testing is
done during both the development and testing phase.
(b)
Integration Testing: Testing the interfaces between related modules of a system is known as
integration testing. After development phase, all modules are tested to check whether they are
properly integrated or not.
(c)
System Testing: Executing programs of entire system under especially prepared test data, by
assuming that the programs will give logical errors and may not be according to specifications is
known as systems testing. Systems testing is actually the testing that is done during testing phase.
(d)
Acceptance Testing: Running the system under live or realistic data by the actual user is called
acceptance testing. It can be done during both testing and implementation phases.
(e)
Verification Testing: Running the system under a simulated environment using simulated data in
order to find errors is known as verification testing or alpha testing.
(f)
Validation Testing: Running the system under a live environment using live data in order to find
errors is known as validation testing or beta testing.
DATABASE DESIGN PROJECT
189
(g)
Static Testing: Observing the behaviour of the system not by executing the system but through
reviews, reading the programs or any other non-execution method is called static testing.
(h)
Dynamic Testing: Observing the behaviour of the system by executing the system is called
dynamic testing. Except static testing, all other testing types are actually dynamic testing.
(i)
Structural Testing: Testing the structure of various programs of a system by examining the program
and data structures is called structural testing. So structural testing is concerned with the
implementation of the program.
(j)
Functional Testing: Testing the function of various programs of a system by executing them and by
examining the structure of data and programs is called functional testing. Functional testing is
concerned with functionality of the system.
Top
&
There are many activities that must be performed during testing process. Some important activities are –
1.
Preparation of Test Plan: A test plan is the first step of testing process. A test plan is a general
document for the project, which contains the following:
-
Identification and specification of test unit;
-
Software features to be tested, e.g., performance, design constraints;
-
Techniques used for testing;
-
Preparation of test data;
-
Schedule of each testing unit;
-
Identification of persons responsible for each activity.
2.
Specification of Test Cases: Specification of test cases is the next major step of testing process. In
this process, test data is prepared for testing each and every criterion of a test unit alongwith the
specifications of conditions and expected outputs. Selecting the test cases manually is a very
difficult process. Some data flow analysis tools help in deciding the test cases.
3.
Execution and Analysis of Test Cases: All the test cases are executed and analyzed by the analyst
to see whether the system is giving expected outputs for all the conditions.
4.
Special System Tests: Besides testing the normal execution of the system, special tests are needed
to be performed to check the response time, storage capacity, memory requirements, peak load
performance, security features and recovery procedures of the system.
Top
'
(
&
Although, testing is an essential phase of SDLC, it has the following drawbacks:
Testing is an expensive method for identification and removal for faults (bugs) in the system.
Testing is the most time-consuming activity of software development process.
Student Activity 6.2
190
DATABASE SYSTEMS
Before reading the next section, answer the following questions:
1.
Describe the activities performed during system testing.
2.
Differentiate between the following.
3.
a.
Unit and Integration Testing
b.
Static and Dynamic Testing
c.
Verification and Validation Testing
d.
Structural and Functional Testing
Differentiate between testing and debugging with suitable example.
If your answers are correct, then proceed to the next section.
Top
)*
%
+
After testing of the system, the candidate system is installed and implemented at the user’s place. The old
system is changed to a new or modified system and users are provided training to operate the new system.
This is a crucial phase of SDLC and is known as implementation phase. Before discussing the activities of
implementation phase, let us first see what is meant by implementation. The term ‘Implementation’ may be
defined as follows:
Implementation is the process of converting the manual or old computerized system with the newly
developed system and making it operational, without disturbing the functioning of the organization.
&
%
Implementation may be of the following three types(a)
Fresh Implementation: Implementation of totally new computerized system by replacing manual
system.
(b)
Replacement Implementation:
computerized system.
(c)
Modified Implementation: Implementation of modified computerized system by replacing old
computerized system.
Implementation of new computerized system by replacing old
%
Whatever be the kind of implementation, the implementation process has following two parts:
(i)
Conversion
(ii) Training of Users
We will discuss these procedures in brief.
Conversion is the process of changing from the old system to modified or new one. Many different
activities are needed to be performed in conversion process depending upon the type of implementation (as
defined above). During fresh implementation, all necessary hardware is installed and manual files are
converted to computer-compatible files. During replacement implementation, old hardware may be
DATABASE DESIGN PROJECT
191
replaced with new hardware and old file structures are also needed to be converted to new ones. The
conversion process is comparatively simpler in the third type of implementation, i.e., modified
implementation. In such implementation, existing hardware is generally not replaced and also no changes
are made in file structures.
Conversion Plan
Before starting conversion process, the analyst must prepare a plan for conversion. This plan should be
prepared in consultation with users. The conversion plan contains following important tasks:
(i)
Selection of conversion method;
(ii) Preparation of a conversion schedule;
(iii) Identification of all data files needed to be converted;
(iv) Identification of documents required for conversion process;
(v) Selecting team members and assigning them different responsibilities.
Conversion Methods
The following four methods are available for conversion process:
(a)
Direct Cutover: In this method, the old system (whether manual or computerised) is completely
dropped out on one particular date and new system is implemented.
(b)
Parallel Conversion: In this method, the old method is not dropped out at once, but both old and
new systems are operated in parallel. When new system is accepted and successfully implemented,
old system is dropped out.
(c)
Phase-in-method of Conversion: In this method, the new system is implemented in many phases.
Each phase is carried out only after successful implementation of previous phase.
(d)
Pilot System: In this method, only a working version of new system is implemented in one
department of the organization. If the system is accepted in that department, it is implemented in
other departments the either in phases or completely.
Each of the above methods has its advantages and disadvantages. Although, the direct cutover is the fastest
way of implementing the system, but this method is very risky. As the organization will completely depend
on the new system, if it fails, there would be a great loss to the company. Parallel method is considered
more secure, but it has many disadvantages. The parallel conversion doubles not only the operating costs
but also the workload. The major disadvantage of parallel conversion is the outputs of both the systems may
mis-match and in such cases, it becomes very difficult for management to analyze, compare and evaluate
their results. Although, phase-in-method and pilot systems are more time-consuming methods of
implementation, they are considered more reliable, secure and economical.
Top
,
&
The performance of a system involves adjusting various parameters design choices to improve its
performance for a specific application. Various aspects of a database-system design–ranging from highlevel aspects such as the schema and transaction design, to database parameters such as buffer size; down to
hardware issues such as number of disks affect the performance of an application. Each of these aspects can
be adjusted such that performance is improved.
!
(
192
DATABASE SYSTEMS
The performance of most systems is limited primarily by the performance of one or a few components,
which are called bottlenecks. For instance, a program may spend 80 percent of its time in a small loop deep
in the code, and the remaining 20 percent improvement overall, where as improving the speed of the
bottleneck loop could result in an improvement of nearly 80 percent overall, in the best case.
Hence when a bottleneck occurs in a system, we must first try to discover what are the bottlenecks, and
then to eliminate the bottlenecks by improving the performance of the components causing the bottlenecks.
When one bottleneck is removed, it may turn out that another component becomes the bottleneck. In a wellbalanced system, no single component is the bottleneck. If the system contains bottlenecks, components
that are not part of the bottleneck are utilized and could perhaps have been replaced by cheaper components
with lower performance.
For simple programs, the time spent in each region of the code determines the overall execution time.
However the data base systems are much more complex, and can be modeled as queuing systems. A
transaction requests various services from the database system, starting from entry to server process.
A transaction requests various services from the database system, starting from entry to a server process,
disk reads during execution, CPU cycles, and locks for concurrency control. Each of these services has a
queue associated with it, and small transactions may spend most of their time waiting in queues especially
in disk I/O queues-rather than executing code.
As a result of the numerous queues in the database, bottlenecks in a database system typically show up in
the form of long queues for a particular service, or, equivalently, in high utilizations for a particular service.
If requests are spaced exactly uniformly, and the time to service a request is less than or equal to the time
when the next request arrives, then each request will find the resource idle and can therefore start execution
immediately without waiting. Unfortunately, the arrival of requests in a database system is never so
uniform, and is instead random.
If a resource, such as a disk, has a low utilization, then, when a request is made, the resource is likely to be
idle, in which case the waiting time for the request will be O. assuming uniformly randomly distributed
arrivals, the length of the queue (and correspondingly the waiting time) go up exponentially with
utilization; as utilization approaches 100 percent, the queue length increases sharply, resulting in
excessively long waiting times. The utilization of a resource should be kept low enough that queue length is
short. As a rule of the thumb, utilizations of around 70 percent are considered to be good, and utilizations
above 90 percent are considered excessive, since they will result in significant delays. To learn more about
the theory of queuing system generally referred to as queering theory, you can consult the references cited
in the bibliographic notes.
&
Database administrators can tune a database system at there levels. The lowest level is at the hardware
level. Options for systems at this level include adding disks or using a RAID system if disk I/O is a
bottleneck, adding more memory if the disk buffer size is a bottleneck, or moving to a faster processor if
CPU use is a bottleneck.
The second level consists of the database-system parameters, such as buffer size and check pointing
intervals. The exact set of database –system parameters that can be tuned depends on the specific database
system. Most database-system manuals provide information on what database-system parameters can be
adjusted, and how you should choose values for the parameters. Well-designed database systems perform as
much as possible automatically, freeing the user or database administrator from the burden. For instance,
many database systems have a buffer of fixed size, and the buffer size can be tuned. If the system
automatically adjusts the buffer size by observing indicators such as page-fault rates, then the user will not
have to worry about the buffer size.
The third level is the higher-level design, including the schema and transactions. You can tune the design of
the schema, the indices that are created, and the transactions that are executed; to improve performance at
DATABASE DESIGN PROJECT
193
this level is comparatively system independent. In the rest of this section, we discuss performance of the
higher-level design.
The three levels of interact with one other; we must consider them together when a system. For example, at
a higher level may result in the hardware bottleneck changing from the disk system to the CPU, or vice
versa.
&
*
*
Within the constraints of the normal form adopted, it is possible to partition relations vertically. For
example, consider the account relation, with the schema:
Account (branch-name, account-number, balance)
For which account-number is a key. Within the constraints of the normal forms (BCNF and third normal
forms), we can partition the account relation into two relations as follows:
Account-branch (account-number, branch-name)
Account-balance (account-number, balance)
The two representations are logically equivalent, since account- number is a key, but they have different
performance characteristics.
If most accesses to account information look at only the account-number and balance, then they can be run
against the account-balance relation, and access is likely to be somewhat faster, since the branch-name
attribute is not fetched. For the same reason, more tuples of account-balance will fit in the buffer than
corresponding tuples of account, again leading to faster performance. This effect would be particularly
marked if the branch-name attribute were large. Hence, a schema consisting of account-branch and accountbalance would be preferable to a schema consisting of the account relation in this case.
On the other hand, if most accesses to account information require both balance and branch-name, using the
account relation would be preferable, since the cost of the join of account-balance and account-branch
would be avoided. Also, the storage overhead would be lower, since there would be only relation, and the
attribute account-number would not be replicated.
&
%
We can tune the indices in a system to improve performance. If queries are the bottleneck, we can often
speed them up by creating appropriate indices on relations. If updates are the bottleneck, there may be too
many indices, which have to be updated when the relations are updated. Removing indices may speed up
updates
The choice of the type of index also is important. Some database systems support different kinds of indices,
such as hash indices and B-tree indices. If range queries are common, B-tree indices are preferable to hash
indices. Whether or not to make an index a clustered index is another tunable parameter. Only one index on
a relation can be made clustered; generally, the one that benefits the most number of queries and updates
should be made clustered.
&
&
Both read-only and update transactions can be tuned. In the past, optimizers on many database systems
were not particularly good, so how a query was written would have a big influence on how it was executed,
and therefore on the performance. Today, optimizers are advanced, and can transform even badly written
queries and execute them efficiently. However, optimizer’s have limits on what they can do. Most systems
provide a mechanism to find out the exact execution plan for a query, and to use it to tune the query.
194
DATABASE SYSTEMS
In embedded SQL, if a query is executed into a more set-oriented query that is executed only once. The
costs of communication of SQL queries can be high in client-server systems, so combining the embedded
SQL calls is particularly helpful in such systems. For example, consider a program that steps through each
department specified in a list, invoking an embedded SQL query to find the total expenses of the
department using the group by construct on a relation expenses (date, employee, department, amount). If
the expenses relation does not have a clustered index on department, each such query will result in a scan of
the relation. Instead, we can use a single embedded SQL query to find total expense of every department,
and to store the total in a temporary relations; the query can be evaluated with a single scan. The relevant
departments can then be looked up in this (presumably much smaller) temporary relation.
Student Activity 6.3
Answer the following questions.
1.
What do mean by concept of maintenance of a database system?
2.
What do you understand by bottlenecks?
•
Although complex statistical analysis is best left to statistics packages, databases should support
simple, commonly used, forms of data analysis.
•
For making the system reliable and error free, the complete system must be tested in a systematic and
organized way.
•
During system testing, the following activities must be tested: Outputs, Response Time, Storage,
Memory, Peak Load Processing, Security and Recovery.
•
Implementation is the process of converting the manual or old computerized system with the newly
developed system and making it operational, without disturbing the functioning of the organization.
"
I.
II.
-
True or False
1.
The data stored in databases are usually small in volume.
2.
Statistical analysis often requires aggregation on multiple attributes.
3.
The repetition of information required by the use of our alternative design is undesirable.
Fill in the Blanks
1.
The SQL aggregation functionality is___________, so several extensions have been
implemented by different databases.
2.
Histograms are frequently used in____________.
3.
Repeating information wastes _________.
4.
To have a_____________________________, we need to impose constraints on the set of
possible relations.
5.
The ________________of a system involves adjusting various parameters design choices to
improve its performance for a specific application.
DATABASE DESIGN PROJECT
I.
II.
I.
II.
195
True or False
1.
False
2.
True
3.
True
Fill in the Blanks
1.
limited
2.
data analysis
3.
space
4.
lossless-join decomposition
5.
performance
True or False
1.
Updates are less costly under the alternative design than under the original design.
2.
Careless decomposition may lead to another form of bad design.
3.
If we are to check updates efficiently, we do not design relational-database schemas that
allow update validation without the computation of joins.
4.
Fresh Implementation is the Implementation of totally new computerized system
replacing manual system.
5.
We can tune the indices in a system to improve performance.
by
Fill in the Blanks
1.
Statistical analysis often requires _____________on multiple attribute.
2.
Decomposing is dividing one __________into two.
3.
When an update is made to the database the system should be able to check that the update
will not create an illegal __________ .
4.
3NF is _________ than BCNF.
1.
“Design of database requires a full conceptual knowledge of BASIC”. Justify the statement with an
example.
2.
BCNF is stronger than 3NF, justify with supportive example?
3.
What do you understand by Tuning?
4.
Is testing necessary before implementation? Give your views with example?
Implementation of SQL Using ORACLE RDBMS
Use of Relational DBMS Package for Class
Project
Learning Objectives
After reading this unit you should appreciate the following:
•
Implementation of SQL using Oracle RDBMS
Top
CHAR: Values of this Datatype are final length character strings of maximum length 255 characters.
VARCHAR/VARCHAR2: Values of this Datatype of variable length character strings of maximum
length 2000.
NUMBER: The NUMBER Datatype is used to store numbers (fixed or floating point). Numbers of
virtually any magnitude may be stored up to 30 digits of precision. Number as large as 9.99*10 to the
power of 124 i.e. 1 followed by 125 zeros can be stored.
DATE : The standard format is DD-MM-YY as in 13-DEC-99. To enter dates other than the standard
format, use the appropriate functions. Date time stores date in the 24-hour format. By default, the time
in a date field is 12: 00: 00 am, if no time portion is specified. The default date for a date field is the
first day of the current month.
LONG: It is used when variable length character strings commit off 65, 535 characters.
!
Syntax
CREATE TABLE tablename (Columname datatype (size), column name datatype (size);
Example
1.
Create client-master table were
Column Name
Datatype
Size
Client-no
Varchar 2
6
Name
Varchar 2
20
198
DATABASE SYSTEMS
Address 1
Varcar 2
30
Address 2
Varchar 2
30
City
Varchar 2
15
State
Varchar 2
15
Pincode
Number
6
Remarks
Varchar 2
60
Bal-due
Number
10, 2
Create Table Client-master (Client-no Varchar 2 (6),
Name Varchar 2 (20),
address 1 Varchar 2 (30),
Address 2 Varchar 2 (30),
city Varchar 2 (15),
State Varchar 2 (15),
Pincode number (6),
Remarks Varchar 2 (60),
Bal-due number (10, 2);
2.
Create product-master table where
Column Name
Data Type
Size
Product-no
Varchar 2
6
Description
Varchar 2
25
Profit-percent
Number
2, 2
Unit-measure`
Varchar 2
10
Qty-on-hand
Number
8
Recorder – lvl
Number
8
Sell-price
Number
8, 2
Cost-price
Number
8, 2
CREATE TABLE product-master (product-no varchar 2 (6), description varchar 2 (25), profit – percent
number (2, 2), nit-measure varchar 2 (10), qty.-on-hand number (8), recorder-lvl number (8), sell-price
number (8, 2), cost-price number (8, 2);
!
!
Syntax
CREATE TABLE tablename [(columnname, columnname)] AS SELECT columnname, columnname from
tablename;
USE OF RELATIONAL DBMS PACKAGE FOR CLASS PROJECT
199
Note: If the source table, from which the target table is being created, has records in it then the target
table is populated with these records as well.
Example
Create table supplier-master from client-master. Select all fields and rename client-no with supplier-no and
name with supplier-name.
CREATE TABLE supplier – master (supplier-no, supplier-name, address 1, address 2, city, state, pincode,
remarks) AS SELECT client-no, name, address 1, address 2, city, state, pincode remarks from clientmaster;
!
"
!
Syntax
INSERT INTO tablename [columnname, columnname)]
VALUES (expression, expression);
Example
Insert a record in the client-master table as client-no = C02000, name = Prabhakar Rane, address 1= A-5,
Jay Apartments, address 2 = Service Road, Vile Parle, City = Bombay, State = Maharashtra, pincode =
400057;
INSERT INTO Client-master (Client-no, name, address 1, address 2, city, state, pincode) Values
(‘C02000’, ‘Prabakar Rane’, ‘A-5, Jay Apartments’, ‘Service Road, Vile Parle’, ‘Bombay’, ‘Maharashtra,
400057);
Note:
The character expressions must be in single quotes.
!
!
Syntax
INSERT INTO table name SELECT columnname, columnname FROM tablename;
Example
Insert records in table supplier-master from table client-master.
INSERT INTO supplier-master SELECT client-no, name, address1, address2, city, state, pincode, remarks
from client-master;
#
!
!
INSERT INTO tablename SELECT columnname, columnname FROM tablename were column =
expression;
Example
Insert records into supplier-master from table client-master were client-no = ‘C01001’;
INSERT INTO supplier-master SELECT client-no, name, address1, address2, city, pincode, state, remarks
FROM client-master were client-no = ‘C01001’;
200
DATABASE SYSTEMS
!
Syntax
UPDATE tablename Set columnname = expression, columnname = expression…..Where columnname =
expression;
Example
UPDATE TABLE client-master set name = ‘Vijay Kadam’ and address = ‘SCT Jay Apartments’ where
client-no = ‘C02000’; UPDATE client-master SET name = ‘Vijay Kadam’, address1 = ‘SCT Jay
Apartments’ where client-no = ‘C02000’;
#
"
Syntax
DELETE FROM table name
Example
Delete all records from table client-master;
Delete from client-master;
Deletion of a Specified Number of Rows:
Syntax
DELETE FROM table name WHERE search condition;
Example
DELETE from table client-master where client-no = ‘C02000’;
Delete from Client-master where client-no = ‘C02000’;
#
$
!
%
#
Syntax
SELECT * FROM tablename;
Example
Select all records from table client-master;
SELECT * FROM client-master;
&
# #
'
Syntax
SELECT columnname, columnname FROM tablename;
!
USE OF RELATIONAL DBMS PACKAGE FOR CLASS PROJECT
201
Examples
Select client-no and name from client-master
SELECT client-no, name FROM client-master;
'
#
#
Syntax
SELECT DISTINCT columname, columnname FROM tablename;
Example
Select unique rows from client-master,
SELECT DISTINCT client-no, name FROM client-master;
!
Syntax
SELECT columnname, columname FROM table name ORDER BY columnname, columnname;
Example
Select client-no, name, address1, address2, city, pincode, from client-master sort in the ascending order of
client-no; select client-no, name, address1, address2, city, pincode from client-master order by client-no;
#
!
Syntax
SELECT columnname, columnname FROM tablename where search condition;
Example
Select client-no, name from client-master where client-no is equal to ‘C01234’;
SELECT client-no, name from client-master where client-no = ‘C01234’;
Note: In the search condition all standard operators such as logic, arithmetic, predicates, etc. can be used.
'# '
( "
!
'
Syntax
ALTER TABLE tablename ADD (new columnname datatype (size), new columnname datatype
(size)……);
Example
Add fields, client-tel number (8), client-fax number (15) to table client-master;
ALTER TABLE Client-master
Add (Client-tel number (8), client-fax number (15));
202
DATABASE SYSTEMS
Syntax
ALTER TABLE tablename MODIFY (columname new datatype (size));
Example
Modify field client-fax as client-fax varchar 2 (25); ALTER TABLE client-master MODIFY (client-fax
varchar 2 (25));
Using the alter table clause you can not perform the following tasks:
Change the name of the table.
Change the name of the column.
Drop a column.
Decrease the size of a column if table data exists.
Syntax
DROP TABLE tablename;
Example
Delete table client-master;
DROP TABLE client-master;
(
(
# '
#
CREATE TABLE client-master
(Client-no varchar 2 (20) NOT NULL, name varchar 2 (20) NOT NULL, address1 varchar 2 (30) NOT
NULL, address2 varchar 2 (30) NOT NULL, city varchar 2 (15), state varchar 2 (15), pincode number (6),
remarks varchar 2 (60), bal-due number (10, 2));
A primary key is one or more columns in a table used to uniquely identify each row in table. Primary key
values must not be null and must be unique across the column.
A multicolumn primary key is called a composite primary key. The only function that a primary key. The
only function that a primary key performs is to uniquely identify a row and thus if one column it used it is
just as good as if multiple columns are used Multiple columns i.e. (Composite keys) are used only when the
system designed requires a primary key that cannot be contained in a single column.
Example
Create client-master where client-no is the primary key.
USE OF RELATIONAL DBMS PACKAGE FOR CLASS PROJECT
203
CREATE TABLE client-master (client-no varchar 2 (6) primary key, name varchar 2 (20), address1
varchar 2 (30), address 2 varchar 2 (30), city varchar 2 (15), state varchar 2 (15), pincode number (6),
remarks varchar 2 (60), bal-due number (10, 2));
Create a sales-order-details table where
Column Name
Date Type
Size
Attributes
S-order-no
Varchar 2
6
Primary key
Product-no
Varchar 2
6
Primary key
Qty.-ordered
Number
8
Qty.-disp
Number
8
Product-rate
Number
8, 2
CREATE TABLE sales-order-details ES-order-no varchar 2 (6), product no varchar 2 (6), qty.– ordered
number (8), qty-disp number (8), product-rate number (8, 2); primary key (S-order=no, product-no));
! "
A unique key is similar to primary key, except that the purpose of a unique key is to ensure that
information in the column for each record is unique, as with telephone or driver’s license number.
A table may have unique keys
Example
Create table client Master with unique constraint on column client-no.
! "
CREATE TABLE Client-master
(Client-no
Varchar 2 (6) constraint num-key UNIQUE,
name
Varchar 2 (20),
address 1
Varchar 2 (30),
address 2
Varchar 2 (30),
city
Varchar 2 (15),
state
Varchar 2 (15),
Pincode
number (6));
! "
CREATE TABLE Client-Master
(Client-no Varchar 2 (6), Name Varchar 2 (20),
Address 1 Varchar 2 (30), Address 2 Varchar 2 (30),
204
DATABASE SYSTEMS
City Varchar 2 (15), State Varchar 2 (15),
Pincode number (6),
CONSTRAINT (num-key UNIQUE (client-no));
#
At the time of cell creation ‘a default value’ can be assigned to it. When the user is loading a ‘record’ with
values and leaves this cell empty. The DBA will automatically load this cell with the default value
specified. The data type of the default value should match the data type of the column.
Example
Create Sales order table were:
Column Name
Data Type
Size
Attributes
S-order-No
Varchar 2
6
Primary Key
S-order-date
Date
Client-no
Varchar 2
6
Dely-Add
Varchar 2
25
Salesman-no
Varchar 2
6
Dely-type
Char
1
Dely-date
Date
Order-status
Varchar 2
Delivery:
Past
(P/Full (F) Default
‘F’
10
CREATE TABLE Sales-order
(S. order-no
Varchar 2 (6)
S-order-date
date,
Client-no
Varchar 2 (6),
Dely-Add
Varchar 2 (25),
Salesman-no
Varchar 2 (6),
Dely-type
Char (1) DEFAULT ‘F’
Dely-date
date;
Order-status
varchar2 (10));
$
PRIMARY KEY,
%
Use the CHECK constraint when you need to enforce integrity rules that can be evaluated based on a
logical expression. Never use CHECK constraints if the constraint can be defined using the not null,
primary key or foreign key constraint.
Following are a few examples of CHECK constraints.
USE OF RELATIONAL DBMS PACKAGE FOR CLASS PROJECT
205
A CHECK constraint on the client-no column of the client-master so that client-no value starts with
‘C’.
A CHECK constraint on name column of the client-master so that the name is entered in upper case.
A CHECK constraint on the city column of the client-master so that only the cities “BOMBAY”,
“NEW DELHI” “MADRAS” and “CALCUTTA” are allowed.
Exercise
CREATE TABLE client-master
(Client-no varchar 2 (6) constraint ‘K-Client
Check (Client-no like ‘C%’),
Name Varchar 2 (20) constraint K-name
CHECK (name = uppername)),
Address 1 Varchar 2 (30),
Address 2 Varchar 2 (30),
City Varchar 2 (15) CONSTRAINT k-City
CHECK (City IN (‘NEW DELHI’, ‘BOMBAY’, ‘CALCUTTA’, ‘MADRAS’));
&
Foreign keys represent relationships between tables. A foreign key is a column (or a group of columns)
whose values are derived from the primary key of the same or some other table.
The existence of a foreign key implies that the table with the foreign key is related to the primary key table
from which the foreign key is derived. A foreign key must have corresponding primary key value in the
primary key table to have a meaning.
For example, the S-order-no column is the primary key of table Sales-order. In table sales-order-details, Sorder-no is a foreign key that references the S-order-no values in table sales-order.
Example: Create table sales-order-details with primary key as S-order no and product-no and foreign key as
S-order-no referencing column S-order-no in the sales-order table.
Create Table Sales-order-details
(S-order-no Varchar 2 (6) REFERENCES Sales-order,
Product-no Varchar 2 (6),
Qty-order number (8),
Product-rate number (8, 2),
PRIMARY KEY (S-order-no, Product-no));
%
'
('
You can also define integrity constraints using the constraint clause in the ALTER TABLE command. For
example are given below.
1.
Add PRIMARY KEY constraint on column supplier-no in table SUPPLIER-MASTER.
ALTER TABLE Supplier-Master
206
DATABASE SYSTEMS
Add Primary Key (Supplier-no);
2.
Add FOREIGN KEY constraint on column S-order-no in table sales-order-details referencing table
Sales-order, modify column qty-ordered to include NOT NULL constraint.
ALTER TABLE Sales-order-details
Add Constraint order-Key
FOREIGN KEY (S-order-no) REFERENCES Sales-order
MODIFY (qty. ordered number (8) NOT NULL)
Dropping Integrity Constraints in the alter table command
You can drop an integrity constraint if the rule that it enforces is no longer true or if the constraint is no
longer needed. Drop the constraint using the alter table command with the DROP clause. The following
example illustrates the dropping of integrity constraints.
1.
Drop the PRIMARY KEY constraint from supplier-master.
ALTER TABLE Supplier-Master
DROP PRIMARY Key.
2.
Drop FOREIGN KEY constraint on columns product-no in table Sales-order-details;
ALTER TABLE Sales-order-details
DROP CONSTRAINT Product-key;
'
%
'
#
)
+
Addition
*
Multiplication
-
Subtraction
**
Exponentiation
/
Division
()
Enclosed operation
Example
Select Product-no, description and compute sell-price * 0.05 and Sell-price * 1.05 for each row retrieved.
Select product-no, description, Sell-price * 0.05, Sell-price * 1.05 from Product-master;
Here, Sell-price * 0.05 ad Sell-Price *1.05 are not columns in the table product-master, but are calculations
done on the contents of the column sell-price of the table product-master. The output will be shown as
follows.
Product No.
Description
Sell-Price *0.05
Sell-Price *1.05
P00001
Floppy
25
525
P03453
Mouse
50
1050
P07865
Keyboard
150
3150
!
*
'
The default output column names can be renamed by the user if required.
USE OF RELATIONAL DBMS PACKAGE FOR CLASS PROJECT
207
Syntax
SELECT column name result-columname, columnname result-column…..FROM tablename;
Example
Select Product-no, description and compute sell-price *0.05 and sell-price *1.05 for each row retrieved.
Rename sell-price * 0.05 as increase ad salary *1.05 on New price.
SELECT Product-no, description,
Sell-price * 0.05 Decrease,
Sell-price *1.05 New Price
FROM Product-Master.
The output will be
'
Product No.
Description
Decrease
New Price
P00001
Floppy
25
525
P03453
Mouse
50
1050
P07865
Keyboard
150
3150
)
The Logical operators that can be used in SQL sentences are and, or, not
Example
1.
Select client information like client-no, name, address 1, address 2, city and pincode for all the clients
in ‘Bombay’ or ‘Delhi’;
SELECT circuit-no, name, address 1, address 2, city, pincode
FROM client-master
WHERE City = ‘BOMBAY’ or City = ‘DELHI’;
2.
Select Product-no, description, profit-percent, sell price where profit-percent is between 10 and 20
both inclusive.
SELECT Product-no, description, profit-percent, Sell-price
FROM product-Master
WHERE profit-Percent > = 10 AND
Profit-Percent < = 20;
+
Example
Select product-no, description, profit-percent, sell-price where profit-percent is not between 10 and 20;
SELECT product-no, description, profit-percent, sell-price
FROM product-master.
WHERE profit-percent NOT BETWEEN 10 AND 15;
208
DATABASE SYSTEMS
)
#
!
',
The character data types: % matches any string-(underscore) matches any single characters.
Example
1.
Select supplier_name from supplier_master where the first two characters of name are ‘ja’;
SELECT supplier_name
FROM supplier_master
WHERE supplier_name LIKE ‘ja%’;
2.
Select supplier_name from supplier_master where the second character of name is ‘r’ or ‘h’;
SELECT supplier_name
FROM supplier-master
WHERE supplier_name like ‘r%’ OR supplier_name LIKE ‘-h%’
3.
Select supplier-name, address 1, address 2, city and pincode from supplier_master where name is 3
characters long and the first two characters are ‘ja’;
SELECT supplier_name, address 1, address 2, city and pincode
FROM supplier_master
WHERE name is 3 characters long and the first two characters and ‘ja’;
'
•
Command for creating a table is CREATE TABLE
column name datatype (size);
•
Command for inserting record into a table is INSERT INTO tablename [columnname,
columnname)] VALUES (expression, expression);
•
Command to update a record is UPDATE tablename Set columnname = expression, columnname =
expression…..Where columnname = expression;
•
Command to delete a record is DELETE FROM table name.
*
tablename (Columname datatype (size),
'
+
I.
True or False
1.
Values of Char Datatype are fixed length character strings of maximum length 255
characters.
2.
Using the alter table clause you can change the name of the table.
USE OF RELATIONAL DBMS PACKAGE FOR CLASS PROJECT
3.
II.
209
You can define integrity constraints using the constraint clause in the ALTER TABLE
command.
Fill in the Blanks
1.
The ______________Datatype is used to store numbers (fixed or floating point).
2.
INSERT INTO tablename [columnname, columnname)]
___________(expression, expression);
3.
_____________keys represent relationships between tables.
4.
DELETE ________table name
5.
You can drop an __________ ___________if the rule that it enforces is no longer true or if
the constraint is no longer needed.
*
I.
II.
True or False
1.
True
2.
False
3.
True
Fill in the Blanks
1.
NUMBER
2.
VALUES
3.
Foreign
4.
FROM
5.
integrity constraint
!
I.
II.
True or False
1.
LONG is used when variable length character strings commit off 65, 535 characters.
2.
The syntax of create table command is CREATE TABLE (Columname datatype (size),
column name datatype (size);
3.
Using the alter table clause you can not change the name of the column.
4.
Use the CHECK constraint when you need to enforce integrity rules that can be evaluated
based on a logical expression.
5.
At the time of cell creation ‘a default value’ can not be assigned to it.
Fill in the Blanks
1.
The standard format of data is ______________ for oracle.
2.
Long is used when variable length character string commit off ____________ characters.
210
DATABASE SYSTEMS
3.
Char values of this data type are final length character strings of maximum length
__________ character.
4.
VARCHAR values of this data type of variable length character strings of maximum length
______________.
5.
Number of virtually any magnitude may be stored up to ___________ digit of precision.
'
1.
Create table supplier-master from client-master select all filed and rename client-x with supplier-no
and name with supplier-name.
2.
Insert a record in the client-master table as client-no=C03000, names=Peter, address 1=A-S, Jay
Apartments, address 2=Service Road, Vile Parle, City-Bangalore, State-Karnataka,
Pincode=300056.
3.
Insert records into supplier-master from table client-master where client-no=C03000.
4.
Update table client-master set name=’Peterwiley’ and address=’SXT Jay Apartments’ where client
no=’C03000’.
1.
2.
Discuss each of the following terms:
a)
Data
b)
Field
c)
Record
d)
File
What do you suppose to be the difference in requirement for database management systems
intended for use with general purpose business and administrative systems and those intended for
use with each of the following:
a)
Science Research
b)
Library Retrieval Systems
c)
Inventory Management
d)
Computer Aided Software Design
e)
Personal Management System
3.
What is the difference between data and information?
4.
What is the significance of Information Systems from the Manager’s point of view?
5.
Highlight the difference in the File Management and Database Management approach.
6.
Discuss the advantages and disadvantages of Database approach.
7.
What do you think should the major objectives of a Database be?
8.
What is Data Independence and why do you think is it important?
9.
What do you think is the significance of Information, for an organisation.
10.
What is meant by Data Processing?
11.
Describe the Storage hierarchy.
12.
Illustrate the benefits of maintaining records of computers.
13.
Compile a list of applications that exist around you, which you think are directly dependent on the
Database technology.
14.
Describe how would you go about collecting information for the following business systems:
a)
Marketing Information Systems
b)
Accounts Receivable System
212
DATABASE SYSTEMS
c)
Purchase Order Processing Systems
d)
Describe the various input, processes and output involved in the above mentioned business
systems.
15.
Collect some real-life examples of how Database technology has helped organisations to grow
better and improve their efficiency and customer service.
16.
Describe the Database System Life Cycle. What is functional operation of each of these stages?
17.
Why do you think proper analysis of the required database is important?
18.
Make a checklist of items that you would consider while you are in the design stage.
19.
What is Operation and Maintenance stage all about? Describe the role of DBA in this stage?
20.
Describe the different components of a typical Database Management Systems.
21.
Identify the different files involved in the following business processing system and detail out their
database structure:
a)
Order Processing System
b)
Accounts Payable System
c)
Inventory Management System
d)
Student Information System.
22.
Perform a research on the present set of tools and techniques offered by major RDBMS companies
like Oracle and Sybase.
23.
Describe the following terms:
a)
Schema
b)
Data Dictionary
c)
End-User
d)
Three Levels of Abstraction
24.
Having understood the structure of a typical DBMS package, highlight the role of each individual
component.
25.
Compare the three functional approaches to database design and implementation—Hierarchical,
Network and Relational.
26.
What are the distinct advantages of Hierarchical and Network Database?
27.
Describe the disadvantages of Relational Databases.
28.
To acquire knowledge about latest developments in the area of Relational Databases and reviews
about different database packages, read the latest reviews and articles. Document this information
and draw your comparison chart.
29.
Discuss the benefits of Data Dictionary.
30.
Why is the relational model becoming dominant?
APPENDIX
213
31.
Describe the basic features of the Relational Database model and discuss their importance to the
end user and the designer.
32.
Describe the architecture of a DBMS? What is the importance of Logical and Physical Data
Independence?
33.
Explain the following with relevant examples:
Data Description Language , Device Media Control Language, Data Manipulation Language.
34.
Compare the Insertion and Deletion functions between the three models.
35.
Relational model is one the major inventions of information technology? Elaborate.
36.
Describe the following terms
i.
Primary Key
ii.
Candidate Key
iii.
Foreign Key
iv.
Cardinality
v.
Degree
vi.
Domain
37.
Explain the relevance and significance of each of the Twelve Codd rules, with appropriate relevant
examples.
38.
With context to E-R relationship diagram, diagrammatically represent atleast five examples for
each of the following types of relationships
a)
One – One
b)
One – Many
c)
Many – Many
39.
Describe the technique of Information Modeling and the art of drawing E-R diagrams. Take up any
example of business system that exists around you and try modelling it using this technique.
40.
What is the concept of Normalisation? Why do you think is it required?
41.
Document the features of Oracle latest products—Oracle 8.0, Developer 2000 and Designer 2000.
42.
Why do you think the database needs to be protected?
43.
Highlight the differences between Security and Integrity.
44.
How are the concepts of Integrity maintained in Relational Databases?
45.
What are the major threats and security mechanisms to be adopted against them?
46.
Discuss some of the security and integrity functions provided by Oracle.
47.
Describe the following terms
a)
Access Control
214
DATABASE SYSTEMS
b)
Audit Trail
c)
Revoke and Grant
d)
Hacking
e)
Virus
f)
Failure Recovery
g)
Backup Techniques
h)
Administrative Controls
48.
Describe the different types of database failures against which the database should be guarded and
the respective recovery technique to be adopted against them.
49.
Describe in detail the role of Database Administrator.
50.
How should one plan for Audit and Control Mechanisms?
51.
Describe the technique of Encryption as a control mechanism.
52.
What is the difference between deadlock prevention and deadlock resolution?
53.
What is Transaction Integrity? Why is it important?
54.
HallMart Department Stores runs a multi-user DBMS on a local area network file server.
Unfortunately, at the present time the DBMS does not enforce concurrency control. One HallMart
customer had a balance of $250.00 when the following three transactions were processed at about
the same time:
a)
Payment of $250.00
b)
Purchase on credit of $100.00
c)
Merchandise return (credit) of $50.00
Each of the three transactions read the customer record when the balance was $250.00 (that is,
before the other transactions were completed). The updated customer record was returned to the
database in the order shown above.
55.
i)
What was the actual balance for the customer after that last transaction was completed?
ii)
What balance should have resulted from processing these three transactions?
For each of the situation described below, indicate which of the following security measures is
most appropriate:
a)
Authorisation
b)
Encryption
c)
Authentication schemes
APPENDIX
215
i)
A national brokerage firm uses an electronic funds transfer system (EFTS) to transmit
sensitive financial data between locations.
ii)
A manufacturing firm uses a simple password system to protect its database but finds it
needs a more comprehensive system to grant different privileges (such as read versus
create or update) to different users.
iii)
A university has experienced considerably difficulty with unauthorised users who
access files and databases by appropriating passwords from legitimate users.
56.
Customcraft, Inc., is a mail-order firm specialising in the manufacture of stationery and other paper
products. Annual sales of Customcraft are $25 million and are growing at a rate of 15% per year.
After several years of experience with conventional data processing systems, Customcraft has
decided to organise a data administration function. At present, they have four major candidates for
the data administrator position:
a.
John Bach, a senior systems analyst with three years experience at Customcraft, who has
attended recent seminars in structured systems design and database design.
b.
Margaret Smith, who has been production control manager for the past two years after a
year’s experience as programmer/analyst at Customcraft.
c.
William Rogers, a systems programmer with extensive experience with DB2 and Oracle, the
two database management systems under consideration at Customcraft.
d.
Ellen Reddy, who is currently database administrator with a medium-size electronics firm in
the same city as Customcraft.
Based on this limited information, rank the four candidates for the data administrator position, and
state your reasons.
57.
Referring to Problem 15, rank the four candidates for the position of database administrator at
Customcraft. State your reasons.
58.
Visit an organisation that has implemented a database approach. Evaluate each of the following:
a.
The organisational placement of the data administration function
b.
The functions performed by data administration and database administration
c.
The background of the person chosen as head of data administration
d.
The status and usage of an information repository (passive, active-in-design, active-inproduction)
e.
59.
The procedures that are used for security, concurrency control, and back up and recovery.
Find a recent article describing an incident of computer crime. Was there evidence of inadequate
security in this incident? What security measures described in this chapter might have been
instrumental in preventing this incident?
216
60.
61.
62.
63.
64.
65.
DATABASE SYSTEMS
A Primary key is a minimum superkey
a)
True
b)
False
c)
Partially True
d)
Inadequate data
................................. Statement is used to modify one or more records in a specified relation.
a)
Update
b)
Alter
c)
Add, delete, modify
d)
Both 1 and 2
A database is said to be fully redundant when
a)
when no replicates of the fragments are allowed
b)
when complete database copies are distributed at all sites
c)
when only certain fragments are replicated
d)
None of the above
In order to modify data, SQL uses update statements, these statements include
a)
Insert and Update
b)
Modify, Insert and Delete
c)
Insert, Update and Delete
d)
None of the above
The Network Model is usually used to represent
a)
One to One relationship
b)
One to Many relationship
c)
Many to Many relationship
d)
None of the above
If all non-key attributes are dependent on all the attributes of the key in a relational database, the
relation
is in
a)
Second Normal Form
APPENDIX
66.
67.
68.
69.
70.
217
b)
Third Normal Form
c)
Fourth Normal Form
d)
None of the above
The stored value of the attribute is referred to as an ..........................................
a)
Attribute value
b)
Stored field
c)
Field
d)
All of the above
In a Client-Server Architecture one of the possible choices for Back-end could be
a)
Oracle
b)
Sybase
c)
FoxPro
d)
All of the above
A user in a DDBMS environment does not know the location of the data, this is called as
a)
Location transparency
b)
Replication transparency
c)
Fragmentation transparency
d)
None of the above
The .................................. option specifies that only one record could exist at any time with a
given value for the column(s) specified in the statement to create the index.
a)
Distinct
b)
Unique
c)
Cluster
d)
Both a and b
An existing relation could be deleted from the database by the ................................... SQL
statement.
a)
delete table tablename
b)
delete relation relationname
c)
drop table tablename
218
DATABASE SYSTEMS
d)
71.
72.
73.
74.
75.
76.
None of the above
The objective of Normalisation is
a)
to reduce the number of tables in a data base
b)
to reduce the number of fields in a record
c)
to increase the speed of retrieving data
d)
None of the above
A Transaction can end itself in
a)
Successful Termination
b)
Suicidal Termination
c)
Murderous Termination
d)
All of the above
Advantage of Distributed Data processing is
a)
Sharing
b)
Availability and Reliability
c)
Incremental Growth and Parallel evaluation
d)
All of the above
Which of the following is not a Network Topology
a)
Star Topology
b)
Bus Topology
c)
Synchronous Topology
d)
Mesh Topology
A ................................. statement is used to delete one or more records from a relation.
a)
Alter
b)
Modify
c)
Drop
d)
Delete
In a Client-Server Architecture one of the possible choices for Front-end could be
a)
Oracle
APPENDIX
77.
78.
79.
80.
81.
219
b)
Sybase
c)
Power Builder
d)
All of the above
Which of the following truly represents Binary relationship
a)
1:1
b)
1:M
c)
M:N
d)
All of the above
The advantage of local data availability in a Distributed environment are
a)
Access of a non-update type is cheaper
b)
Even if access to a remote site is not possible, access to local data is still
c)
Cost and complexity of updates increases
d)
Both 1 and 2
The characteristic of Distributed Databases is
a)
Location Transparency
b)
Fragmentation Transparency
c)
Replication Transparency and Update Transparency
d)
All of the above
................................... is the simplest concurrency control method.
a)
Locking
b)
Fragmentation
c)
Replication
d)
Isolation
Which of the following is not a SQL built-in-function?
a)
Count
b)
Sum
c)
Min
d)
Mean
220
82.
83.
84.
85.
86.
87.
DATABASE SYSTEMS
Systems Catalogues are used to maintain
a)
metadata on database relations
b)
hardware information
c)
system performance
d)
both 2 and 3
A Data Dictionary doesn’t provide information about
a)
where data is located
b)
the size of the disk storage device
c)
who owns or is responsible for the data
d)
how is data used
Which of the following contain a complete record of all activities that affect the contents of a
database during a certain period of time?
a)
Report writer
b)
Transaction Manager
c)
Transaction log
d)
Database Administrator
Which of the following is a traditional data model?
a)
Relational
b)
Network
c)
Hierarchical
d)
All of the above
A Schema describes
a)
data elements, attributes
b)
records and relationship
c)
size of the disk storage and its usage by the database
d)
both a and b
Security in a database involves
a)
Policies to protect data and ensure that it is not accessed or deleted without proper
authorisation
APPENDIX
221
b)
Appointing a Security Manager
c)
Mechanisms to protect the data and ensure that it is not accessed or deleted without proper
authorisation
d)
88.
89.
90.
91.
92.
both a and b
A site needing remote catalogue information requests and stores it for later use. This scheme is
called as
a)
remote cataloging
b)
caching the remote catalogue
c)
remote catalogue caching
d)
None of the above
..................................... Data Models were developed to organize and represent general knowledge;
also they are able to express greater interdependence among entities of interest.
a)
Network
b)
Semantic
c)
Hierarchical
d)
All of the above
The smallest amount of data can be stored in
a)
Bit
b)
Byte
c)
Nibble
d)
Record
The two important dimensions for the protection of the data in database are
a)
Confidentiality and Protection from accidental and malicious corruption and destruction
b)
Protection and Security Locks
c)
Data compression and Encryption techniques
d)
None of the above
Which of the following is not true about Primary Key?
a)
it is unique entity
b)
could be one or combination of attributes
222
93.
94.
95.
96.
97.
98.
DATABASE SYSTEMS
c)
can be null
d)
both a and b
Alternate key is
a)
A primary key
b)
All keys except candidate key
c)
Combination of one of more keys
d)
None of the above
Entity Relation Diagram
a)
Describes the entities, their attributes and their relationships
b)
Describes the scope of the project
c)
Describes the level of participation of each entity
d)
None of the above
The three levels of architecture of a DBMS are
a)
DBA, Internal schema and user
b)
User, Application and Transaction Manager
c)
Database, Application and user
d)
External level, Conceptual level and Internal Level
Which the following is not the component of a DBMS?
a)
Data definition language
b)
Data manipulation language
c)
Query processor
d)
All of the above
Metadata is
a)
Data about data
b)
Data into data
c)
Description of data and user
d)
None of the above
Attribute is a
APPENDIX
99.
100.
101.
102.
103.
223
a)
Name of a field
b)
Property of given entity
c)
Name of the database
d)
All of the above
The way a particular application views the data from the database, depending on the user
requirement is
a)
Subschema
b)
Schema
c)
Metadatabase
d)
None of the above
Which of the following is not the characteristic of a relational database model?
a)
Logical relationship
b)
Tables
c)
Relationships
d)
Tree like structure
The set of possible values that a given attribute can have is called its
a)
Entity set
b)
Domain
c)
Object property
d)
None of the above
A ................................. key is an attribute or combination of attributes in a database that uniquely
identifies an instance.
a)
Superkeys
b)
Primary key
c)
Candidate key
d)
Both a and b
.................................... is a collection of identical record type occurrences pertaining to an entity set
and is labelled to identify the entity set.
a)
File
224
104.
105.
106.
107.
108.
DATABASE SYSTEMS
b)
Database
c)
Field
d)
None of the above
A ............................... represents a different perspective of a base relation or relations.
a)
Virtual table
b)
Table
c)
Tuple
c)
View
The mapping between the Conceptual and Internal view is provided by
a)
DBA
b)
DBMS
c)
Operating System
d)
Both b and c
Disadvantages of Database Processing are
a)
Size, complexity, cost and failure impact
b)
Needs highly skilled staff
c)
Needs expensive hardware
d)
Integrity
Which of the following is not an advantage of Data processing?
a)
Data Redundancy
b)
Data Sharing
c)
Physical and Logical Independence
d)
None of the above
What is Data Integrity?
a)
Protection of the information contained in the database against unauthorised access,
modification or destruction
b)
The culmination of administrative policies of the organisation
APPENDIX
c)
225
The mechanism that is applied to ensure that the data in the database is correct and
consistent
d)
109.
110.
111.
None of the above
Which of the following is not an example of RDBMS?
a)
Oracle
b)
Informix
c)
Access
d)
Focus
Advantages of Distributed Database are
a)
Data Sharing and Distributed Control
b)
High Software Development Cost
c)
Reliability and Availability
d)
Both a and b
Four levels of defense generally recognised for database security are
a)
Human Factor, Legal Laws, Administrative Controls and Security Policies
b)
Human Factor, Authorisation, Encryption and Compression
c)
Human Factor, Physical Security, Administrative Control DBMS and OS Security
Mechanisms
d)
112.
113.
None of the above
Content-Dependent Access control is
a)
A user is allowed access to everything unless access is explicitly denied
b)
Access is allowed to those data objects whose names are known to the user
c)
Concept of least privilege extended to take into account the contents of database
d)
None of the above
A ......................... subset view slices the table horizontally.
a)
Row
b)
Column
c)
Join
226
DATABASE SYSTEMS
d)
114.
115.
116.
117.
118.
119.
Both a and b
SQL statement to give privilege to user is
a)
Grant
b)
Revoke
c)
Select
d)
Update
The Commit operation is
a)
Start of a new operation
b)
Signifies an unsuccessful end of an operation
c)
Generating a Report
d)
None of the following
Large collections of files is called as is
a)
Fields
b)
Records
c)
Databases
d)
Record
A Subject
a)
Is something that needs protection
b)
Is an active element in the security mechanism; it operates on objects
c)
A user is allowed to only that portion of the database defined by the user’s view
d)
Both a and c
Identification and Authentication can be enforced through
a)
Something you have
b)
Someone you are
c)
Something you know
d)
All of the above
A Self Join statement is used to
a)
Join two records
APPENDIX
120.
121.
122.
123.
124.
227
b)
Join two tables
c)
Join Table with itself
d)
None of the above
A .......................... is well-defined collection of objects.
a)
Set
b)
Member
c)
Field
d)
File
Duplicate tuples are not permitted in a relation.
a)
False
b)
Inadequate data
c)
True
d)
Either a or c
A single relation may be stored in more than one file, i.e., some attributes in one and the rest in
others, this is called as
a)
Distribution
b)
Cardinality
c)
Fragmentation
d)
None of the above
No. of Tuples in a Table is called as
a)
Cardinality
b)
Degree
c)
Count
d)
None of the above
Which of the following is not true about a Foreign Key?
a)
It is primary key of some other table
b)
It cannot be NULL
c)
It is used to maintain referential integrity
228
DATABASE SYSTEMS
d)
125.
126.
127.
128.
129.
130.
It should always be numeric
The ........................ option is used in a Select statement to eliminate duplicate tuples in the result.
a)
Unique
b)
Distinct
c)
Exists
d)
None of the above
No. of columns in a Table is called as
a)
Cardinality
b)
Degree
c)
Tuple
d)
None of the above
In a Table of 4X3, which of the following is true?
a)
Cardinally is 12
b)
Degree is 4
c)
No. of rows is 3
d)
None of the above
Database management systems are intended to
a)
establish relationships among records in different files
b)
manage file access
c)
maintain data integrity
d)
all of the above
Which of the following hardware components is the most important to the operation of a database
management system?
a)
High-resolution video display
b)
High speed printer
c)
High speed and large capacity disk
d)
All of the above
Which of the following are the functions of a DBA?
APPENDIX
131.
132.
133.
134.
135.
229
a)
Database Design
b)
Backing up the data
c)
Maintenance and administration of data
d)
All of the above
Which of the following is a serious problem of the file management?
a)
Lack of data independence
b)
Data redundancy
c)
Non-shareability
d)
All of the above
The database environment has all of the following components except
a)
Users
b)
Separate files
c)
Metadata
d)
DBA
............................................. is used to represent an entity type in a DBTG model.
a)
Owner record type
b)
Record type
c)
Member Record type
d)
None of the above
......................................... is also called as the Navigational Model.
a)
Network
b)
Hierarchical
c)
Relational
d)
both a and b
In .................................................. file organisation the physical location of record is based on some
relationship with its primary key value.
a)
Index-sequential
b)
Sequential
230
136.
137.
138.
139.
140.
141.
DATABASE SYSTEMS
c)
Direct
d)
None of the above
Atomic domains are sometimes also referred to as
a)
Composite domains
b)
Structured domains
c)
Application dependent domains
d)
Application-independent domains
Address is an example of
a)
Composite domains
b)
Structured domains
c)
Both 1 and 2
d)
None of the above
Important property associated with key are
a)
Should be from Composite domains
b)
Should be Unique and Numeric
c)
Should have Unique Identification and non redundancy
d)
None of the above
The ........................................ operation removes common tuples from the first relation.
a)
Difference
b)
Intersection
c)
Union
d)
Both a and b
The ........................................ operation is also called as the Restriction operation.
a)
Difference
b)
Selection
c)
Union
d)
Both a and b
Two tuples of a Table can have identical data
APPENDIX
142.
143.
144.
145.
146.
231
a)
True
b)
False
c)
Partially False
d)
Inadequate data
Data Security threats include
a)
Hardware Failure
b)
Privacy invasion
c)
Fraudulent manipulation of data
d)
All of the above
The ...................................... function of SQL allows data to be classified into categories.
a)
Having Count
b)
Count(*)
c)
Sum
d)
Group by
Which of the following is true about Updation of Views?
a)
Any update operation through a view requires that user has appropriate authorisation
b)
If the definition of the view involves a single key and includes the primary key
c)
The value of a nonprime attribute can be modified
d)
All of the above
A ......................................... is a programme unit whose execution may change the contents of a
database.
a)
Update Statement
b)
Transaction
c)
Event
d)
None of the above
Which of the following is not the property of a Transaction?
a)
Atomicity
b)
Consistency
232
147.
148.
149.
150.
DATABASE SYSTEMS
c)
Durability
d)
Inadequate data
Which of the following is not recorded in a Transaction log
a)
Size of the database before and after the transaction
b)
The record identifiers, which include who and where information
c)
The updated value of the modified records
d)
All of the above are recorded
A scheme called ......................................... is used to limit the volume of log information that has to
be handled and processed in the event of system failure.
a)
Write-ahead log
b)
Checkpoint
c)
Volume limit transaction log
d)
None of the above
The following is also called as the Read lock
a)
Non exclusive lock
b)
Exclusive lock
c)
Shared lock
d)
Non Shared Lock
In the ...................................... phase the number of lock increases from zero to maximum.
a)
Two phase locking
b)
Growing phase
c)
Contracting phase
d)
None of the above
Suggested Readings
1.
Database Management,Fred Mcfadden, Benjamin Cummings Publishing.
2.
Database Management: Principles and Products, Charles J Bontempo, Prentice Hall PTR.
3.
Database Management Systems,Raghu Ramakrishnan, Mcgraw-Hill Inc.
4.
Database Systems, Thomas M Connolly, Addison-Wesley.
5.
Database Systems, Rob Coronel, Galgotia Publishers.
6.
Fundamentals of Database Systems, Ramez Elmastri, Addison Wesley.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement