EMC Greenplum DCA and DIA Getting Started Guide

The Data Computing Division of EMC
EMC® Greenplum® Data Computing Appliance
and Data Integration Accelerator
Getting Started Guide
Version 1.0.3
P/N: 300-012-284
Rev: A01
Copyright © 2010 EMC Corporation. All rights reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is subject to
change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS
OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY
DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software
license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com
All other trademarks used herein are the property of their respective owners.
EMC DCA and DIA Getting Started Guide – Contents
EMC DCA and DIA Getting Started Guide - Contents
Preface ............................................................................................... 1
About This Guide .............................................................................. 1
Document Conventions .................................................................... 1
Text Conventions ........................................................................ 2
Command Syntax Conventions ................................................... 3
Getting Support ............................................................................... 3
Product information .................................................................... 3
Technical support ....................................................................... 4
Chapter 1: About EMC Greenplum Greenplum DCA and DIA 5
About DCA and DIA .......................................................................... 5
Available DCA Configurations ...................................................... 5
Available DIA Configurations......................................................10
Component Specifications ..........................................................13
Additional Solutions - Data Domain Backup and Recovery .........16
About Greenplum Database.............................................................17
About the Master Hosts .............................................................18
About the Segment Hosts ..........................................................19
About the Network Configuration ...............................................20
Chapter 2: Greenplum DCA / DIA Administration .................23
Troubleshooting and Diagnostic Tools .............................................23
Database and System Monitoring Tools ...........................................24
ConnectEMC Dial Home Capability .............................................24
Greenplum Performance Monitor................................................26
Greenplum Database System Catalogs ......................................28
Greenplum Database Email and SNMP Alerting ..........................29
General Database Maintenance Tasks .............................................29
Routine Vacuum and Analyze.....................................................29
Routine Reindexing ....................................................................30
Managing Greenplum Database Log Files ...................................30
Chapter 3: Connecting to Greenplum Database .....................33
Establishing a Database Session .....................................................33
Supported Client Applications ..........................................................33
Greenplum Database Client Applications ....................................34
pgAdmin III for Greenplum Database ........................................36
Database Application Interfaces .................................................37
Third-Party Client Tools .............................................................37
Troubleshooting Connection Problems .............................................38
Chapter 4: Next Steps ...................................................................39
Understanding the SQL Features of Greenplum Database................39
Core SQL Conformance ..............................................................39
SQL 1992 Conformance .............................................................40
SQL 1999 Conformance .............................................................41
SQL 2003 Conformance .............................................................42
SQL 2008 Conformance .............................................................42
Greenplum and PostgreSQL Compatibility ..................................43
Providing User Access to Greenplum Database ................................49
Table of Contents
iii
EMC DCA and DIA Getting Started Guide – Contents
Creating Databases and Loading Data .............................................50
Glossary ............................................................................................51
iv
Table of Contents
EMC DCA and DIA Getting Started Guide – Preface
Preface
This guide is intended for database and system administrators who are new to the
Greenplum Data Computing Appliance (Greenplum DCA), Greenplum Data
Integration Accelerator, and to Greenplum Database. This guide provides an overview
of the appliance configuration, as well as general information about using and
administering a Greenplum Database system.
•
About This Guide
•
Document Conventions
•
Getting Support
About This Guide
This guide provides high-level information to help administrators get started with
Greenplum Database. It is intended for system and database administrators
responsible for managing a Greenplum Database system on the Greenplum Data
Computing Appliance.
This guide assumes knowledge of Linux/UNIX system administration, database
management systems, database administration, and structured query language (SQL).
This guide contains the following chapters and appendices:
•
Chapter 1, “About EMC Greenplum Greenplum DCA and DIA” explains the
architecture, components, and configuration of Greenplum Database on the
Greenplum Data Computing Appliance.
•
Chapter 2, “Greenplum DCA / DIA Administration” describes the general
database maintenance tasks and the tools available to diagnose, monitor, and
troubleshoot a Greenplum Database system running on the Greenplum Data
Computing Appliance.
•
Chapter 3, “Connecting to Greenplum Database” explains how to connect to
Greenplum Database using various client programs.
•
Chapter 4, “Next Steps” explains the next steps to implementing your data
warehouse requirements in Greenplum Database.
•
“Glossary” defines Greenplum Database components and terminology.
Document Conventions
The following conventions are used throughout the Greenplum Database
documentation to help you identify certain types of information.
•
Text Conventions
•
Command Syntax Conventions
About This Guide
1
EMC DCA and DIA Getting Started Guide – Preface
Text Conventions
Table 0.1 Text Conventions
Text Convention
Usage
Examples
bold
Button, menu, tab, page, and field
names in GUI applications
Click Cancel to exit the page without
saving your changes.
italics
New terms where they are defined
The master instance is the postgres
process that accepts client
connections.
Database objects, such as schema,
table, or columns names
Catalog information for Greenplum
Database resides in the pg_catalog
schema.
monospace
File names and path names
Edit the postgresql.conf file.
Programs and executables
Use gpstart to start Greenplum
Database.
Command names and syntax
Parameter names
monospace italics
Variable information within file
paths and file names
Variable information within
command syntax
monospace bold
/home/gpadmin/config_file
COPY tablename FROM
'filename'
Used to call attention to a particular Change the host name, port, and
part of a command, parameter, or
database name in the JDBC
code snippet.
connection URL:
jdbc:postgresql://host:5432/m
ydb
UPPERCASE
Environment variables
SQL commands
Keyboard keys
2
Make sure that the Java /bin
directory is in your $PATH.
SELECT * FROM my_table;
Press CTRL+C to escape.
Document Conventions
EMC DCA and DIA Getting Started Guide – Preface
Command Syntax Conventions
Table 0.2 Command Syntax Conventions
Text Convention
Usage
Examples
{ }
Within command syntax, curly
braces group related command
options. Do not type the curly
braces.
FROM { 'filename' | STDIN }
[ ]
Within command syntax, square
brackets denote optional
arguments. Do not type the
brackets.
TRUNCATE [ TABLE ] name
...
Within command syntax, an ellipsis DROP TABLE name [, ...]
denotes repetition of a command,
variable, or option. Do not type the
ellipsis.
|
Within command syntax, the pipe
symbol denotes an “OR”
relationship. Do not type the pipe
symbol.
VACUUM [ FULL | FREEZE ]
$ system_command
Denotes a command prompt - do
not type the prompt symbol. $ and
# denote terminal command
prompts. => and =# denote
Greenplum Database interactive
program command prompts (psql
or gpssh, for example).
$ createdb mydatabase
# root_system_command
=> gpdb_command
=# su_gpdb_command
# chown gpadmin -R /datadir
=> SELECT * FROM mytable;
=# SELECT * FROM pg_database;
Getting Support
EMC support, product, and licensing information can be obtained as follows.
Product information
For documentation, release notes, software updates, or for information about EMC
products, licensing, and service, go to the EMC Powerlink website (registration
required) at:
http://Powerlink.EMC.com
Getting Support
3
EMC DCA and DIA Getting Started Guide – Preface
Technical support
For technical support, go to Powerlink and choose Support. On the Support page, you
will see several options, including one for making a service request. Note that to open
a service request, you must have a valid support agreement. Please contact your EMC
sales representative for details about obtaining a valid support agreement or with
questions about your account.
4
Getting Support
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
1.
About EMC Greenplum Greenplum DCA
and DIA
Greenplum Data Computing Appliance (Greenplum DCA) is a self-contained data
warehouse solution that integrates all of the database software, servers and switches
necessary to perform big data analytics.
EMC Greenplum Data Computing Appliance (DCA) is a turn-key, easy to install data
warehouse solution that provides extreme query and loading performance for
analyzing large data sets. The EMC Greenplum DCA integrates Greenplum Database
software with compute, storage and network components; delivered racked and ready
for immediate data loading and query execution. The DCA is available in two
configurations - balanced and capacity. The balanced system uses high speed SAS
drive technology, while the capacity system uses high capacity SATA drive
technology.
EMC Greenplum Data Integration Accelerator (DIA) is a fast, parallel data loading
solution, built specifically to integrate with the DCA. The DIA comes pre-configured
with Greenplum’s gpfdist loading tool. The DIA supports health monitoring and
ConnectEMC dial home notifications. For more information on the loading process,
refer to the Greenplum Database Administrator Guide: Loading and Unloading Data.
EMC Greenplum Data Computing Appliance runs the Greenplum Database relational
database management system (RDBMS) software. Greenplum Database utilizes the
DCA components to perform its database operations and processing. See the
following sections for a description of the DCA components and configurations.
•
About DCA and DIA
•
About Greenplum Database
About DCA and DIA
This section explains the hardware components and specifications of the Greenplum
Data Computing Appliance and Data Integration Accelerator.
•
Available DCA Configurations
•
Available DIA Configurations
•
Component Specifications
Available DCA Configurations
This release of the Greenplum DCA is available in four configurations, each with a
capacity and balanced model: the Greenplum GP10 and GP10C (quarter-rack
configuration), the Greenplum GP100 and GP100C (half-rack configuration),
Greenplum GP1000 and GP1000C (full-rack configuration), and Greenplum
GP1000C plus one scale-out module (two-rack configuration)
About DCA and DIA
5
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
The balanced configuration of DCA utilizes 600GB 15k SAS drives in the segment
servers for a usable capacity of 36TB on a full-rack GP1000. The capacity
configuration of DCA utilizes 2TB 7.2k SATA drives in the segment servers for a
usable capacity of 124TB on a full-rack GP1000C.
Figure 1.1 GP10 Quarter-Rack Configuration
6
About DCA and DIA
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Figure 1.2 GP100 Half-Rack Configuration
About DCA and DIA
7
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Figure 1.3 GP1000 One-Rack Configuration
8
About DCA and DIA
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Figure 1.4 GP1000 +1 Scale-Out Module Two-Rack Configuration
About DCA and DIA
9
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Available DIA Configurations
This release of the Greenplum DIA comes in three configurations: the Greenplum
DIA10 (quarter-rack), the Greenplum DIA100 (half-rack), and the Greenplum
DIA1000 (full-rack). The DIA has a usable storage capacity of 71TB for a DIA10,
142TB for a DIA100 and 284TB for a DIA1000.
Figure 1.5 DIA10 Quarter-Rack Configuration
10
About DCA and DIA
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Figure 1.6 DIA100 Half-Rack Configuration
About DCA and DIA
11
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Figure 1.7 DIA1000 Full-Rack Configuration
12
About DCA and DIA
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Component Specifications
This section explains the specifications of the various server and networking
components of the Greenplum DCA and DIA. Note that in the Greenplum Database
product and documentation, physical servers are referred to as hosts.
Table 1.1 DCA/DIA Components
Component
Quantity
Master Host
All Configurations = 2 (one primary
and one standby)
The Greenplum DIA contains no
Master Hosts
Segment/DIA Hosts
GP10/DIA10 = 4
GP100/DIA100 = 8
GP1000/DIA1000 = 16
GP1000+1 = 32
Interconnect Switch
GP/10/GP100/GP1000 = 2
GP1000+1 = 4
DIA10/DIA100/DIA1000 = 2
Administration Switch
GP10/GP100/GP1000 = 1
GP1000+1 = 2
DIA10/DIA100/DIA1000 = 1
About DCA and DIA
13
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Master Host Specifications
The following diagram shows an example of how a Greenplum Database master host
is configured in the Greenplum DCA. Greenplum DCA has two master hosts (the
primary master and a standby master). The Greenplum DIA contians no master hosts,
however, the DCA’s master hosts are used for management of the DIA.
Figure 1.8 Greenplum Database Master Host Configuration on the Greenplum
DCA
Table 1.2 Master Host Server Specifications
14
Hardware
Specifications
Quantity
Processor
Intel X5680 3.33 GHz (6 core)
2
Memory
DDR3 1333 MHz
48 GB
Dual-port Converged Network
Adapter
2 x 10 Gbps
1
Quad-port Network Adapter
4 x 1 Gbps
1
RAID controller
Dual channel 6 Gb/s SAS
1
Hard Disks
600 GB 10 K RPM SAS
(one RAID5 volume of 4+1 with 1 hot spare)
Master Host Server utilizes the same drives
between balanced and capacity systems.
6
About DCA and DIA
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Segment/DIA Host Specifications
The following diagram shows an example of how a host is configured in the
Greenplum DCA and DIA. Each segment host serves 6 Greenplum Database primary
segment instances and 6 mirror segment instances. On the DIA, Greenplum
recommends running two gpfdist loading processes per host, however this value
may change based on the actual environment.
Figure 1.9 Host Configuration on the Greenplum DCA / DIA
Table 1.3 Host Server Specifications
Hardware
Specifications
Quantity
Processor
Intel X5670 2.93 GHz (6 core)
2
Memory
DDR3 1333 MHz
48 GB
Dual-port Converged Network
Adapter
2 x 10 Gbps
1
Dual-port Network Adapter
2 x 1 Gbps
1
RAID controller
Dual channel 6 Gb/s SAS
1
Hard Disks
Balanced System:
600 GB 15 K RPM SAS
(two RAID5 volumes of 5+1 disks)
Capacity System and DIA:
2TB 7.2K RPM SATA
(two RAID5 volumes of 5+1 disks)
12
About DCA and DIA
15
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Network Component Specifications
Hardware
Specifications
Quantity
Interconnect Switch
24-port Converged Enhanced Ethernet (CEE), Fibre
Channel over Ethernet (FCoE)
8 Fibre Channel Ports (future use)
2
Admin Switch
24-port 1 Gb Ethernet Layer 3
1
Additional Solutions - Data Domain Backup and Recovery
EMC Data Domain deduplication storage systems dramatically reduce the amount of
disk storage needed to retain and protect enterprise data. By identifying redundant
data as it is being stored, Data Domain provides a storage footprint that is up to 30
times smaller, on average, than the original dataset. Backup data can then be
efficiently replicated and retrieved over existing networks for streamlined disaster
recovery and consolidated tape operations. This allows Data Domain appliances to
integrate seamlessly into database architectures, maintaining existing backup
strategies with no changes to scripts, backup processes, or system architecture.
Figure 1.10 Data Domain Backup Solution for DCA
16
About DCA and DIA
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
About Greenplum Database
Greenplum Database is a massively parallel processing (MPP) database management
system (DBMS). Greenplum Database 4.0 uses MPP as the backbone to its database
architecture. MPP refers to a distributed system that has two or more individual
servers, which carry out an operation in parallel. Each server has its own processor(s),
memory, operating system and storage. All servers communicate with each other over
a common network. In this instance a single database system can effectively use the
combined computational performance of all individual MPP servers to provide a
powerful, scalable database system. Greenplum uses this high-performance system
architecture to distribute the load of multi-terabyte data warehouses, and is able to use
all of a system’s resources in parallel to process a query.
Greenplum Database is based on PostgreSQL 8.2.14, and in most cases is very similar
to PostgreSQL with regards to SQL support, features, configuration options, and
end-user functionality. Database users interact with Greenplum Database as they
would a regular PostgreSQL DBMS.
Greenplum Database is able to handle the storage and processing of large amounts of
data by distributing the load across several servers or hosts. The master is the entry
point to the Greenplum Database system. It is the database instance where clients
connect and submit SQL statements. Greenplum DCA comes with two master hosts
— one primary master and a standby master.
The master coordinates the work across the other database instances in the system, the
segments, which handle data processing and storage. Greenplum DCA comes with a
configurable number of segment hosts. Each segment host serves 6 primary and 6
mirror Greenplum segment instances.
The segments communicate with each other and with the master over the interconnect,
which is the networking layer of Greenplum Database. The DCA interconnect is
configured on a private LAN and utilizes two high-speed network switches, offering
each segment host 20 Gb non-blocking duplex bandwidth. The Greenplum primary
and mirror segments are configured to use different interconnect switches in order to
provide redundancy in the event of a single switch failure.
About Greenplum Database
17
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
In addition the interconnect switches, Greenplum DCA comes with an additional
administration switch. Each master and segment server has a dedicated interface for
remote system administration. This controller has its own processor, memory, battery,
and network connection. This allows administrators to access the individual
Greenplum DCA servers as if they were at the local console (terminal).
Figure 1.11 High-Level Greenplum Database Architecture
About the Master Hosts
The master is the entry point to the Greenplum Database system from the public
LAN. It is the database process that accepts client connections and processes the SQL
commands issued by the users of the system. Users connect to Greenplum Database
through the master using PostgreSQL-compatible client programs such as psql or
ODBC.
The master maintains the system catalog (a set of system tables that contain metadata
about the Greenplum Database system itself), however the master does not contain
any user data. Data resides only on the segments. The master does the work of
authenticating client connections, processing and planning the incoming SQL
commands, distributing the work load between the segments, coordinating the results
returned by each of the segments, and presenting the final results to the client
program.
18
About Greenplum Database
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
Master Redundancy - The Standby Master
Greenplum DCA also has a standby master host to serve as a backup in case the
primary master becomes unoperational. The standby master is a warm standby,
meaning failover is not automatic. If the primary master fails, an administrator can
promote the standby master to be the active master for the Greenplum Database
system.
The standby master is kept up to date by a transaction log replication process, which
runs on the standby master host and keeps the data between the primary and standby
master hosts synchronized. If the primary master fails, the log replication process is
shutdown, and the standby master can be activated in its place. Upon activation of the
standby master, the replicated logs are used to reconstruct the state of the master host
at the time of the last successfully committed transaction.
About the Segment Hosts
In Greenplum Database, the segments are where the database data is stored and where
the majority of query processing takes place. User-defined tables and their indexes are
distributed across the available number of segments in the Greenplum Database
system, each segment containing a distinct portion of the data. Segment instances are
the database server processes that serve segments. Users and administrators do not
interact directly with the segments in a Greenplum Database system, but do so through
the master.
Data Redundancy - Mirror Segments
Greenplum Database provides data redundancy by deploying mirror segments. Mirror
segments allow database queries to fail over to a backup segment if the primary
segment becomes unavailable. A mirror segment always resides on a different host
than its corresponding primary segment. A Greenplum Database system can remain
operational if a segment host, network interface or interconnect switch goes down as
long as all portions of data are available on the remaining active segments.
During database operations, only the primary segment is active. Changes to a primary
segment are copied over to its mirror using a file block replication process. Until a
failure occurs on the primary segment, there is no live segment instance running on
the mirror host -- only the replication process.
Figure 1.12 Data Mirroring in Greenplum Database
About Greenplum Database
19
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
In the event of a segment failure, the file replication process is stopped and the mirror
segment is automatically brought up as the active segment instance. All database
operations then continue using the mirror. While the mirror is active, it is also logging
all transactional changes made to the database. When the failed segment is ready to be
brought back online, administrators initiate a recovery process to bring it back into
operation.
About the Network Configuration
The following diagram shows an example of how the network is configured in
Greenplum GP1000 (full-rack configuration). The Greenplum Database interconnect
and administration networks are configured on a private LAN. Outside access to
Greenplum Database and to the Greenplum DCA systems goes through the master
host.
Figure 1.13 Greenplum DCA Network Configuration
About the Greenplum Interconnect Networks
The interconnect is the networking layer of Greenplum Database. When a user
connects to a database and issues a query, processes are created on each of the
segments to handle the work of that query. The interconnect refers to the inter-process
communication between the segments, as well as the network infrastructure on which
this communication relies.
20
About Greenplum Database
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
To maximize throughput, interconnect activity is load-balanced over two interconnect
networks. To ensure redundancy, a primary segment and its corresponding mirror
segment utilize different interconnect networks. With this configuration, Greenplum
Database can continue its operations in the event of a single interconnect switch
failure.
About the Greenplum DCA Administration Network
The administration network is used for system management facilities and Greenplum
administration utilities, so as not to interfere with the network traffic related to
database processing. Each master and segment host has one administration/iDRAC
network interface.
About iDRAC
The Integrated Dell Remote Access Controller (iDRAC) is a built-in interface in Dell
servers that provides out-of-band system management facilities. The controller has its
own processor, memory, battery, network connection, and access to the system bus.
Key features include power management, virtual media access and remote console
capabilities, all available through a supported web browser. This gives system
administrators the ability to manage a machine as if they were sitting at the local
console. For more information about iDRAC, see the iDRAC User Guide.
About Greenplum Database
21
EMC DCA and DIA Getting Started Guide – Chapter 1: About EMC Greenplum Greenplum DCA and DIA
22
About Greenplum Database
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
2.
Greenplum DCA / DIA Administration
This chapter describes the general database maintenance tasks and the tools available
to diagnose, monitor, and troubleshoot a Greenplum Database system running on the
Greenplum Data Computing Appliance. The Greenplum DIA supports ConnectEMC
for dial home of hardware related issues and well as health monitoring through
Greenplum Performance Monitor.
•
Troubleshooting and Diagnostic Tools
•
Database and System Monitoring Tools
•
General Database Maintenance Tasks
Troubleshooting and Diagnostic Tools
Greenplum Database provides the following troubleshooting and diagnostic tools.
More information on these tools can be found in the Greenplum Database
Administrator Guide:
Table 2.1 Greenplum Database Diagnostic Tools
Tool Name
Description
gpcheck
gpcheck is a Greenplum command-line management utility that
can be used to validate the configuration and operating system
settings of the DCA hosts.
gpcheckperf
gpcheckperf is a Greenplum command-line management utility
that can be used to validate baseline hardware performance. If you
are experiencing slower than expected response times, running this
utility can help identify if the issue is related to a hardware failure
rather than a SQL workload or software problem.
gpstate
gpstate is a Greenplum command-line management utility that
can be used to check the status and configuration of a running
Greenplum Database system. It can be used to identify segment
failures and general health of a Greenplum Database system.
gp_toolkit schema
Troubleshooting and Diagnostic Tools
gp_toolkit is an administrative schema that is installed into every
database within Greenplum Database. It contains a number of
helpful views and database functions to help administrators
diagnose common problems, such as checking for tables that need
routine maintenance. Administrators access this schema by
connecting to any database and issuing SQL queries against the
views and functions in this schema.
23
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
Table 2.1 Greenplum Database Diagnostic Tools
Tool Name
Description
gpssh
gpssh is a Greenplum command-line management utility that
allows administrators to run shell commands on multiple hosts at
once using SSH (secure shell). This allows administrators to
execute a command on every segment host at once without having
to log in to each machine individually.
gplogfilter
gplogfilter is a Greenplum command-line management utility
that can be used to search through Greenplum Database log files
for specific entries. This can be useful in tracking down more
information in the logs about certain database activity or errors.
Database and System Monitoring Tools
Greenplum Data Computing Appliance provides various tools to monitor the status of
Greenplum Database as well as the hardware components that Greenplum Database is
running on.
•
ConnectEMC Dial Home Capability
•
Greenplum Performance Monitor
•
Greenplum Database System Catalogs
•
Greenplum Database Email and SNMP Alerting
ConnectEMC Dial Home Capability
The EMC Greenplum Data Computing Appliance and Data Integration Accelerator
support dial home functionality through the ConnectEMC software. ConnectEMC is a
support utility that collects and sends event data - files indicating system errors and
other information - from EMC products to EMC Global Services customer support.
ConnectEMC sends DCA event files using the secure file transfer protocol (FTPS). If
an EMC Secure Remote Support Gateway (ESRS) is used for connectivty, HTTPS or
FTP are available protocols for sending alerts.
The ConnectEMC software is configured on the DCA master and standby master
server and sent out through the external connection (eth1) either to an ESRS Gateway
server or directly to EMC. The DIA routes notifications through the DCA, so a
dedicated connection for dial home is not required.
Dial Home Severity Levels
Alerts that arrive at EMC Global Services can have one of the following severity
levels:
24
•
WARNING: This indicates a condition that might require immediate attention. This
severity will create a service request.
•
ERROR: This indicates that an error occurred on the Greenplum DCA or DIA.
System operation and/or performance is likely affected. This severity will create a
service request.
Database and System Monitoring Tools
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
•
UNKNOWN: This severity level is associated with hosts and devices on the
Greenplum DCA or DIA that are either disabled (due to hardware failure) or
unreachable for some other reason. This severity will create a service request.
•
INFO: An event with this severity level indicates that a previously reported error
condition is now resolved. An event with this severity level is also used to provide
information about the system that does not require any action. This severity will
not create a service request. For example, Greenplum Database startup triggers an
INFO alert.
The severity of events determines if a service request is created for EMC support to
act on. The events listed in Table 2.2, “ConnectEMC Events and Symptoms Codes”
on page 25 can generate multiple severity levels based on the error condition.
For example, the failure of a segment server disk drive will generate Symptom Code
13 with a severity of ERROR. The ConnectEMC software will dial home to Global
Services customer support, and a service request will be created. Upon successful
replacement of the disk drive, Symptom Code 13 will be generated again, this time
with a severity of INFO to notify the disk drive was replaced.

Note:
Monitoring in the DCA and DIA is primarily focused on hardware related
events. The monitoring of Greenplum Database events is limited in this
release and will be expanded in future versions.
ConnectEMC Event Alerts
The table below lists all the conditions that cause ConnectEMC to send event data
alerts to EMC Global Services.
Table 2.2 ConnectEMC Events and Symptoms Codes
Symptom
Item
Code
Description
1
Host Status
A host or device on the Greenplum DCA/DIA is either
down or not reachable.
2
Greenplum Database
Status
An alert generated by Greenplum Database. The alert
can indicate successful startup (severity=INFO) or critical
database conditions. The following four error conditions
will generate an alert:
•
•
•
•
Unknown transaction status. This may indicate that a
table is corrupted.
Database recovery interruption. This indicates that
Greenplum Database should be restored from
backup.
Two-phase state file for transaction. This may indicate
corruption of database files.
Panic conditions. This indicates that Greenplum
Database shutdown is imminent. This condition can
be caused by data corruption and/or system resource
issues.
Other database error conditions are not monitored in the
current DCA release.
3
Power Supply Status
An issue with a power supply was detected.
4
Battery Status
An issue with a battery was detected.
Database and System Monitoring Tools
25
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
Table 2.2 ConnectEMC Events and Symptoms Codes
Symptom
Item
Code
Description
5
Cooling Device Status
An issue with a fan was detected.
6
Processor Status
An issue with a processor was detected.
7
Cache Device Status
An issue with a CPU L1/L2/L3 cache was detected.
8
OS Memory Status
An issue with OS memory was detected.
9
Memory Device Status
An issue with a RAM device was detected.
10
Network Device Status
An issue with a network interface card was detected. The
device might be disconnected or is no longer
serviceable.
11
Controller Status
An issue with an IO controller was detected.
11
Controller Battery
Status
An issue with an IO controller battery was detected.
12
Virtual Disk Status
An issue with virtual disk configuration was detected.
12
Virtual Disk Write Policy An issue with virtual disk configuration was detected. A
sub-optimal performance policy can also trigger an alert.
12
Virtual Disk Read Policy An issue with virtual disk configuration was detected. A
sub-optimal performance policy can also trigger an alert.
12
Virtual Disk State
An issue with virtual disk configuration was detected.
Any sub-optimal state can trigger an alert.
13
Array Disk Status
An issue with a hard disk was detected.
14
Sensor Status
An issue with a network switch hardware component
was detected.
15
SNMP Monitoring
Status
An issue with SNMP configuration on a device was
detected. This issue is preventing monitoring functions
from occurring.
Greenplum Performance Monitor
Greenplum Performance Monitor allows administrators to collect query and system
performance metrics from a running Greenplum Database system. Monitor data is
stored within Greenplum Database.
Greenplum Performance Monitor is comprised of data collection agents that run on
the master host and each segment host. The agents collect performance data about
active queries and system utilization and send it to the Greenplum master at regular
intervals. The data is stored in a dedicated database on the master (called gpperfmon),
where it can be accessed using the Greenplum Performance Monitor Console (a web
application) or using SQL queries.
Greenplum Performance Monitor Console is a browser-based application where
administrators can view active and historical query and system metrics stored in the
gpperfmon database. By default, Greenplum Performance Monitor Console is installed
on the Greenplum Database master host using HTTP port 28080. It can be accessed
through a browser using a URL such as
26
Database and System Monitoring Tools
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
http://masterhostname.companydomain.com:28080. To log in to Greenplum
Performance Monitor Console, your Greenplum Database administrator must assign
you a username and password (or see the Greenplum Performance Monitor
Administrator Guide for instructions on granting access).
The Dashboard, System Metrics, Query Monitor tabs of Greenplum Performance
Monitor Console show information about active and historical database and system
workload. This information can help an administrator track system utilization and
performance for specific queries and usage periods.
Figure 2.1 Performance Monitor Dashboard
Database and System Monitoring Tools
27
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
The Greenplum Performance Monitor Console for Greenplum DCA has an additional
Health tab, which shows the status of Greenplum DCA and DIA hardware
components. For example, a failed network interface would show up as a server error.
A failed fan would show as a warning.
Figure 2.2 Performance Monitor Health Tab
Greenplum Database System Catalogs
Greenplum Database stores metadata information about the database system in special
tables and views within the database called system catalogs. Database superusers can
access the information in these catalogs using SQL commands. The following is a list
of helpful system catalogs that administrators can query to check database activity and
system status. For more information on these system catalogs, see the Greenplum
Database Administrator Guide.
Table 2.3 Useful Greenplum Database System Catalogs
28
Catalog Name
Description
pg_resqueue_status
Shows status and activity for a workload management resource
queue. It shows how many queries are waiting to run and how
many queries are currently active in the system from a particular
resource queue.
pg_stat_activity
Shows one row per master database process, showing the
database name, process ID, user name, current query, query’s
waiting status, time at which the current query began execution,
time at which the process was started, and client’s address and
port number.
pg_stat_last_operation
Shows the last time certain database operations were performed
on a database object, for example, the last time a table was
vacuumed.
Database and System Monitoring Tools
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
Greenplum Database Email and SNMP Alerting
The Greenplum Database system can be configured to trigger SNMP alerts or send
email notifications to system administrators whenever certain database events occur.
These events can include fatal server errors, segment shutdown and recovery, and
database system shutdown and restart. See the Greenplum Database Administrator
Guide for instructions on enabling system alerts and email notifications.
General Database Maintenance Tasks
Greenplum Database, like any database management system, requires that certain
tasks be performed regularly to achieve optimum performance. The tasks discussed
here are required, but they are repetitive in nature and can easily be automated using
standard UNIX tools such as cron scripts. But it is the database administrator’s
responsibility to set up appropriate scripts, and to check that they execute successfully.
•
Routine Vacuum and Analyze
•
Routine Reindexing
•
Managing Greenplum Database Log Files
Routine Vacuum and Analyze
Because of the multi-version concurrency control (MVCC) transaction model used in
Greenplum Database, data rows that are deleted or updated still occupy physical space
on disk even though they are not visible to any new transactions. If you have a
database with lots of updates and deletes, you will generate a lot of expired rows.
Running the VACUUM SQL command will reclaim this disk space. The VACUUM
command also collects table-level statistics such as number of rows and pages, so it is
necessary to periodically run VACUUM on all tables.
Transaction ID Management
Greenplum’s MVCC transaction semantics depend on being able to compare
transaction ID (XID) numbers to determine visibility to other transactions. But since
transaction IDs have limited size, a Greenplum system that runs for a long time (more
than 4 billion transactions) would suffer transaction ID wraparound: the XID counter
wraps around to zero, and all of a sudden transactions that were in the past appear to
be in the future — which means their outputs become invisible. To avoid this, it is
necessary to run VACUUM on every table in every database at least once every two
billion transactions.
See the Greenplum Database Administrator Guide for more information.
System Catalog Maintenance
Numerous database updates with CREATE and DROP commands can cause growth in
the size of the system catalog that affects system performance. For example, after a
large number of DROP TABLE statements, the overall performance of the system
begins to degrade due to excessive data scanning during metadata operations on the
catalog tables. Depending on your system, the performance loss may occur between
thousands to tens of thousands of DROP TABLE statements.
General Database Maintenance Tasks
29
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
Greenplum recommends that you periodically run VACUUM on the system catalog to
clear the space occupied by deleted objects. If numerous DROP statements are a part of
regular database operations, it is safe and appropriate to run a system catalog
maintenance procedure with VACUUM daily at off-peak hours. This can be done while
the system is running and available.
The following example script performs a VACUUM of the Greenplum Database system
catalog:
#!/bin/bash
DBNAME="<database_name>"
VCOMMAND="VACUUM ANALYZE"
psql -tc "select '$VCOMMAND' || ' pg_catalog.' || relname ||
';' from pg_class a,pg_namespace b where a.relnamespace=b.oid
and b.nspname= 'pg_catalog' and a.relkind='r'" $DBNAME | psql
-a $DBNAME
Vacuum and Analyze for Query Optimization
Greenplum Database uses a cost-based query planner that relies on database statistics.
Accurate statistics allow the query planner to better estimate selectivity and the
number of rows retrieved by a query operation in order to choose the most efficient
query plan. The ANALYZE command collects column-level statistics needed by the
query planner.
Both VACUUM and ANALYZE operations can be run in the same command. For example:
=# VACUUM ANALYZE mytable;
Routine Reindexing
For B-tree indexes, a freshly-constructed index is somewhat faster to access than one
that has been updated many times, because logically adjacent pages are usually also
physically adjacent in a newly built index. It might be worthwhile to reindex
periodically to improve access speed. Also, if all but a few index keys on a page have
been deleted, there will be wasted space on the index page. A reindex will reclaim that
wasted space. In Greenplum Database it is often faster to drop an index (DROP INDEX)
and then recreate it (CREATE INDEX) than it is to use the REINDEX command.
Bitmap indexes are not updated when changes are made to the indexed column(s). If
you have updated a table that has a bitmap index, you must drop and recreate the
index for it to remain current.
Managing Greenplum Database Log Files
30
•
Database Server Log Files
•
Management Utility Log Files
General Database Maintenance Tasks
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
Database Server Log Files
Greenplum Database log output tends to be voluminous (especially at higher debug
levels) and you do not need to save it indefinitely. Administrators need to rotate the
log files periodically so that new log files are started and old ones are removed after a
reasonable period of time.
Greenplum Database has log file rotation enabled on the master and all segment
instances. Daily log files are created in pg_log of the master and each segment data
directory using the naming convention of: gpdb-YYYY-MM-DD.log. Although log
files are rolled over daily, they are not automatically truncated or deleted.
Administrators will need to implement some script or program to periodically clean up
old log files in the pg_log directory of the master and each segment instance.
Management Utility Log Files
Log files for the Greenplum Database management utilities are written to
~/gpAdminLogs by default (the home directory of the gpadmin user). The naming
convention for management log files is:
<script_name>_<date>.log
The log file for a particular utility execution is appended to its daily log file each time
that utility is run. Administrators will need to implement some script or program to
periodically clean up old log files in ~/gpAdminLogs.
General Database Maintenance Tasks
31
EMC DCA and DIA Getting Started Guide – Chapter 2: Greenplum DCA / DIA Administration
32
General Database Maintenance Tasks
EMC DCA and DIA Getting Started Guide – Chapter 3: Connecting to Greenplum Database
3.
Connecting to Greenplum Database
This chapter describes how to connect to a Greenplum Database system running on
the Greenplum Data Computing Appliance.

Note:
Database users and administrators always connect to the Greenplum
Database master.
Establishing a Database Session
Users can connect to Greenplum Database using a PostgreSQL-compatible client
program, such as psql. Users and administrators always connect to Greenplum
Database through the master - the segments cannot accept client connections.
In order to establish a connection to the Greenplum Database master, you will need to
know the following connection information and configure your client program
accordingly.
Table 3.1 Client Connection Parameters
Connection Parameter Description
Environment Variable
Database name
The name of the database to which
you want to connect. For a newly
initialized system, use the
template1 database to connect
for the first time to create your
database.
$PGDATABASE
Host name
The host name of the Greenplum
Database master. The default host
is the local host.
$PGHOST
Port
The port number that the
Greenplum Database master
instance is running on. The default
is 5432.
$PGPORT
User name
The database user (role) name to
connect as. Every Greenplum
Database system has one
superuser account that is created
automatically at initialization time.
This account has the same name
as the OS user who initialized the
Greenplum system (gpadmin).
$PGUSER
Supported Client Applications
Users can connect to Greenplum Database using various client applications:
Establishing a Database Session
33
EMC DCA and DIA Getting Started Guide – Chapter 3: Connecting to Greenplum Database
•
A number of Greenplum Database Client Applications are provided with your
Greenplum installation. The psql client application provides an interactive
command-line interface to Greenplum Database.
•
pgAdmin III for Greenplum Database is an enhanced version of the popular
management tool pgAdmin III. Since version 1.10.0, the pgAdmin III client
available from PostgreSQL Tools includes support for Greenplum-specific
features. Installation packages are available for download from Greenplum
Network and from the pgAdmin download site.
•
Using standard Database Application Interfaces, such as ODBC and JDBC, users
can create their own client applications that interface to Greenplum Database.
Because Greenplum Database is based on PostgreSQL, it uses the standard
PostgreSQL database drivers.
•
Most Third-Party Client Tools that use standard database interfaces, such as
ODBC and JDBC, can be configured to connect to Greenplum Database.
Greenplum Database Client Applications
Greenplum Database comes installed with a number of client applications located in
$GPHOME/bin of your Greenplum Database master host installation. The following
are the most commonly used client applications:
Table 3.2 Commonly Used Client Applications
Name
Usage
createdb
create a new database
createuser
define a new database role
dropdb
remove a database
dropuser
remove a role
psql
PostgreSQL interactive terminal
reindexdb
reindex a database
vacuumdb
vacuum and analyze a database
When using these client applications, you must connect to a database through the
Greenplum master instance. You will need to know the name of your target database,
the host name and port number of the master, and what database user name to connect
as. This information can be provided on the command-line using the options -d, -h,
-p, and -U respectively. If an argument is found that does not belong to any option, it
will be interpreted as the database name first.
All of these options have default values which will be used if the option is not
specified. The default host is the local host. The default port number is 5432. The
default user name is your OS system user name, as is the default database name. Note
that OS user names and Greenplum Database user names are not necessarily the same.
If the default values are not correct, you can save yourself some typing by setting the
environment variables PGDATABASE, PGHOST, PGPORT, and PGUSER to the appropriate
values.
34
Supported Client Applications
EMC DCA and DIA Getting Started Guide – Chapter 3: Connecting to Greenplum Database
Connecting with psql
Depending on the default values used or the environment variables you have set, the
following examples show how to access a database via psql:
$ psql -d gpdatabase -h master_host -p 5432 -U gpadmin
$ psql gpdatabase
$ psql
If a user-defined database has not yet been created, you can access the system by
connecting to the template1 database. For example:
$ psql template1
After connecting to a database, psql provides a prompt with the name of the database
to which psql is currently connected, followed by the string => (or =# if you are the
database superuser). For example:
gpdatabase=>
At the prompt, you may type in SQL commands. A SQL command must end with a ;
(semicolon) in order to be sent to the server and executed. For example:
=> SELECT * FROM mytable;
Getting Help in psql
psql also has a number of meta-commands (backslash commands), that allow you to
easily look up information in the Greenplum Database system catalogs. To see a list of
all meta-commands, use \?. For example:
=> \?
To get help with SQL command syntax, use the \h meta-command. For example, to
see a list of all available SQL commands:
=> \h
To see the syntax reference for a particular SQL command, follow the \h
meta-command by the SQL command name. For example:
=> \h SELECT
Some other commonly used psql meta-commands are:
Table 3.3 common psql meta-commands
command
description
\l
List all databases in the system.
\c <database_name>
Connect to the specified database.
\dn
List all schemas in the current database.
\dt
List all user-created tables in the current database.
\dtS
List all system catalog tables.
\d+ <object_name>
Show the definition of the specified database object (table, index,
etc.).
\du
List all users (roles) in the system.
Supported Client Applications
35
EMC DCA and DIA Getting Started Guide – Chapter 3: Connecting to Greenplum Database
For more information on using the psql client application, see the Greenplum
Database Administrator Guide.
pgAdmin III for Greenplum Database
pgAdmin III is an open source graphical user interface (GUI) for PostgreSQL, which
is also compatible with Greenplum Database. As of version 1.10.0, the pgAdmin III
client includes support for Greenplum-specific features.
pgAdmin III for Greenplum Database supports the following Greenplum-specific
features:
•
External tables
•
Append-only tables, including compressed append-only tables
•
Table partitioning
•
Resource queues
•
Graphical EXPLAIN ANALYZE
•
Greenplum server configuration parameters
Figure 3.1 Greenplum Options in pgAdmin III
36
Supported Client Applications
EMC DCA and DIA Getting Started Guide – Chapter 3: Connecting to Greenplum Database
Installing pgAdmin III for Greenplum Database
The installation package for pgAdmin III for Greenplum Database is available for
download from the official pgAdmin III download site (http://www.pgadmin.org).
Installation instructions are included in the installation package.
Documentation for pgAdmin III for Greenplum Database
For general help on the features of the graphical interface, select Help contents from
the Help menu.
For help with Greenplum-specific SQL support, select Greenplum Database Help
from the Help menu. If you have an active internet connection, you will be directed to
online Greenplum SQL reference documentation.
Database Application Interfaces
You may want to develop your own client applications that interface to Greenplum
Database. PostgreSQL provides a number of database drivers for the most commonly
used database application programming interfaces (APIs), which can also be used
with Greenplum Database. These drivers are not packaged with the Greenplum
Database base distribution. Each driver is an independent PostgreSQL development
project and must be downloaded, installed and configured to connect to Greenplum
Database. The following drivers are available:
Table 3.4 Greenplum Database Interfaces
API
Driver
Download Link
ODBC
pgodbc
The PostgreSQL ODBC driver is available
in the Greenplum Database Connectivity
package, which can be downloaded from
Greenplum Network.
DataDirect ODBC Driver for
Greenplum Database
DataDirect offers an enterprise ODBC
Driver for Greenplum Database.
http://web.datadirect.com/products/odbc/g
reenplum/index.html
JDBC
pgjdbc
Available in the Greenplum Database
Connectivity package, which can be
downloaded from Greenplum Network.
Perl DBI
pgperl
http://gborg.postgresql.org/project/pgperl
Python DBI
pygresql
http://www.pygresql.org
Third-Party Client Tools
Most third-party extract-transform-load (ETL) and business intelligence (BI) tools use
standard database interfaces, such as ODBC and JDBC, and can be configured to
connect to Greenplum Database. Greenplum has certified the following third-party
client applications with Greenplum Database:
•
Business Objects
•
Microstrategy
Supported Client Applications
37
EMC DCA and DIA Getting Started Guide – Chapter 3: Connecting to Greenplum Database
•
Informatica Power Center
•
Microsoft SQL Server Integration Services (SSIS) and Reporting Services
(SSRS)
•
Ascential Datastage
•
SAS
•
Cognos
Greenplum Professional Services can assist users in configuring their chosen
third-party tool for use with Greenplum Database.
Troubleshooting Connection Problems
A number of things can prevent a client application from successfully connecting to
Greenplum Database. This section explains some of the common causes of connection
problems and how to correct them.
Table 3.5 Common Connection Problems
Problem
Solution
No pg_hba.conf entry for
host or user
In order for Greenplum Database to be able to accept remote client
connections, you must configure your Greenplum Database master
instance so that connections are allowed from the client hosts and
database users that will be connecting to Greenplum Database. This
is done by adding the appropriate entries to the pg_hba.conf
configuration file (located in the master instance’s data directory). For
more detailed information, see the Greenplum Database
Administrator Guide.
38
Greenplum Database is not
running
If the Greenplum Database master instance is down, users will not
be able to connect. You can verify that the Greenplum Database
system is up by running the gpstate utility on the Greenplum
master host.
Network problems
If users are connecting to the Greenplum master host from a remote
client, network problems may be preventing a connection (for
example, DNS host name resolution problems, the host system is
down, etc.). To ensure that network problems are not the cause, try
connecting to the Greenplum master host from the remote client
host. For example: ping hostname
Too many clients already
By default, Greenplum Database is configured to allow a maximum
of 25 concurrent user connections. A connection attempt that causes
that limit to be exceeded will be refused. This limit is controlled by the
max_connections parameter in the postgresql.conf
configuration file of the Greenplum Database master. See the
Greenplum Database Administrator Guide for more
information on increasing the allowed connections.
Troubleshooting Connection Problems
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
4.
Next Steps
This chapter explains the next steps to implementing your data warehouse
requirements in Greenplum Database.
Understanding the SQL Features of Greenplum Database
It is important to note that there are no commercial database systems that are fully
compliant with the SQL standard. Greenplum Database is almost fully compliant with
the SQL 1992 standard, with most of the features from SQL 1999. Several features
from SQL 2003 have also been implemented (most notably the SQL OLAP features).
This section addresses the important conformance issues of Greenplum Database as
they relate to the SQL standards. For a feature-by-feature list of Greenplum’s support
of the latest SQL standard, see the Greenplum Database Administrator Guide.
•
Core SQL Conformance
•
SQL 1992 Conformance
•
SQL 1999 Conformance
•
SQL 2003 Conformance
•
SQL 2008 Conformance
•
Greenplum and PostgreSQL Compatibility
Core SQL Conformance
In the process of building a parallel, shared-nothing database system and query
optimizer, certain common SQL constructs are not currently implemented in
Greenplum Database. The following SQL constructs are not supported:
statements that update the distribution key columns of a hash-distributed
Greenplum table. There is currently no way for the system to redistribute a row to
a different segment when its hash value changes.
1. UPDATE
and DELETE statements that require data to move from one segment to
another. This restricts the use of joins in update and delete operations to
hash-distributed tables that have the same distribution key column(s), and the join
condition must specify equality on the distribution key column(s).
2. UPDATE
3.
Correlated subqueries that Greenplum’s parallel optimizer cannot internally
rewrite into non-correlated joins. Most simple uses of correlated subqueries do
work. Those that do not can be manually rewritten using outer joins.
4.
Certain rare cases of multi-row subqueries that Greenplum’s parallel optimizer
cannot internally rewrite into equijoins.
5.
Some set returning subqueries in EXISTS or NOT EXISTS clauses that
Greenplum’s parallel optimizer cannot rewrite into joins.
Understanding the SQL Features of Greenplum Database
39
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
6. UNION ALL
of joined tables with subqueries.
7.
Set-returning functions in the FROM clause of a subquery.
8.
Backwards scrolling cursors, including the use of FETCH PRIOR, FETCH FIRST,
FETCH ABOLUTE, and FETCH RELATIVE.
9.
In CREATE TABLE statements (on hash-distributed tables): a UNIQUE or PRIMARY
KEY clause must include all of (or a superset of) the distribution key columns.
Because of this restriction, only one UNIQUE clause or PRIMARY KEY clause is
allowed in a CREATE TABLE statement. UNIQUE or PRIMARY KEY clauses are not
allowed on randomly-distributed tables.
10. CREATE UNIQUE INDEX statements that do not contain all of (or a superset of) the
distribution key columns. CREATE UNIQUE INDEX is not allowed on
randomly-distributed tables.
or STABLE functions cannot execute on the segments, and so are
generally limited to being passed literal values as the arguments to their
parameters.
11. VOLATILE
12. Triggers
are not supported since they typically rely on the use of VOLATILE
functions.
13. Referential
integrity constraints (foreign keys) are not enforced in Greenplum
Database. Users can declare foreign keys and this information is kept in the
system catalog, however.
14. Sequence
manipulation functions CURRVAL and LASTVAL.
and UPDATE WHERE CURRENT OF (positioned
delete and positioned update operations).
15. DELETE WHERE CURRENT OF
SQL 1992 Conformance
The following features of SQL 1992 are not supported in Greenplum Database:
1. NATIONAL CHARACTER (NCHAR)
and NATIONAL CHARACTER VARYING
(NVARCHAR). Users can declare the NCHAR and NVARCHAR types, however they are
just synonyms for CHAR and VARCHAR in Greenplum Database.
2. CREATE ASSERTION
statement.
literals are supported in Greenplum Database, but do not conform to
the standard.
3. INTERVAL
4. GET DIAGNOSTICS statement.
or UPDATE privileges on columns. Privileges can only be granted
on tables in Greenplum Database.
5. GRANT INSERT
40
Understanding the SQL Features of Greenplum Database
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
6. GLOBAL TEMPORARY TABLEs
and LOCAL TEMPORARY TABLEs. Greenplum
TEMPORARY TABLEs do not conform to the SQL standard, but many commercial
database systems have implemented temporary tables in the same way.
Greenplum temporary tables are the same as VOLATILE TABLEs in Teradata.
7. UNIQUE
predicate.
for referential integrity checks (most likely will not be
implemented in Greenplum Database).
8. MATCH PARTIAL
SQL 1999 Conformance
The following features of SQL 1999 are not supported in Greenplum Database:
1.
Large Object data types: BLOB, CLOB, NCLOB. However, the BYTEA and TEXT
columns can store very large amounts of data in Greenplum Database (hundreds
of megabytes).
2.
Recursive WITH clause or the WITH RECURSIVE clause (recursive queries).
Non-recursive WITH clauses can easily be rewritten by moving the common table
expression into the FROM clause as a derived table.
3. MODULE
(SQL client modules).
4. CREATE PROCEDURE (SQL/PSM).
This can be worked around in Greenplum
Database by creating a FUNCTION that returns void, and invoking the function as
follows:
SELECT myfunc(args);
5.
The PostgreSQL/Greenplum function definition language (PL/PGSQL) is a subset
of Oracle’s PL/SQL, rather than being compatible with the SQL/PSM function
definition language. Greenplum Database also supports function definitions
written in Python, Perl, and R.
6. BIT and BIT VARYING
data types (intentionally omitted). These were deprecated
in SQL 2003, and replaced in SQL 2008.
7.
Greenplum supports identifiers up to 63 characters long. The SQL standard
requires support for identifiers up to 128 characters long.
8.
Prepared transactions (PREPARE TRANSACTION, COMMIT PREPARED, ROLLBACK
PREPARED). This also means Greenplum does not support XA Transactions (2
phase commit coordination of database transactions with external transactions).
9. CHARACTER SET
option on the definition of CHAR() or VARCHAR() columns.
10. Specification
of CHARACTERS or OCTETS (BYTES) on the length of a CHAR() or
VARCHAR() column. For example, VARCHAR(15 CHARACTERS) or VARCHAR(15
OCTETS) or VARCHAR(15 BYTES).
11. CURRENT_SCHEMA function.
Understanding the SQL Features of Greenplum Database
41
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
statement. CREATE DOMAIN can be used as a
work-around in Greenplum.
12. CREATE DISTINCT TYPE
13. The explicit
table construct.
SQL 2003 Conformance
The following features of SQL 2003 are not supported in Greenplum Database:
1. XML data
2. MERGE
type (PostgreSQL does support this).
statements.
columns and the associated GENERATED ALWAYS/GENERATED BY
DEFAULT clause. The SERIAL or BIGSERIAL data types are very similar to INT or
BIGINT GENERATED BY DEFAULT AS IDENTITY.
3. IDENTITY
4. MULTISET
5. ROW data
6.
modifiers on data types.
type.
Greenplum Database syntax for using sequences is non-standard. For example,
nextval('seq') is used in Greenplum instead of the standard NEXT VALUE FOR
seq.
7. GENERATED ALWAYS AS
8.
columns. Views can be used as a work-around.
The sample clause (TABLESAMPLE) on SELECT statements. The random()
function can be used as a work-around to get random samples from tables.
clause on SELECT statements and subqueries (nulls
are always last in Greenplum Database).
9. NULLS FIRST/NULLS LAST
10. The partitioned
join tables construct (PARTITION BY in a join).
11. GRANT SELECT privileges on columns. Privileges can only be granted on tables in
Greenplum Database. Views can be used as a work-around.
12. For CREATE TABLE x (LIKE(y))
statements, Greenplum does not support the
[INCLUDING|EXCLUDING] [DEFAULTS|CONSTRAINTS|INDEXES] clauses.
13. Greenplum
array data types are almost SQL standard compliant with some
exceptions. Generally customers should not encounter any problems using them.
SQL 2008 Conformance
The following features of SQL 2008 are not supported in Greenplum Database:
and VARBINARY data types. BYTEA can be used in place of VARBINARY in
Greenplum Database.
1. BINARY
2. FETCH FIRST
or FETCH NEXT clause for SELECT, for example:
SELECT id, name FROM tab1 ORDER BY id OFFSET 20 ROWS FETCH
NEXT 10 ROWS ONLY;
42
Understanding the SQL Features of Greenplum Database
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
Greenplum has LIMIT and LIMIT OFFSET clauses instead.
3.
The ORDER BY clause is ignored in views and subqueries unless a LIMIT clause is
also used. This is intentional, as the Greenplum optimizer cannot determine when
it is safe to avoid the sort, causing an unexpected performance impact for such
ORDER BY clauses. To work around, you can specify a really large LIMIT. For
example: SELECT * FROM mytable ORDER BY 1 LIMIT 9999999999
4.
The row subquery construct is not supported.
5. TRUNCATE TABLE does
not accept the CONTINUE IDENTITY and RESTART
IDENTITY clauses.
Greenplum and PostgreSQL Compatibility
Greenplum Database is based on PostgreSQL 8.2 with a few features added in from
the 8.3 release. To support the distributed nature and typical workload of a Greenplum
Database system, some SQL commands have been added or modified, and there are a
few PostgreSQL features that are not supported. Greenplum has also added features
not found in PostgreSQL, such as physical data distribution, parallel query
optimization, external tables, resource queues for workload management and
enhanced table partitioning. For full SQL syntax and references, see the Greenplum
Database Administrator Guide.
Table 4.1 SQL Support in Greenplum Database
SQL Command
Supported in
Modifications, Limitations, Exceptions
Greenplum
ALTER AGGREGATE
YES
ALTER CONVERSION
YES
ALTER DATABASE
YES
ALTER DOMAIN
YES
ALTER FILESPACE
YES
ALTER FUNCTION
YES
ALTER GROUP
YES
ALTER INDEX
YES
ALTER LANGUAGE
YES
ALTER OPERATOR
YES
ALTER OPERATOR CLASS
NO
ALTER RESOURCE QUEUE
YES
Greenplum Database workload management feature - not in
PostgreSQL.
ALTER ROLE
YES
Greenplum Database Clauses:
Greenplum Database parallel tablespace feature - not in
PostgreSQL 8.2.14.
An alias for ALTER ROLE
RESOURCE QUEUE queue_name | none
ALTER SCHEMA
YES
Understanding the SQL Features of Greenplum Database
43
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
Table 4.1 SQL Support in Greenplum Database
SQL Command
Supported in
Modifications, Limitations, Exceptions
Greenplum
ALTER SEQUENCE
YES
ALTER TABLE
YES
Unsupported Clauses / Options:
CLUSTER ON
ENABLE/DISABLE TRIGGER
Greenplum Database Clauses:
ADD | DROP | RENAME | SPLIT | EXCHANGE
PARTITION | SET SUBPARTITION TEMPLATE | SET
WITH (REORGANIZE=true | false)| SET
DISTRIBUTED BY
ALTER TABLESPACE
YES
ALTER TRIGGER
NO
ALTER TYPE
YES
ALTER USER
YES
ANALYZE
YES
BEGIN
YES
CHECKPOINT
YES
CLOSE
YES
CLUSTER
YES
COMMENT
YES
COMMIT
YES
COMMIT PREPARED
NO
COPY
YES
An alias for ALTER ROLE
Modified Clauses:
ESCAPE [ AS ] 'escape' | 'OFF'
Greenplum Database Clauses:
[LOG ERRORS INTO error_table] 
SEGMENT REJECT LIMIT count [ROWS|PERCENT]
CREATE AGGREGATE
YES
Unsupported Clauses / Options:
[ , SORTOP = sort_operator ]
Greenplum Database Clauses:
[ , PREFUNC = prefunc ]
Limitations:
The functions used to implement the aggregate must be
IMMUTABLE functions.
CREATE CAST
YES
CREATE CONSTRAINT TRIGGER
NO
CREATE CONVERSION
YES
44
Understanding the SQL Features of Greenplum Database
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
Table 4.1 SQL Support in Greenplum Database
SQL Command
Supported in
Modifications, Limitations, Exceptions
Greenplum
CREATE DATABASE
YES
CREATE DOMAIN
YES
CREATE EXTERNAL TABLE
YES
Greenplum Database parallel ETL feature - not in
PostgreSQL 8.2.14.
CREATE FILESPACE
YES
Greenplum Database parallel tablespace feature - not in
PostgreSQL 8.2.14.
CREATE FUNCTION
YES
Limitations:
Functions defined as STABLE or VOLATILE can be
executed in Greenplum Database provided that they are
executed on the master only. STABLE and VOLATILE
functions cannot be used in statements that execute at the
segment level.
CREATE GROUP
YES
An alias for CREATE ROLE
CREATE INDEX
YES
Greenplum Database Clauses:
USING bitmap (bitmap indexes)
Limitations:
UNIQUE indexes are allowed only if they contain all of (or a
superset of) the Greenplum distribution key columns.
CONCURRENTLY keyword not supported in Greenplum.
CREATE LANGUAGE
YES
CREATE OPERATOR
YES
CREATE OPERATOR CLASS
NO
CREATE OPERATOR FAMILY
NO
CREATE RESOURCE QUEUE
YES
Greenplum Database workload management feature - not in
PostgreSQL 8.2.14.
CREATE ROLE
YES
Greenplum Database Clauses:
Limitations:
The function used to implement the operator must be an
IMMUTABLE function.
RESOURCE QUEUE queue_name | none
CREATE RULE
YES
CREATE SCHEMA
YES
CREATE SEQUENCE
YES
Limitations:
•
•
The lastval and currval functions are not supported.
The setval function is only allowed in queries that do not
operate on distributed data.
Understanding the SQL Features of Greenplum Database
45
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
Table 4.1 SQL Support in Greenplum Database
SQL Command
Supported in
Modifications, Limitations, Exceptions
Greenplum
CREATE TABLE
YES
Unsupported Clauses / Options:
[GLOBAL | LOCAL]
REFERENCES
FOREIGN KEY
[DEFERRABLE | NOT DEFERRABLE]
Limited Clauses:
•
UNIQUE or PRIMARY KEY constraints are only allowed on
hash-distributed tables (DISTRIBUTED BY), and the
constraint columns must be the same as or a superset of
the table’s distribution key columns.
Greenplum Database Clauses:
DISTRIBUTED BY (column, [ ... ] ) |
DISTRIBUTED RANDOMLY
PARTITION BY type (column [, ...])
( partition_specification, [...] )
WITH (appendonly=true 
[,compresslevel=value,blocksize=value]
)
CREATE TABLE AS
YES
See CREATE TABLE
CREATE TABLESPACE
NO
Greenplum Database Clauses:
FILESPACE filespace_name
CREATE TRIGGER
NO
CREATE TYPE
YES
Limitations:
The functions used to implement a new base type must be
IMMUTABLE functions.
CREATE USER
YES
An alias for CREATE ROLE
CREATE VIEW
YES
DEALLOCATE
YES
DECLARE
YES
Unsupported Clauses / Options:
SCROLL
FOR UPDATE [ OF column [, ...] ]
Limitations:
Cursors are non-updatable, and cannot be
backward-scrolled. Forward scrolling is supported.
DELETE
YES
Unsupported Clauses / Options:
RETURNING
Limitations:
•
•
46
Joins must be on a common Greenplum distribution key
(equijoins)
Cannot use STABLE or VOLATILE functions in a DELETE
statement if mirrors are enabled
Understanding the SQL Features of Greenplum Database
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
Table 4.1 SQL Support in Greenplum Database
SQL Command
Supported in
Modifications, Limitations, Exceptions
Greenplum
DROP AGGREGATE
YES
DROP CAST
YES
DROP CONVERSION
YES
DROP DATABASE
YES
DROP DOMAIN
YES
DROP EXTERNAL TABLE
YES
Greenplum Database parallel ETL feature - not in
PostgreSQL 8.2.14.
DROP FILESPACE
YES
Greenplum Database parallel tablespace feature - not in
PostgreSQL 8.2.14.
DROP FUNCTION
YES
DROP GROUP
YES
DROP INDEX
YES
DROP LANGUAGE
YES
DROP OPERATOR
YES
DROP OPERATOR CLASS
NO
DROP OWNED
NO
DROP RESOURCE QUEUE
YES
DROP ROLE
YES
DROP RULE
YES
DROP SCHEMA
YES
DROP SEQUENCE
YES
DROP TABLE
YES
DROP TABLESPACE
NO
DROP TRIGGER
NO
DROP TYPE
YES
DROP USER
YES
DROP VIEW
YES
END
YES
EXECUTE
YES
EXPLAIN
YES
An alias for DROP ROLE
Greenplum Database workload management feature - not in
PostgreSQL 8.2.14.
An alias for DROP ROLE
Understanding the SQL Features of Greenplum Database
47
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
Table 4.1 SQL Support in Greenplum Database
SQL Command
Supported in
Modifications, Limitations, Exceptions
Greenplum
FETCH
YES
Unsupported Clauses / Options:
LAST
PRIOR
BACKWARD
BACKWARD ALL
Limitations:
Cannot fetch rows in a nonsequential fashion; backward
scan is not supported.
GRANT
YES
INSERT
YES
Unsupported Clauses / Options:
RETURNING
LISTEN
NO
LOAD
YES
LOCK
YES
MOVE
YES
NOTIFY
NO
PREPARE
YES
PREPARE TRANSACTION
NO
REASSIGN OWNED
YES
REINDEX
YES
RELEASE SAVEPOINT
YES
RESET
YES
REVOKE
YES
ROLLBACK
YES
ROLLBACK PREPARED
NO
ROLLBACK TO SAVEPOINT
YES
SAVEPOINT
YES
48
See FETCH
Understanding the SQL Features of Greenplum Database
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
Table 4.1 SQL Support in Greenplum Database
SQL Command
Supported in
Modifications, Limitations, Exceptions
Greenplum
SELECT
YES
Limitations:
•
•
•
•
Limited use of VOLATILE and STABLE functions in FROM
or WHERE clauses
Limited use of correlated subquery expressions
Text search (Tsearch2) is not supported
FETCH FIRST or FETCH NEXT clauses not supported
Greenplum Database Clauses (OLAP):
[GROUP BY grouping_element [, ...]]
[WINDOW window_name AS
(window_specification)]
[FILTER (WHERE condition)] applied to an aggregate
function in the SELECT list
SELECT INTO
YES
SET
YES
SET CONSTRAINTS
NO
SET ROLE
YES
SET SESSION AUTHORIZATION
YES
SET TRANSACTION
YES
SHOW
YES
START TRANSACTION
YES
TRUNCATE
YES
UNLISTEN
NO
UPDATE
YES
See SELECT
In PostgreSQL, this only applies to foreign key constraints,
which are currently not enforced in Greenplum Database.
Deprecated as of PostgreSQL 8.1 - see SET ROLE.
Unsupported Clauses:
RETURNING
Limitations:
•
SET not allowed for Greenplum distribution key columns.
•
Joins must be on a common Greenplum distribution key
(equijoins).
Cannot use STABLE or VOLATILE functions in an UPDATE
statement if mirrors are enabled.
•
VACUUM
YES
Limitations:
VACUUM FULL is not recommended in Greenplum
Database.
VALUES
YES
Providing User Access to Greenplum Database
Greenplum Database manages database access permissions using the concept of roles.
The concept of roles subsumes the concepts of users and groups. A role can be a
database user, a group, or both. Roles can own database objects (for example, tables)
Providing User Access to Greenplum Database
49
EMC DCA and DIA Getting Started Guide – Chapter 4: Next Steps
and can assign privileges on those objects to other roles to control access to the
objects. Roles can be members of other roles, thus a member role can inherit the
object privileges of its parent role.
Every Greenplum Database system contains a set of database roles (users and groups).
Those roles are separate from the users and groups managed by the operating system
on which the database process runs. However, for convenience you may want to
maintain a relationship between operating system user names and Greenplum
Database role names, since many of the client applications use the current operating
system user name as the default.
In Greenplum Database, users log in and connect through the master instance, which
then verifies their role and access privileges. In order to bootstrap the Greenplum
Database system, a freshly initialized system always contains one predefined
superuser role. This role will have the same name as the operating system user that
initialized the Greenplum Database system. Customarily, this role is named gpadmin.
In order to create more roles you first have to connect as this initial role.
See the Greenplum Database Administrator Guide for more information on creating
additional roles in Greenplum Database.
Creating Databases and Loading Data
After establishing your database connections, the next step is to begin creating
databases and loading data. See the Greenplum Database Administrator Guide for
more information about creating databases, schemas, tables, and other database
objects in Greenplum Database and loading your data.
50
Creating Databases and Loading Data
EMC DCA and DIA Getting Started Guide – Glossary
Glossary
A
append-only tables
An append-only (AO) table is a storage representation that allows only appending
new rows to a table, but does not allow updating or deleting existing rows. This
allows for more compact storage on disk because each row does not need to store the
MVCC transaction visibility info. This saves 20 bytes per row. AO tables can also
be compressed.
array
The set of physical devices (hosts, servers, network switches, etc.) used to house a
Greenplum Database system.
B
bandwidth
Bandwidth is the maximum amount of information that can be transmitted along a
channel, such as a network or I/O channel. This data transfer rate is usually measured
in megabytes or gigabytes per second (MB/s or GB/s).
BI
Business Intelligence (BI) is a broad category of applications and technologies for
gathering, storing, analyzing, and providing access to data with the goal of helping
users make better business decisions.
C
catalog
See system catalog.
column-oriented table
Greenplum provides a choice of storage orientation models for a table: row or
column. A column-oriented table stores its content on disk by column rather than by
row. This storage model has performance advantages for certain types of queries.
Only append-only tables can be column-oriented; heap tables are always
row-oriented.
append-only tables
51
EMC DCA and DIA Getting Started Guide – Glossary
D
data directory
The data directory is the file system location on disk where database data is stored.
The master data directory contains the global system catalog only — no user data is
stored on the master. The data directory on the segment instances has user data for
that segment plus a local copy of the system catalog. The data directory contains
several subdirectories, control files, and configuration files as well.
DCA
Data Computing Appliance. See Greenplum Data Computing Appliance.
distributed
Certain database objects in Greenplum Database, such as tables and indexes, are
distributed. They are divided into equal parts and spread out among the segment
instances based on a hashing algorithm. To the end-user and client software,
however, a distributed object appears as a conventional database object.
distribution key
In a Greenplum table that uses hash distribution, one or more columns are used as
the distribution key, meaning those columns are used to divide the data among all of
the segments. The distribution key should be the primary key of the table or a unique
column or set of columns.
distribution policy
The distribution policy determines how to divide the rows of a table among the
Greenplum segments. Greenplum Database provides two types of distribution
policy: hash distribution and random distribution.
DDL
Data Definition Language. A subset of SQL commands used for defining the
structure of a database.
DML
Database Manipulation Language. SQL commands that store, manipulate, and
retrieve data from tables. INSERT, UPDATE, DELETE, and SELECT are DML
commands.
E
ELT
Extract, load, and transform (ELT) is a process in data warehousing that involves
extracting data from outside data sources, loading the raw data into a
high-performance database management system (such as Greenplum Database), and
then performing the data transformations within the database itself.
52
data directory
EMC DCA and DIA Getting Started Guide – Glossary
ETL
Extract, transform, and load (ETL) is a process in data warehousing that involves
extracting data from outside data sources, transforming it to meet the operational
requirements of the data warehouse, and loading it into the target database.
G
gang
For each slice of the query plan there is at least one query executor worker process
assigned. During query execution, each segment will have a number of processes
working on the query in parallel. Related processes that are working on the same
portion of the query plan on different segments are referred to as gangs.
Greenplum Database
Greenplum Database is the industry’s first massively parallel processing (MPP)
database server based on open-source technology. It is explicitly designed to support
business intelligence (BI) applications and large, multi-terabyte data warehouses.
Greenplum Database is based on PostgreSQL.
Greenplum Database system
An associated set of segment instances and a master instance running on an array,
which can be composed of one or more hosts.
Greenplum Data Computing Appliance
Greenplum Data Computing Appliance (Greenplum DCA) is a self-contained data
warehouse solution that integrates all of the database software, servers and switches
necessary to perform big data analytics. Greenplum DCA is delivered racked and
ready for immediate data loading and query execution.
Greenplum GP100
The model name of the Greenplum Data Computing Appliance half rack solution.
Greenplum GP1000
The model name of the Greenplum Data Computing Appliance full rack solution.
Greenplum instance
The process that serves a database. An instance of Greenplum Database is comprised
of a master instance and two or more segment instances, however users and
administrators always connect to the database via the master instance.
GP100
See Greenplum GP100.
GP1000
See Greenplum GP1000.
ETL
53
EMC DCA and DIA Getting Started Guide – Glossary
H
hash distribution
With hash distribution, one or more table columns is used as the distribution key for
the table. The distribution key is used by a hashing algorithm to assign each row to
a particular segment. Keys of the same value will always hash to the same segment.
heap tables
Whenever you create a table without specifying a storage structure, the default is a
heap storage structure. In a heap structure, the table is an unordered collection of
data that allows multiple copies or versions of a row. Heap tables have row-level
versioning information and allow updates and deletes. See also append-only tables
and multiversion concurrency control.
host
A host represents a physical machine or compute node in a Greenplum Database
system. In Greenplum Database, one host is designated as the master. The other
hosts in the system have one or more segments on them.
I
interconnect
The interconnect is the networking layer of Greenplum Database. When a user
connects to a database and issues a query, processes are created on each of the
segments to handle the work of that query. The interconnect refers to the
inter-process communication between the segments and master, as well as the
network infrastructure on which this communication relies.
I/O
Input/Output (I/O) refers to the transfer of data to and from a system or device using
a communucation channel.
J
JDBC
Java Database Connectivity is an application program interface (API) specification
for connecting programs written in Java to data in a database management system
(DBMS). The application program interface lets you encode access request
statements in SQL that are then passed to the program that manages the database.
54
hash distribution
EMC DCA and DIA Getting Started Guide – Glossary
M
master
The master is the entry point to a Greenplum Database system. It is the database
listener process (postmaster) that accepts client connections and dispatches the SQL
commands issued by the users of the system.
The master is where the global system catalog resides. However, the master does not
contain any user data. User data resides only on the segments. The master does the
work of authenticating user connections, parsing and planning the incoming SQL
commands, distributing the query plan to the segments for execution, coordinating
the results returned by each of the segments, and presenting the final results to the
user.
master instance
The database process that serves the Greenplum master. See master.
mirror
A mirror is a backup copy of a segment (or master) that is stored on a different host
than the primary copy. Mirrors are useful for maintaining operations if a host in your
Greenplum Database system fails. Mirroring is an optional feature of Greenplum
Database. Mirror segments are evenly distributed among other hosts in the array. If
a host that holds a primary segment fails, Greenplum Database will switch to the
mirror or secondary host.
motion node
A motion node is a portion of a query execution plan that indicates data movement
between the various database instances of Greenplum Database (segments and the
master). Some operations, such as joins, require segments to send and receive tuples
to one another in order to satisfy the operation. A motion node can also indicate data
movement from the segments back up to the master.
MPP
Massive Parallel Processing.
master
55
EMC DCA and DIA Getting Started Guide – Glossary
multiversion concurrency control
Unlike traditional database systems which use locks for concurrency control,
Greenplum Database (as does PostgreSQL) maintains data consistency by using a
multiversion model (multiversion concurrency control or MVCC). This means that
while querying a database, each transaction sees a snapshot of data which protects
the transaction from viewing inconsistent data that could be caused by (other)
concurrent updates on the same data rows. This provides transaction isolation for
each database session.
MVCC, by eschewing explicit locking methodologies of traditional database
systems, minimizes lock contention in order to allow for reasonable performance in
multiuser environments. The main advantage to using the MVCC model of
concurrency control rather than locking is that in MVCC locks acquired for querying
(reading) data do not conflict with locks acquired for writing data, and so reading
never blocks writing and writing never blocks reading.
MVCC
See multiversion concurrency control.
O
ODBC
Open Database Connectivity, a standard database access method that makes it
possible to access any data from any client application, regardless of which database
management system (DBMS) is handling the data. ODBC manages this by inserting
a middle layer, called a database driver, between a client application and the DBMS.
The purpose of this layer is to translate the application’s data queries into commands
that the DBMS understands.
OLAP
Online Analytical Processing (OLAP) is a category of technologies for collecting,
managing, processing and presenting multidimensional data for analysis and
management. OLAP leverages existing data from a relational schema or data
warehouse (data source) by placing key performance indicators (measures) into
context (dimensions). As of release 3.1, OLAP functions are supported in
Greenplum Database. In practice, OLAP functions allow application developers to
compose analytic business queries more easily and more efficiently. For example,
moving averages and moving sums can be calculated over various intervals;
aggregations and ranks can be reset as selected column values change; and complex
ratios can be expressed in simple terms.
OLTP
Online Transactional Processing (OLTP) is a mode of database processing
involving single, small updates from end-point applications and real-time
transactional systems.
56
multiversion concurrency control
EMC DCA and DIA Getting Started Guide – Glossary
P
partitioned tables
Partitioning is a way to logically divide the data in a table for better performance and
easier maintenance. In Greenplum Database, partitioning is a procedure that creates
multiple sub-tables (or child tables) from a single large table (or parent table). The
primary purpose is to improve performance by scanning only the relevant data
needed to satisfy a query. Note that partitioned tables are also distributed.
Perl DBI
Perl Database Interface (DBI) is an API for connecting programs written in Perl to
database management systems (DBMS). Perl DBI (DataBase Interface) is the most
common database interface for the Perl programming language.
PostgreSQL
PostgreSQL is a SQL compliant, open source relational database management
system (RDBMS). Greenplum Database uses a modified version of PostgreSQL as
its underlying database server. For more information on PostgreSQL go to
http://www.postgresql.org.
postgresql.conf
The server configuration file that configures various aspects of the database server.
This configuration file is located in the data directory of the database instance. In
Greenplum Database, the master and each segment instance has its own
postgresql.conf file.
postgres process
The postgres executable is the actual PostgreSQL server process that processes
queries. The database listener postgres process (also known as the postmaster)
creates other postgres subprocesses as needed to handle client connections.
postmaster
In releases prior to Greenplum Database 3.2 and PostgreSQL 8.2, the database
listener process was called postmaster. The postmaster process was renamed to
postgres process in Greenplum Database 3.2 and PostgreSQL 8.2, however many
users who are familiar with PostgreSQL still refer to the database listener process as
the postmaster. In Greenplum Database, there is a postgres database listener
process for the Greenplum master instance and each segment instance.
psql
This is the interactive terminal to PostgreSQL and Greenplum Database. You can
use psql to access a database and issue SQL commands.
partitioned tables
57
EMC DCA and DIA Getting Started Guide – Glossary
Q
QD
See query dispatcher.
QE
See query executor.
query dispatcher
The query dispatcher (QD) is a process that is initiated when users connect to the
master and issue SQL commands. This process represents a user session and is
responsible for sending the query plan to the segments and coordinating the results
it gets back. The query dispatcher process spawns one or more query executor
processes to assist in the execution of SQL commands.
query executor
A query executor process (QE) is associated with a query dispatcher (QD) process
and operates on its behalf. Query executor processes run on the segment instances
and execute their slice of the query plan on a segment.
query plan
A query plan is the set of operations that Greenplum Database will perform to
produce the answer to a given query. Each node or step in the plan represents a
database operation such as a table scan, join, aggregation or sort. Plans are read and
executed from bottom to top. Greenplum Database supports an additional plan node
type called a motion node. See also slice.
R
rack
A type of shelving to which computer components can be attached vertically, one
on top of the other. Components are normally screwed into front-mounted, tapped
metal strips with holes which are spaced so as to accommodate the height of devices
of various U-sizes. Racks usually have their height denominated in U-units.
RAID
Redundant Array of Independent (or Inexpensive) Disks. RAID is a system of using
multiple hard drives for sharing or replicating data among the drives. The benefit of
RAID is increased data integrity, fault-tolerance and/or performance. Multiple hard
drives are grouped and seen by the OS as one logical hard drive.
RAM
Random Access Memory. The main memory of a computer system used for storing
programs and data. RAM provides temporary read/write storage while hard disks
offer semi-permanent storage.
58
QD
EMC DCA and DIA Getting Started Guide – Glossary
random distribution
With random distribution, table rows are sent to the segments as they come in,
cycling across the segments in a round-robin fashion. Rows with columns having the
same values will not necessarily be located on the same segment. Although a random
distribution ensures even data distribution, there are performance advantages to
choosing a hash distribution policy whenever possible.
S
segment
A segment represents a portion of data in a Greenplum database. User-defined tables
and their indexes are distributed across the available number of segment instances in
the Greenplum Database system. Each segment instance contains a distinct portion
of the user data. A primary segment instance and its mirror both store the same
segment of data.
segment instance
The segment instance is the database server process (postmaster) that serves
segments. Users do not connect to segment instances directly, but through the
master.
server
See host.
slice
In order to achieve maximum parallelism during query execution, Greenplum
divides the work of the query plan into slices. A slice is a portion of the plan that can
be worked on independently at the segment level. A query plan is sliced wherever a
motion node occurs in the plan, one slice on each side of the motion. Plans that do
not require data movement (such as catalog lookups on the master) are known as
single-slice plans.
star schema
A relational database design often used in data warehousing. The star schema is
organized around a central table (fact table) joined to a few smaller tables
(dimension tables) using foreign key references. The fact table contains raw numeric
items that represent relevant business facts (price, number of units sold, etc.).
system catalog
The system catalogs are the place where a relational database management system
stores schema metadata, such as information about tables and columns, and internal
bookkeeping information. The system catalog in Greenplum Database is the same as
the PostgreSQL catalog with some additional tables to support the distributed nature
of the Greenplum system and databases. In Greenplum Database, the master
contains the global system catalog tables. The segments also maintain their own
local copy of the system catalog.
random distribution
59
EMC DCA and DIA Getting Started Guide – Glossary
T
tuple
A tuple is another name for a row or record in a relational database table.
W
WAL
Write-Ahead Logging (WAL) is a standard approach to transaction logging. WAL’s
central concept is that changes to data files (where tables and indexes reside) are
logged before they are written to permanent storage. Data pages do not need to be
flushed to disk on every transaction commit. In the event of a crash, data changes
not yet applied to the database can be recovered from the log. A major benefit of
using WAL is a significantly reduced number of disk writes.
60
tuple
Download PDF
Similar pages