null  null
Business Statistics
Made Easy in SAS
®
Gregory Lee
From Business Statistics Made Easy in SAS®.
Full book available for purchase here.
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Chapter 1 • Introduction to the Central Textbook Example . . . . . . . . . . . . . . . . . . . . . . 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Current Research Needs of the Company . . . . . . . . . . . . . 2
Your Brief for the Case Example . . . . . . . . . . . . . . . . . . . . 5
Extended Analytical Skills Needed in the Project . . . . . . . . 6
Chapter 2 • Introduction to the Statistics Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Introductory Case: Big Data in the Airline Industry . . . . . . 9
Introduction to the Statistics Process . . . . . . . . . . . . . . . . 11
Step 1: Your Needs & Requirements . . . . . . . . . . . . . . . . 12
Step 2: Getting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Step 3: Extracting Statistics from the Data . . . . . . . . . . . . 15
Step 4: Understanding & Decision Making . . . . . . . . . . . . 17
Summary: Challenges in the Statistics Process . . . . . . . . 17
Advice to the Statistically Terrified . . . . . . . . . . . . . . . . . . 18
Chapter 3 • Introduction to Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Introductory Case: Royal FrieslandCampina . . . . . . . . . . 21
Brief Introduction to Samples, Populations & Data . . . . . 23
Basic Characteristics of Variables . . . . . . . . . . . . . . . . . . 27
Chapter 4 • Data Collection & Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Correct Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Choose Constructs and Variable Measurements . . . . . . .
Initial Data Capture: Which Package? . . . . . . . . . . . . . . .
Dealing with Data Once It Has Been Captured . . . . . . . .
33
34
35
43
43
iv Contents
Database & Data Analysis Software . . . . . . . . . . . . . . . . 48
Some Complications in Datasets . . . . . . . . . . . . . . . . . . . 48
Chapter 5 • Introduction to SAS® . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Introductory Vignette: SAS On Top of the
Analytics World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Brief Introduction to SAS . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction to the Textbook Materials . . . . . . . . . . . . . . .
Getting Started with SAS 9 or SAS Studio . . . . . . . . . . . .
51
52
53
53
Chapter 6 • Basics of SAS Programs, Data Manipulation, Analysis & Reporting . . . . 69
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
The Running Data Example . . . . . . . . . . . . . . . . . . . . . . . 70
The Pre-Analysis Data Cleaning & Preparation Steps . . . 72
Overview of the Three Big Tasks in Business Statistics . . 73
Basic Introduction to SAS Programming . . . . . . . . . . . . . 73
Major Task #1: Data Manipulation in SAS . . . . . . . . . . . . 77
Major Task #2: Data Analysis . . . . . . . . . . . . . . . . . . . . . . 83
Major Task #3: SAS Reporting through Output Formats . 84
The Visual Programmer Mode in SAS Studio . . . . . . . . . 86
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Chapter 7 • Descriptive Statistics: Understand your Data . . . . . . . . . . . . . . . . . . . . . . 89
Introductory Case: 2007 AngloGold
Ashanti Look Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
End Outcome of a Descriptive Statistics Analysis . . . . . . 91
Getting Descriptive Statistics in SAS . . . . . . . . . . . . . . . . 92
Statistics Measuring Centrality . . . . . . . . . . . . . . . . . . . . . 94
Basic Statistics Assessing Variable Spread . . . . . . . . . . . 97
Assessing Shape of a Variable’s Distribution . . . . . . . . . . 99
Conclusion on Descriptive Statistics . . . . . . . . . . . . . . . 104
Appendix A to Chapter 7: Basic Normality Statistics . . . 104
Chapter 8 • Basics of Associating Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Contents
What is Statistical Association? . . . . . . . . . . . . . . . . . . .
Association Does Not Mean Causation . . . . . . . . . . . . .
Overview of Associations for Different Variable Types . .
Relating Continuous or Ordinal Data:
Correlation & Covariance . . . . . . . . . . . . . . . . . . . . . .
Relating Categorical Variables . . . . . . . . . . . . . . . . . . . .
v
110
110
111
112
119
Chapter 9 • Using Basic Statistics to Check & Fix Data . . . . . . . . . . . . . . . . . . . . . . . 123
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inappropriate Data Points . . . . . . . . . . . . . . . . . . . . . . .
Dealing Practically with Missing Data . . . . . . . . . . . . . .
Checking Centrality & Spread . . . . . . . . . . . . . . . . . . . .
Strange Variable Distributions . . . . . . . . . . . . . . . . . . . .
Dealing Practically with Multi-Item Scales . . . . . . . . . . .
123
124
126
127
128
128
Chapter 10 • Introduction to Graphing in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Major Graphing Procedures in SAS . . . . . . . . . . . . . . . . 136
The PROC SGPLOT Routine in SAS . . . . . . . . . . . . . . . 138
Multiple Plots Simultaneously through
PROC SGPANEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Business Dashboards through PROC GKPI . . . . . . . . . 143
Geographical Mapping Using PROC GMAP . . . . . . . . . 145
PROC SGSCATTER for Multiple Scatterplots . . . . . . . . 146
Conclusion on SAS Graphing . . . . . . . . . . . . . . . . . . . . 147
Chapter 11 • The Statistics Process: Fitting Models to Data . . . . . . . . . . . . . . . . . . . 149
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Look for Patterns in the Data (Fit) . . . . . . . . . . . . . . . . .
Step 3: Interpret the Pattern . . . . . . . . . . . . . . . . . . . . . .
Summary of the Statistics Process . . . . . . . . . . . . . . . .
149
151
164
168
Chapter 12 • Key Concepts: Size & Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Illustrative Case: Pharmaceuticals I –
AstraZeneca’s Crestor . . . . . . . . . . . . . . . . . . . . . . . . 172
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
vi Contents
Issue # 1: Size of a Statistic . . . . . . . . . . . . . . . . . . . . . .
Issue # 2: Accuracy of Statistics . . . . . . . . . . . . . . . . . .
The Aspects of Inaccuracy . . . . . . . . . . . . . . . . . . . . . . .
Putting Statistical Size and Accuracy Together . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix A to Chapter 12: More on
Accuracy (optional) . . . . . . . . . . . . . . . . . . . . . . . . . .
173
177
179
200
202
203
Chapter 13 • Introduction to Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Illustrative Case: West Point . . . . . . . . . . . . . . . . . . . . . 212
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
The Core Textbook Case Example for Chapter 13 . . . . 213
Introduction to Linear Regression . . . . . . . . . . . . . . . . . 215
A Pictorial Walk through Regression . . . . . . . . . . . . . . . 217
Implementing Multiple Regression in SAS . . . . . . . . . . . 226
Step 1: Collect, Capture and Clean Data . . . . . . . . . . . . 227
Step 2: Run an Initial Regression Analysis . . . . . . . . . . 231
Step 3: Assess Fit and Apply Remedies If Necessary . . 233
Step 4: Interpret the Regression Slopes . . . . . . . . . . . . 257
Step 5: Reporting a Multiple Regression Result . . . . . . 265
Other Statistical Forms . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Chapter 14 • Categories Explaining a Continuous Variable:
Comparing Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Introduction to Comparison of Categories . . . . . . . . . . . 270
Features of the Continuous Variable to
Compare Across Categories . . . . . . . . . . . . . . . . . . . 270
Two Types of Categories to Compare . . . . . . . . . . . . . . 271
Numbers of Categories to Compare: Two
vs. More than Two . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Data Assumptions and Alternatives when
Comparing Categories . . . . . . . . . . . . . . . . . . . . . . . . 273
Comparing Two Means: T-Tests . . . . . . . . . . . . . . . . . . . 275
Contents
vii
Comparing Means for More than Two
Categories: ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Chapter 15 • Categorical Data Distributions & Associations . . . . . . . . . . . . . . . . . . . 285
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Repeat: One-Way Categorical Distributions . . . . . . . . .
Repeat: Linking Categorical Variables Together . . . . . .
Further Statistical Questions about Categorical Data . .
Assessing One-Way Frequencies . . . . . . . . . . . . . . . . .
Tests of Categorical Variable Association . . . . . . . . . . .
Conclusion on Categorical Data Analysis . . . . . . . . . . .
285
286
287
287
288
293
298
Chapter 16 • Reporting Business Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Reminder - Your Brief for the Textbook Case Study . . .
Your Tasks in the Analytics and Reporting Stages . . . . .
Background Analyses Versus Displayed
Reports for the CEO . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion on Business Statistics Reporting . . . . . . . . .
299
300
300
308
Chapter 17 • Business Analysis from Statistics: Introduction . . . . . . . . . . . . . . . . . . 309
Case Study: Oracle South Africa . . . . . . . . . . . . . . . . . . 310
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Overall Financial Extrapolation Process . . . . . . . . . . . . 312
Step 1: Statistics Gives Level of or
Change in Focal Variables . . . . . . . . . . . . . . . . . . . . . 313
Step 2: Financial Estimates of Revenue or
Cost of One Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Step 3: Combine Statistics with Per-Unit
Financial Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Step 4: Include Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Steps 5 and 6: Net Profitability Calculations . . . . . . . . . 319
Some Simple Examples of Business Extrapolation . . . . 321
Conclusion of Statistical Business Extrapolation . . . . . . 323
Chapter 18 • Miscellaneous Business Statistics Topics . . . . . . . . . . . . . . . . . . . . . . . 325
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
viii Contents
Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Warehousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Machine Learning & Algorithms . . . . . . . . . . . . . . . . . . .
Simulation in Business Situations . . . . . . . . . . . . . . . . .
Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
326
330
335
336
340
342
Chapter 19 • Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Books and Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Index
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
From Business Statistics Made Easy in SAS®, by Gregory John Lee. Copyright © 2015, SAS Institute
Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
From Business Statistics Made Easy in SAS®.
Full book available for purchase here.
69
6
Basics of SAS Programs, Data
Manipulation, Analysis &
Reporting
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
The Running Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Reminder of the Main Textbook Case Study . . . . . . . . . . . . . . . . . . . . . . 70
Reminder of Your Brief for the Case Example . . . . . . . . . . . . . . . . . . . . . 71
The Pre-Analysis Data Cleaning & Preparation Steps . . . . . . . . . . . . . 72
Overview of the Three Big Tasks in Business Statistics . . . . . . . . . . 73
Basic Introduction to SAS Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Running SAS Tasks through Point-and-Click Windows . . . . . . . . . . . . 73
Doing SAS Tasks through Programming Code (Syntax) . . . . . . . . . . . 74
Major Task #1: Data Manipulation in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Introduction to Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Creating New Datasets in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Creating Temporary Datasets in the Work Library . . . . . . . . . . . . . . . . . 79
Create New Variables or Manipulate Current Variables in SAS . . . . 80
Combining Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Major Task #2: Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Major Task #3: SAS Reporting through Output Formats . . . . . . . . . . 84
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Different ODS Outputs in SAS Studio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Different ODS Outputs in SAS 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
70 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
The Visual Programmer Mode in SAS Studio . . . . . . . . . . . . . . . . . . . . . . 86
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Introduction
This chapter begins the many sections of this book that teach the practical implementation of
statistical techniques through SAS. We start in this chapter with an overview of SAS
programs and programming, data manipulation, the basics of SAS statistical analysis, and
different types of documentary reports in SAS.
The Running Data Example
Reminder of the Main Textbook Case Study
To facilitate the discussion of the next few chapters, I will continue to work with the Accu-Phi
case study from Chapter 1, specifically with the following variables (see Figure 6.1 on page
71 below for a reminder of the initial data format):
n Sales: Measured as actual services sales in dollars in the first year of sales.
n License: A description of what license the customer has ( “Freeware” or “Premium”).
n Size: A description of the size of the customer by turnover, with the character values
“Small,” “Medium,” or “Big.”
n Trust: The trust the customer has in your product and company. You have measured trust
through four questions in an online survey, on a 0-100 point sliding scale.
n Customer satisfaction: Measured through four questions in an online customer survey,
but from 1-7.
n Enquiries: The average number of enquiries about the core software product logged with
the call center or online help by customers, per month, since starting use of the product.
The Running Data Example
71
Figure 6.1 First lines of initial dataset
The download available on the course website in the “Textbook Materials” folder, gives this
initial dataset (“Data01_Initial”).
Reminder of Your Brief for the Case Example
Let us say that your CEO wants you to analyze the data and answer the following questions
which are important to the company:
n How did the first-year sales go?
n Are our customers satisfied and to what extent do they trust us?
n How many enquiries do customers make?
n Do sales, satisfaction, trust or enquiries differ depending on whether the customer has a
premium or freeware contract, and depending on customer size?
n What is the distribution of licenses between the levels of size?
n Is sales seemingly substantially associated with any of the other variables?
72 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
The Pre-Analysis Data Cleaning & Preparation
Steps
Before actually analyzing data to answer questions such as the CEO’s queries above, you
will need to assess the data for integrity, clean any obvious errors and mistakes, and prepare
the data for final analysis. These checks may include:
1 Initial data assessment and cleaning. Notice the following in Figure 6.1 on page 71:
a Size of the fifth respondent is captured as “Bigg,” obviously a typographical error.
b The “Satisfaction03” score for Respondent 2 is captured as a “55”, but this is
supposed to be a 1-7 scale.
c These are data entry mistakes. While easy to spot in such a small with the eye, you’ll
not see this in a bigger table easily. Mis-entered data can seriously impact any
analysis.
2 Missing data: There is missing data; we need to assess and possibly deal with this as
discussed in Chapter 4.
3 Multi-item scales assessing trust and satisfaction: We need to assess and aggregate
these into single measures of the variables if possible.
We need to pre-assess and clean our data. We usually do these sorts of assessments
through basic descriptive statistics and variable associations. Therefore, the next four
chapters will sequentially discuss the following:
n Chapter 6 discusses how to create, change and manipulate data, as well as give an
overview of some other topics. To do things like create aggregated variables from multiitem scales, we’ll need these skills.
n Chapter 7 discusses the essential descriptive statistics we use for single variables.
n Chapter 8 discusses basic measures of variable association.
n Chapter 9 discusses using these analyses in an initial set of steps for the purposes of
data checking, cleaning, and preparation.
Basic Introduction to SAS Programming
73
Overview of the Three Big Tasks in Business
Statistics
Having been introduced to the SAS products in the previous chapter, we now turn our
attention to a basic introduction to the three major types of tasks you may wish to perform in
SAS:
1 Data manipulation tasks are those where you wish to change or add to your current data
set. For instance, you may wish to sort your current dataset by some variable, or add a
new column of data that is the sum of three other columns. Appendix A to this chapter
gives you some lessons on how to do such tasks, including manipulating data and
creating new datasets.
2 Data analysis involves generating representative numbers or pictures of the data that tell
you something you wish to know about the data. This could range from an analysis as
simple as the average of a variable to complex analysis of the relationships between
many variables.
3 Reporting obviously means formatting your findings into a useful report that will be
appropriate and engaging for the user.
The next sections introduce each of these major steps in greater or less detail, after an initial
overview of SAS programming in general.
Basic Introduction to SAS Programming
Running SAS Tasks through Point-and-Click
Windows
You can use various point-and-click windows to perform tasks in SAS. This method is
relatively simple to use, and favored by many people. If you were using the point-and-click
options you could open and use SAS products that work like this, such as SAS Enterprise
Guide or JMP. SAS Studio also has a version of this sort of approach built in, called
the ”Visual Programmer.”
Point-and-click has serious disadvantages, however, because there are often a great number
of check boxs and options, and SAS does not remember your settings. Therefore, every time
you re-start a certain section of SAS you have to re-enter many check box options. For this
74 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
reason, we will not use the point-and-click options very much in this book, as they are very
slow and inefficient.
Doing SAS Tasks through Programming
Code (Syntax)
Advantages of Programming Code
Instead of point-and click, SAS usually uses programming code in the SAS 9 Editor window
or the SAS Studio Code window to input keywords that tell SAS what you want. Note the
following about programming code in general:
1 Programming is efficient: The programming code input method is very efficient and
advantageous. It is far quicker than using point-and-click. You can save programming
code for later use more easily than you can in many point-and-click programs. Finally,
point-and-click takes a lot of time to go through if you are in a classroom teaching
situation, whereas opening and running a programming code file is quick.
2 Saving and re-using programming code: You can save your programming code files and
re-use them time and time again (see for instance the programming code files in the
“Textbook Materials” folder). Generally, once you have the programming files you like to
use, the only thing you have to do is change the names of the datasets and variables.
3 This book mostly uses programming code: Because of the advantages of programming
code, I will mostly use and teach this input method in this book. You will not have to learn
what programming to use; the textbook comes with pre-written programming code files
(see the “Textbook Materials” folder at http://support.sas.com/publishing/authors/lee.html).
Each time we run an analysis, you will be directed to open and run a pre-existing file as
described below.
First Lessons on SAS Programming
Programming can be a daunting task for many people. However, it is actually a very easy
language simply composed of a few keywords, as well as a basic structure to which you
need to stick.
For instance, take a look at Figure 6.2 on page 75, which shows an example of SAS code
in either the Editor window of SAS 9 or the Code window of SAS Studio. Here, you can see
various keywords and variable names that tell SAS what dataset to analyze, which variables
to analyze, and what statistical analysis to do on these variables.
Basic Introduction to SAS Programming
75
Figure 6.2 Example of programming code in a SAS Editor or Code window
Figure 6.2 on page 75 is a specific type of code that runs a statistical analysis. We can see
the following in this figure:
1 To run a SAS statistical procedure, you usually start with the keyword PROC followed by
a specific keyword that identifies which particular statistical analysis you want. For
instance, in Figure 6.2 on page 75 the keyword MEANS asks SAS to do basic descriptive
statistics on variables, as described in later chapters.
2 When running procedures, we next usually identify the dataset to be analyzed by its
library and then its dataset name, i.e. the general structure is “<Name of the
library>.<Name of the dataset>.” In Figure 6.2 on page 75, the dataset to be analyzed is
the “Profits” dataset within the “MBA” library, as identified by the “Data=MBA.Profits” part
of the code.
3 There are often extra keywords to identify further statistical options.
4 Usually, the middle section of SAS procedure code contains a description of the variables
to be analyzed. In the simple example in Figure 6.2 on page 75, we simply list the
variables to be analyzed after the keyword VAR. In more complex procedures that are
mostly beyond the scope of this book, we sometimes also have to tell SAS how the
variables are related.
There are also certain general SAS programming rules that can be seen in the example in
Figure 6.2 on page 75:
1 Capitalization of words in SAS code:
a SAS programs usually do not care about capitalization of words. For instance, in
Figure 6.2 on page 75, keywords such as “Proc Means” could easily be spelled
“PROC means” or any combination or lower and uppercase, as can the dataset
names.
76 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
b Almost the only time that SAS cares about capitalization is if you are referring to
specific text data within a dataset. For instance if “Gregory Lee” is a field in a dataset,
then if you need to refer to this data in code, you must get the exact capitalization
correct.
2 Spacing, lines and tabs in SAS code:
a It does matter that you keep at least one space between different keywords of SAS
programming (e.g. you can’t put “PROCMEANS” above).
b However, other than that, SAS does not mind where in the code or editor window you
place code so long as the basic statements are in the right order. You can place
different statements on different lines, run them together without line breaks, or use
multiple spaces or tabs between pieces of code, etc.
3 Semicolons as the key for endings of sections: Sections of SAS programs end with a
semicolon (“;”). If you try to run a SAS program and find that it does not work, it is often
because you have failed to add the semicolon at the end of a section.
4 The Run command as the key for the end of a program: SAS programs usually end with a
“Run;” command.
5 Running a SAS program: To actually make the program run, you click the little running
person icon in the SAS 9 or SAS Studio toolbar, as seen in Figure 6.3 on page 76
below.
Figure 6.3
Running a SAS Program
One cardinal rule is to always check the SAS log after running code to see if the program has
worked and to determine if there are errors (e.g. misspelling the dataset name). In such
cases, SAS will warn you in the log with red error sections. This is particularly easy in SAS
Studio, which lists any errors at the top of the log section.
Finally, note that “PROC”-type code to invoke SAS statistical analyses are not the only form
of programming. Notably, the very important DATA keyword is used to create and manipulate
datasets, as described below in “Major Task #1: Data Manipulation in SAS” on page 77.
Major Task #1: Data Manipulation in SAS
77
Opening Existing SAS Code Files
As I have discussed above, this book does not expect the reader to become a SAS
programmer immediately. All the analyses taught in the book are given to you as pre-written
programming code files that you simply have to open and run to get the results. As you work
with these files, you will quickly see how the underlying programs work, and soon be able to
apply them to your own datasets and variables with little change.
Even if you were to write your own programs from scratch, you would usually save the code
files and then re-open and run them later when you wish to recreate the analysis.
To open existing programming code files like those in the “Textbook Materials” folder, do the
following:
n In SAS 9, go to File > Open Program and navigate to where the file is stored on your hard
drive.
n In SAS Studio, go to the Server Files and Folders section, and open the code file by
double clicking on it (for instance, see the many code files in the “Textbook Materials SAS
Studio” folder).
As mentioned in the chapter introduction, there are three big tasks in SAS, namely, data
manipulation, data analysis, and report generation. The following sections discuss these
steps further.
Major Task #1: Data Manipulation in SAS
Introduction to Data Manipulation
Data manipulation – in other words, changing data or creating new data – is one of the most
important tasks in practical business statistics. After capturing data, it is rarely the case that
the initial sheet or database query is completely perfect for analysis. Often, changes need to
be made, for various reasons such as:
n Imperfections in the original data that need to be fixed
n The need to add new data
n The need to combine multiple datasets
While you can manipulate data in more basic spreadsheet programs like Microsoft Excel, you
can also do so in SAS, and far more simply, flexibly and reliably. This book cannot cover
much of the SAS data manipulation universe, which is enormous and world-leading. The next
few sections can cover only a few salient topics. For a broader introduction to these topics,
the reader should consult texts such as Delwitch & Slaughter (2012).
78 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
Creating New Datasets in SAS
As a first topic, we often create new datasets in SAS programming code. This section
discusses the basics of doing this.
Of course, one way to create new datasets in SAS is to import them from elsewhere, such as
importing Microsoft Excel files. Chapter 5 describes how to do this. This chapter is more
interested in dealing with data once it is in SAS. To create or manipulate data in SAS we use
a “DATA” statement. Figure 6.4 on page 78 shows the outline of a data step for creating a
new dataset SAS.
Figure 6.4 Creating a new dataset in SAS
As seen in Figure 6.4 on page 78, if you wish to create a new dataset you do the following:
1 Start with the keyword DATA, which tells SAS that you wish to create a new dataset.
2 Name the new dataset. Note the following:
a Specify the name of a library and a dataset name, separated by a period (e.g.
“Textbook.Transformed” in Figure 6.4 on page 78). The new SAS dataset will appear
in the physical folder you have associated with this library. Of course, you have to
have associated this library name with the folder beforehand, as described in Chapter
5.
b There are basic rules for naming SAS datasets. This can be any name – in the code
above we used the name “Transformed” – so long as it follows these rules:
i
A SAS name can contain from one to 32 characters.
ii The first character must be a letter or an underscore (_).
Major Task #1: Data Manipulation in SAS
79
iii Subsequent characters must be letters, numbers, or underscores.
iv Blanks cannot appear in SAS names. If you want to separate parts of the dataset
name, use underscores, e.g. “Dataset_03.”
c If you leave out the library name and give only a dataset name (e.g. the “Data
Transformed;” line in Figure 6.5 on page 81 below) then the new dataset will be
created in the special “Work” library. In other words, calling the dataset “MyData” is
the same as calling it “Work.MyData”. The “Work” library is automatically created as
part of the SAS installation, and I explain it in more detail in the next section. Using
this option is often desirable.
d If you choose the same name and library as an existing dataset, then you will
overwrite (i.e. replace) the original version of the dataset.
3 Populate the new dataset with initial data. There are two main choices here:
a Populate the new dataset with data from another dataset. We frequently base the new
dataset on the data from an existing dataset. Think of this as a copy and paste, i.e.
you are copying data from an existing dataset into your new dataset. As seen in
Figure 6.4 on page 78, we can do this by putting the line “SET <name of existing
dataset>;” into a DATA step. In Figure 6.4 on page 78, we are using the SET
statement to copy all the contents of the “Data02_Cleaned” dataset into the new
“Transformed” dataset. In this code, both datasets are located in the “Textbook” library.
b Enter raw data directly into SAS. You can also enter data literally in SAS in a DATA
step. This book will not cover this direct data input option. I personally advocate
importing initial raw data from a spreadsheet program such as Microsoft Excel.
4 If desired, manipulate the data. In the DATA step, we can manipulate the data in a great
number of ways. “Create New Variables or Manipulate Current Variables in SAS” on page
80 below describes more on such steps.
5 Other programming notes: As seen in Figure 6.4 on page 78, do not forget to place
semicolons between major statements and add a “Run;” statement at the end before
running.
Creating Temporary Datasets in the Work
Library
The previous section noted that if you do not give a library name as part of a dataset name
then you are automatically linking the dataset with the special “Work” folder (so specifying
“Profits” is the same as saying “Work.Profits”).
The Work library has a special property: all datasets contained within it are deleted when you
close SAS. This is desirable in many cases for two major reasons:
n Datasets created in the Work folder do not clutter your hard drive or server, as they are
deleted once you close SAS. However, because you can save the code used to create
80 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
them, these datasets can be re-created every time you re-run the code. Programming
code takes up far less space on a computer than data.
n If you keep your original data and copy it to a Work library dataset, then changes you
make to the new dataset do not affect the original data, which means you are never at
risk of harming your original dataset.
This method of creating datasets out of programming code only for the duration of your
session – and analyzing the temporary data as you need - is highly efficient and often used
by SAS analysts.
On the other hand, giving a SAS library name other than Work causes the dataset to be
stored permanently in the folder associated with that library. This is, of course, desirable in
cases where you do wish to maintain a permanent copy.
Create New Variables or Manipulate Current
Variables in SAS
There are many situations in business statistics where you wish to create a new variable that
is, in effect, a transformation of an existing variable’s data. Here are some initial examples:
n Creating an index such as a financial ratio (such as creating a price/earnings ratio from
two columns containing price and earnings data, respectively).
n Creating mathematical transformations of variables, such as a new variable that is the
square root or log of another variable.
n Using the birthdates of people to create a new column that, on a consistently updating
basis, calculates their ages.
In addition, you can change and manipulate existing variables in SAS.
In our main textbook example, so far we have two major types of such tasks:
1 Creating new variables that reverse the data of reverse-worded survey questions.
Specifically, Satisfaction04 is a reverse-worded survey item (see Chapter 4 and Chapter
9 for more on this), which required us to create a new variable that reverses its data.
2 Creating two new factor variables, which are the aggregation of multi-item scales. Trust
and satisfaction ultimately needed to be created as factors which are an average of the
individual multi-item scores. (Of course, we can’t do this step without having assessed
internal reliability. Again, see Chapters 4 and 9).
One of the many things SAS is brilliant at is data manipulation. You can manipulate data by
using the SAS point-and-click interfaces like SAS Enterprise Guide, but it is quicker and
easier to use code in programs like SAS 9 or SAS Studio. The DATA step in SAS not only
creates new datasets or edits existing ones, but manipulates data columns or rows.
Figure 6.5 on page 81 shows a sample SAS data step in which the new dataset is created
based on an existing dataset (specifically, we create a dataset called “Transformed” in the
Work library because no library is specified, and we copy and paste everything from the
Textbook.Data02_Cleaned dataset using the SET statement).
Major Task #1: Data Manipulation in SAS
81
Figure 6.5 Example of creating new variables in the SAS DATA step
Then, each subsequent line creates a new variable:
n We create a new variable called “Rev_Satisfaction04” that takes the data from the
existing variable Satisfaction04 and reverses it using the principles discussed in Chapter
4.
n We create new variables called “Trust” and “Satisfaction” that are averages of some of
the individual currently existing multi-item scale columns. Note the way the average
works. Also, note here that I have only averaged the values for Satisfaction01Satisfaction03; see Chapter 9 a little later for why.
n We create two new mathematical transformations of the Sales variable, one the natural
log and one for the square (each Sales number to the power of two).
n We create several conditional variables using the IF-THEN concept, where the new
variable only takes on a certain value if a given condition is true. In the first of these, we
create a new variable called “Premium” that will have the value 1 whenever the currently
existing License variable contains the value “Premium” in a row, and takes the value 0 for
all rows where License is not “Premium.”
Take note of the following programming notes about this sort of programming:
n Take another look at the IF-THEN statements in Figure 6.5 on page 81. Note here that
this is the only situation in which capitalization counts in SAS. Take the example of the if
License = “Premium” section of the code. Here, we are asking SAS to go look in the
dataset for all rows where this exact condition is true including the exact capitalization of
82 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
“Premium,” and then apply the result only in those rows. If there are also entries in the
License column spelled “premium” then the above condition will not identify these rows.
So, be careful of capitalization in these situations only.
n As always, note that all statements are separated by semicolons and the entire set ends
with a “Run;” statement.
You could do so much more. For instance, you could create a new variable that is the sum of
other variables (replace MEAN in the above code with SUM). You can identify rows to delete
based on certain rules. SAS has an almost endless set of possible variable manipulations –
see the SAS helpfiles (notably SAS/STAT 13.2 User’s Guide) for more.
Once you have told SAS what you want to do, submit the code using the Run button as seen
above. Once you have done so, always check the log for errors and always open the new
dataset to check that it is right. (And then close it: an open dataset in SAS cannot be
replaced).
You can see the code from this section in the textbook resources files, under “Code06
Manipulating data example.”
Combining Datasets
Often in the business world, we need to combine two or more datasets together. You can
combine datasets side-by-side, one on top of the other, merge them based on a match in a
certain variable, and so on.
Let us look at one of the most common examples: match merging. Imagine you are an
organization with the following two datasets:
1 A database of customer account data, where each customer is identified by a unique
customer number.
2 A different database of customer satisfaction survey data. Again, each customer’s survey
responses are identified by the customer number. Typically, only a limited subset of
customers would have filled in the survey.
Now, let us say that you wish to combine these two datasets so that you can link the data.
Each row needs to be matched up by customer number. You can do this in SAS using the
MERGE statement. See the following example:
Example Code 6.1 Example of merge matching data in SAS
Data Customers.Merged;
Merge Customers.Accounts Customers.Survey2016;
By Customer_ID;
Run;
There are many nuances and complexities to combining datasets – for instance, to match
merge by a common variable as I show above, both datasets must be sorted by the common
matching variable (i.e. you would have to sort both of the above datasets by Customer_ID).
For more on combining datasets, reference the SAS helpfiles or books such as Delwitch &
Slaughter (2012).
Major Task #2: Data Analysis
83
This basic understanding of SAS data manipulation will help us in various parts of the rest of
the book, since data manipulation is frequently required in statistical analysis.
Major Task #2: Data Analysis
“Basic Introduction to SAS Programming” on page 73 above discussed the basics of
programming a PROC step in SAS, which is the foundation of SAS statistical analyses. The
rest of this book gives various examples of core SAS statistical analyses in the context of
business.
Just a few more general points apply to thinking about SAS data analyses:
n Knowing which analysis is the appropriate one for your situation is obviously critical. This
book discusses many introductory analyses to help you begin this journey. However,
especially when you are entering into more complex modelling, you should first carefully
investigate the general ideas behind what the correct analysis is. Thereafter, you can
read up on how SAS implements that specific analysis through code.
n You can easily find prior examples of SAS code for your desired analysis in the SAS
helpfiles, online through SAS User Group articles or the like, or in books like this one.
Then, you can copy the code developed in those sources and simply change the names
of the dataset and variables for your particular analysis. In a similar vein, SAS Studio has
pre-written code in the Tasks section.
n Often, in the same SAS program, we will first manipulate data and then – immediately
below the DATA step – place the PROC step that references and analyses the dataset
created above. We can then run the set together, change the data or analysis steps again
if required, and so on. Example Code 6.2 on page 83 is an example.
Example Code 6.2 Example of running DATA and PROC steps together
Data Transformed;
Set MBA.Profits;
LogRevenue = Log(Revenue);
Run;
Proc Means data=Transformed;
Var LogRevenue Cost Profit;
Run;
84 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
Major Task #3: SAS Reporting through Output
Formats
Introduction
In the early days of its development, SAS reproduced statistical reports in very simple, oldfashioned listing type format, which was designed for line printers. How times have changed!
Now, modern SAS technologies work with their proprietary Output Delivery System (ODS)
system, which allows you to tell SAS to output reports like tables and graphs in multiple
different formats. For instance, SAS can put output into:
1 Attractive HTML files. This is set up as the default in newer SAS versions, and we have
already seen in Chapter 5 how to change the automatic settings of how this output will
look. You can save the automatic output as an HTML file in either SAS 9 or SAS Studio.
2 Rich Text Files, which will open as Microsoft Word or similar files.
3 PDF files, which will open in Adobe Acrobat or other PDF readers.
4 Datasets created from output. These, in turn, can be exported to spreadsheet or
database programs such as Microsoft Excel or Access.
5 Several more.
These output delivery options are incredibly flexible, easy, and attractive. How to get these
different formatted outputs differs between SAS Studio and SAS 9.
Different ODS Outputs in SAS Studio
In SAS Studio, you can download results in HTML (web browser), PDF (Acrobat or similar) or
RTF (Microsoft Word or similar) formats at the click of a button, as seen in Figure 6.6 on
page 85.
Major Task #3: SAS Reporting through Output Formats
85
Figure 6.6 Downloading ODS results in different formats in SAS Studio
This is a major advantage for SAS Studio. Note also that the PDF output contains a menu
allowing you to navigate between different sections of a longer report.
Different ODS Outputs in SAS 9
In SAS 9, you need to program the ODS outputs. Luckily, this is mostly very easy. For
instance, say you wish to create a rich text output of various tables and graphs that you have
created with SAS. Then, merely enter code like that in Example Code 6.3 on page 85
which will open your default program for processing rich text (like MS Word) and create a
new file containing your SAS output (as you can see, you stipulate a filename and location
for it to be saved to):
Example Code 6.3 Example: Output in a rich text format that will open in MS Word or
similar
ODS RTF file=‘c://Output.rtf’;
<Insert SAS code here to create output like statistical tables & graphs>
86 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
ODS RTF close;
The ODS formats need to be studied by the dedicated user, but they all mostly work as
simply as the above example. The following are further examples:
Example Code 6.4 Example of changing HTML output style
ODS HTML style =HTMLBlue;
<Insert SAS code here to create output like statistical graphs>
ODS HTML style = Journal2;
The above example changes the HTML output style – which will usually open in SAS when
you run anything – to a specific style called HTMLBlue. I set your style to Journal2 above
because it produces clean black-and-white tables, however, in Chapter 10 later we will do
graphing which is often best done in color. In the above code, you change to HTMLBlue
which allows color output, then change back to Journal2.
Example Code 6.5 Example: Writing SAS output to a PDF that will open in Acrobat or
similar
ODS PDF file=‘c://Output.pdf’;
<Insert SAS code here to create output like statistical tables & graphs>
ODS PDF close;
Once again, this will save and open a PDF file of your output.
SAS ODS is an incredibly powerful system for crafting your SAS output. Any time you want
to say “hey, I’m creating such-and-such analysis in SAS and I would want it to look like that
and come out in such-and-such a format,” then ODS can usually do it for you.
The Visual Programmer Mode in SAS Studio
So far, I have demonstrated programming in SAS. As much as I have argued for using
programming as the most efficient way of achieving analysis and teaching statistics in many
cases, SAS Studio has created a clever way of generating your programs that allows you the
comfort of a point-and-click type approach that works with SAS programming. This is known
as the Visual Programmer mode.
In the SAS Studio Visual Programmer mode, you can define your dataset, task and variables
for SAS Studio using easy-to-understand drag-and-drop methods. As an example of the use
of this mode, see Figure 6.7 on page 87 below.
The Visual Programmer Mode in SAS Studio
87
Figure 6.7 Example of using the Visual Programmer mode in SAS Studio
In this example, I have generated a bar chart simply by doing the following easy steps:
n Initiate SAS Studio Visual Programmer mode by switching from SAS Programmer mode
at the top right. This opens a process flow window.
n Drag a pre-defined task from the Task window (in this case the Graphs > Bar Chart task)
to the process flow.
n Double click the resulting Bar Chart process piece gives the settings. Here I define the
dataset and variables using easy drop-down fields.
88 Chapter 6 / Basics of SAS Programs, Data Manipulation, Analysis & Reporting
n Click the “running person” icon to get the results. Note: To see the graph in color, switch
to the HTMLBlue results style in Preferences.
There are many other Tasks and what are called “Snippets” (pieces of code that can be used
in various places). You should browse through these – perhaps after reading the book and
acquainting yourself with the field of basic statistics – to see what Visual Programmer has to
offer. It is an intuitive and pleasing way to generate simple tasks, but has other
disadvantages of point-and-click modes, such as lack of the full functionality SAS
programming can offer.
Conclusion
This chapter has introduced data manipulation in SAS, the absolute basics of analysis, and it
has shown us how to create results in various formats. The rest of this book discusses a
variety of analyses and principles that – used correctly - will launch you on a productive and
profitable business statistics path.
From Business Statistics Made Easy in SAS®, by Gregory John Lee. Copyright © 2015, SAS Institute
Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
From Business Statistics Made Easy in SAS®.
Full book available for purchase here.
351
Index
A
a-priori power 196
abseentism 318
accuracy
about 203
assessing 203
of regression slopes 260
of statistics 177
statistical power 192
statistics size and 200
Acemoglu, D. 163
advanced statistical analysis packages
43
agreement tests 297
airline industry, big data in 9
analysis of level 313
analytical skills 6
analytics and reporting stages, tasks in
300
AngloGold Ashanti Look Ahead
See descriptive statistics 91
ANN (artificial neural networks) 336
annotations, placing in graphs 137
ANOVA 284
ANOVA F-Statistic 256
answer formats 37, 39
Apache Hadoop® 328
artificial neural networks (ANN) 336
associating variables
about 109
causation and 110
continuous data 112
correlation 112
correlation coefficients 113
covariance 112
ordinal data 112
relating categorical variables 119
statistical association 110
variable categories 111
AstraZeneca case 172
autocorrelation 251
average 15, 94, 95
B
background analyses, versus displayed
reports 300
bar graphs, in SGPLOT procedure 141
Barr, J. 51
Bauer, H.H. 316
Bayesian statistics
about 340, 341
classical statistics 341
final answer (posterior) 342
pre-existing guesses (proprs) 342
sample data 342
BCA (Bias Corrected and Accelerated)
210
Becker, G. 163
Berengueres, J. 10
Bias Corrected and Accelerated (BCA)
210
big data
about 326, 330
characteristics of 327
in airline industry 9
solutions for 328
bimodal distribution 102
binomial data 291
binomial proportions, assessing
categories through 291
black-and-white graphs, versus color
graphs 137
Blattberg, R.C. 10
352 Index
Boom, A. 163
bootstrapped confidence intervals 245
bootstrapping 190, 208, 210, 249, 274,
280
Boudreau, J.W. 317
box-and-whisper plots, in SGPLOT
procedure 141
breakeven 319
Burmeister, S. 90
business analysis
about 311
combining statistics with per-unit
financial values 318
examples of business extrapolation
321
financial estimates of revenue or cost of
one unit 314
financial extrapolation process 312
focal variables 313
net profitability 319
scope 319
business statistics
interpretation of 7
reporting 308
tasks in 73
C
CALIS procedure 234, 253
capitalization, in SAS code 75
Cascio, W.F. 317
categorical data
about 30, 285, 298
linear regression and 227
linking categorical variables 287
one-way categorical distributions 286
statistical questions about 287
categorical predictors 227
categorical variables
about 111, 119
associating 293
centrality for 95
crosstabs 119
FREQ procedure for associating 294
linking 119, 120, 287
relating 119
spread for 99
testing general association between
295
testing possibilities in association 297
categories
assessing through binomial proportions
291
comparing 270, 272
comparing continuous variables across
270
comparing means for more than two
284
comparing means with related
categories 281
CATMOD procedure 296, 298
causation
associating variables and 110
between independent variables 234
central tendency 91
centrality
about 94
as a variable characteristic 31
checking 127
for categorical variables 95
for continuous variables 94
for ordinal variables 95
change analysis 314
change situations, static situations and
313
character (text) data, versus numerical
data 25
chart modules 103
Cherrier, J. 10
Chi-Square 295
classical statistics 341
CLV (customer lifetime value) 316
Cochran-Mantel-Haenszel Statistics 296
code files
capitalization in 75
opening existing 77
Code window (SAS Studio) 62
coefficients, implications of 165
color graphs, versus black-and-white
graphs 137
comparison
of categories 270
Index
of dependent variables 271
of independent variables 271
of means 271, 281, 284
of means for more than two categories
284
of means with related samples or
categories 281
of more than two categories 272
of related categories 272
of two categories 272
of two means 275
computers, versus math 16
computing power and speed, growth in
327
concepts, measuring relationships
between 15
condition indices 236
conditional variables 81
confidence intervals 184, 259
confirmatory factor analysis 133
constellations 157
constructs
about 35
choosing 35
control 37
defined 163
focal 36
importance of 35
predictor 36
context 16, 223
Contingency Coefficient 295
contingency tables 119
continuous (ratio or interval) data 29, 39,
111, 112
continuous variable spread 97
continuous variables
centrality for 94
comparing aross categories 270
interquartile range for 98
linking to categorical variables 120
control constructs, data and 37
convergent validity 133
Cook's D 247
CORR procedure 115, 130
correlation analysis 117
correlation coefficients 113
correlation tables 116
353
correlations
as back-up diagnostics 235
between independent variables 237
calculating 115
compared with covariance 117
sizes of 116
types of 115
cost of one unit, financial estimates of
314
covariance
about 117
compared with correlation 117
Cramer's V statistic 295
Crestor case 172
Cronbach alpha 130, 132
crosstabs 119
customer lifetime value (CLV) 316
customer satisfaction, as a variable 3
D
data
about 21
assumptions about 273, 278
binomial 291
capturing 33, 34, 43, 227
charactertistics of variables 27
checking for mistakes in 43
cleaning 72, 227
collecting 227
continuous (ratio or interval) 29, 39,
111, 112
control constructs and 37
defined 163
dichotomous 291
entering 231
existing 38
extracting statistics from 15
fitting 155
fitting complex mathematical equations
to 161
focal constructs and 36
forming data tables 24
gathering 13, 33, 34, 37
importance of in statistics 13
354 Index
importing 231
initial assumptions about 233
interval 29, 39, 111, 112
issues with 23, 227
manipulating 73, 77
modeling preconceived ideas about
153
multi-row 49
objects 23
observations 23
ordinal 30, 111, 112
populations 23
post-capturing issues of 43
ratio 29, 39, 111, 112
real-time 38
samples 23
See also big data 326
See also categorical data 285
See also data patterns 151
See also errors 123
See also missing data 44
See descriptive statistics 91
shape issues with 237
testing for normal distributions 159
testing for straight line shapes 158
data analysis
about 73, 83
in data warehousing 333
software for 48
data architecture skills 7
DATA keyword 76
data management
combining datasets 82
creating datasets 78
creating temporary datasets in Work
library 79
creating variables 80
manipulating current variables 80
data mining
about 336
compared with theory-based analysis
154
patterns and 154
theory versus 153
data patterns
about 151
comparing theory-based analysis and
data mining 154
defined 163
fitting mathematical models 155
forcing 162
multivariate patterns 152
plots versus statistical fit measures of
155
See also interpreting patterns 164
single variable patterns 151
theory versus data mining 153
troubleshooting 163
data points, inappropriate 124
data tables, forming 24
data warehousing
about 330, 335
issues and alternatives in 333
steps in traditional 331
database software 48
dataset analysis 252
datasets
combining 82
complex types of 49
complications in 48
creating 78, 79
creating in Work library 79
dispersed 330
incongruent 330
integrating 26
longitudinal 49
multi-level 49
primary 26
secondary 26
vulnerable 330
dates, capturing 48
Davenport, T.H. 327
decision-making, in statistics process 17
deep learning 336
Delwiche, L.D. 77
dependent variables
characteristics of 255
comparison of 271
missing 45
transforming 273, 280
descriptive statistics
about 91, 104
assessing distribution 103
Index
centrality 94
end outcome of analysis of 91
getting in SAS 92
shape 99
spread 97
dichotomous data 291
discriminant validity 133
dispersed datasets 330
displayed reports, versus background
analyses 300
distributed computing, improved storage
and processing through 328
distribution, assessing 103
Dull, T. 334
dummy variables 227, 264
Durbin-Watson statistic 251
Dyché, J. 327
E
Editor window (SAS 9) 54
Efimov, D. 10
Ellison, L. 310
employee stocks 316
employee-related variables, value of 316
employees
movement of 316
performance of 317
reductions in expensive behaviors 317
turnover of 317
endogeneity 234
enquiries
as a variable 3
of customers 303
Enterprise Resource Programs (ERPs)
48
equivalence, testing for 293
ERPs (Enterprise Resource Programs)
48
errors
about 123
checking centrality and spread 127
inappropriate data points 124
missing data 126
multi-item scales 128
355
residuals and 222
strange variable distributions 128
ETL (Extract-Transform-Load) 334
examples
about 1
brief 5, 299
company 2
correlation analysis 117
current research needs 2
of business extrapolation 321
of interpreting when patterns are not
found 167
of SGPLOT procedure graphs 138
of simulation 337
existing data 38
exploratory factor analysis 133
Explorer window (SAS 9) 54
Extract-Transform-Load (ETL) 334
extracting
statistics from data 15
to data marts 333
F
face-to-face interviews 38
Facebook 326
feedback loops 234
FIML (full information maximum
likelihood) 127, 253
final statistic parameters and coefficients,
intermediate fit statistics versus 161
financial extrapolation process 312
financial profitability 311
financial variables, values of 318
fit
about 151, 155, 222
assessing 233
steps in 233
troubleshooting 224, 257
fitting models
See statistics process 149
focal constructs, data and 36
focal variables 313
folders and files, linking with 62, 63
follow-up recommendations 308
356 Index
formats
answer 37, 39
question 37, 38
formatting, in SGPLOT procedure 142
FREQ procedure 92, 95, 120, 287, 294,
296, 298
full information maximum likelihood
(FIML) 127, 253
G
Garbage in, Garbage out (GIGO) 14
geographical mapping, using GMAP
procedure 145
GIGO (Garbage in, Garbage out) 14
GKPI procedure 136, 143
global fit, troubleshooting 257
GMAP procedure 136, 145
good fit 221
Goodnight, Jim 51
GPLOT procedure 136
graphing
about 135, 147
black-and-white versus color 137
flexibility in 136
GKPI procedure 136, 143
GMAP procedure 136, 145
modules for 136
placing annotations in graphs 137
procedures for 136
SGPANEL procedure 143
SGPLOT procedure 138
SGSCATTER procedure 146
groups, comparing 271
H
Hammerschmidt, M. 316
Hats (leverage scores) 247
Heath, D. 142
Helwig, J. 51
heteroscedasticity
about 237
effects of 239
in residual plots 243
remedies for 244
Hoeffding Dependence Cpefficient 115
Hong, S.J. 10
HTML files 84
hypothesis testing 184
I
IF-THEN concept 81
in-memory processing, improved
processing through 329
inaccuracy, faces of 179
inappropriate data points 124
incongruent datasets 330
independent variable slopes 259
independent variables
causal relationships between 234
comparison of 271
correlations between 237
influence, defined 247
influential outliers 245
initial phase, in data warehousing 332
inputs, costs of 315
integration phase, in data warehousing
332
intercept 258
intermediate fit statistics, versus final
statistical parameters and coefficients
161
interpretations 308
interpreting patterns
about 164
implications of model and coefficients
165
steps in 164
interquartile range, for continuous and
ordinal variables 98
interval data 29, 39, 111, 112
issues 23
Index
357
interpreting regression slopes 257
ordinal predictors 227
reporting multiple regression results
Jackofsky, E.F. 153
265
Janmaat, E. 22
running regresson analysis 231
JIPSA (Joint Initiative for Priority Skills
simplest case of 217
Acquisition) 90
single Likert-type scale items 231
JMP® 53, 73
variables in 216
Joint Initiative for Priority Skills Acquisition
variables in multiple regression 216
(JIPSA) 90
linearity 112
lines, in SAS code 76
loading, in data warehousing 333
Log window
K
SAS 9 54
SAS Studio 62
Kendall's Tau 115
logic, importance of 235
knowledge 7
lognormal distribution 100
Kuhfeld, W. 142
longitudinal datasets 49
kurtosis 105, 106, 160
J
L
Lawrence, R.D. 10
Lehrer, J. 162
leverage scores (Hats) 247
libraries
creating in SAS 9 55
creating in SAS Studio 63
licenses
as variable 3
distribution of 304
variables analyzed by 303
Likelihood Ratio Chi-Square 295
Likert-type scale 39, 231
line plots, in SGPLOT procedure 138
linear regression
about 213, 215
aim of 216
applying remedies 233
assessing fit 233
categorical predictors 227
core textbook example 213
defined 215
implementing multiple regression 226
initial data issues 227
M
machine learning 335, 336
magnitude
See size, of statistics 173
Malthouse, E.C. 10
Mantel-Haenszel Chi-Square test 296
Mardia score 176
marketing outcomes 316
Matange, S. 142
math, versus computers 16
mathematical models, fitting 155
mathematical simulations 338
means
about 94
comparing 271
comparing for more than two categories
284
comparing to population benchmarks
283
comparing two 275
comparing with related samples or
categories 281
MEANS procedure 92, 93, 95, 121
measurement error 223
measurement, growth in 327
358 Index
medians 94, 95
MI procedure 253
MIANALYZE procedure 253
Miner, Bob 310
Mining Qualifications Authority (MQA) 90
missing data
as a diagnostic issue 252
assessing in observations 126
assessing in variables 127
dealing with 44
diagnosis of in regression 252
in observations 253
in variables 253
linear regression and 227
remedies for 252
steps 126
mode 95
model fitting
See statistics process 149
MODEL statement 232
models
structures of 234
theoretical and practical implications of
166
modules, for graphing 136
MQA (Mining Qualifications Authority) 90
multi-item assessment 41
multi-item scales
about 45, 128
aggregating multiple items into
summary variables 132
assessing internal reliability of each
129
dealing with 47
linear regression and 227
reversed items 128
tasks in preparing 47
multi-item variables 127
multi-level datasets 49
multi-row datasets 49
multicollinearity 235, 236
multiple imputations 127, 253
multiple regression
implementing 226
reporting results of 265
multivariate patterns 152
N
needs, for statistics process 12
negative linearity 112
net profitability
about 319
basic profit 319
breakeven 319
return on investment (ROI) 320
New Import Data wizard 65
Nirmalanof, G. 161
non-linearity
about 237
effects of 239
in residual plots 242
remedies for 244
noninferiority tests 298
nonparametric statistics 274
nonparametric T-test 280
normal distribution 100, 159
normality 273
normality statistics 104
Normalized Multivariate Kurtosis score
176
numerical data, versus text (character)
data 25
O
Oates, E. 310
objects 23
observations
about 23
loss of 252
missing data in 126, 253
odds ratio test, homogeneity of 297
ODS Graphics engine 136
ODS outputs
in SAS 9 85
in SAS Studio 84
one-way categorical distributions 286
one-way frequencies
about 288
Index
assessing categories through binomial
proportions 291
assessing distribution of 289
online slider scale 40
operational time, value of 315
operational variables, costs and revenues
of 315
Oracle South Africa case study 310
Oracle VirtualBox 60
ordinal data 30, 111, 112
ordinal predictors
single Likert-type scale items as 231
special treatment of 227
ordinal variables
about 228
centrality for 95
interquartile range for 98
testing trend in 296
outlier weighting 249
outlier, defined 247
output formats, reporting through 84
overtime 318
P
p-value 186, 205, 256, 259
paired samples 282
parabola 225
paradigms, patterns and 153
parametric 273, 274
parametric approach
about 204
p-value 205
standard error 205
test statistic 205
patterns
implications of 154
over time 15
reasons for 154
See also data patterns 151
PDF files 84, 137
Pearson correlations 115
people variables 316
per-unit financial values, combining
statistics with 318
359
Phi coefficient 295
physical simulations 337
Pischke, J-S. 163
plots, versus statistical fit measures of
patterns 155
point-and-click 73
populations 23, 283
positive linearity 112
post-capturing 43
POWER procedure 196
pre-analysis data cleaning and
preparation 72
pre-existing guesses (proprs) 342
predictor constructs 36
primary datasets 26
process simulations 337
products, value of 315
programming code
about 73
advantages of 74
doing tasks through 74
lessons on 74
running 76
protocols, in data analysis software 49
psychometric measures 41
Q
question formats 37, 38
R
R-Sq statistics
about 254
interpreting size of 255
random patterms 154
ratio data 29, 39, 111, 112
raw data records 25
raw datasets 332
real-time data 38
REG procedure 232, 252
regression 119
See also linear regression 231
360 Index
regression analysis, running 231
regression parameters
about 257
independent variable slopes 259
intercept 258
regression slopes
about 119, 259, 264
interpreting 257
process for interpreting 259
significance and accuracy of 260
size of significance and accuracy of
262
reliability output, assessing 131
remedies, applying 233
REPORT procedure 92
reporting
about 73
skills for 6
through output formats 84
representativity 34
requirements, for statistics process 12
residual plots
about 240
diagnosing data shape issues with 240
heteroscedasticity in 243
non-linearity in 242
residuals
about 273
error and 222
normality of 250
Results window
SAS 9 54
SAS Studio 62
return on investment (ROI) 320
returns 316
revenue, financial estimates of 314
reverse-worded items 42
reversed items, dealing with 128
Rich Text Files 84, 137
robust regression 248
ROBUSTREG procedure 248
ROI (return on investment) 320
Royal FrieslandCampina example
See data 21
Run command 76
S
sales 305, 316
Sall, J. 51
sample size 34
samples and sampling 23, 34
SAS
about 51, 52
website 60
SAS® 9
about 52
creating libraries in 55
importing data into 58
installing 53
ODS outputs in 85
opening 53
opening code files in 77
setting options 59
setting up 53
SAS® Enterprise Guide® 52, 73
SAS® Enterprise Miner 52
SAS® LASR 329, 334
SAS® Studio
about 52, 60, 61, 73
creating libraries 63
importing data 65
installing 60
linking libraries with folders 63
linking with folders and files on
computers 62
ODS outputs in 84
opening 60
opening code files in 77
setting options 67
setting up 60
Visual Programmer mode 86
SAS® Text Miner 329
SAS® University Edition 52, 60
SAS® Visual Analytics 53
satisfaction, of customers 302
scalability 329
scatter graphs, in SGPLOT procedure
139
scatterplots, SGSCATTER procedure for
multiple 146
scope 319
Index
SD (standard deviation) 97
secondary datasets 26
semantic differential 40
semicolons 76
Server Files and Folders (SAS Studio) 61
services, value of 315
SGPANEL procedure 136, 143
SGPLOT procedure
about 136, 138
examples of graphs 138
graphing options and formatting in 142
SGSCATTER procedure 136, 146
shapes
about 99
bimodal distribution 102
fitting data to exact mathematical 155
lognormal distribution 100
normal distribution 100
testing data for straight line 158
uniform distribution 101
significance, of regression slopes 260
simple imputations 127, 253
simulation
about 336, 340
example of 337
types of 337
single accuracy estimates 186
single data points 25
single variable patterns 151
size
as variable 3
levels of 304
of correlations 116
of R-Sq statistics 255
of significance and accuracy of
regression slopes 262
of statistics 173, 174, 200
variables analyzed by 303
skewness 105, 106, 160
skills
data architecture 7
extending your 6
reporting 6
Slaughter, S.J. 77
slow to access data 331
Snippets 88
social media 326
361
software
data analysis 48
spacing, in SAS code 76
Spearman correlations 115
specification error 223
spread
about 97
as a variable characteristic 31
calculating variables spread 99
checking 127
continuous variable 97
for categorical variables 99
interquartile range for continuous and
ordinal variables 98
Sreekumar, K.P. 161
staging phase, in data warehousing 332
standard deviation (SD) 97
standard error, parametric approach and
205
standardized slopes 259, 263
standardized statistics 175
static situations, change situations and
313
statistical association 110
statistical effect 313
statistical extrapolation
about 323
examples of 321
means-based example of 321
regression-based example of 322
statistical power
about 192
before and after testing 195
elements of 194
measurement of 192
problems with 198
understanding 192
statistical significance
about 183
bootstrapping 190
confidence intervals 184
single inaccuracy estimates and pvalues 186
statistical tests of distribution 103
statistics
about 15
accuracy of 15, 177
362 Index
advice on 18
classical 341
combining with per-unit financial values
318
extracting from data 15
generating 16
importance of data in 13
meaning of 15
nonparametric 274
normality 104
See also descriptive statistics 91
standardized 175
statistics process
about 9, 149, 168
challenges in 17
decision-making 17
extracting statistics from data 15
getting data 13
needs and requirements for 12
patterns in data 151
understanding 17
storage, growth in 327
strikes 318
structural equation modeling 234
Studentized Residual 247
subgroups, comparing 271
summary variables 132
superiority tests 298
supervised learning algorithms 336
surveys 38
SYSLIN procedure 234
T
T-tests
about 275
assessing data assumptios 278
end-point of 276
implementing nonparametric 280
related data 283
running initial 278
versions of traditional parametric 279
tabs, in SAS code 76
TABULATE procedure 92
tasks
doing through programming code
(syntax) 74
in analytics and reporting stages 300
running through point-and-click 73
test statistic, parametric approach and
205
testing
assessing power before and after 195
for statistical significance 183
text (character) data, versus numerical
data 25
textbook materials 53
textual analysis 329
theory
defined 163
importance of 235
versus data mining 153
theory-based analysis, compared with
data mining 154
times
capturing 48
changes in 317
traditional parametric t-test, versions of
279
transformations 244
trust
as variable 3
of customers 302
TTEST procedure 278
Twitter 326
two-stage least squares regression 234
type, as a variable characteristic 28
U
understanding, in statistics process 17
unequal variances t-test 280
uniform distribution 101
UNIVARIATE procedure 92, 93, 95, 136
unstandardized slopes 259, 262
unstructured data, growth in 327
unsupervised learning 336
Index
V
value, of big data 328
variable distribution 91
variables
about 35
analyzed by license and size 303
assessing missing data in 127
calculating spread 99
categories of 111
characteristics of 27
choosing 32
choosing the right 154
conditional 81
continuous 94, 98, 120, 270
creating 80
dependent 45, 255, 271, 273, 280
dummy 227, 264
focal 313
importance of types 30
in linear regression 216
independent 234, 237, 271
manipulating 80
missing data in 253
ordinal 95, 98, 228, 296
sales and 305
See also associating variables 109
See also categorical variables 119
specifying 130
strange distributions of 128
summary 132
variance inflation factors (VIFs) 236
363
variances 97, 216, 273
variety, of big data 328
velocity, of big data 328
veracity, of big data 328
Viewers (SAS 9) 54
VIFs (variance inflation factors) 236
virtualization program 60
Visa 326
Visual Programmer mode 73, 86
VMWare Player 60
volume, of big data 327
vulnerable datasets 330
W
wage bill, changes in 317
Walmart 326
weak relationship 221
weighted regression 245
West Point
See linear regression 213
Windows folders, linking SAS library to
55
Work library, creating datasets in 79
workforce numbers, changes in 317
Z
zero relationship 221
From Business Statistics Made Easy in SAS®, by Gregory John Lee. Copyright © 2015, SAS Institute
Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED.
xv
About the Author
Professor Gregory Lee is currently the Research Director and an
Associate Professor in Research Methodology and Decision Sciences at
the AMBA-rated Wits Business School.
He has published prior books on human resources (HR) metrics, and has
many article publications in the international arena such as the Human
Resource Management Journal, European Journal of Operational
Research, Scientometrics, Journal of Business-to-Business Marketing,
The International Journal of Human Resource Management, International
Journal of Manpower, Review of Income & Wealth, Journal of Human
Resource Costing & Accounting and many others.
He focuses on issues in human resource management, notably HR
metrics (in which he has established himself as a leading expert) and
other areas such as training, employee turnover and the employeecustomer link.
He has served in many capacities within the international academic field.
He has sat on the Graduate Management Admissions Council (GMAC©)
advisory council, the editorial boards of the Journal of Organizational and
Occupational Psychology, and engages in frequent reviewing for many
journals.
In addition, he is a well-known consultant, writer and speaker in the
corporate and practical management arenas, notably in the area of HR
metrics, but extending to other areas such as human resources strategy
and foresight.
Gain Greater Insight into Your
SAS Software with SAS Books.
®
Discover all that you need on your journey to knowledge and empowerment.
support.sas.com/bookstore
for additional books and resources.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names
are trademarks of their respective companies. © 2013 SAS Institute Inc. All rights reserved. S107969US.0613
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement