MATLAB Arrays
Statistics Toolbox™ 7
User’s Guide
How to Contact MathWorks
Web
Newsgroup
www.mathworks.com/contact_TS.html Technical Support
www.mathworks.com
comp.soft-sys.matlab
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Product enhancement suggestions
Bug reports
Documentation error reports
Order status, license renewals, passcodes
Sales, pricing, and general information
508-647-7000 (Phone)
508-647-7001 (Fax)
The MathWorks, Inc.
3 Apple Hill Drive
Natick, MA 01760-2098
For contact information about worldwide offices, see the MathWorks Web site.
Statistics Toolbox™ User’s Guide
© COPYRIGHT 1993–2011 by The MathWorks, Inc.
The software described in this document is furnished under a license agreement. The software may be used
or copied only under the terms of the license agreement. No part of this manual may be photocopied or
reproduced in any form without prior written consent from The MathWorks, Inc.
FEDERAL ACQUISITION: This provision applies to all acquisitions of the Program and Documentation
by, for, or through the federal government of the United States. By accepting delivery of the Program
or Documentation, the government hereby agrees that this software or documentation qualifies as
commercial computer software or commercial computer software documentation as such terms are used
or defined in FAR 12.212, DFARS Part 227.72, and DFARS 252.227-7014. Accordingly, the terms and
conditions of this Agreement and only those rights specified in this Agreement, shall pertain to and govern
the use, modification, reproduction, release, performance, display, and disclosure of the Program and
Documentation by the federal government (or other entity acquiring for or through the federal government)
and shall supersede any conflicting contractual terms or conditions. If this License fails to meet the
government’s needs or is inconsistent in any respect with federal procurement law, the government agrees
to return the Program and Documentation, unused, to The MathWorks, Inc.
Trademarks
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See
www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand
names may be trademarks or registered trademarks of their respective holders.
Patents
MathWorks products are protected by one or more U.S. patents. Please see
www.mathworks.com/patents for more information.
Revision History
September 1993
March 1996
January 1997
November 2000
May 2001
July 2002
February 2003
June 2004
October 2004
March 2005
September 2005
March 2006
September 2006
March 2007
September 2007
March 2008
October 2008
March 2009
September 2009
March 2010
September 2010
April 2011
First printing
Second printing
Third printing
Fourth printing
Fifth printing
Sixth printing
Online only
Seventh printing
Online only
Online only
Online only
Online only
Online only
Eighth printing
Ninth printing
Online only
Online only
Online only
Online only
Online only
Online only
Online only
Version 1.0
Version 2.0
Version 2.11
Revised for Version 3.0 (Release 12)
Minor revisions
Revised for Version 4.0 (Release 13)
Revised for Version 4.1 (Release 13.0.1)
Revised for Version 5.0 (Release 14)
Revised for Version 5.0.1 (Release 14SP1)
Revised for Version 5.0.2 (Release 14SP2)
Revised for Version 5.1 (Release 14SP3)
Revised for Version 5.2 (Release 2006a)
Revised for Version 5.3 (Release 2006b)
Revised for Version 6.0 (Release 2007a)
Revised for Version 6.1 (Release 2007b)
Revised for Version 6.2 (Release 2008a)
Revised for Version 7.0 (Release 2008b)
Revised for Version 7.1 (Release 2009a)
Revised for Version 7.2 (Release 2009b)
Revised for Version 7.3 (Release 2010a)
Revised for Version 7.4 (Release 2010b)
Revised for Version 7.5 (Release 2011a)
Contents
Getting Started
1
Product Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-2
Organizing Data
2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-2
MATLAB Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Numerical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Heterogeneous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-4
2-4
2-7
2-9
Statistical Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Categorical Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataset Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-11
2-11
2-13
2-23
Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Grouping Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Level Order Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Functions for Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . .
Using Grouping Variables . . . . . . . . . . . . . . . . . . . . . . . . . .
2-34
2-34
2-35
2-35
2-37
Descriptive Statistics
3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-2
v
Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . .
3-3
............................
3-5
Measures of Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-7
Resampling Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Jackknife . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parallel Computing Support for Resampling Methods . . . .
3-9
3-9
3-12
3-13
Data with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-14
Measures of Dispersion
Statistical Visualization
4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-2
Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-3
.........................................
4-6
Distribution Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Normal Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quantile-Quantile Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cumulative Distribution Plots . . . . . . . . . . . . . . . . . . . . . . .
Other Probability Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-8
4-8
4-10
4-12
4-14
Box Plots
Probability Distributions
5
Using Probability Distributions . . . . . . . . . . . . . . . . . . . . .
vi
Contents
5-2
Supported Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parametric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nonparametric Distributions . . . . . . . . . . . . . . . . . . . . . . . .
5-3
5-4
5-8
Working with Distributions Through GUIs . . . . . . . . . . .
Exploring Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Modeling Data Using the Distribution Fitting Tool . . . . . .
Visually Exploring Random Number Generation . . . . . . . .
5-9
5-9
5-11
5-49
Statistics Toolbox Distribution Functions . . . . . . . . . . .
Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . .
Cumulative Distribution Functions . . . . . . . . . . . . . . . . . . .
Inverse Cumulative Distribution Functions . . . . . . . . . . . .
Distribution Statistics Functions . . . . . . . . . . . . . . . . . . . . .
Distribution Fitting Functions . . . . . . . . . . . . . . . . . . . . . . .
Negative Log-Likelihood Functions . . . . . . . . . . . . . . . . . . .
Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . .
5-52
5-52
5-62
5-66
5-68
5-70
5-77
5-80
Using Probability Distribution Objects . . . . . . . . . . . . . .
Using Distribution Objects . . . . . . . . . . . . . . . . . . . . . . . . . .
What are Objects? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Creating Distribution Objects . . . . . . . . . . . . . . . . . . . . . . .
Object-Supported Distributions . . . . . . . . . . . . . . . . . . . . . .
Performing Calculations Using Distribution Objects . . . . .
Capturing Results Using Distribution Objects . . . . . . . . . .
5-84
5-84
5-85
5-88
5-89
5-90
5-97
Probability Distributions Used for Multivariate
Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-99
Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-99
Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-107
Random Number Generation
6
Generating Random Data . . . . . . . . . . . . . . . . . . . . . . . . . .
6-2
Random Number Generation Functions . . . . . . . . . . . . .
6-3
vii
Common Generation Methods . . . . . . . . . . . . . . . . . . . . . .
Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inversion Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acceptance-Rejection Methods . . . . . . . . . . . . . . . . . . . . . . .
6-5
6-5
6-7
6-9
Representing Sampling Distributions Using Markov
Chain Samplers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Using the Metropolis-Hastings Algorithm . . . . . . . . . . . . . .
Using Slice Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-13
6-13
6-14
Generating Quasi-Random Numbers . . . . . . . . . . . . . . . .
Quasi-Random Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quasi-Random Point Sets . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quasi-Random Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-15
6-15
6-16
6-23
Generating Data Using Flexible Families of
Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pearson and Johnson Systems . . . . . . . . . . . . . . . . . . . . . . .
Generating Data Using the Pearson System . . . . . . . . . . . .
Generating Data Using the Johnson System . . . . . . . . . . .
6-25
6-25
6-26
6-28
Hypothesis Tests
7
viii
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7-2
Hypothesis Test Terminology . . . . . . . . . . . . . . . . . . . . . . .
7-3
Hypothesis Test Assumptions . . . . . . . . . . . . . . . . . . . . . . .
7-5
Example: Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . .
7-7
Available Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . .
7-13
Analysis of Variance
8
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8-2
ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
N-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Other ANOVA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8-3
8-3
8-9
8-12
8-26
8-27
8-35
MANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ANOVA with Multiple Responses . . . . . . . . . . . . . . . . . . . .
8-39
8-39
8-39
Parametric Regression Analysis
9
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9-2
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Linear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . .
Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Polynomial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Response Surface Models . . . . . . . . . . . . . . . . . . . . . . . . . . .
Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . .
Multivariate Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9-3
9-3
9-8
9-14
9-19
9-29
9-32
9-37
9-45
9-52
9-57
Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nonlinear Regression Models . . . . . . . . . . . . . . . . . . . . . . . .
Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mixed-Effects Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9-58
9-58
9-59
9-64
ix
Multivariate Methods
10
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10-2
Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
Classical Multidimensional Scaling . . . . . . . . . . . . . . . . . . . 10-3
Nonclassical Multidimensional Scaling . . . . . . . . . . . . . . . . 10-8
Nonmetric Multidimensional Scaling . . . . . . . . . . . . . . . . . 10-10
Procrustes Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparing Landmark Data . . . . . . . . . . . . . . . . . . . . . . . . .
Data Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Preprocessing Data for Accurate Results . . . . . . . . . . . . . .
Example: Comparing Handwritten Shapes . . . . . . . . . . . . .
10-14
10-14
10-14
10-15
10-16
Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23
Sequential Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 10-23
Feature Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . . . . .
Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . .
Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10-28
10-28
10-28
10-31
10-45
Cluster Analysis
11
x
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11-2
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Linkages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11-3
11-3
11-3
11-4
11-6
Dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8
Verifying the Cluster Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
Creating Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Creating Clusters and Determining Separation . . . . . . . . .
Determining the Correct Number of Clusters . . . . . . . . . . .
Avoiding Local Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11-21
11-21
11-22
11-23
11-26
Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . 11-28
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-28
Clustering with Gaussian Mixtures . . . . . . . . . . . . . . . . . . . 11-28
Parametric Classification
12
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-2
Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example: Discriminant Analysis . . . . . . . . . . . . . . . . . . . . .
12-3
12-3
12-3
Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . . . . . .
Supported Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-6
12-6
Performance Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What are ROC Curves? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Evaluating Classifier Performance Using perfcurve . . . . .
12-9
12-9
12-9
12-9
xi
Supervised Learning
13
Supervised Learning (Machine Learning) Workflow
and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Steps in Supervised Learning (Machine Learning) . . . . . .
Characteristics of Algorithms . . . . . . . . . . . . . . . . . . . . . . . .
13-2
13-2
13-6
Classification Using Nearest Neighbors . . . . . . . . . . . . . . 13-8
Pairwise Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8
k-Nearest Neighbor Search . . . . . . . . . . . . . . . . . . . . . . . . . 13-11
Classification Trees and Regression Trees . . . . . . . . . . .
What Are Classification Trees and Regression Trees? . . . .
Creating Classification Trees and Regression Trees . . . . .
Predicting Responses With Classification Trees and
Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Improving Classification Trees and Regression Trees . . . .
Alternative: classregtree . . . . . . . . . . . . . . . . . . . . . . . . . . .
13-25
13-25
13-26
Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Framework for Ensemble Learning . . . . . . . . . . . . . . . . . . .
Basic Ensemble Examples . . . . . . . . . . . . . . . . . . . . . . . . . .
Test Ensemble Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification: Imbalanced Data or Unequal
Misclassification Costs . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example: Classification with Many Categorical Levels . . .
Example: Surrogate Splits . . . . . . . . . . . . . . . . . . . . . . . . . .
Ensemble Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example: Tuning RobustBoost . . . . . . . . . . . . . . . . . . . . . . .
TreeBagger Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ensemble Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13-50
13-50
13-57
13-59
13-32
13-33
13-42
13-64
13-71
13-76
13-81
13-92
13-96
13-118
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-130
xii
Contents
Markov Models
14
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14-2
Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14-3
Hidden Markov Models (HMM) . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Analyzing Hidden Markov Models . . . . . . . . . . . . . . . . . . . .
14-5
14-5
14-7
Design of Experiments
15
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15-2
Full Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multilevel Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two-Level Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15-3
15-3
15-4
Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Plackett-Burman Designs . . . . . . . . . . . . . . . . . . . . . . . . . . .
General Fractional Designs . . . . . . . . . . . . . . . . . . . . . . . . .
15-5
15-5
15-5
15-6
Response Surface Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9
Central Composite Designs . . . . . . . . . . . . . . . . . . . . . . . . . 15-9
Box-Behnken Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-13
D-Optimal Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Generating D-Optimal Designs . . . . . . . . . . . . . . . . . . . . . .
Augmenting D-Optimal Designs . . . . . . . . . . . . . . . . . . . . .
Specifying Fixed Covariate Factors . . . . . . . . . . . . . . . . . . .
Specifying Categorical Factors . . . . . . . . . . . . . . . . . . . . . . .
Specifying Candidate Sets . . . . . . . . . . . . . . . . . . . . . . . . . .
15-15
15-15
15-16
15-19
15-20
15-21
15-21
xiii
Statistical Process Control
16
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16-2
Control Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16-3
Capability Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16-6
Parallel Statistics
17
Quick Start Parallel Computing for Statistics
Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
What Is Parallel Statistics Functionality? . . . . . . . . . . . . .
How To Compute in Parallel . . . . . . . . . . . . . . . . . . . . . . . .
Example: Parallel Treebagger . . . . . . . . . . . . . . . . . . . . . . .
17-2
17-2
17-3
17-5
Concepts of Parallel Computing in Statistics
Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Subtleties in Parallel Computing . . . . . . . . . . . . . . . . . . . . .
Vocabulary for Parallel Computation . . . . . . . . . . . . . . . . .
17-7
17-7
17-7
When to Run Statistical Functions in Parallel . . . . . . . .
Why Run in Parallel? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Factors Affecting Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Factors Affecting Results . . . . . . . . . . . . . . . . . . . . . . . . . . .
17-8
17-8
17-8
17-9
Working with parfor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-10
How Statistical Functions Use parfor . . . . . . . . . . . . . . . . . 17-10
Characteristics of parfor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-11
Reproducibility in Parallel Statistical Computations . . 17-13
Issues and Considerations in Reproducing Parallel
Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-13
Running Reproducible Parallel Computations . . . . . . . . . . 17-13
xiv
Contents
Subtleties in Parallel Statistical Computation Using
Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-14
Examples of Parallel Statistical Functions . . . . . . . . . . .
Example: Parallel Jackknife . . . . . . . . . . . . . . . . . . . . . . . .
Example: Parallel Cross Validation . . . . . . . . . . . . . . . . . . .
Example: Parallel Bootstrap . . . . . . . . . . . . . . . . . . . . . . . .
17-18
17-18
17-19
17-20
Function Reference
18
File I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18-2
Data Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Categorical Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataset Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Grouped Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18-3
18-3
18-6
18-7
Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-8
Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-8
Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . 18-8
Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-8
Measures of Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-9
Statistics Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-9
Data with Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . 18-9
Data Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-10
Statistical Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distribution Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ANOVA Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Regression Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multivariate Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cluster Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DOE Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SPC Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18-11
18-11
18-12
18-12
18-13
18-13
18-13
18-14
18-14
18-14
xv
Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distribution Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distribution Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Probability Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cumulative Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Inverse Cumulative Distribution . . . . . . . . . . . . . . . . . . . . .
Distribution Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Distribution Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Negative Log-Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Random Number Generators . . . . . . . . . . . . . . . . . . . . . . . .
Quasi-Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . .
Piecewise Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hypothesis Tests
18-15
18-15
18-16
18-17
18-19
18-21
18-23
18-24
18-26
18-26
18-28
18-29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-31
Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-32
ANOVA Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-32
ANOVA Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-32
Parametric Regression Analysis . . . . . . . . . . . . . . . . . . . .
Regression Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nonlinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18-33
18-33
18-34
18-35
Multivariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multivariate Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . .
Procrustes Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Feature Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18-36
18-36
18-36
18-36
18-37
18-37
Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cluster Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . .
18-38
18-38
18-38
18-39
18-39
Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-39
Parametric Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 18-40
Classification Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-40
xvi
Contents
Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-40
Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 18-40
Distance Computation and Nearest Neighbor Search . . . . 18-41
Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ensemble Methods — Classification . . . . . . . . . . . . . . . . . .
Ensemble Methods — Regression . . . . . . . . . . . . . . . . . . . .
18-42
18-42
18-45
18-47
18-50
Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-53
Design of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DOE Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Full Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . .
Response Surface Designs . . . . . . . . . . . . . . . . . . . . . . . . . .
D-Optimal Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Latin Hypercube Designs . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quasi-Random Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18-54
18-54
18-54
18-55
18-55
18-55
18-55
18-56
Statistical Process Control . . . . . . . . . . . . . . . . . . . . . . . . . 18-58
SPC Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-58
SPC Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-58
GUIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-59
Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-60
Class Reference
19
Data Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Categorical Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dataset Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19-2
19-2
19-2
Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . .
19-3
xvii
Distribution Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Quasi-Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . .
Piecewise Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19-3
19-3
19-4
..........................
19-4
Model Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19-4
Parametric Classification . . . . . . . . . . . . . . . . . . . . . . . . . .
Naive Bayes Classification . . . . . . . . . . . . . . . . . . . . . . . . . .
Distance Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19-5
19-5
19-5
Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification Ensemble Classes . . . . . . . . . . . . . . . . . . . . .
Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Regression Ensemble Classes . . . . . . . . . . . . . . . . . . . . . . . .
19-5
19-6
19-6
19-6
19-7
Quasi-Random Design of Experiments . . . . . . . . . . . . . . .
19-8
Gaussian Mixture Models
Functions — Alphabetical List
20
Data Sets
A
Distribution Reference
B
Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition of the Bernoulli Distribution . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xviii
Contents
B-3
B-3
B-3
Beta Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-4
B-4
B-4
B-5
B-6
B-6
Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-7
B-7
B-7
B-8
B-9
B-9
Birnbaum-Saunders Distribution . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-10
B-10
B-10
B-11
B-11
Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-12
B-12
B-12
B-13
B-13
Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-14
Custom Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-15
Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-16
B-16
B-16
B-16
B-17
B-18
Extreme Value Distribution . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-19
B-19
B-19
B-21
xix
xx
Contents
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-22
B-24
F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-25
B-25
B-25
B-26
B-26
Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-27
B-27
B-27
B-28
B-29
B-29
Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-30
Gaussian Mixture Distributions . . . . . . . . . . . . . . . . . . . . .
B-31
Generalized Extreme Value Distribution . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-32
B-32
B-32
B-33
B-34
B-36
Generalized Pareto Distribution . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-37
B-37
B-37
B-38
B-39
B-40
Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-41
B-41
B-41
B-41
B-42
Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-43
B-43
B-43
B-44
B-44
Inverse Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-45
B-45
B-45
B-45
B-45
Inverse Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-46
B-46
B-46
B-46
B-47
Johnson System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-48
Logistic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-49
B-49
B-49
B-49
B-49
Loglogistic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-50
B-50
B-50
B-50
Lognormal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-51
B-51
B-51
B-52
B-53
Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-54
B-54
B-54
B-54
xxi
xxii
Contents
Multivariate Gaussian Distribution . . . . . . . . . . . . . . . . .
B-57
Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-58
B-58
B-58
B-59
B-63
Multivariate t Distribution . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-64
B-64
B-64
B-65
B-69
Nakagami Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-70
B-70
B-70
B-70
B-71
Negative Binomial Distribution . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-72
B-72
B-72
B-73
B-75
B-75
Noncentral Chi-Square Distribution . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-76
B-76
B-76
B-77
Noncentral F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-78
B-78
B-78
B-79
B-79
Noncentral t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-80
B-80
B-80
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-81
B-81
Nonparametric Distributions . . . . . . . . . . . . . . . . . . . . . . .
B-82
Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-83
B-83
B-83
B-84
B-85
B-85
Pareto Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-86
Pearson System
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-87
Piecewise Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-88
Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-89
B-89
B-89
B-90
B-90
B-90
Rayleigh Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-91
B-91
B-91
B-92
B-92
B-92
Rician Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-93
B-93
B-93
B-93
B-94
Student’s t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-95
B-95
xxiii
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-95
B-96
B-96
t Location-Scale Distribution . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-97
B-97
B-97
B-97
B-98
Uniform Distribution (Continuous) . . . . . . . . . . . . . . . . . . B-99
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-99
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-99
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-99
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-99
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-100
xxiv
Contents
Uniform Distribution (Discrete) . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-101
B-101
B-101
B-101
B-102
Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-103
B-103
B-103
B-104
B-104
B-105
Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
See Also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-106
B-106
B-106
B-107
B-107
Bibliography
C
Index
xxv
xxvi
Contents
1
Getting Started
1
Getting Started
Product Overview
Statistics Toolbox™ software extends MATLAB® to support a wide range of
common statistical tasks. The toolbox contains two categories of tools:
• Building-block statistical functions for use in MATLAB programming
• Graphical user interfaces (GUIs) for interactive data analysis
Code for the building-block functions is open and extensible. Use the MATLAB
Editor to review, copy, and edit code for any function. Extend the toolbox by
copying code to new files or by writing files that call toolbox functions.
GUIs allow you to perform statistical visualization and analysis without
writing code. You interact with the GUIs using sliders, input fields, buttons,
etc. and the GUIs automatically call building-block functions.
1-2
2
Organizing Data
• “Introduction” on page 2-2
• “MATLAB Arrays” on page 2-4
• “Statistical Arrays” on page 2-11
• “Grouped Data” on page 2-34
2
Organizing Data
Introduction
MATLAB data is placed into “data containers” in the form of workspace
variables. All workspace variables organize data into some form of array. For
statistical purposes, arrays are viewed as tables of values.
MATLAB variables use different structures to organize data:
• 2-D numerical arrays (matrices) organize observations and measured
variables by rows and columns, respectively. (See “Other Data Structures”
in the MATLAB documentation.)
• Multidimensional arrays organize multidimensional observations or
experimental designs. (See “Multidimensional Arrays” in the MATLAB
documentation.)
• Cell and structure arrays organize heterogeneous data of different types,
sizes, units, etc. (See “Cell Arrays” and “Structures” in the MATLAB
documentation.)
Data types determine the kind of data variables contain. (See “Classes (Data
Types)” in the MATLAB documentation.)
These basic MATLAB container variables are reviewed, in a statistical
context, in the section on “MATLAB Arrays” on page 2-4.
These variables are not specifically designed for statistical data, however.
Statistical data generally involves observations of multiple variables, with
measurements of heterogeneous type and size. Data may be numerical (of
type single or double), categorical, or in the form of descriptive metadata.
Fitting statistical data into basic MATLAB variables, and accessing it
efficiently, can be cumbersome.
Statistics Toolbox software offers two additional types of container variables
specifically designed for statistical data:
• “Categorical Arrays” on page 2-13 accommodate data in the form of discrete
levels, together with its descriptive metadata.
2-2
Introduction
• “Dataset Arrays” on page 2-23 encapsulate heterogeneous data and
metadata, including categorical data, which is accessed and manipulated
using familiar methods analogous to those for numerical matrices.
These statistical container variables are discussed in the section on
“Statistical Arrays” on page 2-11.
2-3
2
Organizing Data
MATLAB Arrays
In this section...
“Numerical Data” on page 2-4
“Heterogeneous Data” on page 2-7
“Statistical Functions” on page 2-9
Numerical Data
MATLAB two-dimensional numerical arrays (matrices) containing statistical
data use rows to represent observations and columns to represent measured
variables. For example,
load fisheriris % Fisher's iris data (1936)
loads the variables meas and species into the MATLAB workspace. The meas
variable is a 150-by-4 numerical matrix, representing 150 observations of 4
different measured variables (by column: sepal length, sepal width, petal
length, and petal width, respectively).
The observations in meas are of three different species of iris (setosa,
versicolor, and virginica), which can be separated from one another using the
150-by-1 cell array of strings species:
2-4
MATLAB® Arrays
setosa_indices = strcmp('setosa',species);
setosa = meas(setosa_indices,:);
The resulting setosa variable is 50-by-4, representing 50 observations of the
4 measured variables for iris setosa.
To access and display the first five observations in the setosa data, use row,
column parenthesis indexing:
SetosaObs = setosa(1:5,:)
SetosaObs =
5.1000
3.5000
1.4000
4.9000
3.0000
1.4000
4.7000
3.2000
1.3000
4.6000
3.1000
1.5000
5.0000
3.6000
1.4000
0.2000
0.2000
0.2000
0.2000
0.2000
The data are organized into a table with implicit column headers “Sepal
Length,” “Sepal Width,” “Petal Length,” and “Petal Width.” Implicit row
headers are “Observation 1,” “Observation 2,” “Observation 3,” etc.
Similarly, 50 observations for iris versicolor and iris virginica can be extracted
from the meas container variable:
versicolor_indices = strcmp('versicolor',species);
versicolor = meas(versicolor_indices,:);
virginica_indices = strcmp('virginica',species);
virginica = meas(virginica_indices,:);
Because the data sets for the three species happen to be of the same size, they
can be reorganized into a single 50-by-4-by-3 multidimensional array:
iris = cat(3,setosa,versicolor,virginica);
The iris array is a three-layer table with the same implicit row and column
headers as the setosa, versicolor, and virginica arrays. The implicit layer
names, along the third dimension, are “Setosa,” “Versicolor,” and “Virginica.”
The utility of such a multidimensional organization depends on assigning
meaningful properties of the data to each dimension.
2-5
2
Organizing Data
To access and display data in a multidimensional array, use parenthesis
indexing, as for 2-D arrays. The following gives the first five observations
of sepal lengths in the setosa data:
SetosaSL = iris(1:5,1,1)
SetosaSL =
5.1000
4.9000
4.7000
4.6000
5.0000
Multidimensional arrays provide a natural way to organize numerical data
for which the observations, or experimental designs, have many dimensions.
If, for example, data with the structure of iris are collected by multiple
observers, in multiple locations, over multiple dates, the entirety of the data
can be organized into a single higher dimensional array with dimensions
for “Observer,” “Location,” and “Date.” Likewise, an experimental design
calling for m observations of n p-dimensional variables could be stored in
an m-by-n-by-p array.
Numerical arrays have limitations when organizing more general statistical
data. One limitation is the implicit nature of the metadata. Another is the
requirement that multidimensional data be of commensurate size across all
dimensions. If variables have different lengths, or the number of variables
differs by layer, then multidimensional arrays must be artificially padded
with NaNs to indicate “missing values.” These limitations are addressed by
dataset arrays (see “Dataset Arrays” on page 2-23), which are specifically
designed for statistical data.
2-6
MATLAB® Arrays
Heterogeneous Data
MATLAB data types include two container variables—cell arrays and
structure arrays—that allow you to combine metadata with variables of
different types and sizes.
The data in the variables setosa, versicolor, and virginica created in
“Numerical Data” on page 2-4 can be organized in a cell array, as follows:
iris1 = cell(51,5,3); % Container variable
obsnames = strcat({'Obs'},num2str((1:50)','%-d'));
iris1(2:end,1,:) = repmat(obsnames,[1 1 3]);
varnames = {'SepalLength','SepalWidth',...
'PetalLength','PetalWidth'};
iris1(1,2:end,:) = repmat(varnames,[1 1 3]);
iris1(2:end,2:end,1) = num2cell(setosa);
iris1(2:end,2:end,2) = num2cell(versicolor);
iris1(2:end,2:end,3) = num2cell(virginica);
iris1{1,1,1} = 'Setosa';
iris1{1,1,2} = 'Versicolor';
iris1{1,1,3} = 'Virginica';
To access and display the cells, use parenthesis indexing. The following
displays the first five observations in the setosa sepal data:
SetosaSLSW = iris1(1:6,1:3,1)
SetosaSLSW =
'Setosa'
'SepalLength'
'Obs1'
[
5.1000]
'Obs2'
[
4.9000]
'Obs3'
[
4.7000]
'Obs4'
[
4.6000]
'Obs5'
[
5]
'SepalWidth'
[
3.5000]
[
3]
[
3.2000]
[
3.1000]
[
3.6000]
Here, the row and column headers have been explicitly labeled with metadata.
To extract the data subset, use row, column curly brace indexing:
2-7
2
Organizing Data
subset = reshape([iris1{2:6,2:3,1}],5,2)
subset =
5.1000
3.5000
4.9000
3.0000
4.7000
3.2000
4.6000
3.1000
5.0000
3.6000
While cell arrays are useful for organizing heterogeneous data, they may
be cumbersome when it comes to manipulating and analyzing the data.
MATLAB and Statistics Toolbox statistical functions do not accept data in the
form of cell arrays. For processing, data must be extracted from the cell array
to a numerical container variable, as in the preceding example. The indexing
can become complicated for large, heterogeneous data sets. This limitation of
cell arrays is addressed by dataset arrays (see “Dataset Arrays” on page 2-23),
which are designed to store general statistical data and provide easy access.
The data in the preceding example can also be organized in a structure array,
as follows:
iris2.data = cat(3,setosa,versicolor,virginica);
iris2.varnames = {'SepalLength','SepalWidth',...
'PetalLength','PetalWidth'};
iris2.obsnames = strcat({'Obs'},num2str((1:50)','%-d'));
iris2.species = {'setosa','versicolor','virginica'};
The data subset is then returned using a combination of dot and parenthesis
indexing:
subset = iris2.data(1:5,1:2,1)
subset =
5.1000
3.5000
4.9000
3.0000
4.7000
3.2000
4.6000
3.1000
5.0000
3.6000
For statistical data, structure arrays have many of the same limitations as
cell arrays. Once again, dataset arrays (see “Dataset Arrays” on page 2-23),
designed specifically for general statistical data, address these limitations.
2-8
MATLAB® Arrays
Statistical Functions
One of the advantages of working in the MATLAB language is that functions
operate on entire arrays of data, not just on single scalar values. The
functions are said to be vectorized. Vectorization allows for both efficient
problem formulation, using array-based data, and efficient computation,
using vectorized statistical functions.
When MATLAB and Statistics Toolbox statistical functions operate on a
vector of numerical data (either a row vector or a column vector), they return
a single computed statistic:
% Fisher's setosa data:
load fisheriris
setosa_indices = strcmp('setosa',species);
setosa = meas(setosa_indices,:);
% Single variable from the data:
setosa_sepal_length = setosa(:,1);
% Standard deviation of the variable:
std(setosa_sepal_length)
ans =
0.3525
When statistical functions operate on a matrix of numerical data, they treat
the columns independently, as separate measured variables, and return a
vector of statistics—one for each variable:
std(setosa)
ans =
0.3525
0.3791
0.1737
0.1054
The four standard deviations are for measurements of sepal length, sepal
width, petal length, and petal width, respectively.
Compare this to
std(setosa(:))
ans =
1.8483
2-9
2
Organizing Data
which gives the standard deviation across the entire array (all measurements).
Compare the preceding statistical calculations to the more generic
mathematical operation
sin(setosa)
This operation returns a 50-by-4 array the same size as setosa. The sin
function is vectorized in a different way than the std function, computing one
scalar value for each element in the array.
MATLAB and Statistics Toolbox statistical functions, like std, must be
distinguished from general mathematical functions like sin. Both are
vectorized, and both are useful for working with array-based data, but
only statistical functions summarize data across observations (rows) while
preserving variables (columns). This property of statistical functions may be
explicit, as with std, or implicit, as with regress. To see how a particular
function handles array-based data, consult its reference page.
MATLAB statistical functions expect data input arguments to be in the form
of numerical arrays. If data is stored in a cell or structure array, it must
be extracted to a numerical array, via indexing, for processing. Statistics
Toolbox functions are more flexible. Many toolbox functions accept data input
arguments in the form of both numerical arrays and dataset arrays (see
“Dataset Arrays” on page 2-23), which are specifically designed for storing
general statistical data.
2-10
Statistical Arrays
Statistical Arrays
In this section...
“Introduction” on page 2-11
“Categorical Arrays” on page 2-13
“Dataset Arrays” on page 2-23
Introduction
As discussed in “MATLAB Arrays” on page 2-4, MATLAB data types include
arrays for numerical, logical, and character data, as well as cell and structure
arrays for heterogeneous collections of data.
Statistics Toolbox software offers two additional types of arrays specifically
designed for statistical data:
• “Categorical Arrays” on page 2-13
• “Dataset Arrays” on page 2-23
Categorical arrays store data with values in a discrete set of levels. Each level
is meant to capture a single, defining characteristic of an observation. If no
ordering is encoded in the levels, the data and the array are nominal. If an
ordering is encoded, the data and the array are ordinal.
Categorical arrays also store labels for the levels. Nominal labels typically
suggest the type of an observation, while ordinal labels suggest the position
or rank.
Dataset arrays collect heterogeneous statistical data and metadata, including
categorical data, into a single container variable. Like the numerical matrices
discussed in “Numerical Data” on page 2-4, dataset arrays can be viewed as
tables of values, with rows representing different observations and columns
representing different measured variables. Like the cell and structure
arrays discussed in “Heterogeneous Data” on page 2-7, dataset arrays can
accommodate variables of different types, sizes, units, etc.
2-11
2
Organizing Data
Dataset arrays combine the organizational advantages of these basic
MATLAB data types while addressing their shortcomings with respect to
storing complex statistical data.
Both categorical and dataset arrays have associated methods for assembling,
accessing, manipulating, and processing the collected data. Basic array
operations parallel those for numerical, cell, and structure arrays.
2-12
Statistical Arrays
Categorical Arrays
• “Categorical Data” on page 2-13
• “Categorical Arrays” on page 2-14
• “Using Categorical Arrays” on page 2-16
Categorical Data
Categorical data take on values from only a finite, discrete set of categories
or levels. Levels may be determined before the data are collected, based on
the application, or they may be determined by the distinct values in the data
when converting them to categorical form. Predetermined levels, such as a
set of states or numerical intervals, are independent of the data they contain.
Any number of values in the data may attain a given level, or no data at all.
Categorical data show which measured values share common levels, and
which do not.
Levels may have associated labels. Labels typically express a defining
characteristic of an observation, captured by its level.
If no ordering is encoded in the levels, the data are nominal. Nominal
labels typically indicate the type of an observation. Examples of nominal
labels are {false, true}, {male, female}, and {Afghanistan, ..., Zimbabwe}.
For nominal data, the numeric or lexicographic order of the labels is
irrelevant—Afghanistan is not considered to be less than, equal to, or greater
than Zimbabwe.
If an ordering is encoded in the levels—for example, if levels labeled “red”,
“green”, and “blue” represent wavelengths—the data are ordinal. Labels
for ordinal levels typically indicate the position or rank of an observation.
Examples of ordinal labels are {0, 1}, {mm, cm, m, km}, and {poor, satisfactory,
outstanding}. The ordering of the levels may or may not correspond to the
numeric or lexicographic order of the labels.
2-13
2
Organizing Data
Categorical Arrays
Categorical data can be represented using MATLAB integer arrays, but
this method has a number of drawbacks. First, it removes all of the useful
metadata that might be captured in labels for the levels. Labels must be
stored separately, in character arrays or cell arrays of strings. Secondly, this
method suggests that values stored in the integer array have their usual
numeric meaning, which, for categorical data, they may not. Finally, integer
types have a fixed set of levels (for example, -128:127 for all int8 arrays),
which cannot be changed.
Categorical arrays, available in Statistics Toolbox software, are specifically
designed for storing, manipulating, and processing categorical data and
metadata. Unlike integer arrays, each categorical array has its own set of
levels, which can be changed. Categorical arrays also accommodate labels for
levels in a natural way. Like numerical arrays, categorical arrays take on
different shapes and sizes, from scalars to N-D arrays.
Organizing data in a categorical array can be an end in itself. Often, however,
categorical arrays are used for further statistical processing. They can be
used to index into other variables, creating subsets of data based on the
category of observation, or they can be used with statistical functions that
accept categorical inputs. For examples, see “Grouped Data” on page 2-34.
Categorical arrays come in two types, depending on whether the collected
data is understood to be nominal or ordinal. Nominal arrays are constructed
with nominal; ordinal arrays are constructed with ordinal. For example,
load fisheriris
ndata = nominal(species,{'A','B','C'});
creates a nominal array with levels A, B, and C from the species data in
fisheriris.mat, while
odata = ordinal(ndata,{},{'C','A','B'});
encodes an ordering of the levels with C < A < B. See “Using Categorical
Arrays” on page 2-16, and the reference pages for nominal and ordinal, for
further examples.
Categorical arrays are implemented as objects of the categorical class.
The class is abstract, defining properties and methods common to both
2-14
Statistical Arrays
the nominal class and ordinal class. Use the corresponding constructors,
nominal or ordinal, to create categorical arrays. Methods of the classes are
used to display, summarize, convert, concatenate, and access the collected
data. Many of these methods are invoked using operations analogous to those
for numerical arrays, and do not need to be called directly (for example, []
invokes horzcat). Other methods, such as reorderlevels, must be called
directly.
2-15
2
Organizing Data
Using Categorical Arrays
This section provides an extended tutorial example demonstrating the use of
categorical arrays with methods of the nominal class and ordinal class.
• “Constructing Categorical Arrays” on page 2-16
• “Accessing Categorical Arrays” on page 2-18
• “Combining Categorical Arrays” on page 2-19
• “Computing with Categorical Arrays” on page 2-20
Constructing Categorical Arrays.
1 Load the 150-by-4 numerical array meas and the 150-by-1 cell array of
strings species:
load fisheriris % Fisher's iris data (1936)
The data are 150 observations of four measured variables (by column
number: sepal length, sepal width, petal length, and petal width,
respectively) over three species of iris (setosa, versicolor, and virginica).
2 Use nominal to create a nominal array from species:
n1 = nominal(species);
3 Open species and n1 side by side in the Variable Editor (see “Viewing and
Editing Workspace Variables with the Variable Editor” in the MATLAB
documentation). Note that the string information in species has been
converted to categorical form, leaving only information on which data share
the same values, indicated by the labels for the levels.
By default, levels are labeled with the distinct values in the data (in this
case, the strings in species). Give alternate labels with additional input
arguments to the nominal constructor:
n2 = nominal(species,{'species1','species2','species3'});
4 Open n2 in the Variable Editor, and compare it with species and n1. The
levels have been relabeled.
2-16
Statistical Arrays
5 Suppose that the data are considered to be ordinal. A characteristic of the
data that is not reflected in the labels is the diploid chromosome count,
which orders the levels corresponding to the three species as follows:
species1 < species3 < species2
Use ordinal to cast n2 as an ordinal array:
o1 = ordinal(n2,{},{'species1','species3','species2'});
The second input argument to ordinal is the same as for nominal—a list
of labels for the levels in the data. If it is unspecified, as above, the labels
are inherited from the data, in this case n2. The third input argument of
ordinal indicates the ordering of the levels, in ascending order.
6 When displayed side by side in the Variable Editor, o1 does not appear any
different than n2. This is because the data in o1 have not been sorted. It
is important to recognize the difference between the ordering of the levels
in an ordinal array and sorting the actual data according to that ordering.
Use sort to sort ordinal data in ascending order:
o2 = sort(o1);
When displayed in the Variable Editor, o2 shows the data sorted by diploid
chromosome count.
7 To find which elements moved up in the sort, use the < operator for ordinal
arrays:
moved_up = (o1 < o2);
The operation returns a logical array moved_up, indicating which elements
have moved up (the data for species3).
8 Use getlabels to display the labels for the levels in ascending order:
labels2 = getlabels(o2)
labels2 =
'species1'
'species3'
'species2'
9 The sort function reorders the display of the data, but not the order of the
levels. To reorder the levels, use reorderlevels:
2-17
2
Organizing Data
o3 = reorderlevels(o2,labels2([1 3 2]));
labels3 = getlabels(o3)
labels3 =
'species1'
'species2'
'species3'
o4 = sort(o3);
These operations return the levels in the data to their original ordering, by
species number, and then sort the data for display purposes.
Accessing Categorical Arrays. Categorical arrays are accessed using
parenthesis indexing, with syntax that parallels similar operations for
numerical arrays (see “Numerical Data” on page 2-4).
Parenthesis indexing on the right-hand side of an assignment is used to
extract the lowest 50 elements from the ordinal array o4:
low50 = o4(1:50);
Suppose you want to categorize the data in o4 with only two levels: low (the
data in low50) and high (the rest of the data). One way to do this is to use an
assignment with parenthesis indexing on the left-hand side:
o5 = o4; % Copy o4
o5(1:50) = 'low';
Warning: Categorical level 'low' being added.
o5(51:end) = 'high';
Warning: Categorical level 'high' being added.
Note the warnings: the assignments move data to new levels. The old levels,
though empty, remain:
getlabels(o5)
ans =
'species1' 'species2' 'species3' 'low' 'high'
The old levels are removed using droplevels:
o5 = droplevels(o5,{'species1','species2','species3'});
Another approach to creating two categories in o5 from the three categories in
o4 is to merge levels, using mergelevels:
2-18
Statistical Arrays
o5 = mergelevels(o4,{'species1'},'low');
o5 = mergelevels(o5,{'species2','species3'},'high');
getlabels(o5)
ans =
'low'
'high'
The merged levels are removed and replaced with the new levels.
Combining Categorical Arrays. Categorical arrays are concatenated using
square brackets. Again, the syntax parallels similar operations for numerical
arrays (see “Numerical Data” on page 2-4). There are, however, restrictions:
• Only categorical arrays of the same type can be combined. You cannot
concatenate a nominal array with an ordinal array.
• Only ordinal arrays with the same levels, in the same order, can be
combined.
• Nominal arrays with different levels can be combined to produce a nominal
array whose levels are the union of the levels in the component arrays.
First use ordinal to create ordinal arrays from the variables for sepal length
and sepal width in meas. Categorize the data as short or long depending on
whether they are below or above the median of the variable, respectively:
sl = meas(:,1); % Sepal length data
sw = meas(:,2); % Sepal width data
SL1 = ordinal(sl,{'short','long'},[],...
[min(sl),median(sl),max(sl)]);
SW1 = ordinal(sw,{'short','long'},[],...
[min(sw),median(sw),max(sw)]);
Because SL1 and SW1 are ordinal arrays with the same levels, in the same
order, they can be concatenated:
S1 = [SL1,SW1];
S1(1:10,:)
ans =
short
long
short
long
short
long
2-19
2
Organizing Data
short
short
short
short
short
short
short
long
long
long
long
long
short
long
The result is an ordinal array S1 with two columns.
If, on the other hand, the measurements are cast as nominal, different levels
can be used for the different variables, and the two nominal arrays can still
be combined:
SL2 = nominal(sl,{'short','long'},[],...
[min(sl),median(sl),max(sl)]);
SW2 = nominal(sw,{'skinny','wide'},[],...
[min(sw),median(sw),max(sw)]);
S2 = [SL2,SW2];
getlabels(S2)
ans =
'short' 'long' 'skinny' 'wide'
S2(1:10,:)
ans =
short
wide
short
wide
short
wide
short
wide
short
wide
short
wide
short
wide
short
wide
short
skinny
short
wide
Computing with Categorical Arrays. Categorical arrays are used to
index into other variables, creating subsets of data based on the category
of observation, and they are used with statistical functions that accept
categorical inputs, such as those described in “Grouped Data” on page 2-34.
2-20
Statistical Arrays
Use ismember to create logical variables based on the category of observation.
For example, the following creates a logical index the same size as species
that is true for observations of iris setosa and false elsewhere. Recall that
n1 = nominal(species):
SetosaObs = ismember(n1,'setosa');
Since the code above compares elements of n1 to a single value, the same
operation is carried out by the equality operator:
SetosaObs = (n1 == 'setosa');
The SetosaObs variable is used to index into meas to extract only the setosa
data:
SetosaData = meas(SetosaObs,:);
Categorical arrays are also used as grouping variables. The following plot
summarizes the sepal length data in meas by category:
boxplot(sl,n1)
2-21
2
2-22
Organizing Data
Statistical Arrays
Dataset Arrays
• “Statistical Data” on page 2-23
• “Dataset Arrays” on page 2-24
• “Using Dataset Arrays” on page 2-25
Statistical Data
MATLAB data containers (variables) are suitable for completely homogeneous
data (numeric, character, and logical arrays) and for completely heterogeneous
data (cell and structure arrays). Statistical data, however, are often a mixture
of homogeneous variables of heterogeneous types and sizes. Dataset arrays
are suitable containers for this kind of data.
Dataset arrays can be viewed as tables of values, with rows representing
different observations or cases and columns representing different measured
variables. In this sense, dataset arrays are analogous to the numerical
arrays for statistical data discussed in “Numerical Data” on page 2-4. Basic
methods for creating and manipulating dataset arrays parallel the syntax of
corresponding methods for numerical arrays.
While each column of a dataset array must be a variable of a single type,
each row may contain an observation consisting of measurements of different
types. In this sense, dataset arrays lie somewhere between variables that
enforce complete homogeneity on the data and those that enforce nothing.
Because of the potentially heterogeneous nature of the data, dataset arrays
have indexing methods with syntax that parallels corresponding methods for
cell and structure arrays (see “Heterogeneous Data” on page 2-7).
2-23
2
Organizing Data
Dataset Arrays
Dataset arrays are variables created with dataset. For example, the
following creates a dataset array from observations that are a combination of
categorical and numerical measurements:
load fisheriris
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({nominal(species),'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
iris(1:5,:)
ans =
species
SL
SW
PL
PW
Obs1
setosa
5.1
3.5
1.4
0.2
Obs2
setosa
4.9
3
1.4
0.2
Obs3
setosa
4.7
3.2
1.3
0.2
Obs4
setosa
4.6
3.1
1.5
0.2
Obs5
setosa
5
3.6
1.4
0.2
When creating a dataset array, variable names and observation names can be
assigned together with the data. Other metadata associated with the array
can be assigned with set and accessed with get:
iris = set(iris,'Description','Fisher''s Iris Data');
get(iris)
Description: 'Fisher's Iris Data'
Units: {}
DimNames: {'Observations' 'Variables'}
UserData: []
ObsNames: {150x1 cell}
VarNames: {'species' 'SL' 'SW' 'PL' 'PW'}
Dataset arrays are implemented as objects of the dataset class. Methods of
the class are used to display, summarize, convert, concatenate, and access
the collected data. Many of these methods are invoked using operations
analogous to those for numerical arrays, and do not need to be called directly
(for example, [] invokes horzcat). Other methods, such as sortrows, must
be called directly.
2-24
Statistical Arrays
Using Dataset Arrays
This section provides an extended tutorial example demonstrating the use of
dataset arrays with methods of the dataset class.
• “Constructing Dataset Arrays” on page 2-25
• “Accessing Dataset Arrays” on page 2-27
• “Combining Dataset Arrays” on page 2-29
• “Removing Observations from Dataset Arrays” on page 2-31
• “Computing with Dataset Arrays” on page 2-31
Constructing Dataset Arrays. Load the 150-by-4 numerical array meas and
the 150-by-1 cell array of strings species:
load fisheriris % Fisher's iris data (1936)
The data are 150 observations of four measured variables (by column number:
sepal length, sepal width, petal length, and petal width, respectively) over
three species of iris (setosa, versicolor, and virginica).
Use dataset to create a dataset array iris from the data, assigning variable
names species, SL, SW, PL, and PW and observation names Obs1, Obs2, Obs3,
etc.:
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({nominal(species),'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
iris(1:5,:)
ans =
Obs1
Obs2
Obs3
Obs4
Obs5
species
setosa
setosa
setosa
setosa
setosa
SL
5.1
4.9
4.7
4.6
5
SW
3.5
3
3.2
3.1
3.6
PL
1.4
1.4
1.3
1.5
1.4
PW
0.2
0.2
0.2
0.2
0.2
2-25
2
Organizing Data
The cell array of strings species is first converted to a categorical array of
type nominal before inclusion in the dataset array. For further information
on categorical arrays, see “Categorical Arrays” on page 2-13.
Use set to set properties of the array:
desc = 'Fisher''s iris data (1936)';
units = [{''} repmat({'cm'},1,4)];
info = 'http://en.wikipedia.org/wiki/R.A._Fisher';
iris = set(iris,'Description',desc,...
'Units',units,...
'UserData',info);
Use get to view properties of the array:
get(iris)
Description:
Units:
DimNames:
UserData:
ObsNames:
VarNames:
'Fisher's iris data (1936)'
{'' 'cm' 'cm' 'cm' 'cm'}
{'Observations' 'Variables'}
'http://en.wikipedia.org/wiki/R.A._Fisher'
{150x1 cell}
{'species' 'SL' 'SW' 'PL' 'PW'}
get(iris(1:5,:),'ObsNames')
ans =
'Obs1'
'Obs2'
'Obs3'
'Obs4'
'Obs5'
For a table of accessible properties of dataset arrays, with descriptions, see
the reference on the dataset class.
2-26
Statistical Arrays
Accessing Dataset Arrays. Dataset arrays support multiple types of
indexing. Like the numerical matrices described in “Numerical Data” on page
2-4, parenthesis () indexing is used to access data subsets. Like the cell
and structure arrays described in “Heterogeneous Data” on page 2-7, dot .
indexing is used to access data variables and curly brace {} indexing is used
to access data elements.
Use parenthesis indexing to assign a subset of the data in iris to a new
dataset array iris1:
iris1 = iris(1:5,2:3)
iris1 =
SL
SW
Obs1
5.1
3.5
Obs2
4.9
3
Obs3
4.7
3.2
Obs4
4.6
3.1
Obs5
5
3.6
Similarly, use parenthesis indexing to assign new data to the first variable
in iris1:
iris1(:,1) = dataset([5.2;4.9;4.6;4.6;5])
iris1 =
SL
SW
Obs1
5.2
3.5
Obs2
4.9
3
Obs3
4.6
3.2
Obs4
4.6
3.1
Obs5
5
3.6
Variable and observation names can also be used to access data:
SepalObs = iris1({'Obs1','Obs3','Obs5'},'SL')
SepalObs =
SL
Obs1
5.2
Obs3
4.6
Obs5
5
2-27
2
Organizing Data
Dot indexing is used to access variables in a dataset array, and can be
combined with other indexing methods. For example, apply zscore to the
data in SepalObs as follows:
ScaledSepalObs = zscore(iris1.SL([1 3 5]))
ScaledSepalObs =
0.8006
-1.1209
0.3203
The following code extracts the sepal lengths in iris1 corresponding to sepal
widths greater than 3:
BigSWLengths = iris1.SL(iris1.SW > 3)
BigSWLengths =
5.2000
4.6000
4.6000
5.0000
Dot indexing also allows entire variables to be deleted from a dataset array:
iris1.SL = []
iris1 =
SW
Obs 1
3.5
Obs 2
3
Obs 3
3.2
Obs 4
3.1
Obs 5
3.6
Dynamic variable naming works for dataset arrays just as it does for structure
arrays. For example, the units of the SW variable are changed in iris1 as
follows:
varname = 'SW';
iris1.(varname) = iris1.(varname)*10
iris1 =
SW
Obs1
35
Obs2
30
2-28
Statistical Arrays
Obs3
32
Obs4
31
Obs5
36
iris1 = set(iris1,'Units',{'mm'});
Curly brace indexing is used to access individual data elements. The following
are equivalent:
iris1{1,1}
ans =
35
iris1{'Obs1','SW'}
ans =
35
Combining Dataset Arrays. Combine two dataset arrays into a single
dataset array using square brackets:
SepalData = iris(:,{'SL','SW'});
PetalData = iris(:,{'PL','PW'});
newiris = [SepalData,PetalData];
size(newiris)
ans =
150
4
For horizontal concatenation, as in the preceding example, the number of
observations in the two dataset arrays must agree. Observations are matched
up by name (if given), regardless of their order in the two data sets.
The following concatenates variables within a dataset array and then deletes
the component variables:
newiris.SepalData = [newiris.SL,newiris.SW];
newiris.PetalData = [newiris.PL,newiris.PW];
newiris(:,{'SL','SW','PL','PW'}) = [];
size(newiris)
ans =
150
2
size(newiris.SepalData)
ans =
2-29
2
Organizing Data
150
2
newiris is now a 150-by-2 dataset array containing two 150-by-2 numerical
arrays as variables.
Vertical concatenation is also handled in a manner analogous to numerical
arrays:
newobs = dataset({[5.3 4.2; 5.0 4.1],'PetalData'},...
{[5.5 2; 4.8 2.1],'SepalData'});
newiris = [newiris;newobs];
size(newiris)
ans =
152
2
For vertical concatenation, as in the preceding example, the names of the
variables in the two dataset arrays must agree. Variables are matched up by
name, regardless of their order in the two data sets.
Expansion of variables is also accomplished using direct assignment to new
rows:
newiris(153,:) = dataset({[5.1 4.0],'PetalData'},...
{[5.1 4.2],'SepalData'});
A different type of concatenation is performed by join, which takes the data
in one dataset array and assigns it to the rows of another dataset array, based
on matching values in a common key variable. For example, the following
creates a dataset array with diploid chromosome counts for each species of iris:
snames = nominal({'setosa';'versicolor';'virginica'});
CC = dataset({snames,'species'},{[38;108;70],'cc'})
CC =
species
cc
setosa
38
versicolor
108
virginica
70
This data is broadcast to the rows of iris using join:
iris2 = join(iris,CC);
2-30
Statistical Arrays
iris2([1 2 51 52 101 102],:)
ans =
species
SL
Obs1
setosa
5.1
Obs2
setosa
4.9
Obs51
versicolor
7
Obs52
versicolor
6.4
Obs101
virginica
6.3
Obs102
virginica
5.8
SW
3.5
3
3.2
3.2
3.3
2.7
PL
1.4
1.4
4.7
4.5
6
5.1
PW
0.2
0.2
1.4
1.5
2.5
1.9
cc
38
38
108
108
70
70
Removing Observations from Dataset Arrays. Use one of the following
commands to remove observations or variables from a dataset (ds):
• Remove a variable by name:
ds.var = [];
• Remove the jth variable:
ds(:,j) = [];
• Remove the jth variable:
ds = ds(:,[1:(j-1) (j+1):end]);
• Remove the ith observation:
ds(i,:) = [];
• Remove the ith observation:
ds = ds([1:(i-1) (i+1):end],:);
• Remove the jth variable and ith observation:
ds = ds([1:(i-1) (i+1):end],[1:(j-1) (j+1):end]);
Computing with Dataset Arrays. Use summary to provide summary
statistics for the component variables of a dataset array:
summary(newiris)
Fisher's iris data (1936)
SepalData: [153x2 double]
2-31
2
Organizing Data
min
4.3000
1st Q
5.1000
median
5.8000
3rd Q
6.4000
max
7.9000
PetalData: [153x2 double]
min
1
1st Q
1.6000
median
4.4000
3rd Q
5.1000
max
6.9000
2
2.8000
3
3.3250
4.4000
0.1000
0.3000
1.3000
1.8000
4.2000
To apply other statistical functions, use dot indexing to access relevant
variables:
SepalMeans = mean(newiris.SepalData)
SepalMeans =
5.8294
3.0503
The same result is obtained with datasetfun, which applies functions to
dataset array variables:
means = datasetfun(@mean,newiris,'UniformOutput',false)
means =
[1x2 double]
[1x2 double]
SepalMeans = means{1}
SepalMeans =
5.8294
3.0503
An alternative approach is to cast data in a dataset array as double and
apply statistical functions directly. Compare the following two methods
for computing the covariance of the length and width of the SepalData in
newiris:
covs = datasetfun(@cov,newiris,'UniformOutput',false)
covs =
[2x2 double]
[2x2 double]
SepalCovs = covs{1}
SepalCovs =
0.6835
-0.0373
-0.0373
0.2054
2-32
Statistical Arrays
SepalCovs = cov(double(newiris(:,1)))
SepalCovs =
0.6835
-0.0373
-0.0373
0.2054
2-33
2
Organizing Data
Grouped Data
In this section...
“Grouping Variables” on page 2-34
“Level Order Definition” on page 2-35
“Functions for Grouped Data” on page 2-35
“Using Grouping Variables” on page 2-37
Grouping Variables
Grouping variables are utility variables used to indicate which elements
in a data set are to be considered together when computing statistics and
creating visualizations. They may be numeric vectors, string arrays, cell
arrays of strings, or categorical arrays. Logical vectors can be used to indicate
membership (or not) in a single group.
Grouping variables have the same length as the variables (columns) in a data
set. Observations (rows) i and j are considered to be in the same group if the
values of the corresponding grouping variable are identical at those indices.
Grouping variables with multiple columns are used to specify different groups
within multiple variables.
For example, the following loads the 150-by-4 numerical array meas and the
150-by-1 cell array of strings species into the workspace:
load fisheriris % Fisher's iris data (1936)
The data are 150 observations of four measured variables (by column number:
sepal length, sepal width, petal length, and petal width, respectively)
over three species of iris (setosa, versicolor, and virginica). To group the
observations by species, the following are all acceptable (and equivalent)
grouping variables:
group1
group2
group3
group4
2-34
=
=
=
=
species;
grp2idx(species);
char(species);
nominal(species);
%
%
%
%
Cell array of strings
Numeric vector
Character array
Categorical array
Grouped Data
These grouping variables can be supplied as input arguments to any of the
functions described in “Functions for Grouped Data” on page 2-35. Examples
are given in “Using Grouping Variables” on page 2-37.
Level Order Definition
Each level of a grouping variable defines a group. The levels and the order of
levels are decided as follows:
• For a categorical vector G, the set of group levels and their order match the
output of the getlabels (G) method.
• For a numeric vector or a logical vector G, the set of group levels is the
distinct values of G. The order is the sorted order of the unique values.
• For a cell vector of strings or a character matrix G, the set of group levels
is the distinct strings of G. The order for strings is the order of their first
appearance in G.
Some functions, such as grpstats, can take a cell array of several grouping
variables (such as {G1 G2 G3}) to group the observations in the data set by
each combination of the grouping variable levels. The order is decided first
by the order of the first grouping variables, then by the order of the second
grouping variable, and so on.
Functions for Grouped Data
The following table lists Statistics Toolbox functions that accept a grouping
variable group as an input argument. The grouping variable may be in the
form of a vector, string array, cell array of strings, or categorical array, as
described in “Grouping Variables” on page 2-34.
For a full description of the syntax of any particular function, and examples
of its use, consult its reference page, linked from the table. “Using Grouping
Variables” on page 2-37 also includes examples.
Function
Basic Syntax for Grouped Data
andrewsplot
andrewsplot(X, ...
anova1
p = anova1(X,group)
,'Group',group)
2-35
2
Organizing Data
Function
Basic Syntax for Grouped Data
anovan
p = anovan(x,group)
aoctool
aoctool(x,y,group)
boxplot
boxplot(x,group)
classify
class = classify(sample,training,group)
controlchart
controlchart(x,group)
crosstab
crosstab(group1,group2)
cvpartition
c = cvpartition(group,'Kfold',k) or c =
cvpartition(group,'Holdout',p)
dummyvar
D = dummyvar(group)
gagerr
gagerr(x,group)
gplotmatrix
gplotmatrix(x,y,group)
grp2idx
[G,GN] = grp2idx(group)
grpstats
means = grpstats(X,group)
gscatter
gscatter(x,y,group)
interactionplot interactionplot(X,group)
kruskalwallis
p = kruskalwallis(X,group)
maineffectsplot maineffectsplot(X,group)
2-36
manova1
d = manova1(X,group)
multivarichart
multivarichart(x,group)
parallelcoords
parallelcoords(X, ...
silhouette
silhouette(X,group)
tabulate
tabulate(group)
treefit
T = treefit(X,y,'cost',S) or T =
treefit(X,y,'priorprob',S), where S.group
= group
vartestn
vartestn(X,group)
,'Group',group)
Grouped Data
Using Grouping Variables
This section provides an example demonstrating the use of grouping variables
and associated functions. Grouping variables are introduced in “Grouping
Variables” on page 2-34. A list of functions accepting grouping variables as
input arguments is given in “Functions for Grouped Data” on page 2-35.
Load the 150-by-4 numerical array meas and the 150-by-1 cell array of strings
species:
load fisheriris % Fisher's iris data (1936)
The data are 150 observations of four measured variables (by column number:
sepal length, sepal width, petal length, and petal width, respectively) over
three species of iris (setosa, versicolor, and virginica).
Create a categorical array (see “Categorical Arrays” on page 2-13) from
species to use as a grouping variable:
group = nominal(species);
While species, as a cell array of strings, is itself a grouping variable, the
categorical array has the advantage that it can be easily manipulated with
methods of the categorical class.
Compute some basic statistics for the data (median and interquartile range),
by group, using the grpstats function:
[order,number,group_median,group_iqr] = ...
grpstats(meas,group,{'gname','numel',@median,@iqr})
order =
'setosa'
'versicolor'
'virginica'
number =
50
50
50
50
50
50
50
50
50
50
50
50
group_median =
5.0000
3.4000
1.5000
0.2000
5.9000
2.8000
4.3500
1.3000
6.5000
3.0000
5.5500
2.0000
2-37
2
Organizing Data
group_iqr =
0.4000
0.7000
0.7000
0.5000
0.5000
0.4000
0.2000
0.6000
0.8000
0.1000
0.3000
0.5000
The statistics appear in 3-by-4 arrays, corresponding to the 3 groups
(categories) and 4 variables in the data. The order of the groups (not encoded
in the nominal array group) is indicated by the group names in order.
To improve the labeling of the data, create a dataset array (see “Dataset
Arrays” on page 2-23) from meas:
NumObs = size(meas,1);
NameObs = strcat({'Obs'},num2str((1:NumObs)','%-d'));
iris = dataset({group,'species'},...
{meas,'SL','SW','PL','PW'},...
'ObsNames',NameObs);
When you call grpstats with a dataset array as an argument, you invoke the
grpstats method of the dataset class, rather than the grpstats function.
The method has a slightly different syntax than the function, but it returns
the same results, with better labeling:
stats = grpstats(iris,'species',{@median,@iqr})
stats =
species
GroupCount
setosa
setosa
50
versicolor
versicolor
50
virginica
virginica
50
2-38
setosa
versicolor
virginica
median_SL
5
5.9
6.5
iqr_SL
0.4
0.7
0.7
setosa
versicolor
virginica
median_SW
3.4
2.8
3
iqr_SW
0.5
0.5
0.4
Grouped Data
setosa
versicolor
virginica
median_PL
1.5
4.35
5.55
iqr_PL
0.2
0.6
0.8
setosa
versicolor
virginica
median_PW
0.2
1.3
2
iqr_PW
0.1
0.3
0.5
Grouping variables are also used to create visualizations based on categories
of observations. The following scatter plot, created with the gscatter
function, shows the correlation between sepal length and sepal width in two
species of iris. Use ismember to subset the two species from group:
subset = ismember(group,{'setosa','versicolor'});
scattergroup = group(subset);
gscatter(iris.SL(subset),...
iris.SW(subset),...
scattergroup)
xlabel('Sepal Length')
ylabel('Sepal Width')
2-39
2
2-40
Organizing Data
3
Descriptive Statistics
• “Introduction” on page 3-2
• “Measures of Central Tendency” on page 3-3
• “Measures of Dispersion” on page 3-5
• “Measures of Shape” on page 3-7
• “Resampling Statistics” on page 3-9
• “Data with Missing Values” on page 3-14
3
Descriptive Statistics
Introduction
You may need to summarize large, complex data sets—both numerically
and visually—to convey their essence to the data analyst and to allow for
further processing. This chapter focuses on numerical summaries; Chapter 4,
“Statistical Visualization” focuses on visual summaries.
3-2
Measures of Central Tendency
Measures of Central Tendency
Measures of central tendency locate a distribution of data along an
appropriate scale.
The following table lists the functions that calculate the measures of central
tendency.
Function Name
Description
geomean
Geometric mean
harmmean
Harmonic mean
mean
Arithmetic average
median
50th percentile
mode
Most frequent value
trimmean
Trimmed mean
The average is a simple and popular estimate of location. If the data sample
comes from a normal distribution, then the sample mean is also optimal
(MVUE of µ).
Unfortunately, outliers, data entry errors, or glitches exist in almost all
real data. The sample mean is sensitive to these problems. One bad data
value can move the average away from the center of the rest of the data by
an arbitrarily large distance.
The median and trimmed mean are two measures that are resistant (robust)
to outliers. The median is the 50th percentile of the sample, which will only
change slightly if you add a large perturbation to any value. The idea behind
the trimmed mean is to ignore a small percentage of the highest and lowest
values of a sample when determining the center of the sample.
The geometric mean and harmonic mean, like the average, are not robust
to outliers. They are useful when the sample is distributed lognormal or
heavily skewed.
3-3
3
Descriptive Statistics
The following example shows the behavior of the measures of location for a
sample with one outlier.
x = [ones(1,6) 100]
x =
1
1
1
1
1
1
100
locate = [geomean(x) harmmean(x) mean(x) median(x)...
trimmean(x,25)]
locate =
1.9307
1.1647
15.1429
1.0000
1.0000
You can see that the mean is far from any data value because of the influence
of the outlier. The median and trimmed mean ignore the outlying value and
describe the location of the rest of the data values.
3-4
Measures of Dispersion
Measures of Dispersion
The purpose of measures of dispersion is to find out how spread out the data
values are on the number line. Another term for these statistics is measures
of spread.
The table gives the function names and descriptions.
Function
Name
Description
iqr
Interquartile range
mad
Mean absolute deviation
moment
Central moment of all orders
range
Range
std
Standard deviation
var
Variance
The range (the difference between the maximum and minimum values) is the
simplest measure of spread. But if there is an outlier in the data, it will be the
minimum or maximum value. Thus, the range is not robust to outliers.
The standard deviation and the variance are popular measures of spread that
are optimal for normally distributed samples. The sample variance is the
MVUE of the normal parameter σ2. The standard deviation is the square root
of the variance and has the desirable property of being in the same units as
the data. That is, if the data is in meters, the standard deviation is in meters
as well. The variance is in meters2, which is more difficult to interpret.
Neither the standard deviation nor the variance is robust to outliers. A data
value that is separate from the body of the data can increase the value of the
statistics by an arbitrarily large amount.
The mean absolute deviation (MAD) is also sensitive to outliers. But the
MAD does not move quite as much as the standard deviation or variance in
response to bad data.
3-5
3
Descriptive Statistics
The interquartile range (IQR) is the difference between the 75th and 25th
percentile of the data. Since only the middle 50% of the data affects this
measure, it is robust to outliers.
The following example shows the behavior of the measures of dispersion for a
sample with one outlier.
x = [ones(1,6) 100]
x =
1
1
1
1
1
1
100
stats = [iqr(x) mad(x) range(x) std(x)]
stats =
0
3-6
24.2449
99.0000
37.4185
Measures of Shape
Measures of Shape
Quantiles and percentiles provide information about the shape of data as
well as its location and spread.
The quantile of order p (0 ≤ p ≤ 1) is the smallest x value where the cumulative
distribution function equals or exceeds p. The function quantile computes
quantiles as follows:
1 n sorted data points are the 0.5/n, 1.5/n, ..., (n–0.5)/n quantiles.
2 Linear interpolation is used to compute intermediate quantiles.
3 The data min or max are assigned to quantiles outside the range.
4 Missing values are treated as NaN, and removed from the data.
Percentiles, computed by the prctile function, are quantiles for a certain
percentage of the data, specified for 0 ≤ p ≤ 100.
The following example shows the result of looking at every quartile (quantiles
with orders that are multiples of 0.25) of a sample containing a mixture of
two distributions.
x
p
y
z
z
=
=
=
=
=
[normrnd(4,1,1,100) normrnd(6,0.5,1,200)];
100*(0:0.25:1);
prctile(x,p);
[p;y]
0
1.8293
25.0000
4.6728
50.0000
5.6459
75.0000
6.0766
100.0000
7.1546
A box plot helps to visualize the statistics:
boxplot(x)
3-7
3
Descriptive Statistics
The long lower tail and plus signs show the lack of symmetry in the sample
values. For more information on box plots, see “Box Plots” on page 4-6.
The shape of a data distribution is also measured by the Statistics Toolbox
functions skewness, kurtosis, and, more generally, moment.
3-8
Resampling Statistics
Resampling Statistics
In this section...
“The Bootstrap” on page 3-9
“The Jackknife” on page 3-12
“Parallel Computing Support for Resampling Methods” on page 3-13
The Bootstrap
The bootstrap procedure involves choosing random samples with replacement
from a data set and analyzing each sample the same way. Sampling with
replacement means that each observation is selected separately at random
from the original dataset. So a particular data point from the original data
set could appear multiple times in a given bootstrap sample. The number of
elements in each bootstrap sample equals the number of elements in the
original data set. The range of sample estimates you obtain enables you to
establish the uncertainty of the quantity you are estimating.
This example from Efron and Tibshirani [33] compares Law School Admission
Test (LSAT) scores and subsequent law school grade point average (GPA) for
a sample of 15 law schools.
load lawdata
plot(lsat,gpa,'+')
lsline
3-9
3
Descriptive Statistics
The least-squares fit line indicates that higher LSAT scores go with higher
law school GPAs. But how certain is this conclusion? The plot provides some
intuition, but nothing quantitative.
You can calculate the correlation coefficient of the variables using the corr
function.
rhohat = corr(lsat,gpa)
rhohat =
0.7764
Now you have a number describing the positive connection between LSAT
and GPA; though it may seem large, you still do not know if it is statistically
significant.
3-10
Resampling Statistics
Using the bootstrp function you can resample the lsat and gpa vectors as
many times as you like and consider the variation in the resulting correlation
coefficients.
Here is an example.
rhos1000 = bootstrp(1000,'corr',lsat,gpa);
This command resamples the lsat and gpa vectors 1000 times and computes
the corr function on each sample. Here is a histogram of the result.
hist(rhos1000,30)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
Nearly all the estimates lie on the interval [0.4 1.0].
3-11
3
Descriptive Statistics
It is often desirable to construct a confidence interval for a parameter
estimate in statistical inferences. Using the bootci function, you can use
bootstrapping to obtain a confidence interval. The confidence interval for the
lsat and gpa data is computed as:
ci = bootci(5000,@corr,lsat,gpa)
ci =
0.3313
0.9427
Therefore, a 95% confidence interval for the correlation coefficient between
LSAT and GPA is [0.33 0.94]. This is strong quantitative evidence that LSAT
and subsequent GPA are positively correlated. Moreover, this evidence does
not require any strong assumptions about the probability distribution of the
correlation coefficient.
Although the bootci function computes the Bias Corrected and accelerated
(BCa) interval as the default type, it is also able to compute various other
types of bootstrap confidence intervals, such as the studentized bootstrap
confidence interval.
The Jackknife
Similar to the bootstrap is the jackknife, which uses resampling to estimate
the bias of a sample statistic. Sometimes it is also used to estimate standard
error of the sample statistic. The jackknife is implemented by the Statistics
Toolbox function jackknife.
The jackknife resamples systematically, rather than at random as the
bootstrap does. For a sample with n points, the jackknife computes sample
statistics on n separate samples of size n-1. Each sample is the original data
with a single observation omitted.
In the previous bootstrap example you measured the uncertainty in
estimating the correlation coefficient. You can use the jackknife to estimate
the bias, which is the tendency of the sample correlation to over-estimate or
under-estimate the true, unknown correlation. First compute the sample
correlation on the data:
3-12
Resampling Statistics
load lawdata
rhohat = corr(lsat,gpa)
rhohat =
0.7764
Next compute the correlations for jackknife samples, and compute their mean:
jackrho = jackknife(@corr,lsat,gpa);
meanrho = mean(jackrho)
meanrho =
0.7759
Now compute an estimate of the bias:
n = length(lsat);
biasrho = (n-1) * (meanrho-rhohat)
biasrho =
-0.0065
The sample correlation probably underestimates the true correlation by about
this amount.
Parallel Computing Support for Resampling Methods
For information on computing resampling statistics in parallel, see Chapter
17, “Parallel Statistics”.
3-13
3
Descriptive Statistics
Data with Missing Values
Many data sets have one or more missing values. It is convenient to code
missing values as NaN (Not a Number) to preserve the structure of data sets
across multiple variables and observations.
For example:
X = magic(3);
X([1 5]) = [NaN NaN]
X =
NaN
1
6
3 NaN
7
4
9
2
Normal MATLAB arithmetic operations yield NaN values when operands
are NaN:
s1 = sum(X)
s1 =
NaN
NaN
15
Removing the NaN values would destroy the matrix structure. Removing
the rows containing the NaN values would discard data. Statistics Toolbox
functions in the following table remove NaN values only for the purposes of
computation.
3-14
Function
Description
nancov
Covariance matrix, ignoring NaN values
nanmax
Maximum, ignoring NaN values
nanmean
Mean, ignoring NaN values
nanmedian
Median, ignoring NaN values
nanmin
Minimum, ignoring NaN values
nanstd
Standard deviation, ignoring NaN values
nansum
Sum, ignoring NaN values
nanvar
Variance, ignoring NaN values
Data with Missing Values
For example:
s2 = nansum(X)
s2 =
7
10
15
Other Statistics Toolbox functions also ignore NaN values. These include iqr,
kurtosis, mad, prctile, range, skewness, and trimmean.
3-15
3
3-16
Descriptive Statistics
4
Statistical Visualization
• “Introduction” on page 4-2
• “Scatter Plots” on page 4-3
• “Box Plots” on page 4-6
• “Distribution Plots” on page 4-8
4
Statistical Visualization
Introduction
Statistics Toolbox data visualization functions add to the extensive graphics
capabilities already in MATLAB.
• Scatter plots are a basic visualization tool for multivariate data. They
are used to identify relationships among variables. Grouped versions of
these plots use different plotting symbols to indicate group membership.
The gname function is used to label points on these plots with a text label
or an observation number.
• Box plots display a five-number summary of a set of data: the median,
the two ends of the interquartile range (the box), and two extreme values
(the whiskers) above and below the box. Because they show less detail
than histograms, box plots are most useful for side-by-side comparisons
of two distributions.
• Distribution plots help you identify an appropriate distribution family
for your data. They include normal and Weibull probability plots,
quantile-quantile plots, and empirical cumulative distribution plots.
Advanced Statistics Toolbox visualization functions are available for
specialized statistical analyses.
Note For information on creating visualizations of data by group, see
“Grouped Data” on page 2-34.
4-2
Scatter Plots
Scatter Plots
A scatter plot is a simple plot of one variable against another. The MATLAB
functions plot and scatter produce scatter plots. The MATLAB function
plotmatrix can produce a matrix of such plots showing the relationship
between several pairs of variables.
Statistics Toolbox functions gscatter and gplotmatrix produce grouped
versions of these plots. These are useful for determining whether the values
of two variables or the relationship between those variables is the same in
each group.
Suppose you want to examine the weight and mileage of cars from three
different model years.
load carsmall
gscatter(Weight,MPG,Model_Year,'','xos')
4-3
4
Statistical Visualization
This shows that not only is there a strong relationship between the weight of
a car and its mileage, but also that newer cars tend to be lighter and have
better gas mileage than older cars.
The default arguments for gscatter produce a scatter plot with the different
groups shown with the same symbol but different colors. The last two
arguments above request that all groups be shown in default colors and with
different symbols.
The carsmall data set contains other variables that describe different aspects
of cars. You can examine several of them in a single display by creating a
grouped plot matrix.
xvars = [Weight Displacement Horsepower];
yvars = [MPG Acceleration];
4-4
Scatter Plots
gplotmatrix(xvars,yvars,Model_Year,'','xos')
The upper right subplot displays MPG against Horsepower, and shows that
over the years the horsepower of the cars has decreased but the gas mileage
has improved.
The gplotmatrix function can also graph all pairs from a single list of
variables, along with histograms for each variable. See “MANOVA” on page
8-39.
4-5
4
Statistical Visualization
Box Plots
The graph below, created with the boxplot command, compares petal lengths
in samples from two species of iris.
load fisheriris
s1 = meas(51:100,3);
s2 = meas(101:150,3);
boxplot([s1 s2],'notch','on',...
'labels',{'versicolor','virginica'})
This plot has the following features:
• The tops and bottoms of each “box” are the 25th and 75th percentiles of the
samples, respectively. The distances between the tops and bottoms are the
interquartile ranges.
4-6
Box Plots
• The line in the middle of each box is the sample median. If the median is
not centered in the box, it shows sample skewness.
• The whiskers are lines extending above and below each box. Whiskers are
drawn from the ends of the interquartile ranges to the furthest observations
within the whisker length (the adjacent values).
• Observations beyond the whisker length are marked as outliers. By
default, an outlier is a value that is more than 1.5 times the interquartile
range away from the top or bottom of the box, but this value can be adjusted
with additional input arguments. Outliers are displayed with a red + sign.
• Notches display the variability of the median between samples. The width
of a notch is computed so that box plots whose notches do not overlap (as
above) have different medians at the 5% significance level. The significance
level is based on a normal distribution assumption, but comparisons of
medians are reasonably robust for other distributions. Comparing box-plot
medians is like a visual hypothesis test, analogous to the t test used for
means.
4-7
4
Statistical Visualization
Distribution Plots
In this section...
“Normal Probability Plots” on page 4-8
“Quantile-Quantile Plots” on page 4-10
“Cumulative Distribution Plots” on page 4-12
“Other Probability Plots” on page 4-14
Normal Probability Plots
Normal probability plots are used to assess whether data comes from a
normal distribution. Many statistical procedures make the assumption that
an underlying distribution is normal, so normal probability plots can provide
some assurance that the assumption is justified, or else provide a warning of
problems with the assumption. An analysis of normality typically combines
normal probability plots with hypothesis tests for normality, as described in
Chapter 7, “Hypothesis Tests”.
The following example shows a normal probability plot created with the
normplot function.
x = normrnd(10,1,25,1);
normplot(x)
4-8
Distribution Plots
The plus signs plot the empirical probability versus the data value for each
point in the data. A solid line connects the 25th and 75th percentiles in the
data, and a dashed line extends it to the ends of the data. The y-axis values
are probabilities from zero to one, but the scale is not linear. The distance
between tick marks on the y-axis matches the distance between the quantiles
of a normal distribution. The quantiles are close together near the median
(probability = 0.5) and stretch out symmetrically as you move away from
the median.
In a normal probability plot, if all the data points fall near the line, an
assumption of normality is reasonable. Otherwise, the points will curve away
from the line, and an assumption of normality is not justified.
For example:
x = exprnd(10,100,1);
4-9
4
Statistical Visualization
normplot(x)
The plot is strong evidence that the underlying distribution is not normal.
Quantile-Quantile Plots
Quantile-quantile plots are used to determine whether two samples come from
the same distribution family. They are scatter plots of quantiles computed
from each sample, with a line drawn between the first and third quartiles. If
the data falls near the line, it is reasonable to assume that the two samples
come from the same distribution. The method is robust with respect to
changes in the location and scale of either distribution.
To create a quantile-quantile plot, use the qqplot function.
4-10
Distribution Plots
The following example shows a quantile-quantile plot of two samples from
Poisson distributions.
x = poissrnd(10,50,1);
y = poissrnd(5,100,1);
qqplot(x,y);
Even though the parameters and sample sizes are different, the approximate
linear relationship suggests that the two samples may come from the same
distribution family. As with normal probability plots, hypothesis tests,
as described in Chapter 7, “Hypothesis Tests”, can provide additional
justification for such an assumption. For statistical procedures that depend
on the two samples coming from the same distribution, however, a linear
quantile-quantile plot is often sufficient.
4-11
4
Statistical Visualization
The following example shows what happens when the underlying distributions
are not the same.
x = normrnd(5,1,100,1);
y = wblrnd(2,0.5,100,1);
qqplot(x,y);
These samples clearly are not from the same distribution family.
Cumulative Distribution Plots
An empirical cumulative distribution function (cdf) plot shows the proportion
of data less than each x value, as a function of x. The scale on the y-axis is
linear; in particular, it is not scaled to any particular distribution. Empirical
cdf plots are used to compare data cdfs to cdfs for particular distributions.
4-12
Distribution Plots
To create an empirical cdf plot, use the cdfplot function (or ecdf and stairs).
The following example compares the empirical cdf for a sample from an
extreme value distribution with a plot of the cdf for the sampling distribution.
In practice, the sampling distribution would be unknown, and would be
chosen to match the empirical cdf.
y = evrnd(0,3,100,1);
cdfplot(y)
hold on
x = -20:0.1:10;
f = evcdf(x,0,3);
plot(x,f,'m')
legend('Empirical','Theoretical','Location','NW')
4-13
4
Statistical Visualization
Other Probability Plots
A probability plot, like the normal probability plot, is just an empirical cdf plot
scaled to a particular distribution. The y-axis values are probabilities from
zero to one, but the scale is not linear. The distance between tick marks is the
distance between quantiles of the distribution. In the plot, a line is drawn
between the first and third quartiles in the data. If the data falls near the
line, it is reasonable to choose the distribution as a model for the data.
To create probability plots for different distributions, use the probplot
function.
For example, the following plot assesses two samples, one from a Weibull
distribution and one from a Rayleigh distribution, to see if they may have
come from a Weibull population.
x1 = wblrnd(3,3,100,1);
x2 = raylrnd(3,100,1);
probplot('weibull',[x1 x2])
legend('Weibull Sample','Rayleigh Sample','Location','NW')
4-14
Distribution Plots
The plot gives justification for modeling the first sample with a Weibull
distribution; much less so for the second sample.
A distribution analysis typically combines probability plots with hypothesis
tests for a particular distribution, as described in Chapter 7, “Hypothesis
Tests”.
4-15
4
4-16
Statistical Visualization
5
Probability Distributions
• “Using Probability Distributions” on page 5-2
• “Supported Distributions” on page 5-3
• “Working with Distributions Through GUIs” on page 5-9
• “Statistics Toolbox Distribution Functions” on page 5-52
• “Using Probability Distribution Objects” on page 5-84
• “Probability Distributions Used for Multivariate Modeling” on page 5-99
5
Probability Distributions
Using Probability Distributions
Probability distributions are theoretical distributions based on
assumptions about a source population. They assign probability to the event
that a random variable takes on a specific, discrete value, or falls within a
specified range of continuous values. There are two main types of models:
• Parametric Models—Choose a model based on a parametric family of
probability distributions and then adjust the parameters to fit the data.
For information on supported parametric distributions, see “Parametric
Distributions” on page 5-4.
• Nonparametric Models—When data or statistics do not follow
any standard probability distribution, nonparametric models may be
appropriate. For information on supported nonparametric distributions,
see “Nonparametric Distributions” on page 5-8.
The Statistics Toolbox provides several ways of working with both parametric
and nonparametric probability distributions:
• Graphic User Interfaces (GUIs)—Interact with the distributions to
visualize distributions, fit a distribution to your data, or generate random
data using a specific distribution. For more information, see “Working with
Distributions Through GUIs” on page 5-9.
• Command Line Functions—Use command-line functions to further
explore the distributions, fit relevant models to your data, or generate
random data. For more information on using functions, see “Statistics
Toolbox Distribution Functions” on page 5-52.
• Distribution Objects—Use objects to explore and fit your data to a
distribution, save the results to a single entity, and generate random
data from the resulting parameters. For more information, see “Using
Probability Distribution Objects” on page 5-84.
5-2
Supported Distributions
Supported Distributions
In this section...
“Parametric Distributions” on page 5-4
“Nonparametric Distributions” on page 5-8
Probability distributions supported by the Statistics Toolbox are
cross-referenced with their supporting functions and GUIs in the following
tables. The tables use the following abbreviations for distribution functions:
• pdf — Probability density functions
• cdf — Cumulative distribution functions
• inv — Inverse cumulative distribution functions
• stat — Distribution statistics functions
• fit — Distribution fitting functions
• like — Negative log-likelihood functions
• rnd — Random number generators
For more detailed explanations of each supported distribution, see Appendix
B, “Distribution Reference”.
5-3
5
Probability Distributions
Parametric Distributions
Continuous Distributions (Data)
Name
pdf
cdf
inv
stat
fit
like
Beta
betapdf,
pdf
betacdf,
cdf
betainv,
icdf
betastat
betafit,
mle
betalike betarnd,
random,
randtool
BirnbaumSaunders
5-4
rnd
dfittool
Exponential
exppdf,
pdf
expcdf,
cdf
expinv,
icdf
expstat
expfit,
mle,
dfittool
explike
exprnd,
random,
randtool
Extreme
value
evpdf,
pdf
evcdf,
cdf
evinv,
icdf
evstat
evfit, mle,
dfittool
evlike
evrnd,
random,
randtool
Gamma
gampdf,
pdf
gamcdf,
cdf
gaminv,
icdf
gamstat
gamfit,
mle,
dfittool
gamlike
gamrnd,
randg,
random,
randtool
Generalized
extreme
value
gevpdf,
pdf
gevcdf,
cdf
gevinv,
icdf
gevstat
gevfit,
mle,
dfittool
gevlike
gevrnd,
random,
randtool
Generalized
Pareto
gppdf,
pdf
gpcdf,
cdf
gpinv,
icdf
gpstat
gpfit, mle,
dfittool
gplike
gprnd,
random,
randtool
Inverse
Gaussian
dfittool
Johnson
system
johnsrnd
Logistic
dfittool
Loglogistic
dfittool
johnsrnd
Supported Distributions
Name
pdf
cdf
inv
stat
fit
like
Lognormal
lognpdf,
pdf
logncdf,
cdf
logninv,
icdf
lognstat
lognfit,
mle,
dfittool
lognlike lognrnd,
random,
randtool
Nakagami
Normal
(Gaussian)
dfittool
normpdf,
pdf
normcdf,
cdf
norminv,
icdf
normstat
Pearson
system
Piecewise
pdf
cdf
icdf
Rayleigh
raylpdf,
pdf
raylcdf,
cdf
raylinv,
icdf
raylstat
Rician
normfit,
mle,
dfittool
normlike normrnd,
randn,
random,
randtool
pearsrnd
pearsrnd
paretotails
random
raylfit,
mle,
dfittool
raylrnd,
random,
randtool
dfittool
unifpdf,
Uniform
(continuous) pdf
Weibull
rnd
wblpdf,
pdf
unifcdf,
cdf
unifinv,
icdf
unifstat
unifit, mle
wblcdf,
cdf
wblinv,
icdf
wblstat
wblfit,
mle,
dfittool
unifrnd,
rand,
random
wbllike
wblrnd,
random
5-5
5
Probability Distributions
Continuous Distributions (Statistics)
Name
pdf
cdf
inv
stat
Chi-square
chi2pdf,
pdf
chi2cdf,
cdf
chi2inv,
icdf
chi2stat
chi2rnd,
random,
randtool
F
fpdf, pdf
fcdf, cdf
finv,
icdf
fstat
frnd,
random,
randtool
Noncentral
chi-square
ncx2pdf,
pdf
ncx2cdf,
cdf
ncx2inv,
icdf
ncx2stat
ncx2rnd,
random,
randtool
Noncentral
F
ncfpdf,
pdf
ncfcdf,
cdf
ncfinv,
icdf
ncfstat
ncfrnd,
random,
randtool
Noncentral
t
nctpdf,
pdf
nctcdf,
cdf
nctinv,
icdf
nctstat
nctrnd,
random,
randtool
Student’s t
tpdf, pdf
tcdf, cdf
tinv,
icdf
tstat
trnd,
random,
randtool
t locationscale
5-6
fit
dfittool
like
rnd
Supported Distributions
Discrete Distributions
Name
pdf
cdf
inv
stat
fit
Binomial
binopdf,
pdf
binocdf,
cdf
binoinv,
icdf
binostat
binofit,
mle,
dfittool
Bernoulli
Geometric
like
rnd
binornd,
random,
randtool
mle
geopdf,
pdf
hygepdf,
Hypergeometric
pdf
geocdf,
cdf
geoinv,
icdf
geostat
hygecdf,
cdf
hygeinv,
icdf
hygestat
mle
hygernd,
random
Multinomial mnpdf
Negative
binomial
nbinpdf,
pdf
Poisson
Uniform
(discrete)
geornd,
random,
randtool
mnrnd
nbincdf,
cdf
nbinfit,
mle,
dfittool
nbinrnd,
random,
randtool
poisspdf, poisscdf, poissinv, poisstat
pdf
cdf
icdf
poissfit,
mle,
dfittool
poissrnd,
random,
randtool
unidpdf,
pdf
mle
unidrnd,
random,
randtool
unidcdf,
cdf
nbininv,
icdf
unidinv,
icdf
nbinstat
unidstat
5-7
5
Probability Distributions
Multivariate Distributions
Name
pdf
cdf
Gaussian
copula
copulapdf copulacdf
Gaussian
mixture
pdf
t copula
copulapdf copulacdf
Clayton
copula
inv
stat
fit
copulastat
copulafit
copularnd
fit
random
copulastat
copulafit
copularnd
copulapdf copulacdf
copulastat
copulafit
copularnd
Frank
copula
copulapdf copulacdf
copulastat
copulafit
copularnd
Gumbel
copula
copulapdf copulacdf
copulastat
copulafit
copularnd
cdf
like
Inverse
Wishart
rnd
iwishrnd
Multivariate mvnpdf
normal
mvncdf
mvnrnd
Multivariate mvtpdf
t
mvtcdf
mvtrnd
Wishart
wishrnd
Nonparametric Distributions
5-8
Name
pdf
cdf
inv
Nonparametric
ksdensity
ksdensity
ksdensity
stat fit
ksdensity,
dfittool
like rnd
Working with Distributions Through GUIs
Working with Distributions Through GUIs
In this section...
“Exploring Distributions” on page 5-9
“Modeling Data Using the Distribution Fitting Tool” on page 5-11
“Visually Exploring Random Number Generation” on page 5-49
This section describes Statistics Toolbox GUIs that provide convenient,
interactive access to the distribution functions described in “Statistics Toolbox
Distribution Functions” on page 5-52.
Exploring Distributions
To interactively see the influence of parameter changes on the shapes of the
pdfs and cdfs of supported Statistics Toolbox distributions, use the Probability
Distribution Function Tool.
Run the tool by typing disttool at the command line.
5-9
5
Probability Distributions
Choose
distribution
Function type
(cdf or pdf)
Function
plot
Function
value
Parameter
bounds
Draggable
reference
lines
Parameter
value
Parameter
control
Additional
parameters
Start by selecting a distribution. Then choose the function type: probability
density function (pdf) or cumulative distribution function (cdf).
5-10
Working with Distributions Through GUIs
After the plot appears, you can
• Calculate a new function value by
-
Typing a new x value in the text box on the x-axis
Dragging the vertical reference line.
Clicking in the figure where you want the line to be.
The new function value appears in the text box to the left of the plot.
• For cdf plots, find critical values corresponding to a specific probability by
typing the desired probability in the text box on the y-axis or by dragging
the horizontal reference line.
• Use the controls at the bottom of the window to set parameter values for
the distribution and to change their upper and lower bounds.
Modeling Data Using the Distribution Fitting Tool
The Distribution Fitting Tool is a GUI for fitting univariate distributions to
data. This section describes how to use the Distribution Fitting Tool this
tool and covers the following topics:
• “Opening the Distribution Fitting Tool” on page 5-12
• “Creating and Managing Data Sets” on page 5-14
• “Creating a New Fit” on page 5-19
• “Displaying Results” on page 5-24
• “Managing Fits” on page 5-26
• “Evaluating Fits” on page 5-28
• “Excluding Data” on page 5-32
• “Saving and Loading Sessions” on page 5-38
• “Example: Fitting a Distribution” on page 5-39
• “Generating a File to Fit and Plot Distributions” on page 5-46
• “Using Custom Distributions” on page 5-47
• “Additional Distributions Available in the Distribution Fitting Tool” on
page 5-49
5-11
5
Probability Distributions
Opening the Distribution Fitting Tool
To open the Distribution Fitting Tool, enter the command
dfittool
Select display
Select distribution (probability plot only)
Task buttons
Import data
from workspace
Create a new fit
Manage multiple fits
Evaluate distribution
at selected points
Exclude data
from fit
Adjusting the Plot. Buttons at the top of the tool allow you to adjust the
plot displayed in this window:
•
5-12
— Toggle the legend on (default) or off.
Working with Distributions Through GUIs
•
— Toggle grid lines on or off (default).
•
— Restore default axes limits.
Displaying the Data. The Display type field specifies the type of plot
displayed in the main window. Each type corresponds to a probability
function, for example, a probability density function. The following display
types are available:
• Density (PDF) — Display a probability density function (PDF) plot for
the fitted distribution.
• Cumulative probability (CDF) — Display a cumulative probability plot
of the data.
• Quantile (inverse CDF) — Display a quantile (inverse CDF) plot.
• Probability plot — Display a probability plot.
• Survivor function — Display a survivor function plot of the data.
• Cumulative hazard — Display a cumulative hazard plot of the data.
Inputting and Fitting Data. The task buttons enable you to perform the
tasks necessary to fit distributions to data. Each button opens a new dialog
box in which you perform the task. The buttons include:
• Data — Import and manage data sets. See “Creating and Managing Data
Sets” on page 5-14.
• New Fit — Create new fits. See “Creating a New Fit” on page 5-19.
• Manage Fits — Manage existing fits. See “Managing Fits” on page 5-26.
• Evaluate — Evaluate fits at any points you choose. See “Evaluating Fits”
on page 5-28.
• Exclude — Create rules specifying which values to exclude when fitting a
distribution. See “Excluding Data” on page 5-32.
The display pane displays plots of the data sets and fits you create. Whenever
you make changes in one of the dialog boxes, the results in the display pane
update.
5-13
5
Probability Distributions
Saving and Customizing Distributions. The Distribution Fitting Tool
menus contain items that enable you to do the following:
• Save and load sessions. See “Saving and Loading Sessions” on page 5-38.
• Generate a file with which you can fit distributions to data and plot the
results independently of the Distribution Fitting Tool. See “Generating a
File to Fit and Plot Distributions” on page 5-46.
• Define and import custom distributions. See “Using Custom Distributions”
on page 5-47.
Creating and Managing Data Sets
This section describes how to create and manage data sets.
To begin, click the Data button in the Distribution Fitting Tool to open the
Data dialog box shown in the following figure.
5-14
Working with Distributions Through GUIs
Importing Data. The Import workspace vectors pane enables you to
create a data set by importing a vector from the MATLAB workspace. The
following sections describe the fields in this pane and give appropriate values
for vectors imported from the MATLAB workspace:
• Data — The drop-down list in the Data field contains the names of all
matrices and vectors, other than 1-by-1 matrices (scalars) in the MATLAB
workspace. Select the array containing the data you want to fit. The actual
data you import must be a vector. If you select a matrix in the Data field,
the first column of the matrix is imported by default. To select a different
column or row of the matrix, click Select Column or Row. This displays
5-15
5
Probability Distributions
the matrix in the Variable Editor, where you can select a row or column
by highlighting it with the mouse.
Alternatively, you can enter any valid MATLAB expression in the Data
field.
When you select a vector in the Data field, a histogram of the data appears
in the Data preview pane.
• Censoring — If some of the points in the data set are censored, enter
a Boolean vector, of the same size as the data vector, specifying the
censored entries of the data. A 1 in the censoring vector specifies that the
corresponding entry of the data vector is censored, while a 0 specifies that
the entry is not censored. If you enter a matrix, you can select a column or
row by clicking Select Column or Row. If you do not want to censor any
data, leave the Censoring field blank.
• Frequency — Enter a vector of positive integers of the same size as the
data vector to specify the frequency of the corresponding entries of the data
vector. For example, a value of 7 in the 15th entry of frequency vector
specifies that there are 7 data points corresponding to the value in the 15th
entry of the data vector. If all entries of the data vector have frequency 1,
leave the Frequency field blank.
• Data set name — Enter a name for the data set you import from the
workspace, such as My data.
After you have entered the information in the preceding fields, click Create
Data Set to create the data set My data.
Managing Data Sets. The Manage data sets pane enables you to view
and manage the data sets you create. When you create a data set, its name
appears in the Data sets list. The following figure shows the Manage data
sets pane after creating the data set My data.
5-16
Working with Distributions Through GUIs
For each data set in the Data sets list, you can:
• Select the Plot check box to display a plot of the data in the main
Distribution Fitting Tool window. When you create a new data set, Plot is
selected by default. Clearing the Plot check box removes the data from the
plot in the main window. You can specify the type of plot displayed in the
Display type field in the main window.
• If Plot is selected, you can also select Bounds to display confidence
interval bounds for the plot in the main window. These bounds are
pointwise confidence bounds around the empirical estimates of these
functions. The bounds are only displayed when you set Display Type in
the main window to one of the following:
-
Cumulative probability (CDF)
Survivor function
Cumulative hazard
The Distribution Fitting Tool cannot display confidence bounds on density
(PDF), quantile (inverse CDF), or probability plots. Clearing the Bounds
check box removes the confidence bounds from the plot in the main window.
When you select a data set from the list, the following buttons are enabled:
• View — Display the data in a table in a new window.
• Set Bin Rules — Defines the histogram bins used in a density (PDF) plot.
• Rename — Rename the data set.
5-17
5
Probability Distributions
• Delete — Delete the data set.
Setting Bin Rules. To set bin rules for the histogram of a data set, click Set
Bin Rules. This opens the Set Bin Width Rules dialog box.
You can select from the following rules:
• Freedman-Diaconis rule — Algorithm that chooses bin widths and
locations automatically, based on the sample size and the spread of the
data. This rule, which is the default, is suitable for many kinds of data.
• Scott rule — Algorithm intended for data that are approximately normal.
The algorithm chooses bin widths and locations automatically.
• Number of bins — Enter the number of bins. All bins have equal widths.
• Bins centered on integers — Specifies bins centered on integers.
5-18
Working with Distributions Through GUIs
• Bin width — Enter the width of each bin. If you select this option, you
can also select:
-
Automatic bin placement — Place the edges of the bins at integer
multiples of the Bin width.
-
Bin boundary at — Enter a scalar to specify the boundaries of the
bins. The boundary of each bin is equal to this scalar plus an integer
multiple of the Bin width.
The Set Bin Width Rules dialog box also provides the following options:
• Apply to all existing data sets — Apply the rule to all data sets.
Otherwise, the rule is only applied to the data set currently selected in
the Data dialog box.
• Save as default — Apply the current rule to any new data sets that you
create. You can also set default bin width rules by selecting Set Default
Bin Rules from the Tools menu in the main window.
Creating a New Fit
This section describes how to create a new fit. To begin, click the New Fit
button at the top of the main window to open the New Fit dialog box. If you
created the data set My data, it appears in the Data field.
5-19
5
5-20
Probability Distributions
Field Name
Description
Fit Name
Enter a name for the fit in the Fit Name field.
Data
The Data field contains a drop-down list of the data sets
you have created. Select the data set to which you want to
fit a distribution.
Working with Distributions Through GUIs
Field Name
Description
Distribution
Select the type of distribution to fit from the Distribution
drop-down list. See “Available Distributions” on page 5-22
for a list of distributions supported by the Distribution
Fitting Tool.
Only the distributions that apply to the values of the
selected data set appear in the Distribution field. For
example, positive distributions are not displayed when the
data include values that are zero or negative.
You can specify either a parametric or a nonparametric
distribution. When you select a parametric distribution
from the drop-down list, a description of its parameters
appears in the Normal pane. The Distribution Fitting Tool
estimates these parameters to fit the distribution to the
data set. When you select Nonparametric fit, options for
the fit appear in the pane, as described in “Further Options
for Nonparametric Fits” on page 5-23.
Exclusion
rule
Specify a rule to exclude some data in the Exclusion rule
field. Create an exclusion rule by clicking Exclude in
the Distribution Fitting Tool. For more information, see
“Excluding Data” on page 5-32.
Apply the New Fit. Click Apply to fit the distribution. For a parametric
fit, the Results pane displays the values of the estimated parameters. For a
nonparametric fit, the Results pane displays information about the fit.
When you click Apply, the Distribution Fitting Tool displays a plot of the
distribution, along with the corresponding data.
Note When you click Apply, the title of the dialog box changes to Edit Fit.
You can now make changes to the fit you just created and click Apply again
to save them. After closing the Edit Fit dialog box, you can reopen it from the
Fit Manager dialog box at any time to edit the fit.
5-21
5
Probability Distributions
After applying the fit, you can save the information to the workspace using
probability distribution objects by clicking Save to workspace. See “Using
Probability Distribution Objects” on page 5-84 for more information.
Available Distributions. This section lists the distributions available in
the Distribution Fitting Tool.
Most, but not all, of the distributions available in the Distribution Fitting
Tool are supported elsewhere in Statistics Toolbox software (see “Supported
Distributions” on page 5-3), and have dedicated distribution fitting functions.
These functions compute the majority of the fits in the Distribution Fitting
Tool, and are referenced in the list below.
Other fits are computed using functions internal to the Distribution Fitting
Tool. Distributions that do not have corresponding Statistics Toolbox
fitting functions are described in “Additional Distributions Available in the
Distribution Fitting Tool” on page 5-49.
Not all of the distributions listed below are available for all data sets. The
Distribution Fitting Tool determines the extent of the data (nonnegative, unit
interval, etc.) and displays appropriate distributions in the Distribution
drop-down list. Distribution data ranges are given parenthetically in the
list below.
• Beta (unit interval values) distribution, fit using the function betafit.
• Binomial (nonnegative values) distribution, fit using the function binopdf.
• Birnbaum-Saunders (positive values) distribution.
• Exponential (nonnegative values) distribution, fit using the function
expfit.
• Extreme value (all values) distribution, fit using the function evfit.
• Gamma (positive values) distribution, fit using the function gamfit.
• Generalized extreme value (all values) distribution, fit using the function
gevfit.
• Generalized Pareto (all values) distribution, fit using the function gpfit.
• Inverse Gaussian (positive values) distribution.
5-22
Working with Distributions Through GUIs
• Logistic (all values) distribution.
• Loglogistic (positive values) distribution.
• Lognormal (positive values) distribution, fit using the function lognfit.
• Nakagami (positive values) distribution.
• Negative binomial (nonnegative values) distribution, fit using the function
nbinpdf.
• Nonparametric (all values) distribution, fit using the function ksdensity.
See “Further Options for Nonparametric Fits” on page 5-23 for a description
of available options.
• Normal (all values) distribution, fit using the function normfit.
• Poisson (nonnegative integer values) distribution, fit using the function
poisspdf.
• Rayleigh (positive values) distribution using the function raylfit.
• Rician (positive values) distribution.
• t location-scale (all values) distribution.
• Weibull (positive values) distribution using the function wblfit.
Further Options for Nonparametric Fits. When you select
Non-parametric in the Distribution field, a set of options appears in the
Non-parametric pane, as shown in the following figure.
5-23
5
Probability Distributions
The options for nonparametric distributions are:
• Kernel — Type of kernel function to use.
-
Normal
Box
Triangle
Epanechnikov
• Bandwidth — The bandwidth of the kernel smoothing window. Select
auto for a default value that is optimal for estimating normal densities.
This value appears in the Fit results pane after you click Apply. Select
specify and enter a smaller value to reveal features such as multiple
modes or a larger value to make the fit smoother.
• Domain — The allowed x-values for the density.
-
unbounded — The density extends over the whole real line.
positive — The density is restricted to positive values.
specify — Enter lower and upper bounds for the domain of the density.
When you select positive or specify, the nonparametric fit has zero
probability outside the specified domain.
Displaying Results
This section explains the different ways to display results in the Distribution
Fitting Tool window. This window displays plots of:
• The data sets for which you select Plot in the Data dialog box
• The fits for which you select Plot in the Fit Manager dialog box
• Confidence bounds for:
-
Data sets for which you select Bounds in the Data dialog box
Fits for which you select Bounds in the Fit Manager dialog box
The following fields are available.
5-24
Working with Distributions Through GUIs
Display Type. The Display Type field in the main window specifies the type
of plot displayed. Each type corresponds to a probability function, for example,
a probability density function. The following display types are available:
• Density (PDF) — Display a probability density function (PDF) plot for the
fitted distribution. The main window displays data sets using a probability
histogram, in which the height of each rectangle is the fraction of data
points that lie in the bin divided by the width of the bin. This makes the
sum of the areas of the rectangles equal to 1.
• Cumulative probability (CDF) — Display a cumulative probability
plot of the data. The main window displays data sets using a cumulative
probability step function. The height of each step is the cumulative sum of
the heights of the rectangles in the probability histogram.
• Quantile (inverse CDF) — Display a quantile (inverse CDF) plot.
• Probability plot — Display a probability plot of the data. You can
specify the type of distribution used to construct the probability plot in the
Distribution field, which is only available when you select Probability
plot. The choices for the distribution are:
-
Exponential
Extreme value
Logistic
Log-Logistic
Lognormal
Normal
Rayleigh
Weibull
In addition to these choices, you can create a probability plot against a
parametric fit that you create in the New Fit pane. These fits are added at
the bottom of the Distribution drop-down list when you create them.
• Survivor function — Display survivor function plot of the data.
• Cumulative hazard — Display cumulative hazard plot of the data.
5-25
5
Probability Distributions
Note Some distributions are unavailable if the plotted data includes 0 or
negative values.
Confidence Bounds. You can display confidence bounds for data sets and fits
when you set Display Type to Cumulative probability (CDF), Survivor
function, Cumulative hazard, or, for fits only, Quantile (inverse CDF).
• To display bounds for a data set, select Bounds next to the data set in the
Data sets pane of the Data dialog box.
• To display bounds for a fit, select Bounds next to the fit in the Fit Manager
dialog box. Confidence bounds are not available for all fit types.
To set the confidence level for the bounds, select Confidence Level from the
View menu in the main window and choose from the options.
Managing Fits
This section describes how to manage fits that you have created. To begin,
click the Manage Fits button in the Distribution Fitting Tool. This opens the
Fit Manager dialog box as shown in the following figure.
5-26
Working with Distributions Through GUIs
The Table of fits displays a list of the fits you create, with the following
options:
• Plot — Select Plot to display a plot of the fit in the main window of the
Distribution Fitting Tool. When you create a new fit, Plot is selected by
default. Clearing the Plot check box removes the fit from the plot in the
main window.
• Bounds — If Plot is selected, you can also select Bounds to display
confidence bounds in the plot. The bounds are displayed when you set
Display Type in the main window to one of the following:
-
Cumulative probability (CDF)
Quantile (inverse CDF)
Survivor function
Cumulative hazard
5-27
5
Probability Distributions
The Distribution Fitting Tool cannot display confidence bounds on density
(PDF) or probability plots. In addition, bounds are not supported for
nonparametric fits and some parametric fits.
Clearing the Bounds check box removes the confidence intervals from
the plot in the main window.
When you select a fit in the Table of fits, the following buttons are enabled
below the table:
-
New Fit — Open a New Fit window.
Copy — Create a copy of the selected fit.
Edit — Open an Edit Fit dialog box, where you can edit the fit.
Note You can only edit the currently selected fit in the Edit Fit dialog
box. To edit a different fit, select it in the Table of fits and click Edit to
open another Edit Fit dialog box.
-
Save to workspace — Save the selected fit as a distribution object.
See “Using Probability Distribution Objects” on page 5-84 for more
information.
-
Delete — Delete the selected fit.
Evaluating Fits
The Evaluate dialog box enables you to evaluate any fit at whatever points you
choose. To open the dialog box, click the Evaluate button in the Distribution
Fitting Tool. The following figure shows the Evaluate dialog box.
5-28
Working with Distributions Through GUIs
The Evaluate dialog box contains the following items:
• Fit pane — Display the names of existing fits. Select one or more fits that
you want to evaluate. Using your platform specific functionality, you can
select multiple fits.
• Function — Select the type of probability function you want to evaluate
for the fit. The available functions are
-
Density (PDF) — Computes a probability density function.
5-29
5
Probability Distributions
-
Cumulative probability (CDF) — Computes a cumulative probability
function.
Quantile (inverse CDF) — Computes a quantile (inverse CDF)
function.
Survivor function — Computes a survivor function.
Cumulative hazard — Computes a cumulative hazard function.
Hazard rate — Computes the hazard rate.
• At x = — Enter a vector of points or the name of a workspace variable
containing a vector of points at which you want to evaluate the distribution
function. If you change Function to Quantile (inverse CDF), the field
name changes to At p = and you enter a vector of probability values.
• Compute confidence bounds — Select this box to compute confidence
bounds for the selected fits. The check box is only enabled if you set
Function to one of the following:
-
Cumulative probability (CDF)
Quantile (inverse CDF)
Survivor function
Cumulative hazard
The Distribution Fitting Tool cannot compute confidence bounds for
nonparametric fits and for some parametric fits. In these cases, the tool
returns NaN for the bounds.
• Level — Set the level for the confidence bounds.
• Plot function — Select this box to display a plot of the distribution
function, evaluated at the points you enter in the At x = field, in a new
window.
Note The settings for Compute confidence bounds, Level, and Plot
function do not affect the plots that are displayed in the main window of
the Distribution Fitting Tool. The settings only apply to plots you create by
clicking Plot function in the Evaluate window.
5-30
Working with Distributions Through GUIs
Click Apply to apply these settings to the selected fit. The following figure
shows the results of evaluating the cumulative density function for the fit My
fit, created in “Example: Fitting a Distribution” on page 5-39, at the points
in the vector -3:0.5:3.
The window displays the following values in the columns of the table to the
right of the Fit pane:
• X — The entries of the vector you enter in At x = field
• Y — The corresponding values of the CDF at the entries of X
5-31
5
Probability Distributions
• LB — The lower bounds for the confidence interval, if you select Compute
confidence bounds
• UB — The upper bounds for the confidence interval, if you select Compute
confidence bounds
To save the data displayed in the Evaluate window, click Export to
Workspace. This saves the values in the table to a matrix in the MATLAB
workspace.
Excluding Data
To exclude values from fit, click the Exclude button in the main window of
the Distribution Fitting Tool. This opens the Exclude window, in which you
can create rules for excluding specified values. You can use these rules to
exclude data when you create a new fit in the New Fit window. The following
figure shows the Exclude window.
To create an exclusion rule:
5-32
Working with Distributions Through GUIs
1 Exclusion Rule Name—Enter a name for the exclusion rule in the
Exclusion rule name field.
2 Exclude Sections—In the Exclude sections pane, you can specify
bounds for the excluded data:
• In the Lower limit: exclude Y drop-down list, select <= or < from the
drop-down list and enter a scalar in the field to the right. This excludes
values that are either less than or equal to or less than that scalar,
respectively.
• In the Upper limit: exclude Y drop-down list, select >= or > from the
drop-down list and enter a scalar in the field to the right to exclude
values that are either greater than or equal to or greater than the scalar,
respectively.
OR
Exclude Graphically—The Exclude Graphically button enables you
to define the exclusion rule by displaying a plot of the values in a data
set and selecting the bounds for the excluded data with the mouse. For
example, if you created the data set My data, described in “Creating
and Managing Data Sets” on page 5-14, select it from the drop-down list
next to Exclude graphically and then click the Exclude graphically
button. This displays the values in My data in a new window as shown in
the following figure.
5-33
5
Probability Distributions
To set a lower limit for the boundary of the excluded region, click Add
Lower Limit. This displays a vertical line on the left side of the plot
window. Move the line with the mouse to the point you where you want
the lower limit, as shown in the following figure.
5-34
Working with Distributions Through GUIs
Moving the vertical line changes the value displayed in the Lower limit:
exclude data field in the Exclude window, as shown in the following figure.
5-35
5
Probability Distributions
The value displayed corresponds to the x-coordinate of the vertical line.
Similarly, you can set the upper limit for the boundary of the excluded
region by clicking Add Upper Limit and moving the vertical line that
appears at the right side of the plot window. After setting the lower and
upper limits, click Close and return to the Exclude window.
3 Create Exclusion Rule—Once you have set the lower and upper limits
for the boundary of the excluded data, click Create Exclusion Rule
to create the new rule. The name of the new rule now appears in the
Existing exclusion rules pane.
When you select an exclusion rule in the Existing exclusion rules pane,
the following buttons are enabled:
• Copy — Creates a copy of the rule, which you can then modify. To save
the modified rule under a different name, click Create Exclusion Rule.
• View — Opens a new window in which you can see which data points
are excluded by the rule. The following figure shows a typical example.
5-36
Working with Distributions Through GUIs
The shaded areas in the plot graphically display which data points are
excluded. The table to the right lists all data points. The shaded rows
indicate excluded points:
• Rename — Renames the rule
• Delete — Deletes the rule
Once you define an exclusion rule, you can use it when you fit a distribution
to your data. The rule does not exclude points from the display of the data
set.
5-37
5
Probability Distributions
Saving and Loading Sessions
This section explains how to save your work in the current Distribution
Fitting Tool session and then load it in a subsequent session, so that you can
continue working where you left off.
Saving a Session. To save the current session, select Save Session from
the File menu in the main window. This opens a dialog box that prompts you
to enter a filename, such as my_session.dfit, for the session. Clicking Save
saves the following items created in the current session:
• Data sets
• Fits
• Exclusion rules
• Plot settings
• Bin width rules
Loading a Session. To load a previously saved session, select Load Session
from the File menu in the main window and enter the name of a previously
saved session. Clicking Open restores the information from the saved session
to the current session of the Distribution Fitting Tool.
5-38
Working with Distributions Through GUIs
Example: Fitting a Distribution
This section presents an example that illustrates how to use the Distribution
Fitting Tool. The example involves the following steps:
• “Step 1: Generate Random Data” on page 5-39
• “Step 2: Import Data” on page 5-39
• “Step 3: Create a New Fit” on page 5-42
Step 1: Generate Random Data. To try the example, first generate some
random data to which you will fit a distribution. The following command
generates a vector data, of length 100, whose entries are random numbers
from a normal distribution with mean.36 and standard deviation 1.4.
data = normrnd(.36, 1.4, 100, 1);
Step 2: Import Data. Open the distribution fitting tool:
dfittool
To import the vector data into the Distribution Fitting Tool, click the Data
button in main window. This opens the window shown in the following figure.
5-39
5
Probability Distributions
Select data
Enter name for data set
The Data field displays all numeric arrays in the MATLAB workspace. Select
data from the drop-down list, as shown in the following figure.
5-40
Working with Distributions Through GUIs
This displays a histogram of the data in the Data preview pane.
In the Data set name field, type a name for the data set, such as My data,
and click Create Data Set to create the data set. The main window of the
Distribution Fitting Tool now displays a larger version of the histogram in the
Data preview pane, as shown in the following figure.
5-41
5
Probability Distributions
Note Because the example uses random data, you might see a slightly
different histogram if you try this example for yourself.
Step 3: Create a New Fit. To fit a distribution to the data, click New Fit
in the main window of the Distribution Fitting Tool. This opens the window
shown in the following figure.
5-42
Working with Distributions Through GUIs
Select data set name
Specify distribution type
To fit a normal distribution, the default entry of the Distribution field, to
My data:
1 Enter a name for the fit, such as My fit, in the Fit name field.
2 Select My data from the drop-down list in the Data field.
5-43
5
Probability Distributions
3 Click Apply.
The Results pane displays the mean and standard deviation of the normal
distribution that best fits My data, as shown in the following figure.
The main window of the Distribution Fitting Tool displays a plot of the
normal distribution with this mean and standard deviation, as shown in the
following figure.
5-44
Working with Distributions Through GUIs
5-45
5
Probability Distributions
Generating a File to Fit and Plot Distributions
The Generate Code option in the File menu enables you to create a file that
• Fits the distributions used in the current session to any data vector in the
MATLAB workspace.
• Plots the data and the fits.
After you end the current session, you can use the file to create plots in a
standard MATLAB figure window, without having to reopen the Distribution
Fitting Tool.
As an example, assuming you created the fit described in “Creating a New
Fit” on page 5-19, do the following steps:
1 Select Generate Code from the File menu.
2 Choose File > Save as in the MATLAB Editor window. Save the file as
normal_fit.m in a folder on the MATLAB path.
You can then apply the function normal_fit to any vector of data in the
MATLAB workspace. For example, the following commands
new_data = normrnd(4.1, 12.5, 100, 1);
newfit = normal_fit(new_data)
legend('New Data', 'My fit')
generate newfit, a fitted normal distribution of the data, and generates a
plot of the data and the fit.
newfit =
normal distribution
mu = 3.19148
sigma = 12.5631
5-46
Working with Distributions Through GUIs
Note By default, the file labels the data in the legend using the same name as
the data set in the Distribution Fitting Tool. You can change the label using
the legend command, as illustrated by the preceding example.
Using Custom Distributions
This section explains how to use custom distributions with the Distribution
Fitting Tool.
5-47
5
Probability Distributions
Defining Custom Distributions. To define a custom distribution, select
Define Custom Distribution from the File menu. This opens a file
template in the MATLAB editor. You then edit this file so that it computes
the distribution you want.
The template includes example code that computes the Laplace distribution,
beginning at the lines
%
%
Remove the following return statement to define the
%
Laplace distributon
%
return
To use this example, simply delete the command return and save the
file. If you save the template in a folder on the MATLAB path, under its
default name dfittooldists.m, the Distribution Fitting Tool reads it in
automatically when you start the tool. You can also save the template under a
different name, such as laplace.m, and then import the custom distribution
as described in the following section.
Importing Custom Distributions. To import a custom distribution, select
Import Custom Distributions from the File menu. This opens a dialog box
in which you can select the file that defines the distribution. For example, if
you created the file laplace.m, as described in the preceding section, you can
enter laplace.m and select Open in the dialog box. The Distribution field of
the New Fit window now contains the option Laplace.
5-48
Working with Distributions Through GUIs
Additional Distributions Available in the Distribution Fitting Tool
The following distributions are available in the Distribution Fitting Tool,
but do not have dedicated distribution functions as described in “Statistics
Toolbox Distribution Functions” on page 5-52. The distributions can be used
with the functions pdf, cdf, icdf, and mle in a limited capacity. See the
reference pages for these functions for details on the limitations.
• “Birnbaum-Saunders Distribution” on page B-10
• “Inverse Gaussian Distribution” on page B-45
• “Loglogistic Distribution” on page B-50
• “Logistic Distribution” on page B-49
• “Nakagami Distribution” on page B-70
• “Rician Distribution” on page B-93
• “t Location-Scale Distribution” on page B-97
For a complete list of the distributions available for use with the Distribution
Fitting Tool, see “Supported Distributions” on page 5-3. Distributions listing
dfittool in the fit column of the tables in that section can be used with
the Distribution Fitting Tool.
Visually Exploring Random Number Generation
The Random Number Generation Tool is a graphical user interface that
generates random samples from specified probability distributions and
displays the samples as histograms. Use the tool to explore the effects of
changing parameters and sample size on the distributions.
Run the tool by typing randtool at the command line.
5-49
5
Probability Distributions
Choose distribution
Sample size
Histogram
Parameter
bounds
Parameter
value
Parameter
control
Additional
parameters
Sample again Export to
from the same workspace
distribution
Start by selecting a distribution, then enter the desired sample size.
5-50
Working with Distributions Through GUIs
You can also
• Use the controls at the bottom of the window to set parameter values for
the distribution and to change their upper and lower bounds.
• Draw another sample from the same distribution, with the same size and
parameters.
• Export the current sample to your workspace. A dialog box enables you
to provide a name for the sample.
5-51
5
Probability Distributions
Statistics Toolbox Distribution Functions
In this section...
“Probability Density Functions” on page 5-52
“Cumulative Distribution Functions” on page 5-62
“Inverse Cumulative Distribution Functions” on page 5-66
“Distribution Statistics Functions” on page 5-68
“Distribution Fitting Functions” on page 5-70
“Negative Log-Likelihood Functions” on page 5-77
“Random Number Generators” on page 5-80
For each distribution supported by Statistics Toolbox software, a selection of
the distribution functions described in this section is available for statistical
programming. This section gives a general overview of the use of each type
of function, independent of the particular distribution. For specific functions
available for specific distributions, see “Supported Distributions” on page 5-3.
Probability Density Functions
• “Estimating PDFs with Parameters” on page 5-52
• “Estimating PDFs without Parameters” on page 5-55
Estimating PDFs with Parameters
Probability density functions (pdfs) for supported Statistics Toolbox
distributions all end with pdf, as in binopdf or exppdf. For more information
on specific function names for specific distributions see “Supported
Distributions” on page 5-3.
Each function represents a parametric family of distributions. Input
arguments are arrays of outcomes followed by a list of parameter values
specifying a particular member of the distribution family.
For discrete distributions, the pdf assigns a probability to each outcome. In
this context, the pdf is often called a probability mass function (pmf).
5-52
Statistics Toolbox™ Distribution Functions
For example, the discrete binomial pdf
⎛ n⎞
f (k) = ⎜ ⎟ pk (1 − p)n− k
⎝ k⎠
assigns probability to the event of k successes in n trials of a Bernoulli process
(such as coin flipping) with probability p of success at each trial. Each of the
integers k = 0, 1, 2, ..., n is assigned a positive probability, with the sum of the
probabilities equal to 1. Compute the probabilities with the binopdf function:
p = 0.2; % Probability of success for each trial
n = 10; % Number of trials
k = 0:n; % Outcomes
m = binopdf(k,n,p); % Probability mass vector
bar(k,m) % Visualize the probability distribution
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
grid on
5-53
5
Probability Distributions
For continuous distributions, the pdf assigns a probability density to each
outcome. The probability of any single outcome is zero. The pdf must be
integrated over a set of outcomes to compute the probability that an outcome
falls within that set. The integral over the entire set of outcomes is 1.
For example, the continuous exponential pdf
f (t) = λe−λt
is used to model the probability that a process with constant failure rate λ will
have a failure within time t . Each time t > 0 is assigned a positive probability
density. Densities are computed with the exppdf function:
lambda = 2; % Failure rate
t = 0:0.01:3; % Outcomes
f = exppdf(t,1/lambda); % Probability density vector
plot(t,f) % Visualize the probability distribution
grid on
5-54
Statistics Toolbox™ Distribution Functions
Probabilities for continuous pdfs can be computed with the quad function.
In the example above, the probability of failure in the time interval [0,1] is
computed as follows:
f_lambda = @(t)exppdf(t,1/lambda); % Pdf with fixed lambda
P = quad(f_lambda,0,1) % Integrate from 0 to 1
P =
0.8647
Alternatively, the cumulative distribution function (cdf) for the exponential
function, expcdf, can be used:
P = expcdf(1,1/lambda) % Cumulative probability from 0 to 1
P =
0.8647
Estimating PDFs without Parameters
A distribution of data can be described graphically with a histogram:
cars = load('carsmall','MPG','Origin');
MPG = cars.MPG;
hist(MPG)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-55
5
Probability Distributions
You can also describe a data distribution by estimating its density.
The ksdensity function does this using a kernel smoothing method. A
nonparametric density estimate of the previous data, using the default kernel
and bandwidth, is given by:
[f,x] = ksdensity(MPG);
plot(x,f);
title('Density estimate for MPG')
5-56
Statistics Toolbox™ Distribution Functions
Controlling Probability Density Curve Smoothness. The choice of
kernel bandwidth controls the smoothness of the probability density curve.
The following graph shows the density estimate for the same mileage data
using different bandwidths. The default bandwidth is in blue and looks
like the preceding graph. Estimates for smaller and larger bandwidths are
in red and green.
The first call to ksdensity returns the default bandwidth, u, of the kernel
smoothing function. Subsequent calls modify this bandwidth.
[f,x,u] = ksdensity(MPG);
plot(x,f)
title('Density estimate for MPG')
hold on
5-57
5
Probability Distributions
[f,x] = ksdensity(MPG,'width',u/3);
plot(x,f,'r');
[f,x] = ksdensity(MPG,'width',u*3);
plot(x,f,'g');
legend('default width','1/3 default','3*default')
hold off
The default bandwidth seems to be doing a good job—reasonably smooth,
but not so smooth as to obscure features of the data. This bandwidth is
the one that is theoretically optimal for estimating densities for the normal
distribution.
5-58
Statistics Toolbox™ Distribution Functions
The green curve shows a density with the kernel bandwidth set too high.
This curve smooths out the data so much that the end result looks just like
the kernel function. The red curve has a smaller bandwidth and is rougher
looking than the blue curve. It may be too rough, but it does provide an
indication that there might be two major peaks rather than the single peak
of the blue curve. A reasonable choice of width might lead to a curve that is
intermediate between the red and blue curves.
Specifying Kernel Smoothing Functions. You can also specify a kernel
function by supplying either the function name or a function handle. The four
preselected functions, 'normal', 'epanechnikov', 'box', and 'triangle',
are all scaled to have standard deviation equal to 1, so they perform a
comparable degree of smoothing.
Using default bandwidths, you can now plot the same mileage data, using
each of the available kernel functions.
hname = {'normal' 'epanechnikov' 'box' 'triangle'};
colors = {'r' 'b' 'g' 'm'};
for j=1:4
[f,x] = ksdensity(MPG,'kernel',hname{j});
plot(x,f,colors{j});
hold on;
end
legend(hname{:});
hold off
5-59
5
Probability Distributions
The density estimates are roughly comparable, but the box kernel produces a
density that is rougher than the others.
Comparing Density Estimates. While it is difficult to overlay two
histograms to compare them, you can easily overlay smooth density estimates.
For example, the following graph shows the MPG distributions for cars from
different countries of origin:
Origin = cellstr(cars.Origin);
I = strcmp('USA',Origin);
J = strcmp('Japan',Origin);
K = ~(I|J);
MPG_USA = MPG(I);
MPG_Japan = MPG(J);
MPG_Europe = MPG(K);
5-60
Statistics Toolbox™ Distribution Functions
[fI,xI] = ksdensity(MPG_USA);
plot(xI,fI,'b')
hold on
[fJ,xJ] = ksdensity(MPG_Japan);
plot(xJ,fJ,'r')
[fK,xK] = ksdensity(MPG_Europe);
plot(xK,fK,'g')
legend('USA','Japan','Europe')
hold off
For piecewise probability density estimation, using kernel smoothing in the
center of the distribution and Pareto distributions in the tails, see “Fitting
Piecewise Distributions” on page 5-72.
5-61
5
Probability Distributions
Cumulative Distribution Functions
• “Estimating Parametric CDFs” on page 5-62
• “Estimating Empirical CDFs” on page 5-63
Estimating Parametric CDFs
Cumulative distribution functions (cdfs) for supported Statistics Toolbox
distributions all end with cdf, as in binocdf or expcdf. Specific function
names for specific distributions can be found in “Supported Distributions”
on page 5-3.
Each function represents a parametric family of distributions. Input
arguments are arrays of outcomes followed by a list of parameter values
specifying a particular member of the distribution family.
For discrete distributions, the cdf F is related to the pdf f by
F ( x) =
∑ f ( y)
y≤ x
For continuous distributions, the cdf F is related to the pdf f by
x
F ( x) =
∫
f ( y) dy
−∞
Cdfs are used to compute probabilities of events. In particular, if F is a cdf
and x and y are outcomes, then
• P(y ≤ x) = F(x)
• P(y > x) = 1 – F(x)
• P(x1 < y ≤ x2) = F(x2) – F(x1)
For example, the t-statistic
5-62
Statistics Toolbox™ Distribution Functions
t=
x−
s/ n
follows a Student’s t distribution with n – 1 degrees of freedom when computed
from repeated random samples from a normal population with mean μ. Here
x is the sample mean, s is the sample standard deviation, and n is the sample
size. The probability of observing a t-statistic greater than or equal to the
value computed from a sample can be found with the tcdf function:
mu = 1; % Population mean
sigma = 2; % Population standard deviation
n = 100; % Sample size
x = normrnd(mu,sigma,n,1); % Random sample from population
xbar = mean(x); % Sample mean
s = std(x); % Sample standard deviation
t = (xbar-mu)/(s/sqrt(n)) % t-statistic
t =
0.2489
p = 1-tcdf(t,n-1) % Probability of larger t-statistic
p =
0.4020
This probability is the same as the p value returned by a t-test of the null
hypothesis that the sample comes from a normal population with mean μ:
[h,ptest] = ttest(x,mu,0.05,'right')
h =
0
ptest =
0.4020
Estimating Empirical CDFs
The ksdensity function produces an empirical version of a probability
density function (pdf). That is, instead of selecting a density with a particular
parametric form and estimating the parameters, it produces a nonparametric
density estimate that adapts itself to the data.
Similarly, it is possible to produce an empirical version of the cumulative
distribution function (cdf). The ecdf function computes this empirical cdf. It
5-63
5
Probability Distributions
returns the values of a function F such that F(x) represents the proportion of
observations in a sample less than or equal to x.
The idea behind the empirical cdf is simple. It is a function that assigns
probability 1/n to each of n observations in a sample. Its graph has a
stair-step appearance. If a sample comes from a distribution in a parametric
family (such as a normal distribution), its empirical cdf is likely to resemble
the parametric distribution. If not, its empirical distribution still gives an
estimate of the cdf for the distribution that generated the data.
The following example generates 20 observations from a normal distribution
with mean 10 and standard deviation 2. You can use ecdf to calculate the
empirical cdf and stairs to plot it. Then you overlay the normal distribution
curve on the empirical function.
x = normrnd(10,2,20,1);
[f,xf] = ecdf(x);
stairs(xf,f)
hold on
xx=linspace(5,15,100);
yy = normcdf(xx,10,2);
plot(xx,yy,'r:')
hold off
legend('Empirical cdf','Normal cdf',2)
5-64
Statistics Toolbox™ Distribution Functions
The empirical cdf is especially useful in survival analysis applications. In
such applications the data may be censored, that is, not observed exactly.
Some individuals may fail during a study, and you can observe their failure
time exactly. Other individuals may drop out of the study, or may not fail
until after the study is complete. The ecdf function has arguments for dealing
with censored data. In addition, you can use the coxphfit function with
individuals that have predictors that are not the same.
For piecewise probability density estimation, using the empirical cdf in the
center of the distribution and Pareto distributions in the tails, see “Fitting
Piecewise Distributions” on page 5-72.
5-65
5
Probability Distributions
Inverse Cumulative Distribution Functions
Inverse cumulative distribution functions for supported Statistics Toolbox
distributions all end with inv, as in binoinv or expinv. Specific function
names for specific distributions can be found in “Supported Distributions”
on page 5-3.
Each function represents a parametric family of distributions. Input
arguments are arrays of cumulative probabilities from 0 to 1 followed by a list
of parameter values specifying a particular member of the distribution family.
For continuous distributions, the inverse cdf returns the unique outcome
whose cdf value is the input cumulative probability.
For example, the expinv function can be used to compute inverses of
exponential cumulative probabilities:
x = 0.5:0.2:1.5 % Outcomes
x =
0.5000 0.7000 0.9000 1.1000 1.3000
p = expcdf(x,1) % Cumulative probabilities
p =
0.3935 0.5034 0.5934 0.6671 0.7275
expinv(p,1) % Return original outcomes
ans =
0.5000 0.7000 0.9000 1.1000 1.3000
1.5000
0.7769
1.5000
For discrete distributions, there may be no outcome whose cdf value is the
input cumulative probability. In these cases, the inverse cdf returns the first
outcome whose cdf value equals or exceeds the input cumulative probability.
For example, the binoinv function can be used to compute inverses of
binomial cumulative probabilities:
x = 0:5
% Some possible outcomes
p = binocdf(x,10,0.2) % Their cumulative probabilities
p =
0.1074
0.3758
q = [.1 .2 .3 .4]
5-66
0.6778
0.8791
0.9672
% New trial probabilities
0.9936
Statistics Toolbox™ Distribution Functions
q =
0.1000
0.2000
binoinv(q,10,0.2)
ans =
0
1
0.3000
0.4000
% Their corresponding outcomes
1
2
The inverse cdf is useful in hypothesis testing, where critical outcomes of a
test statistic are computed from cumulative significance probabilities. For
example, norminv can be used to compute a 95% confidence interval under
the assumption of normal variability:
p = [0.025 0.975]; % Interval containing 95% of [0,1]
x = norminv(p,0,1) % Assume standard normal variability
x =
-1.9600 1.9600 % 95% confidence interval
n = 20; % Sample size
y = normrnd(8,1,n,1); % Random sample (assume mean is unknown)
ybar = mean(y);
ci = ybar + (1/sqrt(n))*x % Confidence interval for mean
ci =
7.6779 8.5544
5-67
5
Probability Distributions
Distribution Statistics Functions
Distribution statistics functions for supported Statistics Toolbox distributions
all end with stat, as in binostat or expstat. Specific function names for
specific distributions can be found in “Supported Distributions” on page 5-3.
Each function represents a parametric family of distributions. Input
arguments are lists of parameter values specifying a particular member
of the distribution family. Functions return the mean and variance of the
distribution, as a function of the parameters.
For example, the wblstat function can be used to visualize the mean of the
Weibull distribution as a function of its two distribution parameters:
a = 0.5:0.1:3;
b = 0.5:0.1:3;
[A,B] = meshgrid(a,b);
M = wblstat(A,B);
surfc(A,B,M)
5-68
Statistics Toolbox™ Distribution Functions
5-69
5
Probability Distributions
Distribution Fitting Functions
• “Fitting Regular Distributions” on page 5-70
• “Fitting Piecewise Distributions” on page 5-72
Fitting Regular Distributions
Distribution fitting functions for supported Statistics Toolbox distributions all
end with fit, as in binofit or expfit. Specific function names for specific
distributions can be found in “Supported Distributions” on page 5-3.
Each function represents a parametric family of distributions. Input
arguments are arrays of data, presumed to be samples from some member
of the selected distribution family. Functions return maximum likelihood
estimates (MLEs) of distribution parameters, that is, parameters for the
distribution family member with the maximum likelihood of producing the
data as a random sample.
The Statistics Toolbox function mle is a convenient front end to the individual
distribution fitting functions, and more. The function computes MLEs for
distributions beyond those for which Statistics Toolbox software provides
specific pdf functions.
For some pdfs, MLEs can be given in closed form and computed directly.
For other pdfs, a search for the maximum likelihood must be employed. The
search can be controlled with an options input argument, created using
the statset function. For efficient searches, it is important to choose a
reasonable distribution model and set appropriate convergence tolerances.
MLEs can be heavily biased, especially for small samples. As sample size
increases, however, MLEs become unbiased minimum variance estimators
with approximate normal distributions. This is used to compute confidence
bounds for the estimates.
For example, consider the following distribution of means from repeated
random samples of an exponential distribution:
mu = 1; % Population parameter
n = 1e3; % Sample size
ns = 1e4; % Number of samples
5-70
Statistics Toolbox™ Distribution Functions
samples = exprnd(mu,n,ns); % Population samples
means = mean(samples); % Sample means
The Central Limit Theorem says that the means will be approximately
normally distributed, regardless of the distribution of the data in the samples.
The normfit function can be used to find the normal distribution that best
fits the means:
[muhat,sigmahat,muci,sigmaci] = normfit(means)
muhat =
1.0003
sigmahat =
0.0319
muci =
0.9997
1.0010
sigmaci =
0.0314
0.0323
The function returns MLEs for the mean and standard deviation and their
95% confidence intervals.
To visualize the distribution of sample means together with the fitted normal
distribution, you must scale the fitted pdf, with area = 1, to the area of the
histogram being used to display the means:
numbins = 50;
hist(means,numbins)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
hold on
[bincounts,binpositions] = hist(means,numbins);
binwidth = binpositions(2) - binpositions(1);
histarea = binwidth*sum(bincounts);
x = binpositions(1):0.001:binpositions(end);
y = normpdf(x,muhat,sigmahat);
plot(x,histarea*y,'r','LineWidth',2)
5-71
5
Probability Distributions
Fitting Piecewise Distributions
The parametric methods discussed in “Fitting Regular Distributions” on
page 5-70 fit data samples with smooth distributions that have a relatively
low-dimensional set of parameters controlling their shape. These methods
work well in many cases, but there is no guarantee that a given sample will be
described accurately by any of the supported Statistics Toolbox distributions.
The empirical distributions computed by ecdf and discussed in “Estimating
Empirical CDFs” on page 5-63 assign equal probability to each observation in
a sample, providing an exact match of the sample distribution. However, the
distributions are not smooth, especially in the tails where data may be sparse.
The paretotails function fits a distribution by piecing together the empirical
distribution in the center of the sample with smooth generalized Pareto
distributions (GPDs) in the tails. The output is an object of the paretotails
class, with associated methods to evaluate the cdf, inverse cdf, and other
functions of the fitted distribution.
5-72
Statistics Toolbox™ Distribution Functions
As an example, consider the following data, with about 20% outliers:
left_tail = -exprnd(1,10,1);
right_tail = exprnd(5,10,1);
center = randn(80,1);
data = [left_tail;center;right_tail];
Neither a normal distribution nor a t distribution fits the tails very well:
probplot(data);
p = fitdist(data,'tlocationscale');
h = probplot(gca,p);
set(h,'color','r','linestyle','-')
title('{\bf Probability Plot}')
legend('Normal','Data','t','Location','NW')
5-73
5
Probability Distributions
On the other hand, the empirical distribution provides a perfect fit, but the
outliers make the tails very discrete:
ecdf(data)
5-74
Statistics Toolbox™ Distribution Functions
Random samples generated from this distribution by inversion might include,
for example, values around 4.33 and 9.25, but nothing in-between.
The paretotails function provides a single, well-fit model for the entire
sample. The following uses generalized Pareto distributions (GPDs) for the
lower and upper 10% of the data:
pfit = paretotails(data,0.1,0.9)
pfit =
Piecewise distribution with 3 segments
-Inf < x < -1.30726 (0 < p < 0.1)
lower tail, GPD(-1.10167,1.12395)
-1.30726 < x < 1.27213 (0.1 < p < 0.9)
interpolated empirical cdf
1.27213 < x < Inf (0.9 < p < 1)
5-75
5
Probability Distributions
upper tail, GPD(1.03844,0.726038)
x = -4:0.01:10;
plot(x,cdf(pfit,x))
Access information about the fit using the methods of the paretotails class.
Options allow for nonparametric estimation of the center of the cdf.
5-76
Statistics Toolbox™ Distribution Functions
Negative Log-Likelihood Functions
Negative log-likelihood functions for supported Statistics Toolbox
distributions all end with like, as in explike. Specific function names for
specific distributions can be found in “Supported Distributions” on page 5-3.
Each function represents a parametric family of distributions. Input
arguments are lists of parameter values specifying a particular member of
the distribution family followed by an array of data. Functions return the
negative log-likelihood of the parameters, given the data.
Negative log-likelihood functions are used as objective functions in
search algorithms such as the one implemented by the MATLAB function
fminsearch. Additional search algorithms are implemented by Optimization
Toolbox™ functions and Global Optimization Toolbox functions.
When used to compute maximum likelihood estimates (MLEs), negative
log-likelihood functions allow you to choose a search algorithm and exercise
low-level control over algorithm execution. By contrast, the functions
discussed in “Distribution Fitting Functions” on page 5-70 use preset
algorithms with options limited to those set by the statset function.
Likelihoods are conditional probability densities. A parametric family of
distributions is specified by its pdf f(x,a), where x and a represent the
variables and parameters, respectively. When a is fixed, the pdf is used
to compute the density at x, f(x|a). When x is fixed, the pdf is used to
compute the likelihood of the parameters a, f(a|x). The joint likelihood of the
parameters over an independent random sample X is
L(a) =
∏ f (a| x)
x∈ X
Given X, MLEs maximize L(a) over all possible a.
In numerical algorithms, the log-likelihood function, log(L(a)), is
(equivalently) optimized. The logarithm transforms the product of potentially
small likelihoods into a sum of logs, which is easier to distinguish from 0
in computation. For convenience, Statistics Toolbox negative log-likelihood
functions return the negative of this sum, since the optimization algorithms to
which the values are passed typically search for minima rather than maxima.
5-77
5
Probability Distributions
For example, use gamrnd to generate a random sample from a specific gamma
distribution:
a = [1,2];
X = gamrnd(a(1),a(2),1e3,1);
Given X, the gamlike function can be used to visualize the likelihood surface
in the neighborhood of a:
mesh = 50;
delta = 0.5;
a1 = linspace(a(1)-delta,a(1)+delta,mesh);
a2 = linspace(a(2)-delta,a(2)+delta,mesh);
logL = zeros(mesh); % Preallocate memory
for i = 1:mesh
for j = 1:mesh
logL(i,j) = gamlike([a1(i),a2(j)],X);
end
end
[A1,A2] = meshgrid(a1,a2);
surfc(A1,A2,logL)
5-78
Statistics Toolbox™ Distribution Functions
The MATLAB function fminsearch is used to search for the minimum of
the likelihood surface:
LL = @(u)gamlike([u(1),u(2)],X); % Likelihood given X
MLES = fminsearch(LL,[1,2])
MLES =
1.0231
1.9729
These can be compared to the MLEs returned by the gamfit function, which
uses a combination search and solve algorithm:
ahat = gamfit(X)
ahat =
1.0231
1.9728
The MLEs can be added to the surface plot (rotated to show the minimum):
hold on
plot3(MLES(1),MLES(2),LL(MLES),...
'ro','MarkerSize',5,...
'MarkerFaceColor','r')
5-79
5
Probability Distributions
Random Number Generators
The Statistics Toolbox supports the generation of random numbers from
various distributions. Each RNG represents a parametric family of
distributions. RNGs return random numbers from the specified distribution
in an array of the specified dimensions. Specific RNG names for specific
distributions are in “Supported Distributions” on page 5-3.
Other random number generation functions which do not support specific
distributions include:
• cvpartition
• hmmgenerate
• lhsdesign
• lhsnorm
• mhsample
• random
• randsample
• slicesample
RNGs in Statistics Toolbox software depend on MATLAB’s default random
number stream via the rand and randn functions, each RNG uses one of
the techniques discussed in “Common Generation Methods” on page 6-5 to
generate random numbers from a given distribution.
By controlling the default random number stream and its state, you can
control how the RNGs in Statistics Toolbox software generate random values.
For example, to reproduce the same sequence of values from an RNG, you can
save and restore the default stream’s state, or reset the default stream. For
details on managing the default random number stream, see “Managing the
Global Stream”.
MATLAB initializes the default random number stream to the same state
each time it starts up. Thus, RNGs in Statistics Toolbox software will
generate the same sequence of values for each MATLAB session unless you
modify that state at startup. One simple way to do that is to add commands
to startup.m such as
5-80
Statistics Toolbox™ Distribution Functions
stream = RandStream('mt19937ar','seed',sum(100*clock));
RandStream.setDefaultStream(stream);
that initialize MATLAB’s default random number stream to a different state
for each session.
5-81
5
Probability Distributions
Dependencies of the Random Number Generators
The following table lists the dependencies of Statistics Toolbox RNGs on the
MATLAB base RNGs rand and/or randn.
RNG
MATLAB Base RNG
betarnd
rand, randn
binornd
rand
chi2rnd
rand, randn
evrnd
rand
exprnd
rand
frnd
rand, randn
gamrnd
rand, randn
geornd
rand
gevrnd
rand
gprnd
rand
hygernd
rand
iwishrnd
rand, randn
johnsrnd
randn
lhsdesign
rand
lhsnorm
rand
lognrnd
randn
mhsample
rand or randn, depending on
the RNG given for the proposal
distribution
5-82
mvnrnd
randn
mvtrnd
rand, randn
nbinrnd
rand, randn
Statistics Toolbox™ Distribution Functions
RNG
MATLAB Base RNG
ncfrnd
rand, randn
nctrnd
rand, randn
ncx2rnd
randn
normrnd
randn
pearsrnd
rand or randn, depending on the
distribution type
poissrnd
rand, randn
random
rand or randn, depending on the
specified distribution
randsample
rand
raylrnd
randn
slicesample
rand
trnd
rand, randn
unidrnd
rand
unifrnd
rand
wblrnd
rand
wishrnd
rand, randn
5-83
5
Probability Distributions
Using Probability Distribution Objects
In this section...
“Using Distribution Objects” on page 5-84
“What are Objects?” on page 5-85
“Creating Distribution Objects” on page 5-88
“Object-Supported Distributions” on page 5-89
“Performing Calculations Using Distribution Objects” on page 5-90
“Capturing Results Using Distribution Objects” on page 5-97
Using Distribution Objects
For many distributions supported by Statistics Toolbox software, objects are
available for statistical analysis. This section gives a general overview of the
uses of distribution objects, including sample work flows. For information
on objects available for specific distributions, see “Object-Supported
Distributions” on page 5-89.
Probability distribution objects allow you to easily fit, access, and store
distribution information for a given data set. The following operations are
easier to perform using distribution objects:
• Grouping a single dataset in a number of different ways using group
names, and then fit a distribution to each group. For an example of how
to fit distributions to grouped data, see “Example: Fitting Distributions to
Grouped Data Within a Single Dataset” on page 5-91.
• Fitting different distributions to the same set of data. For an example of
how objects make fitting multiple distribution types easier, see “Example:
Fitting Multiple Distribution Types to a Single Dataset” on page 5-95.
• Sharing fitted distributions across workspaces. For an example of sharing
information using probability distribution objects, see “Example: Saving
and Sharing Distribution Fit Data” on page 5-97.
5-84
Using Probability Distribution Objects
Deciding to Use Distribution Objects
If you know the type of distribution you would like to use, objects provide a
less complex interface than functions and a more efficient functionality than
the dfittool GUI.
If you are a novice statistician who would like to explore how various
distributions look without having to manipulate data, see “Working with
Distributions Through GUIs” on page 5-9.
If you have no data to fit, but want to calculate a pdf, cdf, etc for various
parameters, see “Statistics Toolbox Distribution Functions” on page 5-52.
What are Objects?
Objects are, in short, a convenient way of storing data. They allow you to set
rules for the types of data to store, while maintaining some flexibility for the
actual values of the data. For example, in statistics groups of distributions
have some general things in common:
• All distributions have a name (ex, Normal).
• Parametric distributions have parameters.
• Nonparametric distributions have kernel-smoothing functions.
Objects store all this information within properties. Classes of related
objects (for example, all univariate parametric distributions) have the same
properties with values and types relevant to a specified distribution. In
addition to storing information within objects, you can perform certain actions
(called methods) on objects.
Subclasses (for example, ProbDistParametric is a subclass of ProbDist)
contain the same properties and methods as the original class, in addition to
other properties relevant to that subclass. This concept is called inheritance.
Inheritance means that subclasses of a class have all of its properties and
methods. For example, parametric distributions, which are a subset (subclass)
of probability distributions, have input data and a distribution name. The
following diagram illustrates this point:
5-85
5
Probability Distributions
The left side of this diagram shows the inheritance line from all probability
distributions down to univariate parametric probability distributions. The
right side shows the lineage down to univariate kernel distributions. Here is
how to interpret univariate parametric distribution lineage:
• ProbDist is a class of objects that includes all probability distributions. All
probability distribution objects have at least these properties:
-
DistName — the name of the distribution (for example Normal or
Weibull)
-
InputData — the data fit to the distribution
In addition, you can perform the following actions on these objects, using
the following methods:
-
5-86
cdf — Return the cumulative distribution function for a specified
distribution.
Using Probability Distribution Objects
-
pdf — Return the probability density function for a specified distribution.
random — Generate random numbers based on a specified distribution.
• ProbDistParametric is a class of objects that includes all parametric
probability distributions. All parametric probability distribution objects
have the properties and methods of a ProbDist object, in addition to at
least the following properties:
-
NLogL — Negative log likelihood for input data
NumParams — Number of parameters for that distribution
ParamCov — Covariance matrix of parameter estimates
ParamDescription — Descriptions of parameters
ParamNames — Names of parameters
Params — Values of parameters
No additional unique methods apply to ProbDistParametric objects.
• ProbDistUnivParam is a class of objects that includes only univariate
parametric probability distributions. In addition to the properties and
methods of ProbDist and ProbDistParametric objects, these objects also
have at least the following methods:
-
icdf — Return the inverse cumulative distribution function for a
specified distribution based on a given set of data.
-
iqr — Return the interquartile range for a specified distribution based
on a given set of data.
-
mean — Return the mean for a specified distribution based on a given
set of data.
-
median — Return the median for a specified distribution based on a
-
paramci — Return the parameter confidence intervals for a specified
-
std — Return the standard deviation for a specified distribution based
on a given set of data.
given set of data.
distribution based on a given set of data.
-
var — Return the variance for a specified distribution based on a given
set of data.
No additional unique properties apply to ProbDistUnivParam objects.
5-87
5
Probability Distributions
The univariate nonparametric lineage reads in a similar manner, with
different properties and methods. For more information on nonparametric
objects and their methods and properties, see ProbDistKernel and
ProbDistUnivKernel.
For more detailed information on object-oriented programming in MATLAB,
see Object-Oriented Programming.
Creating Distribution Objects
There are two ways to create distribution objects:
• Use the fitdist function. See “Creating Distribution Objects Using
fitdist” on page 5-88.
• Use the object constructor. See “Creating Distribution Objects Using
Constructors” on page 5-88.
Creating Distribution Objects Using fitdist
Using the fitdist function is the simplest way of creating distribution
objects. Like the *fit functions, fitdist fits your data to a specified
distribution and returns relevant distribution information. fitdist creates
an object relevant to the type of distribution you specify: if you specify a
parametric distribution, it returns a ProbDistUnivParam object. For examples
of how to use fitdist to fit your data, see “Performing Calculations Using
Distribution Objects” on page 5-90.
Creating Distribution Objects Using Constructors
If you know the distribution you would like to use and would like to create a
univariate parametric distribution with known parameters, you can use the
ProbDistUnivParam constructor. For example, create a normal distribution
with mean 100 and standard deviation 10:
pd = ProbDistUnivParam('normal',[100 10])
For nonparametric distributions, you must have a dataset. Using
fitdist is a simpler way to fit nonparametric data, but you can use
the ProbDistUnivKernel constructor as well. For example, create a
nonparametric distribution of the MPG data from carsmall.mat:
5-88
Using Probability Distribution Objects
load carsmall
pd = ProbDistUnivKernel(MPG)
Object-Supported Distributions
Object-oriented programming in the Statistics Toolbox supports the following
distributions.
Parametric Distributions
Use the following distribution to create ProbDistUnivParam objects using
fitdist. For more information on the cumulative distribution function (cdf)
and probability density function (pdf) methods, as well as other available
methods, see the ProbDistUnivParam class reference page.
Supported Distribution
Input to fitdist
“Beta Distribution” on page B-4
'beta'
“Binomial Distribution” on page B-7
'binomial'
“Birnbaum-Saunders Distribution”
on page B-10
'birnbaumsaunders'
“Exponential Distribution” on page
B-16
'exponential'
“Extreme Value Distribution” on
page B-19
'extreme value' or 'ev'
“Gamma Distribution” on page B-27
'gamma'
“Generalized Extreme Value
Distribution” on page B-32
'generalized extreme value' or
'gev'
“Generalized Pareto Distribution” on
page B-37
'generalized pareto' or 'gp'
“Inverse Gaussian Distribution” on
page B-45
'inversegaussian'
“Logistic Distribution” on page B-49
'logistic'
“Loglogistic Distribution” on page
B-50
'loglogistic'
5-89
5
Probability Distributions
Supported Distribution
Input to fitdist
“Lognormal Distribution” on page
B-51
'lognormal'
“Nakagami Distribution” on page
B-70
'nakagami'
“Negative Binomial Distribution” on
page B-72
'negative binomial' or 'nbin'
“Normal Distribution” on page B-83
'normal'
“Poisson Distribution” on page B-89
'poisson'
“Rayleigh Distribution” on page B-91
'rayleigh'
“Rician Distribution” on page B-93
'rician'
“t Location-Scale Distribution” on
page B-97
'tlocationscale'
“Weibull Distribution” on page B-103
'weibull' or 'wbl'
Nonparametric Distributions
Use the following distributions to create ProbDistUnivKernel objects.
For more information on the cumulative distribution function (cdf) and
probability density function (pdf) methods, as well as other available
methods, see the ProbDistUnivKernel class reference page.
Supported Distribution
Input to fitdist
“Nonparametric Distributions” on
page B-82
'kernel'
Performing Calculations Using Distribution Objects
Distribution objects make it easier for you to perform calculations on complex
datasets. The following sample workflows show some of the functionality
of these objects.
• “Example: Fitting a Single Distribution to a Single Dataset” on page 5-91
5-90
Using Probability Distribution Objects
• “Example: Fitting Distributions to Grouped Data Within a Single Dataset”
on page 5-91
• “Example: Fitting Multiple Distribution Types to a Single Dataset” on
page 5-95
Example: Fitting a Single Distribution to a Single Dataset
Fit a single Normal distribution to a dataset using fitdist:
load carsmall
NormDist = fitdist(MPG,'normal')
NormDist =
normal distribution
mu = 23.7181
sigma = 8.03573
The output MATLAB returns is a ProbDistUnivParam object with a DistName
property of 'normal distribution'. The ParamNames property contains the
strings mu and sigma, while the Params property contains the parameter
values.
Example: Fitting Distributions to Grouped Data Within a Single
Dataset
Often, datasets are collections of data you can group in different ways. Using
fitdist and the data from carsmall.mat, group the MPG data by country of
origin, then fit a Weibull distribution each group:
load carsmall
[WeiByOrig, Country] = fitdist(MPG,'weibull','by',Origin)
Warning: Error while fitting group 'Italy':
Not enough data in X to fit this distribution.
> In fitdist at 171
WeiByOrig =
5-91
5
Probability Distributions
Columns 1 through 4
[1x1 ProbDistUnivParam]
[1x1 ProbDistUnivParam] ...
[1x1 ProbDistUnivParam]
[1x1 ProbDistUnivParam]
Columns 5 through 6
[1x1 ProbDistUnivParam]
[]
Country =
'USA'
'France'
'Japan'
'Germany'
'Sweden'
'Italy'
A warning appears informing you that, since the data only represents one
Italian car, fitdist cannot fit a Weibull distribution to that group. Each
one of the five other groups now has a distribution object associated with it,
represented in the cell array wd. Each object contains properties that hold
information about the data, the distribution, and the parameters. For more
information on what properties exist and what information they contain, see
ProbDistUnivParam or ProbDistUnivKernel.
Now access two of the objects and their properties:
% Get USA fit
distusa = WeiByOrig{1};
% Use the InputData property of ProbDistUnivParam objects to see
% the actual data used to fit the distribution:
dusa = distusa.InputData.data;
% Get Japan fit and data
distjapan = WeiByOrig{3};
djapan = distjapan.InputData.data;
5-92
Using Probability Distribution Objects
Now you can easily compare PDFs using the pdf method of the
ProbDistUnivParam class:
time = linspace(0,45);
pdfjapan = pdf(distjapan,time);
pdfusa = pdf(distusa,time);
hold on
plot(time,[pdfjapan;pdfusa])
l = legend('Japan','USA')
set(l,'Location','Best')
xlabel('MPG')
ylabel('Probability Density')
5-93
5
Probability Distributions
You could then further group the data and compare, for example, MPG by
year for American cars:
load carsmall
[WeiByYearOrig, Names] = fitdist(MPG,'weibull','by',...
{Origin Model_Year});
USA70 = WeiByYearOrig{1};
USA76 = WeiByYearOrig{2};
USA82 = WeiByYearOrig{3};
time = linspace(0,45);
pdf70 = pdf(USA70,time);
pdf76 = pdf(USA76,time);
pdf82 = pdf(USA82,time);
line(t,[pdf70;pdf76;pdf82])
l = legend('1970','1976','1982')
set(l,'Location','Best')
title('USA Car MPG by Year')
xlabel('MPG')
ylabel('Probability Density')
5-94
Using Probability Distribution Objects
Example: Fitting Multiple Distribution Types to a Single Dataset
Distribution objects make it easy to fit multiple distributions to the same
dataset, while minimizing workspace clutter. For example, use fitdist to
group the MPG data by country of origin, then fit Weibull, Normal, Logistic,
and nonparametric distributions for each group:
load carsmall;
[WeiByOrig, Country] = fitdist(MPG,'weibull','by',Origin);
[NormByOrig, Country] = fitdist(MPG,'normal','by',Origin);
[LogByOrig, Country] = fitdist(MPG,'logistic','by',Origin);
[KerByOrig, Country] = fitdist(MPG,'kernel','by',Origin);
5-95
5
Probability Distributions
Extract the fits for American cars and compare the fits visually against a
histogram of the original data:
WeiUSA = WeiByOrig{1};
NormUSA = NormByOrig{1};
LogUSA = LogByOrig{1};
KerUSA = KerByOrig{1};
% Since all three distributions use the same set of data,
% you can extract the data from any of them:
data = WeiUSA.InputData.data;
% Create a histogram of the data:
[n,y] = hist(data,10);
b = bar(y,n,'hist');
set(b,'FaceColor',[1,0.8,0])
% Scale the density by the histogram area, for easier display:
area = sum(n) * (y(2)-y(1));
time = linspace(0,45);
pdfWei = pdf(WeiUSA,time);
pdfNorm = pdf(NormUSA,time);
pdfLog = pdf(LogUSA,time);
pdfKer = pdf(KerUSA,time);
allpdf = [pdfWei;pdfNorm;pdfLog;pdfKer];
line(t,area * allpdf)
l = legend('Data','Weibull','Normal','Logistic','Kernel')
set(l,'Location','Best')
title('USA Car')
xlabel('MPG')
5-96
Using Probability Distribution Objects
You can see that only the nonparametric kernel distribution, KerUSA, comes
close to revealing the two modes in the data.
Capturing Results Using Distribution Objects
Distribution objects allow you to share both your dataset and your analysis
results simply by saving the information to a .mat file.
Example: Saving and Sharing Distribution Fit Data
Using the premise from the previous set of examples, group the MPG data
in carsmall.mat by country of origin and fit four different distributions to
each of the six sets of data:
5-97
5
Probability Distributions
load carsmall;
[WeiByOrig, Country] = fitdist(MPG,'weibull','by',Origin);
[NormByOrig, Country] = fitdist(MPG,'normal','by',Origin);
[LogByOrig, Country] = fitdist(MPG,'logistic','by',Origin);
[KerByOrig, Country] = fitdist(MPG,'kernel','by',Origin);
Combine all four fits and the country labels into a single cell array, including
“headers” to indicate which distributions correspond to which objects. Then,
save the array to a .mat file:
AllFits = cell(['Country' Country'; 'Weibull' WeiByOrig;...
'Normal' NormByOrig; 'Logistic' LogByOrig; 'Kernel',...
KerByOrig]);
save('CarSmallFits.mat','AllFits');
To show that the data is both safely saved and easily restored, clear your
workspace of relevant variables. This command clears only those variables
associated with this example:
clear('Weight','Acceleration','AllFits','Country',...
'Cylinders','Displacement','Horsepower','KerByOrig',...
'LogByOrig','MPG','Model','Model_Year','NormByOrig',...
'Origin','WeiByOrig')
Now, load the data:
load CarSmallFits
AllFits
You can now access the distributions objects as in the previous examples.
5-98
Probability Distributions Used for Multivariate Modeling
Probability Distributions Used for Multivariate Modeling
In this section...
“Gaussian Mixture Models” on page 5-99
“Copulas” on page 5-107
Gaussian Mixture Models
• “Creating Gaussian Mixture Models” on page 5-99
• “Simulating Gaussian Mixtures” on page 5-105
Gaussian mixture models are formed by combining multivariate normal
density components. For information on individual multivariate normal
densities, see “Multivariate Normal Distribution” on page B-58 and related
distribution functions listed under “Multivariate Distributions” on page 5-8.
In Statistics Toolbox software, use the gmdistribution class to fit data
using an expectation maximization (EM) algorithm, which assigns posterior
probabilities to each component density with respect to each observation. The
fitting method uses an iterative algorithm that converges to a local optimum.
Clustering using Gaussian mixture models is sometimes considered a soft
clustering method. The posterior probabilities for each point indicate that
each data point has some probability of belonging to each cluster.
For more information on clustering with Gaussian mixture models, see
“Gaussian Mixture Models” on page 11-28. This section describes their
creation.
Creating Gaussian Mixture Models
• “Specifying a Model” on page 5-100
• “Fitting a Model to Data” on page 5-102
5-99
5
Probability Distributions
Specifying a Model. Use the gmdistribution constructor to create
Gaussian mixture models with specified means, covariances, and mixture
proportions. The following creates an object of the gmdistribution class
defining a two-component mixture of bivariate Gaussian distributions:
MU = [1 2;-3 -5]; % Means
SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]); % Covariances
p = ones(1,2)/2; % Mixing proportions
obj = gmdistribution(MU,SIGMA,p);
Display properties of the object with the MATLAB function fieldnames:
properties = fieldnames(obj)
properties =
'NDimensions'
'DistName'
'NComponents'
'PComponents'
'mu'
'Sigma'
'NlogL'
'AIC'
'BIC'
'Converged'
'Iters'
'SharedCov'
'CovType'
'RegV'
The gmdistribution reference page describes these properties. To access the
value of a property, use dot indexing:
dimension = obj.NDimensions
dimension =
2
name = obj.DistName
name =
gaussian mixture distribution
5-100
Probability Distributions Used for Multivariate Modeling
Use the methods pdf and cdf to compute values and visualize the object:
ezsurf(@(x,y)pdf(obj,[x y]),[-10 10],[-10 10])
ezsurf(@(x,y)cdf(obj,[x y]),[-10 10],[-10 10])
5-101
5
Probability Distributions
Fitting a Model to Data. You can also create Gaussian mixture models
by fitting a parametric model with a specified number of components to
data. The fit method of the gmdistribution class uses the syntax obj =
gmdistribution.fit(X,k), where X is a data matrix and k is the specified
number of components. Choosing a suitable number of components k is
essential for creating a useful model of the data—too few components fails to
model the data accurately; too many components leads to an over-fit model
with singular covariance matrices.
The following example illustrates this approach.
First, create some data from a mixture of two bivariate Gaussian distributions
using the mvnrnd function:
MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
5-102
Probability Distributions Used for Multivariate Modeling
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);
mvnrnd(MU2,SIGMA2,1000)];
scatter(X(:,1),X(:,2),10,'.')
Next, fit a two-component Gaussian mixture model:
options = statset('Display','final');
obj = gmdistribution.fit(X,2,'Options',options);
hold on
h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);
hold off
5-103
5
Probability Distributions
Among the properties of the fit are the parameter estimates:
ComponentMeans = obj.mu
ComponentMeans =
0.9391
2.0322
-2.9823
-4.9737
ComponentCovariances = obj.Sigma
ComponentCovariances(:,:,1) =
1.7786
-0.0528
-0.0528
0.5312
ComponentCovariances(:,:,2) =
1.0491
-0.0150
5-104
Probability Distributions Used for Multivariate Modeling
-0.0150
0.9816
MixtureProportions = obj.PComponents
MixtureProportions =
0.5000
0.5000
The two-component model minimizes the Akaike information:
AIC = zeros(1,4);
obj = cell(1,4);
for k = 1:4
obj{k} = gmdistribution.fit(X,k);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC);
numComponents
numComponents =
2
model = obj{2}
model =
Gaussian mixture distribution
with 2 components in 2 dimensions
Component 1:
Mixing proportion: 0.500000
Mean:
0.9391
2.0322
Component 2:
Mixing proportion: 0.500000
Mean:
-2.9823
-4.9737
Both the Akaike and Bayes information are negative log-likelihoods for the
data with penalty terms for the number of estimated parameters. You can use
them to determine an appropriate number of components for a model when
the number of components is unspecified.
Simulating Gaussian Mixtures
Use the method random of the gmdistribution class to generate random data
from a Gaussian mixture model created with gmdistribution or fit.
5-105
5
Probability Distributions
For example, the following specifies a gmdistribution object consisting of a
two-component mixture of bivariate Gaussian distributions:
MU = [1 2;-3 -5];
SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]);
p = ones(1,2)/2;
obj = gmdistribution(MU,SIGMA,p);
ezcontour(@(x,y)pdf(obj,[x y]),[-10 10],[-10 10])
hold on
Use random (gmdistribution) to generate 1000 random values:
Y = random(obj,1000);
scatter(Y(:,1),Y(:,2),10,'.')
5-106
Probability Distributions Used for Multivariate Modeling
Copulas
• “Determining Dependence Between Simulation Inputs” on page 5-108
• “Constructing Dependent Bivariate Distributions” on page 5-112
• “Using Rank Correlation Coefficients” on page 5-116
• “Using Bivariate Copulas” on page 5-119
• “Higher Dimension Copulas” on page 5-126
• “Archimedean Copulas” on page 5-128
• “Simulating Dependent Multivariate Data Using Copulas” on page 5-130
• “Example: Fitting Copulas to Data” on page 5-135
5-107
5
Probability Distributions
Copulas are functions that describe dependencies among variables, and
provide a way to create distributions that model correlated multivariate data.
Using a copula, you can construct a multivariate distribution by specifying
marginal univariate distributions, and then choose a copula to provide a
correlation structure between variables. Bivariate distributions, as well as
distributions in higher dimensions, are possible.
Determining Dependence Between Simulation Inputs
One of the design decisions for a Monte Carlo simulation is a choice of
probability distributions for the random inputs. Selecting a distribution
for each individual variable is often straightforward, but deciding what
dependencies should exist between the inputs may not be. Ideally, input
data to a simulation should reflect what you know about dependence among
the real quantities you are modeling. However, there may be little or no
information on which to base any dependence in the simulation. In such cases,
it is useful to experiment with different possibilities in order to determine
the model’s sensitivity.
It can be difficult to generate random inputs with dependence when they have
distributions that are not from a standard multivariate distribution. Further,
some of the standard multivariate distributions can model only limited types
of dependence. It is always possible to make the inputs independent, and
while that is a simple choice, it is not always sensible and can lead to the
wrong conclusions.
For example, a Monte-Carlo simulation of financial risk could have two
random inputs that represent different sources of insurance losses. You could
model these inputs as lognormal random variables. A reasonable question
to ask is how dependence between these two inputs affects the results of the
simulation. Indeed, you might know from real data that the same random
conditions affect both sources; ignoring that in the simulation could lead to
the wrong conclusions.
Example: Generate and Exponentiate Normal Random Variables.
The lognrnd function simulates independent lognormal random variables. In
the following example, the mvnrnd function generates n pairs of independent
normal random variables, and then exponentiates them. Notice that the
covariance matrix used here is diagonal:
5-108
Probability Distributions Used for Multivariate Modeling
n = 1000;
sigma = .5;
SigmaInd = sigma.^2 .* [1 0; 0 1]
SigmaInd =
0.25
0
0
0.25
ZInd = mvnrnd([0 0],SigmaInd,n);
XInd = exp(ZInd);
plot(XInd(:,1),XInd(:,2),'.')
axis([0 5 0 5])
axis equal
xlabel('X1')
ylabel('X2')
5-109
5
Probability Distributions
Dependent bivariate lognormal random variables are also easy to generate
using a covariance matrix with nonzero off-diagonal terms:
rho = .7;
SigmaDep = sigma.^2 .* [1 rho; rho 1]
SigmaDep =
0.25
0.175
0.175
0.25
ZDep = mvnrnd([0 0],SigmaDep,n);
XDep = exp(ZDep);
A second scatter plot demonstrates the difference between these two bivariate
distributions:
plot(XDep(:,1),XDep(:,2),'.')
axis([0 5 0 5])
axis equal
xlabel('X1')
ylabel('X2')
5-110
Probability Distributions Used for Multivariate Modeling
It is clear that there is a tendency in the second data set for large values of
X1 to be associated with large values of X2, and similarly for small values.
The correlation parameter, ρ, of the underlying bivariate normal determines
this dependence. The conclusions drawn from the simulation could well
depend on whether you generate X1 and X2 with dependence. The bivariate
lognormal distribution is a simple solution in this case; it easily generalizes
to higher dimensions in cases where the marginal distributions are different
lognormals.
Other multivariate distributions also exist. For example, the multivariate
t and the Dirichlet distributions simulate dependent t and beta random
variables, respectively. But the list of simple multivariate distributions is not
long, and they only apply in cases where the marginals are all in the same
family (or even the exact same distributions). This can be a serious limitation
in many situations.
5-111
5
Probability Distributions
Constructing Dependent Bivariate Distributions
Although the construction discussed in the previous section creates a
bivariate lognormal that is simple, it serves to illustrate a method that is
more generally applicable.
1 Generate pairs of values from a bivariate normal distribution. There is
statistical dependence between these two variables, and each has a normal
marginal distribution.
2 Apply a transformation (the exponential function) separately to each
variable, changing the marginal distributions into lognormals. The
transformed variables still have a statistical dependence.
If a suitable transformation can be found, this method can be generalized to
create dependent bivariate random vectors with other marginal distributions.
In fact, a general method of constructing such a transformation does exist,
although it is not as simple as exponentiation alone.
By definition, applying the normal cumulative distribution function (cdf),
denoted here by Φ, to a standard normal random variable results in a random
variable that is uniform on the interval [0,1]. To see this, if Z has a standard
normal distribution, then the cdf of U = Φ(Z) is
Pr{U ≤ u} = Pr{Φ( Z) ≤ u} = Pr( Z ≤ Φ −1 (u)} = u
and that is the cdf of a Unif(0,1) random variable. Histograms of some
simulated normal and transformed values demonstrate that fact:
n = 1000;
z = normrnd(0,1,n,1);
hist(z,-3.75:.5:3.75)
xlim([-4 4])
title('1000 Simulated N(0,1) Random Values')
xlabel('Z')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-112
Probability Distributions Used for Multivariate Modeling
u = normcdf(z);
hist(u,.05:.1:.95)
title('1000 Simulated N(0,1) Values Transformed to Unif(0,1)')
xlabel('U')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-113
5
Probability Distributions
Borrowing from the theory of univariate random number generation, applying
the inverse cdf of any distribution, F, to a Unif(0,1) random variable results in
a random variable whose distribution is exactly F (see “Inversion Methods”
on page 6-7). The proof is essentially the opposite of the preceding proof for
the forward case. Another histogram illustrates the transformation to a
gamma distribution:
x = gaminv(u,2,1);
hist(x,.25:.5:9.75)
title('1000 Simulated N(0,1) Values Transformed to Gamma(2,1)')
xlabel('X')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-114
Probability Distributions Used for Multivariate Modeling
You can apply this two-step transformation to each variable of a standard
bivariate normal, creating dependent random variables with arbitrary
marginal distributions. Because the transformation works on each component
separately, the two resulting random variables need not even have the same
marginal distributions. The transformation is defined as:
⎛
⎡1
Z = [ Z1 , Z2 ]  N ⎜ [0, 0] , ⎢
⎣
⎝
U = ⎡⎣Φ ( Z1 ) , Φ ( Z2 ) ⎤⎦
X = ⎡⎣G1 (U1 ) , G2 (U2 ) ⎤⎦
⎤⎞
⎟
1 ⎥⎦ ⎠
5-115
5
Probability Distributions
where G1 and G2 are inverse cdfs of two possibly different distributions. For
example, the following generates random vectors from a bivariate distribution
with t5 and Gamma(2,1) marginals:
n
Z
U
X
=
=
=
=
1000; rho = .7;
mvnrnd([0 0],[1 rho; rho 1],n);
normcdf(Z);
[gaminv(U(:,1),2,1) tinv(U(:,2),5)];
scatterhist(X(:,1),X(:,2),'Direction','out')
This plot has histograms alongside a scatter plot to show both the marginal
distributions, and the dependence.
Using Rank Correlation Coefficients
The correlation parameter, ρ, of the underlying bivariate normal determines
the dependence between X1 and X2 in this construction. However, the linear
correlation of X1 and X2 is not ρ. For example, in the original lognormal case,
a closed form for that correlation is:
5-116
Probability Distributions Used for Multivariate Modeling
e − 1
2
cor ( X 1, X 2) =
e − 1
2
which is strictly less than ρ, unless ρ is exactly 1. In more general cases such
as the Gamma/t construction, the linear correlation between X1 and X2 is
difficult or impossible to express in terms of ρ, but simulations show that the
same effect happens.
That is because the linear correlation coefficient expresses the linear
dependence between random variables, and when nonlinear transformations
are applied to those random variables, linear correlation is not preserved.
Instead, a rank correlation coefficient, such as Kendall’s τ or Spearman’s ρ,
is more appropriate.
Roughly speaking, these rank correlations measure the degree to which
large or small values of one random variable associate with large or small
values of another. However, unlike the linear correlation coefficient, they
measure the association only in terms of ranks. As a consequence, the rank
correlation is preserved under any monotonic transformation. In particular,
the transformation method just described preserves the rank correlation.
Therefore, knowing the rank correlation of the bivariate normal Z exactly
determines the rank correlation of the final transformed random variables,
X. While the linear correlation coefficient, ρ, is still needed to parameterize
the underlying bivariate normal, Kendall’s τ or Spearman’s ρ are more useful
in describing the dependence between random variables, because they are
invariant to the choice of marginal distribution.
For the bivariate normal, there is a simple one-to-one mapping between
Kendall’s τ or Spearman’s ρ, and the linear correlation coefficient ρ:
⎛ ⎞
or  = sin ⎜ ⎟
⎝ 2⎠
6
⎛⎞
⎛ ⎞
s = arcsin ⎜ ⎟ or  = 2sin ⎜ s ⎟

⎝ 6⎠
⎝2⎠
 =
2
arcsin (  )

The following plot shows the relationship:
rho = -1:.01:1;
5-117
5
Probability Distributions
tau = 2.*asin(rho)./pi;
rho_s = 6.*asin(rho./2)./pi;
plot(rho,tau,'b-','LineWidth',2)
hold on
plot(rho,rho_s,'g-','LineWidth',2)
plot([-1 1],[-1 1],'k:','LineWidth',2)
axis([-1 1 -1 1])
xlabel('rho')
ylabel('Rank correlation coefficient')
legend('Kendall''s {\it\tau}', ...
'Spearman''s {\it\rho_s}', ...
'location','NW')
5-118
Probability Distributions Used for Multivariate Modeling
Thus, it is easy to create the desired rank correlation between X1 and X2,
regardless of their marginal distributions, by choosing the correct ρ parameter
value for the linear correlation between Z1 and Z2.
For the multivariate normal distribution, Spearman’s rank correlation is
almost identical to the linear correlation. However, this is not true once you
transform to the final random variables.
Using Bivariate Copulas
The first step of the construction described in the previous section defines
what is known as a bivariate Gaussian copula. A copula is a multivariate
probability distribution, where each random variable has a uniform marginal
distribution on the unit interval [0,1]. These variables may be completely
independent, deterministically related (e.g., U2 = U1), or anything in between.
Because of the possibility for dependence among variables, you can use a
copula to construct a new multivariate distribution for dependent variables.
By transforming each of the variables in the copula separately using the
inversion method, possibly using different cdfs, the resulting distribution can
have arbitrary marginal distributions. Such multivariate distributions are
often useful in simulations, when you know that the different random inputs
are not independent of each other.
Statistics Toolbox functions compute:
• Probability density functions (copulapdf) and the cumulative distribution
functions (copulacdf) for Gaussian copulas
• Rank correlations from linear correlations (copulastat) and vice versa
(copulaparam)
• Random vectors (copularnd)
• Parameters for copulas fit to data (copulafit)
For example, use the copularnd function to create scatter plots of random
values from a bivariate Gaussian copula for various levels of ρ, to illustrate the
range of different dependence structures. The family of bivariate Gaussian
copulas is parameterized by the linear correlation matrix:
5-119
5
Probability Distributions
⎛1
Ρ=⎜
⎝
⎞
1 ⎟⎠
U1 and U2 approach linear dependence as ρ approaches ±1, and approach
complete independence as ρ approaches zero:
n = 500;
U = copularnd('Gaussian',[1 .8; .8 1],n);
subplot(2,2,1)
plot(U(:,1),U(:,2),'.')
title('{\it\rho} = 0.8')
xlabel('U1')
ylabel('U2')
U = copularnd('Gaussian',[1 .1; .1 1],n);
subplot(2,2,2)
plot(U(:,1),U(:,2),'.')
title('{\it\rho} = 0.1')
xlabel('U1')
ylabel('U2')
U = copularnd('Gaussian',[1 -.1; -.1 1],n);
subplot(2,2,3)
plot(U(:,1),U(:,2),'.')
title('{\it\rho} = -0.1')
xlabel('U1')
ylabel('U2')
U = copularnd('Gaussian',[1 -.8; -.8 1],n);
subplot(2,2,4)
plot(U(:,1),U(:,2),'.')
title('{\it\rho} = -0.8')
xlabel('U1')
ylabel('U2')
5-120
Probability Distributions Used for Multivariate Modeling
The dependence between U1 and U2 is completely separate from the marginal
distributions of X1 = G(U1) and X2 = G(U2). X1 and X2 can be given any
marginal distributions, and still have the same rank correlation. This is
one of the main appeals of copulas—they allow this separate specification
of dependence and marginal distribution. You can also compute the pdf
(copulapdf) and the cdf (copulacdf) for a copula. For example, these plots
show the pdf and cdf for ρ = .8:
u1 = linspace(1e-3,1-1e-3,50);
u2 = linspace(1e-3,1-1e-3,50);
[U1,U2] = meshgrid(u1,u2);
Rho = [1 .8; .8 1];
f = copulapdf('t',[U1(:) U2(:)],Rho,5);
f = reshape(f,size(U1));
surf(u1,u2,log(f),'FaceColor','interp','EdgeColor','none')
5-121
5
Probability Distributions
view([-15,20])
xlabel('U1')
ylabel('U2')
zlabel('Probability Density')
u1 = linspace(1e-3,1-1e-3,50);
u2 = linspace(1e-3,1-1e-3,50);
[U1,U2] = meshgrid(u1,u2);
F = copulacdf('t',[U1(:) U2(:)],Rho,5);
F = reshape(F,size(U1));
surf(u1,u2,F,'FaceColor','interp','EdgeColor','none')
view([-15,20])
xlabel('U1')
ylabel('U2')
zlabel('Cumulative Probability')
5-122
Probability Distributions Used for Multivariate Modeling
A different family of copulas can be constructed by starting from a bivariate t
distribution and transforming using the corresponding t cdf. The bivariate t
distribution is parameterized with P, the linear correlation matrix, and ν, the
degrees of freedom. Thus, for example, you can speak of a t1 or a t5 copula,
based on the multivariate t with one and five degrees of freedom, respectively.
Just as for Gaussian copulas, Statistics Toolbox functions for t copulas
compute:
• Probability density functions (copulapdf) and the cumulative distribution
functions (copulacdf) for Gaussian copulas
• Rank correlations from linear correlations (copulastat) and vice versa
(copulaparam)
• Random vectors (copularnd)
• Parameters for copulas fit to data (copulafit)
5-123
5
Probability Distributions
For example, use the copularnd function to create scatter plots of random
values from a bivariate t1 copula for various levels of ρ, to illustrate the range
of different dependence structures:
n = 500;
nu = 1;
U = copularnd('t',[1 .8; .8 1],nu,n);
subplot(2,2,1)
plot(U(:,1),U(:,2),'.')
title('{\it\rho} = 0.8')
xlabel('U1')
ylabel('U2')
U = copularnd('t',[1 .1; .1 1],nu,n);
subplot(2,2,2)
plot(U(:,1),U(:,2),'.')
title('{\it\rho} = 0.1')
xlabel('U1')
ylabel('U2')
U = copularnd('t',[1 -.1; -.1 1],nu,n);
subplot(2,2,3)
plot(U(:,1),U(:,2),'.')
title('{\it\rho} = -0.1')
xlabel('U1')
ylabel('U2')
U = copularnd('t',[1 -.8; -.8 1],nu, n);
subplot(2,2,4)
plot(U(:,1),U(:,2),'.')
title('{\it\rho} = -0.8')
xlabel('U1')
ylabel('U2')
5-124
Probability Distributions Used for Multivariate Modeling
A t copula has uniform marginal distributions for U1 and U2, just as a
Gaussian copula does. The rank correlation τ or ρs between components in a t
copula is also the same function of ρ as for a Gaussian. However, as these plots
demonstrate, a t1 copula differs quite a bit from a Gaussian copula, even when
their components have the same rank correlation. The difference is in their
dependence structure. Not surprisingly, as the degrees of freedom parameter
ν is made larger, a tν copula approaches the corresponding Gaussian copula.
As with a Gaussian copula, any marginal distributions can be imposed over
a t copula. For example, using a t copula with 1 degree of freedom, you can
again generate random vectors from a bivariate distribution with Gamma(2,1)
and t5 marginals using copularnd:
n = 1000;
rho = .7;
nu = 1;
5-125
5
Probability Distributions
U = copularnd('t',[1 rho; rho 1],nu,n);
X = [gaminv(U(:,1),2,1) tinv(U(:,2),5)];
scatterhist(X(:,1),X(:,2),'Direction','out')
Compared to the bivariate Gamma/t distribution constructed earlier, which
was based on a Gaussian copula, the distribution constructed here, based on a
t1 copula, has the same marginal distributions and the same rank correlation
between variables but a very different dependence structure. This illustrates
the fact that multivariate distributions are not uniquely defined by their
marginal distributions, or by their correlations. The choice of a particular
copula in an application may be based on actual observed data, or different
copulas may be used as a way of determining the sensitivity of simulation
results to the input distribution.
Higher Dimension Copulas
The Gaussian and t copulas are known as elliptical copulas. It is easy to
generalize elliptical copulas to a higher number of dimensions. For example,
simulate data from a trivariate distribution with Gamma(2,1), Beta(2,2), and
t5 marginals using a Gaussian copula and copularnd, as follows:
5-126
Probability Distributions Used for Multivariate Modeling
n =
Rho
U =
X =
1000;
= [1 .4 .2; .4 1 -.8; .2 -.8 1];
copularnd('Gaussian',Rho,n);
[gaminv(U(:,1),2,1) betainv(U(:,2),2,2) tinv(U(:,3),5)];
subplot(1,1,1)
plot3(X(:,1),X(:,2),X(:,3),'.')
grid on
view([-55, 15])
xlabel('X1')
ylabel('X2')
zlabel('X3')
Notice that the relationship between the linear correlation parameter ρ and,
for example, Kendall’s τ, holds for each entry in the correlation matrix P
used here. You can verify that the sample rank correlations of the data are
approximately equal to the theoretical values:
tauTheoretical = 2.*asin(Rho)./pi
tauTheoretical =
5-127
5
Probability Distributions
1
0.26198
0.12819
0.26198
1
-0.59033
0.12819
-0.59033
1
tauSample = corr(X,'type','Kendall')
tauSample =
1
0.27254
0.12701
0.27254
1
-0.58182
0.12701
-0.58182
1
Archimedean Copulas
Statistics Toolbox functions are available for three bivariate Archimedean
copula families:
• Clayton copulas
• Frank copulas
• Gumbel copulas
These are one-parameter families that are defined directly in terms of their
cdfs, rather than being defined constructively using a standard multivariate
distribution.
To compare these three Archimedean copulas to the Gaussian and t bivariate
copulas, first use the copulastat function to find the rank correlation for
a Gaussian or t copula with linear correlation parameter of 0.8, and then
use the copulaparam function to find the Clayton copula parameter that
corresponds to that rank correlation:
tau = copulastat('Gaussian',.8 ,'type','kendall')
tau =
0.59033
alpha = copulaparam('Clayton',tau,'type','kendall')
alpha =
2.882
Finally, plot a random sample from the Clayton copula with copularnd.
Repeat the same procedure for the Frank and Gumbel copulas:
5-128
Probability Distributions Used for Multivariate Modeling
n = 500;
U = copularnd('Clayton',alpha,n);
subplot(3,1,1)
plot(U(:,1),U(:,2),'.');
title(['Clayton Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)])
xlabel('U1')
ylabel('U2')
alpha = copulaparam('Frank',tau,'type','kendall');
U = copularnd('Frank',alpha,n);
subplot(3,1,2)
plot(U(:,1),U(:,2),'.')
title(['Frank Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)])
xlabel('U1')
ylabel('U2')
alpha = copulaparam('Gumbel',tau,'type','kendall');
U = copularnd('Gumbel',alpha,n);
subplot(3,1,3)
plot(U(:,1),U(:,2),'.')
title(['Gumbel Copula, {\it\alpha} = ',sprintf('%0.2f',alpha)])
xlabel('U1')
ylabel('U2')
5-129
5
Probability Distributions
Simulating Dependent Multivariate Data Using Copulas
To simulate dependent multivariate data using a copula, you must specify
each of the following:
5-130
Probability Distributions Used for Multivariate Modeling
• The copula family (and any shape parameters)
• The rank correlations among variables
• Marginal distributions for each variable
Suppose you have return data for two stocks and want to run a Monte Carlo
simulation with inputs that follow the same distributions as the data:
load stockreturns
nobs = size(stocks,1);
subplot(2,1,1)
hist(stocks(:,1),10)
xlim([-3.5 3.5])
xlabel('X1')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
subplot(2,1,2)
hist(stocks(:,2),10)
xlim([-3.5 3.5])
xlabel('X2')
ylabel('Frequency')
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
5-131
5
Probability Distributions
You could fit a parametric model separately to each dataset, and use those
estimates as the marginal distributions. However, a parametric model may
not be sufficiently flexible. Instead, you can use a nonparametric model
to transform to the marginal distributions. All that is needed is a way to
compute the inverse cdf for the nonparametric model.
The simplest nonparametric model is the empirical cdf, as computed by the
ecdf function. For a discrete marginal distribution, this is appropriate.
However, for a continuous distribution, use a model that is smoother than
the step function computed by ecdf. One way to do that is to estimate
the empirical cdf and interpolate between the midpoints of the steps with
a piecewise linear function. Another way is to use kernel smoothing with
ksdensity. For example, compare the empirical cdf to a kernel smoothed cdf
estimate for the first variable:
5-132
Probability Distributions Used for Multivariate Modeling
[Fi,xi] = ecdf(stocks(:,1));
stairs(xi,Fi,'b','LineWidth',2)
hold on
Fi_sm = ksdensity(stocks(:,1),xi,'function','cdf','width',.15);
plot(xi,Fi_sm,'r-','LineWidth',1.5)
xlabel('X1')
ylabel('Cumulative Probability')
legend('Empirical','Smoothed','Location','NW')
grid on
For the simulation, experiment with different copulas and correlations.
Here, you will use a bivariate t copula with a fairly small degrees of freedom
5-133
5
Probability Distributions
parameter. For the correlation parameter, you can compute the rank
correlation of the data, and then find the corresponding linear correlation
parameter for the t copula using copulaparam:
nu = 5;
tau = corr(stocks(:,1),stocks(:,2),'type','kendall')
tau =
0.51798
rho = copulaparam('t', tau, nu, 'type','kendall')
rho =
0.72679
Next, use copularnd to generate random values from the t copula and
transform using the nonparametric inverse cdfs. The ksdensity function
allows you to make a kernel estimate of distribution and evaluate the inverse
cdf at the copula points all in one step:
n = 1000;
U = copularnd('t',[1 rho; rho 1],nu,n);
X1 = ksdensity(stocks(:,1),U(:,1),...
'function','icdf','width',.15);
X2 = ksdensity(stocks(:,2),U(:,2),...
'function','icdf','width',.15);
Alternatively, when you have a large amount of data or need to simulate more
than one set of values, it may be more efficient to compute the inverse cdf
over a grid of values in the interval (0,1) and use interpolation to evaluate it
at the copula points:
p = linspace(0.00001,0.99999,1000);
G1 = ksdensity(stocks(:,1),p,'function','icdf','width',0.15);
X1 = interp1(p,G1,U(:,1),'spline');
G2 = ksdensity(stocks(:,2),p,'function','icdf','width',0.15);
X2 = interp1(p,G2,U(:,2),'spline');
scatterhist(X1,X2,'Direction','out')
5-134
Probability Distributions Used for Multivariate Modeling
The marginal histograms of the simulated data are a smoothed version of the
histograms for the original data. The amount of smoothing is controlled by
the bandwidth input to ksdensity.
Example: Fitting Copulas to Data
The copulafit function is used to calibrate copulas with data. To generate
data Xsim with a distribution “just like” (in terms of marginal distributions
and correlations) the distribution of data in the matrix X:
1 Fit marginal distributions to the columns of X.
2 Use appropriate cdf functions to transform X to U, so that U has values
between 0 and 1.
3 Use copulafit to fit a copula to U.
4 Generate new data Usim from the copula.
5-135
5
Probability Distributions
5 Use appropriate inverse cdf functions to transform Usim to Xsim.
The following example illustrates the procedure.
Load and plot simulated stock return data:
load stockreturns
x = stocks(:,1);
y = stocks(:,2);
scatterhist(x,y,'Direction','out')
Transform the data to the copula scale (unit square) using a kernel estimator
of the cumulative distribution function:
u = ksdensity(x,x,'function','cdf');
v = ksdensity(y,y,'function','cdf');
5-136
Probability Distributions Used for Multivariate Modeling
scatterhist(u,v,'Direction','out')
xlabel('u')
ylabel('v')
Fit a t copula:
[Rho,nu] = copulafit('t',[u v],'Method','ApproximateML')
Rho =
1.0000
0.7220
0.7220
1.0000
nu =
3.2017e+006
Generate a random sample from the t copula:
r = copularnd('t',Rho,nu,1000);
u1 = r(:,1);
v1 = r(:,2);
5-137
5
Probability Distributions
scatterhist(u1,v1,'Direction','out')
xlabel('u')
ylabel('v')
set(get(gca,'children'),'marker','.')
Transform the random sample back to the original scale of the data:
x1 = ksdensity(x,u1,'function','icdf');
y1 = ksdensity(y,v1,'function','icdf');
scatterhist(x1,y1,'Direction','out')
set(get(gca,'children'),'marker','.')
5-138
Probability Distributions Used for Multivariate Modeling
As the example illustrates, copulas integrate naturally with other distribution
fitting functions.
5-139
5
Probability Distributions
5-140
6
Random Number
Generation
• “Generating Random Data” on page 6-2
• “Random Number Generation Functions” on page 6-3
• “Common Generation Methods” on page 6-5
• “Representing Sampling Distributions Using Markov Chain Samplers”
on page 6-13
• “Generating Quasi-Random Numbers” on page 6-15
• “Generating Data Using Flexible Families of Distributions” on page 6-25
6
Random Number Generation
Generating Random Data
Pseudorandom numbers are generated by deterministic algorithms. They are
"random" in the sense that, on average, they pass statistical tests regarding
their distribution and correlation. They differ from true random numbers in
that they are generated by an algorithm, rather than a truly random process.
Random number generators (RNGs) like those in MATLAB are algorithms for
generating pseudorandom numbers with a specified distribution.
For more information on random number generators for supported
distributions, see “Random Number Generators” on page 5-80.
For more information on the GUI for generating random numbers from
supported distributions, see “Visually Exploring Random Number Generation”
on page 5-49.
6-2
Random Number Generation Functions
Random Number Generation Functions
The following table lists the supported distributions and their respective
random number generation functions. For more information on other
functions for each distribution, see “Supported Distributions” on page 5-3.
For more information on random number generators, see “Random Number
Generators” on page 5-80.
Distribution
Random Number Generation Function
Beta
betarnd, random, randtool
Binomial
binornd, random, randtool
Chi-square
chi2rnd, random, randtool
Clayton copula
copularnd
Exponential
exprnd, random, randtool
Extreme value
evrnd, random, randtool
F
frnd, random, randtool
Frank copula
copularnd
Gamma
gamrnd, randg, random, randtool
Gaussian copula
copularnd
Gaussian mixture
random
Generalized extreme
value
gevrnd, random, randtool
Generalized Pareto
gprnd, random, randtool
Geometric
geornd, random, randtool
Gumbel copula
copularnd
Hypergeometric
hygernd, random
Inverse Wishart
iwishrnd
Johnson system
johnsrnd
Lognormal
lognrnd, random, randtool
Multinomial
mnrnd
6-3
6
6-4
Random Number Generation
Distribution
Random Number Generation Function
Multivariate normal
mvnrnd
Multivariate t
mvtrnd
Negative binomial
nbinrnd, random, randtool
Noncentral chi-square
ncx2rnd, random, randtool
Noncentral F
ncfrnd, random, randtool
Noncentral t
nctrnd, random, randtool
Normal (Gaussian)
normrnd, randn, random, randtool
Pearson system
pearsrnd
Piecewise
random
Poisson
poissrnd, random, randtool
Rayleigh
raylrnd, random, randtool
Student’s t
trnd, random, randtool
t copula
copularnd
Uniform (continuous)
unifrnd, rand, random
Uniform (discrete)
unidrnd, random, randtool
Weibull
wblrnd, random
Wishart
wishrnd
Common Generation Methods
Common Generation Methods
In this section...
“Direct Methods” on page 6-5
“Inversion Methods” on page 6-7
“Acceptance-Rejection Methods” on page 6-9
Methods for generating pseudorandom numbers usually start with uniform
random numbers, like the MATLAB rand function produces. The methods
described in this section detail how to produce random numbers from other
distributions.
Direct Methods
Direct methods directly use the definition of the distribution.
For example, consider binomial random numbers. A binomial random number
is the number of heads in N tosses of a coin with probability p of a heads on
any single toss. If you generate N uniform random numbers on the interval
(0,1) and count the number less than p, then the count is a binomial random
number with parameters N and p.
This function is a simple implementation of a binomial RNG using the direct
approach:
function X = directbinornd(N,p,m,n)
X = zeros(m,n); % Preallocate memory
for i = 1:m*n
u = rand(N,1);
X(i) = sum(u < p);
end
For example:
X = directbinornd(100,0.3,1e4,1);
hist(X,101)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
6-5
6
Random Number Generation
The Statistics Toolbox function binornd uses a modified direct method, based
on the definition of a binomial random variable as the sum of Bernoulli
random variables.
You can easily convert the previous method to a random number generator
for the Poisson distribution with parameter λ. The Poisson distribution is
the limiting case of the binomial distribution as N approaches infinity, p
approaches zero, and Np is held fixed at λ. To generate Poisson random
numbers, create a version of the previous generator that inputs λ rather than
N and p, and internally sets N to some large number and p to λ/N.
The Statistics Toolbox function poissrnd actually uses two direct methods:
• A waiting time method for small values of λ
• A method due to Ahrens and Dieter for larger values of λ
6-6
Common Generation Methods
Inversion Methods
Inversion methods are based on the observation that continuous cumulative
distribution functions (cdfs) range uniformly over the interval (0,1). If u is a
uniform random number on (0,1), then using X = F -1(U) generates a random
number X from a continuous distribution with specified cdf F.
For example, the following code generates random numbers from a specific
exponential distribution using the inverse cdf and the MATLAB uniform
random number generator rand:
mu = 1;
X = expinv(rand(1e4,1),mu);
Compare the distribution of the generated random numbers to the pdf of the
specified exponential by scaling the pdf to the area of the histogram used
to display the distribution:
numbins = 50;
hist(X,numbins)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
hold on
[bincounts,binpositions] = hist(X,numbins);
binwidth = binpositions(2) - binpositions(1);
histarea = binwidth*sum(bincounts);
x = binpositions(1):0.001:binpositions(end);
y = exppdf(x,mu);
plot(x,histarea*y,'r','LineWidth',2)
6-7
6
Random Number Generation
Inversion methods also work for discrete distributions. To generate a random
number X from a discrete distribution with probability mass vector P(X=xi) =
pi where x0<x1< x2<..., generate a uniform random number u on (0,1) and then
set X = xi if F(xi–1)<u<F(xi).
For example, the following function implements an inversion method for a
discrete distribution with probability mass vector p:
function X = discreteinvrnd(p,m,n)
X = zeros(m,n); % Preallocate memory
for i = 1:m*n
u = rand;
I = find(u < cumsum(p));
X(i) = min(I);
end
Use the function to generate random numbers from any discrete distribution:
6-8
Common Generation Methods
p = [0.1 0.2 0.3 0.2 0.1 0.1]; % Probability mass vector
X = discreteinvrnd(p,1e4,1);
[n,x] = hist(X,length(p));
bar(1:length(p),n)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
Acceptance-Rejection Methods
The functional form of some distributions makes it difficult or time-consuming
to generate random numbers using direct or inversion methods.
Acceptance-rejection methods provide an alternative in these cases.
Acceptance-rejection methods begin with uniform random numbers, but
require an additional random number generator. If your goal is to generate a
random number from a continuous distribution with pdf f, acceptance-rejection
methods first generate a random number from a continuous distribution with
pdf g satisfying f(x) ≤ cg(x) for some c and all x.
6-9
6
Random Number Generation
A continuous acceptance-rejection RNG proceeds as follows:
1 Chooses a density g.
2 Finds a constant c such that f(x)/g(x)≤c for all x.
3 Generates a uniform random number u.
4 Generates a random number v from g.
5 If cu≤f(v)/g (v), accepts and returns v.
6 Otherwise, rejects v and goes to step 3.
For efficiency, a “cheap” method is necessary for generating random numbers
from g, and the scalar c should be small. The expected number of iterations to
produce a single random number is c.
The following function implements an acceptance-rejection method for
generating random numbers from pdf f, given f, g, the RNG grnd for g, and
the constant c:
6-10
Common Generation Methods
function X = accrejrnd(f,g,grnd,c,m,n)
X = zeros(m,n); % Preallocate memory
for i = 1:m*n
accept = false;
while accept == false
u = rand();
v = grnd();
if c*u <= f(v)/g(v)
X(i) = v;
accept = true;
end
end
end
For example, the function f(x) = xe–x2/2 satisfies the conditions for a pdf on [0,∞)
(nonnegative and integrates to 1). The exponential pdf with mean 1, f(x) = e–x,
dominates g for c greater than about 2.2. Thus, you can use rand and exprnd
to generate random numbers from f:
f = @(x)x.*exp(-(x.^2)/2);
g = @(x)exp(-x);
grnd = @()exprnd(1);
X = accrejrnd(f,g,grnd,2.2,1e4,1);
The pdf f is actually a Rayleigh distribution with shape parameter 1. This
example compares the distribution of random numbers generated by the
acceptance-rejection method with those generated by raylrnd:
Y = raylrnd(1,1e4,1);
hist([X Y])
h = get(gca,'Children');
set(h(1),'FaceColor',[.8 .8 1])
legend('A-R RNG','Rayleigh RNG')
6-11
6
Random Number Generation
The Statistics Toolbox function raylrnd uses a transformation method,
expressing a Rayleigh random variable in terms of a chi-square random
variable, which you compute using randn.
Acceptance-rejection methods also work for discrete distributions. In this case,
the goal is to generate random numbers from a distribution with probability
mass Pp(X = i) = pi, assuming that you have a method for generating random
numbers from a distribution with probability mass Pq(X = i) = qi. The RNG
proceeds as follows:
1 Chooses a density Pq.
2 Finds a constant c such that pi/qi≤c for all i .
3 Generates a uniform random number u.
4 Generates a random number v from Pq.
5 If cu≤pv/qv, accepts and returns v.
6 Otherwise, rejects v and goes to step 3.
6-12
Representing Sampling Distributions Using Markov Chain Samplers
Representing Sampling Distributions Using Markov Chain
Samplers
In this section...
“Using the Metropolis-Hastings Algorithm” on page 6-13
“Using Slice Sampling” on page 6-14
The methods in “Common Generation Methods” on page 6-5 might be
inadequate when sampling distributions are difficult to represent in
computations. Such distributions arise, for example, in Bayesian data
analysis and in the large combinatorial problems of Markov chain Monte
Carlo (MCMC) simulations. An alternative is to construct a Markov chain
with a stationary distribution equal to the target sampling distribution, using
the states of the chain to generate random numbers after an initial burn-in
period in which the state distribution converges to the target.
Using the Metropolis-Hastings Algorithm
The Metropolis-Hastings algorithm draws samples from a distribution that
is only known up to a constant. Random numbers are generated from a
distribution with a probability density function that is equal to or proportional
to a proposal function.
To generate random numbers:
1 Assume an initial value x(t).
2 Draw a sample, y(t), from a proposal distribution q(y|x(t)).
3 Accept y(t) as the next sample x(t + 1) with probability r(x(t),y(t)), and keep
x(t) as the next sample x(t + 1) with probability 1 – r(x(t),y(t)), where:
⎧ f ( y) q( x | y) ⎫
r( x, y) = min ⎨
, 1⎬
⎩ f ( x) q( y| x) ⎭
4 Increment t → t+1, and repeat steps 2 and 3 until you get the desired
number of samples.
6-13
6
Random Number Generation
Generate random numbers using the Metropolis-Hastings method with
the mhsample function. To produce quality samples efficiently with the
Metropolis-Hastings algorithm, it is crucial to select a good proposal
distribution. If it is difficult to find an efficient proposal distribution, use
the slice sampling algorithm (slicesample) without explicitly specifying a
proposal distribution.
Using Slice Sampling
In instances where it is difficult to find an efficient Metropolis-Hastings
proposal distribution, the slice sampling algorithm does not require an explicit
specification. The slice sampling algorithm draws samples from the region
under the density function using a sequence of vertical and horizontal steps.
First, it selects a height at random from 0 to the density function f (x). Then,
it selects a new x value at random by sampling from the horizontal “slice” of
the density above the selected height. A similar slice sampling algorithm is
used for a multivariate distribution.
If a function f(x) proportional to the density function is given, then do the
following to generate random numbers:
1 Assume an initial value x(t) within the domain of f(x).
2 Draw a real value y uniformly from (0, f(x(t))), thereby defining a horizontal
“slice” as S = {x: y < f(x)}.
3 Find an interval I = (L, R) around x(t) that contains all, or much of the
“slice” S.
4 Draw the new point x(t+1) within this interval.
5 Increment t → t+1 and repeat steps 2 through 4 until you get the desired
number of samples.
Slice sampling can generate random numbers from a distribution with an
arbitrary form of the density function, provided that an efficient numerical
procedure is available to find the interval I = (L,R), which is the “slice” of
the density.
Generate random numbers using the slice sampling method with the
slicesample function.
6-14
Generating Quasi-Random Numbers
Generating Quasi-Random Numbers
In this section...
“Quasi-Random Sequences” on page 6-15
“Quasi-Random Point Sets” on page 6-16
“Quasi-Random Streams” on page 6-23
Quasi-Random Sequences
Quasi-random number generators (QRNGs) produce highly uniform samples
of the unit hypercube. QRNGs minimize the discrepancy between the
distribution of generated points and a distribution with equal proportions of
points in each sub-cube of a uniform partition of the hypercube. As a result,
QRNGs systematically fill the “holes” in any initial segment of the generated
quasi-random sequence.
Unlike the pseudorandom sequences described in “Common Generation
Methods” on page 6-5, quasi-random sequences fail many statistical tests for
randomness. Approximating true randomness, however, is not their goal.
Quasi-random sequences seek to fill space uniformly, and to do so in such a
way that initial segments approximate this behavior up to a specified density.
QRNG applications include:
• Quasi-Monte Carlo (QMC) integration. Monte Carlo techniques are
often used to evaluate difficult, multi-dimensional integrals without a
closed-form solution. QMC uses quasi-random sequences to improve the
convergence properties of these techniques.
• Space-filling experimental designs. In many experimental settings,
taking measurements at every factor setting is expensive or infeasible.
Quasi-random sequences provide efficient, uniform sampling of the design
space.
• Global optimization. Optimization algorithms typically find a local
optimum in the neighborhood of an initial value. By using a quasi-random
sequence of initial values, searches for global optima uniformly sample the
basins of attraction of all local minima.
6-15
6
Random Number Generation
Example: Using Scramble, Leap, and Skip
Imagine a simple 1-D sequence that produces the integers from 1 to 10. This
is the basic sequence and the first three points are [1,2,3]:
Now look at how Scramble, Leap, and Skip work together:
• Scramble — Scrambling shuffles the points in one of several different
ways. In this example, assume a scramble turns the sequence into
1,3,5,7,9,2,4,6,8,10. The first three points are now [1,3,5]:
• Skip — A Skip value specifies the number of initial points to ignore. In this
example, set the Skip value to 2. The sequence is now 5,7,9,2,4,6,8,10
and the first three points are [5,7,9]:
• Leap — A Leap value specifies the number of points to ignore for each one
you take. Continuing the example with the Skip set to 2, if you set the Leap
to 1, the sequence uses every other point. In this example, the sequence is
now 5,9,4,8 and the first three points are [5,9,4]:
Quasi-Random Point Sets
Statistics Toolbox functions support these quasi-random sequences:
• Halton sequences. Produced by the haltonset function. These sequences
use different prime bases to form successively finer uniform partitions of
the unit interval in each dimension.
6-16
Generating Quasi-Random Numbers
• Sobol sequences. Produced by the sobolset function. These sequences
use a base of 2 to form successively finer uniform partitions of the unit
interval, and then reorder the coordinates in each dimension.
• Latin hypercube sequences. Produced by the lhsdesign function.
Though not quasi-random in the sense of minimizing discrepancy,
these sequences nevertheless produce sparse uniform samples useful in
experimental designs.
Quasi-random sequences are functions from the positive integers to the unit
hypercube. To be useful in application, an initial point set of a sequence must
be generated. Point sets are matrices of size n-by-d, where n is the number of
points and d is the dimension of the hypercube being sampled. The functions
haltonset and sobolset construct point sets with properties of a specified
quasi-random sequence. Initial segments of the point sets are generated by
the net method of the qrandset class (parent class of the haltonset class
and sobolset class), but points can be generated and accessed more generally
using parenthesis indexing.
Because of the way in which quasi-random sequences are generated, they
may contain undesirable correlations, especially in their initial segments, and
especially in higher dimensions. To address this issue, quasi-random point
sets often skip, leap over, or scramble values in a sequence. The haltonset
and sobolset functions allow you to specify both a Skip and a Leap property
of a quasi-random sequence, and the scramble method of the qrandset class
allows you apply a variety of scrambling techniques. Scrambling reduces
correlations while also improving uniformity.
Example: Generate a Quasi-Random Point Set
This example uses haltonset to construct a 2-D Halton point set—an object,
p, of the haltonset class—that skips the first 1000 values of the sequence
and then retains every 101st point:
p = haltonset(2,'Skip',1e3,'Leap',1e2)
p =
Halton point set in 2 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : none
6-17
6
Random Number Generation
The object p encapsulates properties of the specified quasi-random sequence.
The point set is finite, with a length determined by the Skip and Leap
properties and by limits on the size of point set indices (maximum value of 253).
Use scramble to apply reverse-radix scrambling:
p = scramble(p,'RR2')
p =
Halton point set in 2 dimensions (8.918019e+013 points)
Properties:
Skip : 1000
Leap : 100
ScrambleMethod : RR2
Use net to generate the first 500 points:
X0 = net(p,500);
This is equivalent to:
X0 = p(1:500,:);
Values of the point set X0 are not generated and stored in memory until you
access p using net or parenthesis indexing.
To appreciate the nature of quasi-random numbers, create a scatter of the
two dimensions in X0:
scatter(X0(:,1),X0(:,2),5,'r')
axis square
title('{\bf Quasi-Random Scatter}')
6-18
Generating Quasi-Random Numbers
Compare this to a scatter of uniform pseudorandom numbers generated by
the MATLAB rand function:
X = rand(500,2);
scatter(X(:,1),X(:,2),5,'b')
axis square
title('{\bf Uniform Random Scatter}')
6-19
6
Random Number Generation
The quasi-random scatter appears more uniform, avoiding the clumping in
the pseudorandom scatter.
In a statistical sense, quasi-random numbers are too uniform to pass
traditional tests of randomness. For example, a Kolmogorov-Smirnov test,
performed by kstest, is used to assess whether or not a point set has a
uniform random distribution. When performed repeatedly on uniform
pseudorandom samples, such as those generated by rand, the test produces
a uniform distribution of p-values:
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
x = rand(sampSize,1);
[h,pval] = kstest(x,[x,x]);
6-20
Generating Quasi-Random Numbers
PVALS(test) = pval;
end
hist(PVALS,100)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('{\it p}-values')
ylabel('Number of Tests')
The results are quite different when the test is performed repeatedly on
uniform quasi-random samples:
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
6-21
6
Random Number Generation
x = p(test:test+(sampSize-1),:);
[h,pval] = kstest(x,[x,x]);
PVALS(test) = pval;
end
hist(PVALS,100)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('{\it p}-values')
ylabel('Number of Tests')
Small p-values call into question the null hypothesis that the data are
uniformly distributed. If the hypothesis is true, about 5% of the p-values are
expected to fall below 0.05. The results are remarkably consistent in their
failure to challenge the hypothesis.
6-22
Generating Quasi-Random Numbers
Quasi-Random Streams
Quasi-random streams, produced by the qrandstream function, are used
to generate sequential quasi-random outputs, rather than point sets of a
specific size. Streams are used like pseudoRNGS, such as rand, when client
applications require a source of quasi-random numbers of indefinite size that
can be accessed intermittently. Properties of a quasi-random stream, such
as its type (Halton or Sobol), dimension, skip, leap, and scramble, are set
when the stream is constructed.
In implementation, quasi-random streams are essentially very large
quasi-random point sets, though they are accessed differently. The state of a
quasi-random stream is the scalar index of the next point to be taken from the
stream. Use the qrand method of the qrandstream class to generate points
from the stream, starting from the current state. Use the reset method to
reset the state to 1. Unlike point sets, streams do not support parenthesis
indexing.
Example: Generate a Quasi-Random Stream
For example, the following code, taken from the example at the end of
“Quasi-Random Point Sets” on page 6-16, uses haltonset to create a
quasi-random point set p, and then repeatedly increments the index into the
point set, test, to generate different samples:
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
x = p(test:test+(sampSize-1),:);
[h,pval] = kstest(x,[x,x]);
PVALS(test) = pval;
end
The same results are obtained by using qrandstream to construct a
quasi-random stream q based on the point set p and letting the stream take
care of increments to the index:
6-23
6
Random Number Generation
p = haltonset(1,'Skip',1e3,'Leap',1e2);
p = scramble(p,'RR2');
q = qrandstream(p)
nTests = 1e5;
sampSize = 50;
PVALS = zeros(nTests,1);
for test = 1:nTests
X = qrand(q,sampSize);
[h,pval] = kstest(X,[X,X]);
PVALS(test) = pval;
end
6-24
Generating Data Using Flexible Families of Distributions
Generating Data Using Flexible Families of Distributions
In this section...
“Pearson and Johnson Systems” on page 6-25
“Generating Data Using the Pearson System” on page 6-26
“Generating Data Using the Johnson System” on page 6-28
Pearson and Johnson Systems
As described in “Using Probability Distributions” on page 5-2, choosing an
appropriate parametric family of distributions to model your data can be
based on a priori or a posteriori knowledge of the data-producing process,
but the choice is often difficult. The Pearson and Johnson systems can make
such a choice unnecessary. Each system is a flexible parametric family of
distributions that includes a wide range of distribution shapes, and it is often
possible to find a distribution within one of these two systems that provides
a good match to your data.
Data Input
The following parameters define each member of the Pearson and Johnson
systems
• Mean — Estimated by mean
• Standard deviation — Estimated by std
• Skewness — Estimated by skewness
• Kurtosis — Estimated by kurtosis
These statistics can also be computed with the moment function. The Johnson
system, while based on these four parameters, is more naturally described
using quantiles, estimated by the quantile function.
The Statistics Toolbox functions pearsrnd and johnsrnd take input
arguments defining a distribution (parameters or quantiles, respectively) and
return the type and the coefficients of the distribution in the corresponding
system. Both functions also generate random numbers from the specified
distribution.
6-25
6
Random Number Generation
As an example, load the data in carbig.mat, which includes a variable MPG
containing measurements of the gas mileage for each car.
load carbig
MPG = MPG(~isnan(MPG));
[n,x] = hist(MPG,15);
bar(x,n)
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
The following two sections model the distribution with members of the
Pearson and Johnson systems, respectively.
Generating Data Using the Pearson System
The statistician Karl Pearson devised a system, or family, of distributions
that includes a unique distribution corresponding to every valid combination
of mean, standard deviation, skewness, and kurtosis. If you compute sample
6-26
Generating Data Using Flexible Families of Distributions
values for each of these moments from data, it is easy to find the distribution
in the Pearson system that matches these four moments and to generate a
random sample.
The Pearson system embeds seven basic types of distribution together in
a single parametric framework. It includes common distributions such
as the normal and t distributions, simple transformations of standard
distributions such as a shifted and scaled beta distribution and the inverse
gamma distribution, and one distribution—the Type IV—that is not a simple
transformation of any standard distribution.
For a given set of moments, there are distributions that are not in the system
that also have those same first four moments, and the distribution in the
Pearson system may not be a good match to your data, particularly if the
data are multimodal. But the system does cover a wide range of distribution
shapes, including both symmetric and skewed distributions.
To generate a sample from the Pearson distribution that closely matches
the MPG data, simply compute the four sample moments and treat those as
distribution parameters.
moments = {mean(MPG),std(MPG),skewness(MPG),kurtosis(MPG)};
[r,type] = pearsrnd(moments{:},10000,1);
The optional second output from pearsrnd indicates which type of distribution
within the Pearson system matches the combination of moments.
type
type =
1
In this case, pearsrnd has determined that the data are best described with a
Type I Pearson distribution, which is a shifted, scaled beta distribution.
Verify that the sample resembles the original data by overlaying the empirical
cumulative distribution functions.
ecdf(MPG);
[Fi,xi] = ecdf(r);
hold on, stairs(xi,Fi,'r'); hold off
6-27
6
Random Number Generation
Generating Data Using the Johnson System
Statistician Norman Johnson devised a different system of distributions that
also includes a unique distribution for every valid combination of mean,
standard deviation, skewness, and kurtosis. However, since it is more natural
to describe distributions in the Johnson system using quantiles, working with
this system is different than working with the Pearson system.
The Johnson system is based on three possible transformations of a normal
random variable, plus the identity transformation. The three nontrivial cases
are known as SL, SU, and SB, corresponding to exponential, logistic, and
hyperbolic sine transformations. All three can be written as
⎛ ( Ζ-ξ ) ⎞
Χ = γ + δ ⋅ Γ ⎜⎜
⎟⎟
⎝ λ ⎠
6-28
Generating Data Using Flexible Families of Distributions
where Z is a standard normal random variable, Γ is the transformation, and
γ, δ, ξ, and λ are scale and location parameters. The fourth case, SN, is the
identity transformation.
To generate a sample from the Johnson distribution that matches the MPG
data, first define the four quantiles to which the four evenly spaced standard
normal quantiles of -1.5, -0.5, 0.5, and 1.5 should be transformed. That is, you
compute the sample quantiles of the data for the cumulative probabilities of
0.067, 0.309, 0.691, and 0.933.
probs = normcdf([-1.5 -0.5 0.5 1.5])
probs =
0.066807
0.30854
0.69146
quantiles = quantile(MPG,probs)
quantiles =
13.0000
18.0000
27.2000
0.93319
36.0000
Then treat those quantiles as distribution parameters.
[r1,type] = johnsrnd(quantiles,10000,1);
The optional second output from johnsrnd indicates which type of distribution
within the Johnson system matches the quantiles.
type
type =
SB
You can verify that the sample resembles the original data by overlaying the
empirical cumulative distribution functions.
ecdf(MPG);
[Fi,xi] = ecdf(r1);
hold on, stairs(xi,Fi,'r'); hold off
6-29
6
Random Number Generation
In some applications, it may be important to match the quantiles better in
some regions of the data than in others. To do that, specify four evenly spaced
standard normal quantiles at which you want to match the data, instead
of the default -1.5, -0.5, 0.5, and 1.5. For example, you might care more
about matching the data in the right tail than in the left, and so you specify
standard normal quantiles that emphasizes the right tail.
qnorm = [-.5 .25 1 1.75];
probs = normcdf(qnorm);
qemp = quantile(MPG,probs);
r2 = johnsrnd([qnorm; qemp],10000,1);
However, while the new sample matches the original data better in the right
tail, it matches much worse in the left tail.
[Fj,xj] = ecdf(r2);
hold on, stairs(xj,Fj,'g'); hold off
6-30
Generating Data Using Flexible Families of Distributions
6-31
6
6-32
Random Number Generation
7
Hypothesis Tests
• “Introduction” on page 7-2
• “Hypothesis Test Terminology” on page 7-3
• “Hypothesis Test Assumptions” on page 7-5
• “Example: Hypothesis Testing” on page 7-7
• “Available Hypothesis Tests” on page 7-13
7
Hypothesis Tests
Introduction
Hypothesis testing is a common method of drawing inferences about a
population based on statistical evidence from a sample.
As an example, suppose someone says that at a certain time in the state
of Massachusetts the average price of a gallon of regular unleaded gas was
$1.15. How could you determine the truth of the statement? You could try to
find prices at every gas station in the state at the time. That approach would
be definitive, but it could be time-consuming, costly, or even impossible.
A simpler approach would be to find prices at a small number of randomly
selected gas stations around the state, and then compute the sample average.
Sample averages differ from one another due to chance variability in the
selection process. Suppose your sample average comes out to be $1.18. Is the
$0.03 difference an artifact of random sampling or significant evidence that
the average price of a gallon of gas was in fact greater than $1.15? Hypothesis
testing is a statistical method for making such decisions.
7-2
Hypothesis Test Terminology
Hypothesis Test Terminology
All hypothesis tests share the same basic terminology and structure.
• A null hypothesis is an assertion about a population that you would like to
test. It is “null” in the sense that it often represents a status quo belief,
such as the absence of a characteristic or the lack of an effect. It may be
formalized by asserting that a population parameter, or a combination of
population parameters, has a certain value. In the example given in the
“Introduction” on page 7-2, the null hypothesis would be that the average
price of gas across the state was $1.15. This is written H0: µ = 1.15.
• An alternative hypothesis is a contrasting assertion about the population
that can be tested against the null hypothesis. In the example given in the
“Introduction” on page 7-2, possible alternative hypotheses are:
H1: µ ≠ 1.15 — State average was different from $1.15 (two-tailed test)
H1: µ > 1.15 — State average was greater than $1.15 (right-tail test)
H1: µ < 1.15 — State average was less than $1.15 (left-tail test)
• To conduct a hypothesis test, a random sample from the population is
collected and a relevant test statistic is computed to summarize the sample.
This statistic varies with the type of test, but its distribution under the null
hypothesis must be known (or assumed).
• The p value of a test is the probability, under the null hypothesis, of
obtaining a value of the test statistic as extreme or more extreme than the
value computed from the sample.
• The significance level of a test is a threshold of probability α agreed to before
the test is conducted. A typical value of α is 0.05. If the p value of a test is
less than α, the test rejects the null hypothesis. If the p value is greater
than α, there is insufficient evidence to reject the null hypothesis. Note
that lack of evidence for rejecting the null hypothesis is not evidence for
accepting the null hypothesis. Also note that substantive “significance” of
an alternative cannot be inferred from the statistical significance of a test.
• The significance level α can be interpreted as the probability of rejecting
the null hypothesis when it is actually true—a type I error. The distribution
of the test statistic under the null hypothesis determines the probability
α of a type I error. Even if the null hypothesis is not rejected, it may still
be false—a type II error. The distribution of the test statistic under the
7-3
7
Hypothesis Tests
alternative hypothesis determines the probability β of a type II error. Type
II errors are often due to small sample sizes. The power of a test, 1 – β, is
the probability of correctly rejecting a false null hypothesis.
• Results of hypothesis tests are often communicated with a confidence
interval. A confidence interval is an estimated range of values with a
specified probability of containing the true population value of a parameter.
Upper and lower bounds for confidence intervals are computed from the
sample estimate of the parameter and the known (or assumed) sampling
distribution of the estimator. A typical assumption is that estimates will be
normally distributed with repeated sampling (as dictated by the Central
Limit Theorem). Wider confidence intervals correspond to poor estimates
(smaller samples); narrow intervals correspond to better estimates
(larger samples). If the null hypothesis asserts the value of a population
parameter, the test rejects the null hypothesis when the hypothesized
value lies outside the computed confidence interval for the parameter.
7-4
Hypothesis Test Assumptions
Hypothesis Test Assumptions
Different hypothesis tests make different assumptions about the distribution
of the random variable being sampled in the data. These assumptions must
be considered when choosing a test and when interpreting the results.
For example, the z-test (ztest) and the t-test (ttest) both assume that
the data are independently sampled from a normal distribution. Statistics
Toolbox functions are available for testing this assumption, such as chi2gof,
jbtest, lillietest, and normplot.
Both the z-test and the t-test are relatively robust with respect to departures
from this assumption, so long as the sample size n is large enough. Both
tests compute a sample mean x , which, by the Central Limit Theorem,
has an approximately normal sampling distribution with mean equal to the
population mean μ, regardless of the population distribution being sampled.
The difference between the z-test and the t-test is in the assumption of the
standard deviation σ of the underlying normal distribution. A z-test assumes
that σ is known; a t-test does not. As a result, a t-test must compute an
estimate s of the standard deviation from the sample.
Test statistics for the z-test and the t-test are, respectively,
x−
/ n
x−
t=
s/ n
z=
Under the null hypothesis that the population is distributed with mean μ, the
z-statistic has a standard normal distribution, N(0,1). Under the same null
hypothesis, the t-statistic has Student’s t distribution with n – 1 degrees of
freedom. For small sample sizes, Student’s t distribution is flatter and wider
than N(0,1), compensating for the decreased confidence in the estimate s.
As sample size increases, however, Student’s t distribution approaches the
standard normal distribution, and the two tests become essentially equivalent.
7-5
7
Hypothesis Tests
Knowing the distribution of the test statistic under the null hypothesis allows
for accurate calculation of p-values. Interpreting p-values in the context of
the test assumptions allows for critical analysis of test results.
Assumptions underlying Statistics Toolbox hypothesis tests are given in the
reference pages for implementing functions.
7-6
Example: Hypothesis Testing
Example: Hypothesis Testing
This example uses the gas price data in the file gas.mat. The file contains two
random samples of prices for a gallon of gas around the state of Massachusetts
in 1993. The first sample, price1, contains 20 random observations around
the state on a single day in January. The second sample, price2, contains 20
random observations around the state one month later.
load gas
prices = [price1 price2];
As a first step, you might want to test the assumption that the samples come
from normal distributions.
A normal probability plot gives a quick idea.
7-7
7
Hypothesis Tests
normplot(prices)
Both scatters approximately follow straight lines through the first and third
quartiles of the samples, indicating approximate normal distributions.
The February sample (the right-hand line) shows a slight departure from
normality in the lower tail. A shift in the mean from January to February is
evident.
A hypothesis test is used to quantify the test of normality. Since each sample
is relatively small, a Lilliefors test is recommended.
7-8
Example: Hypothesis Testing
lillietest(price1)
ans =
0
lillietest(price2)
ans =
0
The default significance level of lillietest is 5%. The logical 0 returned by
each test indicates a failure to reject the null hypothesis that the samples are
normally distributed. This failure may reflect normality in the population or
it may reflect a lack of strong evidence against the null hypothesis due to
the small sample size.
Now compute the sample means:
sample_means = mean(prices)
sample_means =
115.1500 118.5000
You might want to test the null hypothesis that the mean price across the
state on the day of the January sample was $1.15. If you know that the
standard deviation in prices across the state has historically, and consistently,
been $0.04, then a z-test is appropriate.
[h,pvalue,ci] = ztest(price1/100,1.15,0.04)
h =
0
pvalue =
0.8668
ci =
1.1340
1.1690
The logical output h = 0 indicates a failure to reject the null hypothesis
at the default significance level of 5%. This is a consequence of the high
probability under the null hypothesis, indicated by the p value, of observing
a value as extreme or more extreme of the z-statistic computed from the
sample. The 95% confidence interval on the mean [1.1340 1.1690] includes
the hypothesized population mean of $1.15.
7-9
7
Hypothesis Tests
Does the later sample offer stronger evidence for rejecting a null hypothesis
of a state-wide average price of $1.15 in February? The shift shown in the
probability plot and the difference in the computed sample means suggest
this. The shift might indicate a significant fluctuation in the market, raising
questions about the validity of using the historical standard deviation. If a
known standard deviation cannot be assumed, a t-test is more appropriate.
[h,pvalue,ci] = ttest(price2/100,1.15)
h =
1
pvalue =
4.9517e-004
ci =
1.1675
1.2025
The logical output h = 1 indicates a rejection of the null hypothesis at the
default significance level of 5%. In this case, the 95% confidence interval on
the mean does not include the hypothesized population mean of $1.15.
You might want to investigate the shift in prices a little more closely.
The function ttest2 tests if two independent samples come from normal
distributions with equal but unknown standard deviations and the same
mean, against the alternative that the means are unequal.
[h,sig,ci] = ttest2(price1,price2)
h =
1
sig =
0.0083
ci =
-5.7845
-0.9155
The null hypothesis is rejected at the default 5% significance level, and
the confidence interval on the difference of means does not include the
hypothesized value of 0.
A notched box plot is another way to visualize the shift.
7-10
Example: Hypothesis Testing
boxplot(prices,1)
set(gca,'XTick',[1 2])
set(gca,'XtickLabel',{'January','February'})
xlabel('Month')
ylabel('Prices ($0.01)')
The plot displays the distribution of the samples around their medians. The
heights of the notches in each box are computed so that the side-by-side
boxes have nonoverlapping notches when their medians are different at a
default 5% significance level. The computation is based on an assumption
of normality in the data, but the comparison is reasonably robust for other
distributions. The side-by-side plots provide a kind of visual hypothesis test,
comparing medians rather than means. The plot above appears to barely
reject the null hypothesis of equal medians.
The nonparametric Wilcoxon rank sum test, implemented by the function
ranksum, can be used to quantify the test of equal medians. It tests if two
independent samples come from identical continuous (not necessarily normal)
distributions with equal medians, against the alternative that they do not
have equal medians.
7-11
7
Hypothesis Tests
[p,h] = ranksum(price1,price2)
p =
0.0095
h =
1
The test rejects the null hypothesis of equal medians at the default 5%
significance level.
7-12
Available Hypothesis Tests
Available Hypothesis Tests
Function
Description
ansaribradley
Ansari-Bradley test. Tests if two independent samples
come from the same distribution, against the alternative
that they come from distributions that have the same
median and shape but different variances.
chi2gof
Chi-square goodness-of-fit test. Tests if a sample comes
from a specified distribution, against the alternative
that it does not come from that distribution.
dwtest
Durbin-Watson test. Tests if the residuals from a linear
regression are uncorrelated, against the alternative
that there is autocorrelation among them.
jbtest
Jarque-Bera test. Tests if a sample comes from a
normal distribution with unknown mean and variance,
against the alternative that it does not come from a
normal distribution.
kstest
One-sample Kolmogorov-Smirnov test. Tests if a sample
comes from a continuous distribution with specified
parameters, against the alternative that it does not
come from that distribution.
kstest2
Two-sample Kolmogorov-Smirnov test. Tests if two
samples come from the same continuous distribution,
against the alternative that they do not come from the
same distribution.
lillietest
Lilliefors test. Tests if a sample comes from a
distribution in the normal family, against the
alternative that it does not come from a normal
distribution.
linhyptest
Linear hypothesis test. Tests if H*b = c for parameter
estimates b with estimated covariance H and specified
c, against the alternative that H*b
c.
7-13
7
7-14
Hypothesis Tests
Function
Description
ranksum
Wilcoxon rank sum test. Tests if two independent
samples come from identical continuous distributions
with equal medians, against the alternative that they
do not have equal medians.
runstest
Runs test. Tests if a sequence of values comes in
random order, against the alternative that the ordering
is not random.
signrank
One-sample or paired-sample Wilcoxon signed rank test.
Tests if a sample comes from a continuous distribution
symmetric about a specified median, against the
alternative that it does not have that median.
signtest
One-sample or paired-sample sign test. Tests if a
sample comes from an arbitrary continuous distribution
with a specified median, against the alternative that it
does not have that median.
ttest
One-sample or paired-sample t-test. Tests if a sample
comes from a normal distribution with unknown
variance and a specified mean, against the alternative
that it does not have that mean.
ttest2
Two-sample t-test. Tests if two independent samples
come from normal distributions with unknown but
equal (or, optionally, unequal) variances and the same
mean, against the alternative that the means are
unequal.
vartest
One-sample chi-square variance test. Tests if a sample
comes from a normal distribution with specified
variance, against the alternative that it comes from a
normal distribution with a different variance.
vartest2
Two-sample F-test for equal variances. Tests if two
independent samples come from normal distributions
with the same variance, against the alternative that
they come from normal distributions with different
variances.
Available Hypothesis Tests
Function
Description
vartestn
Bartlett multiple-sample test for equal variances. Tests
if multiple samples come from normal distributions
with the same variance, against the alternative that
they come from normal distributions with different
variances.
ztest
One-sample z-test. Tests if a sample comes from a
normal distribution with known variance and specified
mean, against the alternative that it does not have that
mean.
Note In addition to the previous functions, Statistics Toolbox functions are
available for analysis of variance (ANOVA), which perform hypothesis tests in
the context of linear modeling. These functions are discussed in Chapter 8,
“Analysis of Variance”.
7-15
7
7-16
Hypothesis Tests
8
Analysis of Variance
• “Introduction” on page 8-2
• “ANOVA” on page 8-3
• “MANOVA” on page 8-39
8
Analysis of Variance
Introduction
Analysis of variance (ANOVA) is a procedure for assigning sample variance to
different sources and deciding whether the variation arises within or among
different population groups. Samples are described in terms of variation
around group means and variation of group means around an overall mean. If
variations within groups are small relative to variations between groups, a
difference in group means may be inferred. Chapter 7, “Hypothesis Tests” are
used to quantify decisions.
This chapter treats ANOVA among groups, that is, among categorical
predictors. ANOVA for regression, with continuous predictors, is discussed in
“Tabulating Diagnostic Statistics” on page 9-13.
Multivariate analysis of variance (MANOVA), for data with multiple
measured responses, is also discussed in this chapter.
8-2
ANOVA
ANOVA
In this section...
“One-Way ANOVA” on page 8-3
“Two-Way ANOVA” on page 8-9
“N-Way ANOVA” on page 8-12
“Other ANOVA Models” on page 8-26
“Analysis of Covariance” on page 8-27
“Nonparametric Methods” on page 8-35
One-Way ANOVA
• “Introduction” on page 8-3
• “Example: One-Way ANOVA” on page 8-4
• “Multiple Comparisons” on page 8-6
• “Example: Multiple Comparisons” on page 8-7
Introduction
The purpose of one-way ANOVA is to find out whether data from several
groups have a common mean. That is, to determine whether the groups are
actually different in the measured characteristic.
One-way ANOVA is a simple special case of the linear model. The one-way
ANOVA form of the model is
yij = . j +  ij
where:
• yij is a matrix of observations in which each column represents a different
group.
8-3
8
Analysis of Variance
• α.j is a matrix whose columns are the group means. (The “dot j” notation
means that α applies to all rows of column j. That is, the value αij is the
same for all i.)
• εij is a matrix of random disturbances.
The model assumes that the columns of y are a constant plus a random
disturbance. You want to know if the constants are all the same.
Example: One-Way ANOVA
The data below comes from a study by Hogg and Ledolter [48] of bacteria
counts in shipments of milk. The columns of the matrix hogg represent
different shipments. The rows are bacteria counts from cartons of milk chosen
randomly from each shipment. Do some shipments have higher counts than
others?
load hogg
hogg
hogg =
24
15
21
27
33
23
14
7
12
17
14
16
11
9
7
13
12
18
7
7
4
7
12
18
19
24
19
15
10
20
[p,tbl,stats] = anova1(hogg);
p
p =
1.1971e-04
The standard ANOVA table has columns for the sums of squares, degrees of
freedom, mean squares (SS/df), F statistic, and p value.
8-4
ANOVA
You can use the F statistic to do a hypothesis test to find out if the bacteria
counts are the same. anova1 returns the p value from this hypothesis test.
In this case the p value is about 0.0001, a very small value. This is a strong
indication that the bacteria counts from the different shipments are not the
same. An F statistic as extreme as the observed F would occur by chance only
once in 10,000 times if the counts were truly equal.
The p value returned by anova1 depends on assumptions about the random
disturbances εij in the model equation. For the p value to be correct, these
disturbances need to be independent, normally distributed, and have constant
variance.
You can get some graphical assurance that the means are different by
looking at the box plots in the second figure window displayed by anova1.
Note, however, that the notches are used for a comparison of medians, not a
comparison of means. For more information on this display, see “Box Plots”
on page 4-6.
8-5
8
Analysis of Variance
Multiple Comparisons
Sometimes you need to determine not just whether there are any differences
among the means, but specifically which pairs of means are significantly
different. It is tempting to perform a series of t tests, one for each pair of
means, but this procedure has a pitfall.
In a t test, you compute a t statistic and compare it to a critical value. The
critical value is chosen so that when the means are really the same (any
apparent difference is due to random chance), the probability that the t
statistic will exceed the critical value is small, say 5%. When the means
are different, the probability that the statistic will exceed the critical value
is larger.
In this example there are five means, so there are 10 pairs of means to
compare. It stands to reason that if all the means are the same, and if there is
a 5% chance of incorrectly concluding that there is a difference in one pair,
8-6
ANOVA
then the probability of making at least one incorrect conclusion among all 10
pairs is much larger than 5%.
Fortunately, there are procedures known as multiple comparison procedures
that are designed to compensate for multiple tests.
Example: Multiple Comparisons
You can perform a multiple comparison test using the multcompare function
and supplying it with the stats output from anova1.
load hogg
[p,tbl,stats] = anova1(hogg);
[c,m] = multcompare(stats)
c =
1.0000
2.0000
2.4953
1.0000
3.0000
4.1619
1.0000
4.0000
6.6619
1.0000
5.0000
-2.0047
2.0000
3.0000
-6.3381
2.0000
4.0000
-3.8381
2.0000
5.0000 -12.5047
3.0000
4.0000
-5.5047
3.0000
5.0000 -14.1714
4.0000
5.0000 -16.6714
m =
23.8333
1.9273
13.3333
1.9273
11.6667
1.9273
9.1667
1.9273
17.8333
1.9273
10.5000
12.1667
14.6667
6.0000
1.6667
4.1667
-4.5000
2.5000
-6.1667
-8.6667
18.5047
20.1714
22.6714
14.0047
9.6714
12.1714
3.5047
10.5047
1.8381
-0.6619
The first output from multcompare has one row for each pair of groups, with
an estimate of the difference in group means and a confidence interval for that
group. For example, the second row has the values
1.0000
3.0000
4.1619
12.1667
20.1714
indicating that the mean of group 1 minus the mean of group 3 is
estimated to be 12.1667, and a 95% confidence interval for this difference is
8-7
8
Analysis of Variance
[4.1619, 20.1714]. This interval does not contain 0, so you can conclude that
the means of groups 1 and 3 are different.
The second output contains the mean and its standard error for each group.
It is easier to visualize the difference between group means by looking at the
graph that multcompare produces.
There are five groups. The graph instructs you to Click on the group you
want to test. Three groups have slopes significantly different from group one.
The graph shows that group 1 is significantly different from groups 2, 3, and
4. By using the mouse to select group 4, you can determine that it is also
significantly different from group 5. Other pairs are not significantly different.
8-8
ANOVA
Two-Way ANOVA
• “Introduction” on page 8-9
• “Example: Two-Way ANOVA” on page 8-10
Introduction
The purpose of two-way ANOVA is to find out whether data from several
groups have a common mean. One-way ANOVA and two-way ANOVA differ
in that the groups in two-way ANOVA have two categories of defining
characteristics instead of one.
Suppose an automobile company has two factories, and each factory makes
the same three models of car. It is reasonable to ask if the gas mileage in the
cars varies from factory to factory as well as from model to model. There are
two predictors, factory and model, to explain differences in mileage.
There could be an overall difference in mileage due to a difference in the
production methods between factories. There is probably a difference in the
mileage of the different models (irrespective of the factory) due to differences
in design specifications. These effects are called additive.
Finally, a factory might make high mileage cars in one model (perhaps
because of a superior production line), but not be different from the other
factory for other models. This effect is called an interaction. It is impossible
to detect an interaction unless there are duplicate observations for some
combination of factory and car model.
Two-way ANOVA is a special case of the linear model. The two-way ANOVA
form of the model is
yijk =  + . j +  i. +  ij +  ijk
where, with respect to the automobile example above:
• yijk is a matrix of gas mileage observations (with row index i, column index
j, and repetition index k).
• μ is a constant matrix of the overall mean gas mileage.
8-9
8
Analysis of Variance
• α.j is a matrix whose columns are the deviations of each car’s gas mileage
(from the mean gas mileage μ) that are attributable to the car’s model. All
values in a given column of α.j are identical, and the values in each row of
α.j sum to 0.
• βi. is a matrix whose rows are the deviations of each car’s gas mileage
(from the mean gas mileage μ) that are attributable to the car’s factory. All
values in a given row of βi. are identical, and the values in each column
of βi. sum to 0.
• γij is a matrix of interactions. The values in each row of γij sum to 0, and the
values in each column of γij sum to 0.
• εijk is a matrix of random disturbances.
Example: Two-Way ANOVA
The purpose of the example is to determine the effect of car model and factory
on the mileage rating of cars.
load mileage
mileage
mileage =
33.3000
33.4000
32.9000
32.6000
32.5000
33.0000
34.5000
34.8000
33.8000
33.4000
33.7000
33.9000
37.4000
36.8000
37.6000
36.6000
37.0000
36.7000
cars = 3;
[p,tbl,stats] = anova2(mileage,cars);
p
p =
0.0000
0.0039
0.8411
There are three models of cars (columns) and two factories (rows). The reason
there are six rows in mileage instead of two is that each factory provides
8-10
ANOVA
three cars of each model for the study. The data from the first factory is in the
first three rows, and the data from the second factory is in the last three rows.
The standard ANOVA table has columns for the sums of squares,
degrees-of-freedom, mean squares (SS/df), F statistics, and p-values.
You can use the F statistics to do hypotheses tests to find out if the mileage is
the same across models, factories, and model-factory pairs (after adjusting for
the additive effects). anova2 returns the p value from these tests.
The p value for the model effect is zero to four decimal places. This is a strong
indication that the mileage varies from one model to another. An F statistic
as extreme as the observed F would occur by chance less than once in 10,000
times if the gas mileage were truly equal from model to model. If you used the
multcompare function to perform a multiple comparison test, you would find
that each pair of the three models is significantly different.
The p value for the factory effect is 0.0039, which is also highly significant.
This indicates that one factory is out-performing the other in the gas mileage
of the cars it produces. The observed p value indicates that an F statistic as
extreme as the observed F would occur by chance about four out of 1000 times
if the gas mileage were truly equal from factory to factory.
There does not appear to be any interaction between factories and models.
The p value, 0.8411, means that the observed result is quite likely (84 out 100
times) given that there is no interaction.
8-11
8
Analysis of Variance
The p-values returned by anova2 depend on assumptions about the random
disturbances εijk in the model equation. For the p-values to be correct these
disturbances need to be independent, normally distributed, and have constant
variance.
In addition, anova2 requires that data be balanced, which in this case means
there must be the same number of cars for each combination of model and
factory. The next section discusses a function that supports unbalanced data
with any number of predictors.
N-Way ANOVA
• “Introduction” on page 8-12
• “N-Way ANOVA with a Small Data Set” on page 8-13
• “N-Way ANOVA with a Large Data Set” on page 8-15
• “ANOVA with Random Effects” on page 8-19
Introduction
You can use N-way ANOVA to determine if the means in a set of data differ
when grouped by multiple factors. If they do differ, you can determine which
factors or combinations of factors are associated with the difference.
N-way ANOVA is a generalization of two-way ANOVA. For three factors, the
model can be written
yijkl =  + . j. +  i.. +  ..k + ( )ij. + ( )i.k + ( ). jk + ( )ijk +  ijkl
In this notation parameters with two subscripts, such as (αβ)ij., represent
the interaction effect of two factors. The parameter (αβγ)ijk represents the
three-way interaction. An ANOVA model can have the full set of parameters
or any subset, but conventionally it does not include complex interaction
terms unless it also includes all simpler terms for those factors. For example,
one would generally not include the three-way interaction without also
including all two-way interactions.
8-12
ANOVA
The anovan function performs N-way ANOVA. Unlike the anova1 and anova2
functions, anovan does not expect data in a tabular form. Instead, it expects
a vector of response measurements and a separate vector (or text array)
containing the values corresponding to each factor. This input data format is
more convenient than matrices when there are more than two factors or when
the number of measurements per factor combination is not constant.
N-Way ANOVA with a Small Data Set
Consider the following two-way example using anova2.
m = [23 15 20;27 17 63;43 3 55;41 9 90]
m =
23
15
20
27
17
63
43
3
55
41
9
90
anova2(m,2)
ans =
0.0197
0.2234
0.2663
The factor information is implied by the shape of the matrix m and the number
of measurements at each factor combination (2). Although anova2 does not
actually require arrays of factor values, for illustrative purposes you could
create them as follows.
cfactor = repmat(1:3,4,1)
cfactor =
1
1
1
1
2
2
2
2
3
3
3
3
rfactor = [ones(2,3); 2*ones(2,3)]
rfactor =
1
1
1
8-13
8
Analysis of Variance
1
2
2
1
2
2
1
2
2
The cfactor matrix shows that each column of m represents a different level
of the column factor. The rfactor matrix shows that the top two rows of m
represent one level of the row factor, and bottom two rows of m represent a
second level of the row factor. In other words, each value m(i,j) represents
an observation at column factor level cfactor(i,j) and row factor level
rfactor(i,j).
To solve the above problem with anovan, you need to reshape the matrices m,
cfactor, and rfactor to be vectors.
m = m(:);
cfactor = cfactor(:);
rfactor = rfactor(:);
[m cfactor rfactor]
ans =
23
27
43
41
15
17
3
9
20
63
55
90
1
1
1
1
2
2
2
2
3
3
3
3
1
1
2
2
1
1
2
2
1
1
2
2
anovan(m,{cfactor rfactor},2)
ans =
0.0197
8-14
ANOVA
0.2234
0.2663
N-Way ANOVA with a Large Data Set
The previous example used anova2 to study a small data set measuring car
mileage. This example illustrates how to analyze a larger set of car data with
mileage and other information on 406 cars made between 1970 and 1982.
First, load the data set and look at the variable names.
load carbig
whos
Name
Acceleration
Cylinders
Displacement
Horsepower
MPG
Model
Model_Year
Origin
Weight
cyl4
org
when
Size
Bytes
Class
406x1
406x1
406x1
406x1
406x1
406x36
406x1
406x7
406x1
406x5
406x7
406x5
3248
3248
3248
3248
3248
29232
3248
5684
3248
4060
5684
4060
double array
double array
double array
double array
double array
char array
double array
char array
double array
char array
char array
char array
The example focusses on four variables. MPG is the number of miles per gallon
for each of 406 cars (though some have missing values coded as NaN). The
other three variables are factors: cyl4 (four-cylinder car or not), org (car
originated in Europe, Japan, or the USA), and when (car was built early in the
period, in the middle of the period, or late in the period).
First, fit the full model, requesting up to three-way interactions and Type 3
sums-of-squares.
varnames = {'Origin';'4Cyl';'MfgDate'};
anovan(MPG,{org cyl4 when},3,3,varnames)
ans =
0.0000
8-15
8
Analysis of Variance
NaN
0
0.7032
0.0001
0.2072
0.6990
Note that many terms are marked by a # symbol as not having full rank,
and one of them has zero degrees of freedom and is missing a p value. This
can happen when there are missing factor combinations and the model has
higher-order terms. In this case, the cross-tabulation below shows that there
are no cars made in Europe during the early part of the period with other than
four cylinders, as indicated by the 0 in table(2,1,1).
[table, chi2, p, factorvals] = crosstab(org,when,cyl4)
table(:,:,1) =
82
0
3
8-16
75
4
3
25
3
4
ANOVA
table(:,:,2) =
12
22
23
26
12
25
38
17
32
chi2 =
207.7689
p =
0
factorvals =
'USA'
'Europe'
'Japan'
'Early'
'Mid'
'Late'
'Other'
'Four'
[]
Consequently it is impossible to estimate the three-way interaction effects,
and including the three-way interaction term in the model makes the fit
singular.
Using even the limited information available in the ANOVA table, you can see
that the three-way interaction has a p value of 0.699, so it is not significant.
So this time you examine only two-way interactions.
[p,tbl,stats,terms] = anovan(MPG,{org cyl4 when},2,3,varnames);
terms
terms =
1
0
0
1
1
0
0
1
0
1
0
1
0
0
1
0
1
1
8-17
8
Analysis of Variance
Now all terms are estimable. The p-values for interaction term 4
(Origin*4Cyl) and interaction term 6 (4Cyl*MfgDate) are much larger than
a typical cutoff value of 0.05, indicating these terms are not significant. You
could choose to omit these terms and pool their effects into the error term.
The output terms variable returns a matrix of codes, each of which is a bit
pattern representing a term. You can omit terms from the model by deleting
their entries from terms and running anovan again, this time supplying the
resulting vector as the model argument.
terms([4 6],:) = []
terms =
1
0
0
1
0
1
0
0
0
0
1
1
anovan(MPG,{org cyl4 when},terms,3,varnames)
ans =
1.0e-003 *
8-18
ANOVA
0.0000
0
0
0.1140
Now you have a more parsimonious model indicating that the mileage of
these cars seems to be related to all three factors, and that the effect of the
manufacturing date depends on where the car was made.
ANOVA with Random Effects
• “Introduction” on page 8-19
• “Setting Up the Model” on page 8-20
• “Fitting a Random Effects Model” on page 8-21
• “F Statistics for Models with Random Effects” on page 8-22
• “Variance Components” on page 8-24
Introduction. In an ordinary ANOVA model, each grouping variable
represents a fixed factor. The levels of that factor are a fixed set of values.
Your goal is to determine whether different factor levels lead to different
response values. This section presents an example that shows how to use
anovan to fit models where a factor’s levels represent a random selection from
a larger (infinite) set of possible levels.
8-19
8
Analysis of Variance
Setting Up the Model. To set up the example, first load the data, which is
stored in a 6-by-3 matrix, mileage.
load mileage
The anova2 function works only with balanced data, and it infers the values
of the grouping variables from the row and column numbers of the input
matrix. The anovan function, on the other hand, requires you to explicitly
create vectors of grouping variable values. To create these vectors, do the
following steps:
1 Create an array indicating the factory for each value in mileage. This
array is 1 for the first column, 2 for the second, and 3 for the third.
factory
= repmat(1:3,6,1);
2 Create an array indicating the car model for each mileage value. This array
is 1 for the first three rows of mileage, and 2 for the remaining three rows.
carmod = [ones(3,3); 2*ones(3,3)];
3 Turn these matrices into vectors and display them.
mileage = mileage(:);
factory = factory(:);
carmod = carmod(:);
[mileage factory carmod]
ans =
33.3000
33.4000
32.9000
32.6000
32.5000
33.0000
34.5000
34.8000
33.8000
33.4000
33.7000
8-20
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
2.0000
2.0000
2.0000
2.0000
2.0000
1.0000
1.0000
1.0000
2.0000
2.0000
2.0000
1.0000
1.0000
1.0000
2.0000
2.0000
ANOVA
33.9000
37.4000
36.8000
37.6000
36.6000
37.0000
36.7000
2.0000
3.0000
3.0000
3.0000
3.0000
3.0000
3.0000
2.0000
1.0000
1.0000
1.0000
2.0000
2.0000
2.0000
Fitting a Random Effects Model. Continuing the example from the
preceding section, suppose you are studying a few factories but you want
information about what would happen if you build these same car models in
a different factory—either one that you already have or another that you
might construct. To get this information, fit the analysis of variance model,
specifying a model that includes an interaction term and that the factory
factor is random.
[pvals,tbl,stats] = anovan(mileage, {factory carmod}, ...
'model',2, 'random',1,'varnames',{'Factory' 'Car Model'});
In the fixed effects version of this fit, which you get by omitting the inputs
'random',1 in the preceding code, the effect of car model is significant, with a
p value of 0.0039. But in this example, which takes into account the random
variation of the effect of the variable 'Car Model' from one factory to another,
the effect is still significant, but with a higher p value of 0.0136.
8-21
8
Analysis of Variance
F Statistics for Models with Random Effects. The F statistic in a model
having random effects is defined differently than in a model having all fixed
effects. In the fixed effects model, you compute the F statistic for any term by
taking the ratio of the mean square for that term with the mean square for
error. In a random effects model, however, some F statistics use a different
mean square in the denominator.
In the example described in “Setting Up the Model” on page 8-20, the effect
of the variable 'Factory' could vary across car models. In this case, the
interaction mean square takes the place of the error mean square in the F
statistic. The F statistic for factory is:
F = 1.445 / 0.02
F =
72.2500
The degrees of freedom for the statistic are the degrees of freedom for the
numerator (1) and denominator (2) mean squares. Therefore the p value
for the statistic is:
pval = 1 - fcdf(F,1,2)
pval =
0.0136
With random effects, the expected value of each mean square depends not only
on the variance of the error term, but also on the variances contributed by
the random effects. You can see these dependencies by writing the expected
values as linear combinations of contributions from the various model terms.
To find the coefficients of these linear combinations, enter stats.ems, which
returns the ems field of the stats structure:
stats.ems
ans =
6.0000
0.0000
8-22
0.0000
9.0000
3.0000
3.0000
1.0000
1.0000
ANOVA
0.0000
0
0.0000
0
3.0000
0
1.0000
1.0000
To see text representations of the linear combinations, enter
stats.txtems
ans =
'6*V(Factory)+3*V(Factory*Car Model)+V(Error)'
'9*Q(Car Model)+3*V(Factory*Car Model)+V(Error)'
'3*V(Factory*Car Model)+V(Error)'
'V(Error)'
The expected value for the mean square due to car model (second term)
includes contributions from a quadratic function of the car model effects, plus
three times the variance of the interaction term’s effect, plus the variance
of the error term. Notice that if the car model effects were all zero, the
expression would reduce to the expected mean square for the third term (the
interaction term). That is why the F statistic for the car model effect uses the
interaction mean square in the denominator.
In some cases there is no single term whose expected value matches the one
required for the denominator of theFstatistic. In that case, the denominator is
a linear combination of mean squares. The stats structure contains fields
giving the definitions of the denominators for each F statistic. The txtdenom
field, stats.txtdenom, gives a text representation, and the denom field gives
a matrix that defines a linear combination of the variances of terms in the
model. For balanced models like this one, the denom matrix, stats.denom,
contains zeros and ones, because the denominator is just a single term’s mean
square:
stats.txtdenom
ans =
'MS(Factory*Car Model)'
'MS(Factory*Car Model)'
'MS(Error)'
stats.denom
8-23
8
Analysis of Variance
ans =
-0.0000
0.0000
0.0000
1.0000
1.0000
0
0.0000
-0.0000
1.0000
Variance Components. For the model described in “Setting Up the Model”
on page 8-20, consider the mileage for a particular car of a particular model
made at a random factory. The variance of that car is the sum of components,
or contributions, one from each of the random terms.
stats.rtnames
ans =
'Factory'
'Factory*Car Model'
'Error'
You do not know those variances, but you can estimate them from the data.
Recall that the ems field of the stats structure expresses the expected value
of each term’s mean square as a linear combination of unknown variances for
random terms, and unknown quadratic forms for fixed terms. If you take
the expected mean square expressions for the random terms, and equate
those expected values to the computed mean squares, you get a system of
equations that you can solve for the unknown variances. These solutions
are the variance component estimates. The varest field contains a variance
component estimate for each term. The rtnames field contains the names
of the random terms.
stats.varest
ans =
4.4426
-0.0313
0.1139
Under some conditions, the variability attributed to a term is unusually low,
and that term’s variance component estimate is negative. In those cases it
8-24
ANOVA
is common to set the estimate to zero, which you might do, for example, to
create a bar graph of the components.
bar(max(0,stats.varest))
set(gca,'xtick',1:3,'xticklabel',stats.rtnames)
You can also compute confidence bounds for the variance estimate. The
anovan function does this by computing confidence bounds for the variance
expected mean squares, and finding lower and upper limits on each variance
component containing all of these bounds. This procedure leads to a set
of bounds that is conservative for balanced data. (That is, 95% confidence
bounds will have a probability of at least 95% of containing the true variances
if the number of observations for each combination of grouping variables
is the same.) For unbalanced data, these are approximations that are not
guaranteed to be conservative.
[{'Term' 'Estimate' 'Lower' 'Upper'};
stats.rtnames, num2cell([stats.varest stats.varci])]
ans =
'Term'
'Estimate'
'Lower'
'Upper'
8-25
8
Analysis of Variance
'Factory'
'Factory*Car Model'
'Error'
[ 4.4426]
[ -0.0313]
[ 0.1139]
[1.0736]
[
NaN]
[0.0586]
[175.6038]
[
NaN]
[ 0.3103]
Other ANOVA Models
The anovan function also has arguments that enable you to specify two other
types of model terms. First, the 'nested' argument specifies a matrix that
indicates which factors are nested within other factors. A nested factor is one
that takes different values within each level its nested factor.
For example, the mileage data from the previous section assumed that the
two car models produced in each factory were the same. Suppose instead,
each factory produced two distinct car models for a total of six car models, and
we numbered them 1 and 2 for each factory for convenience. Then, the car
model is nested in factory. A more accurate and less ambiguous numbering of
car model would be as follows:
Factory
Car Model
1
1
1
2
2
3
2
4
3
5
3
6
However, it is common with nested models to number the nested factor the
same way in each nested factor.
Second, the 'continuous' argument specifies that some factors are to be
treated as continuous variables. The remaining factors are categorical
variables. Although the anovan function can fit models with multiple
continuous and categorical predictors, the simplest model that combines one
predictor of each type is known as an analysis of covariance model. The next
section describes a specialized tool for fitting this model.
8-26
ANOVA
Analysis of Covariance
• “Introduction” on page 8-27
• “Analysis of Covariance Tool” on page 8-27
• “Confidence Bounds” on page 8-32
• “Multiple Comparisons” on page 8-34
Introduction
Analysis of covariance is a technique for analyzing grouped data having a
response (y, the variable to be predicted) and a predictor (x, the variable
used to do the prediction). Using analysis of covariance, you can model y as
a linear function of x, with the coefficients of the line possibly varying from
group to group.
Analysis of Covariance Tool
The aoctool function opens an interactive graphical environment for fitting
and prediction with analysis of covariance (ANOCOVA) models. It fits the
following models for the ith group:
Same mean
y=α+ε
Separate means
y = (α + αi) + ε
Same line
y = α + βx + ε
Parallel lines
y = (α + αi) + βx + ε
Separate lines
y = (α + αi) + (β + βi)x + ε
For example, in the parallel lines model the intercept varies from one group
to the next, but the slope is the same for each group. In the same mean
model, there is a common intercept and no slope. In order to make the group
coefficients well determined, the tool imposes the constraints
∑ j = ∑  j = 0
The following steps describe the use of aoctool.
8-27
8
Analysis of Variance
1 Load the data. The Statistics Toolbox data set carsmall.mat contains
information on cars from the years 1970, 1976, and 1982. This example
studies the relationship between the weight of a car and its mileage,
and whether this relationship has changed over the years. To start the
demonstration, load the data set.
load carsmall
The Workspace Browser shows the variables in the data set.
You can also use aoctool with your own data.
2 Start the tool. The following command calls aoctool to fit a separate line
to the column vectors Weight and MPG for each of the three model group
defined in Model_Year. The initial fit models the y variable, MPG, as a linear
function of the x variable, Weight.
[h,atab,ctab,stats] = aoctool(Weight,MPG,Model_Year);
8-28
ANOVA
See the aoctool function reference page for detailed information about
calling aoctool.
3 Examine the output. The graphical output consists of a main window
with a plot, a table of coefficient estimates, and an analysis of variance
table. In the plot, each Model_Year group has a separate line. The data
points for each group are coded with the same color and symbol, and the fit
for each group has the same color as the data points.
8-29
8
Analysis of Variance
The coefficients of the three lines appear in the figure titled ANOCOVA
Coefficients. You can see that the slopes are roughly –0.0078, with a small
deviation for each group:
• Model year 1970: y = (45.9798 – 8.5805) + (–0.0078 + 0.002)x + ε
• Model year 1976: y = (45.9798 – 3.8902) + (–0.0078 + 0.0011)x + ε
• Model year 1982: y = (45.9798 + 12.4707) + (–0.0078 – 0.0031)x + ε
Because the three fitted lines have slopes that are roughly similar, you may
wonder if they really are the same. The Model_Year*Weight interaction
expresses the difference in slopes, and the ANOVA table shows a test for
the significance of this term. With an F statistic of 5.23 and a p value of
0.0072, the slopes are significantly different.
8-30
ANOVA
4 Constrain the slopes to be the same. To examine the fits when the
slopes are constrained to be the same, return to the ANOCOVA Prediction
Plot window and use the Model pop-up menu to select a Parallel Lines
model. The window updates to show the following graph.
Though this fit looks reasonable, it is significantly worse than the Separate
Lines model. Use the Model pop-up menu again to return to the original
model.
8-31
8
Analysis of Variance
Confidence Bounds
The example in “Analysis of Covariance Tool” on page 8-27 provides estimates
of the relationship between MPG and Weight for each Model_Year, but how
accurate are these estimates? To find out, you can superimpose confidence
bounds on the fits by examining them one group at a time.
1 In the Model_Year menu at the lower right of the figure, change the
setting from All Groups to 82. The data and fits for the other groups are
dimmed, and confidence bounds appear around the 82 fit.
8-32
ANOVA
The dashed lines form an envelope around the fitted line for model year 82.
Under the assumption that the true relationship is linear, these bounds
provide a 95% confidence region for the true line. Note that the fits for the
other model years are well outside these confidence bounds for Weight
values between 2000 and 3000.
2 Sometimes it is more valuable to be able to predict the response value for
a new observation, not just estimate the average response value. Use the
aoctool function Bounds menu to change the definition of the confidence
bounds from Line to Observation. The resulting wider intervals reflect
the uncertainty in the parameter estimates as well as the randomness
of a new observation.
8-33
8
Analysis of Variance
Like the polytool function, the aoctool function has cross hairs that you
can use to manipulate the Weight and watch the estimate and confidence
bounds along the y-axis update. These values appear only when a single
group is selected, not when All Groups is selected.
Multiple Comparisons
You can perform a multiple comparison test by using the stats output
structure from aoctool as input to the multcompare function. The
multcompare function can test either slopes, intercepts, or population
marginal means (the predicted MPG of the mean weight for each group). The
example in “Analysis of Covariance Tool” on page 8-27 shows that the slopes
are not all the same, but could it be that two are the same and only the other
one is different? You can test that hypothesis.
multcompare(stats,0.05,'on','','s')
ans =
1.0000
1.0000
2.0000
2.0000
3.0000
3.0000
-0.0012
0.0013
0.0005
0.0008
0.0051
0.0042
0.0029
0.0088
0.0079
This matrix shows that the estimated difference between the intercepts of
groups 1 and 2 (1970 and 1976) is 0.0008, and a confidence interval for the
difference is [–0.0012, 0.0029]. There is no significant difference between the
two. There are significant differences, however, between the intercept for
1982 and each of the other two. The graph shows the same information.
8-34
ANOVA
Note that the stats structure was created in the initial call to the aoctool
function, so it is based on the initial model fit (typically a separate-lines
model). If you change the model interactively and want to base your multiple
comparisons on the new model, you need to run aoctool again to get another
stats structure, this time specifying your new model as the initial model.
Nonparametric Methods
• “Introduction” on page 8-36
8-35
8
Analysis of Variance
• “Kruskal-Wallis Test” on page 8-36
• “Friedman’s Test” on page 8-37
Introduction
Statistics Toolbox functions include nonparametric versions of one-way and
two-way analysis of variance. Unlike classical tests, nonparametric tests
make only mild assumptions about the data, and are appropriate when the
distribution of the data is non-normal. On the other hand, they are less
powerful than classical methods for normally distributed data.
Both of the nonparametric functions described here will return a stats
structure that can be used as an input to the multcompare function for
multiple comparisons.
Kruskal-Wallis Test
The example “Example: One-Way ANOVA” on page 8-4 uses one-way
analysis of variance to determine if the bacteria counts of milk varied from
shipment to shipment. The one-way analysis rests on the assumption that
the measurements are independent, and that each has a normal distribution
with a common variance and with a mean that was constant in each column.
You can conclude that the column means were not all the same. The following
example repeats that analysis using a nonparametric procedure.
The Kruskal-Wallis test is a nonparametric version of one-way analysis of
variance. The assumption behind this test is that the measurements come
from a continuous distribution, but not necessarily a normal distribution. The
test is based on an analysis of variance using the ranks of the data values, not
the data values themselves. Output includes a table similar to an ANOVA
table, and a box plot.
You can run this test as follows:
load hogg
p = kruskalwallis(hogg)
p =
0.0020
8-36
ANOVA
The low p value means the Kruskal-Wallis test results agree with the one-way
analysis of variance results.
Friedman’s Test
“Example: Two-Way ANOVA” on page 8-10 uses two-way analysis of variance
to study the effect of car model and factory on car mileage. The example
tests whether either of these factors has a significant effect on mileage, and
whether there is an interaction between these factors. The conclusion of
the example is there is no interaction, but that each individual factor has
a significant effect. The next example examines whether a nonparametric
analysis leads to the same conclusion.
Friedman’s test is a nonparametric test for data having a two-way layout (data
grouped by two categorical factors). Unlike two-way analysis of variance,
Friedman’s test does not treat the two factors symmetrically and it does not
test for an interaction between them. Instead, it is a test for whether the
columns are different after adjusting for possible row differences. The test is
based on an analysis of variance using the ranks of the data across categories
of the row factor. Output includes a table similar to an ANOVA table.
You can run Friedman’s test as follows.
load mileage
p = friedman(mileage,3)
p =
7.4659e-004
Recall the classical analysis of variance gave a p value to test column effects,
row effects, and interaction effects. This p value is for column effects. Using
either this p value or the p value from ANOVA (p < 0.0001), you conclude that
there are significant column effects.
In order to test for row effects, you need to rearrange the data to swap the
roles of the rows in columns. For a data matrix x with no replications, you
could simply transpose the data and type
p = friedman(x')
With replicated data it is slightly more complicated. A simple way is to
transform the matrix into a three-dimensional array with the first dimension
8-37
8
Analysis of Variance
representing the replicates, swapping the other two dimensions, and restoring
the two-dimensional shape.
x
x
x
x
= reshape(mileage, [3 2 3]);
= permute(x,[1 3 2]);
= reshape(x,[9 2])
=
33.3000
32.6000
33.4000
32.5000
32.9000
33.0000
34.5000
33.4000
34.8000
33.7000
33.8000
33.9000
37.4000
36.6000
36.8000
37.0000
37.6000
36.7000
friedman(x,3)
ans =
0.0082
Again, the conclusion is similar to that of the classical analysis of variance.
Both this p value and the one from ANOVA (p = 0.0039) lead you to conclude
that there are significant row effects.
You cannot use Friedman’s test to test for interactions between the row and
column factors.
8-38
MANOVA
MANOVA
In this section...
“Introduction” on page 8-39
“ANOVA with Multiple Responses” on page 8-39
Introduction
The analysis of variance technique in “Example: One-Way ANOVA” on
page 8-4 takes a set of grouped data and determine whether the mean of a
variable differs significantly among groups. Often there are multiple response
variables, and you are interested in determining whether the entire set of
means is different from one group to the next. There is a multivariate version
of analysis of variance that can address the problem.
ANOVA with Multiple Responses
The carsmall data set has measurements on a variety of car models from
the years 1970, 1976, and 1982. Suppose you are interested in whether the
characteristics of the cars have changed over time.
First, load the data.
load carsmall
whos
Name
Acceleration
Cylinders
Displacement
Horsepower
MPG
Model
Model_Year
Origin
Weight
Size
100x1
100x1
100x1
100x1
100x1
100x36
100x1
100x7
100x1
Bytes
800
800
800
800
800
7200
800
1400
800
Class
double array
double array
double array
double array
double array
char array
double array
char array
double array
Four of these variables (Acceleration, Displacement, Horsepower, and
MPG) are continuous measurements on individual car models. The variable
8-39
8
Analysis of Variance
Model_Year indicates the year in which the car was made. You can create a
grouped plot matrix of these variables using the gplotmatrix function.
x = [MPG Horsepower Displacement Weight];
gplotmatrix(x,[],Model_Year,[],'+xo')
(When the second argument of gplotmatrix is empty, the function graphs
the columns of the x argument against each other, and places histograms
along the diagonals. The empty fourth argument produces a graph with the
default colors. The fifth argument controls the symbols used to distinguish
between groups.)
It appears the cars do differ from year to year. The upper right plot, for
example, is a graph of MPG versus Weight. The 1982 cars appear to have
higher mileage than the older cars, and they appear to weigh less on average.
But as a group, are the three years significantly different from one another?
The manova1 function can answer that question.
[d,p,stats] = manova1(x,Model_Year)
8-40
MANOVA
d =
2
p =
1.0e-006 *
0
0.1141
stats =
W:
B:
T:
dfW:
dfB:
dfT:
lambda:
chisq:
chisqdf:
eigenval:
eigenvec:
canon:
mdist:
gmdist:
[4x4 double]
[4x4 double]
[4x4 double]
90
2
92
[2x1 double]
[2x1 double]
[2x1 double]
[4x1 double]
[4x4 double]
[100x4 double]
[100x1 double]
[3x3 double]
The manova1 function produces three outputs:
• The first output, d, is an estimate of the dimension of the group means. If
the means were all the same, the dimension would be 0, indicating that the
means are at the same point. If the means differed but fell along a line,
the dimension would be 1. In the example the dimension is 2, indicating
that the group means fall in a plane but not along a line. This is the largest
possible dimension for the means of three groups.
• The second output, p, is a vector of p-values for a sequence of tests. The
first p value tests whether the dimension is 0, the next whether the
dimension is 1, and so on. In this case both p-values are small. That’s
why the estimated dimension is 2.
• The third output, stats, is a structure containing several fields, described
in the following section.
8-41
8
Analysis of Variance
The Fields of the stats Structure
The W, B, and T fields are matrix analogs to the within, between, and total sums
of squares in ordinary one-way analysis of variance. The next three fields are
the degrees of freedom for these matrices. Fields lambda, chisq, and chisqdf
are the ingredients of the test for the dimensionality of the group means. (The
p-values for these tests are the first output argument of manova1.)
The next three fields are used to do a canonical analysis. Recall that in
principal components analysis (“Principal Component Analysis (PCA)” on
page 10-31) you look for the combination of the original variables that has the
largest possible variation. In multivariate analysis of variance, you instead
look for the linear combination of the original variables that has the largest
separation between groups. It is the single variable that would give the most
significant result in a univariate one-way analysis of variance. Having found
that combination, you next look for the combination with the second highest
separation, and so on.
The eigenvec field is a matrix that defines the coefficients of the linear
combinations of the original variables. The eigenval field is a vector
measuring the ratio of the between-group variance to the within-group
variance for the corresponding linear combination. The canon field is a matrix
of the canonical variable values. Each column is a linear combination of the
mean-centered original variables, using coefficients from the eigenvec matrix.
A grouped scatter plot of the first two canonical variables shows more
separation between groups then a grouped scatter plot of any pair of original
variables. In this example it shows three clouds of points, overlapping but
with distinct centers. One point in the bottom right sits apart from the others.
By using the gname function, you can see that this is the 20th point.
c1 = stats.canon(:,1);
c2 = stats.canon(:,2);
gscatter(c2,c1,Model_Year,[],'oxs')
gname
8-42
MANOVA
Roughly speaking, the first canonical variable, c1, separates the 1982 cars
(which have high values of c1) from the older cars. The second canonical
variable, c2, reveals some separation between the 1970 and 1976 cars.
The final two fields of the stats structure are Mahalanobis distances. The
mdist field measures the distance from each point to its group mean. Points
with large values may be outliers. In this data set, the largest outlier is the
one in the scatter plot, the Buick Estate station wagon. (Note that you could
have supplied the model name to the gname function above if you wanted to
label the point with its model name rather than its row number.)
max(stats.mdist)
ans =
31.5273
find(stats.mdist == ans)
ans =
8-43
8
Analysis of Variance
20
Model(20,:)
ans =
buick_estate_wagon_(sw)
The gmdist field measures the distances between each pair of group means.
The following commands examine the group means and their distances:
grpstats(x, Model_Year)
ans =
1.0e+003 *
0.0177
0.1489
0.2869
0.0216
0.1011
0.1978
0.0317
0.0815
0.1289
stats.gmdist
ans =
0
3.8277
11.1106
3.8277
0
6.1374
11.1106
6.1374
0
3.4413
3.0787
2.4535
As might be expected, the multivariate distance between the extreme years
1970 and 1982 (11.1) is larger than the difference between more closely
spaced years (3.8 and 6.1). This is consistent with the scatter plots, where the
points seem to follow a progression as the year changes from 1970 through
1976 to 1982. If you had more groups, you might find it instructive to use
the manovacluster function to draw a diagram that presents clusters of the
groups, formed using the distances between their means.
8-44
9
Parametric Regression
Analysis
• “Introduction” on page 9-2
• “Linear Regression” on page 9-3
• “Nonlinear Regression” on page 9-58
9
Parametric Regression Analysis
Introduction
Regression is the process of fitting models to data. The process depends on the
model. If a model is parametric, regression estimates the parameters from the
data. If a model is linear in the parameters, estimation is based on methods
from linear algebra that minimize the norm of a residual vector. If a model
is nonlinear in the parameters, estimation is based on search methods from
optimization that minimize the norm of a residual vector. Nonparametric
models, like “Classification Trees and Regression Trees” on page 13-25, use
methods all their own.
This chapter considers data and models with continuous predictors and
responses. Categorical predictors are the subject of Chapter 8, “Analysis of
Variance”. Categorical responses are the subject of Chapter 12, “Parametric
Classification” and Chapter 13, “Supervised Learning”.
9-2
Linear Regression
Linear Regression
In this section...
“Linear Regression Models” on page 9-3
“Multiple Linear Regression” on page 9-8
“Robust Regression” on page 9-14
“Stepwise Regression” on page 9-19
“Ridge Regression” on page 9-29
“Partial Least Squares” on page 9-32
“Polynomial Models” on page 9-37
“Response Surface Models” on page 9-45
“Generalized Linear Models” on page 9-52
“Multivariate Regression” on page 9-57
Linear Regression Models
In statistics, linear regression models often take the form of something like
this:
y = 0 + 1 x1 +  2 x2 +  3 x1 x2 +  4 x12 + 5 x22 + 
Here a response variable y is modeled as a combination of constant, linear,
interaction, and quadratic terms formed from two predictor variables x1 and
x2. Uncontrolled factors and experimental errors are modeled by ε. Given data
on x1, x2, and y, regression estimates the model parameters βj (j = 1, ..., 5).
More general linear regression models represent the relationship between a
continuous response y and a continuous or categorical predictor x in the form:
y = 1 f1 ( x) + ... +  p f p ( x) + 
The response is modeled as a linear combination of (not necessarily linear)
functions of the predictor, plus a random error ε. The expressions fj(x) (j = 1,
..., p) are the terms of the model. The βj (j = 1, ..., p) are the coefficients. Errors
9-3
9
Parametric Regression Analysis
ε are assumed to be uncorrelated and distributed with mean 0 and constant
(but unknown) variance.
Examples of linear regression models with a scalar predictor variable x
include:
• Linear additive (straight-line) models — Terms are f1(x) = 1 and f2(x) = x.
• Polynomial models — Terms are f1(x) = 1, f2(x) = x, …, fp(x) = xp–1.
• Chebyshev orthogonal polynomial models — Terms are f1(x) = 1, f2(x) = x,
…, fp(x) = 2xfp–1(x) – fp–2(x).
• Fourier trigonometric polynomial models — Terms are f1(x) = 1/2 and sines
and cosines of different frequencies.
Examples of linear regression models with a vector of predictor variables x
= (x1, ..., xN) include:
• Linear additive (hyperplane) models — Terms are f1(x) = 1 and fk+1(x) =
xk (k = 1, ..., N).
• Pairwise interaction models — Terms are linear additive terms plus gk1k2(x)
= xk1xk2 (k1, k2 = 1, ..., N, k1 ≠ k2).
• Quadratic models — Terms are pairwise interaction terms plus hk(x) =
xk2 (k = 1, ..., N).
• Pure quadratic models — Terms are quadratic terms minus the gk1k2(x)
terms.
Whether or not the predictor x is a vector of predictor variables, multivariate
regression refers to the case where the response y = (y1, ..., yM) is a vector of
M response variables. See “Multivariate Regression” on page 9-57 for more
on multivariate regression models.
Given n independent observations (x1, y1), …, (xn, yn) of the predictor x and the
response y, the linear regression model becomes an n-by-p system of equations:
9-4
Linear Regression
⎛ y1 ⎞ ⎛ f1 ( x1 )  f p ( x1 ) ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞
⎟ ⎜ ⎟
⎟⎜
⎜ ⎟ ⎜

 ⎟⎜  ⎟ + ⎜  ⎟
⎜  ⎟=⎜ 
⎜ y ⎟ ⎜ f (x )  f (x ) ⎟ ⎜  ⎟ ⎜ ⎟
n⎠ ⎝ 1 n
p n ⎠ ⎝ p ⎠ ⎝
n⎠
⎝


y


X
X is the design matrix of the system. The columns of X are the terms of the
model evaluated at the predictors. To fit the model to the data, the system
must be solved for the p coefficient values in β = (β1, …, βp)T.
The MATLAB backslash operator \ (mldivide) solves systems of linear
equations. Ignoring the unknown error ε, MATLAB estimates model
coefficients in y = Xβ using
betahat = X\y
where X is the design matrix and y is the vector of observed responses.
MATLAB returns the least-squares solution to the system; betahat minimizes
the norm of the residual vector y-X*beta over all beta. If the system is
consistent, the norm is 0 and the solution is exact. In this case, the regression
model interpolates the data. In more typical regression cases where n > p and
the system is overdetermined, the least-squares solution estimates model
coefficients obscured by the error ε.
The least-squares estimator betahat has several important statistical
properties. First, it is unbiased, with expected value β. Second, by the
Gauss-Markov theorem, it has minimum variance among all unbiased
estimators formed from linear combinations of the response data. Under the
additional assumption that ε is normally distributed, betahat is a maximum
likelihood estimator. The assumption also implies that the estimates
themselves are normally distributed, which is useful for computing confidence
intervals. Even without the assumption, by the Central Limit theorem, the
estimates have an approximate normal distribution if the sample size is large
enough.
Visualize the least-squares estimator as follows.
9-5
9
Parametric Regression Analysis
For betahat to minimize norm(y-X*beta), y-X*betahat must be
perpendicular to the column space of X, which contains all linear combinations
of the model terms. This requirement is summarized in the normal equations,
which express vanishing inner products between y-X*betahat and the
columns of X:
(
)
X T y − X ˆ = 0
or
X T X ˆ = X T y
If X is n-by-p, the normal equations are a p-by-p square system with solution
betahat = inv(X'*X)*X'*y, where inv is the MATLAB inverse operator.
The matrix inv(X'*X)*X' is the pseudoinverse of X, computed by the
MATLAB function pinv.
The normal equations are often badly conditioned relative to the original
system y = Xβ (the coefficient estimates are much more sensitive to the model
error ε), so the MATLAB backslash operator avoids solving them directly.
9-6
Linear Regression
Instead, a QR (orthogonal, triangular) decomposition of X is used to create a
simpler, more stable triangular system:
X T X ˆ
(QR)T (QR) ˆ
RT QT QRˆ
XT y
=
= (QR)T y
=
RT QT y
RT Rˆ
=
RT QT y
Rˆ
=
QT y
Statistics Toolbox functions like regress and regstats call the MATLAB
backslash operator to perform linear regression. The QR decomposition is also
used for efficient computation of confidence intervals.
Once betahat is computed, the model can be evaluated at the predictor data:
yhat = X*betahat
or
yhat = X*inv(X'*X)*X'*y
H = X*inv(X'*X)*X' is the hat matrix. It is a square, symmetric n-by-n
matrix determined by the predictor data. The diagonal elements H(i,i)
(i = 1, ..., n) give the leverage of the ith observation. Since yhat = H*y,
leverage values determine the influence of the observed response y(i) on
the predicted response yhat(i). For leverage values near 1, the predicted
response approximates the observed response. The Statistics Toolbox function
leverage computes leverage values from a QR decomposition of X.
Component residual values in y-yhat are useful for detecting failures in
model assumptions. Like the errors in ε, residuals have an expected value
of 0. Unlike the errors, however, residuals are correlated, with nonconstant
variance. Residuals may be “Studentized” (scaled by an estimate of their
standard deviation) for comparison. Studentized residuals are used by
Statistics Toolbox functions like regress and robustfit to identify outliers
in the data.
9-7
9
Parametric Regression Analysis
Multiple Linear Regression
• “Introduction” on page 9-8
• “Programmatic Multiple Linear Regression” on page 9-9
• “Interactive Multiple Linear Regression” on page 9-11
• “Tabulating Diagnostic Statistics” on page 9-13
Introduction
The system of linear equations
⎛ y1 ⎞ ⎛ f1 ( x1 )  f p ( x1 ) ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞
⎟ ⎜ ⎟
⎟⎜
⎜ ⎟ ⎜

 ⎟⎜  ⎟ + ⎜  ⎟
⎜  ⎟=⎜ 
⎜ y ⎟ ⎜ f (x )  f (x ) ⎟ ⎜  ⎟ ⎜ ⎟
n⎠ ⎝ 1 n
p n ⎠ ⎝ p ⎠ ⎝
n⎠
⎝


y


X
in “Linear Regression Models” on page 9-3 expresses a response y as a linear
combination of model terms fj(x) (j = 1, ..., p) at each of the observations (x1,
y1), …, (xn, yn).
If the predictor x is multidimensional, so are the functions fj that form the
terms of the model. For example, if the predictor is x = (x1, x2), terms for the
model might include f1(x) = x1 (a linear term), f2(x) = x12 (a quadratic term),
and f3(x) = x1x2 (a pairwise interaction term). Typically, the function f(x) = 1 is
included among the fj, so that the design matrix X contains a column of 1s and
the model contains a constant (y-intercept) term.
Multiple linear regression models are useful for:
• Understanding which terms fj(x) have greatest effect on the response
(coefficients βj with greatest magnitude)
• Finding the direction of the effects (signs of the βj)
• Predicting unobserved values of the response (y(x) for new x)
The Statistics Toolbox functions regress and regstats are used for multiple
linear regression analysis.
9-8
Linear Regression
Programmatic Multiple Linear Regression
For example, the file moore.mat contains the 20-by-6 data matrix moore. The
first five columns are measurements of biochemical oxygen demand on five
predictor variables. The final column contains the observed responses. Use
regress to find coefficient estimates betahat for a linear additive model as
follows. Before using regress give the design matrix X1 a first column of 1s to
include a constant term in the model, betahat(1).
load moore
X1 = [ones(size(moore,1),1) moore(:,1:5)];
y = moore(:,6);
betahat = regress(y,X1)
betahat =
-2.1561
-0.0000
0.0013
0.0001
0.0079
0.0001
The MATLAB backslash (mldivide) operator, which regress calls, obtains
the same result:
betahat = X1\y
betahat =
-2.1561
-0.0000
0.0013
0.0001
0.0079
0.0001
The advantage of working with regress is that it allows for additional inputs
and outputs relevant to statistical analysis of the regression. For example:
alpha = 0.05;
[betahat,Ibeta,res,Ires,stats] = regress(y,X1,alpha);
returns not only the coefficient estimates in betahat, but also
9-9
9
Parametric Regression Analysis
• Ibeta — A p-by-2 matrix of 95% confidence intervals for the coefficient
estimates, using a 100*(1-alpha)% confidence level. The first column
contains lower confidence bounds for each of the p coefficient estimates; the
second column contains upper confidence bounds.
• res — An n-by-1 vector of residuals.
• Ires — An n-by-2 matrix of intervals that can be used to diagnose outliers.
If the interval Ires(i,:) for observation i does not contain zero, the
corresponding residual is larger than expected in 100*(1-alpha)% of new
observations, suggesting an outlier.
• stats — A 1-by-4 vector that contains, in order, the R2 statistic, the
F statistic and its p value, and an estimate of the error variance. The
statistics are computed assuming the model contains a constant term, and
are incorrect otherwise.
Visualize the residuals, in case (row number) order, with the rcoplot
function:
rcoplot(res,Ires)
9-10
Linear Regression
The interval around the first residual, shown in red when plotted, does not
contain zero. This indicates that the residual is larger than expected in 95%
of new observations, and suggests the data point is an outlier.
Outliers in regression appear for a variety of reasons:
1 If there is sufficient data, 5% of the residuals, by the definition of rint,
are too big.
2 If there is a systematic error in the model (that is, if the model is not
appropriate for generating the data under model assumptions), the mean
of the residuals is not zero.
3 If the errors in the model are not normally distributed, the distributions
of the residuals may be skewed or leptokurtic (with heavy tails and more
outliers).
When errors are normally distributed, Ires(i,:) is a confidence interval for
the mean of res(i) and checking if the interval contains zero is a test of the
null hypothesis that the residual has zero mean.
Interactive Multiple Linear Regression
The function regstats also performs multiple linear regression, but computes
more statistics than regress. By default, regstats automatically adds a first
column of 1s to the design matrix (necessary for computing the F statistic
and its p value), so a constant term should not be included explicitly as for
regress. For example:
X2 = moore(:,1:5);
stats = regstats(y,X2);
creates a structure stats with fields containing regression statistics. An
optional input argument allows you to specify which statistics are computed.
To interactively specify the computed statistics, call regstats without an
output argument. For example:
regstats(y,X2)
opens the following interface.
9-11
9
Parametric Regression Analysis
Select the check boxes corresponding to the statistics you want to compute and
click OK. Selected statistics are returned to the MATLAB workspace. Names
9-12
Linear Regression
of container variables for the statistics appear on the right-hand side of the
interface, where they can be changed to any valid MATLAB variable name.
Tabulating Diagnostic Statistics
The regstats function computes statistics that are typically used in
regression diagnostics. Statistics can be formatted into standard tabular
displays in a variety of ways. For example, the tstat field of the stats output
structure of regstats is itself a structure containing statistics related to the
estimated coefficients of the regression. Dataset arrays (see “Dataset Arrays”
on page 2-23) provide a natural tabular format for the information:
t = stats.tstat;
CoeffTable = dataset({t.beta,'Coef'},{t.se,'StdErr'}, ...
{t.t,'tStat'},{t.pval,'pVal'})
CoeffTable =
Coef
StdErr
tStat
pVal
-2.1561
0.91349
-2.3603
0.0333
-9.0116e-006
0.00051835
-0.017385
0.98637
0.0013159
0.0012635
1.0415
0.31531
0.0001278
7.6902e-005
1.6618
0.11876
0.0078989
0.014
0.56421
0.58154
0.00014165
7.3749e-005
1.9208
0.075365
The MATLAB function fprintf gives you control over tabular formatting.
For example, the fstat field of the stats output structure of regstats is a
structure with statistics related to the analysis of variance (ANOVA) of the
regression. The following commands produce a standard regression ANOVA
table:
f = stats.fstat;
fprintf('\n')
fprintf('Regression ANOVA');
fprintf('\n\n')
fprintf('%6s','Source');
fprintf('%10s','df','SS','MS','F','P');
fprintf('\n')
fprintf('%6s','Regr');
9-13
9
Parametric Regression Analysis
fprintf('%10.4f',f.dfr,f.ssr,f.ssr/f.dfr,f.f,f.pval);
fprintf('\n')
fprintf('%6s','Resid');
fprintf('%10.4f',f.dfe,f.sse,f.sse/f.dfe);
fprintf('\n')
fprintf('%6s','Total');
fprintf('%10.4f',f.dfe+f.dfr,f.sse+f.ssr);
fprintf('\n')
The result looks like this:
Regression ANOVA
Source
Regr
Resid
Total
df
5.0000
14.0000
19.0000
SS
4.1084
0.9595
5.0679
MS
0.8217
0.0685
F
11.9886
P
0.0001
Robust Regression
• “Introduction” on page 9-14
• “Programmatic Robust Regression” on page 9-15
• “Interactive Robust Regression” on page 9-16
Introduction
The models described in “Linear Regression Models” on page 9-3 are based on
certain assumptions, such as a normal distribution of errors in the observed
responses. If the distribution of errors is asymmetric or prone to outliers,
model assumptions are invalidated, and parameter estimates, confidence
intervals, and other computed statistics become unreliable. The Statistics
Toolbox function robustfit is useful in these cases. The function implements
a robust fitting method that is less sensitive than ordinary least squares to
large changes in small parts of the data.
Robust regression works by assigning a weight to each data point. Weighting
is done automatically and iteratively using a process called iteratively
9-14
Linear Regression
reweighted least squares. In the first iteration, each point is assigned equal
weight and model coefficients are estimated using ordinary least squares. At
subsequent iterations, weights are recomputed so that points farther from
model predictions in the previous iteration are given lower weight. Model
coefficients are then recomputed using weighted least squares. The process
continues until the values of the coefficient estimates converge within a
specified tolerance.
Programmatic Robust Regression
The example in “Multiple Linear Regression” on page 9-8 shows an outlier
when ordinary least squares is used to model the response variable as a linear
combination of the five predictor variables. To determine the influence of the
outlier, compare the coefficient estimates computed by regress:
load moore
X1 = [ones(size(moore,1),1) moore(:,1:5)];
y = moore(:,6);
betahat = regress(y,X1)
betahat =
-2.1561
-0.0000
0.0013
0.0001
0.0079
0.0001
to those computed by robustfit:
X2 = moore(:,1:5);
robustbeta = robustfit(X2,y)
robustbeta =
-1.7516
0.0000
0.0009
0.0002
0.0060
0.0001
By default, robustfit automatically adds a first column of 1s to the design
matrix, so a constant term does not have to be included explicitly as for
9-15
9
Parametric Regression Analysis
regress. In addition, the order of inputs is reversed for robustfit and
regress.
To understand the difference in the coefficient estimates, look at the final
weights given to the data points in the robust fit:
[robustbeta,stats] = robustfit(X2,y);
stats.w'
ans =
Columns 1 through 5
0.0246
0.9986
0.9763
0.9323
Columns 6 through 10
0.8597
0.9180
0.9992
0.9590
Columns 11 through 15
0.9769
0.9868
0.9999
0.9976
Columns 16 through 20
0.9733
0.9892
0.9988
0.8974
0.9704
0.9649
0.8122
0.6774
The first data point has a very low weight compared to the other data points,
and so is effectively ignored in the robust regression.
Interactive Robust Regression
The robustdemo function shows the difference between ordinary least squares
and robust fitting for data with a single predictor. You can use data provided
with the demo, or you can supply your own data. The following steps show
you how to use robustdemo.
1 Start the demo. To begin using robustdemo with the built-in data, simply
enter the function name at the command line:
robustdemo
9-16
Linear Regression
The resulting figure shows a scatter plot with two fitted lines. The red line
is the fit using ordinary least-squares regression. The green line is the
fit using robust regression. At the bottom of the figure are the equations
for the fitted lines, together with the estimated root mean squared errors
for each fit.
2 View leverages and robust weights. Right-click on any data point to
see its least-squares leverage and robust weight.
9-17
9
Parametric Regression Analysis
In the built-in data, the rightmost point has a relatively high leverage of
0.35. The point exerts a large influence on the least-squares fit, but its
small robust weight shows that it is effectively excluded from the robust fit.
3 See how changes in the data affect the fits. With the left mouse
button, click and hold on any data point and drag it to a new location.
When you release the mouse button, the displays update.
9-18
Linear Regression
Bringing the rightmost data point closer to the least-squares line makes
the two fitted lines nearly identical. The adjusted rightmost data point has
significant weight in the robust fit.
Stepwise Regression
• “Introduction” on page 9-20
• “Programmatic Stepwise Regression” on page 9-21
• “Interactive Stepwise Regression” on page 9-27
9-19
9
Parametric Regression Analysis
Introduction
Multiple linear regression models, as described in “Multiple Linear
Regression” on page 9-8, are built from a potentially large number of
predictive terms. The number of interaction terms, for example, increases
exponentially with the number of predictor variables. If there is no theoretical
basis for choosing the form of a model, and no assessment of correlations
among terms, it is possible to include redundant terms in a model that confuse
the identification of significant effects.
Stepwise regression is a systematic method for adding and removing terms
from a multilinear model based on their statistical significance in a regression.
The method begins with an initial model and then compares the explanatory
power of incrementally larger and smaller models. At each step, the p value of
an F-statistic is computed to test models with and without a potential term. If
a term is not currently in the model, the null hypothesis is that the term would
have a zero coefficient if added to the model. If there is sufficient evidence to
reject the null hypothesis, the term is added to the model. Conversely, if a
term is currently in the model, the null hypothesis is that the term has a zero
coefficient. If there is insufficient evidence to reject the null hypothesis, the
term is removed from the model. The method proceeds as follows:
1 Fit the initial model.
2 If any terms not in the model have p-values less than an entrance tolerance
(that is, if it is unlikely that they would have zero coefficient if added to
the model), add the one with the smallest p value and repeat this step;
otherwise, go to step 3.
3 If any terms in the model have p-values greater than an exit tolerance (that
is, if it is unlikely that the hypothesis of a zero coefficient can be rejected),
remove the one with the largest p value and go to step 2; otherwise, end.
Depending on the terms included in the initial model and the order in which
terms are moved in and out, the method may build different models from the
same set of potential terms. The method terminates when no single step
improves the model. There is no guarantee, however, that a different initial
model or a different sequence of steps will not lead to a better fit. In this
sense, stepwise models are locally optimal, but may not be globally optimal.
Statistics Toolbox functions for stepwise regression are:
9-20
Linear Regression
• stepwisefit — A function that proceeds automatically from a specified
initial model and entrance/exit tolerances
• stepwise — An interactive tool that allows you to explore individual steps
in the regression
Programmatic Stepwise Regression
For example, load the data in hald.mat, which contains observations of the
heat of reaction of various cement mixtures:
load hald
whos
Name
Description
hald
heat
ingredients
Size
Bytes
Class
22x58
13x5
13x1
13x4
2552
520
104
416
char
double
double
double
Attributes
The response (heat) depends on the quantities of the four predictors (the
columns of ingredients).
Use stepwisefit to carry out the stepwise regression algorithm, beginning
with no terms in the model and using entrance/exit tolerances of 0.05/0.10
on the p-values:
stepwisefit(ingredients,heat,...
'penter',0.05,'premove',0.10);
Initial columns included: none
Step 1, added column 4, p=0.000576232
Step 2, added column 1, p=1.10528e-006
Final columns included: 1 4
'Coeff'
'Std.Err.'
'Status'
[ 1.4400]
[ 0.1384]
'In'
[ 0.4161]
[ 0.1856]
'Out'
[-0.4100]
[ 0.1992]
'Out'
[-0.6140]
[ 0.0486]
'In'
'P'
[1.1053e-006]
[
0.0517]
[
0.0697]
[1.8149e-007]
stepwisefit automatically includes an intercept term in the model, so you do
not add it explicitly to ingredients as you would for regress. For terms not
9-21
9
Parametric Regression Analysis
in the model, coefficient estimates and their standard errors are those that
result if the term is added.
The inmodel parameter is used to specify terms in an initial model:
initialModel = ...
[false true false false]; % Force in 2nd term
stepwisefit(ingredients,heat,...
'inmodel',initialModel,...
'penter',.05,'premove',0.10);
Initial columns included: 2
Step 1, added column 1, p=2.69221e-007
Final columns included: 1 2
'Coeff'
'Std.Err.'
'Status'
'P'
[ 1.4683]
[ 0.1213]
'In'
[2.6922e-007]
[ 0.6623]
[ 0.0459]
'In'
[5.0290e-008]
[ 0.2500]
[ 0.1847]
'Out'
[
0.2089]
[-0.2365]
[ 0.1733]
'Out'
[
0.2054]
The preceding two models, built from different initial models, use different
subsets of the predictive terms. Terms 2 and 4, swapped in the two models,
are highly correlated:
term2 = ingredients(:,2);
term4 = ingredients(:,4);
R = corrcoef(term2,term4)
R =
1.0000
-0.9730
-0.9730
1.0000
To compare the models, use the stats output of stepwisefit:
[betahat1,se1,pval1,inmodel1,stats1] = ...
stepwisefit(ingredients,heat,...
'penter',.05,'premove',0.10,...
'display','off');
[betahat2,se2,pval2,inmodel2,stats2] = ...
stepwisefit(ingredients,heat,...
'inmodel',initialModel,...
'penter',.05,'premove',0.10,...
'display','off');
9-22
Linear Regression
RMSE1 = stats1.rmse
RMSE1 =
2.7343
RMSE2 = stats2.rmse
RMSE2 =
2.4063
The second model has a lower Root Mean Square Error (RMSE).
An added variable plot is used to determine the unique effect of adding a new
term to a model. The plot shows the relationship between the part of the
response unexplained by terms already in the model and the part of the new
term unexplained by terms already in the model. The “unexplained” parts
are measured by the residuals of the respective regressions. A scatter of the
residuals from the two regressions forms the added variable plot.
For example, suppose you want to add term2 to a model that already contains
the single term term1. First, consider the ability of term2 alone to explain
the response:
load hald
term2 = ingredients(:,2);
[b2,Ib2,res2] = regress(heat,[ones(size(term2)) term2]);
scatter(term2,heat)
xlabel('Term 2')
ylabel('Heat')
hold on
x2 = 20:80;
y2 = b2(1) + b2(2)*x2;
plot(x2,y2,'r')
title('{\bf Response Explained by Term 2: Ignoring Term 1}')
9-23
9
Parametric Regression Analysis
Next, consider the following regressions involving the model term term1:
term1 = ingredients(:,1);
[b1,Ib1,res1] = regress(heat,[ones(size(term1)) term1]);
[b21,Ib21,res21] = regress(term2,[ones(size(term1)) term1]);
bres = regress(res1,[ones(size(res21)) res21]);
A scatter of the residuals res1 vs. the residuals res12 forms the added
variable plot:
figure
scatter(res21,res1)
xlabel('Residuals: Term 2 on Term 1')
ylabel('Residuals: Heat on Term 1')
hold on
9-24
Linear Regression
xres = -30:30;
yres = bres(1) + bres(2)*xres;
plot(xres,yres,'r')
title('{\bf Response Explained by Term 2: Adjusted for Term 1}')
Since the plot adjusted for term1 shows a stronger relationship (less variation
along the fitted line) than the plot ignoring term1, the two terms act jointly to
explain extra variation. In this case, adding term2 to a model consisting of
term1 would reduce the RMSE.
The Statistics Toolbox function addedvarplot produces added variable plots.
The previous plot is essentially the one produced by the following:
figure
9-25
9
Parametric Regression Analysis
inmodel = [true false false false];
addedvarplot(ingredients,heat,2,inmodel)
In addition to the scatter of residuals, the plot shows 95% confidence intervals
on predictions from the fitted line. The fitted line has intercept zero because,
under the assumptions outlined in “Linear Regression Models” on page 9-3,
both of the plotted variables have mean zero. The slope of the fitted line is the
coefficient that term2 would have if it was added to the model with term1.
The addevarplot function is useful for considering the unique effect of adding
a new term to an existing model with any number of terms.
9-26
Linear Regression
Interactive Stepwise Regression
The stepwise interface provides interactive features that allow you to
investigate individual steps in a stepwise regression, and to build models
from arbitrary subsets of the predictive terms. To open the interface with
data from hald.mat:
load hald
stepwise(ingredients,heat)
9-27
9
Parametric Regression Analysis
The upper left of the interface displays estimates of the coefficients for all
potential terms, with horizontal bars indicating 90% (colored) and 95% (grey)
confidence intervals. The red color indicates that, initially, the terms are not
in the model. Values displayed in the table are those that would result if
the terms were added to the model.
The middle portion of the interface displays summary statistics for the entire
model. These statistics are updated with each step.
The lower portion of the interface, Model History, displays the RMSE for
the model. The plot tracks the RMSE from step to step, so you can compare
the optimality of different models. Hover over the blue dots in the history to
see which terms were in the model at a particular step. Click on a blue dot
in the history to open a copy of the interface initialized with the terms in
the model at that step.
Initial models, as well as entrance/exit tolerances for the p-values of
F-statistics, are specified using additional input arguments to stepwise.
Defaults are an initial model with no terms, an entrance tolerance of 0.05,
and an exit tolerance of 0.10.
To center and scale the input data (compute z-scores) to improve conditioning
of the underlying least-squares problem, select Scale Inputs from the
Stepwise menu.
You proceed through a stepwise regression in one of two ways:
1 Click Next Step to select the recommended next step. The recommended
next step either adds the most significant term or removes the least
significant term. When the regression reaches a local minimum of RMSE,
the recommended next step is “Move no terms.” You can perform all of the
recommended steps at once by clicking All Steps.
2 Click a line in the plot or in the table to toggle the state of the corresponding
term. Clicking a red line, corresponding to a term not currently in the
model, adds the term to the model and changes the line to blue. Clicking
a blue line, corresponding to a term currently in the model, removes the
term from the model and changes the line to red.
9-28
Linear Regression
To call addedvarplot and produce an added variable plot from the stepwise
interface, select Added Variable Plot from the Stepwise menu. A list of
terms is displayed. Select the term you want to add, and then click OK.
Click Export to display a dialog box that allows you to select information
from the interface to save to the MATLAB workspace. Check the information
you want to export and, optionally, change the names of the workspace
variables to be created. Click OK to export the information.
Ridge Regression
• “Introduction” on page 9-29
• “Example: Ridge Regression” on page 9-30
Introduction
Coefficient estimates for the models described in “Multiple Linear Regression”
on page 9-8 rely on the independence of the model terms. When terms are
correlated and the columns of the design matrix X have an approximate
linear dependence, the matrix (XTX)–1 becomes close to singular. As a result,
the least-squares estimate
ˆ = ( X T X )−1 X T y
becomes highly sensitive to random errors in the observed response y,
producing a large variance. This situation of multicollinearity can arise, for
example, when data are collected without an experimental design.
Ridge regression addresses the problem by estimating regression coefficients
using
ˆ = ( X T X + kI )−1 X T y
where k is the ridge parameter and I is the identity matrix. Small positive
values of k improve the conditioning of the problem and reduce the variance
of the estimates. While biased, the reduced variance of ridge estimates
often result in a smaller mean square error when compared to least-squares
estimates.
9-29
9
Parametric Regression Analysis
The Statistics Toolbox function ridge carries out ridge regression.
Example: Ridge Regression
For example, load the data in acetylene.mat, with observations of the
predictor variables x1, x2, x3, and the response variable y:
load acetylene
Plot the predictor variables against each other:
subplot(1,3,1)
plot(x1,x2,'.')
xlabel('x1'); ylabel('x2'); grid on; axis square
subplot(1,3,2)
plot(x1,x3,'.')
xlabel('x1'); ylabel('x3'); grid on; axis square
subplot(1,3,3)
plot(x2,x3,'.')
xlabel('x2'); ylabel('x3'); grid on; axis square
Note the correlation between x1 and the other two predictor variables.
Use ridge and x2fx to compute coefficient estimates for a multilinear model
with interaction terms, for a range of ridge parameters:
X = [x1 x2 x3];
D = x2fx(X,'interaction');
9-30
Linear Regression
D(:,1) = []; % No constant term
k = 0:1e-5:5e-3;
betahat = ridge(y,D,k);
Plot the ridge trace:
figure
plot(k,betahat,'LineWidth',2)
ylim([-100 100])
grid on
xlabel('Ridge Parameter')
ylabel('Standardized Coefficient')
title('{\bf Ridge Trace}')
legend('x1','x2','x3','x1x2','x1x3','x2x3')
9-31
9
Parametric Regression Analysis
The estimates stabilize to the right of the plot. Note that the coefficient of
the x2x3 interaction term changes sign at a value of the ridge parameter ≈
5 × 10–4.
Partial Least Squares
• “Introduction” on page 9-33
9-32
Linear Regression
• “Example: Partial Least Squares” on page 9-33
Introduction
Partial least-squares (PLS) regression is a technique used with data that
contain correlated predictor variables. This technique constructs new
predictor variables, known as components, as linear combinations of the
original predictor variables. PLS constructs these components while
considering the observed response values, leading to a parsimonious model
with reliable predictive power.
The technique is something of a cross between multiple linear regression
and principal component analysis:
• Multiple linear regression finds a combination of the predictors that best fit
a response.
• Principal component analysis finds combinations of the predictors with
large variance, reducing correlations. The technique makes no use of
response values.
• PLS finds combinations of the predictors that have a large covariance with
the response values.
PLS therefore combines information about the variances of both the predictors
and the responses, while also considering the correlations among them.
PLS shares characteristics with other regression and feature transformation
techniques. It is similar to ridge regression in that it is used in situations with
correlated predictors. It is similar to stepwise regression (or more general
feature selection techniques) in that it can be used to select a smaller set of
model terms. PLS differs from these methods, however, by transforming the
original predictor space into the new component space.
The Statistics Toolbox function plsregress carries out PLS regression.
Example: Partial Least Squares
For example, consider the data on biochemical oxygen demand in moore.mat,
padded with noisy versions of the predictors to introduce correlations:
load moore
9-33
9
Parametric Regression Analysis
y = moore(:,6);
X0 = moore(:,1:5);
X1 = X0+10*randn(size(X0));
X = [X0,X1];
% Response
% Original predictors
% Correlated predictors
Use plsregress to perform PLS regression with the same number of
components as predictors, then plot the percentage variance explained in the
response as a function of the number of components:
[XL,yl,XS,YS,beta,PCTVAR] = plsregress(X,y,10);
plot(1:10,cumsum(100*PCTVAR(2,:)),'-bo');
xlabel('Number of PLS components');
ylabel('Percent Variance Explained in y');
Choosing the number of components in a PLS model is a critical step. The plot
gives a rough indication, showing nearly 80% of the variance in y explained
9-34
Linear Regression
by the first component, with as many as five additional components making
significant contributions.
The following computes the six-component model:
[XL,yl,XS,YS,beta,PCTVAR,MSE,stats] = plsregress(X,y,6);
yfit = [ones(size(X,1),1) X]*beta;
plot(y,yfit,'o')
The scatter shows a reasonable correlation between fitted and observed
responses, and this is confirmed by the R2 statistic:
TSS = sum((y-mean(y)).^2);
RSS = sum((y-yfit).^2);
Rsquared = 1 - RSS/TSS
Rsquared =
0.8421
9-35
9
Parametric Regression Analysis
A plot of the weights of the ten predictors in each of the six components shows
that two of the components (the last two computed) explain the majority of
the variance in X:
plot(1:10,stats.W,'o-');
legend({'c1','c2','c3','c4','c5','c6'},'Location','NW')
xlabel('Predictor');
ylabel('Weight');
A plot of the mean-squared errors suggests that as few as two components
may provide an adequate model:
[axes,h1,h2] = plotyy(0:6,MSE(1,:),0:6,MSE(2,:));
set(h1,'Marker','o')
set(h2,'Marker','o')
legend('MSE Predictors','MSE Response')
xlabel('Number of Components')
9-36
Linear Regression
The calculation of mean-squared errors by plsregress is controlled by
optional parameter name/value pairs specifying cross-validation type and the
number of Monte Carlo repetitions.
Polynomial Models
• “Introduction” on page 9-37
• “Programmatic Polynomial Regression” on page 9-38
• “Interactive Polynomial Regression” on page 9-43
Introduction
Polynomial models are a special case of the linear models discussed in “Linear
Regression Models” on page 9-3. Polynomial models have the advantages of
being simple, familiar in their properties, and reasonably flexible for following
9-37
9
Parametric Regression Analysis
data trends. They are also robust with respect to changes in the location and
scale of the data (see “Conditioning Polynomial Fits” on page 9-41). However,
polynomial models may be poor predictors of new values. They oscillate
between data points, especially as the degree is increased to improve the fit.
Asymptotically, they follow power functions, leading to inaccuracies when
extrapolating other long-term trends. Choosing a polynomial model is often a
trade-off between a simple description of overall data trends and the accuracy
of predictions made from the model.
Programmatic Polynomial Regression
• “Functions for Polynomial Fitting” on page 9-38
• “Displaying Polynomial Fits” on page 9-40
• “Conditioning Polynomial Fits” on page 9-41
Functions for Polynomial Fitting. To fit polynomials to data, MATLAB
and Statistics Toolbox software offer a number of dedicated functions. The
MATLAB function polyfit computes least-squares coefficient estimates for
polynomials of arbitrary degree. For example:
x
y
p
p
= 0:5; % x data
= [2 1 4 4 3 2]; % y data
= polyfit(x,y,3) % Degree 3 fit
=
-0.1296
0.6865
-0.1759
1.6746
Polynomial coefficients in p are listed from highest to lowest degree, so p(x)
≈ –0.13 x3 + 0.69 x2 – 0.18 x + 1.67. For convenience, polyfit sets up the
Vandermonde design matrix (vander) and calls backslash (mldivide) to
perform the fit.
Once the coefficients of a polynomial are collected in a vector p, use the
MATLAB function polyval to evaluate the polynomial at arbitrary inputs.
For example, the following plots the data and the fit over a range of inputs:
plot(x,y,'ro','LineWidth',2) % Plot data
hold on
xfit = -1:0.01:6;
yfit = polyval(p,xfit);
9-38
Linear Regression
plot(xfit,yfit,'LineWidth',2) % Plot fit
ylim([0,5])
grid on
Use the MATLAB function roots to find the roots of p:
r = roots(p)
r =
5.4786
-0.0913 + 1.5328i
-0.0913 - 1.5328i
The MATLAB function poly solves the inverse problem, finding a polynomial
with specified roots. poly is the inverse of roots up to ordering, scaling, and
round-off error.
9-39
9
Parametric Regression Analysis
An optional output from polyfit is passed to polyval or to the Statistics
Toolbox function polyconf to compute prediction intervals for the fit.
For example, the following computes 95% prediction intervals for new
observations at each value of the predictor x:
[p,S] = polyfit(x,y,3);
[yhat,delta] = polyconf(p,x,S);
PI = [yhat-delta;yhat+delta]'
PI =
-5.3022
8.6514
-4.2068
8.3179
-2.9899
9.0534
-2.1963
9.8471
-2.6036
9.9211
-5.2229
8.7308
Optional input arguments to polyconf allow you to compute prediction
intervals for estimated values (yhat) as well as new observations, and to
compute the bounds simultaneously for all x instead of nonsimultaneously
(the default). The confidence level for the intervals can also be set.
Displaying Polynomial Fits. The documentation example function
polydemo combines the functions polyfit, polyval, roots, and polyconf to
produce a formatted display of data with a polynomial fit.
Note Statistics Toolbox documentation example files are located in the
\help\toolbox\stats\examples subdirectory of your MATLAB root folder
(matlabroot). This subdirectory is not on the MATLAB path at installation.
To use the files in this subdirectory, either add the subdirectory to the
MATLAB path (addpath) or make the subdirectory your current working
folder (cd).
For example, the following uses polydemo to produce a display of simulated
data with a quadratic trend, a fitted polynomial, and 95% prediction intervals
for new observations:
x = -5:5;
y = x.^2 - 5*x - 3 + 5*randn(size(x));
p = polydemo(x,y,2,0.05)
9-40
Linear Regression
p =
0.8107
-4.5054
-1.1862
polydemo calls the documentation example function polystr to convert the
coefficient vector p into a string for the polynomial expression displayed in the
figure title.
Conditioning Polynomial Fits. If x and y data are on very different
scales, polynomial fits may be badly conditioned, in the sense that coefficient
estimates are very sensitive to random errors in the data. For example,
using polyfit to estimate coefficients of a cubic fit to the U.S. census data in
census.mat produces the following warning:
load census
x = cdate;
9-41
9
Parametric Regression Analysis
y = pop;
p = polyfit(x,y,3);
Warning: Polynomial is badly conditioned.
Add points with distinct X values,
reduce the degree of the polynomial,
or try centering and scaling as
described in HELP POLYFIT.
The following implements the suggested centering and scaling, and
demonstrates the robustness of polynomial fits under these transformations:
plot(x,y,'ro') % Plot data
hold on
z = zscore(x); % Compute z-scores of x data
zfit = linspace(z(1),z(end),100);
pz = polyfit(z,y,3); % Compute conditioned fit
yfit = polyval(pz,zfit);
xfit = linspace(x(1),x(end),100);
plot(xfit,yfit,'b-') % Plot conditioned fit vs. x data
grid on
9-42
Linear Regression
Interactive Polynomial Regression
The functions polyfit, polyval, and polyconf are interactively applied to
data using two graphical interfaces for polynomial fitting:
• “The Basic Fitting Tool” on page 9-43
• “The Polynomial Fitting Tool” on page 9-44
The Basic Fitting Tool. The Basic Fitting Tool is a MATLAB interface,
discussed in “Interactive Fitting” in the MATLAB documentation. The tool
allows you to:
• Fit interpolants and polynomials of degree ≤ 10
• Plot residuals and compute their norm
• Interpolate or extrapolate values from the fit
9-43
9
Parametric Regression Analysis
• Save results to the MATLAB workspace
The Polynomial Fitting Tool. The Statistics Toolbox function polytool
opens the Polynomial Fitting Tool. For example, the following opens the
interface using simulated data with a quadratic trend and displays a fitted
polynomial with 95% prediction intervals for new observations:
x = -5:5;
y = x.^2 - 5*x - 3 + 5*randn(size(x));
polytool(x,y,2,0.05)
The tool allows you to:
9-44
Linear Regression
• Interactively change the degree of the fit. Change the value in the Degree
text box at the top of the figure.
• Evaluate the fit and the bounds using a movable crosshair. Click, hold, and
drag the crosshair to change its position.
• Export estimated coefficients, predicted values, prediction intervals, and
residuals to the MATLAB workspace. Click Export to a open a dialog box
with choices for exporting the data.
Options for the displayed bounds and the fitting method are available through
menu options at the top of the figure:
• The Bounds menu lets you choose between bounds on new observations
(the default) and bounds on estimated values. It also lets you choose
between nonsimultaneous (the default) and simultaneous bounds. See
polyconf for a description of these options.
• The Method menu lets you choose between ordinary least-squares
regression and robust regression, as described in “Robust Regression” on
page 9-14.
Response Surface Models
• “Introduction” on page 9-45
• “Programmatic Response Surface Methodology” on page 9-46
• “Interactive Response Surface Methodology” on page 9-51
Introduction
Polynomial models are generalized to any number of predictor variables xi (i
= 1, ..., N) as follows:
N
N
N
i=0
i< j
i=0
y( x) = a0 + ∑ ai xi + ∑ aij xi x j + ∑ aii xi2 + 
The model includes, from left to right, an intercept, linear terms, quadratic
interaction terms, and squared terms. Higher order terms would follow, as
necessary.
9-45
9
Parametric Regression Analysis
Response surface models are multivariate polynomial models. They typically
arise in the design of experiments (see Chapter 15, “Design of Experiments”),
where they are used to determine a set of design variables that optimize a
response. Linear terms alone produce models with response surfaces that
are hyperplanes. The addition of interaction terms allows for warping of
the hyperplane. Squared terms produce the simplest models in which the
response surface has a maximum or minimum, and so an optimal response.
Response surface methodology (RSM) is the process of adjusting predictor
variables to move the response in a desired direction and, iteratively, to an
optimum. The method generally involves a combination of both computation
and visualization. The use of quadratic response surface models makes the
method much simpler than standard nonlinear techniques for determining
optimal designs.
Programmatic Response Surface Methodology
The file reaction.mat contains simulated data on the rate of a chemical
reaction:
load reaction
The variables include:
• rate — A 13-by-1 vector of observed reaction rates
• reactants — A 13-by-3 matrix of reactant concentrations
• xn — The names of the three reactants
• yn — The name of the response
In “Nonlinear Regression” on page 9-58, the nonlinear Hougen-Watson model
is fit to the data using nlinfit. However, there may be no theoretical basis
for choosing a particular model to fit the data. A quadratic response surface
model provides a simple way to determine combinations of reactants that
lead to high reaction rates.
As described in “Multiple Linear Regression” on page 9-8, the regress and
regstats functions fit linear models—including response surface models—to
data using a design matrix of model terms evaluated at predictor data. The
9-46
Linear Regression
x2fx function converts predictor data to design matrices for quadratic models.
The regstats function calls x2fx when instructed to do so.
For example, the following fits a quadratic response surface model to the
data in reaction.mat:
stats = regstats(rate,reactants,'quadratic','beta');
b = stats.beta; % Model coefficients
The 10-by-1 vector b contains, in order, a constant term and then the
coefficients for the model terms x1, x2, x3, x1x2, x1x3, x2x3, x12, x22, and x32, where
x1, x2, and x3 are the three columns of reactants. The order of coefficients for
quadratic models is described in the reference page for x2fx.
Since the model involves only three predictors, it is possible to visualize the
entire response surface using a color dimension for the reaction rate:
x1 = reactants(:,1);
x2 = reactants(:,2);
x3 = reactants(:,3);
xx1 = linspace(min(x1),max(x1),25);
xx2 = linspace(min(x2),max(x2),25);
xx3 = linspace(min(x3),max(x3),25);
[X1,X2,X3] = meshgrid(xx1,xx2,xx3);
RATE = b(1) + b(2)*X1 + b(3)*X2 + b(4)*X3 + ...
b(5)*X1.*X2 + b(6)*X1.*X3 + b(7)*X2.*X3 + ...
b(8)*X1.^2 + b(9)*X2.^2 + b(10)*X3.^2;
hmodel = scatter3(X1(:),X2(:),X3(:),5,RATE(:),'filled');
hold on
hdata = scatter3(x1,x2,x3,'ko','filled');
axis tight
xlabel(xn(1,:))
ylabel(xn(2,:))
zlabel(xn(3,:))
hbar = colorbar;
ylabel(hbar,yn);
title('{\bf Quadratic Response Surface Model}')
9-47
9
Parametric Regression Analysis
legend(hdata,'Data','Location','NE')
The plot show a general increase in model response, within the space of
the observed data, as the concentration of n-pentane increases and the
concentrations of hydrogen and isopentane decrease.
Before trying to determine optimal values of the predictors, perhaps by
collecting more data in the direction of increased reaction rate indicated by
the plot, it is helpful to evaluate the geometry of the response surface. If x
= (x1, x2, x3)T is the vector of predictors, and H is the matrix such that xTHx
gives the quadratic terms of the model, the model has a unique optimum if
and only if H is positive definite. For the data in this example, the model does
not have a unique optimum:
9-48
Linear Regression
H = [b(8),b(5)/2,b(6)/2; ...
b(5)/2,b(9),b(7)/2; ...
b(6)/2,b(7)/2,b(10)];
lambda = eig(H)
lambda =
1.0e-003 *
-0.1303
0.0412
0.4292
The negative eigenvalue shows a lack of positive definiteness. The saddle in
the model is visible if the range of the predictors in the plot (xx1, xx2, and
xx3) is expanded:
9-49
9
Parametric Regression Analysis
When the number of predictors makes it impossible to visualize the entire
response surface, 3-, 2-, and 1-dimensional slices provide local views. The
MATLAB function slice displays 2-dimensional contours of the data at fixed
values of the predictors:
delete(hmodel)
X2slice = 200; % Fix n-Pentane concentration
slice(X1,X2,X3,RATE,[],X2slice,[])
One-dimensional contours are displayed by the Response Surface Tool,
rstool, described in the next section.
9-50
Linear Regression
Interactive Response Surface Methodology
The Statistics Toolbox function rstool opens a GUI for interactively
investigating simultaneous one-dimensional contours of multidimensional
response surface models. For example, the following opens the interface with
a quadratic response surface fit to the data in reaction.mat:
load reaction
alpha = 0.01; % Significance level
rstool(reactants,rate,'quadratic',alpha,xn,yn)
A sequence of plots is displayed, each showing a contour of the response
surface against a single predictor, with all other predictors held fixed.
Confidence intervals for new observations are shown as dashed red curves
above and below the response. Predictor values are displayed in the text
boxes on the horizontal axis and are marked by vertical dashed blue lines
9-51
9
Parametric Regression Analysis
in the plots. Predictor values are changed by editing the text boxes or by
dragging the dashed blue lines. When you change the value of a predictor, all
plots update to show the new point in predictor space.
Note The Statistics Toolbox demonstration function rsmdemo generates
simulated data for experimental settings specified by either the user or by
a D-optimal design generated by cordexch. It uses the rstool interface to
visualize response surface models fit to the data, and it uses the nlintool
interface to visualize a nonlinear model fit to the data.
Generalized Linear Models
• “Introduction” on page 9-52
• “Example: Generalized Linear Models” on page 9-53
Introduction
Linear regression models describe a linear relationship between a response
and one or more predictive terms. Many times, however, a nonlinear
relationship exists. “Nonlinear Regression” on page 9-58 describes general
nonlinear models. A special class of nonlinear models, known as generalized
linear models, makes use of linear methods.
Recall that linear models have the following characteristics:
• At each set of values for the predictors, the response has a normal
distribution with mean μ.
• A coefficient vector b defines a linear combination Xb of the predictors X.
• The model is μ = Xb.
In generalized linear models, these characteristics are generalized as follows:
• At each set of values for the predictors, the response has a distribution
that may be normal, binomial, Poisson, gamma, or inverse Gaussian, with
parameters including a mean μ.
• A coefficient vector b defines a linear combination Xb of the predictors X.
9-52
Linear Regression
• A link function f defines the model as f(μ) = Xb.
Example: Generalized Linear Models
The following data are derived from carbig.mat, which contains
measurements of large cars of various weights. Each weight in w has a
corresponding number of cars in total and a corresponding number of
poor-mileage cars in poor:
w = [2100 2300 2500 2700 2900 3100 ...
3300 3500 3700 3900 4100 4300]';
total = [48 42 31 34 31 21 23 23 21 16 17 21]';
poor = [1 2 0 3 8 8 14 17 19 15 17 21]';
A plot shows that the proportion of poor-mileage cars follows an S-shaped
sigmoid:
plot(w,poor./total,'x','LineWidth',2)
grid on
xlabel('Weight')
ylabel('Proportion of Poor-Mileage Cars')
9-53
9
Parametric Regression Analysis
The logistic model is useful for proportion data. It defines the relationship
between the proportion p and the weight w by:
log[p/(1 – p)] = b1 + b2w
Some of the proportions in the data are 0 and 1, making the left-hand side of
this equation undefined. To keep the proportions within range, add relatively
small perturbations to the poor and total values. A semi-log plot then shows
a nearly linear relationship, as predicted by the model:
p_adjusted = (poor+.5)./(total+1);
semilogy(w,p_adjusted./(1-p_adjusted),'x','LineWidth',2)
grid on
xlabel('Weight')
ylabel('Adjusted p / (1 - p)')
9-54
Linear Regression
It is reasonable to assume that the values of poor follow binomial
distributions, with the number of trials given by total and the percentage
of successes depending on w. This distribution can be accounted for in the
context of a logistic model by using a generalized linear model with link
function log(µ/(1 – µ)) = Xb.
Use the glmfit function to carry out the associated regression:
b = glmfit(w,[poor total],'binomial','link','logit')
b =
-13.3801
0.0042
To use the coefficients in b to compute fitted proportions, invert the logistic
relationship:
p = 1/(1 + exp(–b1 – b2w))
9-55
9
Parametric Regression Analysis
Use the glmval function to compute the fitted values:
x = 2100:100:4500;
y = glmval(b,x,'logit');
plot(w,poor./total,'x','LineWidth',2)
hold on
plot(x,y,'r-','LineWidth',2)
grid on
xlabel('Weight')
ylabel('Proportion of Poor-Mileage Cars')
The previous is an example of logistic regression. For an example of a kind
of stepwise logistic regression, analogous to stepwise regression for linear
models, see “Sequential Feature Selection” on page 10-23.
9-56
Linear Regression
Multivariate Regression
Whether or not the predictor x is a vector of predictor variables, multivariate
regression refers to the case where the response y = (y1, ..., yM) is a vector of
M response variables.
The Statistics Toolbox functions mvregress and mvregresslike are used
for multivariate regression analysis.
9-57
9
Parametric Regression Analysis
Nonlinear Regression
In this section...
“Nonlinear Regression Models” on page 9-58
“Parametric Models” on page 9-59
“Mixed-Effects Models” on page 9-64
Nonlinear Regression Models
The models described in “Linear Regression Models” on page 9-3 are often
called empirical models, because they are based solely on observed data.
Model parameters typically have no relationship to any mechanism producing
the data. To increase the accuracy of a linear model within the range of
observations, the number of terms is simply increased.
Nonlinear models, on the other hand, typically involve parameters with
specific physical interpretations. While they require a priori assumptions
about the data-producing process, they are often more parsimonious than
linear models, and more accurate outside the range of observed data.
Parametric nonlinear models represent the relationship between a continuous
response variable and one or more predictor variables (either continuous or
categorical) in the form y = f(X, β) + ε, where
• y is an n-by-1 vector of observations of the response variable.
• X is an n-by-p design matrix determined by the predictors.
• β is a p-by-1 vector of unknown parameters to be estimated.
• f is any function of X and β.
• ε is an n-by-1 vector of independent, identically distributed random
disturbances.
Nonparametric models do not attempt to characterize the relationship
between predictors and response with model parameters. Descriptions are
often graphical, as in the case of “Classification Trees and Regression Trees”
on page 13-25.
9-58
Nonlinear Regression
Parametric Models
• “A Parametric Nonlinear Model” on page 9-59
• “Confidence Intervals for Parameter Estimates” on page 9-61
• “Confidence Intervals for Predicted Responses” on page 9-61
• “Interactive Nonlinear Parametric Regression” on page 9-62
A Parametric Nonlinear Model
The Hougen-Watson model (Bates and Watts, [2], pp. 271–272) for reaction
kinetics is an example of a parametric nonlinear model. The form of the
model is
rate =
1 x2 − x3 / 5
1 +  2 x1 +  3 x2 +  4 x3
where rate is the reaction rate, x1, x2, and x3 are concentrations of hydrogen,
n-pentane, and isopentane, respectively, and β1, β2, ... , β5 are the unknown
parameters.
The file reaction.mat contains simulated reaction data:
load reaction
The variables are:
• rate — A 13-by-1 vector of observed reaction rates
• reactants — A 13-by-3 matrix of reactant concentrations
• beta — A 5-by-1 vector of initial parameter estimates
• model — The name of a function file for the model
• xn — The names of the reactants
• yn — The name of the response
The function for the model is hougen, which looks like this:
type hougen
9-59
9
Parametric Regression Analysis
function yhat = hougen(beta,x)
%HOUGEN Hougen-Watson model for reaction kinetics.
%
YHAT = HOUGEN(BETA,X) gives the predicted values of the
%
reaction rate, YHAT, as a function of the vector of
%
parameters, BETA, and the matrix of data, X.
%
BETA must have five elements and X must have three
%
columns.
%
%
The model form is:
%
y = (b1*x2 - x3/b5)./(1+b2*x1+b3*x2+b4*x3)
b1
b2
b3
b4
b5
=
=
=
=
=
beta(1);
beta(2);
beta(3);
beta(4);
beta(5);
x1 = x(:,1);
x2 = x(:,2);
x3 = x(:,3);
yhat = (b1*x2 - x3/b5)./(1+b2*x1+b3*x2+b4*x3);
The function nlinfit is used to find least-squares parameter estimates
for nonlinear models. It uses the Gauss-Newton algorithm with
Levenberg-Marquardt modifications for global convergence.
nlinfit requires the predictor data, the responses, and an initial guess of the
unknown parameters. It also requires a function handle to a function that
takes the predictor data and parameter estimates and returns the responses
predicted by the model.
To fit the reaction data, call nlinfit using the following syntax:
load reaction
betahat = nlinfit(reactants,rate,@hougen,beta)
betahat =
1.2526
0.0628
9-60
Nonlinear Regression
0.0400
0.1124
1.1914
The output vector betahat contains the parameter estimates.
The function nlinfit has robust options, similar to those for robustfit, for
fitting nonlinear models to data with outliers.
Confidence Intervals for Parameter Estimates
To compute confidence intervals for the parameter estimates, use the function
nlparci, together with additional outputs from nlinfit:
[betahat,resid,J] = nlinfit(reactants,rate,@hougen,beta);
betaci = nlparci(betahat,resid,J)
betaci =
-0.7467
3.2519
-0.0377
0.1632
-0.0312
0.1113
-0.0609
0.2857
-0.7381
3.1208
The columns of the output betaci contain the lower and upper bounds,
respectively, of the (default) 95% confidence intervals for each parameter.
Confidence Intervals for Predicted Responses
The function nlpredci is used to compute confidence intervals for predicted
responses:
[yhat,delta] = nlpredci(@hougen,reactants,betahat,resid,J);
opd = [rate yhat delta]
opd =
8.5500
8.4179
0.2805
3.7900
3.9542
0.2474
4.8200
4.9109
0.1766
0.0200
-0.0110
0.1875
2.7500
2.6358
0.1578
14.3900
14.3402
0.4236
2.5400
2.5662
0.2425
9-61
9
Parametric Regression Analysis
4.3500
13.0000
8.5000
0.0500
11.3200
3.1300
4.0385
13.0292
8.3904
-0.0216
11.4701
3.4326
0.1638
0.3426
0.3281
0.3699
0.3237
0.1749
The output opd contains the observed rates in the first column and the
predicted rates in the second column. The (default) 95% simultaneous
confidence intervals on the predictions are the values in the second column ±
the values in the third column. These are not intervals for new observations
at the predictors, even though most of the confidence intervals do contain the
original observations.
Interactive Nonlinear Parametric Regression
Calling nlintool opens a graphical user interface (GUI) for interactive
exploration of multidimensional nonlinear functions, and for fitting
parametric nonlinear models. The GUI calls nlinfit, and requires the same
inputs. The interface is analogous to polytool and rstool for polynomial
models.
Open nlintool with the reaction data and the hougen model by typing
load reaction
nlintool(reactants,rate,@hougen,beta,0.01,xn,yn)
9-62
Nonlinear Regression
You see three plots. The response variable for all plots is the reaction rate,
plotted in green. The red lines show confidence intervals on predicted
responses. The first plot shows hydrogen as the predictor, the second shows
n-pentane, and the third shows isopentane.
Each plot displays the fitted relationship of the reaction rate to one predictor
at a fixed value of the other two predictors. The fixed values are in the text
boxes below each predictor axis. Change the fixed values by typing in a new
value or by dragging the vertical lines in the plots to new positions. When
you change the value of a predictor, all plots update to display the model
at the new point in predictor space.
9-63
9
Parametric Regression Analysis
While this example uses only three predictors, nlintool can accommodate
any number of predictors.
Note The Statistics Toolbox demonstration function rsmdemo generates
simulated data for experimental settings specified by either the user or by
a D-optimal design generated by cordexch. It uses the rstool interface to
visualize response surface models fit to the data, and it uses the nlintool
interface to visualize a nonlinear model fit to the data.
Mixed-Effects Models
• “Introduction” on page 9-64
• “Mixed-Effects Model Hierarchy” on page 9-65
• “Specifying Mixed-Effects Models” on page 9-67
• “Specifying Covariate Models” on page 9-70
• “Choosing nlmefit or nlmefitsa” on page 9-71
• “Using Output Functions with Mixed-Effects Models” on page 9-74
• “Example: Mixed-Effects Models Using nlmefit and nlmefitsa” on page 9-79
• “Example: Examining Residuals for Model Verification” on page 9-93
Introduction
In statistics, an effect is anything that influences the value of a response
variable at a particular setting of the predictor variables. Effects are
translated into model parameters. In linear models, effects become
coefficients, representing the proportional contributions of model terms. In
nonlinear models, effects often have specific physical interpretations, and
appear in more general nonlinear combinations.
Fixed effects represent population parameters, assumed to be the same each
time data is collected. Estimating fixed effects is the traditional domain of
regression modeling. Random effects, by comparison, are sample-dependent
random variables. In modeling, random effects act like additional error terms,
and their distributions and covariances must be specified.
9-64
Nonlinear Regression
For example, consider a model of the elimination of a drug from the
bloodstream. The model uses time t as a predictor and the concentration
of the drug C as the response. The nonlinear model term C0e–rt combines
parameters C0 and r, representing, respectively, an initial concentration
and an elimination rate. If data is collected across multiple individuals, it
is reasonable to assume that the elimination rate is a random variable ri
depending on individual i, varying around a population mean r . The term
C0e–rt becomes
C0 e−[ r +( ri − r )]t = C0 e− (  + bi )t ,
where β = r is a fixed effect and bi = ri − r is a random effect.
Random effects are useful when data falls into natural groups. In the drug
elimination model, the groups are simply the individuals under study. More
sophisticated models might group data by an individual’s age, weight, diet,
etc. Although the groups are not the focus of the study, adding random effects
to a model extends the reliability of inferences beyond the specific sample of
individuals.
Mixed-effects models account for both fixed and random effects. As with
all regression models, their purpose is to describe a response variable as a
function of the predictor variables. Mixed-effects models, however, recognize
correlations within sample subgroups. In this way, they provide a compromise
between ignoring data groups entirely and fitting each group with a separate
model.
Mixed-Effects Model Hierarchy
Suppose data for a nonlinear regression model falls into one of m distinct
groups i = 1, ..., m. To account for the groups in a model, write response j
in group i as:
yij = f ( , xij ) +  ij
yij is the response, xij is a vector of predictors, φ is a vector of model
parameters, and εij is the measurement or process error. The index j ranges
from 1 to ni, where ni is the number of observations in group i. The function
f specifies the form of the model. Often, xij is simply an observation time tij.
9-65
9
Parametric Regression Analysis
The errors are usually assumed to be independent and identically, normally
distributed, with constant variance.
Estimates of the parameters in φ describe the population, assuming those
estimates are the same for all groups. If, however, the estimates vary by
group, the model becomes
yij = f ( i , xij ) +  ij
In a mixed-effects model, φi may be a combination of a fixed and a random
effect:
 i =  + bi
The random effects bi are usually described as multivariate normally
distributed, with mean zero and covariance Ψ. Estimating the fixed effects
β and the covariance of the random effects Ψ provides a description of the
population that does not assume the parameters φi are the same across
groups. Estimating the random effects bi also gives a description of specific
groups within the data.
Model parameters do not have to be identified with individual effects. In
general, design matrices A and B are used to identify parameters with linear
combinations of fixed and random effects:
 i = A + Bbi
If the design matrices differ among groups, the model becomes
 i = Ai  + Bibi
If the design matrices also differ among observations, the model becomes
 ij = Aij  + Bij bi
yij = f ( ij , xij ) +  ij
9-66
Nonlinear Regression
Some of the group-specific predictors in xij may not change with observation j.
Calling those vi, the model becomes
yij = f ( ij , xij , vi ) +  ij
Specifying Mixed-Effects Models
Suppose data for a nonlinear regression model falls into one of m distinct
groups i = 1, ..., m. (Specifically, suppose that the groups are not nested.) To
specify a general nonlinear mixed-effects model for this data:
1 Define group-specific model parameters φi as linear combinations of fixed
effects β and random effects bi.
2 Define response values yi as a nonlinear function f of the parameters and
group-specific predictor variables Xi.
The model is:
i = Ai  + Bi bi
yi = f (i , X i ) +  i
bi  N (0, Ψ)
 i  N (0,  2 )
This formulation of the nonlinear mixed-effects model uses the following
notation:
φi
A vector of group-specific model parameters
β
A vector of fixed effects, modeling population parameters
bi
A vector of multivariate normally distributed group-specific
random effects
Ai
A group-specific design matrix for combining fixed effects
Bi
A group-specific design matrix for combining random effects
Xi
A data matrix of group-specific predictor values
yi
A data vector of group-specific response values
9-67
9
Parametric Regression Analysis
f
A general, real-valued function of φi and Xi
εi
A vector of group-specific errors, assumed to be independent,
identically, normally distributed, and independent of bi
Ψ
A covariance matrix for the random effects
σ
The error variance, assumed to be constant across observations
2
For example, consider a model of the elimination of a drug from the
bloodstream. The model incorporates two overlapping phases:
• An initial phase p during which drug concentrations reach equilibrium
with surrounding tissues
• A second phase q during which the drug is eliminated from the bloodstream
For data on multiple individuals i, the model is
yij = C pi e
− rpitij
+ Cqi e
− rqitij
+  ij ,
where yij is the observed concentration in individual i at time tij. The model
allows for different sampling times and different numbers of observations for
different individuals.
The elimination rates rpi and rqi must be positive to be physically meaningful.
Enforce this by introducing the log rates Rpi = log(rpi) and Rqi = log(rqi) and
reparametrizing the model:
yij = C pi e
− exp( Rpi )tij
+ Cqi e
− exp( Rqi )tij
+  ij
Choosing which parameters to model with random effects is an important
consideration when building a mixed-effects model. One technique is to add
random effects to all parameters, and use estimates of their variances to
determine their significance in the model. An alternative is to fit the model
separately to each group, without random effects, and look at the variation
of the parameter estimates. If an estimate varies widely across groups, or if
confidence intervals for each group have minimal overlap, the parameter is a
good candidate for a random effect.
9-68
Nonlinear Regression
To introduce fixed effects β and random effects bi for all model parameters,
reexpress the model as follows:
yij
= [C p + (C pi − C p )]e
[Cq + (Cqi − Cq )]e
= (1 + b1i ) e
− exp[ Rp + ( Rpi − Rp )]tij
− exp[ Rq + ( Rqi − Rq )]tij
− exp(  2 + b2 i )tij
( 3 + b3i ) e
− exp(  4 + b4 i )tij
+
+  ij
+
+  ij
In the notation of the general model:
⎛ ti1 ⎞
⎛ yi1 ⎞
⎛ bi1 ⎞
⎛ 1 ⎞
⎟
⎜
⎟
⎜
⎜
⎟
⎜ ⎟
 = ⎜  ⎟ , bi = ⎜  ⎟ , yi = ⎜  ⎟ , X i = ⎜  ⎟ ,
⎜ ⎟
⎜b ⎟
⎜t ⎟
⎜y ⎟
⎝ 4⎠
⎝ i4 ⎠
⎝ ini ⎠
⎝ ini ⎠
where ni is the number of observations of individual i. In this case, the design
matrices Ai and Bi are, at least initially, 4-by-4 identity matrices. Design
matrices may be altered, as necessary, to introduce weighting of individual
effects, or time dependency.
Fitting the model and estimating the covariance matrix Ψ often leads to
further refinements. A relatively small estimate for the variance of a random
effect suggests that it can be removed from the model. Likewise, relatively
small estimates for covariances among certain random effects suggests that a
full covariance matrix is unnecessary. Since random effects are unobserved,
Ψ must be estimated indirectly. Specifying a diagonal or block-diagonal
covariance pattern for Ψ can improve convergence and efficiency of the fitting
algorithm.
Statistics Toolbox functions nlmefit and nlmefitsa fit the general nonlinear
mixed-effects model to data, estimating the fixed and random effects. The
functions also estimate the covariance matrix Ψ for the random effects.
Additional diagnostic outputs allow you to assess tradeoffs between the
number of model parameters and the goodness of fit.
9-69
9
Parametric Regression Analysis
Specifying Covariate Models
If the model in “Specifying Mixed-Effects Models” on page 9-67 assumes a
group-dependent covariate such as weight (w) the model becomes:
⎛ 1 ⎞
⎛ 1 ⎞ ⎛ 1 0 0 wi ⎞ ⎜ ⎟ ⎛ 1 0 0 ⎞ ⎛ b1 ⎞
⎜ ⎟ ⎜
⎟ ⎜ 2 ⎟ ⎜
⎟⎜ ⎟
⎜ 2 ⎟ = ⎜ 0 1 0 0 ⎟ ⎜  ⎟ + ⎜ 0 1 0 ⎟ ⎜ b2 ⎟
⎜ ⎟ ⎜ 0 0 1 0 ⎟ ⎜ 3 ⎟ ⎜ 0 0 1 ⎟ ⎜ b ⎟
⎠⎜  ⎟ ⎝
⎠⎝ 3 ⎠
⎝ 3⎠ ⎝
⎝ 4⎠
Thus, the parameter φi for any individual in the ith group is:
⎛ 1
⎜ i
⎜ 2
⎜ i
⎜ 3
⎝ i
⎞ ⎛  +  * w ⎞ ⎛ b1
i
i
⎟ ⎜ 1 4
⎟ ⎜⎜
⎟=⎜
2
⎟ + ⎜ b2i
⎟ ⎜
⎟ ⎜

⎟ ⎝
3
⎠ ⎝ b3i
⎠
⎞
⎟
⎟
⎟
⎟
⎠
To specify a covariate model, use the 'FEGroupDesign' option.
'FEGroupDesign' is a p-by-q-by-m array specifying a different p-by-q
fixed-effects design matrix for each of the m groups. Using the previous
example, the array resembles the following:
1 Create the array.
% Number of parameters in the model (Phi)
num_params = 3;
9-70
Nonlinear Regression
% Number of covariates
num_cov = 1;
% Assuming number of groups in the data set is 7
num_groups = 7;
% Array of covariate values
covariates = [75; 52; 66; 55; 70; 58; 62 ];
A = repmat(eye(num_params, num_params+num_cov),...
[1,1,num_groups]);
A(1,num_params+1,1:num_groups) = covariates(:,1)
2 Create a struct with the specified design matrix.
options.FEGroupDesign = A;
3 Specify the arguments for nlmefit (or nlmefitsa) as shown in “Example:
Mixed-Effects Models Using nlmefit and nlmefitsa” on page 9-79.
Choosing nlmefit or nlmefitsa
Statistics Toolbox provides two functions, nlmefit and nlmefitsa for fitting
non-linear mixed-effects models. Each function provides different capabilities,
which may help you decide which to use.
• “Approximation Methods” on page 9-71
• “Parameters Specific to nlmefitsa” on page 9-72
• “Model and Data Requirements” on page 9-73
Approximation Methods. nlmefit provides the following four
approximation methods for fitting non-linear mixed-effects models:
• 'LME' — Use the likelihood for the linear mixed-effects model at the
current conditional estimates of beta and B. This is the default.
• 'RELME' — Use the restricted likelihood for the linear mixed-effects model
at the current conditional estimates of beta and B.
• 'FO' — First-order Laplacian approximation without random effects.
• 'FOCE' — First-order Laplacian approximation at the conditional estimates
of B.
9-71
9
Parametric Regression Analysis
nlmefitsa provides an additional approximation method, Stochastic
Approximation Expectation-Maximization (SAEM) [24] with three steps :
1 Simulation: Generate simulated values of the random effects b from the
posterior density p(b|Σ) given the current parameter estimates.
2 Stochastic approximation: Update the expected value of the log likelihood
function by taking its value from the previous step, and moving part
way toward the average value of the log likelihood calculated from the
simulated random effects.
3 Maximization step: Choose new parameter estimates to maximize the log
likelihood function given the simulated values of the random effects.
Both nlmefit and nlmefitsa attempt to find parameter estimates to
maximize a likelihood function, which is difficult to compute. nlmefit deals
with the problem by approximating the likelihood function in various ways,
and maximizing the approximate function. It uses traditional optimization
techniques that depend on things like convergence criteria and iteration
limits.
nlmefitsa, on the other hand, simulates random values of the parameters in
such a way that in the long run they converge to the values that maximize
the exact likelihood function. The results are random, and traditional
convergence tests don’t apply. Therefore nlmefitsa provides options to plot
the results as the simulation progresses, and to re-start the simulation
multiple times. You can use these features to judge whether the results have
converged to the accuracy you desire.
Parameters Specific to nlmefitsa. The following parameters are specific to
nlmefitsa. Most control the stochastic algorithm.
• Cov0 — Initial value for the covariance matrix PSI. Must be an r-by-r
positive definite matrix. If empty, the default value depends on the values
of BETA0.
• ComputeStdErrors — true to compute standard errors for the coefficient
estimates and store them in the output STATS structure, or false (default)
to omit this computation.
• LogLikMethod — Specifies the method for approximating the log likelihood.
9-72
Nonlinear Regression
• NBurnIn — Number of initial burn-in iterations during which the
parameter estimates are not recomputed. Default is 5.
• NIterations — Controls how many iterations are performed for each of
three phases of the algorithm.
• NMCMCIterations — Number of Markov Chain Monte Carlo (MCMC)
iterations.
Model and Data Requirements. There are some differences in the
capabilities of nlmefit and nlmefitsa. Therefore some data and models
are usable with either function, but some may require you to choose just
one of them.
• Error models — nlmefitsa supports a variety of error models. For
example, the standard deviation of the response can be constant,
proportional to the function value, or a combination of the two. nlmefit fits
models under the assumption that the standard deviation of the response
is constant. One of the error models, 'exponential', specifies that the log
of the response has a constant standard deviation. You can fit such models
using nlmefit by providing the log response as input, and by re-writing the
model function to produce the log of the nonlinear function value.
• Random effects — Both functions fit data to a nonlinear function with
parameters, and the parameters may be simple scalar values or linear
functions of covariates. nlmefit allows any coefficients of the linear
functions to have both fixed and random effects. nlmefitsa supports
random effects only for the constant (intercept) coefficient of the linear
functions, but not for slope coefficients. So in the example in “Specifying
Covariate Models” on page 9-70, nlmefitsa can treat only the first three
beta values as random effects.
• Model form — nlmefit supports a very general model specification, with
few restrictions on the design matrices that relate the fixed coefficients and
the random effects to the model parameters. nlmefitsa is more restrictive:
-
The fixed effect design must be constant in every group (for every
individual), so an observation-dependent design is not supported.
-
The random effect design must be constant for the entire data set, so
neither an observation-dependent design nor a group-dependent design
is supported.
9-73
9
Parametric Regression Analysis
-
As mentioned under Random Effects, the random effect design must
not specify random effects for slope coefficients. This implies that the
design must consist of zeros and ones.
-
The random effect design must not use the same random effect for
multiple coefficients, and cannot use more than one random effect for
any single coefficient.
-
The fixed effect design must not use the same coefficient for multiple
parameters. This implies that it can have at most one non-zero value
in each column.
If you want to use nlmefitsa for data in which the covariate effects are
random, include the covariates directly in the nonlinear model expression.
Don’t include the covariates in the fixed or random effect design matrices.
• Convergence — As described in the Model form, nlmefit and nlmefitsa
have different approaches to measuring convergence. nlmefit uses
traditional optimization measures, and nlmefitsa provides diagnostics to
help you judge the convergence of a random simulation.
In practice, nlmefitsa tends to be more robust, and less likely to fail on
difficult problems. However, nlmefit may converge faster on problems where
it converges at all. Some problems may benefit from a combined strategy,
for example by running nlmefitsa for a while to get reasonable parameter
estimates, and using those as a starting point for additional iterations using
nlmefit.
Using Output Functions with Mixed-Effects Models
The Outputfcn field of the options structure specifies one or more functions
that the solver calls after each iteration. Typically, you might use an output
function to plot points at each iteration or to display optimization quantities
from the algorithm. To set up an output function:
1 Write the output function as a MATLAB file function or subfunction.
2 Use statset to set the value of Outputfcn to be a function handle, that is,
the name of the function preceded by the @ sign. For example, if the output
function is outfun.m, the command
options = statset('OutputFcn', @outfun);
9-74
Nonlinear Regression
specifies OutputFcn to be the handle to outfun. To specify multiple output
functions, use the syntax:
options = statset('OutputFcn',{@outfun, @outfun2});
3 Call the optimization function with options as an input argument.
For an example of an output function, see “Sample Output Function” on page
9-79.
Structure of the Output Function. The function definition line of the
output function has the following form:
stop = outfun(beta,status,state)
where
• beta is the current fixed effects.
• status is a structure containing data from the current iteration. “Fields in
status” on page 9-75 describes the structure in detail.
• state is the current state of the algorithm. “States of the Algorithm” on
page 9-76 lists the possible values.
• stop is a flag that is true or false depending on whether the optimization
routine should quit or continue. See “Stop Flag” on page 9-77 for more
information.
The solver passes the values of the input arguments to outfun at each
iteration.
Fields in status. The following table lists the fields of the status structure:
Field
Description
procedure
• 'ALT' — alternating algorithm for the optimization of
the linear mixed effects or restricted linear mixed effects
approximations
• 'LAP' — optimization of the Laplacian approximation for
first order or first order conditional estimation
iteration
An integer starting from 0.
9-75
9
Parametric Regression Analysis
Field
Description
inner
A structure describing the status of the inner iterations
within the ALT and LAP procedures, with the fields:
• procedure — When procedure is 'ALT':
-
'PNLS' (penalized non-linear least squares)
-
'PNLS' (penalized non-linear least squares)
'LME' (linear mixed-effects estimation)
'none'
When procedure is 'LAP',
'PLM' (profiled likelihood maximization)
'none'
• state — one of the following:
-
'init'
'iter'
'done'
'none'
• iteration — an integer starting from 0, or NaN. For
nlmefitsa with burn-in iterations, the output function is
called after each of those iterations with a negative value
for STATUS.iteration.
fval
The current log likelihood
Psi
The current random-effects covariance matrix
theta
The current parameterization of Psi
mse
The current error variance
States of the Algorithm. The following table lists the possible values for
state:
9-76
Nonlinear Regression
state
Description
'init'
The algorithm is in the initial state before the first
iteration.
'iter'
The algorithm is at the end of an iteration.
'done'
The algorithm is in the final state after the last iteration.
The following code illustrates how the output function might use the value of
state to decide which tasks to perform at the current iteration:
switch state
case 'iter'
% Make updates to plot or guis as needed
case 'init'
% Setup for plots or guis
case 'done'
% Cleanup of plots, guis, or final plot
otherwise
end
Stop Flag. The output argument stop is a flag that is true or false.
The flag tells the solver whether it should quit or continue. The following
examples show typical ways to use the stop flag.
Stopping an Optimization Based on Intermediate Results
The output function can stop the estimation at any iteration based on the
values of arguments passed into it. For example, the following code sets stop
to true based on the value of the log likelihood stored in the 'fval'field of
the status structure:
stop = outfun(beta,status,state)
stop = false;
% Check if loglikelihood is more than 132.
if status.fval > -132
stop = true;
end
9-77
9
Parametric Regression Analysis
Stopping an Iteration Based on GUI Input
If you design a GUI to perform nlmefit iterations, you can make the output
function stop when a user clicks a Stop button on the GUI. For example, the
following code implements a dialog to cancel calculations:
function retval = stop_outfcn(beta,str,status)
persistent h stop;
if isequal(str.inner.state,'none')
switch(status)
case 'init'
% Initialize dialog
stop = false;
h = msgbox('Press STOP to cancel calculations.',...
'NLMEFIT: Iteration 0 ');
button = findobj(h,'type','uicontrol');
set(button,'String','STOP','Callback',@stopper)
pos = get(h,'Position');
pos(3) = 1.1 * pos(3);
set(h,'Position',pos)
drawnow
case 'iter'
% Display iteration number in the dialog title
set(h,'Name',sprintf('NLMEFIT: Iteration %d',...
str.iteration))
drawnow;
case 'done'
% Delete dialog
delete(h);
end
end
if stop
% Stop if the dialog button has been pressed
delete(h)
end
retval = stop;
function stopper(varargin)
% Set flag to stop when button is pressed
stop = true;
9-78
Nonlinear Regression
disp('Calculation stopped.')
end
end
Sample Output Function. nmlefitoutputfcn is the sample Statistics
Toolbox output function for nlmefit and nlmefitsa. It initializes or updates
a plot with the fixed-effects (BETA) and variance of the random effects
(diag(STATUS.Psi)). For nlmefit, the plot also includes the log-likelihood
(STATUS.fval).
nlmefitoutputfcn is the default output function for nlmefitsa. To use it
with nlmefit, specify a function handle for it in the options structure:
opt = statset('OutputFcn', @nlmefitoutputfcn,
beta = nlmefit( , 'Options', opt, )
)
To prevent nlmefitsa from using of this function, specify an empty value for
the output function:
opt = statset('OutputFcn', [], )
beta = nlmefitsa( , 'Options', opt,
)
nlmefitoutputfcn stops nlmefit or nlmefitsa if you close the figure that
it produces.
Example: Mixed-Effects Models Using nlmefit and nlmefitsa
The following example also works with nlmefitsa in place of nlmefit.
The data in indomethacin.mat records concentrations of the drug
indomethacin in the bloodstream of six subjects over eight hours:
load indomethacin
gscatter(time,concentration,subject)
xlabel('Time (hours)')
ylabel('Concentration (mcg/ml)')
title('{\bf Indomethacin Elimination}')
hold on
9-79
9
Parametric Regression Analysis
“Specifying Mixed-Effects Models” on page 9-67 discusses a useful model for
this type of data. Construct the model via an anonymous function as follows:
model = @(phi,t)(phi(1)*exp(-exp(phi(2))*t) + ...
phi(3)*exp(-exp(phi(4))*t));
Use the nlinfit function to fit the model to all of the data, ignoring
subject-specific effects:
phi0 = [1 1 1 1];
[phi,res] = nlinfit(time,concentration,model,phi0);
numObs = length(time);
numParams = 4;
df = numObs-numParams;
mse = (res'*res)/df
9-80
Nonlinear Regression
mse =
0.0304
tplot = 0:0.01:8;
plot(tplot,model(phi,tplot),'k','LineWidth',2)
hold off
A boxplot of residuals by subject shows that the boxes are mostly above or
below zero, indicating that the model has failed to account for subject-specific
effects:
colors = 'rygcbm';
h = boxplot(res,subject,'colors',colors,'symbol','o');
set(h(~isnan(h)),'LineWidth',2)
hold on
boxplot(res,subject,'colors','k','symbol','ko')
9-81
9
Parametric Regression Analysis
grid on
xlabel('Subject')
ylabel('Residual')
hold off
To account for subject-specific effects, fit the model separately to the data
for each subject:
phi0 = [1 1 1 1];
PHI = zeros(4,6);
RES = zeros(11,6);
for I = 1:6
tI = time(subject == I);
cI = concentration(subject == I);
[PHI(:,I),RES(:,I)] = nlinfit(tI,cI,model,phi0);
end
9-82
Nonlinear Regression
PHI
PHI =
0.1915
-1.7878
2.0293
0.5794
0.4989
-1.6354
2.8277
0.8013
1.6757
-0.4122
5.4683
1.7498
0.2545
-1.6026
2.1981
0.2423
3.5661
1.0408
0.2915
-1.5068
0.9685
-0.8731
3.0023
1.0882
numParams = 24;
df = numObs-numParams;
mse = (RES(:)'*RES(:))/df
mse =
0.0057
gscatter(time,concentration,subject)
xlabel('Time (hours)')
ylabel('Concentration (mcg/ml)')
title('{\bf Indomethacin Elimination}')
hold on
for I = 1:6
plot(tplot,model(PHI(:,I),tplot),'Color',colors(I))
end
axis([0 8 0 3.5])
hold off
9-83
9
Parametric Regression Analysis
PHI gives estimates of the four model parameters for each of the six subjects.
The estimates vary considerably, but taken as a 24-parameter model of the
data, the mean-squared error of 0.0057 is a significant reduction from 0.0304
in the original four-parameter model.
A boxplot of residuals by subject shows that the larger model accounts for
most of the subject-specific effects:
h = boxplot(RES,'colors',colors,'symbol','o');
set(h(~isnan(h)),'LineWidth',2)
hold on
boxplot(RES,'colors','k','symbol','ko')
grid on
xlabel('Subject')
ylabel('Residual')
9-84
Nonlinear Regression
hold off
The spread of the residuals (the vertical scale of the boxplot) is much smaller
than in the previous boxplot, and the boxes are now mostly centered on zero.
While the 24-parameter model successfully accounts for variations due
to the specific subjects in the study, it does not consider the subjects as
representatives of a larger population. The sampling distribution from which
the subjects are drawn is likely more interesting than the sample itself. The
purpose of mixed-effects models is to account for subject-specific variations
more broadly, as random effects varying around population means.
Use the nlmefit function to fit a mixed-effects model to the data.
The following anonymous function, nlme_model, adapts the four-parameter
model used by nlinfit to the calling syntax of nlmefit by allowing separate
9-85
9
Parametric Regression Analysis
parameters for each individual. By default, nlmefit assigns random effects
to all the model parameters. Also by default, nlmefit assumes a diagonal
covariance matrix (no covariance among the random effects) to avoid
overparametrization and related convergence issues.
nlme_model = @(PHI,t)(PHI(:,1).*exp(-exp(PHI(:,2)).*t) + ...
PHI(:,3).*exp(-exp(PHI(:,4)).*t));
phi0 = [1 1 1 1];
[phi,PSI,stats] = nlmefit(time,concentration,subject, ...
[],nlme_model,phi0)
phi =
0.4606
-1.3459
2.8277
0.7729
PSI =
0.0124
0
0
0
0
0.0000
0
0
0
0
0.3264
0
0
0
0
0.0250
stats =
logl: 54.5884
mse: 0.0066
aic: -91.1767
bic: -71.4698
sebeta: NaN
dfe: 57
The mean-squared error of 0.0066 is comparable to the 0.0057 of the
24-parameter model without random effects, and significantly better than the
0.0304 of the four-parameter model without random effects.
The estimated covariance matrix PSI shows that the variance of the second
random effect is essentially zero, suggesting that you can remove it to simplify
the model. To do this, use the REParamsSelect parameter to specify the
indices of the parameters to be modeled with random effects in nlmefit:
[phi,PSI,stats] = nlmefit(time,concentration,subject, ...
[],nlme_model,phi0, ...
'REParamsSelect',[1 3 4])
9-86
Nonlinear Regression
phi =
0.4606
-1.3460
2.8277
0.7729
PSI =
0.0124
0
0
stats =
logl:
mse:
aic:
bic:
sebeta:
dfe:
0
0.3270
0
0
0
0.0250
54.5876
0.0066
-93.1752
-75.6580
NaN
58
The log-likelihood logl is almost identical to what it was with random effects
for all of the parameters, the Akaike information criterion aic is reduced
from -91.1767 to -93.1752, and the Bayesian information criterion bic is
reduced from -71.4698 to -75.6580. These measures support the decision to
drop the second random effect.
Refitting the simplified model with a full covariance matrix allows for
identification of correlations among the random effects. To do this, use the
CovPattern parameter to specify the pattern of nonzero elements in the
covariance matrix:
[phi,PSI,stats] = nlmefit(time,concentration,subject, ...
[],nlme_model,phi0, ...
'REParamsSelect',[1 3 4], ...
'CovPattern',ones(3))
phi =
0.5613
-1.1407
2.8148
0.8293
PSI =
0.0236
0.0500
0.0032
0.0500
0.4768
0.1152
9-87
9
Parametric Regression Analysis
0.0032
stats =
logl:
mse:
aic:
bic:
sebeta:
dfe:
0.1152
0.0321
58.4731
0.0061
-94.9462
-70.8600
NaN
55
The estimated covariance matrix PSI shows that the random effects on the
last two parameters have a relatively strong correlation, and both have a
relatively weak correlation with the first random effect. This structure in
the covariance matrix is more apparent if you convert PSI to a correlation
matrix using corrcov:
RHO = corrcov(PSI)
RHO =
1.0000
0.4707
0.4707
1.0000
0.1179
0.9316
0.1179
0.9316
1.0000
clf; imagesc(RHO)
set(gca,'XTick',[1 2 3],'YTick',[1 2 3])
title('{\bf Random Effect Correlation}')
h = colorbar;
set(get(h,'YLabel'),'String','Correlation');
9-88
Nonlinear Regression
Incorporate this structure into the model by changing the specification of the
covariance pattern to block-diagonal:
P = [1 0 0;0 1 1;0 1 1] % Covariance pattern
P =
1
0
0
0
1
1
0
1
1
[phi,PSI,stats,b] = nlmefit(time,concentration,subject, ...
[],nlme_model,phi0, ...
'REParamsSelect',[1 3 4], ...
'CovPattern',P)
phi =
0.5850
9-89
9
Parametric Regression Analysis
-1.1087
2.8056
0.8476
PSI =
0.0331
0
0
stats =
logl:
mse:
aic:
bic:
sebeta:
dfe:
b =
-0.2438
-0.8500
-0.1591
0
0.4793
0.1069
0
0.1069
0.0294
57.4996
0.0061
-96.9992
-77.2923
NaN
57
0.0723
-0.1237
0.0033
0.2014
0.9538
0.1568
0.0592
-0.7267
-0.2144
-0.2181
0.5895
0.1834
0.1289
0.1571
0.0300
The block-diagonal covariance structure reduces aic from -94.9462 to
-96.9992 and bic from -70.8600 to -77.2923 without significantly affecting
the log-likelihood. These measures support the covariance structure used in
the final model.
The output b gives predictions of the three random effects for each of the six
subjects. These are combined with the estimates of the fixed effects in phi
to produce the mixed-effects model.
Use the following commands to plot the mixed-effects model for each of the six
subjects. For comparison, the model without random effects is also shown.
PHI = repmat(phi,1,6) + ...
[b(1,:);zeros(1,6);b(2,:);b(3,:)];
% Fixed effects
% Random effects
RES = zeros(11,6); % Residuals
colors = 'rygcbm';
for I = 1:6
fitted_model = @(t)(PHI(1,I)*exp(-exp(PHI(2,I))*t) + ...
PHI(3,I)*exp(-exp(PHI(4,I))*t));
tI = time(subject == I);
9-90
Nonlinear Regression
cI = concentration(subject == I);
RES(:,I) = cI - fitted_model(tI);
subplot(2,3,I)
scatter(tI,cI,20,colors(I),'filled')
hold on
plot(tplot,fitted_model(tplot),'Color',colors(I))
plot(tplot,model(phi,tplot),'k')
axis([0 8 0 3.5])
xlabel('Time (hours)')
ylabel('Concentration (mcg/ml)')
legend(num2str(I),'Subject','Fixed')
end
9-91
9
Parametric Regression Analysis
If obvious outliers in the data (visible in previous box plots) are ignored, a
normal probability plot of the residuals shows reasonable agreement with
model assumptions on the errors:
clf; normplot(RES(:))
9-92
Nonlinear Regression
Example: Examining Residuals for Model Verification
You can examine the stats structure, which is returned by both nlmefit and
nlmefitsa, to determine the quality of your model. The stats structure
contains fields with conditional weighted residuals (cwres field) and
individual weighted residuals (iwres field). Since the model assumes that
residuals are normally distributed, you can examine the residuals to see how
well this assumption holds.
This example generates synthetic data using normal distributions. It shows
how the fit statistics look:
• Good when testing against the same type of model as generates the data
• Poor when tested against incorrect data models
9-93
9
Parametric Regression Analysis
1 Initialize a 2-D model with 100 individuals:
nGroups = 100; % 100 Individuals
nlmefun = @(PHI,t)(PHI(:,1)*5 + PHI(:,2)^2.*t); % Regression fcn
REParamSelect = [1 2]; % Both Parameters have random effect
errorParam = .03;
beta0 = [ 1.5 5]; % Parameter means
psi = [ 0.35 0; ... % Covariance Matrix
0
0.51 ];
time =[0.25;0.5;0.75;1;1.25;2;3;4;5;6];
nParameters = 2;
rng(0,'twister') % for reproducibility
2 Generate the data for fitting with a proportional error model:
b_i = mvnrnd(zeros(1, numel(REParamSelect)), psi, nGroups);
individualParameters = zeros(nGroups,nParameters);
individualParameters(:, REParamSelect) = ...
bsxfun(@plus,beta0(REParamSelect), b_i);
groups = repmat(1:nGroups,numel(time),1);
groups = vertcat(groups(:));
y = zeros(numel(time)*nGroups,1);
x = zeros(numel(time)*nGroups,1);
for i = 1:nGroups
idx = groups == i;
f = nlmefun(individualParameters(i,:), time);
% Make a proportional error model for y:
y(idx) = f + errorParam*f.*randn(numel(f),1);
x(idx) = time;
end
P = [ 1 0 ; 0 1 ];
3 Fit the data using the same regression function and error model as the
model generator:
[~,~,stats] = nlmefit(x,y,groups, ...
[],nlmefun,[1 1],'REParamsSelect',REParamSelect,...
'ErrorModel','Proportional','CovPattern',P);
9-94
Nonlinear Regression
4 Create a plotting routine by copying the following function definition, and
creating a file plotResiduals.m on your MATLAB path:
function plotResiduals(stats)
pwres = stats.pwres;
iwres = stats.iwres;
cwres = stats.cwres;
figure
subplot(2,3,1);
normplot(pwres); title('PWRES')
subplot(2,3,4);
createhistplot(pwres);
subplot(2,3,2);
normplot(cwres); title('CWRES')
subplot(2,3,5);
createhistplot(cwres);
subplot(2,3,3);
normplot(iwres); title('IWRES')
subplot(2,3,6);
createhistplot(iwres); title('IWRES')
function createhistplot(pwres)
[x, n] = hist(pwres);
d = n(2)- n(1);
x = x/sum(x*d);
bar(n,x);
ylim([0 max(x)*1.05]);
hold on;
x2 = -4:0.1:4;
f2 = normpdf(x2,0,1);
plot(x2,f2,'r');
end
end
5 Plot the residuals using the plotResiduals function:
plotResiduals(stats);
9-95
9
Parametric Regression Analysis
The upper probability plots look straight, meaning the residuals are
normally distributed. The bottom histogram plots match the superimposed
normal density plot. So you can conclude that the error model matches
the data.
6 For comparison, fit the data using a constant error model, instead of the
proportional model that created the data:
[~,~,stats] = nlmefit(x,y,groups, ...
[],nlmefun,[0 0],'REParamsSelect',REParamSelect,...
'ErrorModel','Constant','CovPattern',P);
plotResiduals(stats);
9-96
Nonlinear Regression
The upper probability plots are not straight, indicating the residuals are
not normally distributed. The bottom histogram plots are fairly close to the
superimposed normal density plots.
7 For another comparison, fit the data to a different structural model than
created the data:
nlmefun2 = @(PHI,t)(PHI(:,1)*5 + PHI(:,2).*t.^4);
[~,~,stats] = nlmefit(x,y,groups, ...
[],nlmefun2,[0 0],'REParamsSelect',REParamSelect,...
'ErrorModel','constant', 'CovPattern',P);
plotResiduals(stats);
9-97
9
Parametric Regression Analysis
Not only are the upper probability plots not straight, but the histogram
plot is quite skewed compared to the superimposed normal density. These
residuals are not normally distributed, and do not match the model.
9-98
10
Multivariate Methods
• “Introduction” on page 10-2
• “Multidimensional Scaling” on page 10-3
• “Procrustes Analysis” on page 10-14
• “Feature Selection” on page 10-23
• “Feature Transformation” on page 10-28
10
Multivariate Methods
Introduction
Large, high-dimensional data sets are common in the modern era
of computer-based instrumentation and electronic data storage.
High-dimensional data present many challenges for statistical visualization,
analysis, and modeling.
Data visualization, of course, is impossible beyond a few dimensions. As a
result, pattern recognition, data preprocessing, and model selection must
rely heavily on numerical methods.
A fundamental challenge in high-dimensional data analysis is the so-called
curse of dimensionality. Observations in a high-dimensional space are
necessarily sparser and less representative than those in a low-dimensional
space. In higher dimensions, data over-represent the edges of a sampling
distribution, because regions of higher-dimensional space contain the majority
of their volume near the surface. (A d-dimensional spherical shell has a
volume, relative to the total volume of the sphere, that approaches 1 as d
approaches infinity.) In high dimensions, typical data points at the interior of
a distribution are sampled less frequently.
Often, many of the dimensions in a data set—the measured features—are
not useful in producing a model. Features may be irrelevant or redundant.
Regression and classification algorithms may require large amounts of
storage and computation time to process raw data, and even if the algorithms
are successful the resulting models may contain an incomprehensible number
of terms.
Because of these challenges, multivariate statistical methods often begin with
some type of dimension reduction, in which data are approximated by points
in a lower-dimensional space. Dimension reduction is the goal of the methods
presented in this chapter. Dimension reduction often leads to simpler models
and fewer measured variables, with consequent benefits when measurements
are expensive and visualization is important.
10-2
Multidimensional Scaling
Multidimensional Scaling
In this section...
“Introduction” on page 10-3
“Classical Multidimensional Scaling” on page 10-3
“Nonclassical Multidimensional Scaling” on page 10-8
“Nonmetric Multidimensional Scaling” on page 10-10
Introduction
One of the most important goals in visualizing data is to get a sense of how
near or far points are from each other. Often, you can do this with a scatter
plot. However, for some analyses, the data that you have might not be in
the form of points at all, but rather in the form of pairwise similarities or
dissimilarities between cases, observations, or subjects. There are no points
to plot.
Even if your data are in the form of points rather than pairwise distances,
a scatter plot of those data might not be useful. For some kinds of data,
the relevant way to measure how near two points are might not be their
Euclidean distance. While scatter plots of the raw data make it easy to
compare Euclidean distances, they are not always useful when comparing
other kinds of inter-point distances, city block distance for example, or even
more general dissimilarities. Also, with a large number of variables, it is very
difficult to visualize distances unless the data can be represented in a small
number of dimensions. Some sort of dimension reduction is usually necessary.
Multidimensional scaling (MDS) is a set of methods that address all these
problems. MDS allows you to visualize how near points are to each other
for many kinds of distance or dissimilarity metrics and can produce a
representation of your data in a small number of dimensions. MDS does not
require raw data, but only a matrix of pairwise distances or dissimilarities.
Classical Multidimensional Scaling
• “Introduction” on page 10-4
10-3
10
Multivariate Methods
• “Example: Multidimensional Scaling” on page 10-6
Introduction
The function cmdscale performs classical (metric) multidimensional scaling,
also known as principal coordinates analysis. cmdscale takes as an input a
matrix of inter-point distances and creates a configuration of points. Ideally,
those points are in two or three dimensions, and the Euclidean distances
between them reproduce the original distance matrix. Thus, a scatter plot
of the points created by cmdscale provides a visual representation of the
original distances.
As a very simple example, you can reconstruct a set of points from only their
inter-point distances. First, create some four dimensional points with a small
component in their fourth coordinate, and reduce them to distances.
X = [ normrnd(0,1,10,3), normrnd(0,.1,10,1) ];
D = pdist(X,'euclidean');
Next, use cmdscale to find a configuration with those inter-point distances.
cmdscale accepts distances as either a square matrix, or, as in this example,
in the vector upper-triangular form produced by pdist.
[Y,eigvals] = cmdscale(D);
cmdscale produces two outputs. The first output, Y, is a matrix containing the
reconstructed points. The second output, eigvals, is a vector containing the
sorted eigenvalues of what is often referred to as the “scalar product matrix,”
which, in the simplest case, is equal to Y*Y'. The relative magnitudes of those
eigenvalues indicate the relative contribution of the corresponding columns of
Y in reproducing the original distance matrix D with the reconstructed points.
format short g
[eigvals eigvals/max(abs(eigvals))]
ans =
12.623
1
4.3699
0.34618
1.9307
0.15295
0.025884
0.0020505
1.7192e-015 1.3619e-016
6.8727e-016 5.4445e-017
10-4
Multidimensional Scaling
4.4367e-017 3.5147e-018
-9.2731e-016 -7.3461e-017
-1.327e-015 -1.0513e-016
-1.9232e-015 -1.5236e-016
If eigvals contains only positive and zero (within round-off error) eigenvalues,
the columns of Y corresponding to the positive eigenvalues provide an exact
reconstruction of D, in the sense that their inter-point Euclidean distances,
computed using pdist, for example, are identical (within round-off) to the
values in D.
maxerr4 = max(abs(D - pdist(Y))) % exact reconstruction
maxerr4 =
2.6645e-015
If two or three of the eigenvalues in eigvals are much larger than the rest,
then the distance matrix based on the corresponding columns of Y nearly
reproduces the original distance matrix D. In this sense, those columns
form a lower-dimensional representation that adequately describes the
data. However it is not always possible to find a good low-dimensional
reconstruction.
% good reconstruction in 3D
maxerr3 = max(abs(D - pdist(Y(:,1:3))))
maxerr3 =
0.029728
% poor reconstruction in 2D
maxerr2 = max(abs(D - pdist(Y(:,1:2))))
maxerr2 =
0.91641
The reconstruction in three dimensions reproduces D very well, but the
reconstruction in two dimensions has errors that are of the same order of
magnitude as the largest values in D.
max(max(D))
ans =
3.4686
10-5
10
Multivariate Methods
Often, eigvals contains some negative eigenvalues, indicating that the
distances in D cannot be reproduced exactly. That is, there might not be any
configuration of points whose inter-point Euclidean distances are given by
D. If the largest negative eigenvalue is small in magnitude relative to the
largest positive eigenvalues, then the configuration returned by cmdscale
might still reproduce D well.
Example: Multidimensional Scaling
Given only the distances between 10 US cities, cmdscale can construct a map
of those cities. First, create the distance matrix and pass it to cmdscale.
In this example,D is a full distance matrix: it is square and symmetric, has
positive entries off the diagonal, and has zeros on the diagonal.
cities = ...
{'Atl','Chi','Den','Hou','LA','Mia','NYC','SF','Sea','WDC'};
D = [
0 587 1212 701 1936 604 748 2139 2182
543;
587
0 920 940 1745 1188 713 1858 1737
597;
1212 920
0 879 831 1726 1631 949 1021 1494;
701 940 879
0 1374 968 1420 1645 1891 1220;
1936 1745 831 1374
0 2339 2451 347 959 2300;
604 1188 1726 968 2339
0 1092 2594 2734
923;
748 713 1631 1420 2451 1092
0 2571 2408
205;
2139 1858 949 1645 347 2594 2571
0 678 2442;
2182 1737 1021 1891 959 2734 2408 678
0 2329;
543 597 1494 1220 2300 923 205 2442 2329
0];
[Y,eigvals] = cmdscale(D);
Next, look at the eigenvalues returned by cmdscale. Some of these are
negative, indicating that the original distances are not Euclidean. This is
because of the curvature of the earth.
format short g
[eigvals eigvals/max(abs(eigvals))]
ans =
9.5821e+006
1
1.6868e+006
0.17604
8157.3
0.0008513
1432.9
0.00014954
508.67 5.3085e-005
25.143
2.624e-006
10-6
Multidimensional Scaling
5.3394e-010 5.5722e-017
-897.7 -9.3685e-005
-5467.6
-0.0005706
-35479
-0.0037026
However, in this case, the two largest positive eigenvalues are much larger
in magnitude than the remaining eigenvalues. So, despite the negative
eigenvalues, the first two coordinates of Y are sufficient for a reasonable
reproduction of D.
Dtriu = D(find(tril(ones(10),-1)))';
maxrelerr = max(abs(Dtriu-pdist(Y(:,1:2))))./max(Dtriu)
maxrelerr =
0.0075371
Here is a plot of the reconstructed city locations as a map. The orientation of
the reconstruction is arbitrary. In this case, it happens to be close to, although
not exactly, the correct orientation.
plot(Y(:,1),Y(:,2),'.')
text(Y(:,1)+25,Y(:,2),cities)
xlabel('Miles')
ylabel('Miles')
10-7
10
Multivariate Methods
Nonclassical Multidimensional Scaling
The function mdscale performs nonclassical multidimensional scaling. As
with cmdcale, you use mdscale either to visualize dissimilarity data for which
no “locations” exist, or to visualize high-dimensional data by reducing its
dimensionality. Both functions take a matrix of dissimilarities as an input
and produce a configuration of points. However, mdscale offers a choice of
different criteria to construct the configuration, and allows missing data and
weights.
For example, the cereal data include measurements on 10 variables describing
breakfast cereals. You can use mdscale to visualize these data in two
dimensions. First, load the data. For clarity, this example code selects a
subset of 22 of the observations.
load cereal.mat
X = [Calories Protein Fat Sodium Fiber ...
Carbo Sugars Shelf Potass Vitamins];
10-8
Multidimensional Scaling
% Take a subset from a single manufacturer
mfg1 = strcmp('G',cellstr(Mfg));
X = X(mfg1,:);
size(X)
ans =
22 10
Then use pdist to transform the 10-dimensional data into dissimilarities.
The output from pdist is a symmetric dissimilarity matrix, stored as a vector
containing only the (23*22/2) elements in its upper triangle.
dissimilarities = pdist(zscore(X),'cityblock');
size(dissimilarities)
ans =
1
231
This example code first standardizes the cereal data, and then uses city block
distance as a dissimilarity. The choice of transformation to dissimilarities is
application-dependent, and the choice here is only for simplicity. In some
applications, the original data are already in the form of dissimilarities.
Next, use mdscale to perform metric MDS. Unlike cmdscale, you must
specify the desired number of dimensions, and the method to use to construct
the output configuration. For this example, use two dimensions. The metric
STRESS criterion is a common method for computing the output; for other
choices, see the mdscale reference page in the online documentation. The
second output from mdscale is the value of that criterion evaluated for the
output configuration. It measures the how well the inter-point distances of
the output configuration approximate the original input dissimilarities:
[Y,stress] =...
mdscale(dissimilarities,2,'criterion','metricstress');
stress
stress =
0.1856
A scatterplot of the output from mdscale represents the original
10-dimensional data in two dimensions, and you can use the gname function to
label selected points:
plot(Y(:,1),Y(:,2),'o','LineWidth',2);
10-9
10
Multivariate Methods
gname(Name(mfg1))
Nonmetric Multidimensional Scaling
Metric multidimensional scaling creates a configuration of points whose
inter-point distances approximate the given dissimilarities. This is sometimes
too strict a requirement, and non-metric scaling is designed to relax it a bit.
Instead of trying to approximate the dissimilarities themselves, non-metric
scaling approximates a nonlinear, but monotonic, transformation of them.
Because of the monotonicity, larger or smaller distances on a plot of the
output will correspond to larger or smaller dissimilarities, respectively.
However, the nonlinearity implies that mdscale only attempts to preserve the
10-10
Multidimensional Scaling
ordering of dissimilarities. Thus, there may be contractions or expansions of
distances at different scales.
You use mdscale to perform nonmetric MDS in much the same way as for
metric scaling. The nonmetric STRESS criterion is a common method for
computing the output; for more choices, see the mdscale reference page in
the online documentation. As with metric scaling, the second output from
mdscale is the value of that criterion evaluated for the output configuration.
For nonmetric scaling, however, it measures the how well the inter-point
distances of the output configuration approximate the disparities. The
disparities are returned in the third output. They are the transformed values
of the original dissimilarities:
[Y,stress,disparities] = ...
mdscale(dissimilarities,2,'criterion','stress');
stress
stress =
0.1562
To check the fit of the output configuration to the dissimilarities, and to
understand the disparities, it helps to make a Shepard plot:
distances = pdist(Y);
[dum,ord] = sortrows([disparities(:) dissimilarities(:)]);
plot(dissimilarities,distances,'bo', ...
dissimilarities(ord),disparities(ord),'r.-', ...
[0 25],[0 25],'k-')
xlabel('Dissimilarities')
ylabel('Distances/Disparities')
legend({'Distances' 'Disparities' '1:1 Line'},...
'Location','NorthWest');
10-11
10
Multivariate Methods
This plot shows that mdscale has found a configuration of points in two
dimensions whose inter-point distances approximates the disparities, which
in turn are a nonlinear transformation of the original dissimilarities. The
concave shape of the disparities as a function of the dissimilarities indicates
that fit tends to contract small distances relative to the corresponding
dissimilarities. This may be perfectly acceptable in practice.
mdscale uses an iterative algorithm to find the output configuration, and
the results can often depend on the starting point. By default, mdscale
uses cmdscale to construct an initial configuration, and this choice often
leads to a globally best solution. However, it is possible for mdscale to
stop at a configuration that is a local minimum of the criterion. Such
10-12
Multidimensional Scaling
cases can be diagnosed and often overcome by running mdscale multiple
times with different starting points. You can do this using the 'start'
and 'replicates' parameters. The following code runs five replicates of
MDS, each starting at a different randomly-chosen initial configuration.
The criterion value is printed out for each replication; mdscale returns the
configuration with the best fit.
opts = statset('Display','final');
[Y,stress] =...
mdscale(dissimilarities,2,'criterion','stress',...
'start','random','replicates',5,'Options',opts);
35
31
48
33
32
iterations,
iterations,
iterations,
iterations,
iterations,
Final
Final
Final
Final
Final
stress
stress
stress
stress
stress
criterion
criterion
criterion
criterion
criterion
=
=
=
=
=
0.156209
0.156209
0.171209
0.175341
0.185881
Notice that mdscale finds several different local solutions, some of which
do not have as low a stress value as the solution found with the cmdscale
starting point.
10-13
10
Multivariate Methods
Procrustes Analysis
In this section...
“Comparing Landmark Data” on page 10-14
“Data Input” on page 10-14
“Preprocessing Data for Accurate Results” on page 10-15
“Example: Comparing Handwritten Shapes” on page 10-16
Comparing Landmark Data
The procrustes function analyzes the distribution of a set of shapes using
Procrustes analysis. This analysis method matches landmark data (geometric
locations representing significant features in a given shape) to calculate the
best shape-preserving Euclidian transformations. These transformations
minimize the differences in location between compared landmark data.
Procrustes analysis is also useful in conjunction with multidimensional
scaling. In “Example: Multidimensional Scaling” on page 10-6 there is an
observation that the orientation of the reconstructed points is arbitrary. Two
different applications of multidimensional scaling could produce reconstructed
points that are very similar in principle, but that look different because they
have different orientations. The procrustes function transforms one set of
points to make them more comparable to the other.
Data Input
The procrustes function takes two matrices as input:
• The target shape matrix X has dimension n × p, where n is the number
of landmarks in the shape and p is the number of measurements per
landmark.
• The comparison shape matrix Y has dimension n × q with q ≤ p. If there
are fewer measurements per landmark for the comparison shape than
the target shape (q < p), the function adds columns of zeros to Y, yielding
an n × p matrix.
The equation to obtain the transformed shape, Z, is
10-14
Procrustes Analysis
Z = bYT + c
(10-1)
where:
• b is a scaling factor that stretches (b > 1) or shrinks (b < 1) the points.
• T is the orthogonal rotation and reflection matrix.
• c is a matrix with constant values in each column, used to shift the points.
The procrustes function chooses b, T, and c to minimize the distance between
the target shape X and the transformed shape Z as measured by the least
squares criterion:
n
p
∑∑(X
ij
− Z ij ) 2
i =1 j =1
Preprocessing Data for Accurate Results
Procrustes analysis is appropriate when all p measurement dimensions have
similar scales. The analysis would be inaccurate, for example, if the columns
of Z had different scales:
• The first column is measured in milliliters ranging from 2,000 to 6,000.
• The second column is measured in degrees Celsius ranging from 10 to 25.
• The third column is measured in kilograms ranging from 50 to 230.
In such cases, standardize your variables by:
1 Subtracting the sample mean from each variable.
2 Dividing each resultant variable by its sample standard deviation.
Use the zscore function to perform this standardization.
10-15
10
Multivariate Methods
Example: Comparing Handwritten Shapes
In this example, use Procrustes analysis to compare two handwritten number
threes. Visually and analytically explore the effects of forcing size and
reflection changes as follows:
• “Step 1: Load and Display the Original Data” on page 10-16
• “Step 2: Calculate the Best Transformation” on page 10-17
• “Step 3: Examine the Similarity of the Two Shapes” on page 10-18
• “Step 4: Restrict the Form of the Transformations” on page 10-20
Step 1: Load and Display the Original Data
Input landmark data for two handwritten number threes:
A = [11 39;17 42;25 42;25 40;23 36;19 35;30 34;35 29;...
30 20;18 19];
B = [15 31;20 37;30 40;29 35;25 29;29 31;31 31;35 20;...
29 10;25 18];
Create X and Y from A and B, moving B to the side to make each shape more
visible:
X = A;
Y = B + repmat([25 0], 10,1);
Plot the shapes, using letters to designate the landmark points. Lines in the
figure join the points to indicate the drawing path of each shape.
plot(X(:,1), X(:,2),'r-', Y(:,1), Y(:,2),'b-');
text(X(:,1), X(:,2),('abcdefghij')')
text(Y(:,1), Y(:,2),('abcdefghij')')
legend('X = Target','Y = Comparison','location','SE')
set(gca,'YLim',[0 55],'XLim',[0 65]);
10-16
Procrustes Analysis
Step 2: Calculate the Best Transformation
Use Procrustes analysis to find the transformation that minimizes distances
between landmark data points.
Call procrustes as follows:
[d, Z, tr] = procrustes(X,Y);
The outputs of the function are:
• d – A standardized dissimilarity measure.)
• Z – A matrix of the transformed landmarks.
• tr – A structure array of the computed transformation with fields T, b, and
c which correspond to the transformation equation, Equation 10-1.
10-17
10
Multivariate Methods
Visualize the transformed shape, Z, using a dashed blue line:
plot(X(:,1), X(:,2),'r-', Y(:,1), Y(:,2),'b-',...
Z(:,1),Z(:,2),'b:');
text(X(:,1), X(:,2),('abcdefghij')')
text(Y(:,1), Y(:,2),('abcdefghij')')
text(Z(:,1), Z(:,2),('abcdefghij')')
legend('X = Target','Y = Comparison',...
'Z = Transformed','location','SW')
set(gca,'YLim',[0 55],'XLim',[0 65]);
Step 3: Examine the Similarity of the Two Shapes
Use two different numerical values to assess the similarity of the target shape
and the transformed shape.
10-18
Procrustes Analysis
Dissimilarity Measure d. The dissimilarity measure d gives a number
between 0 and 1 describing the difference between the target shape and the
transformed shape. Values near 0 imply more similar shapes, while values
near 1 imply dissimilarity. For this example:
d =
0.1502
The small value of d in this case shows that the two shapes are similar.
procrustes calculates d by comparing the sum of squared deviations between
the set of points with the sum of squared deviations of the original points from
their column means:
numerator = sum(sum((X-Z).^2))
numerator =
166.5321
denominator = sum(sum(bsxfun(@minus,X,mean(X)).^2))
denominator =
1.1085e+003
ratio = numerator/denominator
ratio =
0.1502
Note The resulting measure d is independent of the scale of the size of
the shapes and takes into account only the similarity of landmark data.
“Examining the Scaling Measure b” on page 10-19 shows how to examine the
size similarity of the shapes.
Examining the Scaling Measure b. The target and comparison threes in
the previous figure visually show that the two numbers are of a similar size.
The closeness of calculated value of the scaling factor b to 1 supports this
observation as well:
10-19
10
Multivariate Methods
tr.b
ans =
0.9291
The sizes of the target and comparison shapes appear similar. This visual
impression is reinforced by the value of b = 0.93, which implies that the best
transformation results in shrinking the comparison shape by a factor .93
(only 7%).
Step 4: Restrict the Form of the Transformations
Explore the effects of manually adjusting the scaling and reflection
coefficients.
Fixing the Scaling Factor b = 1. Force b to equal 1 (set 'Scaling' to
false) to examine the amount of dissimilarity in size of the target and
transformed figures:
ds = procrustes(X,Y,'Scaling',false)
ds =
0.1552
In this case, setting 'Scaling' to false increases the calculated value of
d only 0.0049, which further supports the similarity in the size of the two
number threes. A larger increase in d would have indicated a greater size
discrepancy.
Forcing a Reflection in the Transformation. This example requires only a
rotation, not a reflection, to align the shapes. You can show this by observing
that the determinant of the matrix T is 1 in this analysis:
det(tr.T)
ans =
1.0000
If you need a reflection in the transformation, the determinant of T is -1. You
can force a reflection into the transformation as follows:
[dr,Zr,trr] = procrustes(X,Y,'Reflection',true);
dr
dr =
10-20
Procrustes Analysis
0.8130
The d value increases dramatically, indicating that a forced reflection leads
to a poor transformation of the landmark points. A plot of the transformed
shape shows a similar result:
• The landmark data points are now further away from their target
counterparts.
• The transformed three is now an undesirable mirror image of the target
three.
plot(X(:,1), X(:,2),'r-', Y(:,1), Y(:,2),'b-',...
Zr(:,1),Zr(:,2),'b:');
text(X(:,1), X(:,2),('abcdefghij')')
text(Y(:,1), Y(:,2),('abcdefghij')')
text(Zr(:,1), Zr(:,2),('abcdefghij')')
legend('X = Target','Y = Comparison',...
'Z = Transformed','location','SW')
set(gca,'YLim',[0 55],'XLim',[0 65]);
10-21
10
Multivariate Methods
It appears that the shapes might be better matched if you flipped the
transformed shape upside down. Flipping the shapes would make the
transformation even worse, however, because the landmark data points
would be further away from their target counterparts. From this example,
it is clear that manually adjusting the scaling and reflection parameters is
generally not optimal.
10-22
Feature Selection
Feature Selection
In this section...
“Introduction” on page 10-23
“Sequential Feature Selection” on page 10-23
Introduction
Feature selection reduces the dimensionality of data by selecting only a subset
of measured features (predictor variables) to create a model. Selection criteria
usually involve the minimization of a specific measure of predictive error for
models fit to different subsets. Algorithms search for a subset of predictors
that optimally model measured responses, subject to constraints such as
required or excluded features and the size of the subset.
Feature selection is preferable to feature transformation when the original
units and meaning of features are important and the modeling goal is to
identify an influential subset. When categorical features are present, and
numerical transformations are inappropriate, feature selection becomes the
primary means of dimension reduction.
Sequential Feature Selection
• “Introduction” on page 10-23
• “Example: Sequential Feature Selection” on page 10-24
Introduction
A common method of feature selection is sequential feature selection. This
method has two components:
• An objective function, called the criterion, which the method seeks to
minimize over all feasible feature subsets. Common criteria are mean
squared error (for regression models) and misclassification rate (for
classification models).
• A sequential search algorithm, which adds or removes features from a
candidate subset while evaluating the criterion. Since an exhaustive
10-23
10
Multivariate Methods
comparison of the criterion value at all 2n subsets of an n-feature data set
is typically infeasible (depending on the size of n and the cost of objective
calls), sequential searches move in only one direction, always growing or
always shrinking the candidate set.
The method has two variants:
• Sequential forward selection (SFS), in which features are sequentially
added to an empty candidate set until the addition of further features does
not decrease the criterion.
• Sequential backward selection (SBS), in which features are sequentially
removed from a full candidate set until the removal of further features
increase the criterion.
Stepwise regression is a sequential feature selection technique designed
specifically for least-squares fitting. The functions stepwise and stepwisefit
make use of optimizations that are only possible with least-squares criteria.
Unlike generalized sequential feature selection, stepwise regression may
remove features that have been added or add features that have been removed.
The Statistics Toolbox function sequentialfs carries out sequential feature
selection. Input arguments include predictor and response data and a function
handle to a file implementing the criterion function. Optional inputs allow
you to specify SFS or SBS, required or excluded features, and the size of the
feature subset. The function calls cvpartition and crossval to evaluate the
criterion at different candidate sets.
Example: Sequential Feature Selection
For example, consider a data set with 100 observations of 10 predictors.
As described in “Example: Generalized Linear Models” on page 9-53, the
following generates random data from a logistic model, with a binomial
distribution of responses at each set of values for the predictors. Some
coefficients are set to zero so that not all of the predictors affect the response:
n = 100;
m = 10;
X = rand(n,m);
b = [1 0 0 2 .5 0 0 0.1 0 1];
Xb = X*b';
10-24
Feature Selection
p = 1./(1+exp(-Xb));
N = 50;
y = binornd(N,p);
The glmfit function fits a logistic model to the data:
Y = [y N*ones(size(y))];
[b0,dev0,stats0] = glmfit(X,Y,'binomial');
% Display coefficient estimates and their standard errors:
model0 = [b0 stats0.se]
model0 =
0.3115
0.2596
0.9614
0.1656
-0.1100
0.1651
-0.2165
0.1683
1.9519
0.1809
0.5683
0.2018
-0.0062
0.1740
0.0651
0.1641
-0.1034
0.1685
0.0017
0.1815
0.7979
0.1806
% Display the deviance of the fit:
dev0
dev0 =
101.2594
This is the full model, using all of the features (and an initial constant term).
Sequential feature selection searches for a subset of the features in the full
model with comparative predictive power.
First, you must specify a criterion for selecting the features. The following
function, which calls glmfit and returns the deviance of the fit (a
generalization of the residual sum of squares) is a useful criterion in this case:
function dev = critfun(X,Y)
[b,dev] = glmfit(X,Y,'binomial');
10-25
10
Multivariate Methods
You should create this function as a file on the MATLAB path.
The function sequentialfs performs feature selection, calling the criterion
function via a function handle:
maxdev = chi2inv(.95,1);
opt = statset('display','iter',...
'TolFun',maxdev,...
'TolTypeFun','abs');
inmodel = sequentialfs(@critfun,X,Y,...
'cv','none',...
'nullmodel',true,...
'options',opt,...
'direction','forward');
Start forward sequential feature selection:
Initial columns included: none
Columns that can not be included: none
Step 1, used initial columns, criterion value 309.118
Step 2, added column 4, criterion value 180.732
Step 3, added column 1, criterion value 138.862
Step 4, added column 10, criterion value 114.238
Step 5, added column 5, criterion value 103.503
Final columns included: 1 4 5 10
The iterative display shows a decrease in the criterion value as each new
feature is added to the model. The final result is a reduced model with only
four of the original ten features: columns 1, 4, 5, and 10 of X. These features
are indicated in the logical vector inmodel returned by sequentialfs.
The deviance of the reduced model is higher than for the full model, but
the addition of any other single feature would not decrease the criterion
by more than the absolute tolerance, maxdev, set in the options structure.
Adding a feature with no effect reduces the deviance by an amount that has
a chi-square distribution with one degree of freedom. Adding a significant
feature results in a larger change. By setting maxdev to chi2inv(.95,1), you
instruct sequentialfs to continue adding features so long as the change in
deviance is more than would be expected by random chance.
10-26
Feature Selection
The reduced model (also with an initial constant term) is:
[b,dev,stats] = glmfit(X(:,inmodel),Y,'binomial');
% Display coefficient estimates and their standard errors:
model = [b stats.se]
model =
0.0784
0.1642
1.0040
0.1592
1.9459
0.1789
0.6134
0.1872
0.8245
0.1730
10-27
10
Multivariate Methods
Feature Transformation
In this section...
“Introduction” on page 10-28
“Nonnegative Matrix Factorization” on page 10-28
“Principal Component Analysis (PCA)” on page 10-31
“Factor Analysis” on page 10-45
Introduction
Feature transformation is a group of methods that create new features
(predictor variables). The methods are useful for dimension reduction when
the transformed features have a descriptive power that is more easily ordered
than the original features. In this case, less descriptive features can be
dropped from consideration when building models.
Feature transformation methods are contrasted with the methods presented
in “Feature Selection” on page 10-23, where dimension reduction is achieved
by computing an optimal subset of predictive features measured in the
original data.
The methods presented in this section share some common methodology.
Their goals, however, are essentially different:
• Nonnegative matrix factorization is used when model terms must represent
nonnegative quantities, such as physical quantities.
• Principal component analysis is used to summarize data in fewer
dimensions, for example, to visualize it.
• Factor analysis is used to build explanatory models of data correlations.
Nonnegative Matrix Factorization
• “Introduction” on page 10-29
• “Example: Nonnegative Matrix Factorization” on page 10-29
10-28
Feature Transformation
Introduction
Nonnegative matrix factorization (NMF) is a dimension-reduction technique
based on a low-rank approximation of the feature space. Besides providing
a reduction in the number of features, NMF guarantees that the features
are nonnegative, producing additive models that respect, for example, the
nonnegativity of physical quantities.
Given a nonnegative m-by-n matrix X and a positive integer k < min(m,n),
NMF finds nonnegative m-by-k and k-by-n matrices W and H, respectively,
that minimize the norm of the difference X – WH. W and H are thus
approximate nonnegative factors of X.
The k columns of W represent transformations of the variables in X; the k
rows of H represent the coefficients of the linear combinations of the original
n variables in X that produce the transformed variables in W. Since k is
generally smaller than the rank of X, the product WH provides a compressed
approximation of the data in X. A range of possible values for k is often
suggested by the modeling context.
The Statistics Toolbox function nnmf carries out nonnegative matrix
factorization. nnmf uses one of two iterative algorithms that begin with
random initial values for W and H. Because the norm of the residual X
– WH may have local minima, repeated calls to nnmf may yield different
factorizations. Sometimes the algorithm converges to a solution of lower rank
than k, which may indicate that the result is not optimal.
Example: Nonnegative Matrix Factorization
For example, consider the five predictors of biochemical oxygen demand in the
data set moore.mat:
load moore
X = moore(:,1:5);
The following uses nnmf to compute a rank-two approximation of X with a
multiplicative update algorithm that begins from five random initial values
for W and H:
opt = statset('MaxIter',10,'Display','final');
[W0,H0] = nnmf(X,2,'replicates',5,...
'options',opt,...
10-29
10
Multivariate Methods
'algorithm','mult');
rep iteration
rms resid
|delta x|
1
10
358.296
0.00190554
2
10
78.3556 0.000351747
3
10
230.962
0.0172839
4
10
326.347
0.00739552
5
10
361.547
0.00705539
Final root mean square residual = 78.3556
The 'mult' algorithm is sensitive to initial values, which makes it a good
choice when using 'replicates' to find W and H from multiple random
starting values.
Now perform the factorization using an alternating least-squares algorithm,
which converges faster and more consistently. Run 100 times more iterations,
beginning from the initial W0 and H0 identified above:
opt = statset('Maxiter',1000,'Display','final');
[W,H] = nnmf(X,2,'w0',W0,'h0',H0,...
'options',opt,...
'algorithm','als');
rep iteration
rms resid
|delta x|
1
3
77.5315 3.52673e-005
Final root mean square residual = 77.5315
The two columns of W are the transformed predictors. The two rows of H give
the relative contributions of each of the five predictors in X to the predictors
in W:
H
H =
0.0835
0.0558
0.0190
0.0250
0.1782
0.9969
0.0072
0.0085
0.9802
0.0497
The fifth predictor in X (weight 0.9802) strongly influences the first predictor
in W. The third predictor in X (weight 0.9969) strongly influences the second
predictor in W.
Visualize the relative contributions of the predictors in X with a biplot,
showing the data and original variables in the column space of W:
10-30
Feature Transformation
biplot(H','scores',W,'varlabels',{'','','v3','','v5'});
axis([0 1.1 0 1.1])
xlabel('Column 1')
ylabel('Column 2')
Principal Component Analysis (PCA)
• “Introduction” on page 10-31
• “Example: Principal Component Analysis” on page 10-33
Introduction
One of the difficulties inherent in multivariate statistics is the problem of
visualizing data that has many variables. The MATLAB function plot
displays a graph of the relationship between two variables. The plot3
and surf commands display different three-dimensional views. But when
10-31
10
Multivariate Methods
there are more than three variables, it is more difficult to visualize their
relationships.
Fortunately, in data sets with many variables, groups of variables often
move together. One reason for this is that more than one variable might be
measuring the same driving principle governing the behavior of the system.
In many systems there are only a few such driving forces. But an abundance
of instrumentation enables you to measure dozens of system variables. When
this happens, you can take advantage of this redundancy of information.
You can simplify the problem by replacing a group of variables with a single
new variable.
Principal component analysis is a quantitatively rigorous method for achieving
this simplification. The method generates a new set of variables, called
principal components. Each principal component is a linear combination of
the original variables. All the principal components are orthogonal to each
other, so there is no redundant information. The principal components as a
whole form an orthogonal basis for the space of the data.
There are an infinite number of ways to construct an orthogonal basis for
several columns of data. What is so special about the principal component
basis?
The first principal component is a single axis in space. When you project
each observation on that axis, the resulting values form a new variable. And
the variance of this variable is the maximum among all possible choices of
the first axis.
The second principal component is another axis in space, perpendicular to
the first. Projecting the observations on this axis generates another new
variable. The variance of this variable is the maximum among all possible
choices of this second axis.
The full set of principal components is as large as the original set of variables.
But it is commonplace for the sum of the variances of the first few principal
components to exceed 80% of the total variance of the original data. By
examining plots of these few new variables, researchers often develop a
deeper understanding of the driving forces that generated the original data.
10-32
Feature Transformation
You can use the function princomp to find the principal components. To use
princomp, you need to have the actual measured data you want to analyze.
However, if you lack the actual data, but have the sample covariance or
correlation matrix for the data, you can still use the function pcacov to
perform a principal components analysis. See the reference page for pcacov
for a description of its inputs and outputs.
Example: Principal Component Analysis
• “Computing Components” on page 10-33
• “Component Coefficients” on page 10-36
• “Component Scores” on page 10-36
• “Component Variances” on page 10-40
• “Hotelling’s T2” on page 10-42
• “Visualizing the Results” on page 10-42
Computing Components. Consider a sample application that uses nine
different indices of the quality of life in 329 U.S. cities. These are climate,
housing, health, crime, transportation, education, arts, recreation, and
economics. For each index, higher is better. For example, a higher index
for crime means a lower crime rate.
Start by loading the data in cities.mat.
load cities
whos
Name
categories
names
ratings
Size
9x14
329x43
329x9
Bytes
252
28294
23688
Class
char array
char array
double array
The whos command generates a table of information about all the variables
in the workspace.
The cities data set contains three variables:
• categories, a string matrix containing the names of the indices
10-33
10
Multivariate Methods
• names, a string matrix containing the 329 city names
• ratings, the data matrix with 329 rows and 9columns
The categories variable has the following values:
categories
categories =
climate
housing
health
crime
transportation
education
arts
recreation
economics
The first five rows of names are
first5 = names(1:5,:)
first5 =
Abilene, TX
Akron, OH
Albany, GA
Albany-Troy, NY
Albuquerque, NM
To get a quick impression of the ratings data, make a box plot.
boxplot(ratings,'orientation','horizontal','labels',categories)
This command generates the plot below. Note that there is substantially
more variability in the ratings of the arts and housing than in the ratings
of crime and climate.
10-34
Feature Transformation
Ordinarily you might also graph pairs of the original variables, but there are
36two-variable plots. Perhaps principal components analysis can reduce the
number of variables you need to consider.
Sometimes it makes sense to compute principal components for raw data. This
is appropriate when all the variables are in the same units. Standardizing the
data is often preferable when the variables are in different units or when the
variance of the different columns is substantial (as in this case).
You can standardize the data by dividing each column by its standard
deviation.
stdr = std(ratings);
sr = ratings./repmat(stdr,329,1);
Now you are ready to find the principal components.
10-35
10
Multivariate Methods
[coefs,scores,variances,t2] = princomp(sr);
The following sections explain the four outputs from princomp.
Component Coefficients. The first output of the princomp function, coefs,
contains the coefficients of the linear combinations of the original variables
that generate the principal components. The coefficients are also known as
loadings.
The first three principal component coefficient vectors are:
c3 = coefs(:,1:3)
c3 =
0.2064
0.2178
0.3565
0.2506
0.4602
-0.2995
0.2813
0.3553
0.3512
-0.1796
0.2753
-0.4834
0.4631
-0.1948
0.3279
0.3845
0.1354
0.4713
-0.6900
-0.2082
-0.0073
0.1851
0.1464
0.2297
-0.0265
-0.0509
0.6073
The largest coefficients in the first column (first principal component) are
the third and seventh elements, corresponding to the variables health and
arts. All the coefficients of the first principal component have the same sign,
making it a weighted average of all the original variables.
The principal components are unit length and orthogonal:
I = c3'*c3
I =
1.0000
-0.0000
-0.0000
-0.0000
1.0000
-0.0000
-0.0000
-0.0000
1.0000
Component Scores. The second output, scores, contains the coordinates
of the original data in the new coordinate system defined by the principal
components. This output is the same size as the input data matrix.
10-36
Feature Transformation
A plot of the first two columns of scores shows the ratings data projected
onto the first two principal components. princomp computes the scores to
have mean zero.
plot(scores(:,1),scores(:,2),'+')
xlabel('1st Principal Component')
ylabel('2nd Principal Component')
Note the outlying points in the right half of the plot.
While it is possible to create a three-dimensional plot using three columns
of scores, the examples in this section create two-dimensional plots, which
are easier to describe.
The function gname is useful for graphically identifying a few points in a plot
like this. You can call gname with a string matrix containing as many case
10-37
10
Multivariate Methods
labels as points in the plot. The string matrix names works for labeling points
with the city names.
gname(names)
Move your cursor over the plot and click once near each point in the right
half. As you click each point, it is labeled with the proper row from the names
string matrix. Here is the plot after a few clicks:
When you are finished labeling points, press the Return key.
The labeled cities are some of the biggest population centers in the United
States. They are definitely different from the remainder of the data, so
perhaps they should be considered separately. To remove the labeled cities
from the data, first identify their corresponding row numbers as follows:
10-38
Feature Transformation
1 Close the plot window.
2 Redraw the plot by entering
plot(scores(:,1),scores(:,2),'+')
xlabel('1st Principal Component');
ylabel('2nd Principal Component');
3 Enter gname without any arguments.
4 Click near the points you labeled in the preceding figure. This labels the
points by their row numbers, as shown in the following figure.
Then you can create an index variable containing the row numbers of all
the metropolitan areas you choose.
10-39
10
Multivariate Methods
metro = [43 65 179 213 234 270 314];
names(metro,:)
ans =
Boston, MA
Chicago, IL
Los Angeles, Long Beach, CA
New York, NY
Philadelphia, PA-NJ
San Francisco, CA
Washington, DC-MD-VA
To remove these rows from the ratings matrix, enter the following.
rsubset = ratings;
nsubset = names;
nsubset(metro,:) = [];
rsubset(metro,:) = [];
size(rsubset)
ans =
322
9
Component Variances. The third output, variances, is a vector containing
the variance explained by the corresponding principal component. Each
column of scores has a sample variance equal to the corresponding element
of variances.
variances
variances =
3.4083
1.2140
1.1415
0.9209
0.7533
0.6306
0.4930
0.3180
0.1204
You can easily calculate the percent of the total variability explained by each
principal component.
10-40
Feature Transformation
percent_explained = 100*variances/sum(variances)
percent_explained =
37.8699
13.4886
12.6831
10.2324
8.3698
7.0062
5.4783
3.5338
1.3378
Use the pareto function to make a scree plot of the percent variability
explained by each principal component.
pareto(percent_explained)
xlabel('Principal Component')
ylabel('Variance Explained (%)')
10-41
10
Multivariate Methods
The preceding figure shows that the only clear break in the amount of
variance accounted for by each component is between the first and second
components. However, that component by itself explains less than 40% of the
variance, so more components are probably needed. You can see that the first
three principal components explain roughly two-thirds of the total variability
in the standardized ratings, so that might be a reasonable way to reduce the
dimensions in order to visualize the data.
Hotelling’s T2. The last output of the princomp function,t2, is Hotelling’sT2,
a statistical measure of the multivariate distance of each observation from
the center of the data set. This is an analytical way to find the most extreme
points in the data.
[st2, index] = sort(t2,'descend'); % Sort in descending order.
extreme = index(1)
extreme =
213
names(extreme,:)
ans =
New York, NY
It is not surprising that the ratings for New York are the furthest from the
average U.S. town.
Visualizing the Results. Use the biplot function to help visualize both
the principal component coefficients for each variable and the principal
component scores for each observation in a single plot. For example, the
following command plots the results from the principal components analysis
on the cities and labels each of the variables.
biplot(coefs(:,1:2), 'scores',scores(:,1:2),...
'varlabels',categories);
axis([-.26 1 -.51 .51]);
10-42
Feature Transformation
Each of the nine variables is represented in this plot by a vector, and the
direction and length of the vector indicates how each variable contributes to
the two principal components in the plot. For example, you have seen that the
first principal component, represented in this biplot by the horizontal axis,
has positive coefficients for all nine variables. That corresponds to the nine
vectors directed into the right half of the plot. You have also seen that the
second principal component, represented by the vertical axis, has positive
coefficients for the variables education, health, arts, and transportation, and
negative coefficients for the remaining five variables. That corresponds to
vectors directed into the top and bottom halves of the plot, respectively. This
indicates that this component distinguishes between cities that have high
values for the first set of variables and low for the second, and cities that
have the opposite.
10-43
10
Multivariate Methods
The variable labels in this figure are somewhat crowded. You could either
leave out the VarLabels parameter when making the plot, or simply select
and drag some of the labels to better positions using the Edit Plot tool from
the figure window toolbar.
Each of the 329 observations is represented in this plot by a point, and
their locations indicate the score of each observation for the two principal
components in the plot. For example, points near the left edge of this plot
have the lowest scores for the first principal component. The points are
scaled to fit within the unit square, so only their relative locations may be
determined from the plot.
You can use the Data Cursor, in the Tools menu in the figure window, to
identify the items in this plot. By clicking on a variable (vector), you can read
off that variable’s coefficients for each principal component. By clicking on
an observation (point), you can read off that observation’s scores for each
principal component.
You can also make a biplot in three dimensions. This can be useful if the first
two principal coordinates do not explain enough of the variance in your data.
Selecting Rotate 3D in the Tools menu enables you to rotate the figure to
see it from different angles.
biplot(coefs(:,1:3), 'scores',scores(:,1:3),...
'obslabels',names);
axis([-.26 1 -.51 .51 -.61 .81]);
view([30 40]);
10-44
Feature Transformation
Factor Analysis
• “Introduction” on page 10-45
• “Example: Factor Analysis” on page 10-46
Introduction
Multivariate data often includes a large number of measured variables, and
sometimes those variables overlap, in the sense that groups of them might be
dependent. For example, in a decathlon, each athlete competes in 10 events,
but several of them can be thought of as speed events, while others can be
thought of as strength events, etc. Thus, you can think of a competitor’s 10
event scores as largely dependent on a smaller set of three or four types of
athletic ability.
10-45
10
Multivariate Methods
Factor analysis is a way to fit a model to multivariate data to estimate just this
sort of interdependence. In a factor analysis model, the measured variables
depend on a smaller number of unobserved (latent) factors. Because each
factor might affect several variables in common, they are known as common
factors. Each variable is assumed to be dependent on a linear combination
of the common factors, and the coefficients are known as loadings. Each
measured variable also includes a component due to independent random
variability, known as specific variance because it is specific to one variable.
Specifically, factor analysis assumes that the covariance matrix of your data
is of the form
∑ x = ΛΛΤ + Ψ
where Λ is the matrix of loadings, and the elements of the diagonal matrix
Ψ are the specific variances. The function factoran fits the Factor Analysis
model using maximum likelihood.
Example: Factor Analysis
• “Factor Loadings” on page 10-46
• “Factor Rotation” on page 10-48
• “Factor Scores” on page 10-50
• “Visualizing the Results” on page 10-52
Factor Loadings. Over the course of 100 weeks, the percent change in stock
prices for ten companies has been recorded. Of the ten companies, the first
four can be classified as primarily technology, the next three as financial, and
the last three as retail. It seems reasonable that the stock prices for companies
that are in the same sector might vary together as economic conditions
change. Factor Analysis can provide quantitative evidence that companies
within each sector do experience similar week-to-week changes in stock price.
In this example, you first load the data, and then call factoran, specifying a
model fit with three common factors. By default, factoran computes rotated
estimates of the loadings to try and make their interpretation simpler. But in
this example, you specify an unrotated solution.
10-46
Feature Transformation
load stockreturns
[Loadings,specificVar,T,stats] = ...
factoran(stocks,3,'rotate','none');
The first two factoran return arguments are the estimated loadings and the
estimated specific variances. Each row of the loadings matrix represents one
of the ten stocks, and each column corresponds to a common factor. With
unrotated estimates, interpretation of the factors in this fit is difficult because
most of the stocks contain fairly large coefficients for two or more factors.
Loadings
Loadings =
0.8885
0.7126
0.3351
0.3088
0.6277
0.4726
0.1133
0.6403
0.2363
0.1105
0.2367
0.3862
0.2784
0.1113
-0.6643
-0.6383
-0.5416
0.1669
0.5293
0.1680
-0.2354
0.0034
-0.0211
-0.1905
0.1478
0.0133
0.0322
0.4960
0.5770
0.5524
Note “Factor Rotation” on page 10-48 helps to simplify the structure in the
Loadings matrix, to make it easier to assign meaningful interpretations to
the factors.
From the estimated specific variances, you can see that the model indicates
that a particular stock price varies quite a lot beyond the variation due to
the common factors.
specificVar
specificVar =
0.0991
0.3431
0.8097
0.8559
0.1429
10-47
10
Multivariate Methods
0.3691
0.6928
0.3162
0.3311
0.6544
A specific variance of 1 would indicate that there is no common factor
component in that variable, while a specific variance of 0 would indicate that
the variable is entirely determined by common factors. These data seem to
fall somewhere in between.
The p value returned in the stats structure fails to reject the null hypothesis
of three common factors, suggesting that this model provides a satisfactory
explanation of the covariation in these data.
stats.p
ans =
0.8144
To determine whether fewer than three factors can provide an acceptable fit,
you can try a model with two common factors. The p value for this second fit
is highly significant, and rejects the hypothesis of two factors, indicating that
the simpler model is not sufficient to explain the pattern in these data.
[Loadings2,specificVar2,T2,stats2] = ...
factoran(stocks, 2,'rotate','none');
stats2.p
ans =
3.5610e-006
Factor Rotation. As the results illustrate, the estimated loadings from an
unrotated factor analysis fit can have a complicated structure. The goal of
factor rotation is to find a parameterization in which each variable has only a
small number of large loadings. That is, each variable is affected by a small
number of factors, preferably only one. This can often make it easier to
interpret what the factors represent.
If you think of each row of the loadings matrix as coordinates of a point
in M-dimensional space, then each factor corresponds to a coordinate axis.
Factor rotation is equivalent to rotating those axes and computing new
10-48
Feature Transformation
loadings in the rotated coordinate system. There are various ways to do this.
Some methods leave the axes orthogonal, while others are oblique methods
that change the angles between them. For this example, you can rotate the
estimated loadings by using the promax criterion, a common oblique method.
[LoadingsPM,specVarPM] = factoran(stocks,3,'rotate','promax');
LoadingsPM
LoadingsPM =
0.9452
0.1214
-0.0617
0.7064
-0.0178
0.2058
0.3885
-0.0994
0.0975
0.4162
-0.0148
-0.1298
0.1021
0.9019
0.0768
0.0873
0.7709
-0.0821
-0.1616
0.5320
-0.0888
0.2169
0.2844
0.6635
0.0016
-0.1881
0.7849
-0.2289
0.0636
0.6475
Promax rotation creates a simpler structure in the loadings, one in which
most of the stocks have a large loading on only one factor. To see this
structure more clearly, you can use the biplot function to plot each stock
using its factor loadings as coordinates.
biplot(LoadingsPM,'varlabels',num2str((1:10)'));
axis square
view(155,27);
10-49
10
Multivariate Methods
This plot shows that promax has rotated the factor loadings to a simpler
structure. Each stock depends primarily on only one factor, and it is possible
to describe each factor in terms of the stocks that it affects. Based on which
companies are near which axes, you could reasonably conclude that the first
factor axis represents the financial sector, the second retail, and the third
technology. The original conjecture, that stocks vary primarily within sector,
is apparently supported by the data.
Factor Scores. Sometimes, it is useful to be able to classify an observation
based on its factor scores. For example, if you accepted the three-factor model
and the interpretation of the rotated factors, you might want to categorize
each week in terms of how favorable it was for each of the three stock sectors,
based on the data from the 10 observed stocks.
Because the data in this example are the raw stock price changes, and not
just their correlation matrix, you can have factoran return estimates of the
10-50
Feature Transformation
value of each of the three rotated common factors for each week. You can
then plot the estimated scores to see how the different stock sectors were
affected during each week.
[LoadingsPM,specVarPM,TPM,stats,F] = ...
factoran(stocks, 3,'rotate','promax');
plot3(F(:,1),F(:,2),F(:,3),'b.')
line([-4 4 NaN 0 0 NaN 0 0], [0 0 NaN -4 4 NaN 0 0],...
[0 0 NaN 0 0 NaN -4 4], 'Color','black')
xlabel('Financial Sector')
ylabel('Retail Sector')
zlabel('Technology Sector')
grid on
axis square
view(-22.5, 8)
10-51
10
Multivariate Methods
Oblique rotation often creates factors that are correlated. This plot shows
some evidence of correlation between the first and third factors, and you can
investigate further by computing the estimated factor correlation matrix.
inv(TPM'*TPM)
ans =
1.0000
0.1559
0.4082
0.1559
1.0000
-0.0559
0.4082
-0.0559
1.0000
Visualizing the Results. You can use the biplot function to help visualize
both the factor loadings for each variable and the factor scores for each
observation in a single plot. For example, the following command plots the
results from the factor analysis on the stock data and labels each of the 10
stocks.
biplot(LoadingsPM,'scores',F,'varlabels',num2str((1:10)'))
xlabel('Financial Sector')
ylabel('Retail Sector')
zlabel('Technology Sector')
axis square
view(155,27)
10-52
Feature Transformation
In this case, the factor analysis includes three factors, and so the biplot is
three-dimensional. Each of the 10 stocks is represented in this plot by a vector,
and the direction and length of the vector indicates how each stock depends
on the underlying factors. For example, you have seen that after promax
rotation, the first four stocks have positive loadings on the first factor, and
unimportant loadings on the other two factors. That first factor, interpreted
as a financial sector effect, is represented in this biplot as one of the horizontal
axes. The dependence of those four stocks on that factor corresponds to the
four vectors directed approximately along that axis. Similarly, the dependence
of stocks 5, 6, and 7 primarily on the second factor, interpreted as a retail
sector effect, is represented by vectors directed approximately along that axis.
Each of the 100 observations is represented in this plot by a point, and their
locations indicate the score of each observation for the three factors. For
example, points near the top of this plot have the highest scores for the
10-53
10
Multivariate Methods
technology sector factor. The points are scaled to fit within the unit square, so
only their relative locations can be determined from the plot.
You can use the Data Cursor tool from the Tools menu in the figure window
to identify the items in this plot. By clicking a stock (vector), you can read off
that stock’s loadings for each factor. By clicking an observation (point), you
can read off that observation’s scores for each factor.
10-54
11
Cluster Analysis
• “Introduction” on page 11-2
• “Hierarchical Clustering” on page 11-3
• “K-Means Clustering” on page 11-21
• “Gaussian Mixture Models” on page 11-28
11
Cluster Analysis
Introduction
Cluster analysis, also called segmentation analysis or taxonomy analysis,
creates groups, or clusters, of data. Clusters are formed in such a way that
objects in the same cluster are very similar and objects in different clusters
are very distinct. Measures of similarity depend on the application.
“Hierarchical Clustering” on page 11-3 groups data over a variety of scales by
creating a cluster tree or dendrogram. The tree is not a single set of clusters,
but rather a multilevel hierarchy, where clusters at one level are joined
as clusters at the next level. This allows you to decide the level or scale
of clustering that is most appropriate for your application. The Statistics
Toolbox function clusterdata performs all of the necessary steps for you.
It incorporates the pdist, linkage, and cluster functions, which may be
used separately for more detailed analysis. The dendrogram function plots
the cluster tree.
“K-Means Clustering” on page 11-21 is a partitioning method. The function
kmeans partitions data into k mutually exclusive clusters, and returns
the index of the cluster to which it has assigned each observation. Unlike
hierarchical clustering, k-means clustering operates on actual observations
(rather than the larger set of dissimilarity measures), and creates a single
level of clusters. The distinctions mean that k-means clustering is often more
suitable than hierarchical clustering for large amounts of data.
“Gaussian Mixture Models” on page 11-28 form clusters by representing the
probability density function of observed variables as a mixture of multivariate
normal densities. Mixture models of the gmdistribution class use an
expectation maximization (EM) algorithm to fit data, which assigns posterior
probabilities to each component density with respect to each observation.
Clusters are assigned by selecting the component that maximizes the
posterior probability. Clustering using Gaussian mixture models is sometimes
considered a soft clustering method. The posterior probabilities for each
point indicate that each data point has some probability of belonging to
each cluster. Like k-means clustering, Gaussian mixture modeling uses an
iterative algorithm that converges to a local optimum. Gaussian mixture
modeling may be more appropriate than k-means clustering when clusters
have different sizes and correlation within them.
11-2
Hierarchical Clustering
Hierarchical Clustering
In this section...
“Introduction” on page 11-3
“Algorithm Description” on page 11-3
“Similarity Measures” on page 11-4
“Linkages” on page 11-6
“Dendrograms” on page 11-8
“Verifying the Cluster Tree” on page 11-10
“Creating Clusters” on page 11-16
Introduction
Hierarchical clustering groups data over a variety of scales by creating a
cluster tree or dendrogram. The tree is not a single set of clusters, but rather
a multilevel hierarchy, where clusters at one level are joined as clusters at
the next level. This allows you to decide the level or scale of clustering that
is most appropriate for your application. The Statistics Toolbox function
clusterdata supports agglomerative clustering and performs all of the
necessary steps for you. It incorporates the pdist, linkage, and cluster
functions, which you can use separately for more detailed analysis. The
dendrogram function plots the cluster tree.
Algorithm Description
To perform agglomerative hierarchical cluster analysis on a data set using
Statistics Toolbox functions, follow this procedure:
1 Find the similarity or dissimilarity between every pair of objects
in the data set. In this step, you calculate the distance between objects
using the pdist function. The pdist function supports many different
ways to compute this measurement. See “Similarity Measures” on page
11-4 for more information.
2 Group the objects into a binary, hierarchical cluster tree. In this
step, you link pairs of objects that are in close proximity using the linkage
11-3
11
Cluster Analysis
function. The linkage function uses the distance information generated in
step 1 to determine the proximity of objects to each other. As objects are
paired into binary clusters, the newly formed clusters are grouped into
larger clusters until a hierarchical tree is formed. See “Linkages” on page
11-6 for more information.
3 Determine where to cut the hierarchical tree into clusters. In this
step, you use the cluster function to prune branches off the bottom of
the hierarchical tree, and assign all the objects below each cut to a single
cluster. This creates a partition of the data. The cluster function can
create these clusters by detecting natural groupings in the hierarchical tree
or by cutting off the hierarchical tree at an arbitrary point.
The following sections provide more information about each of these steps.
Note The Statistics Toolbox function clusterdata performs all of the
necessary steps for you. You do not need to execute the pdist, linkage, or
cluster functions separately.
Similarity Measures
You use the pdist function to calculate the distance between every pair of
objects in a data set. For a data set made up of m objects, there are m*(m –
1)/2 pairs in the data set. The result of this computation is commonly known
as a distance or dissimilarity matrix.
There are many ways to calculate this distance information. By default, the
pdist function calculates the Euclidean distance between objects; however,
you can specify one of several other options. See pdist for more information.
Note You can optionally normalize the values in the data set before
calculating the distance information. In a real world data set, variables can
be measured against different scales. For example, one variable can measure
Intelligence Quotient (IQ) test scores and another variable can measure head
circumference. These discrepancies can distort the proximity calculations.
Using the zscore function, you can convert all the values in the data set to
use the same proportional scale. See zscore for more information.
11-4
Hierarchical Clustering
For example, consider a data set, X, made up of five objects where each object
is a set of x,y coordinates.
• Object 1: 1, 2
• Object 2: 2.5, 4.5
• Object 3: 2, 2
• Object 4: 4, 1.5
• Object 5: 4, 2.5
You can define this data set as a matrix
X = [1 2;2.5 4.5;2 2;4 1.5;4 2.5]
and pass it to pdist. The pdist function calculates the distance between
object 1 and object 2, object 1 and object 3, and so on until the distances
between all the pairs have been calculated. The following figure plots these
objects in a graph. The Euclidean distance between object 2 and object 3 is
shown to illustrate one interpretation of distance.
Distance Information
The pdist function returns this distance information in a vector, Y, where
each element contains the distance between a pair of objects.
11-5
11
Cluster Analysis
Y = pdist(X)
Y =
Columns 1 through 5
2.9155
1.0000
3.0414
3.0414
2.5495
2.0616
2.0616
1.0000
Columns 6 through 10
3.3541
2.5000
To make it easier to see the relationship between the distance information
generated by pdist and the objects in the original data set, you can reformat
the distance vector into a matrix using the squareform function. In this
matrix, element i,j corresponds to the distance between object i and object j in
the original data set. In the following example, element 1,1 represents the
distance between object 1 and itself (which is zero). Element 1,2 represents
the distance between object 1 and object 2, and so on.
squareform(Y)
ans =
0
2.9155
1.0000
3.0414
3.0414
2.9155
0
2.5495
3.3541
2.5000
1.0000
2.5495
0
2.0616
2.0616
3.0414
3.3541
2.0616
0
1.0000
3.0414
2.5000
2.0616
1.0000
0
Linkages
Once the proximity between objects in the data set has been computed, you
can determine how objects in the data set should be grouped into clusters,
using the linkage function. The linkage function takes the distance
information generated by pdist and links pairs of objects that are close
together into binary clusters (clusters made up of two objects). The linkage
function then links these newly formed clusters to each other and to other
objects to create bigger clusters until all the objects in the original data set
are linked together in a hierarchical tree.
For example, given the distance vector Y generated by pdist from the sample
data set of x- and y-coordinates, the linkage function generates a hierarchical
cluster tree, returning the linkage information in a matrix, Z.
Z = linkage(Y)
Z =
4.0000
5.0000
11-6
1.0000
Hierarchical Clustering
1.0000
6.0000
2.0000
3.0000
7.0000
8.0000
1.0000
2.0616
2.5000
In this output, each row identifies a link between objects or clusters. The first
two columns identify the objects that have been linked. The third column
contains the distance between these objects. For the sample data set of xand y-coordinates, the linkage function begins by grouping objects 4 and 5,
which have the closest proximity (distance value = 1.0000). The linkage
function continues by grouping objects 1 and 3, which also have a distance
value of 1.0000.
The third row indicates that the linkage function grouped objects 6 and 7. If
the original sample data set contained only five objects, what are objects 6
and 7? Object 6 is the newly formed binary cluster created by the grouping
of objects 4 and 5. When the linkage function groups two objects into a
new cluster, it must assign the cluster a unique index value, starting with
the value m+1, where m is the number of objects in the original data set.
(Values 1 through m are already used by the original data set.) Similarly,
object 7 is the cluster formed by grouping objects 1 and 3.
linkage uses distances to determine the order in which it clusters objects.
The distance vector Y contains the distances between the original objects 1
through 5. But linkage must also be able to determine distances involving
clusters that it creates, such as objects 6 and 7. By default, linkage uses a
method known as single linkage. However, there are a number of different
methods available. See the linkage reference page for more information.
As the final cluster, the linkage function grouped object 8, the newly formed
cluster made up of objects 6 and 7, with object 2 from the original data set.
The following figure graphically illustrates the way linkage groups the
objects into a hierarchy of clusters.
11-7
11
Cluster Analysis
Dendrograms
The hierarchical, binary cluster tree created by the linkage function is most
easily understood when viewed graphically. The Statistics Toolbox function
dendrogram plots the tree, as follows:
dendrogram(Z)
11-8
Hierarchical Clustering
2.5
2
1.5
1
4
5
1
3
2
In the figure, the numbers along the horizontal axis represent the indices of
the objects in the original data set. The links between objects are represented
as upside-down U-shaped lines. The height of the U indicates the distance
between the objects. For example, the link representing the cluster containing
objects 1 and 3 has a height of 1. The link representing the cluster that groups
object 2 together with objects 1, 3, 4, and 5, (which are already clustered as
object 8) has a height of 2.5. The height represents the distance linkage
computes between objects 2 and 8. For more information about creating a
dendrogram diagram, see the dendrogram reference page.
11-9
11
Cluster Analysis
Verifying the Cluster Tree
After linking the objects in a data set into a hierarchical cluster tree, you
might want to verify that the distances (that is, heights) in the tree reflect
the original distances accurately. In addition, you might want to investigate
natural divisions that exist among links between objects. Statistics Toolbox
functions are available for both of these tasks, as described in the following
sections:
• “Verifying Dissimilarity” on page 11-10
• “Verifying Consistency” on page 11-11
Verifying Dissimilarity
In a hierarchical cluster tree, any two objects in the original data set are
eventually linked together at some level. The height of the link represents
the distance between the two clusters that contain those two objects. This
height is known as the cophenetic distance between the two objects. One
way to measure how well the cluster tree generated by the linkage function
reflects your data is to compare the cophenetic distances with the original
distance data generated by the pdist function. If the clustering is valid, the
linking of objects in the cluster tree should have a strong correlation with
the distances between objects in the distance vector. The cophenet function
compares these two sets of values and computes their correlation, returning a
value called the cophenetic correlation coefficient. The closer the value of the
cophenetic correlation coefficient is to 1, the more accurately the clustering
solution reflects your data.
You can use the cophenetic correlation coefficient to compare the results of
clustering the same data set using different distance calculation methods or
clustering algorithms. For example, you can use the cophenet function to
evaluate the clusters created for the sample data set
c = cophenet(Z,Y)
c =
0.8615
where Z is the matrix output by the linkage function and Y is the distance
vector output by the pdist function.
11-10
Hierarchical Clustering
Execute pdist again on the same data set, this time specifying the city block
metric. After running the linkage function on this new pdist output using
the average linkage method, call cophenet to evaluate the clustering solution.
Y
Z
c
c
= pdist(X,'cityblock');
= linkage(Y,'average');
= cophenet(Z,Y)
=
0.9047
The cophenetic correlation coefficient shows that using a different distance
and linkage method creates a tree that represents the original distances
slightly better.
Verifying Consistency
One way to determine the natural cluster divisions in a data set is to compare
the height of each link in a cluster tree with the heights of neighboring links
below it in the tree.
A link that is approximately the same height as the links below it indicates
that there are no distinct divisions between the objects joined at this level of
the hierarchy. These links are said to exhibit a high level of consistency,
because the distance between the objects being joined is approximately the
same as the distances between the objects they contain.
On the other hand, a link whose height differs noticeably from the height of
the links below it indicates that the objects joined at this level in the cluster
tree are much farther apart from each other than their components were when
they were joined. This link is said to be inconsistent with the links below it.
In cluster analysis, inconsistent links can indicate the border of a natural
division in a data set. The cluster function uses a quantitative measure of
inconsistency to determine where to partition your data set into clusters.
The following dendrogram illustrates inconsistent links. Note how the objects
in the dendrogram fall into two groups that are connected by links at a much
higher level in the tree. These links are inconsistent when compared with the
links below them in the hierarchy.
11-11
11
Cluster Analysis
These links show inconsistency when compared
to the links below them.
These links show consistency.
The relative consistency of each link in a hierarchical cluster tree can be
quantified and expressed as the inconsistency coefficient. This value compares
the height of a link in a cluster hierarchy with the average height of links
below it. Links that join distinct clusters have a high inconsistency coefficient;
links that join indistinct clusters have a low inconsistency coefficient.
To generate a listing of the inconsistency coefficient for each link in the
cluster tree, use the inconsistent function. By default, the inconsistent
11-12
Hierarchical Clustering
function compares each link in the cluster hierarchy with adjacent links that
are less than two levels below it in the cluster hierarchy. This is called the
depth of the comparison. You can also specify other depths. The objects at
the bottom of the cluster tree, called leaf nodes, that have no further objects
below them, have an inconsistency coefficient of zero. Clusters that join two
leaves also have a zero inconsistency coefficient.
For example, you can use the inconsistent function to calculate the
inconsistency values for the links created by the linkage function in
“Linkages” on page 11-6.
I = inconsistent(Z)
I =
1.0000
0
1.0000
0
1.3539
0.6129
2.2808
0.3100
1.0000
1.0000
3.0000
2.0000
0
0
1.1547
0.7071
The inconsistent function returns data about the links in an (m-1)-by-4
matrix, whose columns are described in the following table.
Column
Description
1
Mean of the heights of all the links included in the calculation
2
Standard deviation of all the links included in the calculation
3
Number of links included in the calculation
4
Inconsistency coefficient
In the sample output, the first row represents the link between objects 4
and 5. This cluster is assigned the index 6 by the linkage function. Because
both 4 and 5 are leaf nodes, the inconsistency coefficient for the cluster is zero.
The second row represents the link between objects 1 and 3, both of which are
also leaf nodes. This cluster is assigned the index 7 by the linkage function.
The third row evaluates the link that connects these two clusters, objects 6
and 7. (This new cluster is assigned index 8 in the linkage output). Column 3
indicates that three links are considered in the calculation: the link itself and
the two links directly below it in the hierarchy. Column 1 represents the mean
of the heights of these links. The inconsistent function uses the height
11-13
11
Cluster Analysis
information output by the linkage function to calculate the mean. Column 2
represents the standard deviation between the links. The last column contains
the inconsistency value for these links, 1.1547. It is the difference between
the current link height and the mean, normalized by the standard deviation:
(2.0616 - 1.3539) / .6129
ans =
1.1547
The following figure illustrates the links and heights included in this
calculation.
Links
Heights
11-14
Hierarchical Clustering
Note In the preceding figure, the lower limit on the y-axis is set to 0 to show
the heights of the links. To set the lower limit to 0, select Axes Properties
from the Edit menu, click the Y Axis tab, and enter 0 in the field immediately
to the right of Y Limits.
Row 4 in the output matrix describes the link between object 8 and object 2.
Column 3 indicates that two links are included in this calculation: the link
itself and the link directly below it in the hierarchy. The inconsistency
coefficient for this link is 0.7071.
The following figure illustrates the links and heights included in this
calculation.
11-15
11
Cluster Analysis
Links
Heights
Creating Clusters
After you create the hierarchical tree of binary clusters, you can prune the
tree to partition your data into clusters using the cluster function. The
cluster function lets you create clusters in two ways, as discussed in the
following sections:
• “Finding Natural Divisions in Data” on page 11-17
• “Specifying Arbitrary Clusters” on page 11-18
11-16
Hierarchical Clustering
Finding Natural Divisions in Data
The hierarchical cluster tree may naturally divide the data into distinct,
well-separated clusters. This can be particularly evident in a dendrogram
diagram created from data where groups of objects are densely packed in
certain areas and not in others. The inconsistency coefficient of the links in
the cluster tree can identify these divisions where the similarities between
objects change abruptly. (See “Verifying the Cluster Tree” on page 11-10 for
more information about the inconsistency coefficient.) You can use this value
to determine where the cluster function creates cluster boundaries.
For example, if you use the cluster function to group the sample data set
into clusters, specifying an inconsistency coefficient threshold of 1.2 as the
value of the cutoff argument, the cluster function groups all the objects
in the sample data set into one cluster. In this case, none of the links in the
cluster hierarchy had an inconsistency coefficient greater than 1.2.
T = cluster(Z,'cutoff',1.2)
T =
1
1
1
1
1
The cluster function outputs a vector, T, that is the same size as the original
data set. Each element in this vector contains the number of the cluster into
which the corresponding object from the original data set was placed.
If you lower the inconsistency coefficient threshold to 0.8, the cluster
function divides the sample data set into three separate clusters.
T = cluster(Z,'cutoff',0.8)
T =
3
2
3
1
1
11-17
11
Cluster Analysis
This output indicates that objects 1 and 3 were placed in cluster 1, objects 4
and 5 were placed in cluster 2, and object 2 was placed in cluster 3.
When clusters are formed in this way, the cutoff value is applied to the
inconsistency coefficient. These clusters may, but do not necessarily,
correspond to a horizontal slice across the dendrogram at a certain height.
If you want clusters corresponding to a horizontal slice of the dendrogram,
you can either use the criterion option to specify that the cutoff should be
based on distance rather than inconsistency, or you can specify the number of
clusters directly as described in the following section.
Specifying Arbitrary Clusters
Instead of letting the cluster function create clusters determined by the
natural divisions in the data set, you can specify the number of clusters you
want created.
For example, you can specify that you want the cluster function to partition
the sample data set into two clusters. In this case, the cluster function
creates one cluster containing objects 1, 3, 4, and 5 and another cluster
containing object 2.
T = cluster(Z,'maxclust',2)
T =
2
1
2
2
2
To help you visualize how the cluster function determines these clusters, the
following figure shows the dendrogram of the hierarchical cluster tree. The
horizontal dashed line intersects two lines of the dendrogram, corresponding
to setting 'maxclust' to 2. These two lines partition the objects into two
clusters: the objects below the left-hand line, namely 1, 3, 4, and 5, belong to
one cluster, while the object below the right-hand line, namely 2, belongs to
the other cluster.
11-18
Hierarchical Clustering
maxclust= 2
On the other hand, if you set 'maxclust' to 3, the cluster function groups
objects 4 and 5 in one cluster, objects 1 and 3 in a second cluster, and object 2
in a third cluster. The following command illustrates this.
T = cluster(Z,'maxclust',3)
T =
1
3
1
2
2
11-19
11
Cluster Analysis
This time, the cluster function cuts off the hierarchy at a lower point,
corresponding to the horizontal line that intersects three lines of the
dendrogram in the following figure.
maxclust= 3
11-20
K-Means Clustering
K-Means Clustering
In this section...
“Introduction” on page 11-21
“Creating Clusters and Determining Separation” on page 11-22
“Determining the Correct Number of Clusters” on page 11-23
“Avoiding Local Minima” on page 11-26
Introduction
K-means clustering is a partitioning method. The function kmeans partitions
data into k mutually exclusive clusters, and returns the index of the cluster
to which it has assigned each observation. Unlike hierarchical clustering,
k-means clustering operates on actual observations (rather than the larger
set of dissimilarity measures), and creates a single level of clusters. The
distinctions mean that k-means clustering is often more suitable than
hierarchical clustering for large amounts of data.
kmeans treats each observation in your data as an object having a location in
space. It finds a partition in which objects within each cluster are as close to
each other as possible, and as far from objects in other clusters as possible.
You can choose from five different distance measures, depending on the kind
of data you are clustering.
Each cluster in the partition is defined by its member objects and by its
centroid, or center. The centroid for each cluster is the point to which the sum
of distances from all objects in that cluster is minimized. kmeans computes
cluster centroids differently for each distance measure, to minimize the sum
with respect to the measure that you specify.
kmeans uses an iterative algorithm that minimizes the sum of distances from
each object to its cluster centroid, over all clusters. This algorithm moves
objects between clusters until the sum cannot be decreased further. The
result is a set of clusters that are as compact and well-separated as possible.
You can control the details of the minimization using several optional input
parameters to kmeans, including ones for the initial values of the cluster
centroids, and for the maximum number of iterations.
11-21
11
Cluster Analysis
Creating Clusters and Determining Separation
The following example explores possible clustering in four-dimensional data
by analyzing the results of partitioning the points into three, four, and five
clusters.
Note Because each part of this example generates random numbers
sequentially, i.e., without setting a new state, you must perform all steps
in sequence to duplicate the results shown. If you perform the steps out of
sequence, the answers will be essentially the same, but the intermediate
results, number of iterations, or ordering of the silhouette plots may differ.
First, load some data:
load kmeansdata;
size(X)
ans =
560
4
Even though these data are four-dimensional, and cannot be easily visualized,
kmeans enables you to investigate whether a group structure exists in them.
Call kmeans with k, the desired number of clusters, equal to 3. For this
example, specify the city block distance measure, and use the default starting
method of initializing centroids from randomly selected data points:
idx3 = kmeans(X,3,'distance','city');
To get an idea of how well-separated the resulting clusters are, you can make
a silhouette plot using the cluster indices output from kmeans. The silhouette
plot displays a measure of how close each point in one cluster is to points in
the neighboring clusters. This measure ranges from +1, indicating points that
are very distant from neighboring clusters, through 0, indicating points that
are not distinctly in one cluster or another, to -1, indicating points that are
probably assigned to the wrong cluster. silhouette returns these values in
its first output:
[silh3,h] = silhouette(X,idx3,'city');
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('Silhouette Value')
ylabel('Cluster')
11-22
K-Means Clustering
From the silhouette plot, you can see that most points in the third cluster
have a large silhouette value, greater than 0.6, indicating that the cluster is
somewhat separated from neighboring clusters. However, the first cluster
contains many points with low silhouette values, and the second contains a
few points with negative values, indicating that those two clusters are not
well separated.
Determining the Correct Number of Clusters
Increase the number of clusters to see if kmeans can find a better grouping
of the data. This time, use the optional 'display' parameter to print
information about each iteration:
idx4 = kmeans(X,4, 'dist','city', 'display','iter');
iter phase
num
sum
1
1
560
2897.56
11-23
11
Cluster Analysis
2
1
53
2736.67
3
1
50
2476.78
4
1
102
1779.68
5
1
5
1771.1
6
2
0
1771.1
6 iterations, total sum of distances = 1771.1
Notice that the total sum of distances decreases at each iteration as kmeans
reassigns points between clusters and recomputes cluster centroids. In this
case, the second phase of the algorithm did not make any reassignments,
indicating that the first phase reached a minimum after five iterations. In
some problems, the first phase might not reach a minimum, but the second
phase always will.
A silhouette plot for this solution indicates that these four clusters are better
separated than the three in the previous solution:
[silh4,h] = silhouette(X,idx4,'city');
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('Silhouette Value')
ylabel('Cluster')
11-24
K-Means Clustering
A more quantitative way to compare the two solutions is to look at the average
silhouette values for the two cases:
mean(silh3)
ans =
0.52594
mean(silh4)
ans =
0.63997
Finally, try clustering the data using five clusters:
idx5 = kmeans(X,5,'dist','city','replicates',5);
[silh5,h] = silhouette(X,idx5,'city');
set(get(gca,'Children'),'FaceColor',[.8 .8 1])
xlabel('Silhouette Value')
11-25
11
Cluster Analysis
ylabel('Cluster')
mean(silh5)
ans =
0.52657
This silhouette plot indicates that this is probably not the right number of
clusters, since two of the clusters contain points with mostly low silhouette
values. Without some knowledge of how many clusters are really in the data,
it is a good idea to experiment with a range of values for k.
Avoiding Local Minima
Like many other types of numerical minimizations, the solution that kmeans
reaches often depends on the starting points. It is possible for kmeans to
reach a local minimum, where reassigning any one point to a new cluster
would increase the total sum of point-to-centroid distances, but where a
11-26
K-Means Clustering
better solution does exist. However, you can use the optional 'replicates'
parameter to overcome that problem.
For four clusters, specify five replicates, and use the 'display' parameter to
print out the final sum of distances for each of the solutions.
[idx4,cent4,sumdist] = kmeans(X,4,'dist','city',...
'display','final','replicates',5);
17 iterations, total sum of distances = 2303.36
5 iterations, total sum of distances = 1771.1
6 iterations, total sum of distances = 1771.1
5 iterations, total sum of distances = 1771.1
8 iterations, total sum of distances = 2303.36
The output shows that, even for this relatively simple problem, non-global
minima do exist. Each of these five replicates began from a different randomly
selected set of initial centroids, and kmeans found two different local minima.
However, the final solution that kmeans returns is the one with the lowest
total sum of distances, over all replicates.
sum(sumdist)
ans =
1771.1
11-27
11
Cluster Analysis
Gaussian Mixture Models
In this section...
“Introduction” on page 11-28
“Clustering with Gaussian Mixtures” on page 11-28
Introduction
Gaussian mixture models are formed by combining multivariate normal
density components. For information on individual multivariate normal
densities, see “Multivariate Normal Distribution” on page B-58 and related
distribution functions listed under “Multivariate Distributions” on page 5-8.
In Statistics Toolbox software, use the gmdistribution class to fit data
using an expectation maximization (EM) algorithm, which assigns posterior
probabilities to each component density with respect to each observation.
Gaussian mixture models are often used for data clustering. Clusters are
assigned by selecting the component that maximizes the posterior probability.
Like k-means clustering, Gaussian mixture modeling uses an iterative
algorithm that converges to a local optimum. Gaussian mixture modeling may
be more appropriate than k-means clustering when clusters have different
sizes and correlation within them. Clustering using Gaussian mixture models
is sometimes considered a soft clustering method. The posterior probabilities
for each point indicate that each data point has some probability of belonging
to each cluster.
Creation of Gaussian mixture models is described in the “Gaussian Mixture
Models” on page 5-99 section of Chapter 5, “Probability Distributions”. This
section describes their application in cluster analysis.
Clustering with Gaussian Mixtures
Gaussian mixture distributions can be used for clustering data, by realizing
that the multivariate normal components of the fitted model can represent
clusters.
11-28
Gaussian Mixture Models
1 To demonstrate the process, first generate some simulated data from a
mixture of two bivariate Gaussian distributions using the mvnrnd function:
mu1 = [1 2];
sigma1 = [3 .2; .2 2];
mu2 = [-1 -2];
sigma2 = [2 0; 0 1];
X = [mvnrnd(mu1,sigma1,200);mvnrnd(mu2,sigma2,100)];
scatter(X(:,1),X(:,2),10,'ko')
2 Fit a two-component Gaussian mixture distribution. Here, you know
the correct number of components to use. In practice, with real data,
this decision would require comparing models with different numbers of
components.
11-29
11
Cluster Analysis
options = statset('Display','final');
gm = gmdistribution.fit(X,2,'Options',options);
This displays
49 iterations, log-likelihood = -1207.91
3 Plot the estimated probability density contours for the two-component
mixture distribution. The two bivariate normal components overlap, but
their peaks are distinct. This suggests that the data could reasonably be
divided into two clusters:
hold on
ezcontour(@(x,y)pdf(gm,[x y]),[-8 6],[-8 6]);
hold off
11-30
Gaussian Mixture Models
4 Partition the data into clusters using the cluster method for the fitted
mixture distribution. The cluster method assigns each point to one of the
two components in the mixture distribution.
idx = cluster(gm,X);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(X(cluster1,1),X(cluster1,2),10,'r+');
hold on
scatter(X(cluster2,1),X(cluster2,2),10,'bo');
hold off
legend('Cluster 1','Cluster 2','Location','NW')
11-31
11
Cluster Analysis
Each cluster corresponds to one of the bivariate normal components in
the mixture distribution. cluster assigns points to clusters based on the
estimated posterior probability that a point came from a component; each
point is assigned to the cluster corresponding to the highest posterior
probability. The posterior method returns those posterior probabilities.
For example, plot the posterior probability of the first component for each
point:
P = posterior(gm,X);
scatter(X(cluster1,1),X(cluster1,2),10,P(cluster1,1),'+')
hold on
scatter(X(cluster2,1),X(cluster2,2),10,P(cluster2,1),'o')
hold off
legend('Cluster 1','Cluster 2','Location','NW')
clrmap = jet(80); colormap(clrmap(9:72,:))
11-32
Gaussian Mixture Models
ylabel(colorbar,'Component 1 Posterior Probability')
Soft Clustering Using Gaussian Mixture Distributions
An alternative to the previous example is to use the posterior probabilities for
"soft clustering". Each point is assigned a membership score to each cluster.
Membership scores are simply the posterior probabilities, and describe
how similar each point is to each cluster’s archetype, i.e., the mean of the
corresponding component. The points can be ranked by their membership
score in a given cluster:
[~,order] = sort(P(:,1));
plot(1:size(X,1),P(order,1),'r-',1:size(X,1),P(order,2),'b-');
legend({'Cluster 1 Score' 'Cluster 2 Score'},'location','NW');
ylabel('Cluster Membership Score');
xlabel('Point Ranking');
11-33
11
Cluster Analysis
Although a clear separation of the data is hard to see in a scatter plot of the
data, plotting the membership scores indicates that the fitted distribution
does a good job of separating the data into groups. Very few points have
scores close to 0.5.
Soft clustering using a Gaussian mixture distribution is similar to fuzzy
K-means clustering, which also assigns each point to each cluster with a
membership score. The fuzzy K-means algorithm assumes that clusters are
roughly spherical in shape, and all of roughly equal size. This is comparable
to a Gaussian mixture distribution with a single covariance matrix that is
shared across all components, and is a multiple of the identity matrix. In
contrast, gmdistribution allows you to specify different covariance options.
The default is to estimate a separate, unconstrained covariance matrix for
11-34
Gaussian Mixture Models
each component. A more restricted option, closer to K-means, would be to
estimate a shared, diagonal covariance matrix:
gm2 = gmdistribution.fit(X,2,'CovType','Diagonal',...
'SharedCov',true);
This covariance option is similar to fuzzy K-means clustering, but provides
more flexibility by allowing unequal variances for different variables.
You can compute the soft cluster membership scores without computing hard
cluster assignments, using posterior, or as part of hard clustering, as the
second output from cluster:
P2 = posterior(gm2,X); % equivalently [idx,P2] = cluster(gm2,X)
[~,order] = sort(P2(:,1));
plot(1:size(X,1),P2(order,1),'r-',1:size(X,1),P2(order,2),'b-');
legend({'Cluster 1 Score' 'Cluster 2 Score'},'location','NW');
ylabel('Cluster Membership Score');
xlabel('Point Ranking');
11-35
11
Cluster Analysis
Assigning New Data to Clusters
In the previous example, fitting the mixture distribution to data using fit,
and clustering those data using cluster, are separate steps. However, the
same data are used in both steps. You can also use the cluster method to
assign new data points to the clusters (mixture components) found in the
original data.
1 Given a data set X, first fit a Gaussian mixture distribution. The previous
code has already done that.
gm
gm =
Gaussian mixture distribution with 2 components in 2 dimensions
11-36
Gaussian Mixture Models
Component 1:
Mixing proportion: 0.312592
Mean:
-0.9082
-2.1109
Component 2:
Mixing proportion: 0.687408
Mean:
0.9532
1.8940
2 You can then use cluster to assign each point in a new data set, Y, to one
of the clusters defined for the original data:
Y = [mvnrnd(mu1,sigma1,50);mvnrnd(mu2,sigma2,25)];
idx = cluster(gm,Y);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(Y(cluster1,1),Y(cluster1,2),10,'r+');
hold on
scatter(Y(cluster2,1),Y(cluster2,2),10,'bo');
hold off
legend('Class 1','Class 2','Location','NW')
11-37
11
Cluster Analysis
As with the previous example, the posterior probabilities for each point can
be treated as membership scores rather than determining "hard" cluster
assignments.
For cluster to provide meaningful results with new data, Y should come
from the same population as X, the original data used to create the mixture
distribution. In particular, the estimated mixing probabilities for the
Gaussian mixture distribution fitted to X are used when computing the
posterior probabilities for Y.
11-38
12
Parametric Classification
• “Introduction” on page 12-2
• “Discriminant Analysis” on page 12-3
• “Naive Bayes Classification” on page 12-6
• “Performance Curves” on page 12-9
12
Parametric Classification
Introduction
Models of data with a categorical response are called classifiers. A classifier is
built from training data, for which classifications are known. The classifier
assigns new test data to one of the categorical levels of the response.
Parametric methods, like “Discriminant Analysis” on page 12-3, fit a
parametric model to the training data and interpolate to classify test data.
Nonparametric methods, like “Classification Trees and Regression Trees”
on page 13-25, use other means to determine classifications. In this sense,
classification methods are analogous to the methods discussed in “Nonlinear
Regression” on page 9-58.
12-2
Discriminant Analysis
Discriminant Analysis
In this section...
“Introduction” on page 12-3
“Example: Discriminant Analysis” on page 12-3
Introduction
Discriminant analysis uses training data to estimate the parameters of
discriminant functions of the predictor variables. Discriminant functions
determine boundaries in predictor space between various classes. The
resulting classifier discriminates among the classes (the categorical levels of
the response) based on the predictor data.
The Statistics Toolbox function classify performs discriminant analysis.
Example: Discriminant Analysis
1 For training data, use Fisher’s sepal measurements for iris versicolor and
virginica:
load fisheriris
SL = meas(51:end,1);
SW = meas(51:end,2);
group = species(51:end);
h1 = gscatter(SL,SW,group,'rb','v^',[],'off');
set(h1,'LineWidth',2)
legend('Fisher versicolor','Fisher virginica',...
'Location','NW')
12-3
12
Parametric Classification
2 Classify a grid of measurements on the same scale, using classify:
[X,Y] = meshgrid(linspace(4.5,8),linspace(2,4));
X = X(:); Y = Y(:);
[C,err,P,logp,coeff] = classify([X Y],[SL SW],...
group,'quadratic');
3 Visualize the classification:
hold on;
gscatter(X,Y,C,'rb','.',1,'off');
K = coeff(1,2).const;
L = coeff(1,2).linear;
Q = coeff(1,2).quadratic;
% Plot the curve K + [x,y]*L + [x,y]*Q*[x,y]' = 0:
f = @(x,y) K + L(1)*x + L(2)*y + Q(1,1)*x.^2 + ...
12-4
Discriminant Analysis
(Q(1,2)+Q(2,1))*x.*y + Q(2,2)*y.^2
h2 = ezplot(f,[4.5 8 2 4]);
set(h2,'Color','m','LineWidth',2)
axis([4.5 8 2 4])
xlabel('Sepal Length')
ylabel('Sepal Width')
title('{\bf Classification with Fisher Training Data}')
12-5
12
Parametric Classification
Naive Bayes Classification
The Naive Bayes classifier is designed for use when features are independent
of one another within each class, but it appears to work well in practice
even when that independence assumption is not valid. It classifies data in
two steps:
1 Training step: Using the training samples, the method estimates
the parameters of a probability distribution, assuming features are
conditionally independent given the class.
2 Prediction step: For any unseen test sample, the method computes the
posterior probability of that sample belonging to each class. The method
then classifies the test sample according the largest posterior probability.
The class-conditional independence assumption greatly simplifies the training
step since you can estimate the one-dimensional class-conditional density
for each feature individually. While the class-conditional independence
between features is not true in general, research shows that this optimistic
assumption works well in practice. This assumption of class independence
allows the Naive Bayes classifier to better estimate the parameters required
for accurate classification while using less training data than many other
classifiers. This makes it particularly effective for datasets containing many
predictors or features.
Supported Distributions
Naive Bayes classification is based on estimating P(X|Y), the probability or
probability density of features X given class Y. The Naive Bayes classification
object NaiveBayes provides support for normal (Gaussian), kernel,
multinomial, and multivariate multinomial distributions. It is possible to use
different distributions for different features.
Normal (Gaussian) Distribution
The 'normal' distribution is appropriate for features that have normal
distributions in each class. For each feature you model with a normal
distribution, the Naive Bayes classifier estimates a separate normal
12-6
Naive Bayes Classification
distribution for each class by computing the mean and standard deviation of
the training data in that class. For more information on normal distributions,
see “Normal Distribution” on page B-83.
Kernel Distribution
The 'kernel' distribution is appropriate for features that have a continuous
distribution. It does not require a strong assumption such as a normal
distribution and you can use it in cases where the distribution of a feature may
be skewed or have multiple peaks or modes. It requires more computing time
and more memory than the normal distribution. For each feature you model
with a kernel distribution, the Naive Bayes classifier computes a separate
kernel density estimate for each class based on the training data for that class.
By default the kernel is the normal kernel, and the classifier selects a width
automatically for each class and feature. It is possible to specify different
kernels for each feature, and different widths for each feature or class.
Multinomial Distribution
The multinomial distribution (specify with the 'mn' keyword) is appropriate
when all features represent counts of a set of words or tokens. This is
sometimes called the "bag of words" model. For example, an e-mail spam
classifier might be based on features that count the number of occurrences
of various tokens in an e-mail. One feature might count the number of
exclamation points, another might count the number of times the word
"money" appears, and another might count the number of times the recipient’s
name appears. This is a Naive Bayes model under the further assumption
that the total number of tokens (or the total document length) is independent
of response class.
For the multinomial option, each feature represents the count of one token.
The classifier counts the set of relative token probabilities separately for
each class. The classifier defines the multinomial distribution for each row
by the vector of probabilities for the corresponding class, and by N, the total
token count for that row.
Classification is based on the relative frequencies of the tokens. For a row in
which no token appears, N is 0 and no classification is possible. This classifier
is not appropriate when the total number of tokens provides information
about the response class.
12-7
12
Parametric Classification
Multivariate Multinomial Distribution
The multivariate multinomial distribution (specify with the 'mvmn' keyword)
is appropriate for categorical features. For example, you could fit a feature
describing the weather in categories such as rain/sun/snow/clouds using the
multivariate multinomial model. The feature categories are sometimes called
the feature levels, and differ from the class levels for the response variable.
For each feature you model with a multivariate multinomial distribution, the
Naive Bayes classifier computes a separate set of probabilities for the set of
feature levels for each class.
12-8
Performance Curves
Performance Curves
In this section...
“Introduction” on page 12-9
“What are ROC Curves?” on page 12-9
“Evaluating Classifier Performance Using perfcurve” on page 12-9
Introduction
After a classification algorithm such as NaiveBayes or TreeBagger has
trained on data, you may want to examine the performance of the algorithm
on a specific test dataset. One common way of doing this would be to compute
a gross measure of performance such as quadratic loss, accuracy, such as
quadratic loss or accuracy, averaged over the entire test dataset.
What are ROC Curves?
You may want to inspect the classifier performance more closely, for
example, by plotting a Receiver Operating Characteristic (ROC) curve. By
definition, a ROC curve [1,2] shows true positive rate versus false positive
rate (equivalently, sensitivity versus 1–specificity) for different thresholds of
the classifier output. You can use it, for example, to find the threshold that
maximizes the classification accuracy or to assess, in more broad terms, how
the classifier performs in the regions of high sensitivity and high specificity.
Evaluating Classifier Performance Using perfcurve
perfcurve computes measures for a plot of classifier performance. You can
use this utility to evaluate classifier performance on test data after you train
the classifier. Various measures such as mean squared error, classification
error, or exponential loss can summarize the predictive power of a classifier
in a single number. However, a performance curve offers more information
as it lets you explore the classifier performance across a range of thresholds
on its output.
You can use perfcurve with any classifier or, more broadly, with any method
that returns a numeric score for an instance of input data. By convention
adopted here,
12-9
12
Parametric Classification
• A high score returned by a classifier for any given instance signifies that
the instance is likely from the positive class.
• A low score signifies that the instance is likely from the negative classes.
For some classifiers, you can interpret the score as the posterior probability
of observing an instance of the positive class at point X. An example of such
a score is the fraction of positive observations in a leaf of a decision tree. In
this case, scores fall into the range from 0 to 1 and scores from positive and
negative classes add up to unity. Other methods can return scores ranging
between minus and plus infinity, without any obvious mapping from the
score to the posterior class probability.
perfcurve does not impose any requirements on the input score range.
Because of this lack of normalization, you can use perfcurve to process scores
returned by any classification, regression, or fit method. perfcurve does
not make any assumptions about the nature of input scores or relationships
between the scores for different classes. As an example, consider a problem
with three classes, A, B, and C, and assume that the scores returned by some
classifier for two instances are as follows:
A
B
C
instance 1
0.4
0.5
0.1
instance 2
0.4
0.1
0.5
If you want to compute a performance curve for separation of classes A and B,
with C ignored, you need to address the ambiguity in selecting A over B. You
could opt to use the score ratio, s(A)/s(B), or score difference, s(A)-s(B);
this choice could depend on the nature of these scores and their normalization.
perfcurve always takes one score per instance. If you only supply scores for
class A, perfcurve does not distinguish between observations 1 and 2. The
performance curve in this case may not be optimal.
perfcurve is intended for use with classifiers that return scores, not those
that return only predicted classes. As a counter-example, consider a decision
tree that returns only hard classification labels, 0 or 1, for data with two
classes. In this case, the performance curve reduces to a single point because
classified instances can be split into positive and negative categories in one
way only.
12-10
Performance Curves
For input, perfcurve takes true class labels for some data and scores assigned
by a classifier to these data. By default, this utility computes a Receiver
Operating Characteristic (ROC) curve and returns values of 1–specificity,
or false positive rate, for X and sensitivity, or true positive rate, for Y. You
can choose other criteria for X and Y by selecting one out of several provided
criteria or specifying an arbitrary criterion through an anonymous function.
You can display the computed performance curve using plot(X,Y).
perfcurve can compute values for various criteria to plot either on the x- or
the y-axis. All such criteria are described by a 2-by-2 confusion matrix, a
2-by-2 cost matrix, and a 2-by-1 vector of scales applied to class counts.
The confusion matrix, C, is defined as
⎛ TP
⎜
⎝ FP
where
FN ⎞
⎟
TN ⎠
• P stands for "positive".
• N stands for "negative".
• T stands for "true".
• F stands for "false".
For example, the first row of the confusion matrix defines how the classifier
identifies instances of the positive class: C(1,1) is the count of correctly
identified positive instances and C(1,2) is the count of positive instances
misidentified as negative.
The cost matrix defines the cost of misclassification for each category:
⎛ Cost( P | P) Cost( N | P) ⎞
⎜
⎟
⎝ Cost( P | N ) Cost( N | N ) ⎠
where Cost(I|J) is the cost of assigning an instance of class J to class I.
Usually Cost(I|J)=0 for I=J. For flexibility, perfcurve allows you to specify
nonzero costs for correct classification as well.
12-11
12
Parametric Classification
The two scales include prior information about class probabilities.
perfcurve computes these scales by taking scale(P)=prior(P)*N and
scale(N)=prior(N)*P and normalizing the sum scale(P)+scale(N)
to 1. P=TP+FN and N=TN+FP are the total instance counts in the positive
and negative class, respectively. The function then applies the scales as
multiplicative factors to the counts from the corresponding class: perfcurve
multiplies counts from the positive class by scale(P) and counts from the
negative class by scale(N). Consider, for example, computation of positive
predictive value, PPV = TP/(TP+FP). TP counts come from the positive class
and FP counts come from the negative class. Therefore, you need to scale TP
by scale(P) and FP by scale(N), and the modified formula for PPV with prior
probabilities taken into account is now:
PPV =
scale( P) * TP
scale( P) * TP + scale( N ) * FP
If all scores in the data are above a certain threshold, perfcurve classifies all
instances as 'positive'. This means that TP is the total number of instances
in the positive class and FP is the total number of instances in the negative
class. In this case, PPV is simply given by the prior:
PPV =
prior( P)
prior( P) + prior( N )
The perfcurve function returns two vectors, X and Y, of performance
measures. Each measure is some function of confusion, cost, and scale
values. You can request specific measures by name or provide a function
handle to compute a custom measure. The function you provide should take
confusion, cost, and scale as its three inputs and return a vector of output
values.
The criterion for X must be a monotone function of the positive classification
count, or equivalently, threshold for the supplied scores. If perfcurve cannot
perform a one-to-one mapping between values of the X criterion and score
thresholds, it exits with an error message.
By default, perfcurve computes values of the X and Y criteria for all possible
score thresholds. Alternatively, it can compute a reduced number of specific X
values supplied as an input argument. In either case, for M requested values,
perfcurve computes M+1 values for X and Y. The first value out of these M+1
values is special. perfcurve computes it by setting the TP instance count
12-12
Performance Curves
to zero and setting TN to the total count in the negative class. This value
corresponds to the 'reject all' threshold. On a standard ROC curve, this
translates into an extra point placed at (0,0).
If there are NaN values among input scores, perfcurve can process them
in either of two ways:
• It can discard rows with NaN scores.
• It can add them to false classification counts in the respective class.
That is, for any threshold, instances with NaN scores from the positive class
are counted as false negative (FN), and instances with NaN scores from the
negative class are counted as false positive (FP). In this case, the first value
of X or Y is computed by setting TP to zero and setting TN to the total count
minus the NaN count in the negative class. For illustration, consider an
example with two rows in the positive and two rows in the negative class,
each pair having a NaN score:
Class
Score
Negative
0.2
Negative
NaN
Positive
0.7
Positive
NaN
If you discard rows with NaN scores, then as the score cutoff varies, perfcurve
computes performance measures as in the following table. For example, a
cutoff of 0.5 corresponds to the middle row where rows 1 and 3 are classified
correctly, and rows 2 and 4 are omitted.
TP
FN
FP
TN
0
1
0
1
1
0
0
1
1
0
1
0
If you add rows with NaN scores to the false category in their respective
classes, perfcurve computes performance measures as in the following table.
For example, a cutoff of 0.5 corresponds to the middle row where now rows
12-13
12
Parametric Classification
2 and 4 are counted as incorrectly classified. Notice that only the FN and FP
columns differ between these two tables.
TP
FN
FP
TN
0
2
1
1
1
1
1
1
1
1
2
0
For data with three or more classes, perfcurve takes one positive class and a
list of negative classes for input. The function computes the X and Y values
using counts in the positive class to estimate TP and FN, and using counts in
all negative classes to estimate TN and FP. perfcurve can optionally compute
Y values for each negative class separately and, in addition to Y, return a
matrix of size M-by-C, where M is the number of elements in X or Y and C is
the number of negative classes. You can use this functionality to monitor
components of the negative class contribution. For example, you can plot TP
counts on the X-axis and FP counts on the Y-axis. In this case, the returned
matrix shows how the FP component is split across negative classes.
You can also use perfcurve to estimate confidence intervals. perfcurve
computes confidence bounds using either cross-validation or bootstrap. If you
supply cell arrays for labels and scores, perfcurve uses cross-validation
and treats elements in the cell arrays as cross-validation folds. If you set
input parameter NBoot to a positive integer, perfcurve generates nboot
bootstrap replicas to compute pointwise confidence bounds.
perfcurve estimates the confidence bounds using one of two methods:
• Vertical averaging (VA) — estimate confidence bounds on Y and T at
fixed values of X. Use the XVals input parameter to use this method for
computing confidence bounds.
• Threshold averaging (TA) — estimate confidence bounds for X and Y at
fixed thresholds for the positive class score. Use the TVals input parameter
to use this method for computing confidence bounds.
To use observation weights instead of observation counts, you can use
the 'Weights' parameter in your call to perfcurve. When you use this
parameter, to compute X, Y and T or to compute confidence bounds by
cross-validation, perfcurve uses your supplied observation weights instead of
12-14
Performance Curves
observation counts. To compute confidence bounds by bootstrap, perfcurve
samples N out of N with replacement using your weights as multinomial
sampling probabilities.
12-15
12
12-16
Parametric Classification
13
Supervised Learning
• “Supervised Learning (Machine Learning) Workflow and Algorithms” on
page 13-2
• “Classification Using Nearest Neighbors” on page 13-8
• “Classification Trees and Regression Trees” on page 13-25
• “Ensemble Methods” on page 13-50
• “Bibliography” on page 13-130
13
Supervised Learning
Supervised Learning (Machine Learning) Workflow and
Algorithms
In this section...
“Steps in Supervised Learning (Machine Learning)” on page 13-2
“Characteristics of Algorithms” on page 13-6
Steps in Supervised Learning (Machine Learning)
Supervised learning (machine learning) takes a known set of input data
and known responses to the data, and seeks to build a predictor model that
generates reasonable predictions for the response to new data.
1
Known Data
Model
Known Responses
2
Model
Predicted Responses
New Data
For example, suppose you want to predict if someone will have a heart attack
within a year. You have a set of data on previous people, including their
ages, weight, height, blood pressure, etc. You know if the previous people had
heart attacks within a year of their data measurements. So the problem is
combining all the existing data into a model that can predict whether a new
person will have a heart attack within a year.
Supervised learning splits into two broad categories:
• Classification for responses that can have just a few known values, such
as 'true' or 'false'. Classification algorithms apply to nominal, not
ordinal response values.
13-2
Supervised Learning (Machine Learning) Workflow and Algorithms
• Regression for responses that are a real number, such as miles per gallon
for a particular car.
You can have trouble deciding whether you have a classification problem or a
regression problem. In that case, create a regression model first—regression
models are often more computationally efficient.
While there are many Statistics Toolbox algorithms for supervised learning,
most use the same basic workflow for obtaining a predictor model:
1 “Prepare Data” on page 13-3
2 “Choose an Algorithm” on page 13-4
3 “Fit a Model” on page 13-4
4 “Choose a Validation Method” on page 13-5
5 “Examine Fit; Update Until Satisfied” on page 13-5
6 “Use Fitted Model for Predictions” on page 13-6
Prepare Data
All supervised learning methods start with an input data matrix, usually
called X in this documentation. Each row of X represents one observation.
Each column of X represents one variable, or predictor. Represent missing
entries with NaN values in X. Statistics Toolbox supervised learning algorithms
can handle NaN values, either by ignoring them or by ignoring any row with
a NaN value.
You can use various data types for response data Y. Each element in Y
represents the response to the corresponding row of X. Observations with
missing Y data are ignored.
• For regression, Y must be a numeric vector with the same number of
elements as the number of rows of X.
• For classification, Y can be any of these data types. The table also contains
the method of including missing entries.
13-3
13
Supervised Learning
Data Type
Missing Entry
Numeric vector
NaN
Categorical vector
<undefined>
Character array
Row of spaces
Cell array of strings
''
Logical vector
(not possible to represent)
Choose an Algorithm
There are tradeoffs between several characteristics of algorithms, such as:
• Speed of training
• Memory utilization
• Predictive accuracy on new data
• Transparency or interpretability, meaning how easily you can understand
the reasons an algorithm makes its predictions
Details of the algorithms appear in “Characteristics of Algorithms” on page
13-6. More detail about ensemble algorithms is in “Choose an Applicable
Ensemble Method” on page 13-53.
Fit a Model
The fitting function you use depends on the algorithm you choose.
• For classification trees or regression trees, use ClassificationTree.fit
or RegressionTree.fit.
• For classification or regression trees using an older toolbox function, use
classregtree.
• For classification or regression ensembles, use fitensemble.
• For classification or regression ensembles in parallel, or to use specialized
TreeBagger functionality such as outlier detection, use TreeBagger.
13-4
Supervised Learning (Machine Learning) Workflow and Algorithms
Choose a Validation Method
The three main methods for examining the accuracy of the resulting fitted
model are:
• Examine resubstitution error. For examples, see:
-
“Example: Resubstitution Error of a Classification Tree” on page 13-33
“Example: Cross Validating a Regression Tree” on page 13-34
“Example: Test Ensemble Quality” on page 13-59
• Examine the cross-validation error. For examples, see:
-
“Example: Cross Validating a Regression Tree” on page 13-34
“Example: Test Ensemble Quality” on page 13-59
“Example: Classification with Many Categorical Levels” on page 13-71
• Examine the out-of-bag error for bagged decision trees. For examples, see:
-
“Example: Test Ensemble Quality” on page 13-59
-
“Workflow Example: Classifying Radar Returns for Ionosphere Data
with TreeBagger” on page 13-106
“Workflow Example: Regression of Insurance Risk Rating for Car
Imports with TreeBagger” on page 13-97
Examine Fit; Update Until Satisfied
After validating the model, you might want to change it for better accuracy,
better speed, or to use less memory.
• Change fitting parameters to try to get a more accurate model. For
examples, see:
-
“Example: Tuning RobustBoost” on page 13-92
“Example: Unequal Classification Costs” on page 13-66
• Change fitting parameters to try to get a smaller model. This sometimes
gives a model with more accuracy. For examples, see:
-
“Example: Selecting Appropriate Tree Depth” on page 13-35
“Example: Pruning a Classification Tree” on page 13-38
13-5
13
Supervised Learning
-
“Example: Surrogate Splits” on page 13-76
-
“Workflow Example: Classifying Radar Returns for Ionosphere Data
with TreeBagger” on page 13-106
“Example: Regularizing a Regression Ensemble” on page 13-82
“Workflow Example: Regression of Insurance Risk Rating for Car
Imports with TreeBagger” on page 13-97
• Try a different algorithm. For applicable choices, see:
-
“Characteristics of Algorithms” on page 13-6
“Choose an Applicable Ensemble Method” on page 13-53
When you are satisfied with the model, you can trim it using the appropriate
compact method (compact for classification trees, compact for classification
ensembles, compact for regression trees, compact for regression ensembles).
compact removes training data and pruning information, so the model uses
less memory.
Use Fitted Model for Predictions
To predict classification or regression response for most fitted models, use
the predict method:
Ypredicted = predict(obj,Xnew)
• obj is the fitted model object.
• Xnew is the new input data.
• Ypredicted is the predicted response, either classification or regression.
For classregtree, use the eval method instead of predict.
Characteristics of Algorithms
This table shows typical characteristics of the various supervised learning
algorithms. The characteristics in any particular case can vary from the listed
ones. Use the table as a guide for your initial choice of algorithms, but be
aware that the table can be inaccurate for some problems. SVM is available if
you have a Bioinformatics Toolbox™ license.
13-6
Supervised Learning (Machine Learning) Workflow and Algorithms
Characteristics of Supervised Learning Algorithms
Algorithm
Predictive
Accuracy
Fitting
Speed
Prediction
Speed
Memory
Usage
Easy to
Interpret
Handles
Categorical
Predictors
Trees
Low
Fast
Fast
Low
Yes
Yes
Boosted
Trees
High
Medium
Medium
Medium
No
Yes
Bagged
Trees
High
Slow
Slow
High
No
Yes
SVM
High
Medium
*
*
*
No
Naive
Bayes
Low
**
**
**
Yes
Yes
Nearest
Neighbor
***
Fast***
Medium
High
No
Yes***
* — SVM prediction speed and memory usage are good if there are few
support vectors, but can be poor if there are many support vectors. When you
use a kernel function, it can be difficult to interpret how SVM classifies data,
though the default linear scheme is easy to interpret.
** — Naive Bayes speed and memory usage are good for simple distributions,
but can be poor for kernel distributions and large data sets.
*** — Nearest Neighbor usually has good predictions in low dimensions, but
can have poor predictions in high dimensions. For linear search, Nearest
Neighbor does not perform any fitting. For kd-trees, Nearest Neighbor does
perform fitting. Nearest Neighbor can have either continuous or categorical
predictors, but not both.
13-7
13
Supervised Learning
Classification Using Nearest Neighbors
In this section...
“Pairwise Distance” on page 13-8
“k-Nearest Neighbor Search” on page 13-11
Pairwise Distance
Categorizing query points based on their distance to points in a training
dataset can be a simple yet effective way of classifying new points. You can
use various metrics to determine the distance, described next. Use pdist2 to
find the distance between a sets of data and query points.
Distance Metrics
Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors
x1, x2, ..., xmx, and my-by-n data matrix Y, which is treated as my (1-by-n)
row vectors y1, y2, ...,ymy, the various distances between the vector xs and yt
are defined as follows:
• Euclidean distance
2
dst
= ( xs − yt )( xs − yt )′
The Euclidean distance is a special case of the Minkowski metric, where p
= 2.
• Standardized Euclidean distance
2
dst
= ( xs − yt )V −1 ( xs − yt )′
where V is the n-by-n diagonal matrix whose jth diagonal element is S(j)2,
where S is the vector containing the inverse weights.
• Mahalanobis distance
2
dst
= ( xs − yt )C −1 ( xs − yt )′
where C is the covariance matrix.
13-8
Classification Using Nearest Neighbors
• City block metric
dst =
n
∑ xsj − ytj
j =1
The city block distance is a special case of the Minkowski metric, where p
= 1.
• Minkowski metric
dst =
n
p
∑ xsj − ytj
p
j =1
For the special case of p = 1, the Minkowski metric gives the city block
metric, for the special case of p = 2, the Minkowski metric gives the
Euclidean distance, and for the special case of p = ∞, the Minkowski metric
gives the Chebychev distance.
• Chebychev distance
{
dst = max j xsj − ytj
}
The Chebychev distance is a special case of the Minkowski metric, where p
= ∞.
• Cosine distance
⎛
dst = ⎜ 1 −
⎜
⎝
⎞
⎟
( xs xs′ ) ( yt yt′ ) ⎟⎠
xs yt′
• Correlation distance
dst = 1 −
( xs − xs ) ( yt − yt )′
( xs − xs ) ( xs − xs )′ ( yt − yt ) ( yt − yt )′
where
13-9
13
Supervised Learning
1
∑ xsj
n j
xs =
and
yt =
1
∑ ytj
n j
• Hamming distance
dst = (#( xsj ≠ ytj ) / n)
• Jaccard distance
dst =
(
) ((
) (
))
# ⎡ xsj ≠ ytj ∩ xsj ≠ 0 ∪ ytj ≠ 0 ⎤
⎣
⎦
⎡
⎤
# xsj ≠ 0 ∪ ytj ≠ 0
⎣
⎦
(
) (
)
• Spearman distance
dst = 1 −
( rs − rs ) ( rt − rt )′
( rs − rs ) ( rs − rs )′ ( rt − rt ) ( rt − rt )′
where
-
-
13-10
rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by tiedrank.
rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by tiedrank.
rs and rt are the coordinate-wise rank vectors of xs and yt, i.e., rs = (rs1,
rs2, ... rsn) and rt = (rt1, rt2, ... rtn).
rs =
rt =
( n + 1)
1
rsj =
.
∑
2
n j
( n + 1)
1
rtj =
.
∑
2
n j
Classification Using Nearest Neighbors
k-Nearest Neighbor Search
Given a set X of n points and a distance function D, k-nearest neighbor
(kNN) search lets you find the k closest points in X to a query point or set of
points. The kNN search technique and kNN-based algorithms are widely
used as benchmark learning rules—the relative simplicity of the kNN search
technique makes it easy to compare the results from other classification
techniques to kNN results. They have been used in various areas such as
bioinformatics, image processing and data compression, document retrieval,
computer vision, multimedia database, and marketing data analysis. You
can use kNN search for other machine learning algorithms, such as kNN
classification, local weighted regression, missing data imputation and
interpolation, and density estimation. You can also use kNN search with
many distance-based learning functions, such as K-means clustering.
k-Nearest Neighbor Search Using Exhaustive Search
When your input data meets any of the following criteria, knnsearch uses the
exhaustive search method by default to find the k-nearest neighbors:
• The number of columns of X is more than 10.
• X is sparse.
• The distance measure is either:
-
'seuclidean'
'mahalanobis'
'cosine'
'correlation'
'spearman'
'hamming'
'jaccard'
A custom distance function
knnsearch also uses the exhaustive search method if your search object is
an ExhaustiveSearcher object. The exhaustive search method finds the
distance from each query point to every point in X, ranks them in ascending
13-11
13
Supervised Learning
order, and returns the k points with the smallest distances. For example, this
diagram shows the k = 3 nearest neighbors.
k-Nearest Neighbor Search Using a kd-Tree
When your input data meets all of the following criteria, knnsearch creates a
kd-tree by default to find the k-nearest neighbors:
• The number of columns of X is less than 10.
• X is not sparse.
• The distance measure is either:
-
13-12
'euclidean' (default)
'cityblock'
Classification Using Nearest Neighbors
-
'minkowski'
'chebychev'
knnsearch also uses a kd-tree if your search object is a KDTreeSearcher object.
kd-trees divide your data into nodes with at most BucketSize (default is
50) points per node, based on coordinates (as opposed to categories). The
following diagrams illustrate this concept using patch objects to color code
the different “buckets.”
When you want to find the k-nearest neighbors to a given query point,
knnsearch does the following:
13-13
13
Supervised Learning
1 Determines the node to which the query point belongs. In the following
example, the query point (32,90) belongs to Node 4.
2 Finds the closest k points within that node and its distance to the query
point. In the following example, the points in red circles are equidistant
from the query point, and are the closest points to the query point within
Node 4.
3 Chooses all other nodes having any area that is within the same distance,
in any direction, from the query point to the kth closest point. In this
example, only Node 3 overlaps the solid black circle centered at the query
point with radius equal to the distance to the closest points within Node 4.
4 Searches nodes within that range for any points closer to the query point.
In the following example, the point in a red square is slightly closer to the
query point than those within Node 4.
13-14
Classification Using Nearest Neighbors
Using a kd-tree for large datasets with fewer than 10 dimensions (columns)
can be much more efficient than using the exhaustive search method, as
knnsearch needs to calculate only a subset of the distances. To maximize the
efficiency of kd-trees, use a KDTreeSearcher object.
What Are Search Objects?
Basically, objects are a convenient way of storing information. Classes of
related objects (for example, all search objects) have the same properties
with values and types relevant to a specified search method. In addition to
storing information within objects, you can perform certain actions (called
methods) on objects.
13-15
13
Supervised Learning
All search objects have a knnsearch method specific to that class. This lets
you efficiently perform a k-nearest neighbors search on your object for that
specific object type. In addition, there is a generic knnsearch function that
searches without creating or using an object.
To determine which type of object and search method is best for your data,
consider the following:
• Does your data have many columns, say more than 10? The
ExhaustiveSearcher object may perform better.
• Is your data sparse? Use the ExhaustiveSearcher object.
• Do you want to use one of these distance measures to find the nearest
neighbors? Use the ExhaustiveSearcher object.
-
'seuclidean'
'mahalanobis'
'cosine'
'correlation'
'spearman'
'hamming'
'jaccard'
A custom distance function
• Is your dataset huge (but with fewer than 10 columns)? Use the
KDTreeSearcher object.
• Are you searching for the nearest neighbors for a large number of query
points? Use the KDTreeSearcher object.
For more detailed information on object-oriented programming in MATLAB,
see Object-Oriented Programming.
Example: Classifying Query Data Using knnsearch
1 Classify a new point based on the last two columns of the Fisher iris data.
Using only the last two columns makes it easier to plot:
13-16
Classification Using Nearest Neighbors
load fisheriris
x = meas(:,3:4);
gscatter(x(:,1),x(:,2),species)
set(legend,'location','best')
2 Plot the new point:
newpoint = [5 1.45];
line(newpoint(1),newpoint(2),'marker','x','color','k',...
'markersize',10,'linewidth',2)
13-17
13
Supervised Learning
3 Find the 10 sample points closest to the new point:
[n,d] = knnsearch(x,newpoint,'k',10)
line(x(n,1),x(n,2),'color',[.5 .5 .5],'marker','o',...
'linestyle','none','markersize',10)
13-18
Classification Using Nearest Neighbors
4 It appears that knnsearch has found only the nearest eight neighbors. In
fact, this particular dataset contains duplicate values:
x(n,:)
ans =
5.0000
4.9000
4.9000
5.1000
5.1000
4.8000
5.0000
4.7000
4.7000
1.5000
1.5000
1.5000
1.5000
1.6000
1.4000
1.7000
1.4000
1.4000
13-19
13
Supervised Learning
4.7000
1.5000
5 To make duplicate values visible on the plot, use the following code:
% jitter to make repeated points visible
xj = x + .05*(rand(150,2)-.5);
gscatter(xj(:,1),xj(:,2),species)
The jittered points do not affect any analysis of the data, only the
visualization. This example does not jitter the points.
6 Make the axes equal so the calculated distances correspond to the apparent
distances on the plot axis equal and zoom in to see the neighbors better:
set(gca,'xlim',[4.5 5.5],'ylim',[1 2]); axis square
7 Find the species of the 10 neighbors:
13-20
Classification Using Nearest Neighbors
tabulate(species(n))
Value
virginica
versicolor
Count
2
8
Percent
20.00%
80.00%
Using a rule based on the majority vote of the 10 nearest neighbors, you
can classify this new point as a versicolor.
8 Visually identify the neighbors by drawing a circle around the group of
them:
% Define the center and diameter of a circle, based on the
% location of the new point:
ctr = newpoint - d(end);
diameter = 2*d(end);
% Draw a circle around the 10 nearest neighbors:
h = rectangle('position',[ctr,diameter,diameter],...
'curvature',[1 1]);
set(h,'linestyle',':')
13-21
13
Supervised Learning
9 Using the same dataset, find the 10 nearest neighbors to three new points:
figure
newpoint2 = [5 1.45;6 2;2.75 .75];
gscatter(x(:,1),x(:,2),species)
legend('location','best')
[n2,d2] = knnsearch(x,newpoint2,'k',10);
line(x(n2,1),x(n2,2),'color',[.5 .5 .5],'marker','o',...
'linestyle','none','markersize',10)
line(newpoint2(:,1),newpoint2(:,2),'marker','x','color','k',...
'markersize',10,'linewidth',2,'linestyle','none')
13-22
Classification Using Nearest Neighbors
10 Find the species of the 10 nearest neighbors for each new point:
tabulate(species(n2(1,:)))
Value
Count
Percent
virginica
2
20.00%
versicolor
8
80.00%
tabulate(species(n2(2,:)))
Value
Count
Percent
virginica
10
100.00%
tabulate(species(n2(3,:)))
Value
Count
Percent
versicolor
7
70.00%
setosa
3
30.00%
13-23
13
Supervised Learning
For further examples using knnsearch methods and function, see the
individual reference pages.
13-24
Classification Trees and Regression Trees
Classification Trees and Regression Trees
In this section...
“What Are Classification Trees and Regression Trees?” on page 13-25
“Creating Classification Trees and Regression Trees” on page 13-26
“Predicting Responses With Classification Trees and Regression Trees” on
page 13-32
“Improving Classification Trees and Regression Trees” on page 13-33
“Alternative: classregtree” on page 13-42
What Are Classification Trees and Regression Trees?
Classification trees and regression trees predict responses to data. To predict
a response, follow the decisions in the tree from the root (beginning) node
down to a leaf node. The leaf node contains the response. Classification trees
give responses that are nominal, such as 'true' or 'false'. Regression
trees give numeric responses.
Statistics Toolbox trees are binary. Each step in a prediction involves
checking the value of one predictor (variable). For example, here is a simple
classification tree:
This tree predicts classifications based on two predictors, x1 and x2. To
predict, start at the top node, represented by a triangle (Δ). The first decision
is whether x1 is smaller than 0.5. If so, follow the left branch, and see that
the tree classifies the data as type 0.
13-25
13
Supervised Learning
If, however, x1 exceeds 0.5, then follow the right branch to the lower-right
triangle node. Here the tree asks if x2 is smaller than 0.5. If so, then follow
the left branch to see that the tree classifies the data as type 0. If not, then
follow the right branch to see that the that the tree classifies the data as
type 1.
Creating Classification Trees and Regression Trees
1 Collect your known input data into a matrix X. Each row of X represents
one observation. Each column of X represents one variable (also called a
predictor). Use NaN to represent a missing value.
2 Collect the responses to X in a response variable Y. Each entry in Y
represents the response to the corresponding row of X. Represent missing
values as shown in Response Data Types on page 13-26.
• For regression, Y must be a numeric vector with the same number of
elements as the number of rows of X.
• For classification, Y can be any of the following data types; the table also
contains the method of including missing entries:
Response Data Types
Data Type
Missing Entry
Numeric vector
NaN
Categorical vector
<undefined>
Character array
Row of spaces
Cell array of strings
''
Logical vector
(not possible to represent)
For example, suppose your response data consists of three observations in
this order: true, false, true. You could express Y as:
• [1;0;1] (numeric vector)
• nominal({'true','false','true'}) (categorical vector)
• [true;false;true] (logical vector)
13-26
Classification Trees and Regression Trees
• ['true ';'false';'true '] (character array, padded with spaces so
each row has the same length)
• {'true','false','true'} (cell array of strings)
Use whichever data type is most convenient.
3 Create a tree using one of these methods:
• For a classification tree, use ClassificationTree.fit:
tree = ClassificationTree.fit(X,Y);
• For a regression tree, use RegressionTree.fit:
tree = RegressionTree.fit(X,Y);
Example: Creating a Classification Tree
To create a classification tree for the ionosphere data:
load ionosphere % contains X and Y variables
ctree = ClassificationTree.fit(X,Y)
ctree =
ClassificationTree:
PredictorNames:
CategoricalPredictors:
ResponseName:
ClassNames:
ScoreTransform:
NObservations:
{1x34 cell}
[]
'Y'
{'b' 'g'}
'none'
351
Example: Creating a Regression Tree
To create a regression tree for the carsmall data based on the Horsepower
and Weight vectors for data, and MPG vector for response:
load carsmall % contains Horsepower, Weight, MPG
X = [Horsepower Weight];
rtree = RegressionTree.fit(X,MPG)
13-27
13
Supervised Learning
rtree =
RegressionTree:
PredictorNames:
CategoricalPredictors:
ResponseName:
ResponseTransform:
NObservations:
{'x1' 'x2'}
[]
'Y'
'none'
94
Viewing a Tree
There are two ways to view a tree:
• view(tree) returns a text description of the tree.
• view(tree,'mode','graph') returns a graphic description of the tree.
“Example: Creating a Classification Tree” on page 13-27 has the following
two views:
load fisheriris
ctree = ClassificationTree.fit(meas,species);
view(ctree)
Decision tree for classification
1 if x3<2.45 then node 2 elseif
2 class = setosa
3 if x4<1.75 then node 4 elseif
4 if x3<4.95 then node 6 elseif
5 class = virginica
6 if x4<1.65 then node 8 elseif
7 class = virginica
8 class = versicolor
9 class = virginica
view(ctree,'mode','graph')
13-28
x3>=2.45 then node 3 else setosa
x4>=1.75 then node 5 else versicolor
x3>=4.95 then node 7 else versicolor
x4>=1.65 then node 9 else versicolor
Classification Trees and Regression Trees
Similarly, “Example: Creating a Regression Tree” on page 13-27 has the
following two views:
load carsmall % contains Horsepower, Weight, MPG
X = [Horsepower Weight];
rtree = RegressionTree.fit(X,MPG,'MinParent',30);
view(rtree)
Decision tree for regression
1 if x2<3085.5 then node 2 elseif x2>=3085.5 then node 3 else 23.7181
2 if x1<89 then node 4 elseif x1>=89 then node 5 else 28.7931
3 if x1<115 then node 6 elseif x1>=115 then node 7 else 15.5417
4 if x2<2162 then node 8 elseif x2>=2162 then node 9 else 30.9375
5 fit = 24.0882
6 fit = 19.625
7 fit = 14.375
13-29
13
Supervised Learning
8
9
fit = 33.3056
fit = 29
view(rtree,'mode','graph')
How the Fit Methods Create Trees
The ClassificationTree.fit and RegressionTree.fit methods perform
the following steps to create decision trees:
1 Start with all input data, and examine all possible binary splits on every
predictor.
2 Select a split with best optimization criterion.
13-30
Classification Trees and Regression Trees
• If the split leads to a child node having too few observations (less
than the MinLeaf parameter), select a split with the best optimization
criterion subject to the MinLeaf constraint.
3 Impose the split.
4 Repeat recursively for the two child nodes.
The explanation requires two more items: description of the optimization
criterion, and stopping rule.
Stopping rule: Stop splitting when any of the following hold:
• The node is pure.
-
For classification, a node is pure if it contains only observations of one
class.
-
For regression, a node is pure if the mean squared error (MSE) for the
observed response in this node drops below the MSE for the observed
response in the entire data multiplied by the tolerance on quadratic
error per node (qetoler parameter).
• There are fewer than MinParent observations in this node.
• Any split imposed on this node would produce children with fewer than
MinLeaf observations.
Optimization criterion:
• Regression: mean-squared error (MSE). Choose a split to minimize the
MSE of predictions compared to the training data.
• Classification: One of three measures, depending on the setting of the
SplitCriterion name-value pair:
-
'gdi' (Gini’s diversity index, the default)
'twoing'
'deviance'
For details, see ClassificationTree “Definitions” on page 20-203.
13-31
13
Supervised Learning
For a continuous predictor, a tree can split halfway between any two adjacent
unique values found for this predictor. For a categorical predictor with L
levels, a classification tree needs to consider 2L–1–1 splits. To obtain this
formula, observe that you can assign L distinct values to the left and right
nodes in 2L ways. Two out of these 2L configurations would leave either left or
right node empty, and therefore should be discarded. Now divide by 2 because
left and right can be swapped. A classification tree can thus process only
categorical predictors with a moderate number of levels. A regression tree
employs a computational shortcut: it sorts the levels by the observed mean
response, and considers only the L–1 splits between the sorted levels.
Predicting Responses With Classification Trees and
Regression Trees
After creating a tree, you can easily predict responses for new data. Suppose
Xnew is new data that has the same number of columns as the original data
X. To predict the classification or regression based on the tree and the new
data, enter
Ynew = predict(tree,Xnew);
For each row of data in Xnew, predict runs through the decisions in tree and
gives the resulting prediction in the corresponding element of Ynew. For more
information for classification, see the classification predict reference page;
for regression, see the regression predict reference page.
For example, to find the predicted classification of a point at the mean of
the ionosphere data:
load ionosphere % contains X and Y variables
ctree = ClassificationTree.fit(X,Y);
Ynew = predict(ctree,mean(X))
Ynew =
'g'
To find the predicted MPG of a point at the mean of the carsmall data:
load carsmall % contains Horsepower, Weight, MPG
X = [Horsepower Weight];
rtree = RegressionTree.fit(X,MPG);
13-32
Classification Trees and Regression Trees
Ynew = predict(rtree,mean(X))
Ynew =
28.7931
Improving Classification Trees and Regression Trees
You can tune trees by setting name-value pairs in ClassificationTree.fit
and RegressionTree.fit. The remainder of this section describes how to
determine the quality of a tree, how to decide which name-value pairs to set,
and how to control the size of a tree:
• “Examining Resubstitution Error” on page 13-33
• “Cross Validation” on page 13-34
• “Control Depth or “Leafiness”” on page 13-34
• “Pruning” on page 13-38
Examining Resubstitution Error
Resubstitution error is the difference between the response training data and
the predictions the tree makes of the response based on the input training
data. If the resubstitution error is high, you cannot expect the predictions
of the tree to be good. However, having low resubstitution error does not
guarantee good predictions for new data. Resubstitution error is often an
overly optimistic estimate of the predictive error on new data.
Example: Resubstitution Error of a Classification Tree. Examine the
resubstitution error of a default classification tree for the Fisher iris data:
load fisheriris
ctree = ClassificationTree.fit(meas,species);
resuberror = resubLoss(ctree)
resuberror =
0.0200
The tree classifies nearly all the Fisher iris data correctly.
13-33
13
Supervised Learning
Cross Validation
To get a better sense of the predictive accuracy of your tree for new data,
cross validate the tree. By default, cross validation splits the training data
into 10 parts at random. It trains 10 new trees, each one on nine parts of
the data. It then examines the predictive accuracy of each new tree on the
data not included in training that tree. This method gives a good estimate
of the predictive accuracy of the resulting tree, since it tests the new trees
on new data.
Example: Cross Validating a Regression Tree. Examine the
resubstitution and cross-validation accuracy of a regression tree for predicting
mileage based on the carsmall data:
load carsmall
X = [Acceleration Displacement Horsepower Weight];
rtree = RegressionTree.fit(X,MPG);
resuberror = resubLoss(rtree)
resuberror =
4.7188
The resubstitution loss for a regression tree is the mean-squared error. The
resulting value indicates that a typical predictive error for the tree is about
the square root of 4.7, or a bit over 2.
Now calculate the error by cross validating the tree:
cvrtree = crossval(rtree);
cvloss = kfoldLoss(cvrtree)
cvloss =
23.4808
The cross-validated loss is almost 25, meaning a typical predictive error for
the tree on new data is about 5. This demonstrates that cross-validated loss
is usually higher than simple resubstitution loss.
Control Depth or “Leafiness”
When you grow a decision tree, consider its simplicity and predictive power. A
deep tree with many leaves is usually highly accurate on the training data.
13-34
Classification Trees and Regression Trees
However, the tree is not guaranteed to show a comparable accuracy on an
independent test set. A leafy tree tends to overtrain, and its test accuracy
is often far less than its training (resubstitution) accuracy. In contrast, a
shallow tree does not attain high training accuracy. But a shallow tree can be
more robust — its training accuracy could be close to that of a representative
test set. Also, a shallow tree is easy to interpret.
If you do not have enough data for training and test, estimate tree accuracy
by cross validation.
For an alternative method of controlling the tree depth, see “Pruning” on
page 13-38.
Example: Selecting Appropriate Tree Depth. This example shows how to
control the depth of a decision tree, and how to choose an appropriate depth.
1 Load the ionosphere data:
load ionosphere
2 Generate minimum leaf occupancies for classification trees from 10 to 100,
spaced exponentially apart:
leafs = logspace(1,2,10);
3 Create cross validated classification trees for the ionosphere data with
minimum leaf occupancies from leafs:
N = numel(leafs);
err = zeros(N,1);
for n=1:N
t = ClassificationTree.fit(X,Y,'crossval','on',...
'minleaf',leafs(n));
err(n) = kfoldLoss(t);
end
plot(leafs,err);
xlabel('Min Leaf Size');
ylabel('cross-validated error');
13-35
13
Supervised Learning
The best leaf size is between about 20 and 50 observations per leaf.
4 Compare the near-optimal tree with at least 40 observations per leaf
with the default tree, which uses 10 observations per parent node and 1
observation per leaf.
DefaultTree = ClassificationTree.fit(X,Y);
view(DefaultTree,'mode','graph')
13-36
Classification Trees and Regression Trees
OptimalTree = ClassificationTree.fit(X,Y,'minleaf',40);
view(OptimalTree,'mode','graph')
resubOpt = resubLoss(OptimalTree);
lossOpt = kfoldLoss(crossval(OptimalTree));
resubDefault = resubLoss(DefaultTree);
lossDefault = kfoldLoss(crossval(DefaultTree));
resubOpt,resubDefault,lossOpt,lossDefault
resubOpt =
0.0883
13-37
13
Supervised Learning
resubDefault =
0.0114
lossOpt =
0.1054
lossDefault =
0.1026
The near-optimal tree is much smaller and gives a much higher
resubstitution error. Yet it gives similar accuracy for cross-validated data.
Pruning
Pruning optimizes tree depth (leafiness) is by merging leaves on the same tree
branch. “Control Depth or “Leafiness”” on page 13-34 describes one method
for selecting the optimal depth for a tree. Unlike in that section, you do not
need to grow a new tree for every node size. Instead, grow a deep tree, and
prune it to the level you choose.
Prune a tree at the command line using the prune method (classification) or
prune method (regression). Alternatively, prune a tree interactively with
the tree viewer:
view(tree,'mode','graph')
To prune a tree, the tree must contain a pruning sequence. By default, both
ClassificationTree.fit and RegressionTree.fit calculate a pruning
sequence for a tree during construction. If you construct a tree with the
'Prune' name-value pair set to 'off', or if you prune a tree to a smaller level,
the tree does not contain the full pruning sequence. Generate the full pruning
sequence with the prune method (classification) or prune method (regression).
Example: Pruning a Classification Tree. This example creates a
classification tree for the ionosphere data, and prunes it to a good level.
1 Load the ionosphere data:
load ionosphere
13-38
Classification Trees and Regression Trees
2 Construct a default classification tree for the data:
tree = ClassificationTree.fit(X,Y);
3 View the tree in the interactive viewer:
view(tree,'mode','graph')
13-39
13
Supervised Learning
4 Find the optimal pruning level by minimizing cross-validated loss:
[~,~,~,bestlevel] = cvLoss(tree,...
'subtrees','all','treesize','min')
bestlevel =
6
5 Prune the tree to level 6 in the interactive viewer:
13-40
Classification Trees and Regression Trees
The pruned tree is the same as the near-optimal tree in “Example:
Selecting Appropriate Tree Depth” on page 13-35.
6
Set 'treesize' to 'se' (default) to find the maximal pruning level for
which the tree error does not exceed the error from the best level plus one
standard deviation:
13-41
13
Supervised Learning
[~,~,~,bestlevel] = cvLoss(tree,'subtrees','all')
bestlevel =
6
In this case the level is the same for either setting of 'treesize'.
7 Prune the tree to use it for other purposes:
tree = prune(tree,'Level',6);
view(tree,'mode','graph')
Alternative: classregtree
The ClassificationTree and RegressionTree classes are new in MATLAB
R2011a. Previously, you represented both classification trees and regression
trees with a classregtree object. The new classes provide all the
functionality of the classregtree class, and are more convenient when used
in conjunction with “Ensemble Methods” on page 13-50.
Before the classregtree class, there were treefit, treedisp, treeval,
treeprune, and treetest functions. Statistics Toolbox software maintains
these only for backward compatibility.
13-42
Classification Trees and Regression Trees
Example: Creating Classification Trees Using classregtree
This example uses Fisher’s iris data in fisheriris.mat to create a
classification tree for predicting species using measurements of sepal length,
sepal width, petal length, and petal width as predictors. Here, the predictors
are continuous and the response is categorical.
1 Load the data and use the classregtree constructor of the classregtree
class to create the classification tree:
load fisheriris
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1
if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2
class = setosa
3
if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4
if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5
class = virginica
6
if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7
class = virginica
8
class = versicolor
9
class = virginica
t is a classregtree object and can be operated on with any class method.
2 Use the type method of the classregtree class to show the type of the tree:
treetype = type(t)
treetype =
classification
classregtree creates a classification tree because species is a cell array
of strings, and the response is assumed to be categorical.
3 To view the tree, use the view method of the classregtree class:
view(t)
13-43
13
Supervised Learning
The tree predicts the response values at the circular leaf nodes based on a
series of questions about the iris at the triangular branching nodes. A true
answer to any question follows the branch to the left. A false follows the
branch to the right.
4 The tree does not use sepal measurements for predicting species. These
can go unmeasured in new data, and you can enter them as NaN values for
predictions. For example, to use the tree to predict the species of an iris
with petal length 4.8 and petal width 1.6, type:
predicted = t([NaN NaN 4.8 1.6])
predicted =
'versicolor'
13-44
Classification Trees and Regression Trees
The object allows for functional evaluation, of the form t(X). This is a
shorthand way of calling the eval method of the classregtree class.
The predicted species is the left leaf node at the bottom of the tree in the
previous view.
5 You can use a variety of methods of the classregtree class, such as cutvar
and cuttype to get more information about the split at node 6 that makes
the final distinction between versicolor and virginica:
var6 = cutvar(t,6) % What variable determines the split?
var6 =
'PW'
type6 = cuttype(t,6) % What type of split is it?
type6 =
'continuous'
6 Classification trees fit the original (training) data well, but can do a poor
job of classifying new values. Lower branches, especially, can be strongly
affected by outliers. A simpler tree often avoids overfitting. You can use
the prune method of the classregtree class to find the next largest tree
from an optimal pruning sequence:
pruned = prune(t,'level',1)
pruned =
Decision tree for classification
1
if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2
class = setosa
3
if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4
if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5
class = virginica
6
class = versicolor
7
class = virginica
view(pruned)
13-45
13
Supervised Learning
To find the best classification tree, employing the techniques of resubstitution
and cross validation, use the test method of the classregtree class.
Example: Creating Regression Trees Using classregtree
This example uses the data on cars in carsmall.mat to create a regression
tree for predicting mileage using measurements of weight and the number of
cylinders as predictors. Here, one predictor (weight) is continuous and the
other (cylinders) is categorical. The response (mileage) is continuous.
1 Load the data and use the classregtree constructor of the classregtree
class to create the regression tree:
13-46
Classification Trees and Regression Trees
load carsmall
t = classregtree([Weight, Cylinders],MPG,...
'cat',2,'splitmin',20,...
'names',{'W','C'})
t =
Decision tree for regression
1
if W<3085.5 then node 2 elseif W>=3085.5 then node 3 else 23.7181
2
if W<2371 then node 4 elseif W>=2371 then node 5 else 28.7931
3
if C=8 then node 6 elseif C in {4 6} then node 7 else 15.5417
4
if W<2162 then node 8 elseif W>=2162 then node 9 else 32.0741
5
if C=6 then node 10 elseif C=4 then node 11 else 25.9355
6
if W<4381 then node 12 elseif W>=4381 then node 13 else 14.2963
7
fit = 19.2778
8
fit = 33.3056
9
fit = 29.6111
10
fit = 23.25
11
if W<2827.5 then node 14 elseif W>=2827.5 then node 15 else 27.2143
12
if W<3533.5 then node 16 elseif W>=3533.5 then node 17 else 14.8696
13
fit = 11
14
fit = 27.6389
15
fit = 24.6667
16
fit = 16.6
17
fit = 14.3889
t is a classregtree object and can be operated on with any of the methods
of the class.
2 Use the type method of the classregtree class to show the type of the tree:
treetype = type(t)
treetype =
regression
classregtree creates a regression tree because MPG is a numerical vector,
and the response is assumed to be continuous.
3 To view the tree, use the view method of the classregtree class:
13-47
13
Supervised Learning
view(t)
The tree predicts the response values at the circular leaf nodes based on a
series of questions about the car at the triangular branching nodes. A true
answer to any question follows the branch to the left; a false follows the
branch to the right.
4 Use the tree to predict the mileage for a 2000-pound car with either 4,
6, or 8 cylinders:
mileage2K = t([2000 4; 2000 6; 2000 8])
mileage2K =
33.3056
33.3056
33.3056
The object allows for functional evaluation, of the form t(X). This is a
shorthand way of calling the eval method of the classregtree class.
5 The predicted responses computed above are all the same. This is because
they follow a series of splits in the tree that depend only on weight,
terminating at the left-most leaf node in the view above. A 4000-pound
car, following the right branch from the top of the tree, leads to different
predicted responses:
mileage4K = t([4000 4; 4000 6; 4000 8])
13-48
Classification Trees and Regression Trees
mileage4K =
19.2778
19.2778
14.3889
6 You can use a variety of other methods of the classregtree class, such as
cutvar, cuttype, and cutcategories, to get more information about the
split at node 3 that distinguishes the 8-cylinder car:
var3 = cutvar(t,3) % What variable determines the split?
var3 =
'C'
type3 = cuttype(t,3) % What type of split is it?
type3 =
'categorical'
c = cutcategories(t,3) % Which classes are sent to the left
% child node, and which to the right?
c =
[8]
[1x2 double]
c{1}
ans =
8
c{2}
ans =
4
6
Regression trees fit the original (training) data well, but may do a poor
job of predicting new values. Lower branches, especially, may be strongly
affected by outliers. A simpler tree often avoids over-fitting. To find the
best regression tree, employing the techniques of resubstitution and cross
validation, use the test method of the classregtree class.
13-49
13
Supervised Learning
Ensemble Methods
In this section...
“Framework for Ensemble Learning” on page 13-50
“Basic Ensemble Examples” on page 13-57
“Test Ensemble Quality” on page 13-59
“Classification: Imbalanced Data or Unequal Misclassification Costs” on
page 13-64
“Example: Classification with Many Categorical Levels” on page 13-71
“Example: Surrogate Splits” on page 13-76
“Ensemble Regularization” on page 13-81
“Example: Tuning RobustBoost” on page 13-92
“TreeBagger Examples” on page 13-96
“Ensemble Algorithms” on page 13-118
Framework for Ensemble Learning
You have several methods for melding results from many weak learners into
one high-quality ensemble predictor. These methods follow, as closely as
possible, the same syntax, so you can try different methods with only minor
changes in your commands.
Create an ensemble with the fitensemble function. The syntax of
fitensemble is
ens = fitensemble(X,Y,model,numberens,learners)
• X is the matrix of data, each row containing one observation, each column
contains one predictor variable.
• Y is the responses, with the same number of observations as rows in X.
• model is a string naming the type of ensemble.
• numberens is the number of weak learners in ens from each element of
learners. So the number of elements in ens is numberens times the
number of elements in learners.
13-50
Ensemble Methods
• learners is a string naming a weak learner, is a weak learner template,
or is a cell array of such templates.
Pictorially, here is the information you need to create an ensemble:
Data matrix X
Responses Y
Ensemble Method
fitensemble
ensemble
Number of Weak Learners in Ensemble
Weak Learner(s)
For all classification or nonlinear regression problems, follow these steps
to create an ensemble:
1 “Put Predictor Data in a Matrix” on page 13-51
2 “Prepare Response Data” on page 13-52
3 “Choose an Applicable Ensemble Method” on page 13-53
4 “Set the Number of Ensemble Members” on page 13-54
5 “Prepare the Weak Learners” on page 13-54
6 “Call fitensemble” on page 13-55
Put Predictor Data in a Matrix
All supervised learning methods start with a data matrix, usually called X in
this documentation. Each row of X represents one observation. Each column
of X represents one variable, or predictor.
Currently, you can use only decision trees as learners for ensembles. Decision
trees can handle NaN values in X. Such values are called “missing.” If you have
13-51
13
Supervised Learning
some missing values in a row of X, a decision tree finds optimal splits using
nonmissing values only. If an entire row consists of NaN, fitensemble ignores
that row. If you have data with a large fraction of missing values in X, use
surrogate decision splits. For examples of surrogate splits, see “Example:
Unequal Classification Costs” on page 13-66 and “Example: Surrogate Splits”
on page 13-76.
Prepare Response Data
You can use a wide variety of data types for response data.
• For regression ensembles, Y must be a numeric vector with the same
number of elements as the number of rows of X.
• For classification ensembles, Y can be any of the following data types. The
table also contains the method of including missing entries.
Data Type
Missing Entry
Numeric vector
NaN
Categorical vector
<undefined>
Character array
Row of spaces
Cell array of strings
''
Logical vector
(not possible to represent)
fitensemble ignores missing values in Y when creating an ensemble.
For example, suppose your response data consists of three observations in the
following order: true, false, true. You could express Y as:
• [1;0;1] (numeric vector)
• nominal({'true','false','true'}) (categorical vector)
• [true;false;true] (logical vector)
• ['true ';'false';'true '] (character array, padded with spaces so
each row has the same length)
• {'true','false','true'} (cell array of strings)
13-52
Ensemble Methods
Use whichever data type is most convenient. Since you cannot represent
missing values with logical entries, do not use logical entries when you have
missing values in Y.
Choose an Applicable Ensemble Method
fitensemble uses one of these algorithms to create an ensemble.
• For classification with two classes:
-
'AdaBoostM1'
'LogitBoost'
'GentleBoost'
'RobustBoost'
'Bag'
• For classification with three or more classes:
-
'AdaBoostM2'
'Bag'
• For regression:
-
'LSBoost'
'Bag'
Since 'Bag' applies to all methods, indicate whether you want a classifier
or regressor with the type name-value pair set to 'classification' or
'regression'.
For descriptions of the various algorithms, and aid in choosing which applies
to your data, see “Ensemble Algorithms” on page 13-118. The following table
gives characteristics of the various algorithms. In the table titles:
• Regress. — Regression
• Classif. — Classification
• Preds. — Predictors
• Estim. — Estimate
13-53
13
Supervised Learning
• Gen. — Generalization
• Pred. — Prediction
• Mem. — Memory usage
Algorithm
Regress. Binary
Classif.
Bag
×
Binary
Classif.
MultiLevel
Preds.
×
Classif.
3+
Classes
Auto
Estim.
Gen.
Error
×
×
×
AdaBoostM1
×
AdaBoostM2
Fast
Train
Fast
Pred.
Low
Mem.
×
×
×
×
×
×
LogitBoost
×
×
×
×
×
GentleBoost
×
×
×
×
×
RobustBoost
×
×
×
×
×
LSBoost
×
×
Set the Number of Ensemble Members
Choosing the size of an ensemble involves balancing speed and accuracy.
• Larger ensembles take longer to train and to generate predictions.
• Some ensemble algorithms can become overtrained (inaccurate) when too
large.
To set an appropriate size, consider starting with several dozen to several
hundred members in an ensemble, training the ensemble, and then checking
the ensemble quality, as in “Example: Test Ensemble Quality” on page 13-59.
If it appears that you need more members, add them using the resume method
(classification) or the resume method (regression). Repeat until adding more
members does not improve ensemble quality.
Prepare the Weak Learners
Currently there is one built-in weak learner type: 'Tree'. To create an
ensemble with the default tree options, pass in 'Tree' as the weak learner.
13-54
Ensemble Methods
To set a nondefault classification tree learner, create a classification tree
template with the ClassificationTree.template method.
Similarly, to set a nondefault regression tree learner, create a regression tree
template with the RegressionTree.template method.
While you can give fitensemble a cell array of learner templates, the most
common usage is to give just one weak learner template.
For examples using a template, see “Example: Unequal Classification Costs”
on page 13-66 and “Example: Surrogate Splits” on page 13-76.
Common Settings for Weak Learners.
• The depth of the weak learner tree makes a difference for training time,
memory usage, and predictive accuracy. You control the depth with two
parameters:
-
MinLeaf — Each leaf has at least MinLeaf observations. Set small
values of MinLeaf to get a deep tree.
-
MinParent — Each branch node in the tree has at least MinParent
observations. Set small values of MinParent to get a deep tree.
If you supply both MinParent and MinLeaf, the learner uses the setting
that gives larger leaves:
MinParent = max(MinParent,2*MinLeaf)
• Surrogate — Grow decision trees with surrogate splits when Surrogate is
'on'. Use surrogate splits when your data has missing values.
Note Surrogate splits cause training to be slower and use more memory.
Call fitensemble
The syntax of fitensemble is
ens = fitensemble(X,Y,model,numberens,learners)
13-55
13
Supervised Learning
• X is the matrix of data. Each row contains one observation, and each
column contains one predictor variable.
• Y is the responses, with the same number of observations as rows in X.
• model is a string naming the type of ensemble.
• numberens is the number of weak learners in ens from each element of
learners. So the number of elements in ens is numberens times the
number of elements in learners.
• learners is a string naming a weak learner, a weak learner template, or a
cell array of such strings and templates.
The result of fitensemble is an ensemble object, suitable for making
predictions on new data. For a basic example of creating a classification
ensemble, see “Creating a Classification Ensemble” on page 13-57. For a
basic example of creating a regression ensemble, see “Creating a Regression
Ensemble” on page 13-58.
Where to Set Name-Value Pairs. There are several name-value pairs
you can pass to fitensemble, and several that apply to the weak learners
(ClassificationTree.template and RegressionTree.template). To
determine which option (name-value pair) is appropriate, the ensemble or
the weak learner:
• Use template name-value pairs to control the characteristics of the weak
learners.
• Use fitensemble name-value pairs to control the ensemble as a whole,
either for algorithms or for structure.
For example, to have an ensemble of boosted classification trees with each tree
deeper than the default, set the ClassificationTree.template name-value
pairs (MinLeaf and MinParent) to smaller values than the defaults. This
causes the trees to be leafier (deeper).
To name the predictors in the ensemble (part of the structure of the ensemble),
use the PredictorNames name-value pair in fitensemble.
13-56
Ensemble Methods
Basic Ensemble Examples
Creating a Classification Ensemble
Create a classification ensemble for the Fisher iris data, and use it to predict
the classification of a flower with average measurements.
1 Load the data:
load fisheriris
2 The predictor data X is the meas matrix.
3 The response data Y is the species cell array.
4 The only boosted classification ensemble for three or more classes is
'AdaBoostM2'.
5 For this example, arbitrarily take an ensemble of 100 trees.
6 Use a default tree template.
7 Create the ensemble:
ens = fitensemble(meas,species,'AdaBoostM2',100,'Tree')
ens =
classreg.learning.classif.ClassificationEnsemble:
PredictorNames: {'x1' 'x2' 'x3' 'x4'}
CategoricalPredictors: []
ResponseName: 'Y'
ClassNames: {'setosa' 'versicolor' 'virginica'}
ScoreTransform: 'none'
NObservations: 150
NTrained: 100
Method: 'AdaBoostM2'
LearnerNames: {'Tree'}
ReasonForTermination: [1x77 char]
FitInfo: [100x1 double]
FitInfoDescription: [2x83 char]
8 Predict the classification of a flower with average measurements:
13-57
13
Supervised Learning
flower = predict(ens,mean(meas))
flower =
'versicolor'
Creating a Regression Ensemble
Create a regression ensemble to predict mileage of cars based on their
horsepower and weight, trained on the carsmall data. Use the resulting
ensemble to predict the mileage of a car with 150 horsepower weighing 2750
lbs.
1 Load the data:
load carsmall
2 Prepare the input data.
X = [Horsepower Weight];
3 The response data Y is MPG.
4 The only boosted regression ensemble type is 'LSBoost'.
5 For this example, arbitrarily take an ensemble of 100 trees.
6 Use a default tree template.
7 Create the ensemble:
ens = fitensemble(X,MPG,'LSBoost',100,'Tree')
ens =
classreg.learning.regr.RegressionEnsemble:
PredictorNames: {'x1' 'x2'}
CategoricalPredictors: []
ResponseName: 'Y'
ResponseTransform: 'none'
NObservations: 94
NTrained: 100
Method: 'LSBoost'
LearnerNames: {'Tree'}
13-58
Ensemble Methods
ReasonForTermination:
FitInfo:
FitInfoDescription:
Regularization:
[1x77 char]
[100x1 double]
[2x83 char]
[]
8 Predict the mileage of a car with 150 horsepower weighing 2750 lbs:
mileage = ens.predict([150 2750])
mileage =
22.6735
Test Ensemble Quality
Usually you cannot evaluate the predictive quality of an ensemble based on
its performance on training data. Ensembles tend to “overtrain,” meaning
they produce overly optimistic estimates of their predictive power. This
means the result of resubLoss for classification (resubLoss for regression)
usually indicates lower error than you get on new data.
To obtain a better idea of the quality of an ensemble, use one of these methods:
• Evaluate the ensemble on an independent test set (useful when you have a
lot of training data).
• Evaluate the ensemble by cross validation (useful when you don’t have a
lot of training data).
• Evaluate the ensemble on out-of-bag data (useful when you create a bagged
ensemble with fitensemble).
Example: Test Ensemble Quality
This example uses a bagged ensemble so it can use all three methods of
evaluating ensemble quality.
1 Generate an artificial dataset with 20 predictors. Each entry is a random
number from 0 to 1. The initial classification:
Y = 1 when X(1) + X(2) + X(3) + X(4) + X(5) > 2.5
Y = 0 otherwise.
13-59
13
Supervised Learning
rng(1,'twister') % for reproducibility
X = rand(2000,20);
Y = sum(X(:,1:5),2) > 2.5;
In addition, to add noise to the results, randomly switch 10% of the
classifications:
idx = randsample(2000,200);
Y(idx) = ~Y(idx);
2 Independent Test Set
Create independent training and test sets of data. Use 70% of the data for
a training set by calling cvpartition with the holdout option:
cvpart = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cvpart),:);
Ytrain = Y(training(cvpart),:);
Xtest = X(test(cvpart),:);
Ytest = Y(test(cvpart),:);
3 Create a bagged classification ensemble of 200 trees from the training data:
bag = fitensemble(Xtrain,Ytrain,'Bag',200,'Tree',...
'type','classification')
bag =
classreg.learning.classif.ClassificationBaggedEnsemble:
PredictorNames: {1x20 cell}
CategoricalPredictors: []
ResponseName: 'Y'
ClassNames: [0 1]
ScoreTransform: 'none'
NObservations: 1400
NTrained: 200
Method: 'Bag'
LearnerNames: {'Tree'}
ReasonForTermination: [1x77 char]
FitInfo: []
FitInfoDescription: 'None'
13-60
Ensemble Methods
FResample: 1
Replace: 1
UseObsForLearner: [1400x200 logical]
4 Plot the loss (misclassification) of the test data as a function of the number
of trained trees in the ensemble:
figure;
plot(loss(bag,Xtest,Ytest,'mode','cumulative'));
xlabel('Number of trees');
ylabel('Test classification error');
5 Cross validation
Generate a five-fold cross-validated bagged ensemble:
cv = fitensemble(X,Y,'Bag',200,'Tree',...
'type','classification','kfold',5)
13-61
13
Supervised Learning
cv =
classreg.learning.partition.ClassificationPartitionedEnsemble:
CrossValidatedModel: 'Bag'
PredictorNames: {1x20 cell}
CategoricalPredictors: []
ResponseName: 'Y'
NObservations: 2000
KFold: 5
Partition: [1x1 cvpartition]
NTrainedPerFold: [200 200 200 200 200]
ClassNames: [0 1]
ScoreTransform: 'none'
6 Examine the cross validation loss as a function of the number of trees in
the ensemble:
figure;
plot(loss(bag,Xtest,Ytest,'mode','cumulative'));
hold
plot(kfoldLoss(cv,'mode','cumulative'),'r.');
hold off;
xlabel('Number of trees');
ylabel('Classification error');
legend('Test','Cross-validation','Location','NE');
13-62
Ensemble Methods
Cross validating gives comparable estimates to those of the independent
set.
7 Out-of-Bag Estimates
Generate the loss curve for out-of-bag estimates, and plot it along with
the other curves:
figure;
plot(loss(bag,Xtest,Ytest,'mode','cumulative'));
hold
plot(kfoldLoss(cv,'mode','cumulative'),'r.');
plot(oobLoss(bag,'mode','cumulative'),'k--');
hold off;
xlabel('Number of trees');
ylabel('Classification error');
legend('Test','Cross-validation','Out of bag','Location','NE');
13-63
13
Supervised Learning
The out-of-bag estimates are again comparable to those of the other
methods.
Classification: Imbalanced Data or Unequal
Misclassification Costs
In many real-world applications, you might prefer to treat classes in your
data asymmetrically. For example, you might have data with many more
observations of one class than of any other. Or you might work on a problem in
which misclassifying observations of one class has more severe consequences
than misclassifying observations of another class. In such situations, you can
use two optional parameters for fitensemble: prior and cost.
By using prior, you set prior class probabilities (that is, class probabilities
used for training). Use this option if some classes are under- or
overrepresented in your training set. For example, you might obtain your
training data by simulation. Because simulating class A is more expensive
than class B, you opt to generate fewer observations of class A and more
13-64
Ensemble Methods
observations of class B. You expect, however, that class A and class B are mixed
in a different proportion in the real world. In this case, set prior probabilities
for class A and B approximately to the values you expect to observe in the real
world. fitensemble normalizes prior probabilities to make them add up to 1;
multiplying all prior probabilities by the same positive factor does not affect
the result of classification.
If classes are adequately represented in the training data but you want to
treat them asymmetrically, use the cost parameter. Suppose you want to
classify benign and malignant tumors in cancer patients. Failure to identify
a malignant tumor (false negative) has far more severe consequences than
misidentifying benign as malignant (false positive). You should assign high
cost to misidentifying malignant as benign and low cost to misidentifying
benign as malignant.
You must pass misclassification costs as a square matrix with nonnegative
elements. Element C(i,j) of this matrix is the cost of classifying an
observation into class j if the true class is i. The diagonal elements C(i,i) of
the cost matrix must be 0. For the example above, you can choose malignant
tumor to be class 1 and benign tumor to be class 2. Then you can set the
cost matrix to
0 c 
1 0 


where c > 1 is the cost of misidentifying a malignant tumor as benign. Costs
are relative—multiplying all costs by the same positive factor does not affect
the result of classification.
If you have only two classes, fitensemble adjusts their prior probabilities
using Pi  Cij Pi for class i = 1,2 and j ≠ i. Pi are prior probabilities either
passed into fitensemble or computed from class frequencies in the training
data, and Pi are adjusted prior probabilities. Then fitensemble uses the
default cost matrix
0 1 
1 0 


13-65
13
Supervised Learning
and these adjusted probabilities for training its weak learners. Manipulating
the cost matrix is thus equivalent to manipulating the prior probabilities.
If you have three or more classes, fitensemble also converts input costs
into adjusted prior probabilities. This conversion is more complex. First,
fitensemble attempts to solve a matrix equation described in Zhou and
Liu [15]. If it fails to find a solution, fitensemble applies the “average
cost” adjustment described in Breiman et al. [5]. For more information, see
.Zadrozny, Langford, and Abe [14]
Example: Unequal Classification Costs
This example uses data on patients with hepatitis to see if they
live or die as a result of the disease. The data is described at
http://archive.ics.uci.edu/ml/datasets/Hepatitis.
1 Load the data into a file named hepatitis.txt:
s = urlread(['http://archive.ics.uci.edu/ml/' ...
'machine-learning-databases/hepatitis/hepatitis.data']);
fid = fopen('hepatitis.txt','w');
fwrite(fid,s);
fclose(fid);
2 Load the data hepatitis.txt into a dataset, with variable names
describing the fields in the data:
VarNames = {'die_or_live' 'age' 'sex' 'steroid' 'antivirals' 'fatigue' ...
'malaise' 'anorexia' 'liver_big' 'liver_firm' 'spleen_palpable' ...
'spiders' 'ascites' 'varices' 'bilirubin' 'alk_phosphate' 'sgot' ...
'albumin' 'protime' 'histology'};
ds = dataset('file','hepatitis.txt','VarNames',VarNames,...
'Delimiter',',','ReadVarNames',false,'TreatAsEmpty','?',...
'Format','%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f');
ds is a dataset with 155 observations and 20 variables:
size(ds)
ans =
13-66
Ensemble Methods
155
20
3 Convert the data in the dataset to the format for ensembles: a numeric
matrix of predictors, and a cell array with outcome names: 'Die' or
'Live'. The first field in the dataset has the outcomes.
X = double(ds(:,2:end));
ClassNames = {'Die' 'Live'};
Y = ClassNames(ds.die_or_live);
4 Inspect the data for missing values:
figure;
bar(sum(isnan(X),1)/size(X,1));
xlabel('Predictor');
ylabel('Fraction of missing values');
13-67
13
Supervised Learning
Most predictors have missing values, and one has nearly 45% of missing
values. Therefore, use decision trees with surrogate splits for better
accuracy. Because the dataset is small, training time with surrogate splits
should be tolerable.
5 Create a classification tree template that uses surrogate splits:
rng(0,'twister') % for reproducibility
t = ClassificationTree.template('surrogate','on');
6 Examine the data or the description of the data to see which predictors
are categorical:
X(1:5,:)
ans =
Columns 1 through 6
30.0000
50.0000
78.0000
31.0000
34.0000
2.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
2.0000
NaN
2.0000
2.0000
2.0000
2.0000
1.0000
2.0000
2.0000
1.0000
1.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
2.0000
85.0000
135.0000
96.0000
46.0000
NaN
18.0000
42.0000
32.0000
52.0000
200.0000
4.0000
3.5000
4.0000
4.0000
4.0000
NaN
NaN
NaN
80.0000
NaN
Columns 7 through 12
2.0000
2.0000
2.0000
2.0000
2.0000
1.0000
1.0000
2.0000
2.0000
2.0000
Columns 13 through 18
2.0000
2.0000
2.0000
2.0000
2.0000
13-68
1.0000
0.9000
0.7000
0.7000
1.0000
Ensemble Methods
Column 19
1.0000
1.0000
1.0000
1.0000
1.0000
It appears that predictors 2 through 13 are categorical, as well as predictor
19. You can confirm this inference with the dataset description at
http://archive.ics.uci.edu/ml/datasets/Hepatitis.
7 List the categorical variables:
ncat = [2:13,19];
8 Create a cross-validated ensemble using 200 learners and the GentleBoost
algorithm:
a = fitensemble(X,Y,'GentleBoost',200,t,...
'PredictorNames',VarNames(2:end),'LearnRate',0.1,...
'CategoricalPredictors',ncat,'kfold',5);
figure;
plot(kfoldLoss(a,'mode','cumulative','lossfun','exponential'));
xlabel('Number of trees');
ylabel('Cross-validated exponential loss');
13-69
13
Supervised Learning
9 Inspect the confusion matrix to see which people the ensemble predicts
correctly:
[Yfit,Sfit] = kfoldPredict(a); %
confusionmat(Y,Yfit,'order',ClassNames)
ans =
16
10
16
113
Of the 123 people who live, the ensemble predicts correctly that 113 will
live. But for the 32 people who die of hepatitis, the ensemble only predicts
correctly that half will die of hepatitis.
10 There are two types of error in the predictions of the ensemble:
• Predicting that the patient lives, but the patient dies
13-70
Ensemble Methods
• Predicting that the patient dies, but the patient lives
Suppose you believe that the first error is five times worse than the second.
Make a new classification cost matrix that reflects this belief:
cost.ClassNames = ClassNames;
cost.ClassificationCosts = [0 5; 1 0];
11 Create a new cross-validated ensemble using cost as misclassification
cost, and inspect the resulting confusion matrix:
aC = fitensemble(X,Y,'GentleBoost',200,t,...
'PredictorNames',VarNames(2:end),'LearnRate',0.1,...
'CategoricalPredictors',ncat,'kfold',5,...
'cost',cost);
[YfitC,SfitC] = kfoldPredict(aC);
confusionmat(Y,YfitC,'order',ClassNames)
ans =
19
9
13
114
As expected, the new ensemble does a better job classifying the people
who die. Somewhat surprisingly, the new ensemble also does a better
job classifying the people who live, though the result is not statistically
significantly better. The results of the cross validation are random, so this
result is simply a statistical fluctuation. The result seems to indicate that
the classification of people who live is not very sensitive to the cost.
Example: Classification with Many Categorical Levels
Generally, you cannot use classification with more than 31 levels in any
categorical predictor. However, two boosting algorithms can classify data
with many categorical levels: LogitBoost and GentleBoost. For details, see
“LogitBoost” on page 13-125 and “GentleBoost” on page 13-126.
This example uses demographic data from the U.S. Census, available at
http://archive.ics.uci.edu/ml/machine-learning-databases/adult/.
The objective of the researchers who posted the data is predicting whether an
individual makes more than $50,000/year, based on a set of characteristics.
13-71
13
Supervised Learning
You can see details of the data, including predictor names, in the adult.names
file at the site.
1 Load the 'adult.data' file from the UCI Machine Learning Repository:
s = urlread(['http://archive.ics.uci.edu/ml/' ...
'machine-learning-databases/adult/adult.data']);
2 'adult.data' represents missing data as '?'. Replace instances of
missing data with the blank string '':
s = strrep(s,'?','');
3 Put the data into a MATLAB dataset array:
fid = fopen('adult.txt','w');
fwrite(fid,s);
fclose(fid);
clear s;
VarNames = {'age' 'workclass' 'fnlwgt' 'education' 'education_num' ...
'marital_status' 'occupation' 'relationship' 'race' ...
'sex' 'capital_gain' 'capital_loss' ...
'hours_per_week' 'native_country' 'income'};
ds = dataset('file','adult.txt','VarNames',VarNames,...
'Delimiter',',','ReadVarNames',false,'Format',...
'%u%s%u%s%u%s%s%s%s%s%u%u%u%s%s');
cat = ~datasetfun(@isnumeric,ds(:,1:end-1)); % Logical indices
%
of categorical variables
catcol = find(cat); % indices of categorical variables
4 Many predictors in the data are categorical. Convert those fields in the
dataset array to nominal:
ds.workclass = nominal(ds.workclass);
ds.education = nominal(ds.education);
ds.marital_status = nominal(ds.marital_status);
ds.occupation = nominal(ds.occupation);
ds.relationship = nominal(ds.relationship);
ds.race = nominal(ds.race);
ds.sex = nominal(ds.sex);
13-72
Ensemble Methods
ds.native_country = nominal(ds.native_country);
ds.income = nominal(ds.income);
5 Convert the dataset array into numerical variables for fitensemble:
X = double(ds(:,1:end-1));
Y = ds.income;
6 Some variables have many levels. Plot the number of levels of each
predictor:
ncat = zeros(1,numel(catcol));
for c=1:numel(catcol)
[~,gn] = grp2idx(X(:,catcol(c)));
ncat(c) = numel(gn);
end
figure;
bar(catcol,ncat);
xlabel('Predictor');
ylabel('Number of categories');
Predictor 14 ('native_country') has more than 40 categorical levels. This
is too many levels for any method except LogitBoost and GentleBoost.
13-73
13
Supervised Learning
7 Create classification ensembles using both LogitBoost and GentleBoost:
lb = fitensemble(X,Y,'LogitBoost',300,'Tree','CategoricalPredictors',cat,...
'PredictorNames',VarNames(1:end-1),'ResponseName','income');
gb = fitensemble(X,Y,'GentleBoost',300,'Tree','CategoricalPredictors',cat,...
'PredictorNames',VarNames(1:end-1),'ResponseName','income');
8 Examine the resubstitution error for the two ensembles:
figure;
plot(resubLoss(lb,'mode','cumulative'));
hold on
plot(resubLoss(gb,'mode','cumulative'),'r--');
hold off
xlabel('Number of trees');
ylabel('Resubstitution error');
legend('LogitBoost','GentleBoost','Location','NE');
13-74
Ensemble Methods
The algorithms have similar resubstitution error.
9 Estimate the generalization error for the two algorithms by cross validation.
lbcv = crossval(lb,'kfold',5);
gbcv = crossval(gb,'kfold',5);
figure;
plot(kfoldLoss(lbcv,'mode','cumulative'));
hold on
plot(kfoldLoss(gbcv,'mode','cumulative'),'r--');
hold off
xlabel('Number of trees');
ylabel('Cross-validated error');
legend('LogitBoost','GentleBoost','Location','NE');
13-75
13
Supervised Learning
The cross-validated loss is nearly the same as the resubstitution error.
Example: Surrogate Splits
When you have missing data, trees and ensembles of trees give better
predictions when they include surrogate splits. Furthermore, estimates of
predictor importance are often different with surrogate splits. Eliminating
unimportant predictors can save time and memory for predictions, and can
make predictions easier to understand.
This example shows the effects of surrogate splits for predictions for data
containing missing entries in both training and test sets. There is a redundant
predictor in the data, which the surrogate split uses to infer missing values.
While the example is artificial, it shows the value of surrogate splits with
missing data.
13-76
Ensemble Methods
1 Generate and plot two different normally-distributed populations, one with
5000 members, one with 10,000 members:
rng(1,'twister') % for reproducibility
N = 5000;
N1 = 2*N; % number in population 1
N2 = N; % number in population 2
mu1 = [-1 -1]/2; % mean of population 1
mu2 = [1 1]/2; % mean of population 2
S1 = [3
-2.5;...
-2.5
3]; % variance of population 1
S2 = [3
2.5;...
2.5
3]; % variance of population 2
X1 = mvnrnd(mu1,S1,N1); % population 1
X2 = mvnrnd(mu2,S2,N2); % population 2
X = [X1; X2]; % total population
Y = ones(N1+N2,1); % label population 1
Y(N1+1:end) = 2; % label population 2
figure
plot(X1(:,1),X1(:,2),'k.','MarkerSize',2)
hold on
plot(X2(:,1),X2(:,2),'rx','MarkerSize',3);
hold off
axis square
13-77
13
Supervised Learning
There is a good deal of overlap between the data points. You cannot expect
perfect classification of this data.
2 Make a third predictor that is the same as the first component of X:
X = [X X(:,1)];
3 Remove half the values of predictor 1 at random:
X(rand(size(X(:,1))) < 0.5,1) = NaN;
4 Partition the data into a training set and a test set:
cv = cvpartition(Y,'holdout',0.3); % 30% test data
13-78
Ensemble Methods
Xtrain = X(training(cv),:);
Ytrain = Y(training(cv));
Xtest = X(test(cv),:);
Ytest = Y(test(cv));
5 Create two Bag ensembles: one with surrogate splits, one without. First
create the template for surrogate splits, then train both ensembles:
templS = ClassificationTree.template('surrogate','on');
bag = fitensemble(Xtrain,Ytrain,'Bag',50,'Tree',...
'type','class','nprint',10);
Training Bag...
Grown weak learners:
Grown weak learners:
Grown weak learners:
Grown weak learners:
Grown weak learners:
10
20
30
40
50
bagS = fitensemble(Xtrain,Ytrain,'Bag',50,templS,...
'type','class','nprint',10);
Training Bag...
Grown weak learners:
Grown weak learners:
Grown weak learners:
Grown weak learners:
Grown weak learners:
10
20
30
40
50
6 Examine the accuracy of the two ensembles for predicting the test data:
figure
plot(loss(bag,Xtest,Ytest,'mode','cumulative'));
hold on
plot(loss(bagS,Xtest,Ytest,'mode','cumulative'),'r--');
hold off;
legend('Without surrogate splits','With surrogate splits');
xlabel('Number of trees');
ylabel('Test classification error');
13-79
13
Supervised Learning
The ensemble with surrogate splits is obviously more accurate than the
ensemble without surrogate splits.
7 Check the statistical significance of the difference in results with the
McNemar test:
Yfit = predict(bag,Xtest);
YfitS = predict(bagS,Xtest);
N10 = sum(Yfit==Ytest & YfitS~=Ytest);
N01 = sum(Yfit~=Ytest & YfitS==Ytest);
mcnemar = (abs(N10-N01) - 1)^2/(N10+N01);
pval = 1 - chi2cdf(mcnemar,1)
pval =
0
The extremely low p-value indicates that the ensemble with surrogate
splits is better in a statistically significant manner.
13-80
Ensemble Methods
Ensemble Regularization
Regularization is a process of choosing fewer weak learners for an ensemble
in a way that does not diminish predictive performance. Currently you can
regularize regression ensembles.
The regularize method finds an optimal set of learner weights αt that
minimize
T

 T






w
g

h
x
,
y



 n    t t n  n   t .
n1
t 1

  t 1

N
Here
• λ ≥ 0 is a parameter you provide, called the lasso parameter.
• ht is a weak learner in the ensemble trained on N observations with
predictors xn, responses yn, and weights wn.
• g(f,y) = (f – y)2 is the squared error.
The ensemble is regularized on the same (xn,yn,wn) data used for training, so
 T


w
g
 n   t ht  xn   , yn 
n1

  t 1

N
is the ensemble resubstitution error (MSE).
If you use λ = 0, regularize finds the weak learner weights by minimizing
the resubstitution MSE. Ensembles tend to overtrain. In other words, the
resubstitution error is typically smaller than the true generalization error.
By making the resubstitution error even smaller, you are likely to make
the ensemble accuracy worse instead of improving it. On the other hand,
positive values of λ push the magnitude of the αt coefficients to 0. This often
improves the generalization error. Of course, if you choose λ too large, all the
optimal coefficients are 0, and the ensemble does not have any accuracy.
Usually you can find an optimal range for λ in which the accuracy of the
regularized ensemble is better or comparable to that of the full ensemble
without regularization.
13-81
13
Supervised Learning
A nice feature of lasso regularization is its ability to drive the optimized
coefficients precisely to zero. If a learner’s weight αt is 0, this learner can be
excluded from the regularized ensemble. In the end, you get an ensemble with
improved accuracy and fewer learners.
Example: Regularizing a Regression Ensemble
This example uses data for predicting the insurance risk of a car based on
its many attributes.
1 Load the imports-85 data into the MATLAB workspace:
load imports-85;
2 Look at a description of the data to find the categorical variables and
predictor names:
Description
Description =
1985 Auto Imports Database from the UCI repository
http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
Variables have been reordered to place variables with numeric values (referred
to as "continuous" on the UCI site) to the left and categorical values to the
right. Specifically, variables 1:16 are: symboling, normalized-losses,
wheel-base, length, width, height, curb-weight, engine-size, bore, stroke,
compression-ratio, horsepower, peak-rpm, city-mpg, highway-mpg, and price.
Variables 17:26 are: make, fuel-type, aspiration, num-of-doors, body-style,
drive-wheels, engine-location, engine-type, num-of-cylinders, and fuel-system.
The objective of this process is to predict the “symboling,” the first variable
in the data, from the other predictors. “symboling” is an integer from
-3 (good insurance risk) to 3 (poor insurance risk). You could use a
classification ensemble to predict this risk instead of a regression ensemble.
As stated in “Steps in Supervised Learning (Machine Learning)” on page
13-2, when you have a choice between regression and classification,
you should try regression first. Furthermore, this example is to show
regularization, which currently works only for regression.
3 Prepare the data for ensemble fitting:
13-82
Ensemble Methods
Y = X(:,1);
X(:,1) = [];
VarNames = {'normalized-losses' 'wheel-base' 'length' 'width' 'height' ...
'curb-weight' 'engine-size' 'bore' 'stroke' 'compression-ratio' ...
'horsepower' 'peak-rpm' 'city-mpg' 'highway-mpg' 'price' 'make' ...
'fuel-type' 'aspiration' 'num-of-doors' 'body-style' 'drive-wheels' ...
'engine-location' 'engine-type' 'num-of-cylinders' 'fuel-system'};
catidx = 16:25; % indices of categorical predictors
4 Create a regression ensemble from the data using 300 default trees:
ls = fitensemble(X,Y,'LSBoost',300,'Tree','LearnRate',0.1,...
'PredictorNames',VarNames,'ResponseName','symboling',...
'CategoricalPredictors',catidx)
ls =
classreg.learning.regr.RegressionEnsemble:
PredictorNames: {1x25 cell}
CategoricalPredictors: [16 17 18 19 20 21 22 23 24 25]
ResponseName: 'symboling'
ResponseTransform: 'none'
NObservations: 205
NTrained: 300
Method: 'LSBoost'
LearnerNames: {'Tree'}
ReasonForTermination: [1x77 char]
FitInfo: [300x1 double]
FitInfoDescription: [2x83 char]
Regularization: []
The final line, Regularization, is empty ([]). To regularize the ensemble,
you have to use the regularize method.
5 Cross validate the ensemble, and inspect its loss curve.
cv = crossval(ls,'kfold',5);
figure;
plot(kfoldLoss(cv,'mode','cumulative'));
13-83
13
Supervised Learning
xlabel('Number of trees');
ylabel('Cross-validated MSE');
It appears you might obtain satisfactory performance from a smaller
ensemble, perhaps one containing from 50 to 100 trees.
6 Call the regularize method to try to find trees that you can remove from
the ensemble. By default, regularize examines 10 values of the lasso
(Lambda) parameter spaced exponentially.
ls = regularize(ls)
ls =
classreg.learning.regr.RegressionEnsemble:
PredictorNames: {1x25 cell}
CategoricalPredictors: [16 17 18 19 20 21 22 23 24 25]
13-84
Ensemble Methods
ResponseName:
ResponseTransform:
NObservations:
NTrained:
Method:
LearnerNames:
ReasonForTermination:
FitInfo:
FitInfoDescription:
Regularization:
'symboling'
'none'
205
300
'LSBoost'
{'Tree'}
[1x77 char]
[300x1 double]
[2x83 char]
[1x1 struct]
The Regularization property is no longer empty.
7 Plot the resubstitution mean-squared error (MSE) and number of learners
with nonzero weights against the lasso parameter. Separately plot the
value at Lambda=0. Use a logarithmic scale since the values of Lambda are
exponentially spaced.
figure;
semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE);
line([1e-3 1e-3],[ls.Regularization.ResubstitutionMSE(1) ...
ls.Regularization.ResubstitutionMSE(1)],...
'marker','x','markersize',12,'color','b');
r0 = resubLoss(ls);
line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],...
[r0 r0],'color','r','LineStyle','--');
xlabel('Lambda');
ylabel('Resubstitution MSE');
annotation('textbox',[0.5 0.22 0.5 0.05],'String','unregularized ensemble',...
'color','r','FontSize',14,'LineStyle','none');
figure;
loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1));
line([1e-3 1e-3],...
[sum(ls.Regularization.TrainedWeights(:,1)>0) ...
sum(ls.Regularization.TrainedWeights(:,1)>0)],...
'marker','x','markersize',12,'color','b');
line([ls.Regularization.Lambda(2) ls.Regularization.Lambda(end)],...
[ls.NTrained ls.NTrained],...
13-85
13
Supervised Learning
'color','r','LineStyle','--');
xlabel('Lambda');
ylabel('Number of learners');
annotation('textbox',[0.3 0.8 0.5 0.05],'String','unregularized ensemble',...
'color','r','FontSize',14,'LineStyle','none');
13-86
Ensemble Methods
8 The resubstitution MSE values are likely to be overly optimistic. To obtain
more reliable estimates of the error associated with various values of
Lambda, cross validate the ensemble using cvshrink. Plot the resulting
cross validation loss (MSE) and number of learners against Lambda.
rng(0,'Twister') % for reproducibility
[mse,nlearn] = cvshrink(ls,'lambda',ls.Regularization.Lambda,'kfold',5);
figure;
semilogx(ls.Regularization.Lambda,ls.Regularization.ResubstitutionMSE);
hold;
semilogx(ls.Regularization.Lambda,mse,'r--');
hold off;
xlabel('Lambda');
ylabel('Mean squared error');
legend('resubstitution','cross-validation','Location','NW');
13-87
13
Supervised Learning
line([1e-3 1e-3],[ls.Regularization.ResubstitutionMSE(1) ...
ls.Regularization.ResubstitutionMSE(1)],...
'marker','x','markersize',12,'color','b');
line([1e-3 1e-3],[mse(1) mse(1)],'marker','o',...
'markersize',12,'color','r','LineStyle','--');
figure;
loglog(ls.Regularization.Lambda,sum(ls.Regularization.TrainedWeights>0,1));
hold;
loglog(ls.Regularization.Lambda,nlearn,'r--');
hold off;
xlabel('Lambda');
ylabel('Number of learners');
legend('resubstitution','cross-validation','Location','NE');
line([1e-3 1e-3],...
[sum(ls.Regularization.TrainedWeights(:,1)>0) ...
sum(ls.Regularization.TrainedWeights(:,1)>0)],...
'marker','x','markersize',12,'color','b');
line([1e-3 1e-3],[nlearn(1) nlearn(1)],'marker','o',...
13-88
Ensemble Methods
'markersize',12,'color','r','LineStyle','--');
13-89
13
Supervised Learning
Examining the cross-validated error shows that the cross-validation MSE
is almost flat for Lambda up to a bit over 1e-2.
9 Examine ls.Regularization.Lambda to find the highest value that gives
MSE in the flat region (up to a bit over 1e-2):
jj = 1:length(ls.Regularization.Lambda);
[jj;ls.Regularization.Lambda]
ans =
Columns 1 through 6
1.0000
0
2.0000
0.0014
Columns 7 through 10
13-90
3.0000
0.0033
4.0000
0.0077
5.0000
0.0183
6.0000
0.0435
Ensemble Methods
7.0000
0.1031
8.0000
0.2446
9.0000
0.5800
10.0000
1.3754
Element 5 of ls.Regularization.Lambda has value 0.0183, the largest
in the flat range.
10 Reduce the ensemble size using the shrink method. shrink returns a
compact ensemble with no training data. The generalization error for the
new compact ensemble was already estimated by cross validation in mse(5).
cmp = shrink(ls,'weightcolumn',5)
cmp =
classreg.learning.regr.CompactRegressionEnsemble:
PredictorNames: {1x25 cell}
CategoricalPredictors: [16 17 18 19 20 21 22 23 24 25]
ResponseName: 'symboling'
ResponseTransform: 'none'
NTrained: 18
There are only 18 trees in the new ensemble, notably reduced from the
300 in ls.
11 Compare the sizes of the ensembles:
sz(1) = whos('cmp'); sz(2) = whos('ls');
[sz(1).bytes sz(2).bytes]
ans =
162270
2791024
The reduced ensemble is about 6% the size of the original.
12 Compare the MSE of the reduced ensemble to that of the original ensemble:
figure;
plot(kfoldLoss(cv,'mode','cumulative'));
hold on
plot(cmp.NTrained,mse(5),'ro','MarkerSize',12);
xlabel('Number of trees');
13-91
13
Supervised Learning
ylabel('Cross-validated MSE');
legend('unregularized ensemble','regularized ensemble',...
'Location','NE');
hold off
The reduced ensemble gives low loss while using many fewer trees.
Example: Tuning RobustBoost
The RobustBoost algorithm can make good classification predictions even
when the training data has noise. However, the default RobustBoost
parameters can produce an ensemble that does not predict well. This example
shows one way of tuning the parameters for better predictive accuracy.
1 Generate data with label noise. This example has twenty uniform random
numbers per observation, and classifies the observation as 1 if the sum
of the first five numbers exceeds 2.5 (so is larger than average), and 0
otherwise:
13-92
Ensemble Methods
rng(0,'twister') % for reproducibility
Xtrain = rand(2000,20);
Ytrain = sum(Xtrain(:,1:5),2) > 2.5;
2 To add noise, randomly switch 10% of the classifications:
idx = randsample(2000,200);
Ytrain(idx) = ~Ytrain(idx);
3 Create an ensemble with AdaBoostM1 for comparison purposes:
ada = fitensemble(Xtrain,Ytrain,'AdaBoostM1',...
300,'Tree','LearnRate',0.1);
4 Create an ensemble with RobustBoost. Since the data has 10% incorrect
classification, perhaps an error goal of 15% is reasonable.
rb1 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,...
'Tree','RobustErrorGoal',0.15,'RobustMaxMargin',1);
5 Try setting a high value of the error goal, 0.6. You get an error:
rb2 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,'Tree','RobustErrorGoal',0.6)
??? Error using ==> RobustBoost>RobustBoost.RobustBoost at 33
For the chosen values of 'RobustMaxMargin' and 'RobustMarginSigma', you must set
'RobustErrorGoal' to a value between 0 and 0.5.
6 Create an ensemble with an error goal in the allowed range, 0.4:
rb2 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,...
'Tree','RobustErrorGoal',0.4);
7 Create an ensemble with very optimistic error goal, 0.01:
rb3 = fitensemble(Xtrain,Ytrain,'RobustBoost',300,...
'Tree','RobustErrorGoal',0.01);
8 Compare the resubstitution error of the four ensembles:
figure
13-93
13
Supervised Learning
plot(resubLoss(rb1,'mode','cumulative'));
hold on
plot(resubLoss(rb2,'mode','cumulative'),'r--');
plot(resubLoss(rb3,'mode','cumulative'),'k-.');
plot(resubLoss(ada,'mode','cumulative'),'g.');
hold off;
xlabel('Number of trees');
ylabel('Resubstitution error');
legend('ErrorGoal=0.15','ErrorGoal=0.4','ErrorGoal=0.01',...
'AdaBoostM1','Location','NE');
All the RobustBoost curves show lower resubstitution error than the
AdaBoostM1 curve. The error goal of 0.15 curve shows the lowest
13-94
Ensemble Methods
resubstitution error over most of the range. However, its error is rising in
the latter half of the plot, while the other curves are still descending.
9 Generate test data to see the predictive power of the ensembles. Test the
four ensembles:
Xtest = rand(2000,20);
Ytest = sum(Xtest(:,1:5),2) > 2.5;
idx = randsample(2000,200);
Ytest(idx) = ~Ytest(idx);
figure;
plot(loss(rb1,Xtest,Ytest,'mode','cumulative'));
hold on
plot(loss(rb2,Xtest,Ytest,'mode','cumulative'),'r--');
plot(loss(rb3,Xtest,Ytest,'mode','cumulative'),'k-.');
plot(loss(ada,Xtest,Ytest,'mode','cumulative'),'g.');
hold off;
xlabel('Number of trees');
ylabel('Test error');
legend('ErrorGoal=0.15','ErrorGoal=0.4','ErrorGoal=0.01',...
'AdaBoostM1','Location','NE');
13-95
13
Supervised Learning
The error curve for error goal 0.15 is lowest (best) in the plotted range. The
curve for error goal 0.4 seems to be converging to a similar value for a large
number of trees, but more slowly. AdaBoostM1 has higher error than the
curve for error goal 0.15. The curve for the too-optimistic error goal 0.01
remains substantially higher (worse) than the other algorithms for most
of the plotted range.
TreeBagger Examples
TreeBagger ensembles have more functionality than those constructed with
fitensemble; see TreeBagger Features Not in fitensemble on page 13-120.
Also, some property and method names differ from their fitensemble
13-96
Ensemble Methods
counterparts. This section contains examples of workflow for regression and
classification that use this extra TreeBagger functionality.
Workflow Example: Regression of Insurance Risk Rating for
Car Imports with TreeBagger
In this example, use a database of 1985 car imports with 205 observations,
25 input variables, and one response variable, insurance risk rating, or
“symboling.” The first 15 variables are numeric and the last 10 are categorical.
The symboling index takes integer values from –3 to 3.
1 Load the dataset and split it into predictor and response arrays:
load imports-85;
Y = X(:,1);
X = X(:,2:end);
2 Because bagging uses randomized data drawings, its exact outcome
depends on the initial random seed. To reproduce the exact results in this
example, use the random stream settings:
rng(1945,'twister')
Finding the Optimal Leaf Size. For regression, the general rule is to
set leaf size to 5 and select one third of input features for decision splits at
random. In the following step, verify the optimal leaf size by comparing
mean-squared errors obtained by regression for various leaf sizes. oobError
computes MSE versus the number of grown trees. You must set oobpred to
'on' to obtain out-of-bag predictions later.
leaf = [1 5 10 20 50 100];
col = 'rgbcmy';
figure(1);
for i=1:length(leaf)
b = TreeBagger(50,X,Y,'method','r','oobpred','on',...
'cat',16:25,'minleaf',leaf(i));
plot(oobError(b),col(i));
hold on;
end
xlabel('Number of Grown Trees');
ylabel('Mean Squared Error');
13-97
13
Supervised Learning
legend({'1' '5' '10' '20' '50' '100'},'Location','NorthEast');
hold off;
The red (leaf size 1) curve gives the lowest MSE values.
Estimating Feature Importance.
1 In practical applications, you typically grow ensembles with hundreds of
trees. Only 50 trees were used in “Finding the Optimal Leaf Size” on page
13-97 for faster processing. Now that you have estimated the optimal leaf
size, grow a larger ensemble with 100 trees and use it for estimation of
feature importance:
b = TreeBagger(100,X,Y,'method','r','oobvarimp','on',...
'cat',16:25,'minleaf',1);
2
Inspect the error curve again to make sure nothing went wrong during
training:
figure(2);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Squared Error');
13-98
Ensemble Methods
Prediction ability should depend more on important features and less on
unimportant features. You can use this idea to measure feature importance.
For each feature, you can permute the values of this feature across all of the
observations in the data set and measure how much worse the mean-squared
error (MSE) becomes after the permutation. You can repeat this for each
feature.
1 Using the following code, plot the increase in MSE due to
permuting out-of-bag observations across each input variable. The
OOBPermutedVarDeltaError array stores the increase in MSE averaged
over all trees in the ensemble and divided by the standard deviation taken
over the trees, for each variable. The larger this value, the more important
the variable. Imposing an arbitrary cutoff at 0.65, you can select the five
most important features.
figure(3);
bar(b.OOBPermutedVarDeltaError);
xlabel('Feature Number');
ylabel('Out-Of-Bag Feature Importance');
idxvar = find(b.OOBPermutedVarDeltaError>0.65)
idxvar =
13-99
13
Supervised Learning
1
2
4
16
19
2 The OOBIndices property of TreeBagger tracks which observations are out
of bag for what trees. Using this property, you can monitor the fraction of
observations in the training data that are in bag for all trees. The curve
starts at approximately 2/3, the fraction of unique observations selected by
one bootstrap replica, and goes down to 0 at approximately 10 trees.
finbag = zeros(1,b.NTrees);
for t=1:b.NTrees
finbag(t) = sum(all(~b.OOBIndices(:,1:t),2));
end
13-100
Ensemble Methods
finbag = finbag / size(X,1);
figure(4);
plot(finbag);
xlabel('Number of Grown Trees');
ylabel('Fraction of in-Bag Observations');
Growing Trees on a Reduced Set of Features. Using just the five most
powerful features selected in “Estimating Feature Importance” on page 13-98,
determine if it is possible to obtain a similar predictive power. To begin, grow
100 trees on these features only. The first three of the five selected features
are numeric and the last two are categorical.
b5v = TreeBagger(100,X(:,idxvar),Y,'method','r',...
13-101
13
Supervised Learning
'oobvarimp','on','cat',4:5,'minleaf',1);
figure(5);
plot(oobError(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Squared Error');
figure(6);
bar(b5v.OOBPermutedVarDeltaError);
xlabel('Feature Index');
ylabel('Out-of-Bag Feature Importance');
13-102
Ensemble Methods
These five most powerful features give the same MSE as the full set, and
the ensemble trained on the reduced set ranks these features similarly to
each other. Features 1 and 2 from the reduced set perhaps could be removed
without a significant loss in the predictive power.
Finding Outliers. To find outliers in the training data, compute the
proximity matrix using fillProximities:
b5v = fillProximities(b5v);
The method normalizes this measure by subtracting the mean outlier measure
for the entire sample, taking the magnitude of this difference and dividing the
result by the median absolute deviation for the entire sample:
figure(7);
hist(b5v.OutlierMeasure);
xlabel('Outlier Measure');
ylabel('Number of Observations');
13-103
13
Supervised Learning
Discovering Clusters in the Data. By applying multidimensional scaling
to the computed matrix of proximities, you can inspect the structure of the
input data and look for possible clusters of observations. The mdsProx method
returns scaled coordinates and eigenvalues for the computed proximity
matrix. If run with the colors option, this method makes a scatter plot of
two scaled coordinates, first and second by default.
figure(8);
[~,e] = mdsProx(b5v,'colors','k');
xlabel('1st Scaled Coordinate');
ylabel('2nd Scaled Coordinate');
13-104
Ensemble Methods
Assess the relative importance of the scaled axes by plotting the first 20
eigenvalues:
figure(9);
bar(e(1:20));
xlabel('Scaled Coordinate Index');
ylabel('Eigenvalue');
13-105
13
Supervised Learning
Saving the Ensemble Configuration for Future Use. To use the trained
ensemble for predicting the response on unseen data, store the ensemble
to disk and retrieve it later. If you do not want to compute predictions for
out-of-bag data or reuse training data in any other way, there is no need to
store the ensemble object itself. Saving the compact version of the ensemble
would be enough in this case. Extract the compact object from the ensemble:
c = compact(b5v)
c =
Ensemble with 100 decision trees:
Method:
regression
Nvars:
5
This object can be now saved in a *.mat file as usual.
Workflow Example: Classifying Radar Returns for Ionosphere
Data with TreeBagger
You can also use ensembles of decision trees for classification. For this
example, use ionosphere data with 351 observations and 34 real-valued
predictors. The response variable is categorical with two levels:
• 'g' for good radar returns
• 'b' for bad radar returns
The goal is to predict good or bad returns using a set of 34 measurements. The
workflow resembles that for “Workflow Example: Regression of Insurance
Risk Rating for Car Imports with TreeBagger” on page 13-97.
1 Fix the initial random seed, grow 50 trees, inspect how the ensemble error
changes with accumulation of trees, and estimate feature importance. For
classification, it is best to set the minimal leaf size to 1 and select the square
root of the total number of features for each decision split at random. These
are the default settings for a TreeBagger used for classification.
load ionosphere;
rng(1945,'twister')
b = TreeBagger(50,X,Y,'oobvarimp','on');
13-106
Ensemble Methods
figure(10);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Classification Error');
2 The method trains ensembles with few trees on observations that are in
bag for all trees. For such observations, it is impossible to compute the true
out-of-bag prediction, and TreeBagger returns the most probable class
for classification and the sample mean for regression. You can change
the default value returned for in-bag observations using the DefaultYfit
property. If you set the default value to an empty string for classification,
the method excludes in-bag observations from computation of the out-of-bag
error. In this case, the curve is more variable when the number of trees
is small, either because some observations are never out of bag (and are
therefore excluded) or because their predictions are based on few trees.
b.DefaultYfit = '';
figure(11);
plot(oobError(b));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Error Excluding in-Bag Observations');
13-107
13
Supervised Learning
3 The OOBIndices property of TreeBagger tracks which observations are out
of bag for what trees. Using this property, you can monitor the fraction of
observations in the training data that are in bag for all trees. The curve
starts at approximately 2/3, the fraction of unique observations selected by
one bootstrap replica, and goes down to 0 at approximately 10 trees.
finbag = zeros(1,b.NTrees);
for t=1:b.NTrees
finbag(t) = sum(all(~b.OOBIndices(:,1:t),2));
end
finbag = finbag / size(X,1);
figure(12);
plot(finbag);
xlabel('Number of Grown Trees');
ylabel('Fraction of in-Bag Observations');
13-108
Ensemble Methods
4 Estimate feature importance:
figure(13);
bar(b.OOBPermutedVarDeltaError);
xlabel('Feature Index');
ylabel('Out-of-Bag Feature Importance');
idxvar = find(b.OOBPermutedVarDeltaError>0.8)
idxvar =
3
4
5
7
8
13-109
13
Supervised Learning
5 Having selected the five most important features, grow a larger ensemble
on the reduced feature set. Save time by not permuting out-of-bag
observations to obtain new estimates of feature importance for the reduced
feature set (set oobvarimp to 'off'). You would still be interested in
obtaining out-of-bag estimates of classification error (set oobpred to 'on').
b5v = TreeBagger(100,X(:,idxvar),Y,'oobpred','on');
figure(14);
plot(oobError(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Classification Error');
13-110
Ensemble Methods
6 For classification ensembles, in addition to classification error (fraction of
misclassified observations), you can also monitor the average classification
margin. For each observation, the margin is defined as the difference
between the score for the true class and the maximal score for other classes
predicted by this tree. The cumulative classification margin uses the scores
averaged over all trees and the mean cumulative classification margin is
the cumulative margin averaged over all observations. The oobMeanMargin
method with the 'mode' argument set to 'cumulative' (default) shows how
the mean cumulative margin changes as the ensemble grows: every new
element in the returned array represents the cumulative margin obtained
by including a new tree in the ensemble. If training is successful, you would
expect to see a gradual increase in the mean classification margin.
For decision trees, a classification score is the probability of observing an
instance of this class in this tree leaf. For example, if the leaf of a grown
decision tree has five 'good' and three 'bad' training observations in
it, the scores returned by this decision tree for any observation fallen on
this leaf are 5/8 for the 'good' class and 3/8 for the 'bad' class. These
probabilities are called 'scores' for consistency with other classifiers that
might not have an obvious interpretation for numeric values of returned
predictions.
figure(15);
13-111
13
Supervised Learning
plot(oobMeanMargin(b5v));
xlabel('Number of Grown Trees');
ylabel('Out-of-Bag Mean Classification Margin');
7 Compute the matrix of proximities and look at the distribution of outlier
measures. Unlike regression, outlier measures for classification ensembles
are computed within each class separately.
b5v = fillProximities(b5v);
figure(16);
hist(b5v.OutlierMeasure);
xlabel('Outlier Measure');
ylabel('Number of Observations');
13-112
Ensemble Methods
8 All extreme outliers for this dataset come from the 'good' class:
b5v.Y(b5v.OutlierMeasure>40)
ans =
'g'
'g'
'g'
'g'
'g''
9 As for regression, you can plot scaled coordinates, displaying the two classes
in different colors using the colors argument of mdsProx. This argument
takes a string in which every character represents a color. To find the order
of classes used by the ensemble, look at the ClassNames property:
b5v.ClassNames
ans =
'g'
'b'
13-113
13
Supervised Learning
The 'good' class is first and the 'bad' class is second. Display scaled
coordinates using red for 'good' and blue for 'bad' observations:
figure(17);
[s,e] = mdsProx(b5v,'colors','rb');
xlabel('1st Scaled Coordinate');
ylabel('2nd Scaled Coordinate');
10 Plot the first 20 eigenvalues obtained by scaling. The first eigenvalue
in this case clearly dominates and the first scaled coordinate is most
important.
figure(18);
bar(e(1:20));
xlabel('Scaled Coordinate Index');
ylabel('Eigenvalue');
13-114
Ensemble Methods
Plotting a Classification Performance Curve. Another way of exploring
the performance of a classification ensemble is to plot its Receiver Operating
Characteristic (ROC) curve or another performance curve suitable for the
current problem. First, obtain predictions for out-of-bag observations. For
a classification ensemble, the oobPredict method returns a cell array of
classification labels ('g' or 'b' for ionosphere data) as the first output
argument and a numeric array of scores as the second output argument.
The returned array of scores has two columns, one for each class. In this
case, the first column is for the 'good' class and the second column is for the
'bad' class. One column in the score matrix is redundant because the scores
represent class probabilities in tree leaves and by definition add up to 1.
[Yfit,Sfit] = oobPredict(b5v);
Use the perfcurve utility (see “Performance Curves” on page 12-9) to
compute a performance curve. By default, perfcurve returns the standard
ROC curve, which is the true positive rate versus the false positive rate.
perfcurve requires true class labels, scores, and the positive class label for
input. In this case, choose the 'good' class as positive. The scores for this
class are in the first column of Sfit.
[fpr,tpr] = perfcurve(b5v.Y,Sfit(:,1),'g');
figure(19);
plot(fpr,tpr);
13-115
13
Supervised Learning
xlabel('False Positive Rate');
ylabel('True Positive Rate');
Instead of the standard ROC curve, you might want to plot, for example,
ensemble accuracy versus threshold on the score for the 'good' class. The
ycrit input argument of perfcurve lets you specify the criterion for the
y-axis, and the third output argument of perfcurve returns an array of
thresholds for the positive class score. Accuracy is the fraction of correctly
classified observations, or equivalently, 1 minus the classification error.
[fpr,accu,thre] = perfcurve(b5v.Y,Sfit(:,1),'g','ycrit','accu');
figure(20);
plot(thre,accu);
xlabel('Threshold for ''good'' Returns');
ylabel('Classification Accuracy');
13-116
Ensemble Methods
The curve shows a flat region indicating that any threshold from 0.2 to 0.6
is a reasonable choice. By default, the function assigns classification labels
using 0.5 as the boundary between the two classes. You can find exactly
what accuracy this corresponds to:
i50 = find(accu>=0.50,1,'first')
accu(abs(thre-0.5)<eps)
returns
i50 =
2
ans =
0.9430
The maximal accuracy is a little higher than the default one:
[maxaccu,iaccu] = max(accu)
returns
maxaccu =
0.9459
13-117
13
Supervised Learning
iaccu =
91
The optimal threshold is therefore:
thre(iaccu)
ans =
0.5056
Ensemble Algorithms
• “Bagging” on page 13-118
• “AdaBoostM1” on page 13-122
• “AdaBoostM2” on page 13-124
• “LogitBoost” on page 13-125
• “GentleBoost” on page 13-126
• “RobustBoost” on page 13-127
• “LSBoost” on page 13-128
Bagging
Bagging, which stands for “bootstrap aggregation”, is a type of ensemble
learning. To bag a weak learner such as a decision tree on a dataset, generate
many bootstrap replicas of this dataset and grow decision trees on these
replicas. Obtain each bootstrap replica by randomly selecting N observations
out of N with replacement, where N is the dataset size. To find the predicted
response of a trained ensemble, take an average over predictions from
individual trees.
Bagged decision trees were introduced in MATLAB R2009a as TreeBagger.
The fitensemble function lets you bag in a manner consistent with boosting.
An ensemble of bagged trees, either ClassificationBaggedEnsemble or
RegressionBaggedEnsemble, returned by fitensemble offers almost the
same functionally as TreeBagger. Discrepancies between TreeBagger and
13-118
Ensemble Methods
the new framework are described in detail in TreeBagger Features Not in
fitensemble on page 13-120.
Bagging works by training learners on resampled versions of the data. This
resampling is usually done by bootstrapping observations, that is, selecting N
out of N observations with replacement for every new learner. In addition,
every tree in the ensemble can randomly select predictors for decision
splits—a technique known to improve the accuracy of bagged trees.
By default, the minimal leaf sizes for bagged trees are set to 1 for classification
and 5 for regression. Trees grown with the default leaf size are usually
very deep. These settings are close to optimal for the predictive power of
an ensemble. Often you can grow trees with larger leaves without losing
predictive power. Doing so reduces training and prediction time, as well as
memory usage for the trained ensemble.
Another important parameter is the number of predictors selected at random
for every decision split. This random selection is made for every split, and
every deep tree involves many splits. By default, this parameter is set to
a square root of the number of predictors for classification, and one third
of predictors for regression.
Several features of bagged decision trees make them a unique algorithm.
Drawing N out of N observations with replacement omits on average 37% of
observations for each decision tree. These are “out-of-bag” observations. You
can use them to estimate the predictive power and feature importance. For
each observation, you can estimate the out-of-bag prediction by averaging
over predictions from all trees in the ensemble for which this observation
is out of bag. You can then compare the computed prediction against the
observed response for this observation. By comparing the out-of-bag predicted
responses against the observed responses for all observations used for
training, you can estimate the average out-of-bag error. This out-of-bag
average is an unbiased estimator of the true ensemble error. You can also
obtain out-of-bag estimates of feature importance by randomly permuting
out-of-bag data across one variable or column at a time and estimating the
increase in the out-of-bag error due to this permutation. The larger the
increase, the more important the feature. Thus, you need not supply test
data for bagged ensembles because you obtain reliable estimates of the
predictive power and feature importance in the process of training, which
is an attractive feature of bagging.
13-119
13
Supervised Learning
Another attractive feature of bagged decision trees is the proximity matrix.
Every time two observations land on the same leaf of a tree, their proximity
increases by 1. For normalization, sum these proximities over all trees in
the ensemble and divide by the number of trees. The resulting matrix is
symmetric with diagonal elements equal to 1 and off-diagonal elements
ranging from 0 to 1. You can use this matrix for finding outlier observations
and discovering clusters in the data through multidimensional scaling.
For examples using bagging, see:
• “Example: Test Ensemble Quality” on page 13-59
• “Example: Surrogate Splits” on page 13-76
• “Workflow Example: Regression of Insurance Risk Rating for Car Imports
with TreeBagger” on page 13-97
• “Workflow Example: Classifying Radar Returns for Ionosphere Data with
TreeBagger” on page 13-106
For references related to bagging, see Breiman [2], [3], and [4].
Comparison of TreeBagger and Bagged Ensembles. fitensemble
produces bagged ensembles that have most, but not all, of the functionality of
TreeBagger objects. Additionally, some functionality has different names in
the new bagged ensembles.
TreeBagger Features Not in fitensemble
Feature
TreeBagger Property
TreeBagger Method
Computation of proximity
matrix
Proximity
fillProximities, mdsProx
Computation of outliers
OutlierMeasure
Out-of-bag estimates of
predictor importance
OOBPermutedVarDeltaError,
OOBPermutedVarDeltaMeanMargin,
OOBPermutedVarCountRaiseMargin
13-120
Ensemble Methods
TreeBagger Features Not in fitensemble (Continued)
Feature
TreeBagger Property
Merging two ensembles
trained separately
Parallel computation for
creating ensemble
TreeBagger Method
append
Set the UseParallel name-value
pair to 'always'; see Chapter 17,
“Parallel Statistics”
Differing Names Between TreeBagger and Bagged Ensembles
Feature
TreeBagger
Bagged Ensembles
Split criterion contributions
for each predictor
DeltaCritDecisionSplit
First output of
property
predictorImportance
(classification) or
predictorImportance
(regression)
Predictor associations
VarAssoc property
Second output of
predictorImportance
(classification) or
predictorImportance
(regression)
Error (misclassification
probability or mean-squared
error)
error and oobError methods
loss and oobLoss methods
(classification); loss and
oobLoss methods (regression)
Train additional trees and add
to ensemble
growTrees method
resume method (classification);
resume method (regression)
Mean classification margin
per tree
meanMargin and
oobMeanMargin methods
edge and oobEdge methods
(classification);
In addition, two important changes were made to training and prediction
for bagged classification ensembles:
• If you pass a misclassification cost matrix to TreeBagger, it passes this
matrix along to the trees. If you pass a misclassification cost matrix to
fitensemble, it uses this matrix to adjust the class prior probabilities.
13-121
13
Supervised Learning
fitensemble then passes the adjusted prior probabilities and the default
cost matrix to the trees. The default cost matrix is ones(K)-eye(K) for K
classes.
• Unlike the loss and edge methods in the new framework, the TreeBagger
error and meanMargin methods do not normalize input observation
weights of the prior probabilities in the respective class.
AdaBoostM1
AdaBoostM1 is a very popular boosting algorithm for binary classification.
The algorithm trains learners sequentially. For every learner with index t,
AdaBoostM1 computes the weighted classification error
t 
N
 dn    yn  ht  xn  
t
n1
• xn is a vector of predictor values for observation n.
• yn is the true class label.
• ht is the prediction of learner (hypothesis) with index t.
•  is the indicator function.
• dn  is the weight of observation n at step t.
t
AdaBoostM1 then increases weights for observations misclassified by learner
t and reduces weights for observations correctly classified by learner t. The


next learner t + 1 is then trained on the data with updated weights dnt1 .
After training finishes, AdaBoostM1 computes prediction for new data using
T
f  x    t ht  x  ,
t 1
where
13-122
Ensemble Methods
t 
1  t
1
log
2
t
are weights of the weak hypotheses in the ensemble.
Training by AdaBoostM1 can be viewed as stagewise minimization of the
exponential loss:
N
wn exp   yn f  xn  
n1
where
• yn
{–1,+1} is the true class label.
• wn are observation weights normalized to add up to 1.
• f(xn)
(–∞,+∞) is the predicted classification score.
The observation weights wn are the original observation weights you passed
to fitensemble.
The second output from the predict method of an AdaBoostM1 classification
ensemble is an N-by-2 matrix of classification scores for the two classes and
N observations. The second column in this matrix is always equal to minus
the first column. predict returns two scores to be consistent with multiclass
models, though this is redundant since the second column is always the
negative of the first.
Most often AdaBoostM1 is used with decision stumps (default) or shallow
trees. If boosted stumps give poor performance, try setting the minimal
parent node size to one quarter of the training data.
By default, the learning rate for boosting algorithms is 1. If you set the
learning rate to a lower number, the ensemble learns at a slower rate, but can
converge to a better solution. 0.1 is a popular choice for the learning rate.
Learning at a rate less than 1 is often called “shrinkage”.
13-123
13
Supervised Learning
For examples using AdaBoostM1, see “Example: Tuning RobustBoost” on
page 13-92.
For references related to AdaBoostM1, see Freund and Schapire [8], Schapire
et al. [13], Friedman, Hastie, and Tibshirani [10], and Friedman [9].
AdaBoostM2
AdaBoostM2 is an extension of AdaBoostM1 for multiple classes. Instead of
weighted classification error, AdaBoostM2 uses weighted pseudo-loss for N
observations and K classes:
t 
N
1
t
dn,k 1  ht  xn , yn   ht  xn , k  


2 n1k y
n
where
• ht(xn,k) is the confidence of prediction by learner at step t into class k
ranging from 0 (not at all confident) to 1 (highly confident).
t
• dn,k are observation weights at step t for class k.
• yn is the true class label taking one of the K values.
• The second sum is over all classes other than the true class yn.
Interpreting the pseudo-loss is harder than classification error, but the idea is
the same. Pseudo-loss can be used as a measure of the classification accuracy
from any learner in an ensemble. Pseudo-loss typically exhibits the same
behavior as a weighted classification error for AdaBoostM1: the first few
learners in a boosted ensemble give low pseudo-loss values. After the first
few training steps, the ensemble begins to learn at a slower pace, and the
pseudo-loss value approaches 0.5 from below.
For examples using AdaBoostM2, see “Creating a Classification Ensemble” on
page 13-57.
For references related to AdaBoostM2, see Freund and Schapire [8].
13-124
Ensemble Methods
LogitBoost
LogitBoost is another popular algorithm for binary classification.
LogitBoost works similarly to AdaBoostM1, except it minimizes binomial
deviance
N
wn log 1  exp  2 yn f  xn    ,
n1
where
• yn
{–1,+1} is the true class label.
• wn are observation weights normalized to add up to 1.
• f(xn)
(–∞,+∞) is the predicted classification score.
Binomial deviance assigns less weight to badly misclassified observations
(observations with large negative values of ynf(xn)). LogitBoost can give
better average accuracy than AdaBoostM1 for data with poorly separable
classes.
Learner t in a LogitBoost ensemble fits a regression model to response values
y n 
yn*  pt  xn 
pt  xn  1  pt  xn  
where
• y*n
{0,+1} are relabeled classes (0 instead of –1).
• pt(xn) is the current ensemble estimate of the probability for observation xn
to be of class 1.
Fitting a regression model at each boosting step turns into a great
computational advantage for data with multilevel categorical predictors.
Take a categorical predictor with L levels. To find the optimal decision split
on such a predictor, classification tree needs to consider 2L–1 – 1 splits. A
regression tree needs to consider only L – 1 splits, so the processing time
13-125
13
Supervised Learning
can be much shorter. LogitBoost is recommended for categorical predictors
with many levels.
fitensemble computes and stores the mean-squared error
N
2
 
 dnt  y n  ht  xn  
n1
in the FitInfo property of the ensemble object. Here
 
• dnt are observation weights at step t (the weights add up to 1).
• ht(xn) are predictions of the regression model ht fitted to response values y n .
Values yn can range from –∞ to +∞, so the mean-squared error does not have
well-defined bounds.
For examples using LogitBoost, see “Example: Classification with Many
Categorical Levels” on page 13-71.
For references related to LogitBoost, see Friedman, Hastie, and Tibshirani
[10].
GentleBoost
GentleBoost (also known as Gentle AdaBoost) combines features of
AdaBoostM1 and LogitBoost. Like AdaBoostM1, GentleBoost minimizes the
exponential loss. But its numeric optimization is set up differently. Like
LogitBoost, every weak learner fits a regression model to response values
yn {–1,+1}. This makes GentleBoost another good candidate for binary
classification of data with multilevel categorical predictors.
fitensemble computes and stores the mean-squared error
N
2
 
 dnt  y n  ht  xn  
n1
13-126
Ensemble Methods
in the FitInfo property of the ensemble object, where
 
• dnt are observation weights at step t (the weights add up to 1).
• ht(xn) are predictions of the regression model ht fitted to response values yn.
As the strength of individual learners weakens, the weighted mean-squared
error approaches 1.
For examples using GentleBoost, see “Example: Unequal Classification
Costs” on page 13-66, “Example: Classification with Many Categorical Levels”
on page 13-71.
For references related to GentleBoost, see Friedman, Hastie, and Tibshirani
[10].
RobustBoost
Boosting algorithms such as AdaBoostM1 and LogitBoost increase weights for
misclassified observations at every boosting step. These weights can become
very large. If this happens, the boosting algorithm sometimes concentrates on
a few misclassified observations and neglects the majority of training data.
Consequently the average classification accuracy suffers.
In this situation, you can try using RobustBoost. This algorithm does not
assign almost the entire data weight to badly misclassified observations. It
can produce better average classification accuracy.
Unlike AdaBoostM1 and LogitBoost, RobustBoost does not minimize a
specific loss function. Instead, it maximizes the number of observations with
the classification margin above a certain threshold.
RobustBoost trains based on time evolution. The algorithm starts at t = 0.
At every step, RobustBoost solves an optimization problem to find a positive
step in time Δt and a corresponding positive change in the average margin
for training data Δm. RobustBoost stops training and exits if at least one of
these three conditions is true:
• Time t reaches 1.
13-127
13
Supervised Learning
• RobustBoost cannot find a solution to the optimization problem with
positive updates Δt and Δm.
• RobustBoost grows as many learners as you requested.
Results from RobustBoost can be usable for any termination condition.
Estimate the classification accuracy by cross validation or by using an
independent test set.
To get better classification accuracy from RobustBoost, you can adjust three
parameters in fitensemble: RobustErrorGoal, RobustMaxMargin, and
RobustMarginSigma. Start by varying values for RobustErrorGoal from 0 to
1. The maximal allowed value for RobustErrorGoal depends on the two other
parameters. If you pass a value that is too high, fitensemble produces an
error message showing the allowed range for RobustErrorGoal.
For examples using RobustBoost, see “Example: Tuning RobustBoost” on
page 13-92
For references related to RobustBoost, see Freund [7].
LSBoost
You can use least squares boosting (LSBoost) to fit regression ensembles.
At every step, the ensemble fits a new learner to the difference between
the observed response and the aggregated prediction of all learners grown
previously. The ensemble fits to minimize mean-squared error.
You can use LSBoost with shrinkage by passing in LearnRate parameter. By
default this parameter is set to 1, and the ensemble learns at the maximal
speed. If you set LearnRate to a value from 0 to 1, the ensemble fits every
new learner to yn – ηf(xn), where
• yn is the observed response.
• f(xn) is the aggregated prediction from all weak learners grown so far for
observation xn.
• η is the learning rate.
For examples using LSBoost, see “Creating a Regression Ensemble” on page
13-58, “Example: Regularizing a Regression Ensemble” on page 13-82
13-128
Ensemble Methods
For references related to LSBoost, see Hastie, Tibshirani, and Friedman [11],
Chapters 7 (Model Assessment and Selection) and 15 (Random Forests).
13-129
13
Supervised Learning
Bibliography
[1] Bottou, L., and Chih-Jen Lin. Support Vector Machine Solvers. Available at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.64.4209
&rep=rep1&type=pdf.
[2] Breiman, L. Bagging Predictors. Machine Learning 26, pp. 123–140, 1996.
[3] Breiman, L. Random Forests. Machine Learning 45, pp. 5–32, 2001.
[4] Breiman, L.
http://www.stat.berkeley.edu/~breiman/RandomForests/
[5] Breiman, L., et al. Classification and Regression Trees. Chapman & Hall,
Boca Raton, 1993.
[6] Christianini, N., and J. Shawe-Taylor. An Introduction to Support Vector
Machines and Other Kernel-Based Learning Methods. Cambridge University
Press, Cambridge, UK, 2000.
[7] Freund, Y. A more robust boosting algorithm. arXiv:0905.2138v1, 2009.
[8] Freund, Y. and R. E. Schapire. A Decision-Theoretic Generalization of
On-Line Learning and an Application to Boosting. J. of Computer and System
Sciences, Vol. 55, pp. 119–139, 1997.
[9] Friedman, J. Greedy function approximation: A gradient boosting machine.
Annals of Statistics, Vol. 29, No. 5, pp. 1189–1232, 2001.
[10] Friedman, J., T. Hastie, and R. Tibshirani. Additive logistic regression: A
statistical view of boosting. Annals of Statistics, Vol. 28, No. 2, pp. 337–407,
2000.
[11] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical
Learning, second edition. Springer, New York, 2008.
[12] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin. A
Practical Guide to Support Vector Classification. Available at
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
13-130
Bibliography
[13] Schapire, R. E. et al. Boosting the margin: A new explanation for the
effectiveness of voting methods. Annals of Statistics, Vol. 26, No. 5, pp.
1651–1686, 1998.
[14] Zadrozny, B., J. Langford, and N. Abe. Cost-Sensitive Learning
by Cost-Proportionate Example Weighting. CiteSeerX. [Online] 2003.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.9780
[15] Zhou, Z.-H. and X.-Y. Liu. On Multi-Class
Cost-Sensitive Learning. CiteSeerX. [Online] 2006.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.92.9999
13-131
13
Supervised Learning
13-132
14
Markov Models
• “Introduction” on page 14-2
• “Markov Chains” on page 14-3
• “Hidden Markov Models (HMM)” on page 14-5
14
Markov Models
Introduction
Markov processes are examples of stochastic processes—processes that
generate random sequences of outcomes or states according to certain
probabilities. Markov processes are distinguished by being memoryless—their
next state depends only on their current state, not on the history that led them
there. Models of Markov processes are used in a wide variety of applications,
from daily stock prices to the positions of genes in a chromosome.
14-2
Markov Chains
Markov Chains
A Markov model is given visual representation with a state diagram, such
as the one below.
State Diagram for a Markov Model
The rectangles in the diagram represent the possible states of the process you
are trying to model, and the arrows represent transitions between states.
The label on each arrow represents the probability of that transition. At
each step of the process, the model may generate an output, or emission,
depending on which state it is in, and then make a transition to another
state. An important characteristic of Markov models is that the next state
depends only on the current state, and not on the history of transitions that
lead to the current state.
For example, for a sequence of coin tosses the two states are heads and tails.
The most recent coin toss determines the current state of the model and each
subsequent toss determines the transition to the next state. If the coin is fair,
the transition probabilities are all 1/2. The emission might simply be the
current state. In more complicated models, random processes at each state
will generate emissions. You could, for example, roll a die to determine the
emission at any step.
14-3
14
Markov Models
Markov chains are mathematical descriptions of Markov models with a
discrete set of states. Markov chains are characterized by:
• A set of states {1, 2, ..., M}
• An M-by-M transition matrix T whose i,j entry is the probability of a
transition from state i to state j. The sum of the entries in each row of
T must be 1, because this is the sum of the probabilities of making a
transition from a given state to each of the other states.
• A set of possible outputs, or emissions, {s1, s2, ... , sN}. By default, the set of
emissions is {1, 2, ... , N}, where N is the number of possible emissions, but
you can choose a different set of numbers or symbols.
• An M-by-N emission matrix E whose i,k entry gives the probability of
emitting symbol sk given that the model is in state i.
Markov chains begin in an initial state i0 at step 0. The chain then transitions
to state i1 with probability T1i1 , and emits an output sk1 with probability
Ei1k1 . Consequently, the probability of observing the sequence of states
i1i2 ...ir and the sequence of emissions sk1 sk2 ...skr in the first r steps, is
T1i1 Ei1k1 Ti1i2 Ei2 k2 ...Tir −1ir Eir k
14-4
Hidden Markov Models (HMM)
Hidden Markov Models (HMM)
In this section...
“Introduction” on page 14-5
“Analyzing Hidden Markov Models” on page 14-7
Introduction
A hidden Markov model (HMM) is one in which you observe a sequence of
emissions, but do not know the sequence of states the model went through to
generate the emissions. Analyses of hidden Markov models seek to recover
the sequence of states from the observed data.
As an example, consider a Markov model with two states and six possible
emissions. The model uses:
• A red die, having six sides, labeled 1 through 6.
• A green die, having twelve sides, five of which are labeled 2 through 6,
while the remaining seven sides are labeled 1.
• A weighted red coin, for which the probability of heads is .9 and the
probability of tails is .1.
• A weighted green coin, for which the probability of heads is .95 and the
probability of tails is .05.
The model creates a sequence of numbers from the set {1, 2, 3, 4, 5, 6} with the
following rules:
• Begin by rolling the red die and writing down the number that comes up,
which is the emission.
• Toss the red coin and do one of the following:
-
If the result is heads, roll the red die and write down the result.
If the result is tails, roll the green die and write down the result.
• At each subsequent step, you flip the coin that has the same color as the die
you rolled in the previous step. If the coin comes up heads, roll the same die
as in the previous step. If the coin comes up tails, switch to the other die.
14-5
14
Markov Models
The state diagram for this model has two states, red and green, as shown in
the following figure.
You determine the emission from a state by rolling the die with the same color
as the state. You determine the transition to the next state by flipping the
coin with the same color as the state.
The transition matrix is:
⎡ 0 .9 0 .1 ⎤
T=⎢
⎥
⎣0.05 0.95⎦
The emissions matrix is:
⎡1
⎢
E=⎢6
⎢7
⎢⎣ 12
1
6
1
12
1
6
1
12
1
6
1
12
1
6
1
12
1⎤
6⎥
⎥
1⎥
12 ⎥⎦
The model is not hidden because you know the sequence of states from the
colors of the coins and dice. Suppose, however, that someone else is generating
14-6
Hidden Markov Models (HMM)
the emissions without showing you the dice or the coins. All you see is the
sequence of emissions. If you start seeing more 1s than other numbers, you
might suspect that the model is in the green state, but you cannot be sure
because you cannot see the color of the die being rolled.
Hidden Markov models raise the following questions:
• Given a sequence of emissions, what is the most likely state path?
• Given a sequence of emissions, how can you estimate transition and
emission probabilities of the model?
• What is the forward probability that the model generates a given sequence?
• What is the posterior probability that the model is in a particular state at
any point in the sequence?
Analyzing Hidden Markov Models
• “Generating a Test Sequence” on page 14-8
• “Estimating the State Sequence” on page 14-8
• “Estimating Transition and Emission Matrices” on page 14-9
• “Estimating Posterior State Probabilities” on page 14-11
• “Changing the Initial State Distribution” on page 14-12
Statistics Toolbox functions related to hidden Markov models are:
• hmmgenerate — Generates a sequence of states and emissions from a
Markov model
• hmmestimate — Calculates maximum likelihood estimates of transition
and emission probabilities from a sequence of emissions and a known
sequence of states
• hmmtrain — Calculates maximum likelihood estimates of transition and
emission probabilities from a sequence of emissions
• hmmviterbi — Calculates the most probable state path for a hidden
Markov model
14-7
14
Markov Models
• hmmdecode — Calculates the posterior state probabilities of a sequence
of emissions
This section shows how to use these functions to analyze hidden Markov
models.
Generating a Test Sequence
The following commands create the transition and emission matrices for the
model described in the “Introduction” on page 14-5:
TRANS = [.9 .1; .05 .95;];
EMIS = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6;...
7/12, 1/12, 1/12, 1/12, 1/12, 1/12];
To generate a random sequence of states and emissions from the model, use
hmmgenerate:
[seq,states] = hmmgenerate(1000,TRANS,EMIS);
The output seq is the sequence of emissions and the output states is the
sequence of states.
hmmgenerate begins in state 1 at step 0, makes the transition to state i1 at
step 1, and returns i1 as the first entry in states. To change the initial state,
see “Changing the Initial State Distribution” on page 14-12.
Estimating the State Sequence
Given the transition and emission matrices TRANS and EMIS, the function
hmmviterbi uses the Viterbi algorithm to compute the most likely sequence
of states the model would go through to generate a given sequence seq of
emissions:
likelystates = hmmviterbi(seq, TRANS, EMIS);
likelystates is a sequence the same length as seq.
To test the accuracy of hmmviterbi, compute the percentage of the actual
sequence states that agrees with the sequence likelystates.
14-8
Hidden Markov Models (HMM)
sum(states==likelystates)/1000
ans =
0.8200
In this case, the most likely sequence of states agrees with the random
sequence 82% of the time.
Estimating Transition and Emission Matrices
• “Using hmmestimate” on page 14-9
• “Using hmmtrain” on page 14-10
The functions hmmestimate and hmmtrain estimate the transition and
emission matrices TRANS and EMIS given a sequence seq of emissions.
Using hmmestimate. The function hmmestimate requires that you know
the sequence of states states that the model went through to generate seq.
The following takes the emission and state sequences and returns estimates
of the transition and emission matrices:
[TRANS_EST, EMIS_EST] = hmmestimate(seq, states)
TRANS_EST =
0.8989
0.1011
0.0585
0.9415
EMIS_EST =
0.1721
0.1721
0.5836
0.0741
0.1749
0.0804
0.1612
0.0789
0.1803
0.0726
0.1393
0.1104
You can compare the outputs with the original transition and emission
matrices, TRANS and EMIS:
TRANS
TRANS =
0.9000
0.0500
0.1000
0.9500
EMIS
14-9
14
Markov Models
EMIS =
0.1667
0.5833
0.1667
0.0833
0.1667
0.0833
0.1667
0.0833
0.1667
0.0833
0.1667
0.0833
Using hmmtrain. If you do not know the sequence of states states, but you
have initial guesses for TRANS and EMIS, you can still estimate TRANS and
EMIS using hmmtrain.
Suppose you have the following initial guesses for TRANS and EMIS.
TRANS_GUESS = [.85 .15; .1 .9];
EMIS_GUESS = [.17 .16 .17 .16 .17 .17;.6 .08 .08 .08 .08 08];
You estimate TRANS and EMIS as follows:
[TRANS_EST2, EMIS_EST2] = hmmtrain(seq, TRANS_GUESS, EMIS_GUESS)
TRANS_EST2 =
0.2286
0.7714
0.0032
0.9968
EMIS_EST2 =
0.1436
0.2348
0.4355
0.1089
0.1837
0.1144
0.1963
0.1082
0.2350
0.1109
0.0066
0.1220
hmmtrain uses an iterative algorithm that alters the matrices TRANS_GUESS
and EMIS_GUESS so that at each step the adjusted matrices are more likely to
generate the observed sequence, seq. The algorithm halts when the matrices
in two successive iterations are within a small tolerance of each other.
If the algorithm fails to reach this tolerance within a maximum number of
iterations, whose default value is 100, the algorithm halts. In this case,
hmmtrain returns the last values of TRANS_EST and EMIS_EST and issues a
warning that the tolerance was not reached.
If the algorithm fails to reach the desired tolerance, increase the default value
of the maximum number of iterations with the command:
hmmtrain(seq,TRANS_GUESS,EMIS_GUESS,'maxiterations',maxiter)
where maxiter is the maximum number of steps the algorithm executes.
14-10
Hidden Markov Models (HMM)
Change the default value of the tolerance with the command:
hmmtrain(seq, TRANS_GUESS, EMIS_GUESS, 'tolerance', tol)
where tol is the desired value of the tolerance. Increasing the value of tol
makes the algorithm halt sooner, but the results are less accurate.
Two factors reduce the reliability of the output matrices of hmmtrain:
• The algorithm converges to a local maximum that does not represent the
true transition and emission matrices. If you suspect this, use different
initial guesses for the matrices TRANS_EST and EMIS_EST.
• The sequence seq may be too short to properly train the matrices. If you
suspect this, use a longer sequence for seq.
Estimating Posterior State Probabilities
The posterior state probabilities of an emission sequence seq are the
conditional probabilities that the model is in a particular state when it
generates a symbol in seq, given that seq is emitted. You compute the
posterior state probabilities with hmmdecode:
PSTATES = hmmdecode(seq,TRANS,EMIS)
The output PSTATES is an M-by-L matrix, where M is the number of states
and L is the length of seq. PSTATES(i,j) is the conditional probability that
the model is in state i when it generates the jth symbol of seq, given that
seq is emitted.
hmmdecode begins with the model in state 1 at step 0, prior to the first
emission. PSTATES(i,1) is the probability that the model is in state i at the
following step 1. To change the initial state, see “Changing the Initial State
Distribution” on page 14-12.
To return the logarithm of the probability of the sequence seq, use the second
output argument of hmmdecode:
[PSTATES,logpseq] = hmmdecode(seq,TRANS,EMIS)
The probability of a sequence tends to 0 as the length of the sequence
increases, and the probability of a sufficiently long sequence becomes less
14-11
14
Markov Models
than the smallest positive number your computer can represent. hmmdecode
returns the logarithm of the probability to avoid this problem.
Changing the Initial State Distribution
By default, Statistics Toolbox hidden Markov model functions begin in state 1.
In other words, the distribution of initial states has all of its probability mass
concentrated at state 1. To assign a different distribution of probabilities, p =
[p1, p2, ..., pM], to the M initial states, do the following:
1 Create an M+1-by-M+1 augmented transition matrix, T̂ of the following
form:
⎡0 p ⎤
T̂ = ⎢
⎥
⎣0 T ⎦
where T is the true transition matrix. The first column of T̂ contains M+1
zeros. p must sum to 1.
2 Create an M+1-by-N augmented emission matrix, Ê , that has the
following form:
⎡0⎤
Ê = ⎢ ⎥
⎣ E⎦
If the transition and emission matrices are TRANS and EMIS, respectively, you
create the augmented matrices with the following commands:
TRANS_HAT = [0 p; zeros(size(TRANS,1),1) TRANS];
EMIS_HAT = [zeros(1,size(EMIS,2)); EMIS];
14-12
15
Design of Experiments
• “Introduction” on page 15-2
• “Full Factorial Designs” on page 15-3
• “Fractional Factorial Designs” on page 15-5
• “Response Surface Designs” on page 15-9
• “D-Optimal Designs” on page 15-15
15
Design of Experiments
Introduction
Passive data collection leads to a number of problems in statistical modeling.
Observed changes in a response variable may be correlated with, but
not caused by, observed changes in individual factors (process variables).
Simultaneous changes in multiple factors may produce interactions that are
difficult to separate into individual effects. Observations may be dependent,
while a model of the data considers them to be independent.
Designed experiments address these problems. In a designed experiment,
the data-producing process is actively manipulated to improve the quality
of information and to eliminate redundant data. A common goal of all
experimental designs is to collect data as parsimoniously as possible while
providing sufficient information to accurately estimate model parameters.
For example, a simple model of a response y in an experiment with two
controlled factors x1 and x2 might look like this:
y =  0 + 1 x1 +  2 x2 +  3 x1 x2 + 
Here ε includes both experimental error and the effects of any uncontrolled
factors in the experiment. The terms β1x1 and β2x2 are main effects and the
term β3x1x2 is a two-way interaction effect. A designed experiment would
systematically manipulate x1 and x2 while measuring y, with the objective of
accurately estimating β0, β1, β2, and β3.
15-2
Full Factorial Designs
Full Factorial Designs
In this section...
“Multilevel Designs” on page 15-3
“Two-Level Designs” on page 15-4
Multilevel Designs
To systematically vary experimental factors, assign each factor a discrete
set of levels. Full factorial designs measure response variables using every
treatment (combination of the factor levels). A full factorial design for n
factors with N1, ..., Nn levels requires N1 × ... × Nn experimental runs—one for
each treatment. While advantageous for separating individual effects, full
factorial designs can make large demands on data collection.
As an example, suppose a machine shop has three machines and four
operators. If the same operator always uses the same machine, it is
impossible to determine if a machine or an operator is the cause of variation
in production. By allowing every operator to use every machine, effects are
separated. A full factorial list of treatments is generated by the Statistics
Toolbox function fullfact:
dFF = fullfact([3,4])
dFF =
1
1
2
1
3
1
1
2
2
2
3
2
1
3
2
3
3
3
1
4
2
4
3
4
Each of the 3×4 = 12 rows of dFF represent one machine/operator combination.
15-3
15
Design of Experiments
Two-Level Designs
Many experiments can be conducted with two-level factors, using two-level
designs. For example, suppose the machine shop in the previous example
always keeps the same operator on the same machine, but wants to measure
production effects that depend on the composition of the day and night
shifts. The Statistics Toolbox function ff2n generates a full factorial list of
treatments:
dFF2 = ff2n(4)
dFF2 =
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
Each of the 24 = 16 rows of dFF2 represent one schedule of operators for the
day (0) and night (1) shifts.
15-4
Fractional Factorial Designs
Fractional Factorial Designs
In this section...
“Introduction” on page 15-5
“Plackett-Burman Designs” on page 15-5
“General Fractional Designs” on page 15-6
Introduction
Two-level designs are sufficient for evaluating many production processes.
Factor levels of ±1 can indicate categorical factors, normalized factor extremes,
or simply “up” and “down” from current factor settings. Experimenters
evaluating process changes are interested primarily in the factor directions
that lead to process improvement.
For experiments with many factors, two-level full factorial designs can lead to
large amounts of data. For example, a two-level full factorial design with 10
factors requires 210 = 1024 runs. Often, however, individual factors or their
interactions have no distinguishable effects on a response. This is especially
true of higher order interactions. As a result, a well-designed experiment can
use fewer runs for estimating model parameters.
Fractional factorial designs use a fraction of the runs required by full
factorial designs. A subset of experimental treatments is selected based on
an evaluation (or assumption) of which factors and interactions have the
most significant effects. Once this selection is made, the experimental design
must separate these effects. In particular, significant effects should not
be confounded, that is, the measurement of one should not depend on the
measurement of another.
Plackett-Burman Designs
Plackett-Burman designs are used when only main effects are considered
significant. Two-level Plackett-Burman designs require a number of
experimental runs that are a multiple of 4 rather than a power of 2. The
MATLAB function hadamard generates these designs:
dPB = hadamard(8)
15-5
15
Design of Experiments
dPB =
1
1
1
1
1
1
1
1
1
-1
1
-1
1
-1
1
-1
1
1
-1
-1
1
1
-1
-1
1
-1
-1
1
1
-1
-1
1
1
1
1
1
-1
-1
-1
-1
1
-1
1
-1
-1
1
-1
1
1
1
-1
-1
-1
-1
1
1
1
-1
-1
1
-1
1
1
-1
Binary factor levels are indicated by ±1. The design is for eight runs (the rows
of dPB) manipulating seven two-level factors (the last seven columns of dPB).
The number of runs is a fraction 8/27 = 0.0625 of the runs required by a full
factorial design. Economy is achieved at the expense of confounding main
effects with any two-way interactions.
General Fractional Designs
At the cost of a larger fractional design, you can specify which interactions
you wish to consider significant. A design of resolution R is one in which no
n-factor interaction is confounded with any other effect containing less than
R – n factors. Thus, a resolution III design does not confound main effects
with one another but may confound them with two-way interactions (as in
“Plackett-Burman Designs” on page 15-5), while a resolution IV design does
not confound either main effects or two-way interactions but may confound
two-way interactions with each other.
Specify general fractional factorial designs using a full factorial design for
a selected subset of basic factors and generators for the remaining factors.
Generators are products of the basic factors, giving the levels for the
remaining factors. Use the Statistics Toolbox function fracfact to generate
these designs:
dfF = fracfact('a b c d bcd acd')
dfF =
-1
-1
-1
-1
-1
-1
-1
-1
-1
1
1
1
-1
-1
1
-1
1
1
-1
-1
1
1
-1
-1
-1
1
-1
-1
1
-1
15-6
Fractional Factorial Designs
-1
-1
-1
1
1
1
1
1
1
1
1
1
1
1
-1
-1
-1
-1
1
1
1
1
-1
1
1
-1
-1
1
1
-1
-1
1
1
1
-1
1
-1
1
-1
1
-1
1
-1
1
-1
-1
1
-1
1
1
-1
1
-1
-1
1
1
1
-1
1
-1
-1
1
1
-1
-1
1
This is a six-factor design in which four two-level basic factors (a, b, c, and
d in the first four columns of dfF) are measured in every combination of
levels, while the two remaining factors (in the last three columns of dfF) are
measured only at levels defined by the generators bcd and acd, respectively.
Levels in the generated columns are products of corresponding levels in the
columns that make up the generator.
The challenge of creating a fractional factorial design is to choose basic factors
and generators so that the design achieves a specified resolution in a specified
number of runs. Use the Statistics Toolbox function fracfactgen to find
appropriate generators:
generators = fracfactgen('a b c d e f',4,4)
generators =
'a'
'b'
'c'
'd'
'bcd'
'acd'
These are generators for a six-factor design with factors a through f, using 24
= 16 runs to achieve resolution IV. The fracfactgen function uses an efficient
search algorithm to find generators that meet the requirements.
An optional output from fracfact displays the confounding pattern of the
design:
[dfF,confounding] = fracfact(generators);
15-7
15
Design of Experiments
confounding
confounding =
'Term'
'X1'
'X2'
'X3'
'X4'
'X5'
'X6'
'X1*X2'
'X1*X3'
'X1*X4'
'X1*X5'
'X1*X6'
'X2*X3'
'X2*X4'
'X2*X5'
'X2*X6'
'X3*X4'
'X3*X5'
'X3*X6'
'X4*X5'
'X4*X6'
'X5*X6'
'Generator'
'a'
'b'
'c'
'd'
'bcd'
'acd'
'ab'
'ac'
'ad'
'abcd'
'cd'
'bc'
'bd'
'cd'
'abcd'
'cd'
'bd'
'ad'
'bc'
'ac'
'ab'
'Confounding'
'X1'
'X2'
'X3'
'X4'
'X5'
'X6'
'X1*X2 + X5*X6'
'X1*X3 + X4*X6'
'X1*X4 + X3*X6'
'X1*X5 + X2*X6'
'X1*X6 + X2*X5 + X3*X4'
'X2*X3 + X4*X5'
'X2*X4 + X3*X5'
'X1*X6 + X2*X5 + X3*X4'
'X1*X5 + X2*X6'
'X1*X6 + X2*X5 + X3*X4'
'X2*X4 + X3*X5'
'X1*X4 + X3*X6'
'X2*X3 + X4*X5'
'X1*X3 + X4*X6'
'X1*X2 + X5*X6'
The confounding pattern shows that main effects are effectively separated
by the design, but two-way interactions are confounded with various other
two-way interactions.
15-8
Response Surface Designs
Response Surface Designs
In this section...
“Introduction” on page 15-9
“Central Composite Designs” on page 15-9
“Box-Behnken Designs” on page 15-13
Introduction
As discussed in “Response Surface Models” on page 9-45, quadratic response
surfaces are simple models that provide a maximum or minimum without
making additional assumptions about the form of the response. Quadratic
models can be calibrated using full factorial designs with three or more levels
for each factor, but these designs generally require more runs than necessary
to accurately estimate model parameters. This section discusses designs for
calibrating quadratic models that are much more efficient, using three or five
levels for each factor, but not using all combinations of levels.
Central Composite Designs
Central composite designs (CCDs), also known as Box-Wilson designs, are
appropriate for calibrating the full quadratic models described in “Response
Surface Models” on page 9-45. There are three types of CCDs—circumscribed,
inscribed, and faced—pictured below:
15-9
15
15-10
Design of Experiments
Response Surface Designs
Each design consists of a factorial design (the corners of a cube) together with
center and star points that allow for estimation of second-order effects. For
a full quadratic model with n factors, CCDs have enough design points to
estimate the (n+2)(n+1)/2 coefficients in a full quadratic model with n factors.
The type of CCD used (the position of the factorial and star points) is
determined by the number of factors and by the desired properties of the
design. The following table summarizes some important properties. A design
is rotatable if the prediction variance depends only on the distance of the
design point from the center of the design.
15-11
15
Design of Experiments
Factor Uses Points
Levels Outside ±1
Accuracy of
Estimates
Circumscribed Yes
(CCC)
5
Yes
Good over entire
design space
Inscribed
(CCI)
Yes
5
No
Good over central
subset of design space
Faced (CCF)
No
3
No
Fair over entire
design space; poor
for pure quadratic
coefficients
Design
Rotatable
Generate CCDs with the Statistics Toolbox function ccdesign:
dCC = ccdesign(3,'type','circumscribed')
dCC =
-1.0000
-1.0000
-1.0000
-1.0000
-1.0000
1.0000
-1.0000
1.0000
-1.0000
-1.0000
1.0000
1.0000
1.0000
-1.0000
-1.0000
1.0000
-1.0000
1.0000
1.0000
1.0000
-1.0000
1.0000
1.0000
1.0000
-1.6818
0
0
1.6818
0
0
0
-1.6818
0
0
1.6818
0
0
0
-1.6818
0
0
1.6818
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
15-12
Response Surface Designs
0
0
0
The repeated center point runs allow for a more uniform estimate of the
prediction variance over the entire design space.
Box-Behnken Designs
Like the designs described in “Central Composite Designs” on page
15-9, Box-Behnken designs are used to calibrate full quadratic models.
Box-Behnken designs are rotatable and, for a small number of factors (four or
less), require fewer runs than CCDs. By avoiding the corners of the design
space, they allow experimenters to work around extreme factor combinations.
Like an inscribed CCD, however, extremes are then poorly estimated.
The geometry of a Box-Behnken design is pictured in the following figure.
Design points are at the midpoints of edges of the design space and at the
center, and do not contain an embedded factorial design.
15-13
15
Design of Experiments
Generate Box-Behnken designs with the Statistics Toolbox function bbdesign:
dBB = bbdesign(3)
dBB =
-1
-1
0
-1
1
0
1
-1
0
1
1
0
-1
0
-1
-1
0
1
1
0
-1
1
0
1
0
-1
-1
0
-1
1
0
1
-1
0
1
1
0
0
0
0
0
0
0
0
0
Again, the repeated center point runs allow for a more uniform estimate of
the prediction variance over the entire design space.
15-14
D-Optimal Designs
D-Optimal Designs
In this section...
“Introduction” on page 15-15
“Generating D-Optimal Designs” on page 15-16
“Augmenting D-Optimal Designs” on page 15-19
“Specifying Fixed Covariate Factors” on page 15-20
“Specifying Categorical Factors” on page 15-21
“Specifying Candidate Sets” on page 15-21
Introduction
Traditional experimental designs (“Full Factorial Designs” on page 15-3,
“Fractional Factorial Designs” on page 15-5, and “Response Surface Designs”
on page 15-9) are appropriate for calibrating linear models in experimental
settings where factors are relatively unconstrained in the region of interest.
In some cases, however, models are necessarily nonlinear. In other cases,
certain treatments (combinations of factor levels) may be expensive or
infeasible to measure. D-optimal designs are model-specific designs that
address these limitations of traditional designs.
A D-optimal design is generated by an iterative search algorithm and seeks
to minimize the covariance of the parameter estimates for a specified model.
This is equivalent to maximizing the determinant D = |XTX|, where X is the
design matrix of model terms (the columns) evaluated at specific treatments
in the design space (the rows). Unlike traditional designs, D-optimal designs
do not require orthogonal design matrices, and as a result, parameter
estimates may be correlated. Parameter estimates may also be locally, but
not globally, D-optimal.
There are several Statistics Toolbox functions for generating D-optimal
designs:
15-15
15
Design of Experiments
Function
Description
candexch
Uses a row-exchange algorithm to generate a D-optimal design
with a specified number of runs for a specified model and a
specified candidate set. This is the second component of the
algorithm used by rowexch.
candgen
Generates a candidate set for a specified model. This is the
first component of the algorithm used by rowexch.
cordexch
Uses a coordinate-exchange algorithm to generate a D-optimal
design with a specified number of runs for a specified model.
daugment
Uses a coordinate-exchange algorithm to augment an existing
D-optimal design with additional runs to estimate additional
model terms.
dcovary
Uses a coordinate-exchange algorithm to generate a D-optimal
design with fixed covariate factors.
rowexch
Uses a row-exchange algorithm to generate a D-optimal design
with a specified number of runs for a specified model. The
algorithm calls candgen and then candexch. (Call candexch
separately to specify a candidate set.)
The following sections explain how to use these functions to generate
D-optimal designs.
Note The Statistics Toolbox function rsmdemo generates simulated data for
experimental settings specified by either the user or by a D-optimal design
generated by cordexch. It uses the rstool interface to visualize response
surface models fit to the data, and it uses the nlintool interface to visualize
a nonlinear model fit to the data.
Generating D-Optimal Designs
Two Statistics Toolbox algorithms generate D-optimal designs:
• The cordexch function uses a coordinate-exchange algorithm
• The rowexch function uses a row-exchange algorithm
15-16
D-Optimal Designs
Both cordexch and rowexch use iterative search algorithms. They operate by
incrementally changing an initial design matrix X to increase D = |XTX| at
each step. In both algorithms, there is randomness built into the selection of
the initial design and into the choice of the incremental changes. As a result,
both algorithms may return locally, but not globally, D-optimal designs. Run
each algorithm multiple times and select the best result for your final design.
Both functions have a 'tries' parameter that automates this repetition
and comparison.
At each step, the row-exchange algorithm exchanges an entire row of X with a
row from a design matrix C evaluated at a candidate set of feasible treatments.
The rowexch function automatically generates a C appropriate for a specified
model, operating in two steps by calling the candgen and candexch functions
in sequence. Provide your own C by calling candexch directly. In either case,
if C is large, its static presence in memory can affect computation.
The coordinate-exchange algorithm, by contrast, does not use a candidate
set. (Or rather, the candidate set is the entire design space.) At each step,
the coordinate-exchange algorithm exchanges a single element of X with a
new element evaluated at a neighboring point in design space. The absence
of a candidate set reduces demands on memory, but the smaller scale of the
search means that the coordinate-exchange algorithm is more likely to become
trapped in a local minimum than the row-exchange algorithm.
For example, suppose you want a design to estimate the parameters in the
following three-factor, seven-term interaction model:
y =  0 + 1 x 1 +  2 x 2 +  3 x 3 + 12 x 1 x 2 + 13 x 1 x 3 +  23 x 2 x 3 +
Use cordexch to generate a D-optimal design with seven runs:
nfactors = 3;
nruns = 7;
[dCE,X] = cordexch(nfactors,nruns,'interaction','tries',10)
dCE =
-1
1
1
-1
-1
-1
1
1
1
-1
1
-1
1
-1
1
15-17
15
Design of Experiments
1
-1
-1
-1
-1
1
1
1
1
1
1
1
1
-1
-1
1
-1
1
1
-1
1
-1
1
1
-1
-1
-1
X =
1
-1
1
-1
1
-1
1
-1
1
1
-1
-1
-1
1
-1
1
1
1
1
-1
-1
1
1
1
-1
-1
1
-1
Columns of the design matrix X are the model terms evaluated at each row of
the design dCE. The terms appear in order from left to right:
1 Constant term
2 Linear terms (1, 2, 3)
3 Interaction terms (12, 13, 23)
Use X to fit the model, as described in “Linear Regression” on page 9-3, to
response data measured at the design points in dCE.
Use rowexch in a similar fashion to generate an equivalent design:
[dRE,X] = rowexch(nfactors,nruns,'interaction','tries',10)
dRE =
-1
-1
1
1
-1
1
1
-1
-1
1
1
1
-1
-1
-1
-1
1
-1
-1
1
1
X =
1
-1
-1
1
1
-1
-1
1
1
-1
1
-1
1
-1
1
1
-1
-1
-1
-1
1
1
1
1
1
1
1
1
1
-1
-1
-1
1
1
1
1
-1
1
-1
-1
1
-1
15-18
D-Optimal Designs
1
-1
1
1
-1
-1
1
Augmenting D-Optimal Designs
In practice, you may want to add runs to a completed experiment to learn
more about a process and estimate additional model coefficients. The
daugment function uses a coordinate-exchange algorithm to augment an
existing D-optimal design.
For example, the following eight-run design is adequate for estimating main
effects in a four-factor model:
dCEmain = cordexch(4,8)
dCEmain =
1
-1
-1
1
-1
-1
1
1
-1
1
-1
1
1
1
1
-1
1
1
1
1
-1
1
-1
-1
1
-1
-1
-1
-1
-1
1
-1
To estimate the six interaction terms in the model, augment the design with
eight additional runs:
dCEinteraction = daugment(dCEmain,8,'interaction')
dCEinteraction =
1
-1
-1
1
-1
-1
1
1
-1
1
-1
1
1
1
1
-1
1
1
1
1
-1
1
-1
-1
1
-1
-1
-1
-1
-1
1
-1
-1
1
1
1
-1
-1
-1
-1
1
-1
1
-1
1
1
-1
1
-1
1
1
-1
15-19
15
Design of Experiments
1
1
1
1
-1
1
-1
1
1
-1
1
-1
The augmented design is full factorial, with the original eight runs in the
first eight rows.
The 'start' parameter of the candexch function provides the same
functionality as daugment, but uses a row exchange algorithm rather than a
coordinate-exchange algorithm.
Specifying Fixed Covariate Factors
In many experimental settings, certain factors and their covariates are
constrained to a fixed set of levels or combinations of levels. These cannot be
varied when searching for an optimal design. The dcovary function allows
you to specify fixed covariate factors in the coordinate exchange algorithm.
For example, suppose you want a design to estimate the parameters in a
three-factor linear additive model, with eight runs that necessarily occur at
different times. If the process experiences temporal linear drift, you may
want to include the run time as a variable in the model. Produce the design as
follows:
time = linspace(-1,1,8)';
[dCV,X] = dcovary(3,time,'linear')
dCV =
-1.0000
1.0000
1.0000
-1.0000
1.0000
-1.0000
-1.0000
-0.7143
-1.0000
-1.0000
-1.0000
-0.4286
1.0000
-1.0000
1.0000
-0.1429
1.0000
1.0000
-1.0000
0.1429
-1.0000
1.0000
-1.0000
0.4286
1.0000
1.0000
1.0000
0.7143
-1.0000
-1.0000
1.0000
1.0000
X =
1.0000
-1.0000
1.0000
1.0000
1.0000
1.0000
-1.0000
-1.0000
1.0000
-1.0000
-1.0000
-1.0000
1.0000
1.0000
-1.0000
1.0000
15-20
-1.0000
-0.7143
-0.4286
-0.1429
D-Optimal Designs
1.0000
1.0000
1.0000
1.0000
1.0000
-1.0000
1.0000
-1.0000
1.0000
1.0000
1.0000
-1.0000
-1.0000
-1.0000
1.0000
1.0000
0.1429
0.4286
0.7143
1.0000
The column vector time is a fixed factor, normalized to values between ±1.
The number of rows in the fixed factor specifies the number of runs in the
design. The resulting design dCV gives factor settings for the three controlled
model factors at each time.
Specifying Categorical Factors
Categorical factors take values in a discrete set of levels. Both cordexch and
rowexch have a 'categorical' parameter that allows you to specify the
indices of categorical factors and a 'levels' parameter that allows you to
specify a number of levels for each factor.
For example, the following eight-run design is for a linear additive model with
five factors in which the final factor is categorical with three levels:
dCEcat = cordexch(5,8,'linear','categorical',5,'levels',3)
dCEcat =
-1
-1
1
1
2
-1
-1
-1
-1
3
1
1
1
1
3
1
1
-1
-1
2
1
-1
-1
1
3
-1
1
-1
1
1
-1
1
1
-1
3
1
-1
1
-1
1
Specifying Candidate Sets
The row-exchange algorithm exchanges rows of an initial design matrix X
with rows from a design matrix C evaluated at a candidate set of feasible
treatments. The rowexch function automatically generates a C appropriate for
a specified model, operating in two steps by calling the candgen and candexch
functions in sequence. Provide your own C by calling candexch directly.
15-21
15
Design of Experiments
For example, the following uses rowexch to generate a five-run design for
a two-factor pure quadratic model using a candidate set that is produced
internally:
dRE1 = rowexch(2,5,'purequadratic','tries',10)
dRE1 =
-1
1
0
0
1
-1
1
0
1
1
The same thing can be done using candgen and candexch in sequence:
[dC,C] = candgen(2,'purequadratic') % Candidate set, C
dC =
-1
-1
0
-1
1
-1
-1
0
0
0
1
0
-1
1
0
1
1
1
C =
1
-1
-1
1
1
1
0
-1
0
1
1
1
-1
1
1
1
-1
0
1
0
1
0
0
0
0
1
1
0
1
0
1
-1
1
1
1
1
0
1
0
1
1
1
1
1
1
treatments = candexch(C,5,'tries',10) % D-opt subset
treatments =
2
1
7
3
15-22
D-Optimal Designs
4
dRE2 = dC(treatments,:) % Display design
dRE2 =
0
-1
-1
-1
-1
1
1
-1
-1
0
You can replace C in this example with a design matrix evaluated at your own
candidate set. For example, suppose your experiment is constrained so that
the two factors cannot have extreme settings simultaneously. The following
produces a restricted candidate set:
constraint = sum(abs(dC),2) < 2; % Feasible treatments
my_dC = dC(constraint,:)
my_dC =
0
-1
-1
0
0
0
1
0
0
1
Use the x2fx function to convert the candidate set to a design matrix:
my_C = x2fx(my_dC,'purequadratic')
my_C =
1
0
-1
0
1
1
-1
0
1
0
1
0
0
0
0
1
1
0
1
0
1
0
1
0
1
Find the required design in the same manner:
my_treatments = candexch(my_C,5,'tries',10) % D-opt subset
my_treatments =
2
4
5
1
15-23
15
Design of Experiments
3
my_dRE = my_dC(my_treatments,:) % Display design
my_dRE =
-1
0
1
0
0
1
0
-1
0
0
15-24
16
Statistical Process Control
• “Introduction” on page 16-2
• “Control Charts” on page 16-3
• “Capability Studies” on page 16-6
16
Statistical Process Control
Introduction
Statistical process control (SPC) refers to a number of different methods for
monitoring and assessing the quality of manufactured goods. Combined
with methods from the Chapter 15, “Design of Experiments”, SPC is used in
programs that define, measure, analyze, improve, and control development
and production processes. These programs are often implemented using
“Design for Six Sigma” methodologies.
16-2
Control Charts
Control Charts
A control chart displays measurements of process samples over time. The
measurements are plotted together with user-defined specification limits and
process-defined control limits. The process can then be compared with its
specifications—to see if it is in control or out of control.
The chart is just a monitoring tool. Control activity might occur if the chart
indicates an undesirable, systematic change in the process. The control
chart is used to discover the variation, so that the process can be adjusted
to reduce it.
Control charts are created with the controlchart function. Any of the
following chart types may be specified:
• Xbar or mean
• Standard deviation
• Range
• Exponentially weighted moving average
• Individual observation
• Moving range of individual observations
• Moving average of individual observations
• Proportion defective
• Number of defectives
• Defects per unit
• Count of defects
Control rules are specified with the controlrules function.
For example, the following commands create an xbar chart, using the
“Western Electric 2” rule (2 of 3 points at least 2 standard errors above the
center line) to mark out of control measurements:
load parts;
st = controlchart(runout,'rules','we2');
16-3
16
Statistical Process Control
x = st.mean;
cl = st.mu;
se = st.sigma./sqrt(st.n);
hold on
plot(cl+2*se,'m')
Measurements that violate the control rule can then be identified:
R = controlrules('we2',x,cl,se);
I = find(R)
16-4
Control Charts
I =
21
23
24
25
26
27
16-5
16
Statistical Process Control
Capability Studies
Before going into production, many manufacturers run a capability study to
determine if their process will run within specifications enough of the time.
Capability indices produced by such a study are used to estimate expected
percentages of defective parts.
Capability studies are conducted with the capability function. The following
capability indices are produced:
• mu — Sample mean
• sigma — Sample standard deviation
• P — Estimated probability of being within the lower (L) and upper (U)
specification limits
• Pl — Estimated probability of being below L
• Pu — Estimated probability of being above U
• Cp — (U-L)/(6*sigma)
• Cpl — (mu-L)./(3.*sigma)
• Cpu — (U-mu)./(3.*sigma)
• Cpk — min(Cpl,Cpu)
As an example, simulate a sample from a process with a mean of 3 and a
standard deviation of 0.005:
data = normrnd(3,0.005,100,1);
Compute capability indices if the process has an upper specification limit of
3.01 and a lower specification limit of 2.99:
S = capability(data,[2.99 3.01])
S =
mu: 3.0006
sigma: 0.0047
P: 0.9669
Pl: 0.0116
Pu: 0.0215
16-6
Capability Studies
Cp:
Cpl:
Cpu:
Cpk:
0.7156
0.7567
0.6744
0.6744
Visualize the specification and process widths:
capaplot(data,[2.99 3.01]);
grid on
16-7
16
16-8
Statistical Process Control
17
Parallel Statistics
• “Quick Start Parallel Computing for Statistics Toolbox” on page 17-2
• “Concepts of Parallel Computing in Statistics Toolbox” on page 17-7
• “When to Run Statistical Functions in Parallel” on page 17-8
• “Working with parfor” on page 17-10
• “Reproducibility in Parallel Statistical Computations” on page 17-13
• “Examples of Parallel Statistical Functions” on page 17-18
17
Parallel Statistics
Quick Start Parallel Computing for Statistics Toolbox
Note To use parallel computing as described in this chapter, you must have
a Parallel Computing Toolbox™ license.
In this section...
“What Is Parallel Statistics Functionality?” on page 17-2
“How To Compute in Parallel” on page 17-3
“Example: Parallel Treebagger” on page 17-5
What Is Parallel Statistics Functionality?
You can use any of the Statistics Toolbox functions with Parallel Computing
Toolbox constructs such as parfor and spmd. However, some functions,
such as those with interactive displays, can lose functionality in parallel. In
particular, displays and interactive usage are not effective on workers (see
“Vocabulary for Parallel Computation” on page 17-7).
Additionally, the following functions are enhanced to use parallel computing
internally. These functions use parfor internally to parallelize calculations.
• bootci
• bootstrp
• candexch
• cordexch
• crossval
• daugment
• dcovary
• jackknife
• nnmf
• plsregress
17-2
Quick Start Parallel Computing for Statistics Toolbox™
• rowexch
• sequentialfs
• TreeBagger
• TreeBagger.growTrees
This chapter gives the simplest way to use these enhanced functions in
parallel. For more advanced topics, including the issues of reproducibility and
nested parfor loops, see the other sections in this chapter.
For information on parallel statistical computing at the command line, enter
help parallelstats
How To Compute in Parallel
To have a function compute in parallel:
1 “Open matlabpool” on page 17-3
2 “Set the UseParallel Option to ’always’” on page 17-5
3 “Call the Function Using the Options Structure” on page 17-5
Open matlabpool
To run a statistical computation in parallel, first set up a parallel environment.
Note Setting up a parallel environment can take several seconds.
Multicore. For a multicore machine, enter the following at the MATLAB
command line:
matlabpool open n
n is the number of workers you want to use.
17-3
17
Parallel Statistics
Network. If you have multiple processors on a network, use Parallel
Computing Toolbox functions and MATLAB® Distributed Computing Server™
software to establish parallel computation. Make sure that your system
is configured properly for parallel computing. Check with your system
administrator, or refer to the Parallel Computing Toolbox documentation, or
the Administrator Guide documentation for MATLAB Distributed Computing
Server.
Many parallel statistical functions call a function that can be one you define
in a file. For example, jackknife calls a function (jackfun) that can be a
built-in MATLAB function such as corr, but can also be a function you define.
Built-in functions are available to all workers. However, you must take extra
steps to enable workers to access a function file that you define.
To place a function file on the path of all workers, and check that it is
accessible:
1 At the command line, enter
matlabpool open conf
or
matlabpool open conf n
where conf is your configuration, and n is the number of processors you
want to use.
2 If network_file_path is the network path to your function file, enter
pctRunOnAll('addpath network_file_path')
so the worker processors can access your function file.
3 Check whether the file is on the path of every worker by entering:
pctRunOnAll('which filename')
If any worker does not have a path to the file, it reports:
filename not found.
17-4
Quick Start Parallel Computing for Statistics Toolbox™
Set the UseParallel Option to ’always’
Create an options structure with the statset function. To run in parallel, set
the UseParallel option to 'always':
paroptions = statset('UseParallel','always');
Call the Function Using the Options Structure
Call your function with syntax that uses the options structure. For example:
% Run crossval in parallel
cvMse = crossval('mse',x,y,'predfun',regf,'Options',paroptions);
% Run bootstrp in parallel
sts = bootstrp(100,@(x)[mean(x) std(x)],y,'Options',paroptions);
% Run TreeBagger in parallel
b = TreeBagger(50,meas,spec,'OOBPred','on','Options',paroptions);
For more complete examples of parallel statistical functions, see “Example:
Parallel Treebagger” on page 17-5 and “Examples of Parallel Statistical
Functions” on page 17-18.
After you have finished computing in parallel, close the parallel environment:
matlabpool close
Tip To save time, keep the pool open if you expect to compute in parallel
again soon.
Example: Parallel Treebagger
To run the example “Workflow Example: Regression of Insurance Risk Rating
for Car Imports with TreeBagger” on page 13-97 in parallel:
1 Set up the parallel environment to use two cores:
matlabpool open 2
Starting matlabpool using the 'local' configuration ...
17-5
17
Parallel Statistics
connected to 2 labs.
2 Set the options to use parallel processing:
paroptions = statset('UseParallel','always');
3 Load the problem data and separate it into input and response:
load imports-85;
Y = X(:,1);
X = X(:,2:end);
4 Estimate feature importance using leaf size 1 and 1000 trees in parallel.
Time the function for comparison purposes:
tic
b = TreeBagger(1000,X,Y,'Method','r','OOBVarImp','on',...
'cat',16:25,'MinLeaf',1,'Options',paroptions);
toc
Elapsed time is 37.357930 seconds.
5 Perform the same computation in serial for timing comparison:
tic
b = TreeBagger(1000,X,Y,'Method','r','OOBVarImp','on',...
'cat',16:25,'MinLeaf',1); % No options gives serial
toc
Elapsed time is 63.921864 seconds.
Computing in parallel took less than 60% of the time of computing serially.
17-6
Concepts of Parallel Computing in Statistics Toolbox™
Concepts of Parallel Computing in Statistics Toolbox
In this section...
“Subtleties in Parallel Computing” on page 17-7
“Vocabulary for Parallel Computation” on page 17-7
Subtleties in Parallel Computing
There are two main subtleties in parallel computations:
• Nested parallel evaluations (see “No Nested parfor Loops” on page 17-11).
Only the outermost parfor loop runs in parallel, the others run serially.
• Reproducible results when using random numbers (see “Reproducibility
in Parallel Statistical Computations” on page 17-13). How can you get
exactly the same results when repeatedly running a parallel computation
that uses random numbers?
Vocabulary for Parallel Computation
• worker — An independent MATLAB session that runs code distributed
by the client.
• client — The MATLAB session with which you interact, and that distributes
jobs to workers.
• parfor — A Parallel Computing Toolbox function that distributes
independent code segments to workers (see “Working with parfor” on page
17-10).
• random stream — A pseudorandom number generator, and the sequence
of values it generates. MATLAB implements random streams with the
RandStream class.
• reproducible computation — A computation that can be exactly replicated,
even in the presence of random numbers (see “Reproducibility in Parallel
Statistical Computations” on page 17-13).
17-7
17
Parallel Statistics
When to Run Statistical Functions in Parallel
In this section...
“Why Run in Parallel?” on page 17-8
“Factors Affecting Speed” on page 17-8
“Factors Affecting Results” on page 17-9
Why Run in Parallel?
The main reason to run statistical computations in parallel is to gain speed,
meaning to reduce the execution time of your program or functions. “Factors
Affecting Speed” on page 17-8 discusses the main items affecting the speed
of programs or functions. “Factors Affecting Results” on page 17-9 discusses
details that can cause a parallel run to give different results than a serial run.
Factors Affecting Speed
Some factors that can affect the speed of execution of parallel processing are:
• Parallel environment setup. It takes time to run matlabpool to begin
computing in parallel. If your computation is fast, the setup time can
exceed any time saved by computing in parallel.
• Parallel overhead. There is overhead in communication and coordination
when running in parallel. If function evaluations are fast, this overhead
could be an appreciable part of the total computation time. Thus, solving
a problem in parallel can be slower than solving the problem serially.
For an example, see Improving Optimization Performance with Parallel
Computing in MATLAB Digest, March 2009.
• No nested parfor loops. This is described in “Working with parfor” on
page 17-10. parfor does not work in parallel when called from within
another parfor loop. If you have programmed your custom functions to
take advantage of parallel processing, the limitation of no nested parfor
loops can cause a parallel function to run slower than expected.
• When executing serially, parfor loops run slightly slower than for loops.
• Passing parameters. Parameters are automatically passed to worker
sessions during the execution of parallel computations. If there are many
17-8
When to Run Statistical Functions in Parallel
parameters, or they take a large amount of memory, passing parameters
can slow the execution of your computation.
• Contention for resources: network and computing. If the pool of workers
has low bandwidth or high latency, parallel computation can be slow.
Factors Affecting Results
Some factors can affect results when using parallel processing. There are
several caveats related to parfor listed in “Limitations” in the Parallel
Computing Toolbox documentation. Some important factors are:
• Persistent or global variables. If any functions use persistent or global
variables, these variables can take different values on different worker
processors. Furthermore, they might not be cleared properly on the worker
processors.
• Accessing external files. External files can be accessed unpredictably
during a parallel computation. The order of computations is not guaranteed
during parallel processing, so external files can be accessed in unpredictable
order, leading to unpredictable results. Furthermore, if multiple processors
try to read an external file simultaneously, the file can become locked,
leading to a read error, and halting function execution.
• Noncomputational functions, such as input, plot, and keyboard, can
behave badly when used in your custom functions. When called in a parfor
loop, these functions are executed on worker machines. This can cause a
worker to become nonresponsive, since it is waiting for input.
• parfor does not allow break or return statements.
• The random numbers you use can affect the results of your computations.
See “Reproducibility in Parallel Statistical Computations” on page 17-13.
17-9
17
Parallel Statistics
Working with parfor
In this section...
“How Statistical Functions Use parfor” on page 17-10
“Characteristics of parfor” on page 17-11
How Statistical Functions Use parfor
parfor is a Parallel Computing Toolbox function similar to a for loop.
Parallel statistical functions call parfor internally. parfor distributes
computations to worker processors.
17-10
Working with parfor
Client
Lines of code
execute top
to bottom
parfor i = 1:n
Lines of code
distributed to
workers
end
Results
returned
to client
Worker 1
Worker n
...
Characteristics of parfor
More caveats related to parfor appear in “Limitations” in the Parallel
Computing Toolbox documentation.
No Nested parfor Loops
parfor does not work in parallel when called from within another parfor
loop, or from an spmd block. Parallelization occurs only at the outermost level.
17-11
17
Parallel Statistics
Suppose, for example, you want to apply jackknife to your function userfcn,
which calls parfor, and you want to call jackknife in a loop. The following
figure shows three cases:
1 The outermost loop is parfor. Only that loop runs in parallel.
2 The outermost parfor loop is in jackknife. Only jackknife runs in
parallel.
3 The outermost parfor loop is in userfcn. userfcn uses parfor in parallel.
When parfor Runs in Parallel
17-12
Reproducibility in Parallel Statistical Computations
Reproducibility in Parallel Statistical Computations
In this section...
“Issues and Considerations in Reproducing Parallel Computations” on
page 17-13
“Running Reproducible Parallel Computations” on page 17-13
“Subtleties in Parallel Statistical Computation Using Random Numbers” on
page 17-14
Issues and Considerations in Reproducing Parallel
Computations
A reproducible computation is one that gives the same results every time it
runs. Reproducibility is important for:
• Debugging — To correct an anomalous result, you need to reproduce the
result.
• Confidence — When you can reproduce results, you can investigate and
understand them.
• Modifying existing code — When you change existing code, you want to
ensure that you do not break anything.
Generally, you do not need to ensure reproducibility for your computation.
Often, when you want reproducibility, the simplest technique is to run in
serial instead of in parallel. In serial computation you can simply execute
reset(RandStream.getDefaultStream) at the command line. A subsequent
call to your computation delivers the same results.
This section addresses the case when your function uses random numbers,
and you want reproducible results in parallel. This section also addresses the
case when you want the same results in parallel as in serial.
Running Reproducible Parallel Computations
To run a Statistics Toolbox function reproducibly:
1 Set the UseSubstreams option to 'always'.
17-13
17
Parallel Statistics
2 Set the Streams option to a type that supports substreams: 'mlfg6331_64'
or 'mrg32k3a'. For information on these streams, see “Choosing a Random
Number Generator” in the MATLAB Mathematics documentation.
3 To compute in parallel, set the UseParallel option to 'always'.
4 Call the function with the options structure.
5 To reproduce the computation, reset the stream, then call the function
again.
To understand why this technique gives reproducibility, see “How Substreams
Enable Reproducible Parallel Computations” on page 17-15.
For example, to use the 'mlfg6331_64' stream for reproducible computation:
1 Create an appropriate options structure:
s = RandStream('mlfg6331_64');
options = statset('UseParallel','always', ...
'Streams',s,'UseSubstreams','always');
2 Run your parallel computation. For instructions, see “Quick Start Parallel
Computing for Statistics Toolbox” on page 17-2.
3 Reset the random stream:
reset(s);
4 Rerun your parallel computation. You obtain identical results.
For an example of a parallel computation run this reproducible way, see
“Reproducible Parallel Bootstrap” on page 17-22.
Subtleties in Parallel Statistical Computation Using
Random Numbers
What Are Substreams?
A substream is a portion of a random stream that RandStream can access
quickly. There is a number M such that for any positive integer k, RandStream
17-14
Reproducibility in Parallel Statistical Computations
can go the kMth pseudorandom number in the stream. From that point,
RandStream can generate the subsequent entries in the stream. Currently,
RandStream has M = 272, about 5e21, or more.
Beginning
of stream
M
2M
3M
Substream 1 Substream 2 Substream 3
The entries in different substreams have good statistical properties, similar to
the properties of entries in a single stream: independence, and lack of k-way
correlation at various lags. The substreams are so long that you can view the
substreams as being independent streams, as in the following picture.
Substream 1
Substream 2
Substream 3
Random Number 1
Random Number 2
Random Number 3
...
Random Number 1
Random Number 2
Random Number 3
...
Random Number 1
Random Number 2
Random Number 3
...
Two RandStream stream types support substreams: 'mlfg6331_64' and
'mrg32k3a'.
How Substreams Enable Reproducible Parallel Computations
When MATLAB performs computations in parallel with parfor, each worker
receives loop iterations in an unpredictable order. Therefore, you cannot
predict which worker gets which iteration, so cannot determine the random
numbers associated with each iteration.
Substreams allow MATLAB to tie each iteration to a particular sequence
of random numbers. parfor gives each iteration an index. The iteration
uses the index as the substream number. Since the random numbers are
associated with the iterations, not with the workers, the entire computation
is reproducible.
17-15
17
Parallel Statistics
To obtain reproducible results, simply reset the stream, and all the
substreams generate identical random numbers when called again. This
method succeeds when all the workers use the same stream, and the stream
supports substreams. This concludes the discussion of how the procedure
in “Running Reproducible Parallel Computations” on page 17-13 gives
reproducible parallel results.
Random Numbers on the Client or Workers
A few functions generate random numbers on the client before distributing
them to parallel workers. The workers do not use random numbers, so
operate purely deterministically. For these functions, you can run a parallel
computation reproducibly using any random stream type.
The functions that operate this way include:
• crossval
• plsregress
• sequentialfs
To obtain identical results, reset the random stream on the client, or the
random stream you pass to the client. For example:
s = RandStream.getDefaultStream;
reset(s)
% run the statistical function
reset(s)
% run the statistical function again, obtain identical results
While this method enables you to run reproducibly in parallel, the results can
differ from a serial computation. The reason for the difference is parfor loops
run in reverse order from for loops. Therefore, a serial computation can
generate random numbers in a different order than a parallel computation.
For unequivocal reproducibility, use the technique in “Running Reproducible
Parallel Computations” on page 17-13.
Distributing Streams Explicitly
For testing or comparison using particular random number algorithms, you
must set the random number generators. How do you set these generators
17-16
Reproducibility in Parallel Statistical Computations
in parallel, or initialize streams on each worker in a particular way? Or
you might want to run a computation using a different sequence of random
numbers than any other you have run. How can you ensure the sequence
you use is statistically independent?
Parallel Statistics Toolbox functions allow you to set random streams on
each worker explicitly. For information on creating multiple streams, enter
help RandStream/create at the command line. To create four independent
streams using the 'mrg32k3a' generator:
s = RandStream.create('mrg32k3a','NumStreams',4,...
'CellOutput',true);
Pass these streams to a statistical function using the Streams option. For
example:
matlabpool open 4 % if you have at least 4 cores
s = RandStream.create('mrg32k3a','NumStreams',4,...
'CellOutput',true); % create 4 independent streams
paroptions = statset('UseParallel','always',...
'Streams',s); % set the 4 different streams
x = [randn(700,1); 4 + 2*randn(300,1)];
latt = -4:0.01:12;
myfun = @(X) ksdensity(X,latt);
pdfestimate = myfun(x);
B = bootstrp(200,myfun,x,'Options',paroptions);
See “Example: Parallel Bootstrap” on page 17-20 for a plot of the results
of this computation.
This method of distributing streams gives each worker a different stream for
the computation. However, it does not allow for a reproducible computation,
because the workers perform the 200 bootstraps in an unpredictable order. If
you want to perform a reproducible computation, use substreams as described
in “Running Reproducible Parallel Computations” on page 17-13.
If you set the UseSubstreams option to 'always', then set the Streams
option to a single random stream of the type that supports substreams
('mlfg6331_64' or 'mrg32k3a'). This setting gives reproducible
computations.
17-17
17
Parallel Statistics
Examples of Parallel Statistical Functions
In this section...
“Example: Parallel Jackknife” on page 17-18
“Example: Parallel Cross Validation” on page 17-19
“Example: Parallel Bootstrap” on page 17-20
Example: Parallel Jackknife
This example is from the jackknife function reference page, but runs in
parallel.
matlabpool open
opts = statset('UseParallel','always');
sigma = 5;
y = normrnd(0,sigma,100,1);
m = jackknife(@var, y,1,'Options',opts);
n = length(y);
bias = -sigma^2 / n % known bias formula
jbias = (n - 1)*(mean(m)-var(y,1)) % jackknife bias estimate
bias =
-0.2500
jbias =
-0.2698
This simple example is not a good candidate for parallel computation:
% How long to compute in serial?
tic;m = jackknife(@var,y,1);toc
Elapsed time is 0.023852 seconds.
% How long to compute in parallel?
tic;m = jackknife(@var,y,1,'Options',opts);toc
Elapsed time is 1.911936 seconds.
jackknife does not use random numbers, so gives the same results every
time, whether run in parallel or serial.
17-18
Examples of Parallel Statistical Functions
Example: Parallel Cross Validation
• “Simple Parallel Cross Validation” on page 17-19
• “Reproducible Parallel Cross Validation” on page 17-19
Simple Parallel Cross Validation
This example is the same as the first in the crossval function reference page,
but runs in parallel.
matlabpool open
opts = statset('UseParallel','always');
load('fisheriris');
y = meas(:,1);
X = [ones(size(y,1),1),meas(:,2:4)];
[email protected](XTRAIN,ytrain,XTEST)(XTEST*regress(ytrain,XTRAIN));
cvMse = crossval('mse',X,y,'Predfun',regf,'Options',opts)
cvMse =
0.0999
This simple example is not a good candidate for parallel computation:
% How long to compute in serial?
tic;cvMse = crossval('mse',X,y,'Predfun',regf);toc
Elapsed time is 0.046005 seconds.
% How long to compute in parallel?
tic;cvMse = crossval('mse',X,y,'Predfun',regf,...
'Options',opts);toc
Elapsed time is 1.333021 seconds.
Reproducible Parallel Cross Validation
To run crossval in parallel in a reproducible fashion, set the options and
reset the random stream appropriately (see “Running Reproducible Parallel
Computations” on page 17-13).
matlabpool open
17-19
17
Parallel Statistics
s = RandStream('mlfg6331_64');
options = statset('UseParallel','always',...
'Streams',s,'UseSubstreams','always');
load('fisheriris');
y = meas(:,1);
X = [ones(size(y,1),1),meas(:,2:4)];
[email protected](XTRAIN,ytrain,XTEST)(XTEST*regress(ytrain,XTRAIN));
cvMse = crossval('mse',X,y,'Predfun',regf,'Options',opts)
cvMse =
0.1020
Reset the stream and the result is identical:
reset(s)
cvMse = crossval('mse',X,y,'Predfun',regf,'Options',opts)
cvMse =
0.1020
Example: Parallel Bootstrap
• “Bootstrap in Serial and Parallel” on page 17-20
• “Reproducible Parallel Bootstrap” on page 17-22
Bootstrap in Serial and Parallel
Here is an example timing a bootstrap in parallel versus in serial. The
example generates data from a mixture of two Gaussians, constructs a
nonparametric estimate of the resulting data, and uses a bootstrap to get
a sense of the sampling variability.
1 Generate the data:
% Generate a random sample of size 1000,
% from a mixture of two Gaussian distributions
x = [randn(700,1); 4 + 2*randn(300,1)];
2 Construct a nonparametric estimate of the density from the data:
17-20
Examples of Parallel Statistical Functions
latt = -4:0.01:12;
myfun = @(X) ksdensity(X,latt);
pdfestimate = myfun(x);
3 Bootstrap the estimate to get a sense of its sampling variability. Run the
bootstrap in serial for timing comparison.
tic;B = bootstrp(200,myfun,x);toc
Elapsed time is 17.455586 seconds.
4 Run the bootstrap in parallel for timing comparison:
matlabpool open
Starting matlabpool using the 'local' configuration ...
connected to 2 labs.
opt = statset('UseParallel','always');
tic;B = bootstrp(200,myfun,x,'Options',opt);toc
Elapsed time is 9.984345 seconds.
Computing in parallel is nearly twice as fast as computing in serial for
this example.
Overlay the ksdensity density estimate with the 200 bootstrapped estimates
obtained in the parallel bootstrap. You can get a sense of how to assess the
accuracy of the density estimate from this plot.
hold on
for i=1:size(B,1),
plot(latt,B(i,:),'c:')
end
plot(latt,pdfestimate);
xlabel('x');ylabel('Density estimate')
17-21
17
Parallel Statistics
Reproducible Parallel Bootstrap
To run the example in parallel in a reproducible fashion, set the options
appropriately (see “Running Reproducible Parallel Computations” on page
17-13). First set up the problem and parallel environment as in “Bootstrap in
Serial and Parallel” on page 17-20. Then set the options to use substreams
along with a stream that supports substreams.
s = RandStream('mlfg6331_64'); % has substreams
opts = statset('UseParallel','always',...
'Streams',s,'UseSubstreams','always');
B2 = bootstrp(200,myfun,x,'Options',opts);
17-22
Examples of Parallel Statistical Functions
To rerun the bootstrap and get the same result:
reset(s) % set the stream to initial state
B3 = bootstrp(200,myfun,x,'Options',opts);
isequal(B2,B3) % check if same results
ans =
1
17-23
17
17-24
Parallel Statistics
18
Function Reference
File I/O (p. 18-2)
Data file input/output
Data Organization (p. 18-3)
Data arrays and groups
Descriptive Statistics (p. 18-8)
Data summaries
Statistical Visualization (p. 18-11)
Data patterns and trends
Probability Distributions (p. 18-15)
Modeling data frequency
Hypothesis Tests (p. 18-31)
Inferences from data
Analysis of Variance (p. 18-32)
Modeling data variance
Parametric Regression Analysis
(p. 18-33)
Continuous data models
Multivariate Methods (p. 18-36)
Visualization and reduction
Cluster Analysis (p. 18-38)
Identifying data categories
Model Assessment (p. 18-39)
Identifying data categories
Parametric Classification (p. 18-40)
Categorical data models
Supervised Learning (p. 18-42)
Classification and regression via
trees, bagging, boosting, and more
Hidden Markov Models (p. 18-53)
Stochastic data models
Design of Experiments (p. 18-54)
Systematic data collection
Statistical Process Control (p. 18-58)
Production monitoring
GUIs (p. 18-59)
Interactive tools
Utilities (p. 18-60)
General purpose
18
Function Reference
File I/O
18-2
caseread
Read case names from file
casewrite
Write case names to file
tblread
Read tabular data from file
tblwrite
Write tabular data to file
tdfread
Read tab-delimited file
xptread
Create dataset array from data
stored in SAS XPORT format file
Data Organization
Data Organization
Categorical Arrays (p. 18-3)
Dataset Arrays (p. 18-6)
Grouped Data (p. 18-7)
Categorical Arrays
addlevels (categorical)
Add levels to categorical array
cat (categorical)
Concatenate categorical arrays
categorical
Create categorical array
cellstr (categorical)
Convert categorical array to cell
array of strings
char (categorical)
Convert categorical array to
character array
circshift (categorical)
Shift categorical array circularly
ctranspose (categorical)
Transpose categorical matrix
double (categorical)
Convert categorical array to double
array
droplevels (categorical)
Drop levels
end (categorical)
Last index in indexing expression for
categorical array
flipdim (categorical)
Flip categorical array along specified
dimension
fliplr (categorical)
Flip categorical matrix in left/right
direction
flipud (categorical)
Flip categorical matrix in up/down
direction
getlabels (categorical)
Access categorical array labels
getlevels (categorical)
Get categorical array levels
18-3
18
18-4
Function Reference
hist (categorical)
Plot histogram of categorical data
horzcat (categorical)
Horizontal concatenation for
categorical arrays
int16 (categorical)
Convert categorical array to signed
16-bit integer array
int32 (categorical)
Convert categorical array to signed
32-bit integer array
int64 (categorical)
Convert categorical array to signed
64-bit integer array
int8 (categorical)
Convert categorical array to signed
8-bit integer array
intersect (categorical)
Set intersection for categorical
arrays
ipermute (categorical)
Inverse permute dimensions of
categorical array
isempty (categorical)
True for empty categorical array
isequal (categorical)
True if categorical arrays are equal
islevel (categorical)
Test for levels
ismember (categorical)
True for elements of categorical
array in set
ismember (ordinal)
Test for membership
isscalar (categorical)
True if categorical array is scalar
isundefined (categorical)
Test for undefined elements
isvector (categorical)
True if categorical array is vector
length (categorical)
Length of categorical array
levelcounts (categorical)
Element counts by level
mergelevels (ordinal)
Merge levels
ndims (categorical)
Number of dimensions of categorical
array
nominal
Construct nominal categorical array
Data Organization
numel (categorical)
Number of elements in categorical
array
ordinal
Construct ordinal categorical array
permute (categorical)
Permute dimensions of categorical
array
reorderlevels (categorical)
Reorder levels
repmat (categorical)
Replicate and tile categorical array
reshape (categorical)
Resize categorical array
rot90 (categorical)
Rotate categorical matrix 90 degrees
setdiff (categorical)
Set difference for categorical arrays
setlabels (categorical)
Label levels
setxor (categorical)
Set exclusive-or for categorical
arrays
shiftdim (categorical)
Shift dimensions of categorical array
single (categorical)
Convert categorical array to single
array
size (categorical)
Size of categorical array
sort (ordinal)
Sort elements of ordinal array
sortrows (ordinal)
Sort rows
squeeze (categorical)
Squeeze singleton dimensions from
categorical array
summary (categorical)
Summary statistics for categorical
array
times (categorical)
Product of categorical arrays
transpose (categorical)
Transpose categorical matrix
uint16 (categorical)
Convert categorical array to
unsigned 16-bit integers
uint32 (categorical)
Convert categorical array to
unsigned 32-bit integers
18-5
18
Function Reference
uint64 (categorical)
Convert categorical array to
unsigned 64-bit integers
uint8 (categorical)
Convert categorical array to
unsigned 8-bit integers
union (categorical)
Set union for categorical arrays
unique (categorical)
Unique values in categorical array
vertcat (categorical)
Vertical concatenation for categorical
arrays
Dataset Arrays
18-6
cat (dataset)
Concatenate dataset arrays
cellstr (dataset)
Create cell array of strings from
dataset array
dataset
Construct dataset array
datasetfun (dataset)
Apply function to dataset array
variables
double (dataset)
Convert dataset variables to double
array
end (dataset)
Last index in indexing expression for
dataset array
export (dataset)
Write dataset array to file
get (dataset)
Access dataset array properties
grpstats (dataset)
Summary statistics by group for
dataset arrays
horzcat (dataset)
Horizontal concatenation for dataset
arrays
isempty (dataset)
True for empty dataset array
join (dataset)
Merge observations
length (dataset)
Length of dataset array
Data Organization
ndims (dataset)
Number of dimensions of dataset
array
numel (dataset)
Number of elements in dataset array
replacedata (dataset)
Replace dataset variables
set (dataset)
Set and display properties
single (dataset)
Convert dataset variables to single
array
size (dataset)
Size of dataset array
sortrows (dataset)
Sort rows of dataset array
stack (dataset)
Stack data from multiple variables
into single variable
summary (dataset)
Print summary of dataset array
unique (dataset)
Unique observations in dataset
array
unstack (dataset)
Unstack data from single variable
into multiple variables
vertcat (dataset)
Vertical concatenation for dataset
arrays
Grouped Data
gplotmatrix
Matrix of scatter plots by group
grp2idx
Create index vector from grouping
variable
grpstats
Summary statistics by group
gscatter
Scatter plot by group
18-7
18
Function Reference
Descriptive Statistics
Summaries (p. 18-8)
Measures of Central Tendency
(p. 18-8)
Measures of Dispersion (p. 18-8)
Measures of Shape (p. 18-9)
Statistics Resampling (p. 18-9)
Data with Missing Values (p. 18-9)
Data Correlation (p. 18-10)
Summaries
crosstab
Cross-tabulation
grpstats
Summary statistics by group
summary (categorical)
Summary statistics for categorical
array
tabulate
Frequency table
Measures of Central Tendency
geomean
Geometric mean
harmmean
Harmonic mean
trimmean
Mean excluding outliers
Measures of Dispersion
18-8
iqr
Interquartile range
mad
Mean or median absolute deviation
Descriptive Statistics
moment
Central moments
range
Range of values
Measures of Shape
kurtosis
Kurtosis
moment
Central moments
prctile
Calculate percentile values
quantile
Quantiles
skewness
Skewness
zscore
Standardized z-scores
Statistics Resampling
bootci
Bootstrap confidence interval
bootstrp
Bootstrap sampling
jackknife
Jackknife sampling
Data with Missing Values
nancov
Covariance ignoring NaN values
nanmax
Maximum ignoring NaN values
nanmean
Mean ignoring NaN values
nanmedian
Median ignoring NaN values
nanmin
Minimum ignoring NaN values
nanstd
Standard deviation ignoring NaN
values
nansum
Sum ignoring NaN values
nanvar
Variance, ignoring NaN values
18-9
18
Function Reference
Data Correlation
18-10
canoncorr
Canonical correlation
cholcov
Cholesky-like covariance
decomposition
cophenet
Cophenetic correlation coefficient
corr
Linear or rank correlation
corrcov
Convert covariance matrix to
correlation matrix
partialcorr
Linear or rank partial correlation
coefficients
tiedrank
Rank adjusted for ties
Statistical Visualization
Statistical Visualization
Distribution Plots (p. 18-11)
Scatter Plots (p. 18-12)
ANOVA Plots (p. 18-12)
Regression Plots (p. 18-13)
Multivariate Plots (p. 18-13)
Cluster Plots (p. 18-13)
Classification Plots (p. 18-14)
DOE Plots (p. 18-14)
SPC Plots (p. 18-14)
Distribution Plots
boxplot
Box plot
cdfplot
Empirical cumulative distribution
function plot
dfittool
Interactive distribution fitting
disttool
Interactive density and distribution
plots
ecdfhist
Empirical cumulative distribution
function histogram
fsurfht
Interactive contour plot
hist3
Bivariate histogram
histfit
Histogram with normal fit
normplot
Normal probability plot
normspec
Normal density plot between
specifications
pareto
Pareto chart
probplot
Probability plots
18-11
18
Function Reference
qqplot
Quantile-quantile plot
randtool
Interactive random number
generation
scatterhist
Scatter plot with marginal
histograms
surfht
Interactive contour plot
wblplot
Weibull probability plot
Scatter Plots
gline
Interactively add line to plot
gname
Add case names to plot
gplotmatrix
Matrix of scatter plots by group
gscatter
Scatter plot by group
lsline
Add least-squares line to scatter plot
refcurve
Add reference curve to plot
refline
Add reference line to plot
scatterhist
Scatter plot with marginal
histograms
ANOVA Plots
18-12
anova1
One-way analysis of variance
aoctool
Interactive analysis of covariance
manovacluster
Dendrogram of group mean clusters
following MANOVA
multcompare
Multiple comparison test
Statistical Visualization
Regression Plots
addedvarplot
Added-variable plot
gline
Interactively add line to plot
lsline
Add least-squares line to scatter plot
polytool
Interactive polynomial fitting
rcoplot
Residual case order plot
refcurve
Add reference curve to plot
refline
Add reference line to plot
robustdemo
Interactive robust regression
rsmdemo
Interactive response surface
demonstration
rstool
Interactive response surface
modeling
view (classregtree)
Plot tree
Multivariate Plots
andrewsplot
Andrews plot
biplot
Biplot
glyphplot
Glyph plot
parallelcoords
Parallel coordinates plot
Cluster Plots
dendrogram
Dendrogram plot
manovacluster
Dendrogram of group mean clusters
following MANOVA
silhouette
Silhouette plot
18-13
18
Function Reference
Classification Plots
perfcurve
Compute Receiver Operating
Characteristic (ROC) curve or other
performance curve for classifier
output
view (classregtree)
Plot tree
DOE Plots
interactionplot
Interaction plot for grouped data
maineffectsplot
Main effects plot for grouped data
multivarichart
Multivari chart for grouped data
rsmdemo
Interactive response surface
demonstration
rstool
Interactive response surface
modeling
SPC Plots
18-14
capaplot
Process capability plot
controlchart
Shewhart control charts
histfit
Histogram with normal fit
normspec
Normal density plot between
specifications
Probability Distributions
Probability Distributions
Distribution Objects (p. 18-15)
Distribution Plots (p. 18-16)
Probability Density (p. 18-17)
Cumulative Distribution (p. 18-19)
Inverse Cumulative Distribution
(p. 18-21)
Distribution Statistics (p. 18-23)
Distribution Fitting (p. 18-24)
Negative Log-Likelihood (p. 18-26)
Random Number Generators
(p. 18-26)
Quasi-Random Numbers (p. 18-28)
Piecewise Distributions (p. 18-29)
Distribution Objects
cdf (ProbDist)
Return cumulative distribution
function (CDF) for ProbDist object
fitdist
Fit probability distribution to data
icdf (ProbDistUnivKernel)
Return inverse cumulative
distribution function (ICDF) for
ProbDistUnivKernel object
icdf (ProbDistUnivParam)
Return inverse cumulative
distribution function (ICDF) for
ProbDistUnivParam object
iqr (ProbDistUnivKernel)
Return interquartile range (IQR) for
ProbDistUnivKernel object
iqr (ProbDistUnivParam)
Return interquartile range (IQR) for
ProbDistUnivParam object
18-15
18
Function Reference
mean (ProbDistUnivParam)
Return mean of ProbDistUnivParam
object
median (ProbDistUnivKernel)
Return median of
ProbDistUnivKernel object
median (ProbDistUnivParam)
Return median of
ProbDistUnivParam object
paramci (ProbDistUnivParam)
Return parameter confidence
intervals of ProbDistUnivParam
object
pdf (ProbDist)
Return probability density function
(PDF) for ProbDist object
ProbDistUnivKernel
Construct ProbDistUnivKernel
object
ProbDistUnivParam
Construct ProbDistUnivParam
object
random (ProbDist)
Generate random number drawn
from ProbDist object
std (ProbDistUnivParam)
Return standard deviation of
ProbDistUnivParam object
var (ProbDistUnivParam)
Return variance of
ProbDistUnivParam object
Distribution Plots
18-16
boxplot
Box plot
cdfplot
Empirical cumulative distribution
function plot
dfittool
Interactive distribution fitting
disttool
Interactive density and distribution
plots
ecdfhist
Empirical cumulative distribution
function histogram
Probability Distributions
fsurfht
Interactive contour plot
hist3
Bivariate histogram
histfit
Histogram with normal fit
normplot
Normal probability plot
normspec
Normal density plot between
specifications
pareto
Pareto chart
probplot
Probability plots
qqplot
Quantile-quantile plot
randtool
Interactive random number
generation
scatterhist
Scatter plot with marginal
histograms
surfht
Interactive contour plot
wblplot
Weibull probability plot
Probability Density
betapdf
Beta probability density function
binopdf
Binomial probability density
function
chi2pdf
Chi-square probability density
function
copulapdf
Copula probability density function
disttool
Interactive density and distribution
plots
evpdf
Extreme value probability density
function
exppdf
Exponential probability density
function
18-17
18
18-18
Function Reference
fpdf
F probability density function
gampdf
Gamma probability density function
geopdf
Geometric probability density
function
gevpdf
Generalized extreme value
probability density function
gppdf
Generalized Pareto probability
density function
hygepdf
Hypergeometric probability density
function
ksdensity
Kernel smoothing density estimate
lognpdf
Lognormal probability density
function
mnpdf
Multinomial probability density
function
mvnpdf
Multivariate normal probability
density function
mvtpdf
Multivariate t probability density
function
nbinpdf
Negative binomial probability
density function
ncfpdf
Noncentral F probability density
function
nctpdf
Noncentral t probability density
function
ncx2pdf
Noncentral chi-square probability
density function
normpdf
Normal probability density function
pdf
Probability density functions
pdf (gmdistribution)
Probability density function for
Gaussian mixture distribution
Probability Distributions
pdf (piecewisedistribution)
Probability density function for
piecewise distribution
poisspdf
Poisson probability density function
random (piecewisedistribution)
Random numbers from piecewise
distribution
raylpdf
Rayleigh probability density function
tpdf
Student’s t probability density
function
unidpdf
Discrete uniform probability density
function
unifpdf
Continuous uniform probability
density function
wblpdf
Weibull probability density function
Cumulative Distribution
betacdf
Beta cumulative distribution
function
binocdf
Binomial cumulative distribution
function
cdf
Cumulative distribution functions
cdf (gmdistribution)
Cumulative distribution function for
Gaussian mixture distribution
cdf (piecewisedistribution)
Cumulative distribution function for
piecewise distribution
cdfplot
Empirical cumulative distribution
function plot
chi2cdf
Chi-square cumulative distribution
function
copulacdf
Copula cumulative distribution
function
18-19
18
18-20
Function Reference
disttool
Interactive density and distribution
plots
ecdf
Empirical cumulative distribution
function
ecdfhist
Empirical cumulative distribution
function histogram
evcdf
Extreme value cumulative
distribution function
expcdf
Exponential cumulative distribution
function
fcdf
F cumulative distribution function
gamcdf
Gamma cumulative distribution
function
geocdf
Geometric cumulative distribution
function
gevcdf
Generalized extreme value
cumulative distribution function
gpcdf
Generalized Pareto cumulative
distribution function
hygecdf
Hypergeometric cumulative
distribution function
logncdf
Lognormal cumulative distribution
function
mvncdf
Multivariate normal cumulative
distribution function
mvtcdf
Multivariate t cumulative
distribution function
ncfcdf
Noncentral F cumulative
distribution function
nctcdf
Noncentral t cumulative distribution
function
Probability Distributions
ncx2cdf
Noncentral chi-square cumulative
distribution function
normcdf
Normal cumulative distribution
function
poisscdf
Poisson cumulative distribution
function
raylcdf
Rayleigh cumulative distribution
function
tcdf
Student’s t cumulative distribution
function
unidcdf
Discrete uniform cumulative
distribution function
unifcdf
Continuous uniform cumulative
distribution function
wblcdf
Weibull cumulative distribution
function
Inverse Cumulative Distribution
betainv
Beta inverse cumulative distribution
function
binoinv
Binomial inverse cumulative
distribution function
chi2inv
Chi-square inverse cumulative
distribution function
evinv
Extreme value inverse cumulative
distribution function
expinv
Exponential inverse cumulative
distribution function
finv
F inverse cumulative distribution
function
18-21
18
18-22
Function Reference
gaminv
Gamma inverse cumulative
distribution function
geoinv
Geometric inverse cumulative
distribution function
gevinv
Generalized extreme value inverse
cumulative distribution function
gpinv
Generalized Pareto inverse
cumulative distribution function
hygeinv
Hypergeometric inverse cumulative
distribution function
icdf
Inverse cumulative distribution
functions
icdf (piecewisedistribution)
Inverse cumulative distribution
function for piecewise distribution
logninv
Lognormal inverse cumulative
distribution function
nbininv
Negative binomial inverse
cumulative distribution function
ncfinv
Noncentral F inverse cumulative
distribution function
nctinv
Noncentral t inverse cumulative
distribution function
ncx2inv
Noncentral chi-square inverse
cumulative distribution function
norminv
Normal inverse cumulative
distribution function
poissinv
Poisson inverse cumulative
distribution function
raylinv
Rayleigh inverse cumulative
distribution function
tinv
Student’s t inverse cumulative
distribution function
Probability Distributions
unidinv
Discrete uniform inverse cumulative
distribution function
unifinv
Continuous uniform inverse
cumulative distribution function
wblinv
Weibull inverse cumulative
distribution function
Distribution Statistics
betastat
Beta mean and variance
binostat
Binomial mean and variance
chi2stat
Chi-square mean and variance
copulastat
Copula rank correlation
evstat
Extreme value mean and variance
expstat
Exponential mean and variance
fstat
F mean and variance
gamstat
Gamma mean and variance
geostat
Geometric mean and variance
gevstat
Generalized extreme value mean
and variance
gpstat
Generalized Pareto mean and
variance
hygestat
Hypergeometric mean and variance
lognstat
Lognormal mean and variance
nbinstat
Negative binomial mean and
variance
ncfstat
Noncentral F mean and variance
nctstat
Noncentral t mean and variance
ncx2stat
Noncentral chi-square mean and
variance
18-23
18
Function Reference
normstat
Normal mean and variance
poisstat
Poisson mean and variance
raylstat
Rayleigh mean and variance
tstat
Student’s t mean and variance
unidstat
Discrete uniform mean and variance
unifstat
Continuous uniform mean and
variance
wblstat
Weibull mean and variance
Distribution Fitting
Supported Distributions (p. 18-24)
Piecewise Distributions (p. 18-25)
Supported Distributions
18-24
betafit
Beta parameter estimates
binofit
Binomial parameter estimates
copulafit
Fit copula to data
copulaparam
Copula parameters as function of
rank correlation
dfittool
Interactive distribution fitting
evfit
Extreme value parameter estimates
expfit
Exponential parameter estimates
fit (gmdistribution)
Gaussian mixture parameter
estimates
gamfit
Gamma parameter estimates
gevfit
Generalized extreme value
parameter estimates
Probability Distributions
gpfit
Generalized Pareto parameter
estimates
histfit
Histogram with normal fit
johnsrnd
Johnson system random numbers
lognfit
Lognormal parameter estimates
mle
Maximum likelihood estimates
mlecov
Asymptotic covariance of maximum
likelihood estimators
nbinfit
Negative binomial parameter
estimates
normfit
Normal parameter estimates
normplot
Normal probability plot
pearsrnd
Pearson system random numbers
poissfit
Poisson parameter estimates
raylfit
Rayleigh parameter estimates
unifit
Continuous uniform parameter
estimates
wblfit
Weibull parameter estimates
wblplot
Weibull probability plot
Piecewise Distributions
boundary (piecewisedistribution)
Piecewise distribution boundaries
lowerparams (paretotails)
Lower Pareto tails parameters
nsegments (piecewisedistribution)
Number of segments
paretotails
Construct Pareto tails object
piecewisedistribution
Create piecewise distribution object
segment (piecewisedistribution)
Segments containing values
upperparams (paretotails)
Upper Pareto tails parameters
18-25
18
Function Reference
Negative Log-Likelihood
betalike
Beta negative log-likelihood
evlike
Extreme value negative
log-likelihood
explike
Exponential negative log-likelihood
gamlike
Gamma negative log-likelihood
gevlike
Generalized extreme value negative
log-likelihood
gplike
Generalized Pareto negative
log-likelihood
lognlike
Lognormal negative log-likelihood
mvregresslike
Negative log-likelihood for
multivariate regression
normlike
Normal negative log-likelihood
wbllike
Weibull negative log-likelihood
Random Number Generators
18-26
betarnd
Beta random numbers
binornd
Binomial random numbers
chi2rnd
Chi-square random numbers
copularnd
Copula random numbers
evrnd
Extreme value random numbers
exprnd
Exponential random numbers
frnd
F random numbers
gamrnd
Gamma random numbers
geornd
Geometric random numbers
gevrnd
Generalized extreme value random
numbers
Probability Distributions
gprnd
Generalized Pareto random numbers
hygernd
Hypergeometric random numbers
iwishrnd
Inverse Wishart random numbers
johnsrnd
Johnson system random numbers
lhsdesign
Latin hypercube sample
lhsnorm
Latin hypercube sample from normal
distribution
lognrnd
Lognormal random numbers
mhsample
Metropolis-Hastings sample
mnrnd
Multinomial random numbers
mvnrnd
Multivariate normal random
numbers
mvtrnd
Multivariate t random numbers
nbinrnd
Negative binomial random numbers
ncfrnd
Noncentral F random numbers
nctrnd
Noncentral t random numbers
ncx2rnd
Noncentral chi-square random
numbers
normrnd
Normal random numbers
pearsrnd
Pearson system random numbers
poissrnd
Poisson random numbers
randg
Gamma random numbers
random
Random numbers
random (gmdistribution)
Random numbers from Gaussian
mixture distribution
random (piecewisedistribution)
Random numbers from piecewise
distribution
randsample
Random sample
18-27
18
Function Reference
randtool
Interactive random number
generation
raylrnd
Rayleigh random numbers
slicesample
Slice sampler
trnd
Student’s t random numbers
unidrnd
Discrete uniform random numbers
unifrnd
Continuous uniform random
numbers
wblrnd
Weibull random numbers
wishrnd
Wishart random numbers
Quasi-Random Numbers
18-28
addlistener (qrandstream)
Add listener for event
delete (qrandstream)
Delete handle object
end (qrandset)
Last index in indexing expression for
point set
eq (qrandstream)
Test handle equality
findobj (qrandstream)
Find objects matching specified
conditions
findprop (qrandstream)
Find property of MATLAB handle
object
ge (qrandstream)
Greater than or equal relation for
handles
gt (qrandstream)
Greater than relation for handles
haltonset
Construct Halton quasi-random
point set
isvalid (qrandstream)
Test handle validity
le (qrandstream)
Less than or equal relation for
handles
Probability Distributions
length (qrandset)
Length of point set
lt (qrandstream)
Less than relation for handles
ndims (qrandset)
Number of dimensions in matrix
ne (qrandstream)
Not equal relation for handles
net (qrandset)
Generate quasi-random point set
notify (qrandstream)
Notify listeners of event
qrand (qrandstream)
Generate quasi-random points from
stream
qrandset
Abstract quasi-random point set
class
qrandstream
Construct quasi-random number
stream
rand (qrandstream)
Generate quasi-random points from
stream
reset (qrandstream)
Reset state
scramble (qrandset)
Scramble quasi-random point set
size (qrandset)
Number of dimensions in matrix
sobolset
Construct Sobol quasi-random point
set
Piecewise Distributions
boundary (piecewisedistribution)
Piecewise distribution boundaries
cdf (piecewisedistribution)
Cumulative distribution function for
piecewise distribution
icdf (piecewisedistribution)
Inverse cumulative distribution
function for piecewise distribution
lowerparams (paretotails)
Lower Pareto tails parameters
nsegments (piecewisedistribution)
Number of segments
paretotails
Construct Pareto tails object
18-29
18
18-30
Function Reference
pdf (piecewisedistribution)
Probability density function for
piecewise distribution
piecewisedistribution
Create piecewise distribution object
random (piecewisedistribution)
Random numbers from piecewise
distribution
segment (piecewisedistribution)
Segments containing values
upperparams (paretotails)
Upper Pareto tails parameters
Hypothesis Tests
Hypothesis Tests
ansaribradley
Ansari-Bradley test
barttest
Bartlett’s test
canoncorr
Canonical correlation
chi2gof
Chi-square goodness-of-fit test
dwtest
Durbin-Watson test
friedman
Friedman’s test
jbtest
Jarque-Bera test
kruskalwallis
Kruskal-Wallis test
kstest
One-sample Kolmogorov-Smirnov
test
kstest2
Two-sample Kolmogorov-Smirnov
test
lillietest
Lilliefors test
linhyptest
Linear hypothesis test
ranksum
Wilcoxon rank sum test
runstest
Run test for randomness
sampsizepwr
Sample size and power of test
signrank
Wilcoxon signed rank test
signtest
Sign test
ttest
One-sample and paired-sample t-test
ttest2
Two-sample t-test
vartest
Chi-square variance test
vartest2
Two-sample F-test for equal
variances
vartestn
Bartlett multiple-sample test for
equal variances
18-31
18
Function Reference
zscore
Standardized z-scores
ztest
z-test
Analysis of Variance
ANOVA Plots (p. 18-32)
ANOVA Operations (p. 18-32)
ANOVA Plots
anova1
One-way analysis of variance
aoctool
Interactive analysis of covariance
manovacluster
Dendrogram of group mean clusters
following MANOVA
multcompare
Multiple comparison test
ANOVA Operations
18-32
anova1
One-way analysis of variance
anova2
Two-way analysis of variance
anovan
N-way analysis of variance
aoctool
Interactive analysis of covariance
dummyvar
Create dummy variables
friedman
Friedman’s test
kruskalwallis
Kruskal-Wallis test
manova1
One-way multivariate analysis of
variance
Parametric Regression Analysis
manovacluster
Dendrogram of group mean clusters
following MANOVA
multcompare
Multiple comparison test
Parametric Regression Analysis
Regression Plots (p. 18-33)
Linear Regression (p. 18-34)
Nonlinear Regression (p. 18-35)
Regression Plots
addedvarplot
Added-variable plot
gline
Interactively add line to plot
lsline
Add least-squares line to scatter plot
polytool
Interactive polynomial fitting
rcoplot
Residual case order plot
refcurve
Add reference curve to plot
refline
Add reference line to plot
robustdemo
Interactive robust regression
rsmdemo
Interactive response surface
demonstration
rstool
Interactive response surface
modeling
view (classregtree)
Plot tree
18-33
18
Function Reference
Linear Regression
18-34
coxphfit
Cox proportional hazards regression
dummyvar
Create dummy variables
glmfit
Generalized linear model regression
glmval
Generalized linear model values
invpred
Inverse prediction
leverage
Leverage
mnrfit
Multinomial logistic regression
mnrval
Multinomial logistic regression
values
mvregress
Multivariate linear regression
mvregresslike
Negative log-likelihood for
multivariate regression
plsregress
Partial least-squares regression
polyconf
Polynomial confidence intervals
polytool
Interactive polynomial fitting
regress
Multiple linear regression
regstats
Regression diagnostics
ridge
Ridge regression
robustdemo
Interactive robust regression
robustfit
Robust regression
rsmdemo
Interactive response surface
demonstration
rstool
Interactive response surface
modeling
stepwise
Interactive stepwise regression
stepwisefit
Stepwise regression
x2fx
Convert predictor matrix to design
matrix
Parametric Regression Analysis
Nonlinear Regression
dummyvar
Create dummy variables
hougen
Hougen-Watson model
nlinfit
Nonlinear regression
nlintool
Interactive nonlinear regression
nlmefit
Nonlinear mixed-effects estimation
nlmefitsa
Fit nonlinear mixed effects model
with stochastic EM algorithm
nlparci
Nonlinear regression parameter
confidence intervals
nlpredci
Nonlinear regression prediction
confidence intervals
18-35
18
Function Reference
Multivariate Methods
Multivariate Plots (p. 18-36)
Multidimensional Scaling (p. 18-36)
Procrustes Analysis (p. 18-36)
Feature Selection (p. 18-37)
Feature Transformation (p. 18-37)
Multivariate Plots
andrewsplot
Andrews plot
biplot
Biplot
glyphplot
Glyph plot
parallelcoords
Parallel coordinates plot
Multidimensional Scaling
cmdscale
Classical multidimensional scaling
mahal
Mahalanobis distance
mdscale
Nonclassical multidimensional
scaling
pdist
Pairwise distance between pairs of
objects
squareform
Format distance matrix
Procrustes Analysis
procrustes
18-36
Procrustes analysis
Multivariate Methods
Feature Selection
sequentialfs
Sequential feature selection
Feature Transformation
Nonnegative Matrix Factorization
(p. 18-37)
Principal Component Analysis
(p. 18-37)
Factor Analysis (p. 18-37)
Nonnegative Matrix Factorization
nnmf
Nonnegative matrix factorization
Principal Component Analysis
barttest
Bartlett’s test
pareto
Pareto chart
pcacov
Principal component analysis on
covariance matrix
pcares
Residuals from principal component
analysis
princomp
Principal component analysis (PCA)
on data
Factor Analysis
factoran
Factor analysis
18-37
18
Function Reference
Cluster Analysis
Cluster Plots (p. 18-38)
Hierarchical Clustering (p. 18-38)
K-Means Clustering (p. 18-39)
Gaussian Mixture Models (p. 18-39)
Cluster Plots
dendrogram
Dendrogram plot
manovacluster
Dendrogram of group mean clusters
following MANOVA
silhouette
Silhouette plot
Hierarchical Clustering
18-38
cluster
Construct agglomerative clusters
from linkages
clusterdata
Agglomerative clusters from data
cophenet
Cophenetic correlation coefficient
inconsistent
Inconsistency coefficient
linkage
Agglomerative hierarchical cluster
tree
pdist
Pairwise distance between pairs of
objects
squareform
Format distance matrix
Model Assessment
K-Means Clustering
kmeans
K-means clustering
mahal
Mahalanobis distance
Gaussian Mixture Models
cdf (gmdistribution)
Cumulative distribution function for
Gaussian mixture distribution
cluster (gmdistribution)
Construct clusters from Gaussian
mixture distribution
fit (gmdistribution)
Gaussian mixture parameter
estimates
gmdistribution
Construct Gaussian mixture
distribution
mahal (gmdistribution)
Mahalanobis distance to component
means
pdf (gmdistribution)
Probability density function for
Gaussian mixture distribution
posterior (gmdistribution)
Posterior probabilities of components
random (gmdistribution)
Random numbers from Gaussian
mixture distribution
Model Assessment
confusionmat
Confusion matrix
crossval
Loss estimate using cross-validation
cvpartition
Create cross-validation partition for
data
repartition (cvpartition)
Repartition data for cross-validation
18-39
18
Function Reference
test (cvpartition)
Test indices for cross-validation
training (cvpartition)
Training indices for cross-validation
Parametric Classification
Classification Plots (p. 18-40)
Discriminant Analysis (p. 18-40)
Naive Bayes Classification (p. 18-40)
Distance Computation and Nearest
Neighbor Search (p. 18-41)
Classification Plots
perfcurve
Compute Receiver Operating
Characteristic (ROC) curve or other
performance curve for classifier
output
view (classregtree)
Plot tree
Discriminant Analysis
classify
Discriminant analysis
mahal
Mahalanobis distance
Naive Bayes Classification
18-40
fit (NaiveBayes)
Create Naive Bayes classifier object
by fitting training data
NaiveBayes
Create NaiveBayes object
Parametric Classification
posterior (NaiveBayes)
Compute posterior probability of
each class for test data
predict (NaiveBayes)
Predict class label for test data
Distance Computation and Nearest Neighbor Search
createns
Create object to use in k-nearest
neighbors search
knnsearch
Find k-nearest neighbors using data
knnsearch (ExhaustiveSearcher)
Find k-nearest neighbors using
ExhaustiveSearcher object
knnsearch (KDTreeSearcher)
Find k-nearest neighbors using
KDTreeSearcher object
pdist
Pairwise distance between pairs of
objects
pdist2
Pairwise distance between two sets
of observations
relieff
Importance of attributes (predictors)
using ReliefF algorithm
18-41
18
Function Reference
Supervised Learning
Classification Trees
18-42
catsplit (classregtree)
Categorical splits used for branches
in decision tree
children (classregtree)
Child nodes
classcount (classregtree)
Class counts
ClassificationPartitionedModel
Cross-validated classification model
ClassificationTree
Binary decision tree for classification
classname (classregtree)
Class names for classification
decision tree
classprob (classregtree)
Class probabilities
classregtree
Construct classification and
regression trees
classregtree
Classification and regression trees
compact (ClassificationTree)
Compact tree
CompactClassificationTree
Compact classification tree
crossval (ClassificationTree)
Cross-validated decision tree
cutcategories (classregtree)
Cut categories
cutpoint (classregtree)
Decision tree cut point values
cuttype (classregtree)
Cut types
cutvar (classregtree)
Cut variable names
cvloss (ClassificationTree)
Classification error by cross
validation
edge (CompactClassificationTree)
Classification edge
eval (classregtree)
Predicted responses
fit (ClassificationTree)
Fit classification tree
isbranch (classregtree)
Test node for branch
Supervised Learning
kfoldEdge
(ClassificationPartitionedModel)
Classification edge for observations
not used for training
kfoldfun
(ClassificationPartitionedModel)
Cross validate function
kfoldLoss
(ClassificationPartitionedModel)
Classification loss for observations
not used for training
kfoldMargin
(ClassificationPartitionedModel)
Classification margins for
observations not used for training
kfoldPredict
(ClassificationPartitionedModel)
Predict response for observations not
used for training
loss (CompactClassificationTree)
Classification error
margin (CompactClassificationTree)
Classification margins
meansurrvarassoc (classregtree)
Mean predictive measure of
association for surrogate splits in
decision tree
meanSurrVarAssoc
(CompactClassificationTree)
Mean predictive measure of
association for surrogate splits in
decision tree
nodeclass (classregtree)
Class values of nodes of classification
tree
nodeerr (classregtree)
Return vector of node errors
nodeprob (classregtree)
Node probabilities
nodesize (classregtree)
Return node size
numnodes (classregtree)
Number of nodes
parent (classregtree)
Parent node
predict (CompactClassificationTree)
Predict classification
predictorImportance
(CompactClassificationTree)
Estimates of predictor importance
prune (ClassificationTree)
Produce sequence of subtrees by
pruning
prune (classregtree)
Prune tree
18-43
18
18-44
Function Reference
prunelist (classregtree)
Pruning levels for decision tree
nodes
resubEdge (ClassificationTree)
Classification edge by resubstitution
resubLoss (ClassificationTree)
Classification error by resubstitution
resubMargin (ClassificationTree)
Classification margins by
resubstitution
resubPredict (ClassificationTree)
Predict resubstitution response of
tree
risk (classregtree)
Node risks
surrcutcategories (classregtree)
Categories used for surrogate splits
in decision tree
surrcutflip (classregtree)
Numeric cutpoint assignments used
for surrogate splits in decision tree
surrcutpoint (classregtree)
Cutpoints used for surrogate splits
in decision tree
surrcuttype (classregtree)
Types of surrogate splits used at
branches in decision tree
surrcutvar (classregtree)
Variables used for surrogate splits
in decision tree
surrvarassoc (classregtree)
Predictive measure of association for
surrogate splits in decision tree
template (ClassificationTree)
Create classification template
test (classregtree)
Error rate
type (classregtree)
Tree type
varimportance (classregtree)
Compute embedded estimates of
input feature importance
view (classregtree)
Plot tree
view (CompactClassificationTree)
View tree
Supervised Learning
Regression Trees
catsplit (classregtree)
Categorical splits used for branches
in decision tree
children (classregtree)
Child nodes
classregtree
Construct classification and
regression trees
classregtree
Classification and regression trees
compact (RegressionTree)
Compact regression tree
CompactRegressionTree
Compact regression tree
crossval (RegressionTree)
Cross-validated decision tree
cutcategories (classregtree)
Cut categories
cutpoint (classregtree)
Decision tree cut point values
cuttype (classregtree)
Cut types
cutvar (classregtree)
Cut variable names
cvloss (RegressionTree)
Regression error by cross validation
eval (classregtree)
Predicted responses
fit (RegressionTree)
Binary decision tree for regression
isbranch (classregtree)
Test node for branch
kfoldfun
(RegressionPartitionedModel)
Cross validate function
kfoldLoss
(RegressionPartitionedModel)
Cross-validation loss of partitioned
regression model
kfoldPredict
(RegressionPartitionedModel)
Predict response for observations not
used for training.
loss (CompactRegressionTree)
Regression error
meansurrvarassoc (classregtree)
Mean predictive measure of
association for surrogate splits in
decision tree
18-45
18
18-46
Function Reference
meanSurrVarAssoc
(CompactRegressionTree)
Mean predictive measure of
association for surrogate splits in
decision tree
nodeerr (classregtree)
Return vector of node errors
nodemean (classregtree)
Mean values of nodes of regression
tree
nodeprob (classregtree)
Node probabilities
nodesize (classregtree)
Return node size
numnodes (classregtree)
Number of nodes
parent (classregtree)
Parent node
predict (CompactRegressionTree)
Predict response of regression tree
predictorImportance
(CompactRegressionTree)
Estimates of predictor importance
prune (classregtree)
Prune tree
prune (RegressionTree)
Produce sequence of subtrees by
pruning
prunelist (classregtree)
Pruning levels for decision tree
nodes
RegressionPartitionedModel
Cross-validated regression model
RegressionTree
Regression tree
resubLoss (RegressionTree)
Regression error by resubstitution
resubPredict (RegressionTree)
Predict resubstitution response of
tree
risk (classregtree)
Node risks
surrcutcategories (classregtree)
Categories used for surrogate splits
in decision tree
surrcutflip (classregtree)
Numeric cutpoint assignments used
for surrogate splits in decision tree
surrcutpoint (classregtree)
Cutpoints used for surrogate splits
in decision tree
Supervised Learning
surrcuttype (classregtree)
Types of surrogate splits used at
branches in decision tree
surrcutvar (classregtree)
Variables used for surrogate splits
in decision tree
surrvarassoc (classregtree)
Predictive measure of association for
surrogate splits in decision tree
template (RegressionTree)
Create regression template
test (classregtree)
Error rate
type (classregtree)
Tree type
varimportance (classregtree)
Compute embedded estimates of
input feature importance
view (classregtree)
Plot tree
view (CompactRegressionTree)
View tree
Ensemble Methods — Classification
append (TreeBagger)
Append new trees to ensemble
ClassificationBaggedEnsemble
Classification ensemble grown by
resampling
ClassificationEnsemble
Ensemble classifier
ClassificationPartitionedEnsemble
Cross-validated classification
ensemble
combine (CompactTreeBagger)
Combine two ensembles
compact (ClassificationEnsemble)
Compact classification ensemble
compact (TreeBagger)
Compact ensemble of decision trees
CompactClassificationEnsemble
Compact classification ensemble
class
CompactTreeBagger
Compact ensemble of decision trees
grown by bootstrap aggregation
crossval (ClassificationEnsemble)
Cross validate ensemble
18-47
18
18-48
Function Reference
edge
(CompactClassificationEnsemble)
Classification edge
error (CompactTreeBagger)
Error (misclassification probability
or MSE)
error (TreeBagger)
Error (misclassification probability
or MSE)
fillProximities (TreeBagger)
Proximity matrix for training data
fitensemble
Fitted ensemble for classification or
regression
growTrees (TreeBagger)
Train additional trees and add to
ensemble
kfoldEdge
(ClassificationPartitionedEnsemble)
Classification edge for observations
not used for training
kfoldfun
(ClassificationPartitionedModel)
Cross validate function
kfoldLoss
(ClassificationPartitionedEnsemble)
Classification loss for observations
not used for training
kfoldMargin
(ClassificationPartitionedModel)
Classification margins for
observations not used for training
kfoldPredict
(ClassificationPartitionedModel)
Predict response for observations not
used for training
loss
(CompactClassificationEnsemble)
Classification error
margin
(CompactClassificationEnsemble)
Classification margins
margin (CompactTreeBagger)
Classification margin
margin (TreeBagger)
Classification margin
mdsProx (CompactTreeBagger)
Multidimensional scaling of
proximity matrix
mdsProx (TreeBagger)
Multidimensional scaling of
proximity matrix
meanMargin (CompactTreeBagger)
Mean classification margin
Supervised Learning
meanMargin (TreeBagger)
Mean classification margin
oobEdge
(ClassificationBaggedEnsemble)
Out-of-bag classification edge
oobError (TreeBagger)
Out-of-bag error
oobLoss
(ClassificationBaggedEnsemble)
Out-of-bag classification error
oobMargin
(ClassificationBaggedEnsemble)
Out-of-bag classification margins
oobMargin (TreeBagger)
Out-of-bag margins
oobMeanMargin (TreeBagger)
Out-of-bag mean margins
oobPredict
(ClassificationBaggedEnsemble)
Predict out-of-bag response of
ensemble
oobPredict (TreeBagger)
Ensemble predictions for out-of-bag
observations
outlierMeasure
(CompactTreeBagger)
Outlier measure for data
predict
(CompactClassificationEnsemble)
Predict classification
predict (CompactTreeBagger)
Predict response
predict (TreeBagger)
Predict response
predictorImportance
(CompactClassificationEnsemble)
Estimates of predictor importance
proximity (CompactTreeBagger)
Proximity matrix for data
resubEdge (ClassificationEnsemble)
Classification edge by resubstitution
resubLoss (ClassificationEnsemble)
Classification error by resubstitution
resubMargin
(ClassificationEnsemble)
Classification margins by
resubstitution
resubPredict
(ClassificationEnsemble)
Predict ensemble response by
resubstitution
resume (ClassificationEnsemble)
Resume training ensemble
18-49
18
Function Reference
resume
(ClassificationPartitionedEnsemble)
Resume training learners on
cross-validation folds
SetDefaultYfit
(CompactTreeBagger)
Set default value for predict
TreeBagger
Bootstrap aggregation for ensemble
of decision trees
Ensemble Methods — Regression
18-50
append (TreeBagger)
Append new trees to ensemble
combine (CompactTreeBagger)
Combine two ensembles
compact (RegressionEnsemble)
Create compact regression ensemble
compact (TreeBagger)
Compact ensemble of decision trees
CompactRegressionEnsemble
Compact regression ensemble class
CompactTreeBagger
Compact ensemble of decision trees
grown by bootstrap aggregation
crossval (RegressionEnsemble)
Cross validate ensemble
cvshrink (RegressionEnsemble)
Cross validate shrinking (pruning)
ensemble
error (CompactTreeBagger)
Error (misclassification probability
or MSE)
error (TreeBagger)
Error (misclassification probability
or MSE)
fillProximities (TreeBagger)
Proximity matrix for training data
fitensemble
Fitted ensemble for classification or
regression
growTrees (TreeBagger)
Train additional trees and add to
ensemble
kfoldfun
(RegressionPartitionedModel)
Cross validate function
Supervised Learning
kfoldLoss
(RegressionPartitionedEnsemble)
Cross-validation loss of partitioned
regression ensemble
kfoldPredict
(RegressionPartitionedModel)
Predict response for observations not
used for training.
loss (CompactRegressionEnsemble)
Regression error
mdsProx (CompactTreeBagger)
Multidimensional scaling of
proximity matrix
mdsProx (TreeBagger)
Multidimensional scaling of
proximity matrix
meanMargin (CompactTreeBagger)
Mean classification margin
oobError (TreeBagger)
Out-of-bag error
oobLoss
(RegressionBaggedEnsemble)
Out-of-bag regression error
oobPredict
(RegressionBaggedEnsemble)
Predict out-of-bag response of
ensemble
oobPredict (TreeBagger)
Ensemble predictions for out-of-bag
observations
outlierMeasure
(CompactTreeBagger)
Outlier measure for data
predict
(CompactRegressionEnsemble)
Predict response of ensemble
predict (CompactTreeBagger)
Predict response
predict (TreeBagger)
Predict response
predictorImportance
(CompactRegressionEnsemble)
Estimates of predictor importance
proximity (CompactTreeBagger)
Proximity matrix for data
RegressionBaggedEnsemble
Regression ensemble grown by
resampling
RegressionEnsemble
Ensemble regression
RegressionPartitionedEnsemble
Cross-validated regression ensemble
18-51
18
18-52
Function Reference
regularize (RegressionEnsemble)
Find weights to minimize
resubstitution error plus penalty
term
resubLoss (RegressionEnsemble)
Regression error by resubstitution
resubPredict (RegressionEnsemble)
Predict response of ensemble by
resubstitution
resume (RegressionEnsemble)
Resume training ensemble
resume
(RegressionPartitionedEnsemble)
Resume training ensemble
SetDefaultYfit
(CompactTreeBagger)
Set default value for predict
shrink (RegressionEnsemble)
Prune ensemble
TreeBagger
Bootstrap aggregation for ensemble
of decision trees
Hidden Markov Models
Hidden Markov Models
hmmdecode
Hidden Markov model posterior
state probabilities
hmmestimate
Hidden Markov model parameter
estimates from emissions and states
hmmgenerate
Hidden Markov model states and
emissions
hmmtrain
Hidden Markov model parameter
estimates from emissions
hmmviterbi
Hidden Markov model most probable
state path
18-53
18
Function Reference
Design of Experiments
DOE Plots (p. 18-54)
Full Factorial Designs (p. 18-54)
Fractional Factorial Designs
(p. 18-55)
Response Surface Designs (p. 18-55)
D-Optimal Designs (p. 18-55)
Latin Hypercube Designs (p. 18-55)
Quasi-Random Designs (p. 18-56)
DOE Plots
interactionplot
Interaction plot for grouped data
maineffectsplot
Main effects plot for grouped data
multivarichart
Multivari chart for grouped data
rsmdemo
Interactive response surface
demonstration
rstool
Interactive response surface
modeling
Full Factorial Designs
18-54
ff2n
Two-level full factorial design
fullfact
Full factorial design
Design of Experiments
Fractional Factorial Designs
fracfact
Fractional factorial design
fracfactgen
Fractional factorial design
generators
Response Surface Designs
bbdesign
Box-Behnken design
ccdesign
Central composite design
D-Optimal Designs
candexch
Candidate set row exchange
candgen
Candidate set generation
cordexch
Coordinate exchange
daugment
D-optimal augmentation
dcovary
D-optimal design with fixed
covariates
rowexch
Row exchange
rsmdemo
Interactive response surface
demonstration
Latin Hypercube Designs
lhsdesign
Latin hypercube sample
lhsnorm
Latin hypercube sample from normal
distribution
18-55
18
Function Reference
Quasi-Random Designs
18-56
addlistener (qrandstream)
Add listener for event
delete (qrandstream)
Delete handle object
end (qrandset)
Last index in indexing expression for
point set
eq (qrandstream)
Test handle equality
findobj (qrandstream)
Find objects matching specified
conditions
findprop (qrandstream)
Find property of MATLAB handle
object
ge (qrandstream)
Greater than or equal relation for
handles
gt (qrandstream)
Greater than relation for handles
haltonset
Construct Halton quasi-random
point set
isvalid (qrandstream)
Test handle validity
le (qrandstream)
Less than or equal relation for
handles
length (qrandset)
Length of point set
lt (qrandstream)
Less than relation for handles
ndims (qrandset)
Number of dimensions in matrix
ne (qrandstream)
Not equal relation for handles
net (qrandset)
Generate quasi-random point set
notify (qrandstream)
Notify listeners of event
qrand (qrandstream)
Generate quasi-random points from
stream
qrandset
Abstract quasi-random point set
class
qrandstream
Construct quasi-random number
stream
Design of Experiments
rand (qrandstream)
Generate quasi-random points from
stream
reset (qrandstream)
Reset state
scramble (qrandset)
Scramble quasi-random point set
size (qrandset)
Number of dimensions in matrix
sobolset
Construct Sobol quasi-random point
set
18-57
18
Function Reference
Statistical Process Control
SPC Plots (p. 18-58)
SPC Functions (p. 18-58)
SPC Plots
capaplot
Process capability plot
controlchart
Shewhart control charts
histfit
Histogram with normal fit
normspec
Normal density plot between
specifications
SPC Functions
18-58
capability
Process capability indices
controlrules
Western Electric and Nelson control
rules
gagerr
Gage repeatability and
reproducibility study
GUIs
GUIs
aoctool
Interactive analysis of covariance
dfittool
Interactive distribution fitting
disttool
Interactive density and distribution
plots
fsurfht
Interactive contour plot
polytool
Interactive polynomial fitting
randtool
Interactive random number
generation
regstats
Regression diagnostics
robustdemo
Interactive robust regression
rsmdemo
Interactive response surface
demonstration
rstool
Interactive response surface
modeling
surfht
Interactive contour plot
18-59
18
Function Reference
Utilities
18-60
combnk
Enumeration of combinations
perms
Enumeration of permutations
statget
Access values in statistics options
structure
statset
Create statistics options structure
zscore
Standardized z-scores
19
Class Reference
• “Data Organization” on page 19-2
• “Probability Distributions” on page 19-3
• “Gaussian Mixture Models” on page 19-4
• “Model Assessment” on page 19-4
• “Parametric Classification” on page 19-5
• “Supervised Learning” on page 19-5
• “Quasi-Random Design of Experiments” on page 19-8
19
Class Reference
Data Organization
In this section...
“Categorical Arrays” on page 19-2
“Dataset Arrays” on page 19-2
Categorical Arrays
categorical
Arrays for categorical data
nominal
Arrays for nominal categorical data
ordinal
Arrays for ordinal categorical data
Dataset Arrays
dataset
19-2
Arrays for statistical data
Probability Distributions
Probability Distributions
In this section...
“Distribution Objects” on page 19-3
“Quasi-Random Numbers” on page 19-3
“Piecewise Distributions” on page 19-4
Distribution Objects
ProbDist
Object representing probability
distribution
ProbDistKernel
Object representing nonparametric
probability distribution defined by
kernel smoothing
ProbDistParametric
Object representing parametric
probability distribution
ProbDistUnivKernel
Object representing univariate
kernel probability distribution
ProbDistUnivParam
Object representing univariate
parametric probability distribution
Quasi-Random Numbers
haltonset
Halton quasi-random point sets
qrandset
Quasi-random point sets
qrandstream
Quasi-random number streams
sobolset
Sobol quasi-random point sets
19-3
19
Class Reference
Piecewise Distributions
paretotails
Empirical distributions with Pareto
tails
piecewisedistribution
Piecewise-defined distributions
Gaussian Mixture Models
gmdistribution
Gaussian mixture models
Model Assessment
cvpartition
19-4
Data partitions for cross-validation
Parametric Classification
Parametric Classification
In this section...
“Naive Bayes Classification” on page 19-5
“Distance Classifiers” on page 19-5
Naive Bayes Classification
NaiveBayes
Naive Bayes classifier
Distance Classifiers
ExhaustiveSearcher
Nearest neighbors search using
exhaustive search
KDTreeSearcher
Nearest neighbors search using
kd-tree
NeighborSearcher
Nearest neighbor search object
Supervised Learning
In this section...
“Classification Trees” on page 19-6
“Classification Ensemble Classes” on page 19-6
“Regression Trees” on page 19-6
“Regression Ensemble Classes” on page 19-7
19-5
19
Class Reference
Classification Trees
ClassificationPartitionedModel
Cross-validated classification model
ClassificationTree
Binary decision tree for classification
classregtree
Classification and regression trees
CompactClassificationTree
Compact classification tree
Classification Ensemble Classes
ClassificationBaggedEnsemble
Classification ensemble grown by
resampling
ClassificationEnsemble
Ensemble classifier
ClassificationPartitionedEnsemble
Cross-validated classification
ensemble
CompactClassificationEnsemble
Compact classification ensemble
class
CompactTreeBagger
Compact ensemble of decision trees
grown by bootstrap aggregation
TreeBagger
Bootstrap aggregation for ensemble
of decision trees
Regression Trees
19-6
classregtree
Classification and regression trees
CompactRegressionTree
Compact regression tree
RegressionPartitionedModel
Cross-validated regression model
RegressionTree
Regression tree
Supervised Learning
Regression Ensemble Classes
CompactRegressionEnsemble
Compact regression ensemble class
CompactTreeBagger
Compact ensemble of decision trees
grown by bootstrap aggregation
RegressionBaggedEnsemble
Regression ensemble grown by
resampling
RegressionEnsemble
Ensemble regression
RegressionPartitionedEnsemble
Cross-validated regression ensemble
TreeBagger
Bootstrap aggregation for ensemble
of decision trees
19-7
19
Class Reference
Quasi-Random Design of Experiments
19-8
haltonset
Halton quasi-random point sets
qrandset
Quasi-random point sets
qrandstream
Quasi-random number streams
sobolset
Sobol quasi-random point sets
20
Functions — Alphabetical
List
addedvarplot
Purpose
Added-variable plot
Syntax
addedvarplot(X,y,num,inmodel)
addedvarplot(X,y,num,inmodel,stats)
Description
addedvarplot(X,y,num,inmodel) displays an added variable plot
using the predictive terms in X, the response values in y, the added
term in column num of X, and the model with current terms specified by
inmodel. X is an n-by-p matrix of n observations of p predictive terms.
y is vector of n response values. num is a scalar index specifying the
column of X with the term to be added. inmodel is a logical vector of p
elements specifying the columns of X in the current model. By default,
all elements of inmodel are false.
Note addedvarplot automatically includes a constant term in all
models. Do not enter a column of 1s directly into X.
addedvarplot(X,y,num,inmodel,stats) uses the stats output from
the stepwisefit function to improve the efficiency of repeated calls to
addedvarplot. Otherwise, this syntax is equivalent to the previous
syntax.
Added variable plots are used to determine the unique effect of adding
a new term to a multilinear model. The plot shows the relationship
between the part of the response unexplained by terms already in the
model and the part of the new term unexplained by terms already in
the model. The “unexplained” parts are measured by the residuals of
the respective regressions. A scatter of the residuals from the two
regressions forms the added variable plot.
In addition to the scatter of residuals, the plot produced by
addedvarplot shows 95% confidence intervals on predictions from the
fitted line. The fitted line has intercept zero because, under typical
linear model assumptions, both of the plotted variables have mean zero.
The slope of the fitted line is the coefficient that the new term would
have if it were added to the model with terms inmodel.
20-2
addedvarplot
Added variable plots are sometimes known as partial regression
leverage plots.
Examples
Load the data in hald.mat, which contains observations of the heat of
reaction of various cement mixtures:
load hald
whos
Name
Description
hald
heat
ingredients
Size
Bytes
Class
22x58
13x5
13x1
13x4
2552
520
104
416
char
double
double
double
Attributes
Create an added variable plot to investigate the addition of the third
column of ingredients to a model consisting of the first two columns:
inmodel = [true true false false];
addedvarplot(ingredients,heat,3,inmodel)
20-3
addedvarplot
The wide scatter and the low slope of the fitted line are evidence against
the statistical significance of adding the third column to the model.
See Also
20-4
stepwisefit | stepwise
categorical.addlevels
Purpose
Add levels to categorical array
Syntax
B = addlevels(A,newlevels)
Description
B = addlevels(A,newlevels) adds new levels to the categorical array
A. newlevels is a cell array of strings or a 2-D character matrix that
specifies the levels to add. addlevels adds the new levels at the end of
the list of possible categorical levels in A, but does not modify the value
of any element. B does not contain elements at the new levels.
Examples
Example 1
Add levels for additional species in Fisher’s iris data:
load fisheriris
species = nominal(species,...
{'Species1','Species2','Species3'},...
{'setosa','versicolor','virginica'});
species = addlevels(species,{'Species4','Species5'});
getlabels(species)
ans =
'Species1' 'Species2' 'Species3' 'Species4' 'Species5'
Example 2
1 Load patient data from the CSV file hospital.dat and store the
information in a dataset array with observation names given by the
first column in the data (patient identification):
patients = dataset('file','hospital.dat',...
'delimiter',',',...
'ReadObsNames',true);
2 Make the {0,1}-valued variable smoke nominal, and change the labels
to 'No' and 'Yes':
patients.smoke = nominal(patients.smoke,{'No','Yes'});
20-5
categorical.addlevels
3 Add new levels to smoke as placeholders for more detailed histories
of smokers:
patients.smoke = addlevels(patients.smoke,...
{'0-5 Years','5-10 Years','LongTerm'});
4 Assuming the nonsmokers have never smoked, relabel the 'No' level:
patients.smoke = setlabels(patients.smoke,'Never','No');
5 Drop the undifferentiated 'Yes' level from smoke:
patients.smoke = droplevels(patients.smoke,'Yes');
Warning: OLDLEVELS contains categorical levels that
were present in A, caused some array elements to have
undefined levels.
Note that smokers now have an undefined level.
6 Set each smoker to one of the new levels, by observation name:
patients.smoke('YPL-320') = '5-10 Years';
See Also
20-6
droplevels | getlabels | islevel | mergelevels | reorderlevels
qrandstream.addlistener
Purpose
Add listener for event
Syntax
el = addlistener(hsource,'eventname',callback)
el = addlistener(hsource,property,'eventname',callback)
Description
el = addlistener(hsource,'eventname',callback) creates a
listener for the event named eventname, the source of which is handle
object hsource. If hsource is an array of source handles, the listener
responds to the named event on any handle in the array. callback is a
function handle that is invoked when the event is triggered.
el = addlistener(hsource,property,'eventname',callback)
adds a listener for a property event. eventname must be one of the
strings 'PreGet', 'PostGet', 'PreSet', and 'PostSet'. property
must be either a property name or cell array of property names, or a
meta.property or array of meta.property. The properties must belong
to the class of hsource. If hsource is scalar, property can include
dynamic properties.
For all forms, addlistener returns an event.listener. To remove
a listener, delete the object returned by addlistener. For example,
delete(el) calls the handle class delete method to remove the listener
and delete it from the workspace.
See Also
delete | dynamicprops | event.listener | events | meta.property
| notify | qrandstream | reset
20-7
gmdistribution.AIC property
Purpose
Akaike Information Criterion
Description
The Akaike Information Criterion: 2*NlogL+2*m, where m is the number
of estimated parameters.
Note This property applies only to gmdistribution objects constructed
with fit.
20-8
andrewsplot
Purpose
Andrews plot
Syntax
andrewsplot(X)
andrewsplot(X,...,'Standardize',standopt)
andrewsplot(X,...,'Quantile',alpha)
andrewsplot(X,...,'Group',group)
andrewsplot(X,...,’PropName’,PropVal,...)
h = andrewsplot(X,...)
Description
andrewsplot(X) creates an Andrews plot of the multivariate data in
the matrix X. The rows of X correspond to observations, the columns to
variables. Andrews plots represent each observation by a function f(t) of
a continuous dummy variable t over the interval [0,1]. f(t) is defined for
the i th observation in X as
f (t) = X (i,1) / 2 + X (i, 2)sin(2 t) + X (i, 3) cos(2 t) + 
andrewsplot treats NaN values in X as missing values and ignores the
corresponding rows.
andrewsplot(X,...,'Standardize',standopt) creates an Andrews
plot where standopt is one of the following:
• 'on' — scales each column of X to have mean 0 and standard
deviation 1 before making the plot.
• 'PCA' — creates an Andrews plot from the principal component
scores of X, in order of decreasing eigenvalue. (See princomp.)
• 'PCAStd' — creates an Andrews plot using the standardized
principal component scores. (See princomp.)
andrewsplot(X,...,'Quantile',alpha) plots only the median and
the alpha and (1 – alpha) quantiles of f(t) at each value of t. This is
useful if X contains many observations.
andrewsplot(X,...,'Group',group) plots the data in different groups
with different colors. Groups are defined by group, a numeric array
containing a group index for each observation. group can also be a
20-9
andrewsplot
categorical array, character matrix, or cell array of strings containing a
group name for each observation. (See “Grouped Data” on page 2-34.)
andrewsplot(X,...,’PropName’,PropVal,...) sets optional
lineseries object properties to the specified values for all lineseries
objects created by andrewsplot. (See Lineseries Properties.)
h = andrewsplot(X,...) returns a column vector of handles to the
lineseries objects created by andrewsplot, one handle per row of X. If
you use the 'Quantile' input parameter, h contains one handle for each
of the three lineseries objects created. If you use both the 'Quantile'
and the 'Group' input parameters, h contains three handles for each
group.
Examples
Make a grouped plot of the Fisher iris data:
load fisheriris
andrewsplot(meas,'group',species)
20-10
andrewsplot
Plot only the median and quartiles of each group:
andrewsplot(meas,'group',species,'quantile',.25)
20-11
andrewsplot
See Also
parallelcoords | glyphplot
How To
• “Grouped Data” on page 2-34
20-12
anova1
Purpose
One-way analysis of variance
Syntax
p = anova1(X)
p = anova1(X,group)
p = anova1(X,group,displayopt)
[p,table] = anova1(...)
[p,table,stats] = anova1(...)
Description
p = anova1(X) performs balanced one-way ANOVA for comparing
the means of two or more columns of data in the matrix X, where
each column represents an independent sample containing mutually
independent observations. The function returns the p value under the
null hypothesis that all samples in X are drawn from populations with
the same mean.
If p is near zero, it casts doubt on the null hypothesis and suggests
that at least one sample mean is significantly different than the other
sample means. Common significance levels are 0.05 or 0.01.
The anova1 function displays two figures, the standard ANOVA table
and a box plot of the columns of X.
The standard ANOVA table divides the variability of the data into two
parts:
• Variability due to the differences among the column means
(variability between groups)
• Variability due to the differences between the data in each column
and the column mean (variability within groups)
The standard ANOVA table has six columns:
1 The source of the variability.
2 The sum of squares (SS) due to each source.
3 The degrees of freedom (df) associated with each source.
20-13
anova1
4 The mean squares (MS) for each source, which is the ratio SS/df.
5 The F-statistic, which is the ratio of the mean squares.
6 The p value, which is derived from the cdf of F.
The box plot of the columns of X suggests the size of the F-statistic and
the p value. Large differences in the center lines of the boxes correspond
to large values of F and correspondingly small values of p.
anova1 treats NaN values as missing, and disregards them.
p = anova1(X,group) performs ANOVA by group. For more
information on grouping variables, see “Grouped Data” on page 2-34.
If X is a matrix, anova1 treats each column as a separate group, and
evaluates whether the population means of the columns are equal. This
form of anova1 is appropriate when each group has the same number of
elements (balanced ANOVA). group can be a character array or a cell
array of strings, with one row per column of X, containing group names.
Enter an empty array ([]) or omit this argument if you do not want to
specify group names.
If X is a vector, group must be a categorical variable, vector, string
array, or cell array of strings with one name for each element of X. X
values corresponding to the same value of group are placed in the same
group. This form of anova1 is appropriate when groups have different
numbers of elements (unbalanced ANOVA).
If group contains empty or NaN-valued cells or strings, the corresponding
observations in X are disregarded.
p = anova1(X,group,displayopt) enables the ANOVA table and box
plot displays when displayopt is 'on' (default) and suppresses the
displays when displayopt is 'off'. Notches in the boxplot provide a
test of group medians (see boxplot) different from the F test for means
in the ANOVA table.
[p,table] = anova1(...) returns the ANOVA table (including
column and row labels) in the cell array table. Copy a text version of
20-14
anova1
the ANOVA table to the clipboard using the Copy Text item on the
Edit menu.
[p,table,stats] = anova1(...) returns a structure stats used
to perform a follow-up multiple comparison test. anova1 evaluates
the hypothesis that the samples all have the same mean against the
alternative that the means are not all the same. Sometimes it is
preferable to perform a test to determine which pairs of means are
significantly different, and which are not. Use the multcompare function
to perform such tests by supplying the stats structure as input.
Assumptions
The ANOVA test makes the following assumptions about the data in X:
• All sample populations are normally distributed.
• All sample populations have equal variance.
• All observations are mutually independent.
The ANOVA test is known to be robust with respect to modest violations
of the first two assumptions.
Examples
Example 1
Create X with columns that are constants plus random normal
disturbances with mean zero and standard deviation one:
X = meshgrid(1:5)
X =
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
5
5
5
5
5
X = X + normrnd(0,1,5,5)
X =
1.3550
2.0662
2.4688
5.9447
5.4897
20-15
anova1
2.0693
2.1919
2.7620
-0.3626
1.7611
0.7276
1.8179
1.1685
Perform one-way ANOVA:
p = anova1(X)
p =
7.9370e-006
20-16
1.4864
3.1905
3.9506
3.5742
4.8826
4.8768
4.4678
2.1945
6.3222
4.6841
4.9291
5.9465
anova1
The very small p value indicates that differences between column
means are highly significant. The probability of this outcome under the
null hypothesis (that samples drawn from the same population would
have means differing by the amounts seen in X) is equal to the p value.
Example 2
The following example is from a study of the strength of structural
beams in Hogg. The vector strength measures deflections of beams in
thousandths of an inch under 3,000 pounds of force. The vector alloy
identifies each beam as steel ('st'), alloy 1 ('al1'), or alloy 2 ('al2').
(Although alloy is sorted in this example, grouping variables do not
need to be sorted.) The null hypothesis is that steel beams are equal in
strength to beams made of the two more expensive alloys.
strength = [82 86 79 83 84 85 86 87 74 82 ...
20-17
anova1
78 75 76 77 79 79 77 78 82 79];
alloy = {'st','st','st','st','st','st','st','st',...
'al1','al1','al1','al1','al1','al1',...
'al2','al2','al2','al2','al2','al2'};
p = anova1(strength,alloy)
p =
1.5264e-004
20-18
anova1
The p value suggests rejection of the null hypothesis. The box plot
shows that steel beams deflect more than beams made of the more
expensive alloys.
References
[1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
See Also
anova2 | anovan | boxplot | manova1 | multcompare
How To
• “Grouped Data” on page 2-34
20-19
anova2
Purpose
Two-way analysis of variance
Syntax
p = anova2(X,reps)
p = anova2(X,reps,displayopt)
[p,table] = anova2(...)
[p,table,stats] = anova2(...)
Description
p = anova2(X,reps) performs a balanced two-way ANOVA for
comparing the means of two or more columns and two or more rows of
the observations in X. The data in different columns represent changes
in factor A. The data in different rows represent changes in factor B.
If there is more than one observation for each combination of factors,
input reps indicates the number of replicates in each position, which
must be constant. (For unbalanced designs, use anovan.)
The matrix below shows the format for a set-up where column factor
A has two levels, row factor B has three levels, and there are two
replications (reps = 2). The subscripts indicate row, column, and
replicate, respectively.
A =1
A=2
⎡ x111
⎢x
⎢ 112
⎢ x211
⎢
⎢ x212
⎢x
⎢ 311
⎣⎢ x312
x121 ⎤
x122 ⎥⎥
x221 ⎥
⎥
x222 ⎥
x321 ⎥
⎥
x322 ⎥⎦
⎫
⎬B =1
⎭
⎫
⎬B = 2
⎭
⎫
⎬B = 3
⎭
When reps is 1 (default), anova2 returns two p-values in vector p:
1 The p value for the null hypothesis, H0A, that all samples from factor
A (i.e., all column-samples in X) are drawn from the same population
2 The p value for the null hypothesis, H0B, that all samples from factor
B (i.e., all row-samples in X) are drawn from the same population
20-20
anova2
When reps is greater than 1, anova2 returns a third p value in vector
p:
3 The p value for the null hypothesis, H0AB, that the effects due to
factors A and B are additive (i.e., that there is no interaction between
factors A and B)
If any p value is near zero, this casts doubt on the associated null
hypothesis. A sufficiently small p value for H0A suggests that at least
one column-sample mean is significantly different that the other
column-sample means; i.e., there is a main effect due to factor A. A
sufficiently small p value for H0B suggests that at least one row-sample
mean is significantly different than the other row-sample means; i.e.,
there is a main effect due to factor B. A sufficiently small p value for
H0AB suggests that there is an interaction between factors A and B.
The choice of a limit for the p value to determine whether a result
is “statistically significant” is left to the researcher. It is common to
declare a result significant if the p value is less than 0.05 or 0.01.
anova2 also displays a figure showing the standard ANOVA table,
which divides the variability of the data in X into three or four parts
depending on the value of reps:
• The variability due to the differences among the column means
• The variability due to the differences among the row means
• The variability due to the interaction between rows and columns (if
reps is greater than its default value of one)
• The remaining variability not explained by any systematic source
The ANOVA table has five columns:
• The first shows the source of the variability.
• The second shows the Sum of Squares (SS) due to each source.
• The third shows the degrees of freedom (df) associated with each
source.
20-21
anova2
• The fourth shows the Mean Squares (MS), which is the ratio SS/df.
• The fifth shows the F statistics, which is the ratio of the mean
squares.
p = anova2(X,reps,displayopt) enables the ANOVA table display
when displayopt is 'on' (default) and suppresses the display when
displayopt is 'off'.
[p,table] = anova2(...) returns the ANOVA table (including
column and row labels) in cell array table. (Copy a text version of the
ANOVA table to the clipboard by using the Copy Text item on the Edit
menu.)
[p,table,stats] = anova2(...) returns a stats structure that you
can use to perform a follow-up multiple comparison test.
The anova2 test evaluates the hypothesis that the row, column, and
interaction effects are all the same, against the alternative that they
are not all the same. Sometimes it is preferable to perform a test
to determine which pairs of effects are significantly different, and
which are not. Use the multcompare function to perform such tests by
supplying the stats structure as input.
Examples
The data below come from a study of popcorn brands and popper type
(Hogg 1987). The columns of the matrix popcorn are brands (Gourmet,
National, and Generic). The rows are popper type (Oil and Air.) The
study popped a batch of each brand three times with each popper. The
values are the yield in cups of popped popcorn.
load popcorn
popcorn
popcorn =
5.5000
5.5000
6.0000
6.5000
7.0000
20-22
4.5000
4.5000
4.0000
5.0000
5.5000
3.5000
4.0000
3.0000
4.0000
5.0000
anova2
7.0000
5.0000
4.5000
p = anova2(popcorn,3)
p =
0.0000 0.0001 0.7462
The vector p shows the p-values for the three brands of popcorn, 0.0000,
the two popper types, 0.0001, and the interaction between brand and
popper type, 0.7462. These values indicate that both popcorn brand and
popper type affect the yield of popcorn, but there is no evidence of a
synergistic (interaction) effect of the two.
The conclusion is that you can get the greatest yield using the Gourmet
brand and an Air popper (the three values popcorn(4:6,1)).
References
[1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
See Also
anova1 | anovan
20-23
anovan
Purpose
N-way analysis of variance
Syntax
p = anovan(y,group)
p = anovan(y,group,param,val)
[p,table] = anovan(y,group,param,val)
[p,table,stats] = anovan(y,group,param,val)
[p,table,stats,terms] = anovan(y,group,param,val)
Description
p = anovan(y,group) performs multiway (n-way) analysis of variance
(ANOVA) for testing the effects of multiple factors on the mean of
the vector y. (See “Grouped Data”.) This test compares the variance
explained by factors to the left over variance that cannot be explained.
The factors and factor levels of the observations in y are assigned by
the cell array group. Each of the cells in the cell array group contains
a list of factor levels identifying the observations in y with respect to
one of the factors. The list within each cell can be a categorical array,
numeric vector, character matrix, or single-column cell array of strings,
and must have the same number of elements as y. The fitted ANOVA
model includes the main effects of each grouping variable. All grouping
variables are treated as fixed effects by default. The result p is a vector
of p-values, one per term. For an example, see “Example of Three-Way
ANOVA” on page 20-28.
p = anovan(y,group,param,val) specifies one or more of the
parameter name/value pairs described in the following table.
20-24
Parameter
Value
'alpha'
A number between 0 and 1 requesting 100(1 –
alpha)% confidence bounds (default 0.05 for 95%
confidence)
'continuous'
A vector of indices indicating which grouping
variables should be treated as continuous predictors
rather than as categorical predictors.
'display'
'on' displays an ANOVA table (the default)
'off' omits the display
anovan
Parameter
Value
'model'
The type of model used. See “Model Type” on page
20-26 for a description of this parameter.
'nested'
A matrix M of 0’s and 1’s specifying the nesting
relationships among the grouping variables. M(i,j) is
1 if variable i is nested in variable j.
'random'
A vector of indices indicating which grouping
variables are random effects (all are fixed by default).
See “ANOVA with Random Effects” on page 8-19 for
an example of how to use 'random'.
'sstype'
1, 2, 3 (default), or h specifies the type of sum of
squares. See “Sum of Squares” on page 20-27 for a
description of this parameter.
'varnames'
A character matrix or a cell array of strings specifying
names of grouping variables, one per grouping
variable. When you do not specify 'varnames', the
default labels 'X1', 'X2', 'X3', ..., 'XN' are used.
See “ANOVA with Random Effects” on page 8-19 for
an example of how to use 'varnames'.
[p,table] = anovan(y,group,param,val) returns the ANOVA table
(including factor labels) in cell array table. (Copy a text version of
the ANOVA table to the clipboard by using the Copy Text item on the
Edit menu.)
[p,table,stats] = anovan(y,group,param,val) returns a stats
structure that you can use to perform a follow-up multiple comparison
test with the multcompare function. See “The Stats Structure” on page
20-31The Stats Structure for more information.
[p,table,stats,terms] = anovan(y,group,param,val) returns the
main and interaction terms used in the ANOVA computations. The
terms are encoded in the output matrix terms using the same format
described above for input 'model'. When you specify 'model' itself in
this matrix format, the matrix returned in terms is identical.
20-25
anovan
Model Type
This section explains how to use the argument 'model' with the syntax:
[...]
= anovan(y,group,'model',modeltype)
The argument modeltype, which specifies the type of model the function
uses, can be any one of the following:
• 'linear' — The default 'linear' model computes only the p-values
for the null hypotheses on the N main effects.
• 'interaction' — The 'interaction' model computes the p-values
⎛N⎞
for null hypotheses on the N main effects and the ⎜ ⎟ two-factor
⎝2⎠
interactions.
• 'full' — The 'full' model computes the p-values for null
hypotheses on the N main effects and interactions at all levels.
• An integer — For an integer value of modeltype, k (k ≤ N),
anovan computes all interaction levels through the kth level. For
example, the value 3 means main effects plus two- and three-factor
interactions. The values k = 1 and k = 2 are equivalent to the
'linear' and 'interaction' specifications, respectively, while the
value k = N is equivalent to the 'full' specification.
• A matrix of term definitions having the same form as the input to the
x2fx function. All entries must be 0 or 1 (no higher powers).
For more precise control over the main and interaction terms that
anovan computes, modeltype can specify a matrix containing one row
for each main or interaction term to include in the ANOVA model. Each
row defines one term using a vector of N zeros and ones. The table
below illustrates the coding for a 3-factor ANOVA.
20-26
Matrix Row
ANOVA Term
[1 0 0]
Main term A
[0 1 0]
Main term B
anovan
Matrix Row
ANOVA Term
[0 0 1]
Main term C
[1 1 0]
Interaction term AB
[1 0 1]
Interaction term AC
[0 1 1]
Interaction term BC
[1 1 1]
Interaction term ABC
For example, if modeltype is the matrix [0 1 0;0 0 1;0 1 1], the
output vector p contains the p-values for the null hypotheses on the
main effects B and C and the interaction effect BC, in that order. A
simple way to generate the modeltype matrix is to modify the terms
output, which codes the terms in the current model using the format
described above. If anovan returns [0 1 0;0 0 1;0 1 1] for terms, for
example, and there is no significant result for interaction BC, you can
recompute the ANOVA on just the main effects B and C by specifying
[0 1 0;0 0 1] for modeltype.
Sum of Squares
This section explains how to use the argument 'sstype' with the
syntax:
[...]
= anovan(y,group,'sstype',type)
This syntax computes the ANOVA using the type of sum of squares
specified by type, which can be 1, 2, 3, or h. While the numbers 1 – 3
designate Type 1, Type 2, or Type 3 sum of squares, respectively, h
represents a hierarchical model similar to type 2, but with continuous
as well as categorical factors used to determine the hierarchy of
terms. The default value is 3. For a model containing main effects
but no interactions, the value of type only influences computations
on unbalanced data.
The sum of squares for any term is determined by comparing two
models. The Type 1 sum of squares for a term is the reduction in
residual sum of squares obtained by adding that term to a fit that
already includes the terms listed before it. The Type 2 sum of squares is
20-27
anovan
the reduction in residual sum of squares obtained by adding that term
to a model consisting of all other terms that do not contain the term
in question. The Type 3 sum of squares is the reduction in residual
sum of squares obtained by adding that term to a model containing all
other terms, but with their effects constrained to obey the usual “sigma
restrictions” that make models estimable.
Suppose you are fitting a model with two factors and their interaction,
and that the terms appear in the order A, B, AB. Let R(·) represent the
residual sum of squares for a model, so for example R(A, B, AB) is the
residual sum of squares fitting the whole model, R(A) is the residual
sum of squares fitting just the main effect of A, and R(1) is the residual
sum of squares fitting just the mean. The three types of sums of squares
are as follows:
Term
Type 1 Sum of Squares
Type 2 Sum of
Squares
Type 3 Sum of
Squares
A
R(1) – R(A)
R(B) – R(A, B)
R(B, AB) – R(A, B, AB)
B
R(A) – R(A, B)
R(A) – R(A, B)
R(A, AB) – R(A, B, AB)
AB
R(A, B) – R(A, B, AB)
R(A, B) – R(A, B, AB)
R(A, B) – R(A, B, AB)
The models for Type 3 sum of squares have sigma restrictions imposed.
This means, for example, that in fitting R(B, AB), the array of AB
effects is constrained to sum to 0 over A for each value of B, and over B
for each value of A.
Example of Three-Way ANOVA
As an example of three-way ANOVA, consider the vector y and group
inputs below.
y = [52.7 57.5 45.9 44.5 53.0 57.0 45.9 44.0]';
g1 = [1 2 1 2 1 2 1 2];
g2 = {'hi';'hi';'lo';'lo';'hi';'hi';'lo';'lo'};
g3 = {'may';'may';'may';'may';'june';'june';'june';'june'};
20-28
anovan
This defines a three-way ANOVA with two levels of each factor. Every
observation in y is identified by a combination of factor levels. If the
factors are A, B, and C, then observation y(1) is associated with
• Level 1 of factor A
• Level 'hi' of factor B
• Level 'may' of factor C
Similarly, observation y(6) is associated with
• Level 2 of factor A
• Level 'hi' of factor B
• Level 'june' of factor C
To compute the ANOVA, enter
p = anovan(y,{g1 g2 g3})
p =
0.4174
0.0028
0.9140
Output vector p contains p-values for the null hypotheses on the N main
effects. Element p(1) contains the p value for the null hypotheses,
H0A, that samples at all levels of factor A are drawn from the same
population; element p(2) contains the p value for the null hypotheses,
H0B, that samples at all levels of factor B are drawn from the same
population; and so on.
If any p value is near zero, this casts doubt on the associated null
hypothesis. For example, a sufficiently small p value for H0A suggests
that at least one A-sample mean is significantly different from the other
A-sample means; that is, there is a main effect due to factor A. You
need to choose a bound for the p value to determine whether a result is
statistically significant. It is common to declare a result significant if
the p value is less than 0.05 or 0.01.
20-29
anovan
anovan also displays a figure showing the standard ANOVA table,
which by default divides the variability of the data in x into
• The variability due to differences between the levels of each factor
accounted for in the model (one row for each factor)
• The remaining variability not explained by any systematic source
The ANOVA table has six columns:
• The first shows the source of the variability.
• The second shows the sum of squares (SS) due to each source.
• The third shows the degrees of freedom (df) associated with each
source.
• The fourth shows the mean squares (MS), which is the ratio SS/df.
• The fifth shows the F statistics, which are the ratios of the mean
squares.
• The sixth shows the p-values for the F statistics.
The table is shown in the following figure:
Two-Factor Interactions
By default, anovan computes p-values just for the three main effects.
To also compute p-values for the two-factor interactions, X1*X2, X1*X3,
20-30
anovan
and X2*X3, add the name/value pair 'model', 'interaction' as input
arguments.
p = anovan(y,{g1 g2 g3},'model','interaction')
p =
0.0347
0.0048
0.2578
0.0158
0.1444
0.5000
The first three entries of p are the p-values for the main effects. The
last three entries are the p-values for the two-factor interactions. You
can determine the order in which the two-factor interactions occur from
the ANOVAN table shown in the following figure.
The Stats Structure
The anovan test evaluates the hypothesis that the different levels of a
factor (or more generally, a term) have the same effect, against the
alternative that they do not all have the same effect. Sometimes it is
preferable to perform a test to determine which pairs of levels are
significantly different, and which are not. Use the multcompare function
to perform such tests by supplying the stats structure as input.
20-31
anovan
The stats structure contains the fields listed below, in addition to a
number of other fields required for doing multiple comparisons using
the multcompare function:
Field
Description
coeffs
Estimated coefficients
coeffnames
Name of term for each coefficient
vars
Matrix of grouping variable values for each term
resid
Residuals from the fitted model
The stats structure also contains the following fields if there are
random effects:
Examples
Field
Description
ems
Expected mean squares
denom
Denominator definition
rtnames
Names of random terms
varest
Variance component estimates (one per random term)
varci
Confidence intervals for variance components
“Example: Two-Way ANOVA” on page 8-10 shows how to use anova2 to
analyze the effects of two factors on a response in a balanced design.
For a design that is not balanced, use anovan instead.
The data in carbig.mat gives measurements on 406 cars. Use anonvan
to study how the mileage depends on where and when the cars were
made:
load carbig
p = anovan(MPG,{org when},'model',2,'sstype',3,...
'varnames',{'Origin';'Mfg date'})
p =
20-32
anovan
0
0
0.3059
The p value for the interaction term is not small, indicating little
evidence that the effect of the year or manufacture (when) depends on
where the car was made (org). The linear effects of those two factors,
however, are significant.
References
[1] Hogg, R. V., and J. Ledolter. Engineering Statistics. New York:
MacMillan, 1987.
See Also
anova1 | anova2 | multcompare
How To
• “Grouped Data” on page 2-34
20-33
ansaribradley
Purpose
Ansari-Bradley test
Syntax
h = ansaribradley(x,y)
h = ansaribradley(x,y,alpha)
h = ansaribradley(x,y,alpha,tail)
[h,p] = ansaribradley(...)
[h,p,stats] = ansaribradley(...)
[...] = ansaribradley(x,y,alpha,tail,exact)
[...] = ansaribradley(x,y,alpha,tail,exact,dim)
Description
h = ansaribradley(x,y) performs an Ansari-Bradley test of the
hypothesis that two independent samples, in the vectors x and y, come
from the same distribution, against the alternative that they come
from distributions that have the same median and shape but different
dispersions (e.g. variances). The result is h = 0 if the null hypothesis of
identical distributions cannot be rejected at the 5% significance level,
or h = 1 if the null hypothesis can be rejected at the 5% level. x and y
can have different lengths.
x and y can also be matrices or N-dimensional arrays. For matrices,
ansaribradley performs separate tests along each column, and returns
a vector of results. x and y must have the same number of columns.
For N-dimensional arrays, ansaribradley works along the first
nonsingleton dimension. x and y must have the same size along all
the remaining dimensions.
h = ansaribradley(x,y,alpha) performs the test at the significance
level (100*alpha), where alpha is a scalar.
h = ansaribradley(x,y,alpha,tail) performs the test against the
alternative hypothesis specified by the string tail. tail is one of:
• 'both' — Two-tailed test (dispersion parameters are not equal)
• 'right' — Right-tailed test (dispersion of X is greater than
dispersion of Y)
• 'left' — Left-tailed test (dispersion of X is less than dispersion of Y)
20-34
ansaribradley
[h,p] = ansaribradley(...) returns the p value, i.e., the probability
of observing the given result, or one more extreme, by chance if the
null hypothesis is true. Small values of p cast doubt on the validity of
the null hypothesis.
[h,p,stats] = ansaribradley(...) returns a structure stats with
the following fields:
• W — Value of the test statistic W, which is the sum of the
Ansari-Bradley ranks for the X sample
• Wstar — Approximate normal statistic W*
[...] = ansaribradley(x,y,alpha,tail,exact) computes p using
an exact calculation of the distribution of W with exact = 'on'. This
can be time-consuming for large samples. exact = 'off' computes p
using a normal approximation for the distribution of W*. The default if
exact is empty is to use the exact calculation if N, the total number of
rows in x and y, is 25 or less, and to use the normal approximation if N
> 25. Pass in [] for alpha and tail to use their default values while
specifying a value for exact. Note that N is computed before any NaN
values (representing missing data) are removed.
[...]
= ansaribradley(x,y,alpha,tail,exact,dim) works along
dimension dim of x and y.
The Ansari-Bradley test is a nonparametric alternative to the
two-sample F test of equal variances. It does not require the assumption
that x and y come from normal distributions. The dispersion of a
distribution is generally measured by its variance or standard deviation,
but the Ansari-Bradley test can be used with samples from distributions
that do not have finite variances.
The theory behind the Ansari-Bradley test requires that the groups
have equal medians. Under that assumption and if the distributions
in each group are continuous and identical, the test does not depend
on the distributions in each group. If the groups do not have the
same medians, the results may be misleading. Ansari and Bradley
recommend subtracting the median in that case, but the distribution of
20-35
ansaribradley
the resulting test, under the null hypothesis, is no longer independent
of the common distribution of x and y. If you want to perform the tests
with medians subtracted, you should subtract the medians from x and
y before calling ansaribradley.
Examples
Is the dispersion significantly different for two model years?
load carsmall
[h,p,stats] = ansaribradley(MPG(Model_Year==82),MPG(Model_Year==76))
h =
0
p =
0.8426
stats =
W: 526.9000
Wstar: 0.1986
See Also
20-36
vartest | vartestn | ttest2
aoctool
Purpose
Interactive analysis of covariance
Syntax
aoctool(x,y,group)
aoctool(x,y,group,alpha)
aoctool(x,y,group,alpha,xname,yname,gname)
aoctool(x,y,group,alpha,xname,yname,gname,displayopt)
aoctool(x,y,group,alpha,xname,yname,gname,displayopt,model)
h = aoctool(...)
[h,atab,ctab] = aoctool(...)
[h,atab,ctab,stats] = aoctool(...)
Description
aoctool(x,y,group) fits a separate line to the column vectors, x and y,
for each group defined by the values in the array group. group may be
a categorical variable, vector, character array, or cell array of strings.
(See “Grouped Data” on page 2-34.) These types of models are known
as one-way analysis of covariance (ANOCOVA) models. The output
consists of three figures:
• An interactive graph of the data and prediction curves
• An ANOVA table
• A table of parameter estimates
You can use the figures to change models and to test different parts
of the model. More information about interactive use of the aoctool
function appears in “Analysis of Covariance Tool” on page 8-27.
aoctool(x,y,group,alpha) determines the confidence levels of the
prediction intervals. The confidence level is 100(1-alpha)%. The
default value of alpha is 0.05.
aoctool(x,y,group,alpha,xname,yname,gname) specifies the name
to use for the x, y, and g variables in the graph and tables. If you
enter simple variable names for the x, y, and g arguments, the aoctool
function uses those names. If you enter an expression for one of these
arguments, you can specify a name to use in place of that expression by
supplying these arguments. For example, if you enter m(:,2) as the x
argument, you might choose to enter 'Col 2' as the xname argument.
20-37
aoctool
aoctool(x,y,group,alpha,xname,yname,gname,displayopt) enables
the graph and table displays when displayopt is 'on' (default) and
suppresses those displays when displayopt is 'off'.
aoctool(x,y,group,alpha,xname,yname,gname,displayopt,model)
specifies the initial model to fit. The value of model can be any of the
following:
• 'same mean' — Fit a single mean, ignoring grouping
• 'separate means' — Fit a separate mean to each group
• 'same line' — Fit a single line, ignoring grouping
• 'parallel lines' — Fit a separate line to each group, but constrain
the lines to be parallel
• 'separate lines' — Fit a separate line to each group, with no
constraints
h = aoctool(...) returns a vector of handles to the line objects in
the plot.
[h,atab,ctab] = aoctool(...) returns cell arrays containing the
entries in ANOVA table (atab) and the table of coefficient estimates
(ctab). (You can copy a text version of either table to the clipboard by
using the Copy Text item on the Edit menu.)
[h,atab,ctab,stats] = aoctool(...) returns a stats structure
that you can use to perform a follow-up multiple comparison test. The
ANOVA table output includes tests of the hypotheses that the slopes
or intercepts are all the same, against a general alternative that they
are not all the same. Sometimes it is preferable to perform a test to
determine which pairs of values are significantly different, and which
are not. You can use the multcompare function to perform such tests by
supplying the stats structure as input. You can test either the slopes,
the intercepts, or population marginal means (the heights of the curves
at the mean x value).
20-38
aoctool
Examples
This example illustrates how to fit different models non-interactively.
After loading the smaller car data set and fitting a separate-slopes
model, you can examine the coefficient estimates.
load carsmall
[h,a,c,s] = aoctool(Weight,MPG,Model_Year,0.05,...
'','','','off','separate lines');
c(:,1:2)
ans =
'Term'
'Estimate'
'Intercept' [45.97983716833132]
' 70'
[-8.58050531454973]
' 76'
[-3.89017396094922]
' 82'
[12.47067927549897]
'Slope'
[-0.00780212907455]
' 70'
[ 0.00195840368824]
' 76'
[ 0.00113831038418]
' 82'
[-0.00309671407243]
Roughly speaking, the lines relating MPG to Weight have an intercept
close to 45.98 and a slope close to -0.0078. Each group’s coefficients are
offset from these values somewhat. For instance, the intercept for the
cars made in 1970 is 45.98-8.58 = 37.40.
Next, try a fit using parallel lines. (The ANOVA table shows that the
parallel-lines fit is significantly worse than the separate-lines fit.)
[h,a,c,s] = aoctool(Weight,MPG,Model_Year,0.05,...
'','','','off','parallel lines');
c(:,1:2)
ans =
'Term'
'Intercept'
' 70'
' 76'
'Estimate'
[43.38984085130596]
[-3.27948192983761]
[-1.35036234809006]
20-39
aoctool
' 82'
'Slope'
[ 4.62984427792768]
[-0.00664751826198]
Again, there are different intercepts for each group, but this time the
slopes are constrained to be the same.
See Also
anova1 | multcompare | polytool
How To
• “Grouped Data” on page 2-34
20-40
TreeBagger.append
Purpose
Append new trees to ensemble
Syntax
B = append(B,other)
Description
B = append(B,other) appends the trees from the other ensemble to
those in B. This method checks for consistency of the X and Y properties
of the two ensembles, as well as consistency of their compact objects and
out-of-bag indices, before appending the trees. The output ensemble B
takes training parameters such as FBoot, Prior, Cost, and other from
the B input. There is no attempt to check if these training parameters
are consistent between the two objects.
See Also
CompactTreeBagger.combine
20-41
ProbDistKernel.BandWidth property
Purpose
Read-only value specifying bandwidth of kernel smoothing function
for ProbDistKernel object
Description
BandWidth is a read-only property of the ProbDistKernel class.
BandWidth is a value specifying the width of the kernel smoothing
function used to compute a nonparametric estimate of the probability
distribution when creating a ProbDistKernel object.
Values
For a distribution specified to cover only the positive numbers or only
a finite interval, the data are transformed before the kernel density is
applied, and the bandwidth is on the scale of the transformed data.
Use this information to view and compare the width of the kernel
smoothing function used to create distributions.
See Also
20-42
ksdensity
barttest
Purpose
Bartlett’s test
Syntax
ndim = barttest(X,alpha)
[ndim,prob,chisquare] = barttest(X,alpha)
Description
ndim = barttest(X,alpha) returns the number of dimensions
necessary to explain the nonrandom variation in the data matrix X,
using the significance probability alpha. The dimension is determined
by a series of hypothesis tests. The test for ndim=1 tests the hypothesis
that the variances of the data values along each principal component
are equal, the test for ndim=2 tests the hypothesis that the variances
along the second through last components are equal, and so on.
[ndim,prob,chisquare] = barttest(X,alpha) returns the number of
dimensions, the significance values for the hypothesis tests, and the χ2
values associated with the tests.
Examples
See Also
X = mvnrnd([0 0],[1 0.99; 0.99 1],20);
X(:,3:4) = mvnrnd([0 0],[1 0.99; 0.99 1],20);
X(:,5:6) = mvnrnd([0 0],[1 0.99; 0.99 1],20);
[ndim, prob] = barttest(X,0.05)
ndim =
3
prob =
0
0
0
0.5081
0.6618
princomp | pcacov | pcares
20-43
bbdesign
Purpose
Box-Behnken design
Syntax
dBB = bbdesign(n)
[dBB,blocks] = bbdesign(n)
[...] = bbdesign(n,param,val)
Description
dBB = bbdesign(n) generates a Box-Behnken design for n factors. n
must be an integer 3 or larger. The output matrix dBB is m-by-n, where
m is the number of runs in the design. Each row represents one run,
with settings for all factors represented in the columns. Factor values
are normalized so that the cube points take values between -1 and 1.
[dBB,blocks] = bbdesign(n) requests a blocked design. The output
blocks is an m-by-1 vector of block numbers for each run. Blocks
indicate runs that are to be measured under similar conditions
to minimize the effect of inter-block differences on the parameter
estimates.
= bbdesign(n,param,val) specifies one or more optional
parameter/value pairs for the design. The following table lists valid
parameter/value pairs.
[...]
Examples
Parameter
Description
Values
'center'
Number of
center points.
Integer. The default depends on
n.
'blocksize'
Maximum
number of
points per block.
Integer. The default is Inf.
The following creates a 3-factor Box-Behnken design:
dBB = bbdesign(3)
dBB =
-1
-1
0
-1
1
0
1
-1
0
20-44
bbdesign
1
-1
-1
1
1
0
0
0
0
0
0
0
1
0
0
0
0
-1
-1
1
1
0
0
0
0
-1
1
-1
1
-1
1
-1
1
0
0
0
The center point is run 3 times to allow for a more uniform estimate of
the prediction variance over the entire design space.
Visualize the design as follows:
plot3(dBB(:,1),dBB(:,2),dBB(:,3),'ro',...
'MarkerFaceColor','b')
X = [1 -1 -1 -1 1 -1 -1 -1 1 1 -1 -1; ...
1 1 1 -1 1 1 1 -1 1 1 -1 -1];
Y = [-1 -1 1 -1 -1 -1 1 -1 1 -1 1 -1; ...
1 -1 1 1 1 -1 1 1 1 -1 1 -1];
Z = [1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1; ...
1 1 1 1 -1 -1 -1 -1 1 1 1 1];
line(X,Y,Z,'Color','b')
axis square equal
20-45
bbdesign
See Also
20-46
ccdesign
betacdf
Purpose
Beta cumulative distribution function
Syntax
p = betacdf(X,A,B)
Description
p = betacdf(X,A,B) returns the beta cdf at each of the values in X
using the corresponding parameters in A and B. X, A, and B can be
vectors, matrices, or multidimensional arrays that all have the same
size. A scalar input is expanded to a constant array with the same
dimensions as the other inputs. The parameters in A and B must all be
positive, and the values in X must lie on the interval [0,1].
The beta cdf for a given value x and given pair of parameters a and b is
p = F ( x | a, b) =
x
1
t a−1 (1 − t)b−1 dt
B(a, b) ∫
0
where B( · ) is the Beta function.
Examples
x
a
b
p
p
=
=
=
=
=
0.1:0.2:0.9;
2;
2;
betacdf(x,a,b)
0.0280
0.2160
0.5000
0.7840
0.9720
a = [1 2 3];
p = betacdf(0.5,a,a)
p =
0.5000 0.5000 0.5000
See Also
cdf | betapdf | betainv | betastat | betalike | betarnd | betafit
How To
• “Beta Distribution” on page B-4
20-47
betafit
Purpose
Beta parameter estimates
Syntax
phat = betafit(data)
[phat,pci] = betafit(data,alpha)
Description
phat = betafit(data) computes the maximum likelihood estimates
of the beta distribution parameters a and b from the data in the vector
data and returns a column vector containing the a and b estimates,
where the beta cdf is given by
x
F ( x | a, b) =
1
t a−1 (1 − t)b−1 dt
B(a, b) ∫
0
and B( · ) is the Beta function. The elements of data must lie in the
open interval (0, 1), where the beta distribution is defined. However,
it is sometimes also necessary to fit a beta distribution to data that
include exact zeros or ones. For such data, the beta likelihood function
is unbounded, and standard maximum likelihood estimation is not
possible. In that case, betafit maximizes a modified likelihood that
incorporates the zeros or ones by treating them as if they were values
that have been left-censored at sqrt(realmin) or right-censored at
1-eps/2, respectively.
[phat,pci] = betafit(data,alpha) returns confidence intervals on
the a and b parameters in the 2-by-2 matrix pci. The first column of the
matrix contains the lower and upper confidence bounds for parameter
a, and the second column contains the confidence bounds for parameter
b. The optional input argument alpha is a value in the range [0, 1]
specifying the width of the confidence intervals. By default, alpha is
0.05, which corresponds to 95% confidence intervals. The confidence
intervals are based on a normal approximation for the distribution of
the logs of the parameter estimates.
Examples
20-48
This example generates 100 beta distributed observations. The true
a and b parameters are 4 and 3, respectively. Compare these to the
betafit
values returned in p by the beta fit. Note that the columns of ci both
bracket the true parameters.
data = betarnd(4,3,100,1);
[p,ci] = betafit(data,0.01)
p =
5.5328
3.8097
ci =
3.6538
2.6197
8.3781
5.5402
References
[1] Hahn, Gerald J., and S. S. Shapiro. Statistical Models in
Engineering. Hoboken, NJ: John Wiley & Sons, Inc., 1994, p. 95.
See Also
mle | betapdf | betainv | betastat | betalike | betarnd | betacdf
How To
• “Beta Distribution” on page B-4
20-49
betainv
Purpose
Beta inverse cumulative distribution function
Syntax
X = betainv(P,A,B)
Description
X = betainv(P,A,B) computes the inverse of the beta cdf with
parameters specified by A and B for the corresponding probabilities in P.
P, A, and B can be vectors, matrices, or multidimensional arrays that are
all the same size. A scalar input is expanded to a constant array with
the same dimensions as the other inputs. The parameters in A and B
must all be positive, and the values in P must lie on the interval [0, 1].
The inverse beta cdf for a given probability p and a given pair of
parameters a and b is
x = F −1 ( p| a, b) = { x : F ( x | a, b) = p}
where
x
p = F ( x | a, b) =
1
t a−1 (1 − t)b−1 dt
B(a, b) ∫
0
and B( · ) is the Beta function. Each element of output X is the value
whose cumulative probability under the beta cdf defined by the
corresponding parameters in A and B is specified by the corresponding
value in P.
Algorithms
Examples
The betainv function uses Newton’s method with modifications to
constrain steps to the allowable range for x, i.e., [0 1].
p = [0.01 0.5 0.99];
x = betainv(p,10,5)
x =
0.3726 0.6742 0.8981
According to this result, for a beta cdf with a = 10 and b = 5, a value
less than or equal to 0.3726 occurs with probability 0.01. Similarly,
20-50
betainv
values less than or equal to 0.6742 and 0.8981 occur with respective
probabilities 0.5 and 0.99.
See Also
icdf | betapdf | betafit | betainv | betastat | betalike | betarnd
| betacdf
How To
• “Beta Distribution” on page B-4
20-51
betalike
Purpose
Beta negative log-likelihood
Syntax
nlogL = betalike(params,data)
[nlogL,AVAR] = betalike(params,data)
Description
nlogL = betalike(params,data) returns the negative of the beta
log-likelihood function for the beta parameters a and b specified in
vector params and the observations specified in the column vector data.
The elements of data must lie in the open interval (0, 1), where the beta
distribution is defined. However, it is sometimes also necessary to fit a
beta distribution to data that include exact zeros or ones. For such data,
the beta likelihood function is unbounded, and standard maximum
likelihood estimation is not possible. In that case, betalike computes a
modified likelihood that incorporates the zeros or ones by treating them
as if they were values that have been left-censored at sqrt(realmin) or
right-censored at 1-eps/2, respectively.
[nlogL,AVAR] = betalike(params,data) also returns AVAR, which is
the asymptotic variance-covariance matrix of the parameter estimates
if the values in params are the maximum likelihood estimates. AVAR is
the inverse of Fisher’s information matrix. The diagonal elements of
AVAR are the asymptotic variances of their respective parameters.
betalike is a utility function for maximum likelihood estimation of
the beta distribution. The likelihood assumes that all the elements in
the data sample are mutually independent. Since betalike returns
the negative beta log-likelihood function, minimizing betalike using
fminsearch is the same as maximizing the likelihood.
Examples
This example continues the betafit example, which calculates
estimates of the beta parameters for some randomly generated beta
distributed data.
r = betarnd(4,3,100,1);
[nlogl,AVAR] = betalike(betafit(r),r)
nlogl =
20-52
betalike
-27.5996
AVAR =
0.2783
0.1316
0.1316
0.0867
See Also
betapdf | betafit | betainv | betastat | betarnd | betacdf
How To
• “Beta Distribution” on page B-4
20-53
betapdf
Purpose
Beta probability density function
Syntax
Y = betapdf(X,A,B)
Description
Y = betapdf(X,A,B) computes the beta pdf at each of the values in
X using the corresponding parameters in A and B. X, A, and B can be
vectors, matrices, or multidimensional arrays that all have the same
size. A scalar input is expanded to a constant array with the same
dimensions of the other inputs. The parameters in A and B must all be
positive, and the values in X must lie on the interval [0, 1].
The beta probability density function for a given value x and given pair
of parameters a and b is
y = f ( x | a, b) =
1
x a−1 (1 − x)b−1 I(0,1) ( x)
B(a, b)
where B( · ) is the Beta function. The indicator function I(0,1) ( x)
ensures that only values of x in the range (0 1) have nonzero probability.
The uniform distribution on (0 1) is a degenerate case of the beta pdf
where a = 1 and b = 1.
A likelihood function is the pdf viewed as a function of the parameters.
Maximum likelihood estimators (MLEs) are the values of the
parameters that maximize the likelihood function for a fixed value of x.
Examples
See Also
20-54
a = [0.5 1; 2 4]
a =
0.5000 1.0000
2.0000 4.0000
y = betapdf(0.5,a,a)
y =
0.6366 1.0000
1.5000 2.1875
pdf | betafit | betainv | betastat | betalike | betarnd | betacdf
betapdf
How To
• “Beta Distribution” on page B-4
20-55
betarnd
Purpose
Beta random numbers
Syntax
R = betarnd(A,B)
R = betarnd(A,B,m,n,...)
R = betarnd(A,B,[m,n,...])
Description
R = betarnd(A,B) generates random numbers from the beta
distribution with parameters specified by A and B. A and B can be
vectors, matrices, or multidimensional arrays that have the same size,
which is also the size of R. A scalar input for A or B is expanded to a
constant array with the same dimensions as the other input.
R = betarnd(A,B,m,n,...) or R = betarnd(A,B,[m,n,...])
generates an m-by-n-by-... array containing random numbers from the
beta distribution with parameters A and B. A and B can each be scalars
or arrays of the same size as R.
Examples
a = [1 1;2 2];
b = [1 2;1 2];
r = betarnd(a,b)
r =
0.6987 0.6139
0.9102 0.8067
r = betarnd(10,10,[1 5])
r =
0.5974 0.4777 0.5538
0.5465
0.6327
r = betarnd(4,2,2,3)
r =
0.3943 0.6101 0.5768
0.5990 0.2760 0.5474
See Also
20-56
random | betapdf | betafit | betainv | betastat | betalike |
betacdf
betarnd
How To
• “Beta Distribution” on page B-4
20-57
betastat
Purpose
Beta mean and variance
Syntax
[M,V] = betastat(A,B)
Description
[M,V] = betastat(A,B), with A>0 and B>0, returns the mean of and
variance for the beta distribution with parameters specified by A and
B. A and B can be vectors, matrices, or multidimensional arrays that
have the same size, which is also the size of M and V. A scalar input
for A or B is expanded to a constant array with the same dimensions
as the other input.
The mean of the beta distribution with parameters a and b is a / (a + b)
and the variance is
ab
(a + b + 1)(a + b)2
Examples
If parameters a and b are equal, the mean is 1/2.
a = 1:6;
[m,v] = betastat(a,a)
m =
0.5000 0.5000 0.5000
v =
0.0833 0.0500 0.0357
0.5000
0.5000
0.5000
0.0278
0.0227
0.0192
See Also
betapdf | betafit | betainv | betalike | betarnd | betacdf
How To
• “Beta Distribution” on page B-4
20-58
gmdistribution.BIC property
Purpose
Bayes Information Criterion
Description
The Bayes Information Criterion: 2*NlogL+m*log(n), where n is the
number of observations and m is the number of estimated parameters.
20-59
binocdf
Purpose
Binomial cumulative distribution function
Syntax
Y = binocdf(X,N,P)
Description
Y = binocdf(X,N,P) computes a binomial cdf at each of the values
in X using the corresponding number of trials in N and probability of
success for each trial in P. X, N, and P can be vectors, matrices, or
multidimensional arrays that are all the same size. A scalar input is
expanded to a constant array with the same dimensions of the other
inputs. The values in N must all be positive integers, the values in X
must lie on the interval [0,N], and the values in P must lie on the
interval [0, 1].
The binomial cdf for a given value x and a given pair of parameters n
and p is
y  F ( x | n, p) 
x
 n
  i  pi (1  p)(ni) I(0,1,...,n) (i).
i 0
The result, y, is the probability of observing up to x successes in n
independent trials, where the probability of success in any given trial is
p. The indicator function I(0,1,...,n) (i) ensures that x only adopts values
of 0,1,...,n.
Examples
If a baseball team plays 162 games in a season and has a 50-50 chance
of winning any game, then the probability of that team winning more
than 100 games in a season is:
1 - binocdf(100,162,0.5)
The result is 0.001 (i.e., 1-0.999). If a team wins 100 or more games
in a season, this result suggests that it is likely that the team’s true
probability of winning any game is greater than 0.5.
See Also
20-60
cdf | binopdf | binoinv | binostat | binofit | binornd
binocdf
How To
• “Binomial Distribution” on page B-7
20-61
binofit
Purpose
Binomial parameter estimates
Syntax
phat = binofit(x,n)
[phat,pci] = binofit(x,n)
[phat,pci] = binofit(x,n,alpha)
Description
phat = binofit(x,n) returns a maximum likelihood estimate of the
probability of success in a given binomial trial based on the number of
successes, x, observed in n independent trials. If x = (x(1), x(2),
... x(k)) is a vector, binofit returns a vector of the same size as
x whose ith entry is the parameter estimate for x(i). All k estimates
are independent of each other. If n = (n(1), n(2), ..., n(k)) is a
vector of the same size as x, the binomial fit, binofit, returns a vector
whose ith entry is the parameter estimate based on the number of
successes x(i) in n(i) independent trials. A scalar value for x or n is
expanded to the same size as the other input.
[phat,pci] = binofit(x,n) returns the probability estimate,
phat, and the 95% confidence intervals, pci. binofit uses the
Clopper-Pearson method to calculate confidence intervals.
[phat,pci] = binofit(x,n,alpha) returns the 100(1 - alpha)%
confidence intervals. For example, alpha = 0.01 yields 99% confidence
intervals.
Note binofit behaves differently than other Statistics Toolbox
functions that compute parameter estimates, in that it returns
independent estimates for each entry of x. By comparison, expfit
returns a single parameter estimate based on all the entries of x.
Unlike most other distribution fitting functions, the binofit function
treats its input x vector as a collection of measurements from separate
samples. If you want to treat x as a single sample and compute a single
parameter estimate for it, you can use binofit(sum(x),sum(n)) when
n is a vector, and binofit(sum(X),N*length(X)) when n is a scalar.
20-62
binofit
Examples
This example generates a binomial sample of 100 elements, where the
probability of success in a given trial is 0.6, and then estimates this
probability from the outcomes in the sample.
r = binornd(100,0.6);
[phat,pci] = binofit(r,100)
phat =
0.5800
pci =
0.4771 0.6780
The 95% confidence interval, pci, contains the true value, 0.6.
References
[1] Johnson, N. L., S. Kotz, and A. W. Kemp. Univariate Discrete
Distributions. Hoboken, NJ: Wiley-Interscience, 1993.
See Also
mle | binopdf | binocdf | binoinv | binostat | binornd
How To
• “Binomial Distribution” on page B-7
20-63
binoinv
Purpose
Binomial inverse cumulative distribution function
Syntax
X = binoinv(Y,N,P)
Description
X = binoinv(Y,N,P) returns the smallest integer X such that the
binomial cdf evaluated at X is equal to or exceeds Y. You can think of
Y as the probability of observing X successes in N independent trials
where P is the probability of success in each trial. Each X is a positive
integer less than or equal to N.
Y, N, and P can be vectors, matrices, or multidimensional arrays that
all have the same size. A scalar input is expanded to a constant array
with the same dimensions as the other inputs. The parameters in N
must be positive integers, and the values in both P and Y must lie on
the interval [0 1].
Examples
If a baseball team has a 50-50 chance of winning any game, what is a
reasonable range of games this team might win over a season of 162
games?
binoinv([0.05 0.95],162,0.5)
ans =
71
91
This result means that in 90% of baseball seasons, a .500 team should
win between 71 and 91 games.
See Also
icdf | binopdf | binocdf | binofit | binostat | binornd
How To
• “Binomial Distribution” on page B-7
20-64
binopdf
Purpose
Binomial probability density function
Syntax
Y = binopdf(X,N,P)
Description
Y = binopdf(X,N,P) computes the binomial pdf at each of the values
in X using the corresponding number of trials in N and probability of
success for each trial in P. Y, N, and P can be vectors, matrices, or
multidimensional arrays that all have the same size. A scalar input is
expanded to a constant array with the same dimensions of the other
inputs.
The parameters in N must be positive integers, and the values in P must
lie on the interval [0, 1].
The binomial probability density function for a given value x and given
pair of parameters n and p is
⎛ n⎞
y = f ( x | n, p) = ⎜ ⎟ px q(n− x) I(0,1,...,n) ( x)
⎝ x⎠
where q = 1 – p. The result, y, is the probability of observing x successes
in n independent trials, where the probability of success in any given
trial is p. The indicator function I(0,1,...,n)(x) ensures that x only adopts
values of 0, 1, ..., n.
Examples
A Quality Assurance inspector tests 200 circuit boards a day. If 2% of
the boards have defects, what is the probability that the inspector will
find no defective boards on any given day?
binopdf(0,200,0.02)
ans =
0.0176
What is the most likely number of defective boards the inspector will
find?
defects=0:200;
y = binopdf(defects,200,.02);
20-65
binopdf
[x,i]=max(y);
defects(i)
ans =
4
See Also
pdf | binoinv | binocdf | binofit | binostat | binornd
How To
• “Binomial Distribution” on page B-7
20-66
binornd
Purpose
Binomial random numbers
Syntax
R = binornd(N,P)
R = binornd(N,P,m,n,...)
R = binornd(N,P,[m,n,...])
Description
R = binornd(N,P) generates random numbers from the binomial
distribution with parameters specified by the number of trials, N, and
probability of success for each trial, P. N and P can be vectors, matrices,
or multidimensional arrays that have the same size, which is also the
size of R. A scalar input for N or P is expanded to a constant array with
the same dimensions as the other input.
R = binornd(N,P,m,n,...) or R = binornd(N,P,[m,n,...])
generates an m-by-n-by-... array containing random numbers from the
binomial distribution with parameters N and P. N and P can each be
scalars or arrays of the same size as R.
Algorithms
Examples
The binornd function uses the direct method using the definition of the
binomial distribution as a sum of Bernoulli random variables.
n = 10:10:60;
r1 = binornd(n,1./n)
r1 =
2
1
0
1
1
2
r2 = binornd(n,1./n,[1 6])
r2 =
0
1
2
1
3
1
r3 = binornd(n,1./n,1,6)
r3 =
0
1
1
1
0
3
See Also
random | binoinv | binocdf | binofit | binostat | binopdf
20-67
binornd
How To
20-68
• “Binomial Distribution” on page B-7
binostat
Purpose
Binomial mean and variance
Syntax
[M,V] = binostat(N,P)
Description
[M,V] = binostat(N,P) returns the mean of and variance for the
binomial distribution with parameters specified by the number of trials,
N, and probability of success for each trial, P. N and P can be vectors,
matrices, or multidimensional arrays that have the same size, which
is also the size of M and V. A scalar input for N or P is expanded to a
constant array with the same dimensions as the other input.
The mean of the binomial distribution with parameters n and p is np.
The variance is npq, where q = 1–p.
Examples
n = logspace(1,5,5)
n =
10
100
1000
[m,v] = binostat(n,1./n)
m =
1
1
1
1
1
v =
0.9000 0.9900 0.9990
[m,v] = binostat(n,1/2)
m =
5
50
500
v =
1.0e+04 *
0.0003 0.0025 0.0250
10000
100000
0.9999
1.0000
5000
50000
0.2500
2.5000
See Also
binoinv | binocdf | binofit | binornd | binopdf
How To
• “Binomial Distribution” on page B-7
20-69
biplot
Purpose
Biplot
Syntax
biplot(coefs)
h = biplot(coefs,'Name',Value)
Description
biplot(coefs) creates a biplot of the coefficients in the matrix coefs.
The biplot is 2-D if coefs has two columns or 3-D if it has three
columns. coefs usually contains principal component coefficients
created with princomp, pcacov, or factor loadings estimated with
factoran. The axes in the biplot represent the principal components or
latent factors (columns of coefs), and the observed variables (rows of
coefs) are represented as vectors.
A biplot allows you to visualize the magnitude and sign of each
variable’s contribution to the first two or three principal components,
and how each observation is represented in terms of those components.
biplot imposes a sign convention, forcing the element with largest
magnitude in each column of coefs to be positive. This flips some of the
vectors in coefs to the opposite direction, but often makes the plot easier
to read. Interpretation of the plot is unaffected, because changing the
sign of a coefficient vector does not change its meaning.
h = biplot(coefs,'Name',Value) specifies one or more name/value
input pairs and returns a column vector of handles to the graphics
objects created by biplot. The h contains, in order, handles
corresponding to variables (line handles, followed by marker handles,
followed by text handles), to observations (if present, marker handles
followed by text handles), and to the axis lines.
Input
Arguments
Name-Value Pair Arguments
Scores
Plots both coefs and the scores in the matrix scores in the
biplot. scores usually contains principal component scores
created with princomp or factor scores estimated with factoran.
20-70
biplot
Each observation (row of scores) is represented as a point in the
biplot.
VarLabels
Labels each vector (variable) with the text in the character array
or cell array varlabels.
ObsLabels
Uses the text in the character array or cell array obslabels as
observation names when displaying data cursors.
Positive
• 'true' — restricts the biplot to the positive quadrant (in 2-D)
or octant (in 3-D).
• 'false' — makes the biplot over the range +/- max(coefs(:))
for all coordinates.
Default: false
PropertyName
Specifies optional property name/value pairs for all line graphics
objects created by biplot.
Examples
Perform a principal component analysis of the data in carsmall.mat:
load carsmall
x = [Acceleration Displacement Horsepower MPG Weight];
x = x(all(~isnan(x),2),:);
[coefs,score] = princomp(zscore(x));
View the data and the original variables in the space of the first three
principal components:
20-71
biplot
vbls = {'Accel','Disp','HP','MPG','Wgt'};
biplot(coefs(:,1:3),'scores',score(:,1:3),...
'varlabels',vbls);
See Also
20-72
factoran | nnmf | princomp | pcacov | rotatefactors
bootci
Purpose
Bootstrap confidence interval
Syntax
ci
ci
ci
ci
Description
ci = bootci(nboot,bootfun,...) computes the 95% bootstrap
confidence interval of the statistic computed by the function bootfun.
nboot is a positive integer indicating the number of bootstrap samples
used in the computation. bootfun is a function handle specified with
@, and must return a scalar. The third and later input arguments to
bootci are data (scalars, column vectors, or matrices) that are used
to create inputs to bootfun. bootci creates each bootstrap sample
by sampling with replacement from the rows of the non-scalar data
arguments (these must have the same number of rows). Scalar data
are passed to bootfun unchanged.
= bootci(nboot,bootfun,...)
= bootci(nboot,{bootfun,...},'alpha',alpha)
= bootci(nboot,{bootfun,...},...,'type',type)
= bootci(nboot,{bootfun,...},...,'type','student',
'nbootstd',nbootstd)
ci = bootci(nboot,{bootfun,...},...,'type','student','stderr',
stderr)
ci = bootci(nboot,{bootfun,...},...,'Weights',weights)
ci = bootci(nboot,{bootfun,...},...,'Options',options)
[ci,bootstat] = bootci(...)
If bootfun returns a scalar, ci is a vector containing the lower and
upper bounds of the confidence interval. If bootfun returns a vector of
length m, ci is an array of size 2-by-m, where ci(1,:) are lower bounds
and ci(2,:) are upper bounds. If bootfun returns an array of size
m-by-n-by-p-by-..., ci is an array of size 2-by-m-by-n-by-p-by-..., where
ci(1,:,:,:,...) is an array of lower bounds and ci(2,:,:,:,...) is
an array of upper bounds.
ci = bootci(nboot,{bootfun,...},'alpha',alpha) computes the
100*(1-alpha) bootstrap confidence interval of the statistic defined by
the function bootfun. bootfun and the data that bootci passes to it
are contained in a single cell array. alpha is a scalar between 0 and 1.
The default value of alpha is 0.05.
20-73
bootci
ci = bootci(nboot,{bootfun,...},...,'type',type) computes the
bootstrap confidence interval of the statistic defined by the function
bootfun. type is the confidence interval type, chosen from among the
following strings:
• 'norm' or 'normal' — Normal approximated interval with
bootstrapped bias and standard error.
• 'per' or 'percentile' — Basic percentile method.
• 'cper' or 'corrected percentile' — Bias corrected percentile
method.
• 'bca' — Bias corrected and accelerated percentile method. This
is the default.
• 'stud' or 'student' — Studentized confidence interval.
ci =
bootci(nboot,{bootfun,...},...,'type','student','nbootstd',nbootstd)
computes the studentized bootstrap confidence interval of the statistic
defined by the function bootfun. The standard error of the
bootstrap statistics is estimated using bootstrap, with nbootstd
bootstrap data samples. nbootstd is a positive integer value.
The default value of nbootstd is 100.
ci =
bootci(nboot,{bootfun,...},...,'type','student','stderr',stderr)
computes the studentized bootstrap confidence interval of statistics
defined by the function bootfun. The standard error of the bootstrap
statistics is evaluated by the function stderr. stderr is a function
handle. stderr takes the same arguments as bootfun and returns the
standard error of the statistic computed by bootfun.
ci = bootci(nboot,{bootfun,...},...,'Weights',weights)
specifies observation weights. weights must be a vector of non-negative
numbers with at least one positive element. The number of elements
in weights must be equal to the number of rows in non-scalar
input arguments to bootfun. To obtain one bootstrap replicate,
20-74
bootci
bootstrp samples N out of N with replacement using these weights as
multinomial sampling probabilities.
ci = bootci(nboot,{bootfun,...},...,'Options',options)
specifies options that govern the computation of bootstrap iterations.
One option requests that bootci perform bootstrap iterations using
multiple processors, if the Parallel Computing Toolbox is available. Two
options specify the random number streams to be used in bootstrap
resampling. This argument is a struct that you can create with a call to
statset. You can retrieve values of the individual fields with a call to
statget. Applicable statset parameters are:
• 'UseParallel' — If 'always' and if a matlabpool of the Parallel
Computing Toolbox is open, compute bootstrap iterations in parallel.
If the Parallel Computing Toolbox is not installed, or a matlabpool is
not open, computation occurs in serial mode. Default is 'never', or
serial computation.
• UseSubstreams — Set to 'always' to compute in parallel in a
reproducible fashion. Default is 'never'. To compute reproducibly,
set Streams to a type allowing substreams: 'mlfg6331_64' or
'mrg32k3a'.
• Streams — A RandStream object or cell array of such objects. If you
do not specify Streams, bootci uses the default stream or streams. If
you choose to specify Streams, use a single object except in the case
-
You have an open MATLAB pool
UseParallel is 'always'
UseSubstreams is 'never'
In that case, use a cell array the same size as the MATLAB pool.
For more information on using parallel computing, see Chapter 17,
“Parallel Statistics”.
[ci,bootstat] = bootci(...) also returns the bootstrapped statistic
computed for each of the nboot bootstrap replicate samples. Each row
of bootstat contains the results of applying bootfun to one bootstrap
20-75
bootci
sample. If bootfun returns a matrix or array, then this output is
converted to a row vector for storage in bootstat.
Examples
Compute the confidence interval for the capability index in statistical
process control:
y = normrnd(1,1,30,1);
% Simulated process data
LSL = -3; USL = 3;
% Process specifications
capable = @(x)(USL-LSL)./(6* std(x));
% Process capability
ci = bootci(2000,capable,y)
% BCa confidence interval
ci =
0.8122
1.2657
sci = bootci(2000,{capable,y},'type','student') % Studentized ci
sci =
0.7739
1.2707
See Also
20-76
bootstrp | jackknife | statget | statset | randsample | parfor
bootstrp
Purpose
Bootstrap sampling
Syntax
bootstat = bootstrp(nboot,bootfun,d1,...)
[bootstat,bootsam] = bootstrp(...)
bootstat = bootstrp(...,’Name’,Value)
Description
bootstat = bootstrp(nboot,bootfun,d1,...) draws nboot
bootstrap data samples, computes statistics on each sample using
bootfun, and returns the results in the matrix bootstat. nboot must
be a positive integer. bootfun is a function handle specified with
@. Each row of bootstat contains the results of applying bootfun to
one bootstrap sample. If bootfun returns a matrix or array, then this
output is converted to a row vector for storage in bootstat.
The third and later input arguments (d1,...) are data (scalars, column
vectors, or matrices) used to create inputs to bootfun. bootstrp creates
each bootstrap sample by sampling with replacement from the rows of
the non-scalar data arguments (these must have the same number of
rows). bootfun accepts scalar data unchanged.
[bootstat,bootsam] = bootstrp(...) returns an n-by-nboot matrix
of bootstrap indices, bootsam. Each column in bootsam contains indices
of the values that were drawn from the original data sets to constitute
the corresponding bootstrap sample. For example, if d1,... each
contain 16 values, and nboot = 4, then bootsam is a 16-by-4 matrix.
The first column contains the indices of the 16 values drawn from
d1,..., for the first of the four bootstrap samples, the second column
contains the indices for the second of the four bootstrap samples, and so
on. (The bootstrap indices are the same for all input data sets.) To get
the output samples bootsam without applying a function, set bootfun
to empty ([]).
bootstat = bootstrp(...,’Name’,Value) uses additional arguments
specified by one or more Name,Value pair arguments. The name-value
pairs must appear after the data arguments. The available name-value
pairs:
20-77
bootstrp
• 'Weights' — Observation weights. The weights value must be a
vector of nonnegative numbers with at least one positive element.
The number of elements in weights must be equal to the number
of rows in non-scalar input arguments to bootstrp. To obtain one
bootstrap replicate, bootstrp samples N out of N with replacement
using these weights as multinomial sampling probabilities.
• 'Options' — The value is a structure that contains options
specifying whether to compute bootstrap iterations in parallel,
and specifying how to use random numbers during the bootstrap
sampling. Create the options structure with statset. Applicable
statset parameters are:
-
'UseParallel' — If 'always' and if a matlabpool of the Parallel
-
UseSubstreams — Set to 'always' to compute in parallel
in a reproducible fashion. Default is 'never'. To compute
reproducibly, set Streams to a type allowing substreams:
'mlfg6331_64' or 'mrg32k3a'.
-
Streams — A RandStream object or cell array of such objects. If
you do not specify Streams, bootstrp uses the default stream
or streams. If you choose to specify Streams, use a single object
except in the case
Computing Toolbox is open, compute bootstrap iterations in
parallel. If the Parallel Computing Toolbox is not installed, or
a matlabpool is not open, computation occurs in serial mode.
Default is 'never', meaning serial computation.
•
You have an open MATLAB pool
• UseParallel
is 'always'
is 'never'
In that case, use a cell array the same size as the MATLAB pool.
• UseSubstreams
For more information on using parallel computing, see Chapter 17,
“Parallel Statistics”.
20-78
bootstrp
Examples
Bootstrapping a Correlation Coefficient Standard Error
Load a data set containing the LSAT scores and law-school GPA for 15
students. These 15 data points are resampled to create 1000 different
data sets, and the correlation between the two variables is computed
for each data set.
load lawdata
[bootstat,bootsam] = bootstrp(1000,@corr,lsat,gpa);
Display the first 5 bootstrapped correlation coefficients.
bootstat(1:5,:)
ans =
0.6600
0.7969
0.5807
0.8766
0.9197
Display the indices of the data selected for the first 5 bootstrap samples.
bootsam(:,1:5)
ans =
9
8
14
7
4
6
3
10
15
4
9
4
8
5
1
9
10
8
1
4
1
1
3
10
14
6
13
12
12
6
15
6
10
11
13
5
4
1
6
5
10
15
10
1
4
11
7
3
9
4
2
3
15
12
2
6
10
3
2
9
15
14
11
2
14
10
13
11
3
8
2
8
8
4
8
20-79
bootstrp
hist(bootstat)
The histogram shows the variation of the correlation coefficient across
all the bootstrap samples. The sample minimum is positive, indicating
that the relationship between LSAT score and GPA is not accidental.
Finally, compute a bootstrap standard of error for the estimated
correlation coefficient.
se = std(bootstat)
se =
0.1327
20-80
bootstrp
Estimating the Density of Bootstrapped Statistic
Compute a sample of 100 bootstrapped means of random samples
taken from the vector Y, and plot an estimate of the density of these
bootstrapped means:
y = exprnd(5,100,1);
m = bootstrp(100,@mean,y);
[fi,xi] = ksdensity(m);
plot(xi,fi);
20-81
bootstrp
Bootstrapping More Than One Statistic
Compute a sample of 100 bootstrapped means and standard deviations
of random samples taken from the vector Y, and plot the bootstrap
estimate pairs:
y = exprnd(5,100,1);
stats = bootstrp(100,@(x)[mean(x) std(x)],y);
plot(stats(:,1),stats(:,2),'o')
20-82
bootstrp
Bootstrapping a Regression Model
Estimate the standard errors for a coefficient vector in a linear
regression by bootstrapping residuals:
load hald
x = [ones(size(heat)),ingredients];
y = heat;
b = regress(y,x);
yfit = x*b;
resid = y - yfit;
se = std(bootstrp(...
1000,@(bootr)regress(yfit+bootr,x),resid));
See Also
hist | bootci | ksdensity | parfor | random | randsample |
RandStream | statget | statset
20-83
boxplot
Purpose
Box plot
Syntax
boxplot(X)
boxplot(X,G)
boxplot(axes,X,...)
boxplot(...,'Name',value)
Description
boxplot(X) produces a box plot of the data in X. If X is a matrix, there is
one box per column; if X is a vector, there is just one box. On each box,
the central mark is the median, the edges of the box are the 25th and
75th percentiles, the whiskers extend to the most extreme data points
not considered outliers, and outliers are plotted individually.
boxplot(X,G) specifies one or more grouping variables G, producing
a separate box for each set of X values sharing the same G value or
values (see “Grouped Data” on page 2-34). Grouping variables must
have one row per element of X, or one row per column of X. Specify a
single grouping variable in G using a vector, a character array, a cell
array of strings, or a vector categorical array; specify multiple grouping
variables in G using a cell array of these variable types, such as {G1 G2
G3}, or by using a matrix. If multiple grouping variables are used, they
must all be the same length. Groups that contain a NaN value or an
empty string in a grouping variable are omitted, and are not counted in
the number of groups considered by other parameters.
By default, character and string grouping variables are sorted in the
order they initially appear in the data, categorical grouping variables
are sorted by the order of their levels, and numeric grouping variables
are sorted in numeric order. To control the order of groups, do one of
the following:
• Use categorical variables in G and specify the order of their levels.
• Use the 'grouporder' parameter described below.
• Pre-sort your data.
boxplot(axes,X,...) creates the plot in the axes with handle axes.
20-84
boxplot
boxplot(...,'Name',value) specifies one or more optional parameter
name/value pairs, as described in the following table. Specify Name in
single quotes.
Name
Value
plotstyle
• 'traditional' — Traditional box style.
This is the default.
• 'compact' — Box style designed for plots
with many groups. This style changes the
defaults for some other parameters, as
described in the following table.
boxstyle
• 'outline' — Draws an unfilled box with
dashed whiskers. This is the default.
• 'filled' — Draws a narrow filled box with
lines for whiskers.
colorgroup
One or more grouping variables, of the same
type as permitted for G, specifying that the
box color should change when the specified
variables change. The default is [] for no box
color change.
colors
Colors for boxes, specified as a single color
(such as 'r' or [1 0 0]) or multiple colors
(such as 'rgbm' or a three-column matrix of
RGB values). The sequence is replicated or
truncated as required, so for example 'rb'
gives boxes that alternate in color. The default
when no 'colorgroup' is specified is to use
the same color scheme for all boxes. The
default when 'colorgroup' is specified is a
modified hsv colormap.
20-85
boxplot
Name
Value
datalim
A two-element vector containing lower and
upper limits, used by 'extrememode' to
determine which points are extreme. The
default is [-Inf Inf].
extrememode
• 'clip' — Moves data outside the datalim
limits to the limit. This is the default.
• 'compress' — Evenly distributes data
outside the datalim limits in a region just
outside the limit, retaining the relative
order of the points.
A dotted line marks the limit if any points
are outside it, and two gray lines mark
the compression region if any points are
compressed. Values at +/–Inf can be clipped
or compressed, but NaN values still do not
appear on the plot. Box notches are drawn to
scale and may extend beyond the bounds if the
median is inside the limit; they are not drawn
if the median is outside the limits.
factordirection
• 'data' — Arranges factors with the first
value next to the origin. This is the default.
• 'list' — Arranges factors left-to-right if on
the x axis or top-to-bottom if on the y axis.
• 'auto' — Uses 'data' for numeric
grouping variables and 'list' for strings.
20-86
boxplot
Name
Value
fullfactors
• 'off' — One group for each unique row of
G. This is the default.
• 'on' — Create a group for each possible
combination of group variable values,
including combinations that do not appear
in the data.
factorseparator
Specifies which factors should have their
values separated by a grid line. The value
may be 'auto' or a vector of grouping variable
numbers. For example, [1 2] adds a separator
line when the first or second grouping variable
changes value. 'auto' is [] for one grouping
variable and [1] for two or more grouping
variables. The default is [].
factorgap
Specifies an extra gap to leave between boxes
when the corresponding grouping factor
changes value, expressed as a percentage of
the width of the plot. For example, with [3 1],
the gap is 3% of the width of the plot between
groups with different values of the first
grouping variable, and 1% between groups
with the same value of the first grouping
variable but different values for the second.
'auto' specifies that boxplot should choose a
gap automatically. The default is [].
grouporder
Order of groups for plotting, specified as a
cell array of strings. With multiple grouping
variables, separate values within each string
with a comma. Using categorical arrays as
grouping variables is an easier way to control
the order of the boxes. The default is [], which
does not reorder the boxes.
20-87
boxplot
Name
Value
jitter
Maximum distance d to displace outliers along
the factor axis by a uniform random amount, in
order to make duplicate points visible. A d of
1 makes the jitter regions just touch between
the closest adjacent groups. The default is 0.
labels
A character array, cell array of strings, or
numeric vector of box labels. There may be
one label per group or one label per X value.
Multiple label variables may be specified via a
numeric matrix or a cell array containing any
of these types.
Tip To remove labels from a plot, use the
following command:
set(gca,'XTickLabel',{' '})
labelorientation
• 'inline' — Rotates the labels to be vertical.
This is the default when plotstyle is
'compact'.
• 'horizontal' — Leaves the labels
horizontal. This is the default when
plotstyle has the default value of
'traditional'.
When the labels are on the y axis, both settings
leave the labels horizontal.
20-88
boxplot
Name
Value
labelverbosity
• 'all' — Displays every label. This is the
default.
• 'minor' — Displays a label for a factor only
when that factor has a different value from
the previous group.
• 'majorminor' — Displays a label for a
factor when that factor or any factor major
to it has a different value from the previous
group.
medianstyle
• 'line' — Draws a line for the median. This
is the default.
• 'target' — Draws a black dot inside a
white circle for the median.
notch
• 'on' — Draws comparison intervals using
notches when plotstyle is 'traditional',
or triangular markers when plotstyle is
'compact'.
• 'marker' — Draws comparison intervals
using triangular markers.
• 'off' — Omits notches. This is the default.
Two medians are significantly different at the
5% significance level if their intervals do not
overlap. Interval endpoints are the extremes
of the notches or the centers of the triangular
markers. When the sample size is small,
notches may extend beyond the end of the box.
orientation
• 'vertical' — Plots X on the y axis. This is
the default.
• 'horizontal' — Plots X on the x axis.
20-89
boxplot
Name
Value
outliersize
Size of the marker used for outliers, in points.
The default is 6 (6/72 inch).
positions
Box positions specified as a numeric vector
with one entry per group or X value. The
default is 1:numGroups, where numGroups is
the number of groups.
symbol
Symbol and color to use for outliers, using
the same values as the LineSpec parameter
in plot. The default is 'r+'. If the symbol
is omitted then the outliers are invisible; if
the color is omitted then the outliers have the
same color as their corresponding box.
whisker
Maximum whisker length w. The default is a
w of 1.5. Points are drawn as outliers if they
are larger than q3 + w(q3 – q1) or smaller than
q1 – w(q3 – q1), where q1 and q3 are the 25th
and 75th percentiles, respectively. The default
of 1.5 corresponds to approximately +/–2.7σ
and 99.3 coverage if the data are normally
distributed. The plotted whisker extends to
the adjacent value, which is the most extreme
data value that is not an outlier. Set whisker
to 0 to give no whiskers and to make every
point outside of q1 and q3 an outlier.
widths
A scalar or vector of box widths for when
boxstyle is 'outline'. The default is half
of the minimum separation between boxes,
which is 0.5 when the positions argument
takes its default value. The list of values is
replicated or truncated as necessary.
When the plotstyle parameter takes the value 'compact', the
following default values for other parameters apply.
20-90
boxplot
Parameter
Default when plotstyle is compact
boxstyle
'filled'
factorseparator
'auto'
factorgap
'auto'
jitter
0.5
labelorientation
'inline'
labelverbosity
'majorminor'
medianstyle
'target'
outliersize
4
symbol
'o'
You can see data values and group names using the data cursor in the
figure window. The cursor shows the original values of any points
affected by the datalim parameter. You can label the group to which
an outlier belongs using the gname function.
To modify graphics properties of a box plot component, use findobj
with the Tag property to find the component’s handle. Tag values for
box plot components depend on parameter settings, and are listed in
the table below.
Parameter
Settings
Tag Values
All settings
• 'Box'
• 'Outliers'
When plotstyle is
• 'Median'
'traditional'
• 'Upper Whisker'
• 'Lower Whisker'
• 'Upper Adjacent Value'
• 'Lower Adjacent Value'
20-91
boxplot
Parameter
Settings
Tag Values
When plotstyle is
• 'Whisker'
'compact'
• 'MedianOuter'
• 'MedianInner'
Examples
When notch is
• 'NotchLo'
'marker'
• 'NotchHi'
Example 1
Create a box plot of car mileage, grouped by country:
load carsmall
boxplot(MPG,Origin)
20-92
boxplot
Example 2
Create notched box plots for two groups of sample data:
x1 = normrnd(5,1,100,1);
x2 = normrnd(6,1,100,1);
boxplot([x1,x2],'notch','on')
20-93
boxplot
The difference between the medians of the two groups is approximately
1. Since the notches in the box plot do not overlap, you can conclude,
with 95% confidence, that the true medians do differ.
The following figure shows the box plot for the same data with the
length of the whiskers specified as 1.0 times the interquartile range.
Points beyond the whiskers are displayed using +.
boxplot([x1,x2],'notch','on','whisker',1)
20-94
boxplot
Example 3
A plotstyle of 'compact' is useful for large numbers of groups:
X = randn(100,25);
subplot(2,1,1)
boxplot(X)
subplot(2,1,2)
boxplot(X,'plotstyle','compact')
20-95
boxplot
References
[1] McGill, R., J. W. Tukey, and W. A. Larsen. “Variations of Boxplots.”
The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
[2] Velleman, P.F., and D.C. Hoaglin. Applications, Basics, and
Computing of Exploratory Data Analysis. Pacific Grove, CA: Duxbury
Press, 1981.
[3] Nelson, L. S. “Evaluating Overlapping Confidence Intervals.”
Journal of Quality Technology. Vol. 21, 1989, pp. 140–141.
20-96
boxplot
See Also
anova1 | axes_props | kruskalwallis | multcompare
How To
• “Grouped Data” on page 2-34
20-97
piecewisedistribution.boundary
Purpose
Piecewise distribution boundaries
Syntax
[p,q] = boundary(obj)
[p,q] = boundary(obj,i)
Description
[p,q] = boundary(obj) returns the boundary points between
segments of the piecewise distribution object, obj. p is a vector of
cumulative probabilities at each boundary. q is a vector of quantiles at
each boundary.
[p,q] = boundary(obj,i) returns p and q for the ith boundary.
Examples
Fit Pareto tails to a t distribution at cumulative probabilities 0.1 and
0.9:
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
[p,q] = boundary(obj)
p =
0.1000
0.9000
q =
-1.7766
1.8432
See Also
20-98
paretotails | cdf | icdf | nsegments
candexch
Purpose
Candidate set row exchange
Syntax
treatments = candexch(C,nruns)
treatments = candexch(C,nruns,'Name',value)
Description
treatments = candexch(C,nruns) uses a row-exchange algorithm
to select treatments from a candidate design matrix C to produce a
D-optimal design with nruns runs. The columns of C represent model
terms evaluated at candidate treatments. treatments is a vector of
length nruns giving indices of the rows in C used in the D-optimal
design. The function selects a starting design at random.
treatments = candexch(C,nruns,'Name',value) specifies one or
more additional name/value pairs for the design. Valid parameters and
their values are listed in the following table. Specify Name in single
quotes.
Parameter
Value
display
Either 'on' or 'off' to control display of the
iteration counter. The default is 'on'.
init
Initial design as an nruns-by-p matrix, where p is
the number of model terms. The default is a random
subset of the rows of C.
maxiter
Maximum number of iterations. The default is 10.
20-99
candexch
Parameter
Value
options
A structure that specifies whether to run in parallel,
and specifies the random stream or streams. Create
the options structure with statset. Option fields:
• UseParallel — Set to 'always' to compute in
parallel. Default is 'never'.
• UseSubstreams — Set to 'always' to compute
in parallel in a reproducible fashion. Default is
'never'. To compute reproducibly, set Streams
to a type allowing substreams: 'mlfg6331_64'
or 'mrg32k3a'.
• Streams — A RandStream object or cell array
of such objects. If you do not specify Streams,
candexch uses the default stream or streams. If
you choose to specify Streams, use a single object
except in the case
-
You have an open MATLAB pool
UseParallel is 'always'
UseSubstreams is 'never'
In that case, use a cell array the same size as the
MATLAB pool.
For more information on using parallel computing,
see Chapter 17, “Parallel Statistics”.
20-100
candexch
Parameter
Value
start
An nobs-by-p matrix of factor settings, specifying
a set of nobs fixed design points to include in the
design. candexch finds nruns additional rows to
add to the start design. The parameter provides
the same functionality as the daugment function,
using a row-exchange algorithm rather than a
coordinate-exchange algorithm.
tries
Number of times to try to generate a design from
a new starting point. The algorithm uses random
points for each try, except possibly the first. The
default is 1.
Note The rowexch function automatically generates a candidate set
using candgen, and then creates a D-optimal design from that candidate
set using candexch. Call candexch separately to specify your own
candidate set to the row-exchange algorithm.
Examples
The following example uses rowexch to generate a five-run design for a
two-factor pure quadratic model using a candidate set that is produced
internally:
dRE1 = rowexch(2,5,'purequadratic','tries',10)
dRE1 =
-1
1
0
0
1
-1
1
0
1
1
The same thing can be done using candgen and candexch in sequence:
[dC,C] = candgen(2,'purequadratic') % Candidate set
dC =
20-101
candexch
-1
0
1
-1
0
1
-1
0
1
-1
-1
-1
0
0
0
1
1
1
C =
1
-1
-1
1
1
1
0
-1
0
1
1
1
-1
1
1
1
-1
0
1
0
1
0
0
0
0
1
1
0
1
0
1
-1
1
1
1
1
0
1
0
1
1
1
1
1
1
treatments = candexch(C,5,'tries',10) % D-opt subset
treatments =
2
1
7
3
4
dRE2 = dC(treatments,:) % Display design
dRE2 =
0
-1
-1
-1
-1
1
1
-1
-1
0
You can replace C in this example with a design matrix evaluated
at your own candidate set. For example, suppose your experiment
20-102
candexch
is constrained so that the two factors cannot have extreme settings
simultaneously. The following produces a restricted candidate set:
constraint = sum(abs(dC),2) < 2; % Feasible treatments
my_dC = dC(constraint,:)
my_dC =
0
-1
-1
0
0
0
1
0
0
1
Use the x2fx function to convert the candidate set to a design matrix:
my_C = x2fx(my_dC,'purequadratic')
my_C =
1
0
-1
0
1
1
-1
0
1
0
1
0
0
0
0
1
1
0
1
0
1
0
1
0
1
Find the required design in the same manner:
my_treatments = candexch(my_C,5,'tries',10) % D-opt subset
my_treatments =
2
4
5
1
3
my_dRE = my_dC(my_treatments,:) % Display design
my_dRE =
-1
0
1
0
0
1
0
-1
0
0
20-103
candexch
See Also
20-104
candgen | rowexch | cordexch | daugment | x2fx
candgen
Purpose
Candidate set generation
Syntax
dC = candgen(nfactors,'model')
[dC,C] = candgen(nfactors,'model')
[...] = candgen(nfactors,'model','Name',value)
Description
dC = candgen(nfactors,'model') generates a candidate set dC of
treatments appropriate for estimating the parameters in the model
with nfactors factors. dC has nfactors columns and one row for each
candidate treatment. model is one of the following strings, specified
inside single quotes:
• linear — Constant and linear terms. This is the default.
• interaction — Constant, linear, and interaction terms
• quadratic — Constant, linear, interaction, and squared terms
• purequadratic — Constant, linear, and squared terms
Alternatively, model can be a matrix specifying polynomial terms of
arbitrary order. In this case, model should have one column for each
factor and one row for each term in the model. The entries in any row
of model are powers for the factors in the columns. For example, if a
model has factors X1, X2, and X3, then a row [0 1 2] in model specifies
the term (X1.^0).*(X2.^1).*(X3.^2). A row of all zeros in model
specifies a constant term, which can be omitted.
[dC,C] = candgen(nfactors,'model') also returns the design matrix
C evaluated at the treatments in dC. The order of the columns of C for a
full quadratic model with n terms is:
1 The constant term
2 The linear terms in order 1, 2, ..., n
3 The interaction terms in order (1, 2), (1, 3), ..., (1, n), (2, 3), ..., (n–1, n)
4 The squared terms in order 1, 2, ..., n
20-105
candgen
Other models use a subset of these terms, in the same order.
Pass C to candexch to generate a D-optimal design using a
coordinate-exchange algorithm.
= candgen(nfactors,'model','Name',value) specifies one
or more optional name/value pairs for the design. Valid parameters
and their values are listed in the following table. Specify Name inside
single quotes.
[...]
Name
Value
bounds
Lower and upper bounds for each factor, specified as
a 2-by-nfactors matrix. Alternatively, this value
can be a cell array containing nfactors elements,
each element specifying the vector of allowable
values for the corresponding factor.
categorical
Indices of categorical predictors.
levels
Vector of number of levels for each factor.
Note The rowexch function automatically generates a candidate set
using candgen, and then creates a D-optimal design from that candidate
set using candexch. Call candexch separately to specify your own
candidate set to the row-exchange algorithm.
Examples
The following example uses rowexch to generate a five-run design for a
two-factor pure quadratic model using a candidate set that is produced
internally:
dRE1 = rowexch(2,5,'purequadratic','tries',10)
dRE1 =
-1
1
0
0
1
-1
1
0
20-106
candgen
1
1
The same thing can be done using candgen and candexch in sequence:
[dC,C] = candgen(2,'purequadratic') % Candidate set, C
dC =
-1
-1
0
-1
1
-1
-1
0
0
0
1
0
-1
1
0
1
1
1
C =
1
-1
-1
1
1
1
0
-1
0
1
1
1
-1
1
1
1
-1
0
1
0
1
0
0
0
0
1
1
0
1
0
1
-1
1
1
1
1
0
1
0
1
1
1
1
1
1
treatments = candexch(C,5,'tries',10) % Find D-opt subset
treatments =
2
1
7
3
4
dRE2 = dC(treatments,:) % Display design
dRE2 =
0
-1
-1
-1
-1
1
20-107
candgen
1
-1
See Also
20-108
-1
0
candexch | rowexch
canoncorr
Purpose
Canonical correlation
Syntax
[A,B] = canoncorr(X,Y)
[A,B,r] = canoncorr(X,Y)
[A,B,r,U,V] = canoncorr(X,Y)
[A,B,r,U,V,stats] = canoncorr(X,Y)
Description
[A,B] = canoncorr(X,Y) computes the sample canonical coefficients
for the n-by-d1 and n-by-d2 data matrices X and Y. X and Y must have
the same number of observations (rows) but can have different numbers
of variables (columns). A and B are d1-by-d and d2-by-d matrices, where
d = min(rank(X),rank(Y)). The jth columns of A and B contain the
canonical coefficients, i.e., the linear combination of variables making
up the jth canonical variable for X and Y, respectively. Columns of
A and B are scaled to make the covariance matrices of the canonical
variables the identity matrix (see U and V below). If X or Y is less than
full rank, canoncorr gives a warning and returns zeros in the rows of A
or B corresponding to dependent columns of X or Y.
[A,B,r] = canoncorr(X,Y) also returns a 1-by-d vector containing the
sample canonical correlations. The jth element of r is the correlation
between the jth columns of U and V (see below).
[A,B,r,U,V] = canoncorr(X,Y) also returns the canonical variables,
scores. U and V are n-by-d matrices computed as
U = (X-repmat(mean(X),N,1))*A
V = (Y-repmat(mean(Y),N,1))*B
[A,B,r,U,V,stats] = canoncorr(X,Y) also returns a structure
stats containing information relating to the sequence of d null
hypotheses H0(k) , that the (k+1)st through dth correlations are all zero,
for k = 0:(d-1). stats contains seven fields, each a 1-by-d vector with
elements corresponding to the values of k, as described in the following
table:
20-109
canoncorr
Field
Description
Wilks
Wilks’ lambda (likelihood ratio) statistic
chisq
pChisq
F
pF
Examples
Bartlett’s approximate chi-squared statistic for H0(k)
with Lawley’s modification
Right-tail significance level for chisq
Rao’s approximate F statistic for H0(k)
Right-tail significance level for F
df1
Degrees of freedom for the chi-squared statistic, and
the numerator degrees of freedom for the F statistic
df2
Denominator degrees of freedom for the F statistic
load carbig;
X = [Displacement Horsepower Weight Acceleration MPG];
nans = sum(isnan(X),2) > 0;
[A B r U V] = canoncorr(X(~nans,1:3),X(~nans,4:5));
plot(U(:,1),V(:,1),'.')
xlabel('0.0025*Disp+0.020*HP-0.000025*Wgt')
ylabel('-0.17*Accel-0.092*MPG')
20-110
canoncorr
References
[1] Krzanowski, W. J. Principles of Multivariate Analysis: A User’s
Perspective. New York: Oxford University Press, 1988.
[2] Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley
& Sons, Inc., 1984.
See Also
manova1 | princomp
20-111
capability
Purpose
Process capability indices
Syntax
S = capability(data,specs)
Description
S = capability(data,specs) estimates capability indices for
measurements in data given the specifications in specs. data can be
either a vector or a matrix of measurements. If data is a matrix, indices
are computed for the columns. specs can be either a two-element vector
of the form [L,U] containing lower and upper specification limits, or (if
data is a matrix) a two-row matrix with the same number of columns as
data. If there is no lower bound, use -Inf as the first element of specs.
If there is no upper bound, use Inf as the second element of specs.
The output S is a structure with the following fields:
• mu — Sample mean
• sigma — Sample standard deviation
• P — Estimated probability of being within limits
• Pl — Estimated probability of being below L
• Pu — Estimated probability of being above U
• Cp — (U-L)/(6*sigma)
• Cpl — (mu-L)./(3.*sigma)
• Cpu — (U-mu)./(3.*sigma)
• Cpk — min(Cpl,Cpu)
Indices are computed under the assumption that data values are
independent samples from a normal population with constant mean
and variance.
Indices divide a “specification width” (between specification limits) by
a “process width” (between control limits). Higher ratios indicate a
process with fewer measurements outside of specification.
20-112
capability
Examples
Simulate a sample from a process with a mean of 3 and a standard
deviation of 0.005:
data = normrnd(3,0.005,100,1);
Compute capability indices if the process has an upper specification
limit of 3.01 and a lower specification limit of 2.99:
S = capability(data,[2.99 3.01])
S =
mu: 3.0006
sigma: 0.0047
P: 0.9669
Pl: 0.0116
Pu: 0.0215
Cp: 0.7156
Cpl: 0.7567
Cpu: 0.6744
Cpk: 0.6744
Visualize the specification and process widths:
capaplot(data,[2.99 3.01]);
grid on
20-113
capability
References
[1] Montgomery, D. Introduction to Statistical Quality Control.
Hoboken, NJ: John Wiley & Sons, 1991, pp. 369–374.
See Also
capaplot | histfit
20-114
capaplot
Purpose
Process capability plot
Syntax
p = capaplot(data,specs)
[p,h] = capaplot(data,specs)
Description
p = capaplot(data,specs) estimates the mean of and variance for
the observations in input vector data, and plots the pdf of the resulting
T distribution. The observations in data are assumed to be normally
distributed. The output, p, is the probability that a new observation
from the estimated distribution will fall within the range specified by
the two-element vector specs. The portion of the distribution between
the lower and upper bounds specified in specs is shaded in the plot.
[p,h] = capaplot(data,specs) additionally returns handles to the
plot elements in h.
capaplot treats NaN values in data as missing, and ignores them.
Examples
Simulate a sample from a process with a mean of 3 and a standard
deviation of 0.005:
data = normrnd(3,0.005,100,1);
Compute capability indices if the process has an upper specification
limit of 3.01 and a lower specification limit of 2.99:
S = capability(data,[2.99 3.01])
S =
mu: 3.0006
sigma: 0.0047
P: 0.9669
Pl: 0.0116
Pu: 0.0215
Cp: 0.7156
Cpl: 0.7567
Cpu: 0.6744
Cpk: 0.6744
20-115
capaplot
Visualize the specification and process widths:
capaplot(data,[2.99 3.01]);
grid on
See Also
20-116
capability | histfit
caseread
Purpose
Read case names from file
Syntax
names = caseread('filename')
names = caseread
Description
names = caseread('filename') reads the contents of filename and
returns a string matrix of names. filename is the name of a file in
the current folder, or the complete path name of any file elsewhere.
caseread treats each line as a separate case.
names = caseread displays the Select File to Open dialog box for
interactive selection of the input file.
Examples
Read the file months.dat created using the casewrite function.
type months.dat
January
February
March
April
May
names = caseread('months.dat')
names =
January
February
March
April
May
See Also
casewrite | gname | tdfread | tblread
20-117
casewrite
Purpose
Write case names to file
Syntax
casewrite(strmat,'filename')
casewrite(strmat)
Description
casewrite(strmat,'filename') writes the contents of string matrix
strmat to filename. Each row of strmat represents one case name.
filename is the name of a file in the current folder, or the complete
path name of any file elsewhere. casewrite writes each name to a
separate line in filename.
casewrite(strmat) displays the Select File to Write dialog box for
interactive specification of the output file.
Examples
strmat = char('January','February',...
'March','April','May')
strmat =
January
February
March
April
May
casewrite(strmat,'months.dat')
type months.dat
January
February
March
April
May
See Also
20-118
gname | caseread | tblwrite | tdfread
categorical.cat
Purpose
Concatenate categorical arrays
Syntax
c = cat(dim,A,B,...)
Description
c = cat(dim,A,B,...) concatenates the categorical arrays A,B,...
along dimension dim. All inputs must have the same size except along
dimension dim. The set of categorical levels for C is the sorted union of
the sets of levels of the inputs, as determined by their labels.
See Also
cat | horzcat | vertcat
20-119
categorical
Purpose
Arrays for categorical data
Description
categorical is an abstract class, and you cannot create instances of it
directly. You must create nominal or ordinal arrays.
Categorical arrays store data with values in a discrete set of levels.
Each level is meant to capture a single, defining characteristic of an
observation. If you do not encode ordering in the levels, the data and
the array are nominal. If you do encode an ordering, the data and the
array are ordinal.
Construction
categorical
Create categorical array
Methods
addlevels
Add levels to categorical array
cat
Concatenate categorical arrays
cellstr
Convert categorical array to cell
array of strings
char
Convert categorical array to
character array
circshift
Shift categorical array circularly
ctranspose
Transpose categorical matrix
disp
Display categorical array
display
Display categorical array
double
Convert categorical array to
double array
droplevels
Drop levels
end
Last index in indexing expression
for categorical array
20-120
categorical
flipdim
Flip categorical array along
specified dimension
fliplr
Flip categorical matrix in
left/right direction
flipud
Flip categorical matrix in
up/down direction
getlabels
Access categorical array labels
getlevels
Get categorical array levels
hist
Plot histogram of categorical data
horzcat
Horizontal concatenation for
categorical arrays
int16
Convert categorical array to
signed 16-bit integer array
int32
Convert categorical array to
signed 32-bit integer array
int64
Convert categorical array to
signed 64-bit integer array
int8
Convert categorical array to
signed 8-bit integer array
intersect
Set intersection for categorical
arrays
ipermute
Inverse permute dimensions of
categorical array
isempty
True for empty categorical array
isequal
True if categorical arrays are
equal
islevel
Test for levels
ismember
True for elements of categorical
array in set
20-121
categorical
20-122
isscalar
True if categorical array is scalar
isundefined
Test for undefined elements
isvector
True if categorical array is vector
length
Length of categorical array
levelcounts
Element counts by level
ndims
Number of dimensions of
categorical array
numel
Number of elements in categorical
array
permute
Permute dimensions of
categorical array
reorderlevels
Reorder levels
repmat
Replicate and tile categorical
array
reshape
Resize categorical array
rot90
Rotate categorical matrix 90
degrees
setdiff
Set difference for categorical
arrays
setlabels
Label levels
setxor
Set exclusive-or for categorical
arrays
shiftdim
Shift dimensions of categorical
array
single
Convert categorical array to
single array
size
Size of categorical array
categorical
Properties
squeeze
Squeeze singleton dimensions
from categorical array
subsasgn
Subscripted assignment for
categorical array
subsindex
Subscript index for categorical
array
subsref
Subscripted reference for
categorical array
summary
Summary statistics for categorical
array
times
Product of categorical arrays
transpose
Transpose categorical matrix
uint16
Convert categorical array to
unsigned 16-bit integers
uint32
Convert categorical array to
unsigned 32-bit integers
uint64
Convert categorical array to
unsigned 64-bit integers
uint8
Convert categorical array to
unsigned 8-bit integers
union
Set union for categorical arrays
unique
Unique values in categorical
array
vertcat
Vertical concatenation for
categorical arrays
labels
Text labels for levels
undeflabel
Text label for undefined levels
20-123
categorical
Copy
Semantics
Value. To learn how this affects your use of the class, see Comparing
Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
How To
• “Categorical Arrays” on page 2-13
20-124
categorical
Purpose
Create categorical array
Description
categorical is an abstract class, and you cannot create instances of it
directly. You must create nominal or ordinal arrays.
See Also
nominal | ordinal
20-125
dataset.cat
Purpose
Concatenate dataset arrays
Syntax
ds = cat(dim, ds1, ds2, ...)
Description
ds = cat(dim, ds1, ds2, ...) concatenates the dataset arrays ds1,
ds2, ... along dimension dim by calling the dataset/horzcat or
dataset/vertcat method. dim must be 1 or 2.
See Also
horzcat | vertcat
20-126
classregtree.catsplit
Purpose
Categorical splits used for branches in decision tree
Syntax
v=catsplit(t)
v=catsplit(t,j)
Description
v=catsplit(t) returns an n-by-2 cell array v. Each row in v gives left
and right values for a categorical split. For each branch node j based on
a categorical predictor variable z, the left child is chosen if z is in v(j,1)
and the right child is chosen if z is in v(j,2). The splits are in the
same order as nodes of the tree. Nodes for these splits can be found by
running cuttype and selecting 'categorical' cuts from top to bottom.
v=catsplit(t,j) takes an array j of rows and returns the splits for
the specified rows.
See Also
classregtree
20-127
gmdistribution.cdf
Purpose
Cumulative distribution function for Gaussian mixture distribution
Syntax
y = cdf(obj,X)
Description
y = cdf(obj,X) returns a vector y of length n containing the values of
the cumulative distribution function (cdf) for the gmdistribution object
obj, evaluated at the n-by-d data matrix X, where n is the number of
observations and d is the dimension of the data. obj is an object created
by gmdistribution or fit. y(I) is the cdf of observation I.
Examples
Create a gmdistribution object defining a two-component mixture of
bivariate Gaussian distributions:
MU = [1 2;-3 -5];
SIGMA = cat(3,[2 0;0 .5],[1 0;0 1]);
p = ones(1,2)/2;
obj = gmdistribution(MU,SIGMA,p);
ezsurf(@(x,y)cdf(obj,[x y]),[-10 10],[-10 10])
20-128
gmdistribution.cdf
See Also
gmdistribution | fit | pdf | mvncdf
20-129
ccdesign
Purpose
Central composite design
Syntax
dCC = ccdesign(n)
[dCC,blocks] = ccdesign(n)
[...] = ccdesign(n,'Name',value)
Description
dCC = ccdesign(n) generates a central composite design for n factors.
n must be an integer 2 or larger. The output matrix dCC is m-by-n, where
m is the number of runs in the design. Each row represents one run,
with settings for all factors represented in the columns. Factor values
are normalized so that the cube points take values between -1 and 1.
[dCC,blocks] = ccdesign(n) requests a blocked design. The output
blocks is an m-by-1 vector of block numbers for each run. Blocks
indicate runs that are to be measured under similar conditions
to minimize the effect of inter-block differences on the parameter
estimates.
= ccdesign(n,'Name',value) specifies one or more optional
name/value pairs for the design. Valid parameters and their values are
listed in the following table. Specify Name in single quotes.
[...]
Parameter
Description
Values
center
Number of
center points.
• Integer — Number of center
points to include.
• 'uniform' — Select number of
center points to give uniform
precision.
• 'orthogonal' — Select
number of center points to give
an orthogonal design. This is
the default.
fraction
20-130
Fraction of
full-factorial
cube, expressed
• 0 — Whole design. This is the
default.
• 1 — 1/2 fraction.
ccdesign
Parameter
type
Description
Values
as an exponent
of 1/2.
• 2 — 1/4 fraction.
Type of CCD.
• 'circumscribed' —
Circumscribed (CCC). This is
the default.
• 'inscribed' — Inscribed
(CCI).
• 'faced' — Faced (CCF).
blocksize
Examples
Maximum
number of
points per block.
Integer. The default is Inf.
The following creates a 2-factor CCC:
dCC = ccdesign(2,'type','circumscribed')
dCC =
-1.0000
-1.0000
-1.0000
1.0000
1.0000
-1.0000
1.0000
1.0000
-1.4142
0
1.4142
0
0
-1.4142
0
1.4142
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
20-131
ccdesign
The center point is run 8 times to reduce the correlations among the
coefficient estimates.
Visualize the design as follows:
plot(dCC(:,1),dCC(:,2),'ro','MarkerFaceColor','b')
X = [1 -1 -1 -1; 1 1 1 -1];
Y = [-1 -1 1 -1; 1 -1 1 1];
line(X,Y,'Color','b')
axis square equal
See Also
20-132
bbdesign
cdf
Purpose
Cumulative distribution functions
Syntax
Y = cdf('name',X,A)
Y = cdf('name',X,A,B)
Y = cdf('name',X,A,B,C)
Description
Y = cdf('name',X,A) computes the cumulative distribution function
for the one-parameter family of distributions specified by name. A
contains parameter values for the distribution. The cumulative
distribution function is evaluated at the values in X and its values are
returned in Y.
If X and A are arrays, they must be the same size. If X is a scalar, it is
expanded to a constant matrix the same size as A. If A is a scalar, it is
expanded to a constant matrix the same size as X.
Y is the common size of X and A after any necessary scalar expansion.
Y = cdf('name',X,A,B) computes the cumulative distribution function
for two-parameter families of distributions, where parameter values
are given in A and B.
If X, A, and B are arrays, they must be the same size. If X is a scalar, it is
expanded to a constant matrix the same size as A and B. If either A or B
are scalars, they are expanded to constant matrices the same size as X.
Y is the common size of X, A, and B after any necessary scalar expansion.
Y = cdf('name',X,A,B,C) computes the cumulative distribution
function for three-parameter families of distributions, where parameter
values are given in A, B, and C.
If X, A, B, and C are arrays, they must be the same size. If X is a scalar,
it is expanded to a constant matrix the same size as A, B, and C. If
any of A, B or C are scalars, they are expanded to constant matrices
the same size as X.
Y is the common size of X, A, B, and C after any necessary scalar
expansion.
Acceptable strings for name (specified in single quotes) are:
20-133
cdf
name
Distribution
Input
Parameter A
Input
Parameter B
Input
Parameter C
beta or Beta
“Beta
Distribution”
on page B-4
a
b
—
bino or
Binomial
“Binomial
Distribution”
on page B-7
n: number of
p: probability of
—
trials
success for each
trial
chi2 or
Chisquare
“Chi-Square
Distribution”
on page B-12
ν: degrees of
freedom
—
—
exp or
Exponential
“Exponential
Distribution” on
page B-16
μ: mean
—
—
ev or Extreme
Value
“Extreme Value
Distribution” on
page B-19
μ: location
parameter
σ: scale
parameter
—
f or F
“F Distribution”
on page B-25
ν1: numerator
degrees of
freedom
ν2: denominator
degrees of
freedom
—
gam or Gamma
“Gamma
Distribution”
on page B-27
a: shape
b: scale
—
parameter
parameter
gev or
Generalized
Extreme Value
“Generalized
Extreme Value
Distribution” on
page B-32
K: shape
μ: location
parameter
σ: scale
parameter
gp or
Generalized
Pareto
“Generalized
Pareto
Distribution”
on page B-37
k:
σ: scale
parameter
μ: threshold
(location)
parameter
20-134
parameter
tail
index (shape)
parameter
cdf
name
Distribution
Input
Parameter A
Input
Parameter B
Input
Parameter C
geo or
Geometric
“Geometric
Distribution”
on page B-41
p: probability
—
—
K: number of
n: number of
samples drawn
parameter
hyge or
“Hypergeometric M: size of the
Hypergeometric Distribution” on population
page B-43
items with
the desired
characteristic in
the population
logn or
Lognormal
“Lognormal
Distribution”
on page B-51
μ
σ
—
nbin or
Negative
Binomial
“Negative
Binomial
Distribution”
on page B-72
r: number of
p: probability
—
successes
of success in a
single trial
ncf or
Noncentral F
“Noncentral F
Distribution” on
page B-78
ν1: numerator
degrees of
freedom
ν2: denominator
degrees of
freedom
δ: noncentrality
parameter
nct or
Noncentral t
“Noncentral t
Distribution” on
page B-80
ν: degrees of
freedom
δ: noncentrality
parameter
—
ncx2 or
Noncentral
Chi-square
“Noncentral
Chi-Square
Distribution”
on page B-76
ν: degrees of
freedom
δ: noncentrality
parameter
—
norm or Normal
“Normal
Distribution”
on page B-83
μ: mean
σ: standard
deviation
—
20-135
cdf
name
Distribution
Input
Parameter A
Input
Parameter B
Input
Parameter C
poiss or
Poisson
“Poisson
Distribution”
on page B-89
λ: mean
—
—
rayl or
Rayleigh
“Rayleigh
Distribution”
on page B-91
b: scale
—
—
t or T
“Student’s t
Distribution”
on page B-95
ν: degrees of
freedom
—
—
unif or Uniform
“Uniform
Distribution
(Continuous)”
on page B-99
a: lower
b: upper
—
endpoint
(minimum)
endpoint
(maximum)
unid or
Discrete
Uniform
“Uniform
Distribution
(Discrete)” on
page B-101
N: maximum
—
—
wbl or Weibull
“Weibull
Distribution”
on page B-103
a: scale
b: shape
—
parameter
parameter
Examples
parameter
observable
value
Compute the cdf of the normal distribution with mean 0 and standard
deviation 1 at inputs –2, –1, 0, 1, 2:
p1 = cdf('Normal',-2:2,0,1)
p1 =
0.0228 0.1587 0.5000 0.8413
0.9772
The order of the parameters is the same as for normcdf.
Compute the cdfs of Poisson distributions with rate parameters 0, 1, ...,
4 at inputs 1, 2, ..., 5, respectively:
20-136
cdf
p2 = cdf('Poisson',0:4,1:5)
p2 =
0.3679 0.4060 0.4232 0.4335
0.4405
The order of the parameters is the same as for poisscdf.
See Also
pdf | icdf
20-137
piecewisedistribution.cdf
Purpose
Cumulative distribution function for piecewise distribution
Syntax
P = cdf(obj,X)
Description
P = cdf(obj,X) returns an array P of values of the cumulative
distribution function for the piecewise distribution object obj, evaluated
at the values in the array X.
Examples
Fit Pareto tails to a t distribution at cumulative probabilities 0.1 and
0.9:
t = trnd(3,100,1);
obj = paretotails(t,0.1,0.9);
[p,q] = boundary(obj)
p =
0.1000
0.9000
q =
-1.7766
1.8432
cdf(obj,q)
ans =
0.1000
0.9000
See Also
20-138
paretotails | pdf | icdf
ProbDist.cdf
Purpose
Return cumulative distribution function (CDF) for ProbDist object
Syntax
Y = cdf(PD, X)
Description
Y = cdf(PD, X) returns Y, an array containing the cumulative
distribution function (CDF) for the ProbDist object PD, evaluated at
values in X.
Input
Arguments
Output
Arguments
See Also
PD
An object of the class ProbDistUnivParam or
ProbDistUnivKernel.
X
A numeric array of values where you want to
evaluate the CDF.
Y
An array containing the cumulative
distribution function (CDF) for the ProbDist
object PD.
cdf
20-139
cdfplot
Purpose
Empirical cumulative distribution function plot
Syntax
cdfplot(X)
h = cdfplot(X)
[h,stats] = cdfplot(X)
Description
cdfplot(X) displays a plot of the empirical cumulative distribution
function (cdf) for the data in the vector X. The empirical cdf F(x) is
defined as the proportion of X values less than or equal to x.
This plot, like those produced by hist and normplot, is useful for
examining the distribution of a sample of data. You can overlay a
theoretical cdf on the same plot to compare the empirical distribution
of the sample to the theoretical distribution.
The kstest, kstest2, and lillietest functions compute test statistics
that are derived from the empirical cdf. You may find the empirical
cdf plot produced by cdfplot useful in helping you to understand the
output from those functions.
h = cdfplot(X) returns a handle to the cdf curve.
[h,stats] = cdfplot(X) also returns a stats structure with the
following fields.
Examples
20-140
Field
Description
stats.min
Minimum value
stats.max
Maximum value
stats.mean
Sample mean
stats.median
Sample median (50th percentile)
stats.std
Sample standard deviation
The following example compares the empirical cdf for a sample from
an extreme value distribution with a plot of the cdf for the sampling
distribution. In practice, the sampling distribution would be unknown,
and would be chosen to match the empirical cdf.
cdfplot
y = evrnd(0,3,100,1);
cdfplot(y)
hold on
x = -20:0.1:10;
f = evcdf(x,0,3);
plot(x,f,'m')
legend('Empirical','Theoretical','Location','NW')
See Also
ecdf
20-141
categorical.cellstr
Purpose
Convert categorical array to cell array of strings
Syntax
B = cellstr(A)
Description
B = cellstr(A) converts the categorical array A to a cell array of
strings. Each element of B contains the categorical level label for the
corresponding element of A.
See Also
char | getlabels
20-142
dataset.cellstr
Purpose
Create cell array of strings from dataset array
Syntax
B = cellstr(A)
B = cellstr(A,VARS)
Description
B = cellstr(A) returns the contents of the dataset A, converted to a
cell array of strings. The variables in the dataset must support the
conversion and must have compatible sizes.
B = cellstr(A,VARS) returns the contents of the dataset variables
specified by VARS. VARS is a positive integer, a vector of positive integers,
a variable name, a cell array containing one or more variable names, or
a logical vector.
See Also
dataset.double | dataset.replacedata
20-143
categorical.char
Purpose
Convert categorical array to character array
Syntax
B = char(A)
Description
B = char(A) converts the categorical array A to a 2-D character matrix.
char does not preserve the shape of A. B contains numel(A) rows, and
each row of B contains the categorical level label for the corresponding
element of A(:).
See Also
cellstr | getlabels
20-144
chi2cdf
Purpose
Chi-square cumulative distribution function
Syntax
P = chi2cdf(X,V)
Description
P = chi2cdf(X,V) computes the chi-square cdf at each of the values
in X using the corresponding degrees of freedom in V. X and V can
be vectors, matrices, or multidimensional arrays that have the same
size. A scalar input is expanded to a constant array with the same
dimensions as the other input.
The degrees of freedom parameters in V must be positive integers, and
the values in X must lie on the interval [0 Inf].
The χ2 cdf for a given value x and degrees-of-freedom ν is
p = F ( x | ) =
x t( −2)/ 2 e− t / 2
∫0
2 / 2 Γ( / 2)
dt
where Γ( · ) is the Gamma function.
The chi-square density function with ν degrees-of-freedom is the same
as the gamma density function with parameters ν/2 and 2.
Examples
probability = chi2cdf(5,1:5)
probability =
0.9747 0.9179 0.8282 0.7127
0.5841
probability = chi2cdf(1:5,1:5)
probability =
0.6827 0.6321 0.6084 0.5940
0.5841
See Also
cdf | chi2pdf | chi2inv | chi2stat | chi2rnd
How To
• “Chi-Square Distribution” on page B-12
20-145
chi2gof
Purpose
Chi-square goodness-of-fit test
Syntax
h = chi2gof(x)
[h,p] = chi2gof(...)
[h,p,stats] = chi2gof(...)
[...] = chi2gof(X,'Name',value)
Description
h = chi2gof(x) performs a chi-square goodness-of-fit test of the
default null hypothesis that the data in vector x are a random sample
from a normal distribution with mean and variance estimated from
x, against the alternative that the data are not normally distributed
with the estimated mean and variance. The result h is 1 if the null
hypothesis can be rejected at the 5% significance level. The result h is 0
if the null hypothesis cannot be rejected at the 5% significance level.
The null distribution can be changed from a normal distribution to
an arbitrary discrete or continuous distribution. See the syntax for
specifying optional argument name/value pairs below.
The test is performed by grouping the data into bins, calculating
the observed and expected counts for those bins, and computing the
chi-square test statistic
N
χ2 = ∑ ( Oi − Ei ) / Ei
2
i=1
where Oi are the observed counts and Ei are the expected counts. The
statistic has an approximate chi-square distribution when the counts
are sufficiently large. Bins in either tail with an expected count less
than 5 are pooled with neighboring bins until the count in each extreme
bin is at least 5. If bins remain in the interior with counts less than 5,
chi2gof displays a warning. In this case, you should use fewer bins,
or provide bin centers or edges, to increase the expected counts in all
bins. (See the syntax for specifying optional argument name/value pairs
below.) chi2gof sets the number of bins, nbins, to 10 by default, and
compares the test statistic to a chi-square distribution with nbins – 3
degrees of freedom to take into account the two estimated parameters.
20-146
chi2gof
[h,p] = chi2gof(...) also returns the p value of the test, p. The p
value is the probability, under assumption of the null hypothesis, of
observing the given statistic or one more extreme.
[h,p,stats] = chi2gof(...) also returns a structure stats with
the following fields:
• chi2stat — The chi-square statistic
• df — Degrees of freedom
• edges — Vector of bin edges after pooling
• O — Observed count in each bin
• E — Expected count in each bin
= chi2gof(X,'Name',value) specifies one or more optional
argument name/value pairs chosen from the following lists. Argument
names are case insensitive and partial matches are allowed. Specify
Name in single quotes.
[...]
The following name/value pairs control the initial binning of the data
before pooling. You should not specify more than one of these options.
• nbins — The number of bins to use. Default is 10.
• ctrs — A vector of bin centers
• edges — A vector of bin edges
The following name/value pairs determine the null distribution for the
test. Do not specify both cdf and expected.
• cdf — A fully specified cumulative distribution function. This can
be a function name, a function handle, or a ProbDist object of the
ProbDistUnivParam class or ProbDistUnivKernel class. When
'cdf' is a function name or handle, the distribution function must
take x as its only argument. Alternately, you can provide a cell array
whose first element is a function name or handle, and whose later
20-147
chi2gof
elements are parameter values, one per cell. The function must take
x as its first argument, and other parameters as later arguments.
• expected — A vector with one element per bin specifying the
expected counts for each bin.
• nparams — The number of estimated parameters; used to adjust
the degrees of freedom to be nbins – 1 – nparams, where nbins is
the number of bins.
If your cdf or expected input depends on estimated parameters, you
should use nparams to ensure that the degrees of freedom for the test is
correct. If cdfis a cell array, the default value of nparams is the number
of parameters in the array; otherwise the default is 0.
The following name/value pairs control other aspects of the test.
• emin — The minimum allowed expected value for a bin; any bin in
either tail having an expected value less than this amount is pooled
with a neighboring bin. Use the value 0 to prevent pooling. The
default is 5.
• frequency — A vector the same length as x containing the frequency
of the corresponding xvalues
• alpha — Significance level for the test. The default is 0.05.
Examples
Example 1
Equivalent ways to test against an unspecified normal distribution
with estimated parameters:
x = normrnd(50,5,100,1);
[h,p] = chi2gof(x)
h =
0
p =
0.7532
20-148
chi2gof
[h,p] = chi2gof(x,'cdf',@(z)normcdf(z,mean(x),std(x)),'nparams',2)
h =
0
p =
0.7532
[h,p] = chi2gof(x,'cdf',{@normcdf,mean(x),std(x)})
h =
0
p =
0.7532
Example 2
Test against the standard normal:
x = randn(100,1);
[h,p] = chi2gof(x,'cdf',@normcdf)
h =
0
p =
0.9443
Example 3
Test against the standard uniform:
x = rand(100,1);
n = length(x);
edges = linspace(0,1,11);
expectedCounts = n * diff(edges);
[h,p,st] = chi2gof(x,'edges',edges,...
'expected',expectedCounts)
h =
0
p =
0.3191
20-149
chi2gof
st =
chi2stat:
df:
edges:
O:
E:
10.4000
9
[1x11 double]
[6 11 4 12 15 8 14 9 11 10]
[1x10 double]
Example 4
Test against the Poisson distribution by specifying observed and
expected counts:
bins = 0:5;
obsCounts = [6 16 10 12 4 2];
n = sum(obsCounts);
lambdaHat = sum(bins.*obsCounts)/n;
expCounts = n*poisspdf(bins,lambdaHat);
[h,p,st] = chi2gof(bins,'ctrs',bins,...
'frequency',obsCounts, ...
'expected',expCounts,...
'nparams',1)
h =
0
p =
0.4654
st =
chi2stat: 2.5550
df: 3
edges: [1x6 double]
O: [6 16 10 12 6]
E: [7.0429 13.8041 13.5280 8.8383 6.0284]
See Also
crosstab | lillietest | kstest | chi2cdf | chi2pdf | chi2inv |
chi2stat | chi2rnd
How To
• “Chi-Square Distribution” on page B-12
20-150
chi2inv
Purpose
Chi-square inverse cumulative distribution function
Syntax
X = chi2inv(P,V)
Description
X = chi2inv(P,V) computes the inverse of the chi-square cdf with
degrees of freedom specified by V for the corresponding probabilities in
P. P and V can be vectors, matrices, or multidimensional arrays that
have the same size. A scalar input is expanded to a constant array with
the same dimensions as the other inputs.
The degrees of freedom parameters in V must be positive integers, and
the values in P must lie in the interval [0 1].
The inverse chi-square cdf for a given probability p and ν degrees of
freedom is
x = F −1 ( p| ) = { x : F ( x | ) = p}
where
p = F ( x | ) =
x t( −2)/ 2 e− t / 2
∫0
2 / 2 Γ( / 2)
dt
and Γ( · ) is the Gamma function. Each element of output X is the value
whose cumulative probability under the chi-square cdf defined by the
corresponding degrees of freedom parameter in V is specified by the
corresponding value in P.
Examples
Find a value that exceeds 95% of the samples from a chi-square
distribution with 10 degrees of freedom.
x = chi2inv(0.95,10)
x =
18.3070
You would observe values greater than 18.3 only 5% of the time by
chance.
20-151
chi2inv
See Also
icdf | chi2cdf | chi2pdf | chi2stat | chi2rnd
How To
• “Chi-Square Distribution” on page B-12
20-152
chi2pdf
Purpose
Chi-square probability density function
Syntax
Y = chi2pdf(X,V)
Description
Y = chi2pdf(X,V) computes the chi-square pdf at each of the values
in X using the corresponding degrees of freedom in V. X and V can be
vectors, matrices, or multidimensional arrays that have the same size,
which is also the size of the output Y. A scalar input is expanded to a
constant array with the same dimensions as the other input.
The degrees of freedom parameters in V must be positive integers, and
the values in X must lie on the interval [0 Inf].
The chi-square pdf for a given value x and ν degrees of freedom is
y = f ( x | ) =
x( −2)/ 2 e− x / 2
2 / 2 Γ( / 2)
where Γ( · ) is the Gamma function.
If x is standard normal, then x2 is distributed chi-square with one
degree of freedom. If x1, x2, ..., xn are n independent standard normal
observations, then the sum of the squares of the x’s is distributed
chi-square with n degrees of freedom (and is equivalent to the gamma
density function with parameters ν/2 and 2).
Examples
nu = 1:6;
x = nu;
y = chi2pdf(x,nu)
y =
0.2420 0.1839 0.1542
0.1353
0.1220
0.1120
The mean of the chi-square distribution is the value of the degrees of
freedom parameter, nu. The above example shows that the probability
density of the mean falls as nu increases.
See Also
pdf | chi2cdf | chi2inv | chi2stat | chi2rnd
20-153
chi2pdf
How To
20-154
• “Chi-Square Distribution” on page B-12
chi2rnd
Purpose
Chi-square random numbers
Syntax
R = chi2rnd(V)
R = chi2rnd(V,m,n,...)
R = chi2rnd(V,[m,n,...])
Description
R = chi2rnd(V) generates random numbers from the chi-square
distribution with degrees of freedom parameters specified by V. V can be
a vector, a matrix, or a multidimensional array. R is the same size as V.
R = chi2rnd(V,m,n,...) or R = chi2rnd(V,[m,n,...]) generates
an m-by-n-by-... array containing random numbers from the chi-square
distribution with degrees of freedom parameter V. V can be a scalar
or an array of the same size as R.
Examples
Note that the first and third commands are the same, but are different
from the second command.
r = chi2rnd(1:6)
r =
0.0037 3.0377
7.8142
0.9021
r = chi2rnd(6,[1 6])
r =
6.5249 2.6226 12.2497
r = chi2rnd(1:6,1,6)
r =
0.7638 6.0955 0.8273
3.2019
3.0388
3.2506
9.0729
6.3133
1.5469
See Also
random | chi2cdf | chi2pdf | chi2inv | chi2stat
How To
• “Chi-Square Distribution” on page B-12
5.0388
10.9197
20-155
chi2stat
Purpose
Chi-square mean and variance
Syntax
[M,V] = chi2stat(NU)
Description
[M,V] = chi2stat(NU) returns the mean of and variance for the
chi-square distribution with degrees of freedom parameters specified
by NU.
The mean of the chi-square distribution is ν, the degrees of freedom
parameter, and the variance is 2ν.
Examples
nu = 1:10;
nu = nu'*nu;
[m,v] = chi2stat(nu)
m =
1
2
3
4
5
6
2
4
6
8 10 12
3
6
9 12 15 18
4
8 12 16 20 24
5 10 15 20 25 30
6 12 18 24 30 36
7 14 21 28 35 42
8 16 24 32 40 48
9 18 27 36 45 54
10 20 30 40 50 60
v =
2
4
6
8
10
12
14
16
18
20
20-156
4
8
12
16
20
24
28
32
36
40
6
12
18
24
30
36
42
48
54
60
7
14
21
28
35
42
49
56
63
70
8
16
24
32
40
48
56
64
72
80
8 10 12 14
16 20 24 28
24 30 36 42
32 40 48 56
40 50 60 70
48 60 72 84
56 70 84 98
64 80 96 112
72 90 108 126
80 100 120 140
16
32
48
64
80
96
112
128
144
160
9 10
18 20
27 30
36 40
45 50
54 60
63 70
72 80
81 90
90 100
18
36
54
72
90
108
126
144
162
180
20
40
60
80
100
120
140
160
180
200
chi2stat
See Also
chi2cdf | chi2pdf | chi2inv | chi2rnd
How To
• “Chi-Square Distribution” on page B-12
20-157
classregtree.children
Purpose
Child nodes
Syntax
C = children(t)
C = children(t,nodes)
Description
C = children(t) returns an n-by-2 array C containing the numbers
of the child nodes for each node in the tree t, where n is the number of
nodes. Leaf nodes have child node 0.
C = children(t,nodes) takes a vector nodes of node numbers and
returns the children for the specified nodes.
Examples
Create a classification tree for Fisher’s iris data:
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1
if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2
class = setosa
3
if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4
if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5
class = virginica
6
if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7
class = virginica
8
class = versicolor
9
class = virginica
view(t)
20-158
classregtree.children
C = children(t)
C =
2
3
0
0
4
5
6
7
0
0
8
9
0
0
0
0
20-159
classregtree.children
0
0
References
[1] Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification
and Regression Trees. Boca Raton, FL: CRC Press, 1984.
See Also
classregtree | numnodes | parent
20-160
cholcov
Purpose
Cholesky-like covariance decomposition
Syntax
T = cholcov(SIGMA)
[T,num] = cholcov(SIGMA)
[T,num] = cholcov(SIGMA,0)
Description
T = cholcov(SIGMA) computes T such that SIGMA = T'*T. SIGMA
must be square, symmetric, and positive semi-definite. If SIGMA is
positive definite, then T is the square, upper triangular Cholesky factor.
If SIGMA is not positive definite, T is computed from an eigenvalue
decomposition of SIGMA. T is not necessarily triangular or square in this
case. Any eigenvectors whose corresponding eigenvalue is close to zero
(within a small tolerance) are omitted. If any remaining eigenvalues
are negative, T is empty.
[T,num] = cholcov(SIGMA) returns the number num of negative
eigenvalues of SIGMA, and T is empty if num is positive. If num is zero,
SIGMA is positive semi-definite. If SIGMA is not square and symmetric,
num is NaN and T is empty.
[T,num] = cholcov(SIGMA,0) returns num equal to zero if SIGMA
is positive definite, and T is the Cholesky factor. If SIGMA is not
positive definite, num is a positive integer and T is empty. [...] =
cholcov(SIGMA,1) is equivalent to [...] = cholcov(SIGMA).
Examples
The following 4-by-4 covariance matrix is rank-deficient:
C1 = [2 1 1 2;1 2 1 2;1 1 2 2;2 2 2 3]
C1 =
2
1
1
2
1
2
1
2
1
1
2
2
2
2
2
3
rank(C1)
ans =
3
Use cholcov to factor C1:
20-161
cholcov
T = cholcov(C1)
T =
-0.2113
0.7887
0.7887
-0.2113
1.1547
1.1547
C2 = T'*T
C2 =
2.0000
1.0000
1.0000
2.0000
1.0000
2.0000
1.0000
2.0000
-0.5774
-0.5774
1.1547
1.0000
1.0000
2.0000
2.0000
0
0
1.7321
2.0000
2.0000
2.0000
3.0000
Use T to generate random data with the specified covariance:
C3 = cov(randn(1e6,3)*T)
C3 =
1.9973
0.9982
0.9995
0.9982
1.9962
0.9969
0.9995
0.9969
1.9980
1.9975
1.9956
1.9972
See Also
20-162
chol | cov
1.9975
1.9956
1.9972
2.9951
categorical.circshift
Purpose
Shift categorical array circularly
Syntax
B = circshift(A,shiftsize)
Description
B = circshift(A,shiftsize) circularly shifts the values in the
categorical array A by shiftsize elements. shiftsize is a vector of
integer scalars where the n-th element specifies the shift amount for the
n-th dimension of array A. If an element in shiftsize is positive, the
values of A are shifted down (or to the right). If it is negative, the values
of A are shifted up (or to the left).
See Also
permute | shiftdim
20-163
NaiveBayes.CIsNonEmpty property
Purpose
Flag for non-empty classes
Description
The CIsNonEmpty property is a logical vector of length NClasses
specifying which classes are not empty. When the grouping variable
is categorical, it may contain categorical levels that don’t appear in
the elements of the grouping variable. Those levels are empty and
NaiveBayes ignores them for the purposes of training the classifier.
20-164
classregtree.classcount
Purpose
Class counts
Syntax
P = classcount(t)
P = classcount(t,nodes)
Description
P = classcount(t) returns an n-by-m array P of class counts for the
nodes in the classification tree t, where n is the number of nodes and
m is the number of classes. For any node number i, the class counts
P(i,:) are counts of observations (from the data used in fitting the
tree) from each class satisfying the conditions for node i.
P = classcount(t,nodes) takes a vector nodes of node numbers and
returns the class counts for the specified nodes.
Examples
Create a classification tree for Fisher’s iris data:
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1
if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2
class = setosa
3
if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4
if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5
class = virginica
6
if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7
class = virginica
8
class = versicolor
9
class = virginica
view(t)
20-165
classregtree.classcount
P = classcount(t)
P =
50
50
50
50
0
0
0
50
50
0
49
5
0
1
45
0
47
1
0
2
4
0
47
0
20-166
classregtree.classcount
0
0
1
References
[1] Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification
and Regression Trees. Boca Raton, FL: CRC Press, 1984.
See Also
classregtree | numnodes
20-167
ClassificationBaggedEnsemble
Purpose
Classification ensemble grown by resampling
Description
ClassificationBaggedEnsemble combines a set of trained weak
learner models and data on which these learners were trained. It can
predict ensemble response for new data by aggregating predictions from
its weak learners.
Construction
ens =
fitensemble(X,Y,'bag',nlearn,learners,'type','classification')
creates a bagged classification ensemble. For syntax details,
see the fitensemble reference page.
Properties
CategoricalPredictors
List of categorical predictors. CategoricalPredictors is a
numeric vector with indices from 1 to p, where p is the number of
columns of X.
CombineWeights
String describing how ens combines weak learner weights, either
'WeightedSum' or 'WeightedAverage'.
FitInfo
Numeric array of fit information. The FitInfoDescription
property describes the content of this array.
FitInfoDescription
String describing the meaning of the FitInfo array.
FResample
Numeric scalar between 0 and 1. FResample is the fraction of
training data fitensemble resampled at random for every weak
learner when constructing the ensemble.
Method
String describing the method that creates ens.
ModelParams
20-168
ClassificationBaggedEnsemble
Parameters used in training ens.
NTrained
Number of trained weak learners in ens, a scalar.
PredictorNames
Cell array of names for the predictor variables, in the order in
which they appear in X.
ReasonForTermination
String describing the reason fitensemble stopped adding weak
learners to the ensemble.
Replace
Logical value indicating if the ensemble was trained with
replacement (true) or without replacement (false).
ResponseName
String with the name of the response variable Y.
ScoreTransform
Function handle for transforming scores, or string representing
a built-in transformation function. 'none' means no
transformation; equivalently, 'none' means @(x)x. For a list
of built-in transformation functions and the syntax of custom
transformation functions, see ClassificationTree.fit.
Add or change a ScoreTransform function by dot addressing:
ens.ScoreTransform = 'function'
or
ens.ScoreTransform = @function
Trained
Trained learners, a cell array of compact classification models.
20-169
ClassificationBaggedEnsemble
TrainedWeights
Numeric vector of trained weights for the weak learners in ens.
TrainedWeights has T elements, where T is the number of weak
learners in learners.
UseObsForLearner
Logical matrix of size N-by-NTrained, where N is the number of
observations in the training data and NTrained is the number
of trained weak learners. UseObsForLearner(I,J) is true if
observation I was used for training learner J, and is false
otherwise.
W
Scaled weights, a vector with length n, the number of rows in X.
The sum of the elements of W is 1.
X
Matrix of predictor values that trained the ensemble. Each
column of X represents one variable, and each row represents one
observation.
Y
Numeric vector, vector of categorical variables (nominal or
ordinal), logical vector, character array, or cell array of strings.
Each row of Y represents the classification of the corresponding
row of X.
Methods
20-170
oobEdge
Out-of-bag classification edge
oobLoss
Out-of-bag classification error
oobMargin
Out-of-bag classification margins
oobPredict
Predict out-of-bag response of
ensemble
ClassificationBaggedEnsemble
Inherited Methods
compact
Compact classification ensemble
crossval
Cross validate ensemble
resubEdge
Classification edge by
resubstitution
resubLoss
Classification error by
resubstitution
resubMargin
Classification margins by
resubstitution
resubPredict
Predict ensemble response by
resubstitution
resume
Resume training ensemble
edge
Classification edge
loss
Classification error
margin
Classification margins
predict
Predict classification
predictorImportance
Estimates of predictor importance
Copy
Semantics
Value. To learn how value classes affect copy operations, see Copying
Objects in the MATLAB Programming Fundamentals documentation.
Examples
Construct a bagged ensemble for the ionosphere data, and examine its
resubstitution loss:
load ionosphere
rng(0,'twister') % for reproducibility
ens = fitensemble(X,Y,'bag',100,'Tree',...
'type','classification');
L = resubLoss(ens)
20-171
ClassificationBaggedEnsemble
L =
0
The ensemble does a perfect job classifying its training data.
See Also
ClassificationEnsemble | fitensemble
How To
• “Ensemble Methods” on page 13-50
20-172
ClassificationEnsemble
Superclasses
CompactClassificationEnsemble
Purpose
Ensemble classifier
Description
ClassificationEnsemble combines a set of trained weak learner
Construction
ens = fitensemble(X,Y,method,nlearn,learners) returns an
ensemble model that can predict responses to data. The ensemble
consists of models listed in learners. For more information on the
syntax, see the fitensemble function reference page.
models and data on which these learners were trained. It can predict
ensemble response for new data by aggregating predictions from its
weak learners. It also stores data used for training and can compute
resubstitution predictions. It can resume training if desired.
ens = fitensemble(X,Y,method,nlearn,learners,Name,Value)
returns an ensemble model with additional options specified by one or
more Name,Value pair arguments. For more information on the syntax,
see the fitensemble function reference page.
Properties
CategoricalPredictors
List of categorical predictors. CategoricalPredictors is a
numeric vector with indices from 1 to p, where p is the number of
columns of X.
ClassNames
List of the elements in Y with duplicates removed. ClassNames
can be a numeric vector, vector of categorical variables (nominal
or ordinal), logical vector, character array, or cell array of strings.
ClassNames has the same data type as the data in the argument Y.
CombineWeights
String describing how ens combines weak learner weights, either
'WeightedSum' or 'WeightedAverage'.
Cost
20-173
ClassificationEnsemble
Square matrix where Cost(i,j) is the cost of classifying a point
into class j if its true class is i.
FitInfo
Numeric array of fit information. The FitInfoDescription
property describes the content of this array.
FitInfoDescription
String describing the meaning of the FitInfo array.
LearnerNames
Cell array of strings with names of weak learners in the ensemble.
The name of each learner appears just once. For example, if you
have an ensemble of 100 trees, LearnerNames is {'Tree'}.
Method
String describing the method that creates ens.
ModelParams
Parameters used in training ens.
NObservations
Numeric scalar containing the number of observations in the
training data.
NTrained
Number of trained weak learners in ens, a scalar.
PredictorNames
Cell array of names for the predictor variables, in the order in
which they appear in X.
Prior
Prior probabilities for each class. Prior is a numeric vector whose
entries relate to the corresponding ClassNames property.
ReasonForTermination
20-174
ClassificationEnsemble
String describing the reason fitensemble stopped adding weak
learners to the ensemble.
ResponseName
String with the name of the response variable Y.
ScoreTransform
Function handle for transforming scores, or string representing
a built-in transformation function. 'none' means no
transformation; equivalently, 'none' means @(x)x. For a list
of built-in transformation functions and the syntax of custom
transformation functions, see ClassificationTree.fit.
Add or change a ScoreTransform function by dot addressing:
ens.ScoreTransform = 'function'
or
ens.ScoreTransform = @function
Trained
Trained learners, a cell array of compact classification models.
TrainedWeights
Numeric vector of trained weights for the weak learners in ens.
TrainedWeights has T elements, where T is the number of weak
learners in learners.
W
Scaled weights, a vector with length n, the number of rows in X.
The sum of the elements of W is 1.
X
Matrix of predictor values that trained the ensemble. Each
column of X represents one variable, and each row represents one
observation.
20-175
ClassificationEnsemble
Y
Numeric vector, vector of categorical variables (nominal or
ordinal), logical vector, character array, or cell array of strings.
Each row of Y represents the classification of the corresponding
row of X.
Methods
compact
Compact classification ensemble
crossval
Cross validate ensemble
resubEdge
Classification edge by
resubstitution
resubLoss
Classification error by
resubstitution
resubMargin
Classification margins by
resubstitution
resubPredict
Predict ensemble response by
resubstitution
resume
Resume training ensemble
Inherited Methods
Copy
Semantics
20-176
edge
Classification edge
loss
Classification error
margin
Classification margins
predict
Predict classification
predictorImportance
Estimates of predictor importance
Value. To learn how value classes affect copy operations, see Copying
Objects in the MATLAB Programming Fundamentals documentation.
ClassificationEnsemble
Examples
Construct a boosted classification ensemble for the ionosphere data,
using the AdaBoostM1 method:
load ionosphere
ens = fitensemble(X,Y,'AdaBoostM1',100,'Tree')
ens =
classreg.learning.classif.ClassificationEnsemble:
PredictorNames: {1x34 cell}
CategoricalPredictors: []
ResponseName: 'Response'
ClassNames: {'b' 'g'}
ScoreTransform: 'none'
NObservations: 351
NTrained: 100
Method: 'AdaBoostM1'
LearnerNames: {'Tree'}
ReasonForTermination: [1x77 char]
FitInfo: [100x1 double]
FitInfoDescription: [2x83 char]
Predict the classification of the mean of X:
ypredict = predict(ens,mean(X))
ypredict =
'g'
See Also
ClassificationTree | fitensemble | RegressionEnsemble |
CompactClassificationEnsemble
How To
• Chapter 13, “Supervised Learning”
20-177
ClassificationPartitionedEnsemble
Purpose
Cross-validated classification ensemble
Description
ClassificationPartitionedEnsemble is a set of classification
ensembles trained on cross-validated folds. Estimate the quality of
classification by cross validation using one or more “kfold” methods:
kfoldPredict, kfoldLoss, kfoldMargin, kfoldEdge, and kfoldfun.
Every “kfold” method uses models trained on in-fold observations to
predict response for out-of-fold observations. For example, suppose you
cross validate using five folds. In this case, every training fold contains
roughly 4/5 of the data and every test fold contains roughly 1/5 of the
data. The first model stored in Trained{1} was trained on X and Y
with the first 1/5 excluded, the second model stored in Trained{2} was
trained on X and Y with the second 1/5 excluded, and so on. When you
call kfoldPredict, it computes predictions for the first 1/5 of the data
using the first model, for the second 1/5 of data using the second model,
and so on. In short, response for every observation is computed by
kfoldPredict using the model trained without this observation.
Construction
cvens = crossval(ens) creates a cross-validated ensemble from ens,
a classification ensemble. For syntax details, see the crossval method
reference page.
cvens = fitensemble(X,Y,method,nlearn,learners,name,value)
creates a cross-validated ensemble when name is one of 'crossval',
'kfold', 'holdout', 'leaveout', or 'cvpartition'. For syntax
details, see the fitensemble function reference page.
Properties
CategoricalPredictors
List of categorical predictors. CategoricalPredictors is a
numeric vector with indices from 1 to p, where p is the number of
columns of X.
ClassNames
List of the elements in Y with duplicates removed. ClassNames
can be a numeric vector, vector of categorical variables (nominal
20-178
ClassificationPartitionedEnsemble
or ordinal), logical vector, character array, or cell array of strings.
ClassNames has the same data type as the data in the argument Y.
Combiner
Cell array of combiners across all folds.
Cost
Square matrix, where Cost(i,j) is the cost of classifying a point
into class j if its true class is i.
CrossValidatedModel
Name of the cross-validated model, a string.
Kfold
Number of folds used in a cross-validated ensemble, a positive
integer.
ModelParams
Object holding parameters of cvens.
NObservations
Number of data points used in training the ensemble, a positive
integer.
NTrainedPerFold
Number of data points used in training each fold of the ensemble,
a positive integer.
Partition
Partition of class cvpartition used in creating the cross-validated
ensemble.
PredictorNames
Cell array of names for the predictor variables, in the order in
which they appear in X.
Prior
20-179
ClassificationPartitionedEnsemble
Prior probabilities for each class. Prior is a numeric vector whose
entries relate to the corresponding ClassNames property.
ResponseName
Name of the response variable Y, a string.
ScoreTransform
Function handle for transforming scores, or string representing
a built-in transformation function. 'none' means no
transformation; equivalently, 'none' means @(x)x. For a list
of built-in transformation functions and the syntax of custom
transformation functions, see ClassificationTree.fit.
Add or change a ScoreTransform function by dot addressing:
ens.ScoreTransform = 'function'
or
ens.ScoreTransform = @function
Trainable
Cell array of ensembles trained on cross-validation folds. Every
ensemble is full, meaning it contains its training data and weights.
Trained
Cell array of compact ensembles trained on cross-validation folds.
W
Scaled weights, a vector with length n, the number of rows in X.
X
A matrix of predictor values. Each column of X represents one
variable, and each row represents one observation.
Y
A numeric column vector with the same number of rows as X. Each
entry in Y is the response to the data in the corresponding row of X.
20-180
ClassificationPartitionedEnsemble
Methods
kfoldEdge
Classification edge for
observations not used for
training
kfoldLoss
Classification loss for
observations not used for
training
resume
Resume training learners on
cross-validation folds
Inherited Methods
kfoldEdge
Classification edge for
observations not used for
training
kfoldfun
Cross validate function
kfoldLoss
Classification loss for
observations not used for
training
kfoldMargin
Classification margins for
observations not used for training
kfoldPredict
Predict response for observations
not used for training
Copy
Semantics
Value. To learn how value classes affect copy operations, see Copying
Objects in the MATLAB Programming Fundamentals documentation.
Examples
Evaluate the k-fold cross-validation error for a classification ensemble
that models the Fisher iris data:
load fisheriris
ens = fitensemble(meas,species,'AdaBoostM2',100,'Tree');
cvens = crossval(ens);
L = kfoldLoss(cvens)
20-181
ClassificationPartitionedEnsemble
L =
0.0533
See Also
RegressionPartitionedEnsemble |
ClassificationPartitionedModel | ClassificationEnsemble
How To
• Chapter 13, “Supervised Learning”
20-182
ClassificationPartitionedModel
Purpose
Cross-validated classification model
Description
ClassificationPartitionedModel is a set of classification models
trained on cross-validated folds. Estimate the quality of classification
by cross validation using one or more “kfold” methods: kfoldPredict,
kfoldLoss, kfoldMargin, kfoldEdge, and kfoldfun.
Every “kfold” method uses models trained on in-fold observations to
predict response for out-of-fold observations. For example, suppose you
cross validate using five folds. In this case, every training fold contains
roughly 4/5 of the data and every test fold contains roughly 1/5 of the
data. The first model stored in Trained{1} was trained on X and Y
with the first 1/5 excluded, the second model stored in Trained{2} was
trained on X and Y with the second 1/5 excluded, and so on. When you
call kfoldPredict, it computes predictions for the first 1/5 of the data
using the first model, for the second 1/5 of data using the second model,
and so on. In short, response for every observation is computed by
kfoldPredict using the model trained without this observation.
Construction
cvmodel = crossval(tree) creates a cross-validated classification
model from a classification tree.
cvmodel = ClassificationTree.fit(X,Y,name,value) creates a
cross-validated model when name is one of 'crossval', 'kfold',
'holdout', 'leaveout', or 'cvpartition'. For syntax details, see the
ClassificationTree.fit function reference page.
Input Arguments
tree
A classification tree constructed with ClassificationTree.fit.
Properties
CategoricalPredictors
List of categorical predictors. CategoricalPredictors is a
numeric vector with indices from 1 to p, where p is the number of
columns of X.
ClassNames
20-183
ClassificationPartitionedModel
List of the elements in Y with duplicates removed. ClassNames
can be a numeric vector, vector of categorical variables (nominal
or ordinal), logical vector, character array, or cell array of strings.
ClassNames has the same data type as the data in the argument Y.
Cost
Square matrix, where Cost(i,j) is the cost of classifying a point
into class j if its true class is i.
CrossValidatedModel
Name of the cross-validated model, a string.
Kfold
Number of folds used in cross-validated tree, a positive integer.
ModelParams
Object holding parameters of cvmodel.
Partition
The partition of class cvpartition used in creating the
cross-validated model.
PredictorNames
A cell array of names for the predictor variables, in the order in
which they appear in X.
Prior
Prior probabilities for each class. Prior is a numeric vector whose
entries relate to the corresponding ClassNames property.
ResponseName
Name of the response variable Y, a string.
ScoreTransform
Function handle for transforming scores, or string representing
a built-in transformation function. 'none' means no
transformation; equivalently, 'none' means @(x)x. For a list
20-184
ClassificationPartitionedModel
of built-in transformation functions and the syntax of custom
transformation functions, see ClassificationTree.fit.
Add or change a ScoreTransform function by dot addressing:
cvmodel.ScoreTransform = 'function'
or
cvmodel.ScoreTransform = @function
Trained
The trained learners, a cell array of compact classification models.
W
The scaled weights, a vector with length n, the number of rows
in X.
X
A matrix of predictor values. Each column of X represents one
variable, and each row represents one observation.
Y
A numeric column vector with the same number of rows as X. Each
entry in Y is the response to the data in the corresponding row of X.
Methods
kfoldEdge
Classification edge for
observations not used for
training
kfoldfun
Cross validate function
kfoldLoss
Classification loss for
observations not used for
training
20-185
ClassificationPartitionedModel
kfoldMargin
Classification margins for
observations not used for training
kfoldPredict
Predict response for observations
not used for training
Copy
Semantics
Value. To learn how value classes affect copy operations, see Copying
Objects in the MATLAB Programming Fundamentals documentation.
Examples
Evaluate the k-fold cross-validation error for a classification model for
the Fisher iris data:
load fisheriris
tree = ClassificationTree.fit(meas,species);
cvtree = crossval(tree);
L = kfoldLoss(cvtree)
L =
0.0600
See Also
ClassificationPartitionedEnsemble |
RegressionPartitionedModel
How To
• Chapter 13, “Supervised Learning”
20-186
ClassificationTree
Superclasses
CompactClassificationTree
Purpose
Binary decision tree for classification
Description
A decision tree with binary splits for classification. An object of class
ClassificationTree can predict responses for new data with the
predict method. The object contains the data used for training, so can
compute resubstitution predictions.
Construction
tree = ClassificationTree.fit(X,Y) returns a classification tree
based on the input variables (also known as predictors, features, or
attributes) X and output (response) Y. tree is a binary tree, where each
branching node is split based on the values of a column of X.
tree = ClassificationTree.fit(X,Y,Name,Value) fits a tree
with additional options specified by one or more Name,Value pair
arguments. If you use one of the following five options, tree is of
class ClassificationPartitionedModel: 'crossval', 'kfold',
'holdout', 'leaveout', or 'cvpartition'. Otherwise, tree is of class
ClassificationTree.
Input Arguments
X
A matrix of numeric predictor values. Each column of X represents
one variable, and each row represents one observation.
NaN values in X are taken to be missing values. Observations with
all missing values for X are not used in the fit. Observations with
some missing values for X are used to find splits on variables for
which these observations have valid values.
Y
A numeric vector, vector of categorical variables (nominal or
ordinal), logical vector, character array, or cell array of strings.
Each row of Y represents the classification of the corresponding
20-187
ClassificationTree
row of X. For numeric Y, consider using RegressionTree.fit
instead of ClassificationTree.fit.
NaN values in Y are taken to be missing values. Observations with
missing values for Y are not used in the fit.
Name-Value Pair Arguments
Optional comma-separated pairs of Name,Value arguments, where Name
is the argument name and Value is the corresponding value. Name must
appear inside single quotes (''). You can specify several name-value
pair arguments in any order as Name1,Value1, ,NameN,ValueN.
CategoricalPredictors
List of categorical predictors. Pass CategoricalPredictors as
one of:
• A numeric vector with indices from 1 to p, where p is the
number of columns of X.
• A logical vector of length p, where a true entry means that the
corresponding column of X is a categorical variable.
• 'all', meaning all predictors are categorical.
• A cell array of strings, where each element in the array is the
name of a predictor variable. The names must match entries in
PredictorNames values.
• A character matrix, where each row of the matrix is a name
of a predictor variable. The names must match entries in
PredictorNames values. Pad the names with extra blanks so
each row of the character matrix has the same length.
Default: []
ClassNames
Array of class names. Use the data type that exists in Y.
20-188
ClassificationTree
Use ClassNames to order the classes or to select a subset of classes
for training.
Default: The class names that exist in Y
Cost
Square matrix, where Cost(i,j) is the cost of classifying a
point into class j if its true class is i. Alternatively, Cost can
be a structure S having two fields: S.ClassNames containing
the group names as a variable of the same type as Y, and
S.ClassificationCosts containing the cost matrix.
Default: Cost(i,j)=1 if i~=j, and Cost(i,j)=0 if i=j
crossval
If 'on', grows a cross-validated decision tree with 10 folds. You
can use 'kfold', 'holdout', 'leaveout', or 'cvpartition'
parameters to override this cross-validation setting. You can
only use one of these four parameters ('kfold', 'holdout',
'leaveout', or 'cvpartition') at a time when creating a
cross-validated tree.
Alternatively, cross validate tree later using the crossval
method.
Default: 'off'
cvpartition
Partition created with cvpartition to use in a cross-validated
tree. You can only use one of these four options at a time for
creating a cross-validated tree: 'kfold', 'holdout', 'leaveout',
or 'cvpartition'.
holdout
Holdout validation tests the specified fraction of the data, and
uses the rest of the data for training. Specify a numeric scalar
20-189
ClassificationTree
from 0 to 1. You can only use one of these four options at a time for
creating a cross-validated tree: 'kfold', 'holdout', 'leaveout',
or 'cvpartition'.
kfold
Number of folds to use in a cross-validated tree, a positive integer.
You can only use one of these four options at a time for creating
a cross-validated tree: 'kfold', 'holdout', 'leaveout', or
'cvpartition'.
Default: 10
leaveout
Use leave-one-out cross validation by setting to 'on'. You
can only use one of these four options at a time for creating
a cross-validated tree: 'kfold', 'holdout', 'leaveout', or
'cvpartition'.
MergeLeaves
When 'on', ClassificationTree.fit merges leaves that
originate from the same parent node, and that give a sum of risk
values greater or equal to the risk associated with the parent node.
When 'off', ClassificationTree.fit does not merge leaves.
Default: 'on'
MinLeaf
Each leaf has at least MinLeaf observations per tree
leaf. If you supply both MinParent and MinLeaf,
ClassificationTree.fit uses the setting that gives larger
leaves: MinParent=max(MinParent,2*MinLeaf).
Default: 1
MinParent
20-190
ClassificationTree
Each branch node in the tree has at least MinParent
observations. If you supply both MinParent and MinLeaf,
ClassificationTree.fit uses the setting that gives larger
leaves: MinParent=max(MinParent,2*MinLeaf).
Default: 10
NVarToSample
Number of predictors to select at random for each split. Can
be a positive integer or 'all', which means use all available
predictors.
Default: 'all'
PredictorNames
A cell array of names for the predictor variables, in the order in
which they appear in X.
Default: {'x1','x2',...}
prior
Prior probabilities for each class. Specify as one of:
• A string:
-
'empirical' determines class probabilities from class
frequencies in Y. If you pass observation weights, they are
used to compute the class probabilities.
-
'uniform' sets all class probabilities equal.
• A vector (one scalar value for each class)
• A structure S with two fields:
-
S.ClassNames containing the class names as a variable of
the same type as Y
20-191
ClassificationTree
-
S.ClassProbs containing a vector of corresponding
probabilities
If you set values for both weights and prior, the weights are
renormalized to add up to the value of the prior probability in
the respective class.
Default: 'empirical'
Prune
When 'on', ClassificationTree.fit grows the classification
tree, and computes the optimal sequence of pruned subtrees.
When 'off' ClassificationTree.fit grows the classification
tree without pruning.
Default: 'on'
PruneCriterion
String with the pruning criterion, either 'error' or 'impurity'.
Default: 'error'
ResponseName
Name of the response variable Y, a string.
Default: 'Response'
ScoreTransform
Function handle for transforming scores, or string representing a
built-in transformation function.
20-192
String
Formula
'symmetric'
2x – 1
'invlogit'
log(x / (1–x))
ClassificationTree
String
Formula
'ismax'
Set score for the class with the
largest score to 1, and scores for all
other classes to 0.
'symmetricismax'
Set score for the class with the
largest score to 1, and scores for
all other classes to -1.
'none'
x
'logit'
1/(1 + e–x)
'doublelogit'
1/(1 + e–2x)
'symmetriclogit'
2/(1 + e–x) – 1
'sign'
–1 for x < 0
0 for x = 0
1 for x > 0
You can include your own function handle for transforming scores.
Your function should accept a matrix (the original scores) and
return a matrix of the same size (the transformed scores).
Default: 'none'
SplitCriterion
Criterion for choosing a split. One of 'gdi' (Gini’s diversity
index), 'twoing' for the twoing rule, or 'deviance' for maximum
deviance reduction (also known as cross entropy).
Default: 'gdi'
Surrogate
When 'on', ClassificationTree.fit finds surrogate splits
at each branch node. This setting improves the accuracy of
predictions for data with missing values. The setting also enables
20-193
ClassificationTree
you to compute measures of predictive association between
predictors. This setting can use much time and memory.
Default: 'off'
weights
Vector of observation weights. The length of weights is the
number of rows in X. ClassificationTree.fit normalizes the
weights in each class to add up to the value of the prior probability
of the class.
Default: ones(size(X,1),1)
Properties
CategoricalPredictors
List of categorical predictors, a numeric vector with indices from 1
to p, where p is the number of columns of X.
CatSplit
An n-by-2 cell array, where n is the number of nodes in tree.
Each row in CatSplit gives left and right values for a categorical
split. For each branch node j based on a categorical predictor
variable z, the left child is chosen if z is in CatSplit(j,1) and
the right child is chosen if z is in CatSplit(j,2). The splits are
in the same order as nodes of the tree. Find the nodes for these
splits by selecting 'categorical' cuts from top to bottom in the
CutType property.
Children
An n-by-2 array containing the numbers of the child nodes for
each node in tree, where n is the number of nodes. Leaf nodes
have child node 0.
ClassCount
An n-by-k array of class counts for the nodes in tree, where n
is the number of nodes and k is the number of classes. For any
node number i, the class counts ClassCount(i,:) are counts of
20-194
ClassificationTree
observations (from the data used in fitting the tree) from each
class satisfying the conditions for node i.
ClassNames
List of the elements in Y with duplicates removed. ClassNames
can be a numeric vector, vector of categorical variables (nominal
or ordinal), logical vector, character array, or cell array of strings.
ClassNames has the same data type as the data in the argument Y.
ClassProb
An n-by-k array of class probabilities for the nodes in tree, where
n is the number of nodes and k is the number of classes. For any
node number i, the class probabilities ClassProb(i,:) are the
estimated probabilities for each class for a point satisfying the
conditions for node i.
Cost
Square matrix, where Cost(i,j) is the cost of classifying a point
into class j if its true class is i.
CutCategories
An n-by-2 cell array of the categories used at branches in tree,
where n is the number of nodes. For each branch node i based on
a categorical predictor variable x, the left child is chosen if x is
among the categories listed in CutCategories{i,1}, and the right
child is chosen if x is among those listed in CutCategories{i,2}.
Both columns of CutCategories are empty for branch nodes based
on continuous predictors and for leaf nodes.
CutPoint contains the cut points for 'continuous' cuts, and
CutCategories contains the set of categories.
CutPoint
An n-element vector of the values used as cut points in tree,
where n is the number of nodes. For each branch node i based
on a continuous predictor variable x, the left child is chosen if
x<CutPoint(i) and the right child is chosen if x>=CutPoint(i).
20-195
ClassificationTree
CutPoint is NaN for branch nodes based on categorical predictors
and for leaf nodes.
CutPoint contains the cut points for 'continuous' cuts, and
CutCategories contains the set of categories.
CutType
An n-element cell array indicating the type of cut at each node
in tree, where n is the number of nodes. For each node i,
CutType{i} is:
• 'continuous' — If the cut is defined in the form x < v for
a variable x and cut point v.
• 'categorical' — If the cut is defined by whether a variable x
takes a value in a set of categories.
• '' — If i is a leaf node.
CutPoint contains the cut points for 'continuous' cuts, and
CutCategories contains the set of categories.
CutVar
An n-element cell array of the names of the variables used for
branching in each node in tree, where n is the number of nodes.
These variables are sometimes known as cut variables. For leaf
nodes, CutVar contains an empty string.
CutPoint contains the cut points for 'continuous' cuts, and
CutCategories contains the set of categories.
IsBranch
An n-element logical vector that is true for each branch node and
false for each leaf node of tree.
ModelParams
Parameters used in training tree.
NObservations
20-196
ClassificationTree
Number of observations in the training data, a numeric scalar.
NObservations can be less than the number of rows of input data
X when there are missing values in X or response Y.
NodeClass
An n-element cell array with the names of the most probable
classes in each node of tree, where n is the number of nodes in
the tree. Every element of this array is a string equal to one of
the class names in ClassNames.
NodeErr
An n-element vector of the errors of the nodes in tree, where
n is the number of nodes. NodeErr(i) is the misclassification
probability for node i.
NodeProb
An n-element vector of the probabilities of the nodes in tree,
where n is the number of nodes. The probability of a node is
computed as the proportion of observations from the original
data that satisfy the conditions for the node. This proportion is
adjusted for any prior probabilities assigned to each class.
NodeSize
An n-element vector of the sizes of the nodes in tree, where n is
the number of nodes. The size of a node is defined as the number
of observations from the data used to create the tree that satisfy
the conditions for the node.
NumNodes
The number of nodes in tree.
Parent
An n-element vector containing the number of the parent node for
each node in tree, where n is the number of nodes. The parent of
the root node is 0.
PredictorNames
20-197
ClassificationTree
A cell array of names for the predictor variables, in the order in
which they appear in X.
Prior
Prior probabilities for each class. Prior is a numeric vector whose
entries relate to the corresponding ClassNames property.
PruneList
An n-element numeric vector with the pruning levels in each node
of tree, where n is the number of nodes.
ResponseName
String describing the response variable Y.
Risk
An n-element vector of the risk of the nodes in tree, where n is
the number of nodes.
• If tree was grown with SplitCriterion set to either 'gdi'
(default) or 'deviance', then the risk at node i is the impurity
of node i times the node probability. See “Definitions” on page
20-203.
• Otherwise,
Risk(i) = NodeErr(i) * NodeProb(i).
ScoreTransform
Function handle for transforming scores, or string representing
a built-in transformation function. 'none' means no
transformation; equivalently, 'none' means @(x)x.
Add or change a ScoreTransform function by dot addressing:
tree.ScoreTransform = 'function'
or
tree.ScoreTransform = @function
20-198
ClassificationTree
SurrCutCategories
An n-element cell array of the categories used for surrogate
splits in tree, where n is the number of nodes in tree. For
each node k, SurrCutCategories{k} is a cell array. The
length of SurrCutCategories{k} is equal to the number of
surrogate predictors found at this node. Every element of
SurrCutCategories{k} is either an empty string for a continuous
surrogate predictor, or is a two-element cell array with categories
for a categorical surrogate predictor. The first element of this
two-element cell array lists categories assigned to the left child by
this surrogate split, and the second element of this two-element
cell array lists categories assigned to the right child by this
surrogate split. The order of the surrogate split variables at each
node is matched to the order of variables in SurrCutVar. The
optimal-split variable at this node does not appear. For nonbranch
(leaf) nodes, SurrCutCategories contains an empty cell.
SurrCutFlip
An n-element cell array of the numeric cut assignments used
for surrogate splits in tree, where n is the number of nodes in
tree. For each node k, SurrCutFlip{k} is a numeric vector. The
length of SurrCutFlip{k} is equal to the number of surrogate
predictors found at this node. Every element of SurrCutFlip{k}
is either zero for a categorical surrogate predictor, or a numeric
cut assignment for a continuous surrogate predictor. The numeric
cut assignment can be either –1 or +1. For every surrogate split
with a numeric cut C based on a continuous predictor variable Z,
the left child is chosen if Z<C and the cut assignment for this
surrogate split is +1, or if Z≥C and the cut assignment for this
surrogate split is –1. Similarly, the right child is chosen if Z≥C
and the cut assignment for this surrogate split is +1, or if Z<C
and the cut assignment for this surrogate split is –1. The order of
the surrogate split variables at each node is matched to the order
of variables in SurrCutVar. The optimal-split variable at this
node does not appear. For nonbranch (leaf) nodes, SurrCutFlip
contains an empty array.
20-199
ClassificationTree
SurrCutPoint
An n-element cell array of the numeric values used for surrogate
splits in tree, where n is the number of nodes in tree. For each
node k, SurrCutPoint{k} is a numeric vector. The length of
SurrCutPoint{k} is equal to the number of surrogate predictors
found at this node. Every element of SurrCutPoint{k} is either
NaN for a categorical surrogate predictor, or a numeric cut for a
continuous surrogate predictor. For every surrogate split with a
numeric cut C based on a continuous predictor variable Z, the left
child is chosen if Z<C and SurrCutFlip for this surrogate split is
–1. Similarly, the right child is chosen if Z≥C and SurrCutFlip
for this surrogate split is +1, or if Z<C and SurrCutFlip for this
surrogate split is –1. The order of the surrogate split variables
at each node is matched to the order of variables returned by
SurrCutVar. The optimal-split variable at this node does not
appear. For nonbranch (leaf) nodes, SurrCutPoint contains an
empty cell.
SurrCutType
An n-element cell array indicating types of surrogate splits at
each node in tree, where n is the number of nodes in tree. For
each node k, SurrCutType{k} is a cell array with the types of the
surrogate split variables at this node. The variables are sorted by
the predictive measure of association with the optimal predictor
in the descending order, and only variables with the positive
predictive measure are included. The order of the surrogate
split variables at each node is matched to the order of variables
in SurrCutVar. The optimal-split variable at this node does not
appear. For nonbranch (leaf) nodes, SurrCutType contains an
empty cell. A surrogate split type can be either 'continuous' if
the cut is defined in the form Z<V for a variable Z and cut point
V or 'categorical' if the cut is defined by whether Z takes a
value in a set of categories.
SurrCutVar
20-200
ClassificationTree
An n-element cell array of the names of the variables used for
surrogate splits in each node in tree, where n is the number
of nodes in tree. Every element of SurrCutVar is a cell array
with the names of the surrogate split variables at this node. The
variables are sorted by the predictive measure of association
with the optimal predictor in the descending order, and only
variables with the positive predictive measure are included. The
optimal-split variable at this node does not appear. For nonbranch
(leaf) nodes, SurrCutVar contains an empty cell.
SurrVarAssoc
An n-element cell array of the predictive measures of association
for surrogate splits in tree, where n is the number of nodes in
tree. For each node k, SurrVarAssoc{k} is a numeric vector. The
length of SurrVarAssoc{k} is equal to the number of surrogate
predictors found at this node. Every element of SurrVarAssoc{k}
gives the predictive measure of association between the optimal
split and this surrogate split. The order of the surrogate split
variables at each node is the order of variables in SurrCutVar.
The optimal-split variable at this node does not appear. For
nonbranch (leaf) nodes, SurrVarAssoc contains an empty cell.
W
The scaled weights, a vector with length n, the number of rows
in X.
X
A matrix of predictor values. Each column of X represents one
variable, and each row represents one observation.
Y
A numeric vector, vector of categorical variables (nominal or
ordinal), logical vector, character array, or cell array of strings.
Each row of Y represents the classification of the corresponding
row of X.
20-201
ClassificationTree
Methods
compact
Compact tree
crossval
Cross-validated decision tree
cvloss
Classification error by cross
validation
fit
Fit classification tree
prune
Produce sequence of subtrees by
pruning
resubEdge
Classification edge by
resubstitution
resubLoss
Classification error by
resubstitution
resubMargin
Classification margins by
resubstitution
resubPredict
Predict resubstitution response of
tree
template
Create classification template
Inherited Methods
20-202
edge
Classification edge
loss
Classification error
margin
Classification margins
meanSurrVarAssoc
Mean predictive measure of
association for surrogate splits in
decision tree
predict
Predict classification
predictorImportance
Estimates of predictor importance
view
View tree
ClassificationTree
Definitions
Impurity and Node Error
ClassificationTree splits nodes based on either impurity or node
error. Impurity means one of several things, depending on your choice
of the SplitCriterion name-value pair:
• Gini’s Diversity Index (gdi) — The Gini index of a node is
1   p2 (i),
i
where the sum is over the classes i at the node, and p(i) is the
observed fraction of classes with class i that reach the node. A node
with just one class (a pure node) has Gini index 0; otherwise the Gini
index is positive. So the Gini index is a measure of node impurity.
• Deviance ('deviance') — With p(i) defined as for the Gini index,
the deviance of a node is
 p(i) log p(i).
i
A pure node has deviance 0; otherwise, the deviance is positive.
• Twoing rule ('twoing') — Twoing is not a purity measure of a node,
but is a different measure for deciding how to split a node. Let L(i)
denote the fraction of members of class i in the left child node after a
split, and R(i) denote the fraction of members of class i in the right
child node after a split. Choose the split criterion to maximize
2


P ( L) P ( R)   L(i)  R(i)  ,


 i

where P(L) and P(R) are the fractions of observations that split to the
left and right respectively. If the expression is large, the split made
each child node purer. Similarly, if the expression is small, the split
made each child node similar to each other, and hence similar to the
parent node, and so the split did not increase node purity.
20-203
ClassificationTree
• Node error — The node error is the fraction of misclassified classes at
a node. If j is the class with largest number of training samples at
a node, the node error is
1 – p(j).
Copy
Semantics
Value. To learn how value classes affect copy operations, see Copying
Objects in the MATLAB Programming Fundamentals documentation.
Examples
Construct a classification tree for the data in ionosphere.mat:
load ionosphere
tc = ClassificationTree.fit(X,Y)
tc =
ClassificationTree:
PredictorNames: {1x34 cell}
CategoricalPredictors: []
ResponseName: 'Response'
ClassNames: {'b'
'g'}
Cost: [2x2 double]
Prior: [0.3590 0.6410]
ScoreTransform: 'none'
X: [351x34 double]
Y: {351x1 cell}
W: [351x1 double]
ModelParams: [1x1 classreg.learning.modelparams.TreeParams]
See Also
RegressionTree | ClassificationEnsemble |
ClassificationTree.fit | CompactClassificationTree | predict
How To
• “Classification Trees and Regression Trees” on page 13-25
20-204
classify
Purpose
Discriminant analysis
Syntax
class = classify(sample,training,group)
class = classify(sample,training,group,'type')
class = classify(sample,training,group,'type',prior)
[class,err] = classify(...)
[class,err,POSTERIOR] = classify(...)
[class,err,POSTERIOR,logp] = classify(...)
[class,err,POSTERIOR,logp,coeff] = classify(...)
Description
class = classify(sample,training,group) classifies each row of the
data in sample into one of the groups in training. (See “Grouped Data”
on page 2-34.) sample and training must be matrices with the same
number of columns. group is a grouping variable for training. Its
unique values define groups; each element defines the group to which
the corresponding row of training belongs. group can be a categorical
variable, a numeric vector, a string array, or a cell array of strings.
training and group must have the same number of rows. classify
treats NaNs or empty strings in group as missing values, and ignores
the corresponding rows of training. The output class indicates the
group to which each row of sample has been assigned, and is of the
same type as group.
class = classify(sample,training,group,'type') allows you to
specify the type of discriminant function. Specify type inside single
quotes. type is one of:
• linear — Fits a multivariate normal density to each group, with a
pooled estimate of covariance. This is the default.
• diaglinear — Similar to linear, but with a diagonal covariance
matrix estimate (naive Bayes classifiers).
• quadratic — Fits multivariate normal densities with covariance
estimates stratified by group.
• diagquadratic — Similar to quadratic, but with a diagonal
covariance matrix estimate (naive Bayes classifiers).
20-205
classify
• mahalanobis — Uses Mahalanobis distances with stratified
covariance estimates.
class = classify(sample,training,group,'type',prior) allows
you to specify prior probabilities for the groups. prior is one of:
• A numeric vector the same length as the number of unique values
in group (or the number of levels defined for group, if group is
categorical). If group is numeric or categorical, the order of prior
must correspond to the ordered values in group, or, if group contains
strings, to the order of first occurrence of the values in group.
• A 1-by-1 structure with fields:
-
prob — A numeric vector.
group — Of the same type as group, containing unique values
indicating the groups to which the elements of prob correspond.
As a structure, prior can contain groups that do not appear in
group. This can be useful if training is a subset a larger training
set. classify ignores any groups that appear in the structure but
not in the group array.
• The string 'empirical', indicating that group prior probabilities
should be estimated from the group relative frequencies in training.
prior defaults to a numeric vector of equal probabilities, i.e., a uniform
distribution. prior is not used for discrimination by Mahalanobis
distance, except for error rate calculation.
[class,err] = classify(...) also returns an estimate err of the
misclassification error rate based on the training data. classify
returns the apparent error rate, i.e., the percentage of observations in
training that are misclassified, weighted by the prior probabilities
for the groups.
[class,err,POSTERIOR] = classify(...) also returns a matrix
POSTERIOR of estimates of the posterior probabilities that the jth
training group was the source of the ith sample observation, i.e.,
20-206
classify
Pr(group j|obs i). POSTERIOR is not computed for Mahalanobis
discrimination.
[class,err,POSTERIOR,logp] = classify(...) also returns a
vector logp containing estimates of the logarithms of the unconditional
predictive probability density of the sample observations, p(obs i) =
∑p(obs i|group j)Pr(group j) over all groups. logp is not computed for
Mahalanobis discrimination.
[class,err,POSTERIOR,logp,coeff] = classify(...) also returns
a structure array coeff containing coefficients of the boundary
curves between pairs of groups. Each element coeff(I,J) contains
information for comparing group I to group J in the following fields:
• type — Type of discriminant function, from the type input.
• name1 — Name of the first group.
• name2 — Name of the second group.
• const — Constant term of the boundary equation (K)
• linear — Linear coefficients of the boundary equation (L)
• quadratic — Quadratic coefficient matrix of the boundary equation
(Q)
For the linear and diaglinear types, the quadratic field is absent,
and a row x from the sample array is classified into group I rather than
group J if 0 < K+x*L. For the other types, x is classified into group I if
0 < K+x*L+x*Q*x'.
Examples
For training data, use Fisher’s sepal measurements for iris versicolor
and virginica:
load fisheriris
SL = meas(51:end,1);
SW = meas(51:end,2);
group = species(51:end);
h1 = gscatter(SL,SW,group,'rb','v^',[],'off');
20-207
classify
set(h1,'LineWidth',2)
legend('Fisher versicolor','Fisher virginica',...
'Location','NW')
Classify a grid of measurements on the same scale:
[X,Y] = meshgrid(linspace(4.5,8),linspace(2,4));
X = X(:); Y = Y(:);
[C,err,P,logp,coeff] = classify([X Y],[SL SW],...
group,'quadratic');
Visualize the classification:
20-208
classify
hold on;
gscatter(X,Y,C,'rb','.',1,'off');
K = coeff(1,2).const;
L = coeff(1,2).linear;
Q = coeff(1,2).quadratic;
% Function to compute K + L*v + v'*Q*v for multiple vectors
% v=[x;y]. Accepts x and y as scalars or column vectors.
f = @(x,y) K + [x y]*L + sum(([x y]*Q) .* [x y], 2);
h2 = ezplot(f,[4.5 8 2 4]);
set(h2,'Color','m','LineWidth',2)
axis([4.5 8 2 4])
xlabel('Sepal Length')
ylabel('Sepal Width')
title('{\bf Classification with Fisher Training Data}')
20-209
classify
References
[1] Krzanowski, W. J. Principles of Multivariate Analysis: A User’s
Perspective. New York: Oxford University Press, 1988.
[2] Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley
& Sons, Inc., 1984.
See Also
classregtree | mahal | NaiveBayes
How To
• “Grouped Data” on page 2-34
20-210
classregtree.classname
Purpose
Class names for classification decision tree
Syntax
CNAMES = classname(T)
CNAMES = classname(T,J)
Description
CNAMES = classname(T) returns a cell array of strings with class
names for this classification decision tree.
CNAMES = classname(T,J) takes an array J of class numbers and
returns the class names for the specified numbers.
See Also
classregtree
20-211
TreeBagger.ClassNames property
Purpose
Names of classes
Description
The ClassNames property is a cell array containing the class names for
the response variable Y. This property is empty for regression trees.
20-212
classregtree.classprob
Purpose
Class probabilities
Syntax
P = classprob(t)
P = classprob(t,nodes)
Description
P = classprob(t) returns an n-by-m array P of class probabilities for
the nodes in the classification tree t, where n is the number of nodes
and m is the number of classes. For any node number i, the class
probabilities P(i,:) are the estimated probabilities for each class for a
point satisfying the conditions for node i.
P = classprob(t,nodes) takes a vector nodes of node numbers and
returns the class probabilities for the specified nodes.
Examples
Create a classification tree for Fisher’s iris data:
load fisheriris;
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1
if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2
class = setosa
3
if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4
if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5
class = virginica
6
if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7
class = virginica
8
class = versicolor
9
class = virginica
view(t)
20-213
classregtree.classprob
P = classprob(t)
P =
0.3333
0.3333
1.0000
0
0
0.5000
0
0.9074
0
0.0217
0
0.9792
0
0.3333
0
1.0000
20-214
0.3333
0
0.5000
0.0926
0.9783
0.0208
0.6667
0
classregtree.classprob
0
0
1.0000
References
[1] Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification
and Regression Trees. Boca Raton, FL: CRC Press, 1984.
See Also
classregtree | numnodes
20-215
classregtree
Purpose
Classification and regression trees
Construction
classregtree
Construct classification and
regression trees
Methods
catsplit
Categorical splits used for
branches in decision tree
children
Child nodes
classcount
Class counts
classname
Class names for classification
decision tree
classprob
Class probabilities
cutcategories
Cut categories
cutpoint
Decision tree cut point values
cuttype
Cut types
cutvar
Cut variable names
disp
Display classregtree object
display
Display classregtree object
eval
Predicted responses
isbranch
Test node for branch
meansurrvarassoc
Mean predictive measure of
association for surrogate splits in
decision tree
nodeclass
Class values of nodes of
classification tree
nodeerr
Return vector of node errors
20-216
classregtree
nodemean
Mean values of nodes of
regression tree
nodeprob
Node probabilities
nodesize
Return node size
numnodes
Number of nodes
parent
Parent node
prune
Prune tree
prunelist
Pruning levels for decision tree
nodes
risk
Node risks
subsasgn
Subscripted reference for
classregtree object
subsref
Subscripted reference for
classregtree object
surrcutcategories
Categories used for surrogate
splits in decision tree
surrcutflip
Numeric cutpoint assignments
used for surrogate splits in
decision tree
surrcutpoint
Cutpoints used for surrogate
splits in decision tree
surrcuttype
Types of surrogate splits used at
branches in decision tree
surrcutvar
Variables used for surrogate
splits in decision tree
surrvarassoc
Predictive measure of association
for surrogate splits in decision
tree
test
Error rate
20-217
classregtree
type
Tree type
varimportance
Compute embedded estimates of
input feature importance
view
Plot tree
Properties
Objects of the classregtree class have no properties accessible by dot
indexing, get methods, or set methods. To obtain information about a
classregtree object, use the appropriate method.
Copy
Semantics
Value. To learn how this affects your use of the class, see Comparing
Handle and Value Classes in the MATLAB Object-Oriented
Programming documentation.
How To
• “Ensemble Methods” on page 13-50
• “Classification Trees and Regression Trees” on page 13-25
• “Grouped Data” on page 2-34
20-218
classregtree
Purpose
Construct classification and regression trees
Syntax
t = classregtree(X,y)
t = classregtree(X,y,'Name',value)
Description
t = classregtree(X,y) creates a decision tree t for predicting the
response y as a function of the predictors in the columns of X. X is an
n-by-m matrix of predictor values. If y is a vector of n response values,
classregtree performs regression. If y is a categorical variable,
character array, or cell array of strings, classregtree performs
classification. Either way, t is a binary tree where each branching node
is split based on the values of a column of X. NaN values in X or y are
taken to be missing values. Observations with all missing values for
X or missing values for y are not used in the fit. Observations with
some missing values for X are used to find splits on variables for which
these observations have valid values.
t = classregtree(X,y,'Name',value) specifies one or more optional
parameter name/value pairs. Specify Name in single quotes. The
following options are available:
20-219
classregtree
For all trees:
• categorical — Vector of indices of the columns of X that are to be
treated as unordered categorical variables
• method — Either 'classification' (default if y is text or a
categorical variable) or 'regression' (default if y is numeric).
• names — A cell array of names for the predictor variables, in the
order in which they appear in the X from which the tree was created.
• prune — 'on' (default) to compute the full tree and the optimal
sequence of pruned subtrees, or 'off' for the full tree without
pruning.
• minparent — A number k such that impure nodes must have k or
more observations to be split (default is 10).
• minleaf — A minimal number of observations per tree leaf (default
is 1). If you supply both 'minparent' and 'minleaf', classregtree
uses the setting which results in larger leaves: minparent =
max(minparent,2*minleaf)
• mergeleaves — 'on' (default) to merge leaves that originate from
the same parent node and give the sum of risk values greater or equal
to the risk associated with the parent node. If 'off', classregtree
does not merge leaves.
• nvartosample — Number of predictor variables randomly selected
for each split. By default all variables are considered for each
decision split.
• stream — Random number stream. Default is the MATLAB default
random number stream.
• surrogate — 'on' to find surrogate splits at each branch node.
Default is 'off'. If you set this parameter to 'on',classregtree can
run significantly slower and consume significantly more memory.
• weights — Vector of observation weights. By default the weight of
every observation is 1. The length of this vector must be equal to
the number of rows in X.
20-220
classregtree
For regression trees only:
• qetoler — Defines tolerance on quadratic error per node for
regression trees. Splitting nodes stops when quadratic error per
node drops below qetoler*qed, where qed is the quadratic error for
the entire data computed before the decision tree is grown: qed =
norm(y-ybar) with ybar estimated as the average of the input array
Y. Default value is 1e-6.
For classification trees only:
• cost — Square matrix C, where C(i,j) is the cost of classifying a
point into class j if its true class is i (default has C(i,j)=1 if i~=j,
and C(i,j)=0 if i=j). Alternatively, this value can be a structure
S having two fields: S.group containing the group names as a
categorical variable, character array, or cell array of strings; and
S.cost containing the cost matrix C.
• splitcriterion — Criterion for choosing a split. One of 'gdi'
(default) or Gini’s diversity index, 'twoing' for the twoing rule, or
'deviance' for maximum deviance reduction.
• priorprob — Prior probabilities for each class, specified as a string
('empirical' or 'equal') or as a vector (one value for each distinct
group name) or as a structure S with two fields:
-
S.group containing the group names as a categorical variable,
character array, or cell array of strings
-
S.prob containing a vector of corresponding probabilities.
If the input value is 'empirical' (default), class probabilities are
determined from class frequencies in Y. If the input value is 'equal',
all class probabilities are set equal. If both observation weights and
class prior probabilities are supplied, the weights are renormalized to
add up to the value of the prior probability in the respective class.
Examples
Create a classification tree for Fisher’s iris data:
load fisheriris;
20-221
classregtree
t = classregtree(meas,species,...
'names',{'SL' 'SW' 'PL' 'PW'})
t =
Decision tree for classification
1
if PL<2.45 then node 2 elseif PL>=2.45 then node 3 else setosa
2
class = setosa
3
if PW<1.75 then node 4 elseif PW>=1.75 then node 5 else versicolor
4
if PL<4.95 then node 6 elseif PL>=4.95 then node 7 else versicolor
5
class = virginica
6
if PW<1.65 then node 8 elseif PW>=1.65 then node 9 else versicolor
7
class = virginica
8
class = versicolor
9
class = virginica
view(t)
20-222
classregtree
References
[1] Breiman, L., J. Friedman, R. Olshen, and C. Stone. Classification
and Regression Trees. Boca Raton, FL: CRC Press, 1984.
See Also
eval | prune | test | view
How To
• Grouped Data
• “Ensemble Methods” on page 13-50
20-223
NaiveBayes.CLevels property
Purpose
Class levels
Description
The CLevels property is a vector of the same type as the grouping
variable, containing the unique levels of the grouping variable.
20-224
cluster
Purpose
Construct agglomerative clusters from linkages
Syntax
T
T
T
T
Description
T = cluster(Z,'cutoff',c) constructs clusters from the
agglomerative hierarchical cluster tree, Z, as generated by the linkage
function. Z is a matrix of size (m – 1)-by-3, where m is the number of
observations in the original data. c is a threshold for cutting Z into
=
=
=
=
cluster(Z,'cutoff',c)
cluster(Z,'cutoff',c,'depth',d)
cluster(Z,'cutoff',c,'criterion',criterion)
cluster(Z,'maxclust',n)
clusters. Clusters are formed when a node and all of its subnodes have
inconsistent value less than c. All leaves at or below the node are
grouped into a cluster. t is a vector of size m containing the cluster
assignments of each observation.
If c is a vector, T is a matrix of cluster assignments with one column
per cutoff value.
T = cluster(Z,'cutoff',c,'depth',d) evaluates inconsistent values
by looking to a depth d below each node. The default depth is 2.
T = cluster(Z,'cutoff',c,'criterion',criterion) uses the
specified criterion for forming clusters, where criterion is one of the
strings 'inconsistent' (default) or 'distance'. The 'distance'
criterion uses the distance between the two subnodes merged at a node
to measure node height. All leaves at or below a node with height less
than c are grouped into a cluster.
T = cluster(Z,'maxclust',n) constructs a maximum of n clusters
using the 'distance' criterion. cluster finds the smallest height at
which a horizontal cut through the tree leaves n or fewer clusters.
If n is a vector, T is a matrix of cluster assignments with one column
per maximum value.
Examples
Compare clusters from Fisher iris data with species:
load fisheriris
20-225
cluster
d = pdist(meas);
Z = linkage(d);
c = cluster(Z,'maxclust',3:5);
crosstab(c(:,1),species)
ans =
0
0
2
0
50
48
50
0
0
crosstab(c(:,2),species)
ans =
0
0
1
0
50
47
0
0
2
50
0
0
crosstab(c(:,3),species)
ans =
0
4
0
0
46
47
0
0
1
0
0
2
50
0
0
See Also
20-226
clusterdata | cophenet | inconsistent | linkage | pdist
gmdistribution.cluster
Purpose
Construct clusters from Gaussian mixture distribution
Syntax
idx = cluster(obj,X)
[idx,nlogl] = cluster(obj,X)
[idx,nlogl,P] = cluster(obj,X)
[idx,nlogl,P,logpdf] = cluster(obj,X)
[idx,nlogl,P,logpdf,M] = cluster(obj,X)
Description
idx = cluster(obj,X) partitions data in the n-by-d matrix X, where n
is the number of observations and d is the dimension of the data, into
k clusters determined by the k components of the Gaussian mixture
distribution defined by obj. obj is an object created by gmdistribution
or fit. idx is an n-by-1 vector, where idx(I) is the cluster index of
observation I. The cluster index gives the component with the largest
posterior probability for the observation, weighted by the component
probability.
Note The data in X is typically the same as the data used to create
the Gaussian mixture distribution defined by obj. Clustering with
cluster is treated as a separate step, apart from density estimation.
For cluster to provide meaningful clustering with new data, X should
come from the same population as the data used to create obj.
cluster treats NaN values as missing data. Rows of X with NaN values
are excluded from the partition.
[idx,nlogl] = cluster(obj,X) also returns nlogl, the negative
log-likelihood of the data.
[idx,nlogl,P] = cluster(obj,X) also returns the posterior
probabilities of each component for each observation in the n-by-k
matrix P. P(I,J) is the probability of component J given observation I.
[idx,nlogl,P,logpdf] = cluster(obj,X) also returns the n-by-1
vector logpdf containing the logarithm of the estimated probability
density function for each observation. The density estimate for
20-227
gmdistribution.cluster
observation I is a sum over all components of the component density at
I times the component probability.
[idx,nlogl,P,logpdf,M] = cluster(obj,X) also returns an n-by-k
matrix M containing Mahalanobis distances in squared units. M(I,J) is
the Mahalanobis distance of observation I from the mean of component
J.
Examples
Generate data from a mixture of two bivariate Gaussian distributions
using the mvnrnd function:
MU1 = [1 2];
SIGMA1 = [2 0; 0 .5];
MU2 = [-3 -5];
SIGMA2 = [1 0; 0 1];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
scatter(X(:,1),X(:,2),10,'.')
hold on
20-228
gmdistribution.cluster
Fit a two-component Gaussian mixture model:
obj = gmdistribution.fit(X,2);
h = ezcontour(@(x,y)pdf(obj,[x y]),[-8 6],[-8 6]);
20-229
gmdistribution.cluster
Use the fit to cluster the data:
idx = cluster(obj,X);
cluster1 = X(idx == 1,:);
cluster2 = X(idx == 2,:);
delete(h)
h1 = scatter(cluster1(:,1),cluster1(:,2),10,'r.');
h2 = scatter(cluster2(:,1),cluster2(:,2),10,'g.');
legend([h1 h2],'Cluster 1','Cluster 2','Location','NW')
20-230
gmdistribution.cluster
See Also
fit | gmdistribution | mahal | posterior
20-231
clusterdata
Purpose
Agglomerative clusters from data
Syntax
T = clusterdata(X,cutoff)
T = clusterdata(X,Name,Value)
Description
T = clusterdata(X,cutoff)
T = clusterdata(X,Name,Value) clusters with additional options
specified by one or more Name,Value pair arguments.
Tips
• The centroid and median methods can produce a cluster tree that
is not monotonic. This occurs when the distance from the union
of two clusters, r and s, to a third cluster is less than the distance
between r and s. In this case, in a dendrogram drawn with the
default orientation, the path from a leaf to the root node takes some
downward steps. To avoid this, use another method. The following
image shows a nonmonotonic cluster tree.
In this case, cluster 1 and cluster 3 are joined into a new cluster,
while the distance between this new cluster and cluster 2 is less
than the distance between cluster 1 and cluster 3. This leads to a
nonmonotonic tree.
20-232
clusterdata
• You can provide the output T to other functions including
dendrogram to display the tree, cluster to assign points to clusters,
inconsistent to compute inconsistent measures, and cophenet to
compute the cophenetic correlation coefficient.
Input
Arguments
X
Matrix with two or more rows. The rows represent observations,
the columns represent categories or dimensions.
cutoff
When 0 < cutoff < 2, clusterdata forms clusters when
inconsistent values are greater than cutoff (see inconsistent).
When cutoff is an integer ≥ 2, clusterdata interprets cutoff as
the maximum number of clusters to keep in the hierarchical tree
generated by linkage.
Name-Value Pair Arguments
Optional comma-separated pairs of Name,Value arguments, where Name
is the argument name and Value is the corresponding value. Name must
appear inside single quotes (''). You can specify several name-value
pair arguments in any order as Name1,Value1, ,NameN,ValueN.
criterion
Either 'inconsistent' or 'distance'.
cutoff
Cutoff for inconsistent or distance measure, a positive scalar.
When 0 < cutoff < 2, clusterdata forms clusters when
inconsistent values are greater than cutoff (see inconsistent).
When cutoff is an integer ≥ 2, clusterdata interprets cutoff as
the maximum number of clusters to keep in the hierarchical tree
generated by linkage.
depth
Depth for computing inconsistent values, a positive integer.
20-233
clusterdata
distance
Any of the distance metric names allowed by pdist (follow the
'minkowski' option by the value of the exponent p):
Metric
Description
'euclidean'
Euclidean distance (default).
'seuclidean'
Standardized Euclidean distance. Each
coordinate difference between rows in X
is scaled by dividing by the corresponding
element of the standard deviation
S=nanstd(X). To specify another value for S,
use D=pdist(X,'seuclidean',S).
'cityblock'
City block metric.
'minkowski'
Minkowski distance. The default exponent
is 2. To specify a different exponent, use D
= pdist(X,'minkowski',P), where P is a
scalar
positive
value(maximum
of the exponent.
Chebychev
distance
coordinate
'chebychev'
difference).
20-234
'mahalanobis'
Mahalanobis distance, using the
sample covariance of X as computed
by nancov. To compute the distance
with a different covariance, use D =
pdist(X,'mahalanobis',C), where the
matrix C is symmetric and positive definite.
'cosine'
One minus the cosine of the included angle
between points (treated as vectors).
'correlation'
One minus the sample correlation between
points (treated as sequences of values).
'spearman'
One minus the sample Spearman’s rank
correlation between observations (treated as
sequences of values).
clusterdata
Metric
Description
'hamming'
Hamming distance, which is the percentage
of coordinates that differ.
'jaccard'
One minus the Jaccard coefficient, which is
the percentage of nonzero coordinates that
differ.
custom distance
function
A distance function specified using @:
D = pdist(X,@distfun)
A distance function must be of form
d2 = distfun(XI,XJ)
taking as arguments a 1-by-n vector XI,
corresponding to a single row of X, and
an m2-by-n matrix XJ, corresponding to
multiple rows of X. distfun must accept a
matrix XJ with an arbitrary number of rows.
distfun must return an m2-by-1 vector
of distances d2, whose kth element is the
distance between XI and XJ(k,:).
linkage
Any of the linkage methods allowed by the linkage function:
• 'average'
• 'centroid'
• 'complete'
• 'median'
• 'single'
• 'ward'
• 'weighted'
20-235
clusterdata
For details, see the definitions in the linkage function reference
page.
maxclust
Maximum number of clusters to form, a positive integer.
savememory
A string, either 'on' or 'off'. When applicable, the 'on' setting
causes clusterdata to construct clusters without computing the
distance matrix. savememory is applicable when:
• linkage is 'centroid', 'median', or 'ward'
• distance is 'euclidean' (default)
When savememory is 'on', linkage run time is proportional
to the number of dimensions (number of columns of X).
When savememory is 'off', linkage memory requirement is
proportional to N2, where N is the number of observations. So
choosing the best (least-time) setting for savememory depends
on the problem dimensions, number of observations, and
available memory. The default savememory setting is a rough
approximation of an optimal setting.
Default: 'on' when X has 20 columns or fewer, or the computer
does not have enough memory to store the distance matrix;
otherwise 'off'
Output
Arguments
T
T is a vector of size m containing a cluster number for each
observation.
• When 0 < cutoff < 2, T = clusterdata(X,cutoff) is
equivalent to:
Y = pdist(X,'euclid');
20-236
clusterdata
Z = linkage(Y,'single');
T = cluster(Z,'cutoff',cutoff);
• When cutoff is an integer ≥ 2, T = clusterdata(X,cutoff)
is equivalent to:
Y = pdist(X,'euclid');
Z = linkage(Y,'single');
T = cluster(Z,'maxclust',cutoff);
Examples
The example first creates a sample data set of random numbers. It then
uses clusterdata to compute the distances between items in the data
set and create a hierarchical cluster tree from the data set. Finally,
the clusterdata function groups the items in the data set into three
clusters. The example uses the find function to list all the items in
cluster 2, and the scatter3 function to plot the data with each cluster
shown in a different color.
X = [gallery('uniformdata',[10 3],12);...
gallery('uniformdata',[10 3],13)+1.2;...
gallery('uniformdata',[10 3],14)+2.5];
T = clusterdata(X,'maxclust',3);
find(T==2)
ans =
11
12
13
14
15
16
17
18
19
20
scatter3(X(:,1),X(:,2),X(:,3),100,T,'filled')
20-237
clusterdata
Create a hierarchical cluster tree for a data with 20000 observations
using Ward’s linkage. If you set savememory to 'off', you can get an
out-of-memory error if your machine doesn’t have enough memory to
hold the distance matrix.
X = rand(20000,3);
c = clusterdata(X,'linkage','ward','savememory','on',...
'maxclust',4);
scatter3(X(:,1),X(:,2),X(:,3),10,c)
20-238
clusterdata
See Also
cluster | inconsistent | kmeans | linkage | pdist
20-239
cmdscale
Purpose
Classical multidimensional scaling
Syntax
Y = cmdscale(D)
[Y,e] = cmdscale(D)
Description
Y = cmdscale(D) takes an n-by-n distance matrix D, and returns an
n-by-p configuration matrix Y. Rows of Y are the coordinates of n points
in p-dimensional space for some p < n. When D is a Euclidean distance
matrix, the distances between those points are given by D. p is the
dimension of the smallest space in which the n points whose inter-point
distances are given by D can be embedded.
[Y,e] = cmdscale(D) also returns the eigenvalues of Y*Y'. When D is
Euclidean, the first p elements of e are positive, the rest zero. If the first
k elements of e are much larger than the remaining (n-k), then you can
use the first k columns of Y as k-dimensional points whose inter-point
distances approximate D. This can provide a useful dimension reduction
for visualization, e.g., for k = 2.
D need not be a Euclidean distance matrix. If it is non-Euclidean or a
more general dissimilarity matrix, then some elements of e are negative,
and cmdscale chooses p as the number of positive eigenvalues. In this
case, the reduction to p or fewer dimensions provides a reasonable
approximation to D only if the negative elements of e are small in
magnitude.
You can specify D as either a full dissimilarity matrix, or in upper
triangle vector form such as is output by pdist. A full dissimilarity
matrix must be real and symmetric, and have zeros along the diagonal
and positive elements everywhere else. A dissimilarity matrix in upper
triangle form must have real, positive entries. You can also specify D
as a full similarity matrix, with ones along the diagonal and all other
elements less than one. cmdscale transforms a similarity matrix to a
dissimilarity matrix in such a way that distances between the points
returned in Y equal or approximate sqrt(1-D). To use a different
transformation, you must transform the similarities prior to calling
cmdscale.
20-240
cmdscale
Examples
Generate some points in 4-D space, but close to 3-D space, then reduce
them to distances only.
X = [normrnd(0,1,10,3) normrnd(0,.1,10,1)];
D = pdist(X,'euclidean');
Find a configuration with those inter-point distances.
[Y,e] = cmdscale(D);
% Four, but fourth one small
dim = sum(e > eps^(3/4))
% Poor reconstruction
maxerr2 = max(abs(pdist(X)-pdist(Y(:,1:2))))
% Good reconstruction
maxerr3 = max(abs(pdist(X)-pdist(Y(:,1:3))))
% Exact reconstruction
maxerr4 = max(abs(pdist(X)-pdist(Y)))
% D is now non-Euclidean
D = pdist(X,'cityblock');
[Y,e] = cmdscale(D);
% One is large negative
min(e)
% Poor reconstruction
maxerr = max(abs(pdist(X)-pdist(Y)))
References
[1] Seber, G. A. F. Multivariate Observations. Hoboken, NJ: John Wiley
& Sons, Inc., 1984.
See Also
mdscale | pdist | procrustes
20-241
NaiveBayes.CNames property
Purpose
Class names
Description
The CNames property is an NClasses-by-1 cell array containing the
group names, where NClasses number of groups in the grouping
variable used to create the Naive Bayes classifier.
20-242
CompactTreeBagger.combine
Purpose
Combine two ensembles
Syntax
B1 = combine(B1,B2)
Description
B1 = combine(B1,B2) appends decision trees from ensemble B2 to
those stored in B1 and returns ensemble B1. This method requires that
the class and variable names be identical in both ensembles.
See Also
TreeBagger.append
20-243
combnk
Purpose
Enumeration of combinations
Syntax
C = combnk(v,k)
Description
C = combnk(v,k) returns all combinations of the n elements in v taken
k at a time.
C = combnk(v,k) produces a matrix C with k columns and n!/k!(n–k)!
rows, where each row contains k of the elements in the vector v.
It is not practical to use this function if v has more than about 15
elements.
Examples
Combinations of characters from a string.
C = combnk('tendril',4);
last5 = C(31:35,:)
last5 =
tedr
tenl
teni
tenr
tend
Combinations of elements from a numeric vector.
c = combnk(1:4,2)
c =
3
4
2
4
2
3
1
4
1
3
1
2
See Also
20-244
perms
ClassificationEnsemble.compact
Purpose
Compact classification ensemble
Syntax
cens = compact(ens)
Description
cens = compact(ens) creates a compact version of ens. You can
predict classifications using cens exactly as you can using ens.
However, since cens does not contain training data, you cannot perform
some actions, such as cross validation.
Input
Arguments
ens
Output
Arguments
cens
Examples
Compare the size of a classification ensemble for Fisher’s iris data to
the compact version of the ensemble:
A classification ensemble created with fitensemble.
A compact classification ensemble. cens has class
CompactClassificationEnsemble.
load fisheriris
ens = fitensemble(meas,species,'AdaBoostM2',100,'Tree');
cens = compact(ens);
b = whos('ens'); % b.bytes = size of ens
c = whos('cens'); % c.bytes = size of ens
[b.bytes c.bytes] % shows cens uses less memory
ans =
571727
532476
See Also
ClassificationTree | fitensemble
How To
• “Ensemble Methods” on page 13-50
20-245
ClassificationTree.compact
Purpose
Compact tree
Syntax
ctree = compact(tree)
Description
ctree = compact(tree) creates a compact version of tree.
Input
Arguments
tree
Output
Arguments
ctree
Examples
Compare the size of the classification tree for Fisher’s iris data to the
compact version of the tree:
A classification tree created using ClassificationTree.fit.
A compact decision tree. ctree has class
CompactClassificationTree. You can predict classifications
using ctree exactly as you can using tree. However, since ctree
does not contain training data, you cannot perform some actions,
such as cross validation.
load fisheriris
fulltree = ClassificationTree.fit(meas,species);
ctree = compact(fulltree);
b = whos('fulltree'); % b.bytes = size of fulltree
c = whos('ctree'); % c.bytes = size of ctree
[b.bytes c.bytes] % shows ctree uses half the memory
ans =
13913
6818
See Also
CompactClassificationTree | ClassificationTree | predict
How To
• Chapter 13, “Supervised Learning”
20-246
RegressionEnsemble.compact
Purpose
Create compact regression ensemble
Syntax
cens = compact(ens)
Description
cens = compact(ens) creates a compact version of ens. You can
predict regressions using cens exactly as you can using ens. However,
since cens does not contain training data, you cannot perform some
actions, such as cross validation.
Input
Arguments
ens
Output
Arguments
cens
Examples
Compare the size of a regression ensemble for the carsmall data to the
compact version of the ensemble:
A regression ensemble created with fitensemble.
A compact regression ensemble. cens is of class
CompactRegressionEnsemble.
load carsmall
X = [Acceleration Cylinders Displacement Horsepower Weight];
ens = fitensemble(X,MPG,'LSBoost',100,'Tree');
cens = compact(ens);
b = whos('ens'); % b.bytes = size of ens
c = whos('cens'); % c.bytes = size of cens
[b.bytes c.bytes] % shows ctree uses less memory
ans =
311789
287368
See Also
RegressionEnsemble | CompactRegressionEnsemble
How To
• Chapter 13, “Supervised Learning”
20-247
RegressionTree.compact
Purpose
Compact regression tree
Syntax
ctree = compact(tree)
Description
ctree = compact(tree) creates a compact version of tree.
Input
Arguments
tree
Output
Arguments
ctree
Examples
Compare the size of a regression tree for the carsmall data to the
compact version of the tree:
A regression tree created using RegressionTree.fit.
A compact regression tree. ctree has class
CompactRegressionTree. You can predict regressions
using ctree exactly as you can using tree. However, since ctree
does not contain training data, you cannot perform some actions,
such as cross validation.
load carsmall
X = [Acceleration Cylinders Displacement Horsepower Weight];
fulltree = RegressionTree.fit(X,MPG);
ctree = compact(fulltree);
b = whos('fulltree'); % b.bytes = size of fulltree
c = whos('ctree'); % c.bytes = size of ctree
[b.bytes c.bytes] % shows ctree uses 2/3 the memory
ans =
15715
10258
See Also
CompactRegressionTree | RegressionTree | predict
How To
• Chapter 13, “Superv