Multivariate Methods Version 12 12.1 Marcel Proust

Multivariate Methods Version 12 12.1 Marcel Proust
Version 12
Multivariate Methods
“The real voyage of discovery consists not in seeking new
landscapes, but in having new eyes.”
Marcel Proust
JMP, A Business Unit of SAS
SAS Campus Drive
Cary, NC 27513
12.1
The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2015. JMP® 12 Multivariate Methods. Cary, NC: SAS Institute Inc.
JMP® 12 Multivariate Methods
Copyright © 2015, SAS Institute Inc., Cary, NC, USA
ISBN 978‐1‐62959‐458‐3 (Hardcopy)
ISBN 978‐1‐62959‐460‐6 (EPUB)
ISBN 978‐1‐62959‐461‐3 (MOBI)
ISBN 978‐1‐62959‐459‐0 (PDF)
All rights reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202‐1(a), DFAR 227.7202‐3(a) and DFAR 227.7202‐4 and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227‐19 (DEC 2007). If FAR 52.227‐19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government’s rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513‐2414.
March 2015
July 2015
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Technology License Notices
•
Scintilla ‐ Copyright © 1998‐2014 by Neil Hodgson <neilh@scintilla.org>.
All Rights Reserved.
Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation.
NEIL HODGSON DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL NEIL HODGSON BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
•
Telerik RadControls: Copyright © 2002‐2012, Telerik. Usage of the included Telerik RadControls outside of JMP is not permitted.
•
ZLIB Compression Library ‐ Copyright © 1995‐2005, Jean‐Loup Gailly and Mark Adler.
•
Made with Natural Earth. Free vector and raster map data @ naturalearthdata.com.
•
Packages ‐ Copyright © 2009‐2010, Stéphane Sudre (s.sudre.free.fr). All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the WhiteBox nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
•
iODBC software ‐ Copyright © 1995‐2006, OpenLink Software Inc and Ke Jin (www.iodbc.org). All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
‒ Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
‒ Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
‒ Neither the name of OpenLink Software Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL OPENLINK OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
•
bzip2, the associated library “libbzip2”, and all documentation, are Copyright © 1996‐2010, Julian R Seward. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
•
R software is Copyright © 1999‐2012, R Foundation for Statistical Computing.
•
MATLAB software is Copyright © 1984‐2012, The MathWorks, Inc. Protected by U.S. and international patents. See www.mathworks.com/patents. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders.
•
libopc is Copyright © 2011, Florian Reuter. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
‒ Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
‒ Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and / or other materials provided with the distribution.
‒ Neither the name of Florian Reuter nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
•
libxml2 ‐ Except where otherwise noted in the source code (e.g. the files hash.c, list.c and the trio files, which are covered by a similar licence but with different Copyright notices) all the files are:
Copyright © 1998 ‐ 2003 Daniel Veillard. All Rights Reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.IN NO EVENT SHALL THE DANIEL VEILLARD BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Except as contained in this notice, the name of Daniel Veillard shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization from him.
Get the Most from JMP®
Whether you are a first‐time or a long‐time user, there is always something to learn about JMP.
Visit JMP.com to find the following:
•
live and recorded webcasts about how to get started with JMP
•
video demos and webcasts of new features and advanced techniques
•
details on registering for JMP training
•
schedules for seminars being held in your area
•
success stories showing how others use JMP
•
a blog with tips, tricks, and stories from JMP staff
•
a forum to discuss JMP with other users
http://www.jmp.com/getstarted/
Contents
Multivariate Methods
1
Learn about JMP
Documentation and Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Formatting Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
JMP Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
JMP Documentation Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
JMP Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Additional Resources for Learning JMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Sample Data Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Learn about Statistical and JSL Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Learn JMP Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Tooltips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
JMP User Community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
JMPer Cable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
JMP Books by Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
The JMP Starter Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2
Introduction to Multivariate Analysis
Overview of Multivariate Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3
Correlations and Multivariate Techniques
Explore the Multidimensional Behavior of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Launch the Multivariate Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
The Multivariate Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Multivariate Platform Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Nonparametric Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Outlier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10
Multivariate Methods
Item Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Impute Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Example of Item Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Computations and Statistical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Pearson Product‐Moment Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Nonparametric Measures of Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Inverse Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Cronbach’s  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4
Cluster Analysis
Identify and Explore Groups of Similar Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Clustering Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Example of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Launch the Cluster Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Hierarchical Cluster Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Hierarchical Cluster Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
K‐Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
K‐Means Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
K‐Means Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Normal Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Robust Normal Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Platform Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Self Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Additional Examples of Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Example of Self‐Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Statistical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Statistical Details for Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Statistical Details for Robust Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5
Principal Components
Reduce the Dimensionality of Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Overview of Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Multivariate Methods
11
Example of Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Launch the Principal Components Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Principal Components Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Principal Components Report Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Principal Components Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Wide Principal Components Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Cluster Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6
Discriminant Analysis
Predict Classifications Based on Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Discriminant Analysis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Example of Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Discriminant Launch Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Stepwise Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Discriminant Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Shrink Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
The Discriminant Analysis Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Canonical Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Discriminant Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Score Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Discriminant Analysis Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Score Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Canonical Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Example of a Canonical 3D Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Specify Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Consider New Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Save Discrim Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Validation in JMP and JMP Pro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Description of the Wide Linear Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Saved Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
12
Multivariate Methods
Between Groups Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7
Partial Least Squares Models
Develop Models Using Correlations between Ys and Xs . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Overview of the Partial Least Squares Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Example of Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Launch the Partial Least Squares Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Centering and Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Standardize X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Model Launch Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Partial Least Squares Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Model Comparison Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
<Cross Validation Method> and Method = <Method Specification> . . . . . . . . . . . . . . . . . 151
Model Fit Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Partial Least Squares Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Model Fit Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Variable Importance Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
VIP vs Coefficients Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Save Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Statistical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
van der Voet T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
T2 Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Confidence Ellipses for X Score Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Standard Error of Prediction and Confidence Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Standardized Scores and Loadings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
PLS Discriminant Analysis (PLS‐DA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
A
References
B
Statistical Details
Multivariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Wide Linear Methods and the Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 171
The Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
The SVD and the Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
The SVD and the Inverse Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Multivariate Methods
13
Calculating the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Multivariate Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Approximate F‐Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Index
Multivariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
14
Multivariate Methods
Chapter 1
Learn about JMP
Documentation and Additional Resources
This chapter includes the following information:
•
book conventions
•
JMP documentation
•
JMP Help
•
additional resources, such as the following:
‒ other JMP documentation
‒ tutorials
‒ indexes
‒ Web resources
Figure 1.1 The JMP Help Home Window on Windows
Contents
Formatting Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
JMP Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
JMP Documentation Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
JMP Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Additional Resources for Learning JMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Sample Data Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Learn about Statistical and JSL Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Learn JMP Tips and Tricks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Tooltips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
JMP User Community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
JMPer Cable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
JMP Books by Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
The JMP Starter Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 1
Multivariate Methods
Learn about JMP
Formatting Conventions
17
Formatting Conventions
The following conventions help you relate written material to information that you see on your screen.
•
Sample data table names, column names, pathnames, filenames, file extensions, and folders appear in Helvetica font.
•
Code appears in Lucida Sans Typewriter font.
•
Code output appears in Lucida Sans Typewriter italic font and is indented farther than the preceding code.
•
Helvetica bold formatting indicates items that you select to complete a task:
‒ buttons
‒ check boxes
‒ commands
‒ list names that are selectable
‒ menus
‒ options
‒ tab names
‒ text boxes
•
The following items appear in italics:
‒ words or phrases that are important or have definitions specific to JMP
‒ book titles
‒ variables
‒ script output
•
Features that are for JMP Pro only are noted with the JMP Pro icon of JMP Pro features, visit http://www.jmp.com/software/pro/.
. For an overview Note: Special information and limitations appear within a Note.
Tip: Helpful information appears within a Tip.
JMP Documentation
JMP offers documentation in various formats, from print books and Portable Document Format (PDF) to electronic books (e‐books).
18
Learn about JMP
JMP Documentation
Chapter 1
Multivariate Methods
•
Open the PDF versions from the Help > Books menu.
•
All books are also combined into one PDF file, called JMP Documentation Library, for convenient searching. Open the JMP Documentation Library PDF file from the Help > Books menu.
•
You can also purchase printed documentation and e‐books on the SAS website:
http://www.sas.com/store/search.ep?keyWords=JMP
JMP Documentation Library
The following table describes the purpose and content of each book in the JMP library.
Document Title
Document Purpose
Document Content
Discovering JMP
If you are not familiar with JMP, start here.
Introduces you to JMP and gets you started creating and analyzing data.
Using JMP
Learn about JMP data tables and how to perform basic operations.
Covers general JMP concepts and features that span across all of JMP, including importing data, modifying columns properties, sorting data, and connecting to SAS.
Basic Analysis
Perform basic analysis using this document.
Describes these Analyze menu platforms:
•
Distribution
•
Fit Y by X
•
Matched Pairs
•
Tabulate
How to approximate sampling distributions using bootstrapping and modeling utilities are also included.
Chapter 1
Multivariate Methods
Learn about JMP
JMP Documentation
Document Title
Document Purpose
Document Content
Essential Graphing
Find the ideal graph for your data.
Describes these Graph menu platforms:
•
Graph Builder
•
Overlay Plot
•
Scatterplot 3D
•
Contour Plot
•
Bubble Plot
•
Parallel Plot
•
Cell Plot
•
Treemap
•
Scatterplot Matrix
•
Ternary Plot
•
Chart
19
The book also covers how to create background and custom maps.
Profilers
Learn how to use interactive profiling tools, which enable you to view cross‐sections of any response surface.
Covers all profilers listed in the Graph menu. Analyzing noise factors is included along with running simulations using random inputs.
Design of Experiments Guide
Learn how to design experiments and determine appropriate sample sizes.
Covers all topics in the DOE menu and the Screening menu item in the Analyze > Modeling menu.
20
Learn about JMP
JMP Documentation
Chapter 1
Multivariate Methods
Document Title
Document Purpose
Document Content
Fitting Linear Models
Learn about Fit Model platform and many of its personalities.
Describes these personalities, all available within the Analyze menu Fit Model platform:
Specialized Models
Learn about additional modeling techniques.
•
Standard Least Squares
•
Stepwise
•
Generalized Regression
•
Mixed Model
•
MANOVA
•
Loglinear Variance
•
Nominal Logistic
•
Ordinal Logistic
•
Generalized Linear Model
Describes these Analyze > Modeling menu platforms:
•
Partition
•
Neural
•
Model Comparison
•
Nonlinear
•
Gaussian Process
•
Time Series
•
Response Screening
The Screening platform in the Analyze > Modeling menu is described in Design of Experiments Guide.
Multivariate Methods
Read about techniques for analyzing several variables simultaneously.
Describes these Analyze > Multivariate Methods menu platforms:
•
Multivariate
•
Cluster
•
Principal Components
•
Discriminant
•
Partial Least Squares
Chapter 1
Multivariate Methods
Learn about JMP
JMP Documentation
Document Title
Document Purpose
Document Content
Quality and Process Methods
Read about tools for evaluating and improving processes.
Describes these Analyze > Quality and Process menu platforms:
Reliability and Survival Methods
Consumer Research
Learn to evaluate and improve reliability in a product or system and analyze survival data for people and products.
Learn about methods for studying consumer preferences and using that insight to create better products and services.
21
•
Control Chart Builder and individual control charts
•
Measurement Systems Analysis
•
Variability / Attribute Gauge Charts
•
Process Capability
•
Pareto Plot
•
Diagram
Describes these Analyze > Reliability and Survival menu platforms:
•
Life Distribution
•
Fit Life by X
•
Recurrence Analysis
•
Degradation and Destructive Degradation
•
Reliability Forecast
•
Reliability Growth
•
Reliability Block Diagram
•
Survival
•
Fit Parametric Survival
•
Fit Proportional Hazards
Describes these Analyze > Consumer Research menu platforms:
•
Categorical
•
Multiple Correspondence Analysis
•
Factor Analysis
•
Choice
•
Uplift
•
Item Analysis
22
Learn about JMP
Additional Resources for Learning JMP
Chapter 1
Multivariate Methods
Document Title
Document Purpose
Document Content
Scripting Guide
Learn about taking advantage of the powerful JMP Scripting Language (JSL).
Covers a variety of topics, such as writing and debugging scripts, manipulating data tables, constructing display boxes, and creating JMP applications.
JSL Syntax Reference
Read about many JSL functions on functions and their arguments, and messages that you send to objects and display boxes.
Includes syntax, examples, and notes for JSL commands.
Note: The Books menu also contains two reference cards that can be printed: The Menu Card describes JMP menus, and the Quick Reference describes JMP keyboard shortcuts.
JMP Help
JMP Help is an abbreviated version of the documentation library that provides targeted information. You can open JMP Help in several ways:
•
On Windows, press the F1 key to open the Help system window.
•
Get help on a specific part of a data table or report window. Select the Help tool from the Tools menu and then click anywhere in a data table or report window to see the Help for that area.
•
Within a JMP window, click the Help button.
•
Search and view JMP Help on Windows using the Help > Help Contents, Search Help, and Help Index options. On Mac, select Help > JMP Help.
•
Search the Help at http://jmp.com/support/help/ (English only).
Additional Resources for Learning JMP
In addition to JMP documentation and JMP Help, you can also learn about JMP using the following resources:
•
Tutorials (see “Tutorials” on page 23)
•
Sample data (see “Sample Data Tables” on page 23)
•
Indexes (see “Learn about Statistical and JSL Terms” on page 23)
Chapter 1
Multivariate Methods
Learn about JMP
Additional Resources for Learning JMP
•
Tip of the Day (see “Learn JMP Tips and Tricks” on page 24)
•
Web resources (see “JMP User Community” on page 24)
•
JMPer Cable technical publication (see “JMPer Cable” on page 24)
•
Books about JMP (see “JMP Books by Users” on page 25)
•
JMP Starter (see “The JMP Starter Window” on page 25)
23
Tutorials
You can access JMP tutorials by selecting Help > Tutorials. The first item on the Tutorials menu is Tutorials Directory. This opens a new window with all the tutorials grouped by category.
If you are not familiar with JMP, then start with the Beginners Tutorial. It steps you through the JMP interface and explains the basics of using JMP.
The rest of the tutorials help you with specific aspects of JMP, such as creating a pie chart, using Graph Builder, and so on.
Sample Data Tables
All of the examples in the JMP documentation suite use sample data. Select Help > Sample
Data Library to open the sample data directory. To view an alphabetized list of sample data tables or view sample data within categories, select Help > Sample Data.
Sample data tables are installed in the following directory:
On Windows: C:\Program Files\SAS\JMP\<version_number>\Samples\Data
On Macintosh: \Library\Application Support\JMP\<version_number>\Samples\Data
In JMP Pro, sample data is installed in the JMPPRO (rather than JMP) directory. In JMP Shrinkwrap, sample data is installed in the JMPSW directory.
Learn about Statistical and JSL Terms
The Help menu contains the following indexes:
Statistics Index Provides definitions of statistical terms.
Lets you search for information about JSL functions, objects, and display boxes. You can also edit and run sample scripts from the Scripting Index.
Scripting Index
24
Learn about JMP
Additional Resources for Learning JMP
Chapter 1
Multivariate Methods
Learn JMP Tips and Tricks
When you first start JMP, you see the Tip of the Day window. This window provides tips for using JMP.
To turn off the Tip of the Day, clear the Show tips at startup check box. To view it again, select Help > Tip of the Day. Or, you can turn it off using the Preferences window. See the Using JMP book for details.
Tooltips
JMP provides descriptive tooltips when you place your cursor over items, such as the following:
•
Menu or toolbar options
•
Labels in graphs
•
Text results in the report window (move your cursor in a circle to reveal)
•
Files or windows in the Home Window
•
Code in the Script Editor
Tip: You can hide tooltips in the JMP Preferences. Select File > Preferences > General (or JMP
> Preferences > General on Macintosh) and then deselect Show menu tips.
JMP User Community
The JMP User Community provides a range of options to help you learn more about JMP and connect with other JMP users. The learning library of one‐page guides, tutorials, and demos is a good place to start. And you can continue your education by registering for a variety of JMP training courses.
Other resources include a discussion forum, sample data and script file exchange, webcasts, and social networking groups.
To access JMP resources on the website, select Help > JMP User Community or visit https://community.jmp.com/.
JMPer Cable
The JMPer Cable is a yearly technical publication targeted to users of JMP. The JMPer Cable is available on the JMP website:
http://www.jmp.com/about/newsletters/jmpercable/
Chapter 1
Multivariate Methods
Learn about JMP
Additional Resources for Learning JMP
25
JMP Books by Users
Additional books about using JMP that are written by JMP users are available on the JMP website:
http://www.jmp.com/en_us/software/books.html
The JMP Starter Window
The JMP Starter window is a good place to begin if you are not familiar with JMP or data analysis. Options are categorized and described, and you launch them by clicking a button. The JMP Starter window covers many of the options found in the Analyze, Graph, Tables, and File menus.
•
To open the JMP Starter window, select View (Window on the Macintosh) > JMP Starter.
•
To display the JMP Starter automatically when you open JMP on Windows, select File >
Preferences > General, and then select JMP Starter from the Initial JMP Window list. On Macintosh, select JMP > Preferences > Initial JMP Starter Window.
26
Learn about JMP
Additional Resources for Learning JMP
Chapter 1
Multivariate Methods
Chapter 2
Introduction to Multivariate Analysis
Overview of Multivariate Techniques
This book describes the following techniques for analyzing several variables simultaneously:
•
The Multivariate platform examines multiple variables to see how they relate to each other. See Chapter 3, “Correlations and Multivariate Techniques”.
•
The Cluster platform groups rows together that share similar values across a number of variables. It is a useful exploratory technique to help you understand the clumping structure of your data. See Chapter 4, “Cluster Analysis”.
•
The Principal Components platform derives a small number of independent linear combinations (principal components) of a set of measured variables that capture as much of the variability in the original variables as possible. It is a useful exploratory technique and can help you to create predictive models. See Chapter 5, “Principal Components”.
•
The Discriminant platform looks to find a way to predict a classification (X) variable (nominal or ordinal) based on known continuous responses (Y). It can be regarded as inverse prediction from a multivariate analysis of variance (MANOVA). See Chapter 6, “Discriminant Analysis”.
•
The Partial Least Squares platform fits linear models based on factors, namely, linear combinations of the explanatory variables (Xs). PLS exploits the correlations between the Xs and the Ys to reveal underlying latent structures. See Chapter 7, “Partial Least Squares Models”.
28
Introduction to Multivariate Analysis
Chapter 2
Multivariate Methods
Chapter 3
Correlations and Multivariate Techniques
Explore the Multidimensional Behavior of Variables
Use the Multivariate platform to explore how many variables relate to each other. The word multivariate simply means involving many variables instead of one (univariate) or two (bivariate). From the Multivariate report, you can:
•
summarize the strength of the linear relationships between each pair of response variables using the Correlations table
•
identify dependencies, outliers, and clusters using the Scatterplot Matrix
•
use other techniques to examine multiple variables, such as partial, inverse, and pairwise correlations, covariance matrices, principal components, and more
Figure 3.1 Example of a Multivariate Report
Contents
Launch the Multivariate Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
The Multivariate Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Multivariate Platform Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Nonparametric Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Outlier Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Item Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Impute Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Example of Item Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Computations and Statistical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Pearson Product‐Moment Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Nonparametric Measures of Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Inverse Correlation Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Cronbach’s a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 3
Multivariate Methods
Correlations and Multivariate Techniques
Launch the Multivariate Platform
31
Launch the Multivariate Platform
Launch the Multivariate platform by selecting Analyze > Multivariate Methods > Multivariate.
Figure 3.2 The Multivariate Launch Window
Table 3.1 Description of the Multivariate Launch Window
Y, Columns
Defines one or more response columns.
Weight
(Optional) Identifies one column whose numeric values assign a weight to each row in the analysis.
Freq
(Optional) Identifies one column whose numeric values assign a frequency to each row in the analysis.
By
(Optional) Performs a separate multivariate analysis for each level of the By variable.
Estimation Method
Select from one of several estimation methods for the correlations. With the Default option, Row‐wise is used for data tables with no missing values. Pairwise is used for data tables that have more than 10 columns or more than 5000 rows, and that have missing values. Otherwise, the default estimation method is REML. For details, see “Estimation Methods” on page 32.
Matrix Format
Select a format option for the Scatterplot Matrix. The Square option displays plots for all ordered combinations of columns. Lower Triangular displays plots on and below the diagonal, with the first n ‐ 1 columns on the horizontal axis. Upper Triangular displays plots on and above the diagonal, with the first n ‐ 1 columns on the vertical axis.
32
Correlations and Multivariate Techniques
Launch the Multivariate Platform
Chapter 3
Multivariate Methods
Estimation Methods
Several estimation methods for the correlations options are available to provide flexibility and to accommodate personal preferences. REML and Pairwise are the methods used most frequently. You can also estimate missing values by using the estimated covariance matrix, and then using the Impute Missing Data command. See “Impute Missing Data” on page 44.
Default
The Default option uses either the Row‐wise, Pairwise, or REML methods:
•
Row-wise is used for data tables with no missing values.
•
Pairwise is used in these circumstances:
‒ the data table has more than 10 columns or more than 5000 rows and has missing values
‒ the data table has more columns than rows and has missing values
•
REML is used otherwise.
REML
REML (restricted maximum likelihood) estimates are less biased than the ML (maximum likelihood) estimation method. The REML method maximizes marginal likelihoods based upon error contrasts. The REML method is often used for estimating variances and covariances.The REML method in the Multivariate platform is the same as the REML estimation of mixed models for repeated measures data with an unstructured covariance matrix. See the documentation for SAS PROC MIXED about REML estimation of mixed models. REML uses all of your data, even if missing cells are present, and is most useful for smaller datasets. Because of the bias‐correction factor, this method is slow if your dataset is large and there are many missing data values. If there are no missing cells in the data, then the REML estimate is equivalent to the sample covariance matrix.
ML
The maximum likelihood estimation method (ML) is useful for large data tables with missing cells. The ML estimates are similar to the REML estimates, but the ML estimates are generated faster. Observations with missing values are not excluded. For small data tables, REML is preferred over ML because REML’s variance and covariance estimates are less biased.
Robust
Note: If you select Robust, and your data table contains more columns than rows, JMP switches the Estimation Method to Row‐wise.
Chapter 3
Multivariate Methods
Correlations and Multivariate Techniques
The Multivariate Report
33
Robust estimation is useful for data tables that might have outliers. For statistical details, see “Robust” on page 45.
Row-wise
Rowwise estimation does not use observations containing missing cells. This method is useful in the following situations:
•
checking compatibility with JMP versions earlier than JMP 8. Rowwise estimation was the only estimation method available before JMP 8.
•
excluding any observations that have missing data.
Pairwise
Pairwise estimation performs correlations for all rows for each pair of columns with nonmissing values.
The Multivariate Report
The default multivariate report shows the standard correlation matrix and the scatterplot matrix. The platform menu lists additional correlation options and other techniques for looking at multiple variables. See “Multivariate Platform Options” on page 35.
34
Correlations and Multivariate Techniques
The Multivariate Report
Chapter 3
Multivariate Methods
Figure 3.3 Example of a Multivariate Report
To Produce the Report in Figure 3.3
1. Select Help > Sample Data Library and open Solubility.jmp.
2. Select Analyze > Multivariate Methods > Multivariate.
3. Select all columns except Labels and click Y, Columns.
4. Click OK.
About Missing Values
In most of the analysis options, a missing value in an observation does not cause the entire observation to be deleted. However, the Pairwise Correlations option excludes rows that are missing for either of the variables under consideration. The Simple Statistics > Univariate
option calculates its statistics column‐by‐column, without regard to missing values in other columns.
Chapter 3
Multivariate Methods
Correlations and Multivariate Techniques
Multivariate Platform Options
35
Multivariate Platform Options
Correlations Multivariate
Shows or hides the Correlations table, which is a matrix of correlation coefficients that summarizes the strength of the linear relationships between each pair of response (Y) variables. This option is on by default. See “Pearson Product‐Moment Correlation” on page 46.
This correlation matrix is calculated by the method that you select in the launch window.
Correlation Probability
Shows the Correlation Probability report, which is a matrix of p‐values. Each p‐value corresponds to a test of the null hypothesis that the true correlation between the variables is zero. This is a test of no linear relationship between the two response variables. The test is the usual test for significance of the Pearson correlation coefficient.
CI of Correlation
Shows the two‐tailed confidence intervals of the correlations. This option is off by default.
The default confidence coefficient is 95%. Use the Set 
Level option to change the confidence coefficient.
Inverse Correlations
Shows or hides the inverse correlation matrix (Inverse Corr table). This option is off by default.
The diagonal elements of the matrix are a function of how closely the variable is a linear function of the other variables. In the inverse correlation, the diagonal is 1/(1 – R2) for the fit of that variable by all the other variables. If the multiple correlation is zero, the diagonal inverse element is 1. If the multiple correlation is 1, then the inverse element becomes infinite and is reported missing.
For statistical details about inverse correlations, see the “Inverse Correlation Matrix” on page 47.
Partial Correlations
Shows or hides the partial correlation table (Partial Corr), which shows the measure of the relationship between a pair of variables after adjusting for the effects of all the other variables. This option is off by default.
This table is the negative of the inverse correlation matrix, scaled to unit diagonal.
36
Correlations and Multivariate Techniques
Multivariate Platform Options
Chapter 3
Multivariate Methods
Covariance Matrix
Shows or hides the covariance matrix which measures the degree to which a pair of variables change together. This option is off by default.
Pairwise Correlations
Shows or hides the Pairwise Correlations table, which lists the Pearson product‐moment correlations for each pair of Y variables. This option is off by default.
The correlations are calculated by the pairwise deletion method. The count values differ if any pair has a missing value for either variable. The Pairwise Correlations report also shows significance probabilities and compares the correlations in a bar chart. All results are based on the pairwise method.
Hotelling’s T2 Test
Allows you to conduct a one‐sample test for the mean of the multivariate distribution of the variables that you entered as Y. Specify the mean vector under the null hypothesis in the window that appears by entering a hypothesized mean for each variable. The test assumes multivariate normality of the Y variables.
The Hotelling’s T2 Test report gives the following:
Variable Lists the variables entered as Y.
Mean Gives the sample mean for each variable.
Hypothesized Mean Shows the null hypothesis means that you specified.
Test Statistic Gives the value of Hotelling’s T2 statistic.
F Ratio Gives the value of the test statistic. If you have n rows and k variables, the F ratio is given as follows:
n–k - 2
------------------T
kn – 1
Prob > F The p‐value for the test. Under the null hypothesis the F ratio has an F distribution with n and n ‐ k degrees of freedom.
Chapter 3
Multivariate Methods
Simple Statistics
Correlations and Multivariate Techniques
Multivariate Platform Options
37
This menu contains two options that each show or hide simple statistics (mean, standard deviation, and so on) for each column. The univariate and multivariate simple statistics can differ when there are missing values present, or when the Robust method is used.
Univariate Simple Statistics Shows statistics that are calculated on each column, regardless of values in other columns. These values match those produced by the Distribution platform.
Multivariate Simple Statistics Shows statistics that correspond to the estimation method selected in the launch window. If the REML, ML, or Robust method is selected, the mean vector and covariance matrix are estimated by that selected method. If the Row-wise method is selected, all rows with at least one missing value are excluded from the calculation of means and variances. If the Pairwise method is selected, the mean and variance are calculated for each column.
These options are off by default.
Nonparametric Correlations
This menu contains three nonparametric measures: Spearman’s Rho, Kendall’s Tau, and Hoeffding’s D. These options are off by default.
For details, see “Nonparametric Correlations” on page 39.
Set  Level
You can specify any alpha value for the correlation confidence intervals.
Four alpha values are listed: 0.01, 0.05, 0.10, and 0.50. Select Other to enter any other value.
Scatterplot Matrix
Shows or hides a scatterplot matrix of each pair of response variables. This option is on by default.
For details, see “Scatterplot Matrix” on page 40.
38
Correlations and Multivariate Techniques
Multivariate Platform Options
Color Maps
Chapter 3
Multivariate Methods
The Color Map menu contains three types of color maps.
Color Map On Correlations Produces a cell plot that shows the correlations among variables on a scale from red (+1) to blue (‐1).
Produces a cell plot that shows the significance of the correlations on a scale from p = 0 (red) to p = 1 (blue).
Color Map On p-values
Cluster the Correlations Produces a cell plot that clusters together similar variables. The correlations are the same as for Color Map on Correlations, but the positioning of the variables may be different.
These options are off by default.
Parallel Coord Plot
Shows or hides a parallel coordinate plot of the variables. This option is off by default.
Ellipsoid 3D Plot
Shows or hides a 95% confidence ellipsoid around three variables that you are asked to specify the three variables. This option is off by default.
Principal Components
This menu contains options to show or hide a principal components report. You can select correlations, covariances, or unscaled. Selecting one of these options when another of the reports is shown changes the report to the new option. Select None to remove the report. This option is off by default.
Principal components is a technique to take linear combinations of the original variables. The first principal component has maximum variation, the second principal component has the next most variation, subject to being orthogonal to the first, and so on. For details, see the chapter “Principal Components” on page 81.
Outlier Analysis
This menu contains options that show or hide plots that measure distance in the multivariate sense using one of these methods: the Mahalanobis distance, jackknife distances, and the T2 statistic.
For details, see “Outlier Analysis” on page 42.
Chapter 3
Multivariate Methods
Item Reliability
Correlations and Multivariate Techniques
Multivariate Platform Options
39
This menu contains options that each shows or hides an item reliability report. The reports indicate how consistently a set of instruments measures an overall response, using either Cronbach’s  or standardized . These options are off by default.
For details, see “Item Reliability” on page 43.
Impute Missing Data
Produces a new data table that duplicates your data table and replaces all missing values with estimated values. This option is available only if your data table contains missing values.
For details, see “Impute Missing Data” on page 44.
Save Imputed Formula
For columns that contain missing values, saves new columns to the data table that contain the formulas used to estimate the missing values. The new columns are called Imputed_<Column Name>.
Script
Contains options that are available to all platforms. See the Using JMP book.
Nonparametric Correlations
The Nonparametric Correlations menu offers three nonparametric measures:
Spearman’s Rho is a correlation coefficient computed on the ranks of the data values instead of on the values themselves.
Kendall’s Tau is based on the number of concordant and discordant pairs of observations. A pair is concordant if the observation with the larger value of X also has the larger value of Y. A pair is discordant if the observation with the larger value of X has the smaller value of Y. There is a correction for tied pairs (pairs of observations that have equal values of X or equal values of Y).
Hoeffding’s D A statistical scale that ranges from –0.5 to 1, with large positive values indicating dependence. The statistic approximates a weighted sum over observations of chi‐square statistics for two‐by‐two classification tables. The two‐by‐two tables are made by setting each data value as the threshold. This statistic detects more general departures from independence.
The Nonparametric Measures of Association report also shows significance probabilities for all measures and compares them with a bar chart.
40
Correlations and Multivariate Techniques
Multivariate Platform Options
Chapter 3
Multivariate Methods
Note: The nonparametric correlations are always calculated by the Pairwise method, even if other methods were selected in the launch window.
For statistical details about these three methods, see the “Nonparametric Measures of Association” on page 46.
Scatterplot Matrix
A scatterplot matrix helps you visualize the correlations between each pair of response variables. The scatterplot matrix is shown by default, and can be hidden or shown by selecting Scatterplot Matrix from the red triangle menu for Multivariate.
Figure 3.4 Clusters of Correlations
By default, a 95% bivariate normal density ellipse is shown in each scatterplot. Assuming that each pair of variables has a bivariate normal distribution, this ellipse encloses approximately 95% of the points. The narrowness of the ellipse reflects the degree of correlation of the variables. If the ellipse is fairly round and is not diagonally oriented, the variables are uncorrelated. If the ellipse is narrow and diagonally oriented, the variables are correlated.
Working with the Scatterplot Matrix
Re‐sizing any cell resizes all the cells.
Drag a label cell to another label cell to reorder the matrix.
Chapter 3
Multivariate Methods
Correlations and Multivariate Techniques
Multivariate Platform Options
41
When you look for patterns in the scatterplot matrix, you can see the variables cluster into groups based on their correlations. Figure 3.4 shows two clusters of correlations: the first two variables (top, left), and the next four (bottom, right).
Options for Scatterplot Matrix
The red triangle menu for the Scatterplot Matrix lets you tailor the matrix with color and density ellipses and by setting the ‐level.
Table 3.2 Options for the Scatterplot Matrix
Show Points
Shows or hides the points in the scatterplots.
Fit Line
Shows or hides the regression line and 95% level confidence curves for the fitted regression line.
Density Ellipses
Shows or hides the 95% density ellipses in the scatterplots. Use the Ellipse  menu to change the ‐level.
Shaded Ellipses
Colors each ellipse. Use the Ellipses Transparency and Ellipse
Color menus to change the transparency and color.
Show Correlations
Shows or hides the correlation of each histogram in the upper left corner of each scatterplot.
Show Histogram
Shows either horizontal or vertical histograms in the label cells. Once histograms have been added, select Show Counts to label each bar of the histogram with its count. Select Horizontal or Vertical to either change the orientation of the histograms or remove the histograms.
Ellipse 
Sets the ‐level used for the ellipses. Select one of the standard ‐levels in the menu, or select Other to enter a different one.
Ellipses Transparency
Sets the transparency of the ellipses if they are colored. Select one of the default levels, or select Other to enter a different one. The default value is 0.2.
Ellipse Color
Sets the color of the ellipses if they are colored. Select one of the colors in the palette, or select Other to use another color. The default value is red.
Nonpar Density
Shows or hides shaded density contours based on a smooth nonparametric bivariate surface that describes the density of data points. Contours for the 10% and 50% quantiles of the nonparametric surface are shown.
42
Correlations and Multivariate Techniques
Multivariate Platform Options
Chapter 3
Multivariate Methods
Outlier Analysis
The Outlier Analysis menu contains options that show or hide plots that measure distance in the multivariate sense using one of these methods:
•
Mahalanobis distance
•
jackknife distances
•
T2 statistic
These methods all measure distance in the multivariate sense, with respect to the correlation structure. Testing is done at the alpha level that appears at the bottom of the plot.
In Figure 3.5, Point A is an outlier because it is outside the correlation structure rather than because it is an outlier in any of the coordinate directions.
Figure 3.5 Example of an Outlier
Mahalanobis Distance
The Mahalanobis Outlier Distance plot shows the Mahalanobis distance of each point from the multivariate mean (centroid). The standard Mahalanobis distance depends on estimates of the mean, standard deviation, and correlation for the data. The distance is plotted for each observation number. Extreme multivariate outliers can be identified by highlighting the points with the largest distance values. See “Mahalanobis Distance Measures” on page 48 for more information.
Jackknife Distances
The Jackknife Distances plot shows distances that are calculated using a jackknife technique. The distance for each observation is calculated with estimates of the mean, standard deviation, and correlation matrix that do not include the observation itself. The jack‐knifed distances are useful when there is an outlier. In this case, the Mahalanobis distance is distorted and tends to disguise the outlier or make other points look more outlying than they are. See “Jackknife Distance Measures” on page 49 for more information.
Chapter 3
Multivariate Methods
Correlations and Multivariate Techniques
Multivariate Platform Options
43
T2 Statistic
The T2 plot shows distances that are the square of the Mahalanobis distance. This plot is preferred for multivariate control charts. The plot includes the value of the calculated T2 statistic, as well as its upper control limit. Values that fall outside this limit might be outliers. See “T2 Distance Measures” on page 49 for more information.
Saving Distances and Values
You can save any of the distances to the data table by selecting the Save option from the red triangle menu for the plot.
Note: There is no formula saved with the jackknife distance column. This means that the distance is not recomputed if you modify the data table. If you add or delete columns, or change values in the data table, select Analyze > Multivariate Methods > Multivariate again to compute new jackknife distances.
In addition to saving the distance values for each row, a column property is created that holds the upper control limit (UCL) value for the Outlier Analysis type specified.
Item Reliability
Item reliability indicates how consistently a set of instruments measures an overall response. Cronbach’s  (Cronbach 1951) is one measure of reliability. Two primary applications for Cronbach’s  are industrial instrument reliability and questionnaire analysis.
Cronbach’s  is based on the average correlation of items in a measurement scale. It is equivalent to computing the average of all split‐half correlations in the data table. The Standardized  can be requested if the items have variances that vary widely.
Note: Cronbach’s  is not related to a significance level . Also, item reliability is unrelated to survival time reliability analysis.
To look at the influence of an individual item, JMP excludes it from the computations and shows the effect of the Cronbach’s  value. If  increases when you exclude a variable (item), that variable is not highly correlated with the other variables. If the  decreases, you can conclude that the variable is correlated with the other items in the scale. Nunnally (1979) suggests a Cronbach’s  of 0.7 as a rule‐of‐thumb acceptable level of agreement.
See “Cronbach’s a” on page 50 for details about computations.
44
Correlations and Multivariate Techniques
Example of Item Reliability
Chapter 3
Multivariate Methods
Impute Missing Data
To impute missing data, select Impute Missing Data from the red triangle menu for Multivariate. A new data table is created that duplicates your data table and replaces all missing values with estimated values.
Imputed values are expectations conditional on the nonmissing values for each row. The mean and covariance matrix, which is estimated by the method chosen in the launch window, is used for the imputation calculation. All multivariate tests and options are then available for the imputed data set.
This option is available only if your data table contains missing values.
Example of Item Reliability
This example uses the Danger.jmp data in the sample data folder. This table lists 30 items having some level of inherent danger. Three groups of people (students, nonstudents, and experts) ranked the items according to perceived level of danger. Note that Nuclear power is rated as very dangerous (1) by both students and nonstudents, but is ranked low (20) by experts. On the other hand, motorcycles are ranked either fifth or sixth by all three judging groups.
You can use Cronbach’s  to evaluate the agreement in the perceived way the groups ranked the items. Note that in this type of example, where the values are the same set of ranks for each group, standardizing the data has no effect.
1. Select Help > Sample Data Library and open Danger.jmp.
2. Select Analyze > Multivariate Methods > Multivariate.
3. Select all the columns except for Activity and click Y, Columns.
4. Click OK.
5. From the red triangle menu for Multivariate, select Item Reliability > Cronbach’s .
6. (Optional) From the red triangle menu for Multivariate, select Scatterplot Matrix to hide that plot.
Chapter 3
Multivariate Methods
Correlations and Multivariate Techniques
Computations and Statistical Details
45
Figure 3.6 Cronbach’s  Report
The Cronbach’s results in Figure 3.6 show an overall  of 0.8666, which indicates a high correlation of the ranked values among the three groups. Further, when you remove the experts from the analysis, the Nonstudents and Students ranked the dangers nearly the same, with Cronbach’s  scores of 0.7785 and 0.7448, respectively.
Computations and Statistical Details
Estimation Methods
Robust
This method essentially ignores any outlying values by substantially down‐weighting them. A sequence of iteratively reweighted fits of the data is done using the weight:
wi = 1.0 if Q < K and wi = K/Q otherwise,
where K is a constant equal to the 0.75 quantile of a chi‐square distribution with the degrees of freedom equal to the number of columns in the data table, and
T
2 –1
Q =  yi –    S 
 yi –  
where yi = the response for the ith observation,  = the current estimate of the mean vector, S2 = current estimate of the covariance matrix, and T = the transpose matrix operation. The final step is a bias reduction of the variance matrix.
The tradeoff of this method is that you can have higher variance estimates when the data do not have many outliers, but can have a much more precise estimate of the variances when the data do have outliers.
46
Correlations and Multivariate Techniques
Computations and Statistical Details
Chapter 3
Multivariate Methods
Pearson Product-Moment Correlation
The Pearson product‐moment correlation coefficient measures the strength of the linear relationship between two variables. For response variables X and Y, it is denoted as r and computed as
 x – x  y – y
r = -----------------------------------------------------------2
 x – x  y – y
2
If there is an exact linear relationship between two variables, the correlation is 1 or –1, depending on whether the variables are positively or negatively related. If there is no linear relationship, the correlation tends toward zero.
Nonparametric Measures of Association
For the Spearman, Kendall, or Hoeffding correlations, the data are first ranked. Computations are then performed on the ranks of the data values. Average ranks are used in case of ties.
Spearman’s  (rho) Coefficients
Spearman’s  correlation coefficient is computed on the ranks of the data using the formula for the Pearson’s correlation previously described.
Kendall’s b Coefficients
Kendall’s b coefficients are based on the number of concordant and discordant pairs. A pair of rows for two variables is concordant if they agree in which variable is greater. Otherwise they are discordant, or tied.
The formula
 sgn  x i – xj  sgn  y i – yj 
ij
 b = -------------------------------------------------------------------- T0 – T1   T0 – T2 
computes Kendall’s b where:
T0 =  n  n – 1    2
T1 =
   ti   ti – 1    2
T2 =
   ui   ui – 1    2
Note the following:
•
The sgn(z) is equal to 1 if z>0, 0 if z=0, and –1 if z<0.
Chapter 3
Multivariate Methods
Correlations and Multivariate Techniques
Computations and Statistical Details
•
The ti (the ui) are the number of tied x (respectively y) values in the ith group of tied x (respectively y) values.
•
The n is the number of observations.
•
Kendall’s b ranges from –1 to 1. If a weight variable is specified, it is ignored.
47
Computations proceed in the following way:
•
Observations are ranked in order according to the value of the first variable.
•
The observations are then re‐ranked according to the values of the second variable.
•
The number of interchanges of the first variable is used to compute Kendall’s b.
Hoeffding’s D Statistic
The formula for Hoeffding’s D (1948) is
 n – 2   n – 3 D 1 + D 2 – 2  n – 2 D 3
-
D = 30  -----------------------------------------------------------------------------------------
nn – 1n – 2n – 3n – 4 
where:
D1 = i  Qi – 1   Qi – 2 
D2 = i  Ri – 1   Ri – 2   Si – 1   Si – 2 
D3 = i  Ri – 2   Si – 2   Qi – 1 
Note the following:
•
The Ri and Si are ranks of the x and y values.
•
The Qi (sometimes called bivariate ranks) are one plus the number of points that have both x and y values less than the ith points.
•
A point that is tied on its x value or y value, but not on both, contributes 1/2 to Qi if the other value is less than the corresponding value for the ith point. A point tied on both x and y contributes 1/4 to Qi.
When there are no ties among observations, the D statistic has values between –0.5 and 1, with 1 indicating complete dependence. If a weight variable is specified, it is ignored.
Inverse Correlation Matrix
The inverse correlation matrix provides useful multivariate information. The diagonal elements of the inverse correlation matrix, sometimes called the variance inflation factors (VIF), are a function of how closely the variable is a linear function of the other variables. Specifically, if the correlation matrix is denoted R and the inverse correlation matrix is denoted R‐1, the diagonal element is denoted rii and is computed as
48
Correlations and Multivariate Techniques
Computations and Statistical Details
r
ii
Chapter 3
Multivariate Methods
1 = VIF i = --------------2
1 – Ri
where Ri2 is the coefficient of variation from the model regressing the ith explanatory variable on the other explanatory variables. Thus, a large rii indicates that the ith variable is highly correlated with any number of the other variables.
Distance Measures
The Outlier Analysis plots show the specified distance measure for each point in the data table.
Mahalanobis Distance Measures
The Mahalanobis distance takes into account the correlation structure of the data and the individual scales. For each value, the Mahalanobis distance is denoted Mi and is computed as
Mi =
 Y i – Y ' S
–1
 Yi – Y 
where:
Yi is the data for the ith row
Y is the row of means
S is the estimated covariance matrix for the data
The UCL reference line (Mason and Young, 2002) drawn on the Mahalanobis Distances plot is computed as
2
UCL Mahalanobis =
------------------n – 1 
p n–p–1
n
1 –  ;--- ;--------------------2
2
where:
n = number of observations
p = number of variables (columns)

p n–p–1
1 –  ;--- ;--------------------2
2
p n–p–1
2
2
= (1–th) quantile of a Beta  --- --------------------- distribution
If a variable is an exact linear combination of other variables, then the correlation matrix is singular and the row and the column for that variable are zeroed out. The generalized inverse that results is still valid for forming the distances.
Chapter 3
Multivariate Methods
Correlations and Multivariate Techniques
Computations and Statistical Details
49
Jackknife Distance Measures
The jackknife distance is calculated with estimates of the mean, standard deviation, and correlation matrix that do not include the observation itself. For each value, the jackknife distance is computed as
2
Ji =
Mi
----------------------n – 1 n 2
 ----------------------------3
Mi
n – 1
1 – ------------------2
n – 1
where:
n = number of observations
p = number of variables (columns)
Mi = Mahalanobis distance for the ith observation
The UCL reference line (Penny, 1996) drawn on the Jackknife Distances plot is calculated as
UCL Mahalanobis
 n – 2 n 2
UCL Jackknife = -----------------------  -----------------------------------------------------------------2
3
n – 1
n  UCL Mahalanobis
1 – -------------------------------------------------------2
n – 1
T2 Distance Measures
The T2 distance is the square of the Mahalanobis distance, so Ti2 = Mi2.
The UCL on the T2 distance is:
UCL
 n – 12
= -------------------- 
=  UCL Mahalanobis  2
p –p–1
n
T
1 –  ;--- ;n
--------------------2
2
2
where
n = number of observations
p = number of variables (columns)

p n–p–1
1 –  ;--- ;--------------------2
2
p n–p–1
2
2
= (1–th) quantile of a Beta  --- --------------------- distribution
Multivariate distances are useful for spotting outliers in many dimensions. However, if the variables are highly correlated in a multivariate sense, then a point can be seen as an outlier in multivariate space without looking unusual along any subset of dimensions. In other words, when the values are correlated, it is possible for a point to be unremarkable when seen along one or two axes but still be an outlier by violating the correlation.
50
Correlations and Multivariate Techniques
Computations and Statistical Details
Chapter 3
Multivariate Methods
Cronbach’s 
Cronbach’s  is defined as
kc
 = --------------------------v +  k – 1 c
where
k = the number of items in the scale
c = the average covariance between items
v = the average variance between items
If the items are standardized to have a constant variance, the formula becomes
k r
 = ---------------------------- where
1 +  k – 1 r
r = the average correlation between items
The larger the overall  coefficient, the more confident you can feel that your items contribute to a reliable scale or test. The coefficient can approach 1.0 if you have many highly correlated items.
Chapter 4
Cluster Analysis
Identify and Explore Groups of Similar Objects
Clustering is the technique of grouping rows together that share similar values across a number of variables. It is a wonderful exploratory technique to help you understand the clumping structure of your data. JMP provides three different clustering methods: hierarchical, k‐means, and normal mixtures.
Figure 4.1 Example of a Cluster Analysis
Contents
Clustering Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Example of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Launch the Cluster Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Hierarchical Cluster Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Hierarchical Cluster Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
K‐Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
K‐Means Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
K‐Means Report. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Normal Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Robust Normal Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Platform Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Self Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Additional Examples of Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Example of Self‐Organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Statistical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 4
Multivariate Methods
Cluster Analysis
Clustering Overview
53
Clustering Overview
Clustering is a multivariate technique of grouping rows together that share similar values. It can use any number of variables. The variables must be numeric variables for which numerical differences make sense. The common situation is that data are not scattered evenly through n‐dimensional space, but rather they form clumps, locally dense areas, modes, or clusters. The identification of these clusters goes a long way toward characterizing the distribution of values.
JMP provides three approaches to clustering:
•
hierarchical clustering for small tables, up to several thousand rows. It combines rows in a hierarchical sequence portrayed as a tree. In JMP, the tree, also called a dendrogram, is a dynamic, responding graph. You can choose the number of clusters that you like after the tree is built.
•
K‐means clustering is appropriate for larger tables, up to hundreds of thousands of rows. It makes a fairly good guess at cluster seed points. It then starts an iteration of alternately assigning points to clusters and recalculating cluster centers. You have to specify the number of clusters before you start the process.
•
Normal mixtures are appropriate when data is assumed to come from a mixture of multivariate normal distributions that overlap. Maximum likelihood is used to estimate the mixture proportions and the means, standard deviations, and correlations jointly. This approach is particularly good at estimating the total counts in each group. However, each point, rather than being classified into one group, is assigned a probability of being in each group. The EM algorithm is used to obtain estimates.
Hierarchical clustering is also called agglomerative clustering because it is a combining process. The method starts with each point (row) as its own cluster. At each step the clustering process calculates the distance between each cluster, and combines the two clusters that are closest together. This combining continues until all the points are in one final cluster. The user then chooses the number of clusters that seems right and cuts the clustering tree at that point. The combining record is portrayed as a tree, called a dendrogram. The single points are leaves, the final single cluster of all points are the trunk, and the intermediate cluster combinations are branches. Since the process starts with n(n + 1)/2 distances for n points, this method becomes too expensive in memory and time when n is large.
Hierarchical clustering also supports character columns. If the column is ordinal, then the data value used for clustering is just the index of the ordered category, treated as if it were continuous data. If the column is nominal, then the categories must match to contribute a distance of zero. They contribute a distance of 1 otherwise.
JMP offers five rules for defining distances between clusters: Average, Centroid, Ward, Single, and Complete. Each rule can generate a different sequence of clusters.
54
Cluster Analysis
Example of Clustering
Chapter 4
Multivariate Methods
K‐means clustering is an iterative follow‐the‐leader strategy. First, the user must specify the number of clusters, k. Then a search algorithm goes out and finds k points in the data, called seeds, that are not close to each other. Each seed is then treated as a cluster center. The routine goes through the points (rows) and assigns each point to the closest cluster. For each cluster, a new cluster center is formed as the means (centroid) of the points currently in the cluster. This process continues as an alternation between assigning points to clusters and recalculating cluster centers until the clusters become stable.
Normal mixtures clustering, like k‐means clustering, begins with a user‐defined number of clusters and then selects distance seeds. JMP uses the cluster centers chosen by k‐means as seeds. However, each point, rather than being classified into one group, is assigned a probability of being in each group.
SOMs are a variation on k‐means where the cluster centers are laid out on a grid. Clusters and points close together on the grid are meant to be close together in the multivariate space. See “Self Organizing Maps” on page 72.
K‐means, normal mixtures, and SOM clustering are doubly iterative processes. The clustering process iterates between two steps in a particular implementation of the EM algorithm:
•
The expectation step of mixture clustering assigns each observation a probability of belonging to each cluster.
•
For each cluster, a new center is formed using every observation with its probability of membership as a weight. This is the maximization step.
This process continues alternating between the expectation and maximization steps until the clusters become stable.
Example of Clustering
In this example, we group together countries by their 1976 crude birth and death rates per 100,000 people.
1. Select Help > Sample Data Library and open Birth Death Subset.jmp
2. Select Analyze > Multivariate Methods > Cluster.
3. Assign columns birth and death to Y, Columns.
4. Select country and click Label.
5. Click OK.
The Hierarchical Clustering platform report consists of a Clustering History table, a dendrogram tree diagram, and a plot of the distances between the clusters. Each observation is identified by the label that you assigned, country.
Chapter 4
Multivariate Methods
Cluster Analysis
Example of Clustering
55
Figure 4.2 Hierarchical Clustering Report
The clustering sequence is easily visualized with the help of the dendrogram, shown in Figure 4.2. Clustering occurs from left to right in the diagram, with each step consisting of the two closest clusters combining into a single cluster.
The scree plot beneath the dendrogram has a point for each join. The ordinate is the distance that was bridged to join the clusters at each step. There is a natural break in the scree plot between three and four clusters, suggesting four to be good choice for the number of clusters. Note there is also a break between seven and eight clusters. For practical purposes, however, four is the best choice.
6. From the Hierarchical Clustering red triangle menu, select Color Clusters.
7. From the Hierarchical Clustering red triangle menu, select Constellation Plot.
56
Cluster Analysis
Launch the Cluster Platform
Chapter 4
Multivariate Methods
Figure 4.3 Constellation Plot
This constellation plot arranges the countries as endpoints and each cluster join as a new point, with lines drawn that represent membership. We can see that the cluster that contains Afghanistan and Zaire is the most dissimilar cluster to the others.
Launch the Cluster Platform
Launch the Cluster platform by selecting Analyze > Multivariate Methods > Cluster. The Cluster Launch dialog shown in Figure 4.4 appears. The data table used is Birth Death
Subset.jmp.
Chapter 4
Multivariate Methods
Cluster Analysis
Hierarchical Clustering
57
Figure 4.4 Hierarchical Cluster Launch Dialog
You can specify as many Y variables as you want by selecting the variables in the Select Columns list and clicking Y, Columns.
K‐Means clustering only supports numeric columns. Hierarchical clustering supports character columns as follows.
•
For Ordinal columns, the data value used for clustering is just the index of the ordered category, treated as if it were continuous data. These data values are standardized like continuous columns.
•
For Nominal columns, the categories must either match to contribute a distance of zero, or contribute a standardized distance of 1.
For Hierarchical clustering, select Hierarchical from the Options list. For K‐Means Clustering, Normal Mixtures, or Self Organizing Maps, select KMeans from the Options list.
Hierarchical Clustering
The Hierarchical option clusters rows that group the points (rows) of a JMP table into clusters whose values are close to each other relative to those of other clusters. Hierarchical clustering is a process that starts with each point in its own cluster. At each step, the two clusters that are closest together are combined into a single cluster. This process continues until there is only one cluster containing all the points. This type of clustering is good for smaller data sets (a few hundred observations).
Hierarchical clustering enables you to sort clusters by their mean value by specifying an Ordering column. One way to use this feature is to complete a Principal Components analysis 58
Cluster Analysis
Hierarchical Clustering
Chapter 4
Multivariate Methods
(using Multivariate) and save the first principal component to use as an Ordering column. The clusters are then sorted by these values.
For Hierarchical clustering, select Hierarchical from the Options list on the platform launch window and then select one of the clustering distance options: Average, Centroid, Ward, Single, and Complete, and Fast Ward. The clustering methods differ in how the distance between two clusters is computed. These clustering methods are discussed under “Statistical Details for Hierarchical Clustering” on page 76.
The following options determine the form of the data that is used in calculating multivariate distances.
Data as usual
Select this option if you have typical, rectangular data.
Data as summarized Select this option if you have data that is summarized by Object ID. The Data as summarized option calculates group means and treats them as input data.
Data is distance matrix Select this option if you have a data table of distances instead of raw data. If your raw data consists of n observations, the distance table should have n rows and n columns, with the values being the distances between the observations. The distance table needs to have an additional column giving a unique identifier (such as row number) that matches the column names of the other n columns. The diagonal elements of the table should be zero, since the distance between a point and itself is zero. The table can be square (both upper and lower elements), or it can be upper or lower triangular. If using a square table, the platform gives a warning if the table is not symmetric. For an example of what the distance table should look like, use the option “Save Distance Matrix” on page 62.
Data is stacked Select this option if you have a data that is stacked. For example, data for one object that spans multiple rows is considered stacked. Stacked data is identified by Attribute ID and Object ID. The Standardize Data option is not appropriate for stacked data.
Standardize Data By default, data in each column are first standardized by subtracting the column mean and dividing by the column standard deviation. Uncheck the Standardize
Data check box if you do not want the cluster distances computed on standardized values.
Standardize Robustly The Standardize Robustly option reduces the influence of outliers on estimates of the mean and standard deviation. Outliers in a column inflate the standard deviation, thereby deflating standardized data values and giving them less influence in determining multivariate distances.
The Standardize Robustly option uses Huber M‐estimates of the mean and standard deviation (Huber, 1964, Huber, 1973, and Huber and Ronchetti, 2009). For columns with outliers, this option gives the standardized values greater representation in determining multivariate distances. The option can result in isolated clusters of outliers.
Chapter 4
Multivariate Methods
Cluster Analysis
Hierarchical Clustering
59
Note: If both Standardize Data and Standardize Robustly are checked, each column is standardized by subtracting its robust column mean and dividing by its robust standard deviation. This is useful when columns represent different measurement scales or when observations tend to be outliers in only specific dimensions.
If Standardize Data is unchecked and Standardize Robustly is checked, the robust mean and standard deviation for the data in all columns combined are used to standardize each column. This can be useful when columns all represent the same measurement scale and when observations tend to be outliers in all dimensions.
Use the Missing value imputation option to impute missing values. Missing value imputation is done assuming that there are no clusters, that the data come from a single multivariate normal distribution, and that the values are missing completely at random. These assumptions are usually not reasonable in practice. Thus, this feature must be used with caution, but it can produce more informative results than discarding most of your data.
Missing value imputation
Using the Pairwise method, a single covariance matrix is formed for all the data. Then each missing value is imputed by a method that is equivalent to regression prediction using all the nonmissing variables as predictors. If you have categorical variables, the algorithm uses the category indices as dummy variables. If regression prediction fails due to a non‐positive‐definite covariance for the nonmissing values, JMP uses univariate means.
Add Spatial Measures Use the Add Spatial Measures option when your data is stacked and contains two attributes that correspond to spatial coordinates (X and Y, for example). This option adds measures for circle, pie, and streak spatial measures to aid in clustering defect patterns.
Hierarchical Cluster Report
The Hierarchical Cluster report displays the method used, a dendrogram tree diagram, and the Clustering History table. If you assigned a label in the launch window, its values identify each observation in the dendrogram.
The dendrogram is a tree diagram that lists each observation and shows which cluster it is in and when it entered the cluster. You can drag the small diamond‐shaped handle at either the top or bottom of the dendrogram to identify a given number of clusters. If you click on any cluster stem, all the members of the cluster highlight in the dendrogram and in the data table.
The scree plot beneath the dendrogram has a point for each cluster join. The ordinate is the distance that was bridged to join the clusters at each step. Often there is a natural break where the distance jumps up suddenly. These breaks suggest natural cutting points to determine the number of clusters.
60
Cluster Analysis
Hierarchical Clustering
Chapter 4
Multivariate Methods
The Clustering History table contains the history of the cluster, from each data point in its own cluster to all points in one cluster. The order of the clusters at each join is unimportant, essentially an accident of how the data was sorted.
Hierarchical Cluster Options
The Hierarchical Cluster red triangle menu includes the following commands.
Table 4.1 Description of the Hierarchical Cluster Control Panel
Color Clusters
Assigns colors to the rows of the data table corresponding to the cluster the row belongs to. Also colors the dendrogram according to the clusters. The colors automatically update if you change the number of clusters. Deselecting this option disconnects the number of clusters, but does not change the colors.
Mark Clusters
Assigns markers to the rows of the data table corresponding to the cluster the row belongs to. The markers automatically update if you change the number of clusters. Deselecting this option disconnects the number of clusters, but does not change the markers.
Number of Clusters
Prompts you to enter a number of clusters and positions the dendrogram slider to that number.
Cluster Criterion
Gives the Cubic Clustering Criterion for range of number of clusters.
Show Dendrogram
Shows or hides the Dendrogram report.
Dendrogram Scale
Contains options for scaling the dendrogram. Distance
Scale shows the actual joining distance between each join point, and is the same scale used on the plot produced by the Distance Graph command. Even
Spacing shows the distance between each join point as equal. Geometric Spacing is useful when there are many clusters and you want the clusters near the top of the tree to be more visible than those at the bottom. (This option is the default for more than 256 rows).
Distance Graph
Shows or hides the scree plot at the bottom of the histogram.
Chapter 4
Multivariate Methods
Cluster Analysis
Hierarchical Clustering
61
Table 4.1 Description of the Hierarchical Cluster Control Panel (Continued)
Show NCluster Handle
Shows or hides the handles on the dendrogram used to manually change the number of clusters.
Zoom to Selected Rows
Is used to zoom the dendrogram to a particular cluster after selecting the cluster on the dendrogram. Alternatively, you can double‐click on a cluster to zoom in on it.
Release Zoom
Returns the dendrogram to original view after zooming.
Pivot on Selected Cluster
Reverses the order of the two sub‐clusters of the currently selected cluster.
Color Map
Gives the option to add a color map showing the values of all the data colored across its value range. There are several color theme choices in a submenu. Another term for this feature is heat map.
Two way clustering
Adds clustering by column. A color map is automatically added with the column dendrogram at its base. The columns must be measured on the same scale.
Positioning
Provides options for changing the positions of dendrograms and labels.
Legend
Shows or hides a legend for the colors used in a color map. This option is available only if a color map is enabled.
More Color Map Columns
Adds a color map for specified columns.
Constellation Plot
Arranges the individuals as endpoints and each cluster join as a new point, with lines drawn that represent membership. The longer lines represent greater distance between clusters. To turn off the displayed labels, right‐click inside the Constellation Plot and select Show Labels.
Save Constellation Coordinates
Saves the coordinates of the constellation plot to the data table.
Save Clusters
Creates a data table column containing the cluster number.
62
Cluster Analysis
Hierarchical Clustering
Chapter 4
Multivariate Methods
Table 4.1 Description of the Hierarchical Cluster Control Panel (Continued)
Save Formula for Closest Cluster
Creates a data table column containing a formula to the closest cluster. This option calculates the squared Euclidean distance to each cluster’s centroid and selects the cluster that is closest. Note that this formula does not always reproduce the cluster assignment given by Hierarchical Clustering since the clusters are determined differently. However, the cluster assignment is very similar.
Save Display Order
Creates a data table column containing the order the row is presented in the dendrogram.
Save Cluster Hierarchy
Saves information needed if you are going to do a custom dendrogram with scripting. For each clustering, it outputs three rows, the joiner, the leader, and the result, with the cluster centers, size, and other information.
Save Cluster Tree
Saves information needed if you are going to compare cluster trees between JMP and SAS. For each clustering, it outputs one row for each new cluster, with the cluster’s size and other information.
Save Distance Matrix
Makes a new data table containing the distances between the observations.
Save Cluster Means
Creates a new data table containing the number of rows and the means of each column in each cluster.
Cluster Summary
Displays a table of cluster means, a graph of means by cluster for each column, and a table of RSquare values of each column against the current clusters.
Scatterplot Matrix
Creates a scatterplot matrix using all the variables.
Parallel Coord Plots
Creates a parallel coordinate plot for each cluster. For details about the plots, see the Basic Analysis book.
Script
Contains options that are available to all platforms. See Using JMP.
Chapter 4
Multivariate Methods
Cluster Analysis
K-Means Clustering
63
K-Means Clustering
The k‐means approach to clustering performs an iterative alternating fitting process to form the number of specified clusters. The k‐means method first selects a set of n points called cluster seeds as a first guess of the means of the clusters. Each observation is assigned to the nearest seed to form a set of temporary clusters. The seeds are then replaced by the cluster means, the points are reassigned, and the process continues until no further changes occur in the clusters. When the clustering process is finished, you see tables showing brief summaries of the clusters. The k‐means approach is a special case of a general approach called the EM algorithm; E stands for Expectation (the cluster means in this case), and M stands for maximization, which means assigning points to closest clusters in this case.
The k‐means method is intended for use with larger data tables, from approximately 200 to 100,000 observations. With smaller data tables, the results can be highly sensitive to the order of the observations in the data table.
K‐Means clustering only supports numeric columns. K‐Means clustering ignores model types (nominal and ordinal), and treat all numeric columns as continuous columns.
To see the KMeans cluster launch dialog (see Figure 4.5), select KMeans from the Options menu on the platform launch dialog. The figure uses the Cytometry.jmp data table.
Figure 4.5 KMeans Launch Dialog
The dialog has the following options:
Columns Scaled Individually is used when variables do not share a common measurement scale, and you do not want one variable to dominate the clustering process. For example, one variable might have values that are between 0‐1000, and another variable might have values between 0‐10. In this situation, you can use the option so that the clustering process is not dominated by the first variable.
64
Cluster Analysis
K-Means Clustering
Chapter 4
Multivariate Methods
Johnson Transform balances highly skewed variables or brings outliers closer to the center of the rest of the values.
K-Means Control Panel
As an example of KMeans clustering, use the Cytometry.jmp sample data table. Add the variables CD3 and CD8 as Y, Columns variables. Select the KMeans option. Click OK. The Control Panel appears, and is shown in Figure 4.6.
Figure 4.6 Iterative Clustering Control Panel
The Iterative Clustering red‐triangle menu has the Save Transformed option. This saves the Johnson transformed variables to the data table. This option is available only if the Johnson Transform option is selected on the launch dialog (Figure 4.5).
The Control Panel has these options:
Table 4.2 Description of K‐Means Clustering Control Panel Options
Declutter
Locates outliers in the multivariate sense. Plots are produced giving distances between each point and that points nearest neighbor, the second nearest neighbor, up to the kth nearest neighbor. You are prompted to enter k. Beneath the plots are options to create a scatterplot matrix, save the distances to the data table, or to not include rows that you have excluded in the clustering procedure. If an outlier is identified, you might want to exclude the row from the clustering process.
Chapter 4
Multivariate Methods
Cluster Analysis
K-Means Clustering
65
Table 4.2 Description of K‐Means Clustering Control Panel Options (Continued)
Method
Chooses the Clustering Method. The available methods are:
•
KMeans Clustering is described in this section.
•
Normal Mixtures is described in “Normal Mixtures” on page 68.
•
Robust Normal Mixtures is described in “Normal Mixtures” on page 68.
•
Self Organizing Map is described in “Self Organizing Maps” on page 72.
Number of Clusters
Designates the number of clusters to form.
Optional range of clusters
Provides an upper bound for the number of clusters to form. If a number is entered here, the platform creates separate analyses for every integer between Number of
clusters and this one.
Single Step
Enables you to step through the clustering process one iteration at a time using a Step button, or automate the process using a Go button.
Use within-cluster std deviations
If you do not use this option, all distances are scaled by an overall estimate of the standard deviation of each variable. If you use this option, distances are scaled by the standard deviation estimated for each cluster.
Shift distances using sampling
rates
Assumes that you have a mix of unequally sized clusters, and points should give preference to being assigned to larger clusters because there is a greater prior probability that it is from a larger cluster. This option is an advanced feature. The calculations for this option are implied, but not shown for normal mixtures.
K-Means Report
Clicking Go in the Control Panel in Figure 4.6 produces the K‐Means report, shown in Figure 4.7.
66
Cluster Analysis
K-Means Clustering
Chapter 4
Multivariate Methods
Figure 4.7 K‐Means Report
The report gives summary statistics for each cluster:
•
count of number of observations
•
means for each variable
•
standard deviations for each variable.
The Cluster Comparison report gives fit statistics to compare different numbers of clusters. For KMeans Clustering and Self Organizing Maps, the fit statistic is CCC (Cubic Clustering Criterion). For Normal Mixtures, the fit statistic is BIC or AICc. Robust Normal Mixtures does not provide a fit statistic.
K-Means Platform Options
These options are accessed from the red‐triangle menus, and apply to KMeans, Normal Mixtures, Robust Normal Mixtures, and Self‐Organizing Map methods.
Table 4.3 Descriptions of K‐Means Platform Options
Biplot
Shows a plot of the points and clusters in the first two principal components of the data. Circles are drawn around the cluster centers. The size of the circles is proportional to the count inside the cluster. The shaded area is the 90% density contour around the mean. Therefore, the shaded area indicates where 90% of the observations in that cluster would fall. Below the plot is an option to save the cluster colors to the data table.
Chapter 4
Multivariate Methods
Cluster Analysis
K-Means Clustering
67
Table 4.3 Descriptions of K‐Means Platform Options (Continued)
Biplot Options
Contains options for controlling the Biplot.
•
Show Biplot Rays enables you to show or hide the biplot rays.
•
Biplot Ray Position enables you to position the biplot ray display. This is viable since biplot rays only signify the directions of the original variables in canonical space, and there is no special significance to where they are placed in the graph.
•
Mark Clusters assigns markers to the rows of the data table corresponding to the clusters.
Biplot 3D
Shows a three‐dimensional biplot of the data. Three variables are needed to use this option.
Parallel Coord Plots
Creates a parallel coordinate plot for each cluster. For details about the plots, see the Basic Analysis book. The plot report has options for showing and hiding the data and means.
Scatterplot Matrix
Creates a scatterplot matrix using all the variables.
Save Colors to Table
Colors each row with a color corresponding to the cluster that it is in.
Save Clusters
Creates a new column with the cluster number that each row is assigned to. For normal mixtures, this is the cluster that is most likely.
Save Cluster Formula
Creates a new column with a formula to evaluate which cluster the row belongs to.
Save Mixture Probabilities
Creates a column for each cluster and saves the probability an observation belongs to that cluster in the column. This is available for Normal Mixtures and Robust Normal Mixtures clustering only.
Save Mixture Formulas
Creates columns with mixture probabilities, but stores their formulas in the column and needs additional columns to hold intermediate results for the formulas. Use this feature if you want to score probabilities for excluded data, or data that you add to the table. This is available for Normal Mixtures and Robust Normal Mixtures clustering only.
68
Cluster Analysis
Normal Mixtures
Chapter 4
Multivariate Methods
Table 4.3 Descriptions of K‐Means Platform Options (Continued)
Save Density Formula
Saves the density formula in the data table. This is available for Normal Mixtures clustering only.
Simulate Clusters
Creates a new data table containing simulated clusters using the mixing probabilities, means, and standard deviations.
Remove
Removes the clustering report.
Normal Mixtures
Normal mixtures is an iterative technique, but rather than being a clustering method to group rows, it is more of an estimation method to characterize the cluster groups. Rather than classifying each row into a cluster, it estimates the probability that a row is in each cluster. See McLachlan and Krishnan (1997).
The normal mixtures approach to clustering predicts the proportion of responses expected within each cluster. The assumption is that the joint probability distribution of the measurement columns can be approximated using a mixture of multivariate normal distributions, which represent different clusters. The distributions have mean vectors and covariance matrices for each cluster.
Hierarchical and k‐means clustering methods work well when clusters are well separated, but when clusters overlap, assigning each point to one cluster is problematic. In the overlap areas, there are points from several clusters sharing the same space. It is especially important to use normal mixtures rather than k‐means clustering if you want an accurate estimate of the total population in each group, because it is based on membership probabilities, rather than arbitrary cluster assignments based on borders.
To perform Normal Mixtures, select that option on the Method menu of the Iterative Clustering Control Panel (Figure 4.6). After selecting Normal Mixtures, the control panel looks like Figure 4.8.
Chapter 4
Multivariate Methods
Cluster Analysis
Normal Mixtures
69
Figure 4.8 Normal Mixtures Control Panel
Some of the options on the panel are described in “K‐Means Control Panel” on page 64. The other options are described below:
Diagonal Variance is used to constrain the off‐diagonal elements of the covariance matrix to zero. In this case, the platform fits multivariate normal distributions that have no correlations between the variables.
This is sometimes necessary in order to avoid getting a singular covariance matrix, when there are fewer observations than columns.
Outlier Cluster is used to fit a Uniform cluster to catch any outliers that do not fall into any of the Normal clusters. If this cluster is created, it is designated cluster 0.
is the number of independent restarts of estimation process, each with different starting values. This helps to guard against finding local solutions.
Tours
Maximum Iterations is the maximum number of iterations of the convergence stage of the EM algorithm.
Converge Criteria
is the difference in the likelihood at which the EM iterations stop.
For an example of Normal Mixtures, open the Iris.jmp sample data table. This data set was first introduced by Fisher (1936), and includes four different measurements: sepal length, sepal width, petal length, and petal width, performed on samples of 50 each for three species of iris.
Note: Your results may not exactly match these results due to the random selection of initial centers.
70
Cluster Analysis
Normal Mixtures
Chapter 4
Multivariate Methods
On the Cluster launch dialog, assign all four variables to the Y, Columns role, select KMeans from Method menu, and click OK. Select Normal Mixtures from the Method menu, specify 3 for the Number of Clusters, and click Go. The report is shown in Figure 4.9.
Figure 4.9 Normal Mixtures Report
The report gives summary statistics for each cluster:
•
count of number of observations and proportions
•
means for each variable
•
standard deviations for each variable.
•
correlations between variables
The Cluster Comparison report gives fit statistics to compare different numbers of clusters. For KMeans Clustering and Self Organizing Maps, the fit statistic is CCC (Cubic Clustering Criterion). For Normal Mixtures, the fit statistic is BIC or AICc. Robust Normal Mixtures does not provide a fit statistic.
Robust Normal Mixtures
The Robust Normal Mixtures option is available if you suspect you may have outliers in the multivariate sense. Since regular Normal Mixtures is sensitive to outliers, the Robust Normal Mixtures option uses a more robust method for estimating the parameters. For details, see “Statistical Details for Robust Estimation Methods” on page 78.
Chapter 4
Multivariate Methods
Cluster Analysis
Normal Mixtures
71
To perform Robust Normal Mixtures, select that option on the Method menu of the Iterative Clustering Control Panel (Figure 4.6). After selecting Robust Normal Mixtures, the control panel looks like Figure 4.10.
Figure 4.10 Robust Normal Mixtures Control Panel
Some of the options on the panel are described in “K‐Means Control Panel” on page 64. The other options are described below:
Diagonal Variance is used to constrain the off‐diagonal elements of the covariance matrix to zero. In this case, the platform fits multivariate normal distributions that have no correlations between the variables.
This is sometimes necessary in order to avoid getting a singular covariance matrix, when there are fewer observations than columns.
is a number between 0 and 1. Robust Normal Mixtures protects against outliers by downweighting them. Huber Coverage can be loosely thought of as the proportion of the data that is not considered outliers, and not downweighted. Values closer to 1 result in a larger proportion of the data not being downweighted. In other words, values closer to 1 protect only against the most extreme outliers. Values closer to 0 result in a smaller proportion of the data not being downweighted, and may falsely consider less extreme data points to be outliers.
Huber Coverage
Complete Tours is the number of times to restart the estimation process. This helps guard against the process finding a local solution.
is the number of random starts within each tour. Random starting values for the parameters are used for each new start.
Initial Guesses
72
Cluster Analysis
Self Organizing Maps
Chapter 4
Multivariate Methods
Max Iterations is the maximum number of iterations during the convergence stage. The convergence stage starts after all tours are complete. It begins at the optimal result out of all the starts and tours, and from there converges to a final solution.
Platform Options
For details about the red‐triangle options for Normal Mixtures and Robust Normal Mixtures, see “K‐Means Platform Options” on page 66.
Self Organizing Maps
The Self‐Organizing Maps (SOMs) technique was developed by Teuvo Kohonen (1989) and further extended by a number of other neural network enthusiasts and statisticians. The original SOM was cast as a learning process, like the original neural net algorithms, but the version implemented here is done in a much more straightforward way as a simple variation on k‐means clustering. In the SOM literature, this would be called a batch algorithm using a locally weighted linear smoother.
The goal of a SOM is not only to form clusters, but form them in a particular layout on a cluster grid, such that points in clusters that are near each other in the SOM grid are also near each other in multivariate space. In classical k‐means clustering, the structure of the clusters is arbitrary, but in SOMs the clusters have the grid structure. This grid structure helps interpret the clusters in two dimensions: clusters that are close are more similar than distant clusters.
To create a Self Organizing Map, select that option on the Method menu of the Iterative Clustering Control Panel (Figure 4.6). After selecting Self Organizing Map, the control panel looks like Figure 4.11.
Figure 4.11 Self Organizing Map Control Panel
Chapter 4
Multivariate Methods
Cluster Analysis
Self Organizing Maps
73
Some of the options on the panel are described in “K‐Means Control Panel” on page 64. The other options are described below:
N Rows is the number of rows in the cluster grid.
N Columns is the number of columns in the cluster grid.
Bandwidth determines the effect of neighboring clusters for predicting centroids. A higher bandwidth results in a more detailed fitting of the data.
Figure 4.12 Self Organizing Map Report
The report gives summary statistics for each cluster:
•
count of number of observations
•
means for each variable
•
standard deviations for each variable.
The Cluster Comparison report gives fit statistics to compare different numbers of clusters. For KMeans Clustering and Self Organizing Maps, the fit statistic is CCC (Cubic Clustering Criterion). For Normal Mixtures, the fit statistic is BIC or AICc. Robust Normal Mixtures does not provide a fit statistic.
For details about the red‐triangle options for Self Organizing Maps, see “K‐Means Platform Options” on page 66.
Implementation Technical Details
The SOM implementation in JMP proceeds as follows:
74
Cluster Analysis
Additional Examples of Cluster Analysis
Chapter 4
Multivariate Methods
•
The first step is to obtain good initial cluster seeds that provide a good coverage of the multidimensional space. JMP uses principal components to determine the two directions that capture the most variation in the data.
•
JMP then lays out a grid in this principal component space with its edges 2.5 standard deviations from the middle in each direction. The clusters seeds are formed by translating this grid back into the original space of the variables.
•
The cluster assignment proceeds as with k‐means, with each point assigned to the cluster closest to it.
•
The means are estimated for each cluster as in k‐means. JMP then uses these means to set up a weighted regression with each variable as the response in the regression, and the SOM grid coordinates as the regressors. The weighting function uses a ‘kernel’ function that gives large weight to the cluster whose center is being estimated, with smaller weights given to clusters farther away from the cluster in the SOM grid. The new cluster means are the predicted values from this regression.
•
These iterations proceed until the process has converged.
Additional Examples of Cluster Analysis
Example of Self-Organizing Maps
This example uses the Iris.jmp sample data table, which includes measurements of sepal length, sepal width, petal length, and petal width for three species of irises.
1. Select Help > Sample Data Library and open Iris.jmp.
2. Select Analyze > Multivariate Methods > Cluster.
3. Assign Sepal length, Sepal width, Petal length, and Petal width as Y, Column variables.
4. Select K Means on the Options menu.
5. Uncheck Columns Scaled Individually.
6. Click OK.
7. Select Self Organizing Map from the Method menu on the Control Panel.
8. Even though we know the data consists of three species, set Optional range of clusters equal to 10.
9. Set N Rows equal to 1 and N Columns equal to 3.
10. Click Go.
The results are displayed in Figure 4.13. Notice the number of clusters that gives the largest CCC is 3, which is the number of species. We can see the classification was not perfect; each cluster should represent each species, with 50 rows for each.
Chapter 4
Multivariate Methods
Cluster Analysis
Additional Examples of Cluster Analysis
Figure 4.13 Self Organizing Map Report Window
11. In the data table, select the Species column and select Rows > Color or Mark by Column.
12. Select the Classic option under Markers.
13. Click Go.
14. From the SOM Grid 3 by 1 red triangle menu, select Biplot.
Figure 4.14 Biplot of Iris Self Organizing Map
75
76
Cluster Analysis
Statistical Details
Chapter 4
Multivariate Methods
We can see that all rows from Cluster 3 are correctly identified as the setosa species. The other two species, virginica and versicolor, overlap slightly and can be mistaken for each other.
15. From the SOM Grid 3 by 1 red triangle menu, select Parallel Coord Plots.
Figure 4.15 Parallel Coordinate Plot for Iris Data
We can see from the Parallel Coordinate Plot in Figure 4.15 that clusters 1 and 2 (species virginica and versicolor, respectively) can be similar to each other in characteristics. These similarities can make it hard to distinguish between the species. However, the SOM did a relatively good job identifying and classifying these three species.
Statistical Details
Statistical Details for Hierarchical Clustering
The following description of hierarchical clustering methods gives distance formulas that use the following notation. Lowercase symbols generally pertain to observations and uppercase symbols to clusters.
n is the number of observations
v is the number of variables
xi is the ith observation
CK is the Kth cluster, subset of {1, 2,..., n}
NK is the number of observations in CK
Chapter 4
Multivariate Methods
Cluster Analysis
Statistical Details
77
x is the sample mean vector
xK is the mean vector for cluster CK
x is the square root of the sum of the squares of the elements of x (the Euclidean length of the vector x)
d(xi, xj) is x i – x j
2
Average Linkage In average linkage, the distance between two clusters is the average distance between pairs of observations, or one in each cluster. Average linkage tends to join clusters with small variances and is slightly biased toward producing clusters with the same variance. See Sokal and Michener (1958).
Distance for the average linkage cluster method is
D KL =
d  x i x j 
--------------------i  CK j  CL NK NL


In the centroid method, the distance between two clusters is defined as the squared Euclidean distance between their means. The centroid method is more robust to outliers than most other hierarchical methods but in other respects might not perform as well as Ward’s method or average linkage. See Milligan (1980).
Centroid Method
Distance for the centroid method of clustering is
D KL = x K – x L
2
In Ward’s minimum variance method, the distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables. At each generation, the within‐cluster sum of squares is minimized over all partitions obtainable by merging two clusters from the previous generation. The sums of squares are easier to interpret when they are divided by the total sum of squares to give the proportions of variance (squared semipartial correlations).
Ward’s
Ward’s method joins clusters to maximize the likelihood at each level of the hierarchy under the assumptions of multivariate normal mixtures, spherical covariance matrices, and equal sampling probabilities.
Ward’s method tends to join clusters with a small number of observations and is strongly biased toward producing clusters with approximately the same number of observations. It is also very sensitive to outliers. See Milligan (1980).
Distance for Ward’s method is
xK – xL
2
D KL = --------------------------1 - + ------1
------NK NL
78
Cluster Analysis
Statistical Details
Chapter 4
Multivariate Methods
In single linkage the distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster. Single linkage has many desirable theoretical properties. See Jardine and Sibson (1976), Fisher and Van Ness (1971), and Hartigan (1981). Single linkage has, however, fared poorly in Monte Carlo studies. See Milligan (1980). By imposing no constraints on the shape of clusters, single linkage sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters. Single linkage tends to chop off the tails of distributions before separating the main clusters. See Hartigan (1981). Single linkage was originated by Florek et al. (1951a, 1951b) and later reinvented by McQuitty (1957) and Sneath (1957).
Single Linkage
Distance for the single linkage cluster method is
D KL = min i  C min j  C d  x i x j 
K
L
Complete Linkage In complete linkage, the distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. Complete linkage is strongly biased toward producing clusters with approximately equal diameters and can be severely distorted by moderate outliers. See Milligan (1980).
Distance for the Complete linkage cluster method is
D KL = max i  C max j  C d  x i x j 
K
L
Fast Ward is a way of applying Wardʹs method more quickly for large numbers of rows. It is used automatically whenever there are more than 2000 rows.
Statistical Details for Robust Estimation Methods
Normal Mixtures uses the EM algorithm to do fitting because it is more stable than the Newton‐Raphson algorithm. Additionally we are using a Bayesian regularized version of the EM algorithm, which allows us to smoothly handle cases where the covariance matrix is singular. Since the estimates are heavily dependent on initial guesses, the platform will go through a number of tours, each with randomly selected points as initial centers.
Doing multiple tours makes the estimation process somewhat expensive, so considerable patience is required for large problems. Controls enable you to specify the tour and iteration limits.
Additional Details for Robust Normal Mixtures
Because Normal Mixtures is sensitive to outliers, JMP offers an outlier robust alternative called Robust Normal Mixtures. This uses a robust method of estimating the normal parameters. JMP computes the estimates via maximum likelihood with respect to a mixture of Huberized normal distributions (a class of modified normal distributions that was tailor‐made to be more outlier resistant than the normal distribution).
Chapter 4
Multivariate Methods
Cluster Analysis
Statistical Details
79
The Huberized Gaussian distribution has pdf  k  x  .
exp  –   x  -
 k  x  = ----------------------------ck
 x2
 ----2

x  = 
2

k
 k x – ----2

if x  k
if x  k
2
ck =
exp  – k  2 
2    k  –   – k   + 2 ------------------------------k
So, in the limit as k becomes arbitrarily large,  k  x  tends toward the normal PDF. As k  0 ,  k  x  tends toward the exponential (Laplace) distribution.
The regularization parameter k is set so that P(Normal(x) < k) = Huber Coverage, where Normal(x) indicates a multivariate normal variate. Huber Coverage is a user field, which defaults to 0.90.
80
Cluster Analysis
Statistical Details
Chapter 4
Multivariate Methods
Chapter 5
Principal Components
Reduce the Dimensionality of Your Data
The purpose of principal component analysis is to derive a small number of independent linear combinations (principal components) of a set of measured variables that capture as much of the variability in the original variables as possible. Principal component analysis is a dimension‐reduction technique. It can be used as an exploratory data analysis tool, but is also useful for constructing predictive models, as in principal components analysis regression (also known as PCA regression or PCR).
For data with a very large number of variables, the Principal Components platform provides an estimation method called the Wide method that enables you to calculate principal components in short computing times. These principal components can then be used in PCA regression.
The Principal Components platform also supports factor analysis. JMP offers several types of orthogonal and oblique factor analysis‐style rotations to help interpret the extracted components. For factor analysis, see the Factor Analysis chapter in the Consumer Research book.
Figure 5.1 Example of Principal Components
Contents
Overview of Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Example of Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Launch the Principal Components Platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Principal Components Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Principal Components Report Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Principal Components Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Wide Principal Components Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Cluster Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Chapter 5
Multivariate Methods
Principal Components
Overview of Principal Component Analysis
83
Overview of Principal Component Analysis
A principal component analysis models the variation in a set of variables in terms of a smaller number of independent linear combinations (principal components) of those variables.
If you want to see the arrangement of points across many correlated variables, you can use principal component analysis to show the most prominent directions of the high‐dimensional data. Using principal component analysis reduces the dimensionality of a set of data. Principal components representation is important in visualizing multivariate data by reducing it to graphable dimensions. Principal components is a way to picture the structure of the data as completely as possible by using as few variables as possible.
For p variables, p principal components are formed as follows:
•
The first principal component is the linear combination of the standardized original variables that has the greatest possible variance.
•
Each subsequent principal component is the linear combination of the variables that has the greatest possible variance and is uncorrelated with all previously defined components.
Each principal component is calculated by taking a linear combination of an eigenvector of the correlation matrix (or covariance matrix or sum of squares and cross products matrix) with the variables. The eigenvalues represent the variance of each component.
The Principal Components platform allows you to conduct your analysis on the correlation matrix, the covariance matrix, or the unscaled data. You can also conduct Factor Analysis within the Principal Components platform. See the Factor Analysis chapter in the Consumer Research book for details.
Example of Principal Component Analysis
To view an example Principal Component Analysis report for a data table for two factors:
1. Select Help > Sample Data Library and open Solubility.jmp.
2. Select Analyze > Multivariate Methods > Principal Components.
The Principal Components launch window appears.
3. Select all of the continuous columns and click Y, Columns.
4. Keep the default Estimation Method.
5. Click OK.
The Principal Components on Correlations report appears.
84
Principal Components
Launch the Principal Components Platform
Chapter 5
Multivariate Methods
Figure 5.2 Principal Components on Correlations Report
The report gives the eigenvalues and a bar chart of the percent of the variation accounted for by each principal component. There is a Score Plot and a Loadings Plot as well. See “Principal Components Report” on page 88 for details.
Launch the Principal Components Platform
Launch the Principal Components platform by selecting Analyze > Multivariate Methods >
Principal Components. Principal Component analysis is also available using the Multivariate and the Scatterplot 3D platforms.
The example described in “Example of Principal Component Analysis” on page 83 uses all of the continuous variables from the Solubility.jmp sample data table.
Figure 5.3 Principal Components Launch Window
Y, Columns Lists the variables to analyze for components.
Weight and Freq
Enables you to weight the analysis to account for pre‐summarized data.
Chapter 5
Multivariate Methods
By
Principal Components
Launch the Principal Components Platform
85
Creates a Principal Component report for each value specified by the By column so that you can perform separate analyses for each group.
Estimation Method Lists different methods for calculating the correlations. Several of these methods address the treatment of missing data. See “Estimation Methods” on page 85.
Estimation Methods
Use the estimation method that addresses your specific needs. Methods are available to handle missing values, outliers, and wide data.
You can also estimate missing values in the following ways:
•
Use the Impute Missing Data option found under Multivariate Methods > Multivariate. See “Impute Missing Data” on page 44 in the “Correlations and Multivariate Techniques” chapter.
•
Use the Multivariate Normal Imputation or Multivariate SVD Imputation utilities found under Cols > Modeling Utilities > Explore Missing Values. See the Basic Analysis book for details.
Default
The Default option uses either the Row‐wise, Pairwise, or REML methods:
•
Row-wise is used for data tables with no missing values.
•
Pairwise is used in the following circumstances:
‒ the data table has more than 10 columns or more than 5,000 rows and has missing values
‒ the data table has more columns than rows and has missing values
•
REML is used otherwise.
REML
REML (restricted maximum likelihood) estimates are less biased than the ML (maximum likelihood) estimation method. The REML method maximizes marginal likelihoods based upon error contrasts. The REML method is often used for estimating variances and covariances.The REML method in the Multivariate platform is the same as the REML estimation of mixed models for repeated measures data with an unstructured covariance matrix. See the documentation for SAS PROC MIXED about REML estimation of mixed models.
REML uses all of your data, even if missing values are present, and is most useful for smaller datasets. Because of the bias‐correction factor, this method is slow if your dataset is large and 86
Principal Components
Launch the Principal Components Platform
Chapter 5
Multivariate Methods
there are many missing data values. If there are no missing cells in the data, then the REML estimate is equivalent to the sample covariance matrix.
Note: If you select REML and your data table contains more columns than rows, JMP switches the Estimation Method. If there are no missing values, the Estimation Method switches to Row‐wise. If there are missing values, then the Estimation Method switches to Pairwise.
ML
The maximum likelihood estimation method (ML) is useful for large data tables with missing cells. The ML estimates are similar to the REML estimates, but the ML estimates are generated faster. Observations with missing values are not excluded. For small data tables, REML is preferred over ML because REML’s variance and covariance estimates are less biased.
Note: If you select ML and your data table contains more columns than rows, JMP switches the Estimation Method. If there are no missing values, the Estimation Method switches to Row‐wise. If there are missing values, then the Estimation Method switches to Pairwise.
Robust
Robust estimation is useful for data tables that might have outliers. For statistical details, see “Robust” on page 45.
Note: If you select Robust and your data table contains more columns than rows, JMP switches the Estimation Method. If there are no missing values, the Estimation Method switches to Row‐wise. If there are missing values, then the Estimation Method switches to Pairwise.
Row-wise
Row‐wise estimation does not use observations containing missing cells. This method is useful in the following situations:
•
Checking compatibility with JMP versions earlier than JMP 8. Row‐wise estimation was the only estimation method available before JMP 8.
•
Excluding any observations that have missing data.
Pairwise
Pairwise estimation performs correlations for all rows for each pair of columns with nonmissing values.
Chapter 5
Multivariate Methods
Principal Components
Launch the Principal Components Platform
87
Wide Method
Note: The Wide method only extracts components based on the correlation structure. The On Covariance and On Unscaled options are not available.
The Wide method is useful when you have a very large number of columns in your data. It uses a computationally efficient algorithm that avoids calculating the covariance matrix. The algorithm is based on the singular value decomposition. For for additional background, see “Wide Linear Methods and the Singular Value Decomposition” on page 171 in the “Statistical Details” appendix.
Consider the following notation:
•
n = number of rows
•
p = number of variables
•
X = n by p matrix of data values
The number of nonzero eigenvalues, and consequently the number of principal components, equals the rank of the correlation matrix of X. The number of nonzero eigenvalues cannot exceed the smaller of n and p.
When you select the Wide method, the data are standardized. To standardize a value, subtract its mean and divide by its standard deviation. Denote the n by p matrix of standardized data values by Xs. Then the covariance matrix of the standardized data is the correlation matrix of X and it is given as follows:
Cov = X s X s   n – 1 
Using the singular value decomposition, Xs is written as UDiag()V’. This representation is used to obtain the eigenvectors and eigenvalues of Xs’Xs. The principal components, or scores, are given by X s V .
Note: If there are missing values and you select the Wide method, then the rows that contain missing values are deleted and the Wide method is applied to the remaining rows.
Note: When you select the Default estimation method and enter more than 500 variables as Y, Columns, a JMP Alert recommends that you switch to the Wide estimation method. This is because computation time can be considerable when you use the other methods with a large number of columns. Click Wide to switch to the Wide method. Click Continue to use the method you originally selected.
88
Principal Components
Principal Components Report
Chapter 5
Multivariate Methods
Principal Components Report
The initial Principal Components report is for an analysis on Correlations. It summarizes the variation of the specified Y variables with principal components. See Figure 5.4. You can switch to an analysis based on the covariance matrix or unscaled data by selecting the Principal Components option from the red triangle menu.
Based on your selection, the principal components are derived from an eigenvalue decomposition of one of the following:
•
the correlation matrix
•
the covariance matrix
•
the sum of squares and cross products matrix for the unscaled and uncentered data
The details in the report show how the principal components absorb the variation in the data. The principal component points are derived from the eigenvector linear combination of the variables.
Figure 5.4 Principal Components on Correlations Report
The report gives the eigenvalues and a bar chart of the percent of the variation accounted for by each principal component. There is a Score Plot and a Loadings Plot as well. The eigenvalues indicate the total number of components extracted based on the amount of variance contributed by each component.
The Score Plot graphs each component’s calculated values in relation to the other, adjusting each value for the mean and standard deviation.
The Loadings Plot graphs the unrotated loading matrix between the variables and the components. The closer the value is to 1 the greater the effect of the component on the variable.
Chapter 5
Multivariate Methods
Principal Components
Principal Components Report Options
89
Principal Components Report Options
The options you see in the red triangle menu depend upon the estimation method you chose in the launch window:
•
If you selected any method other than Wide, the Principal Components: on Correlations report initially appears. (The title of this report changes if you select Principal Components > on Covariances or on Unscaled from the red triangle menu.) See “Principal Components Options” on page 89.
•
If you selected the Wide method, the Wide Principal Components report appears. See “Wide Principal Components Options” on page 96.
Principal Components Options
For estimation methods other than Wide, the Principal Components red triangle menu contains the following options:
Principal Components Enables you to create the principal components based on Correlations, Covariances, or Unscaled.
Correlations Displays the matrix of correlations between the variables.
Note: The values on the diagonals are 1.0.
Figure 5.5 Correlations
Covariance Matrix Displays the covariances of the variables.
Figure 5.6 Covariance Matrix
90
Principal Components
Principal Components Report Options
Chapter 5
Multivariate Methods
Eigenvalues Lists the eigenvalue that corresponds to each principal component in order from largest to smallest. The eigenvalues represent a partition of the total variation in the multivariate sample.
The scaling of the eigenvalues depends on which matrix you select for extraction of principal components:
‒ For the on Correlations option, the eigenvalues are scaled to sum to the number of variables.
‒ For the on Covariances options, the eigenvalues are not scaled.
‒ For the on Unscaled option, the eigenvalues are divided by the total number of observations.
If you select the Bartlett Test option from the red triangle menu, hypothesis tests (Figure 5.9) are given for each eigenvalue (Jackson, 2003).
Figure 5.7 Eigenvalues
Shows columns of values that correspond to the eigenvectors for each of the principal components, in order, from left to right. Using these coefficients to form a linear combination of the original variables produces the principal component variables. Following the standard convention, eigenvectors have norm 1.
Eigenvectors
Figure 5.8 Eigenvectors
Bartlett Test Shows the results of the homogeneity test (appended to the Eigenvalues table) to determine if the eigenvalues have the same variance by calculating the Chi‐square, degrees of freedom (DF), and the p‐value (prob > ChiSq) for the test. See Bartlett (1937, 1954).
Chapter 5
Multivariate Methods
Principal Components
Principal Components Report Options
91
Figure 5.9 Bartlett Test
Loading Matrix Shows columns corresponding to the loadings for each component. These values are graphed in the Loading Plot.
The scaling of the loadings depends on which matrix you select for extraction of principal components:
‒ For the on Correlations option, the ith column of loadings is the ith eigenvector multiplied by the square root of the ith eigenvalue. The i,jth loading is the correlation between the ith variable and the jth principal component.
‒ For the on Covariances option, the jth entry in the ith column of loadings is the ith eigenvector multiplied by the square root of the ith eigenvalue and divided by the standard deviation of the jth variable. The i,jth loading is the correlation between the ith variable and the jth principal component.
‒ For the on Unscaled option, the jth entry in the ith column of loadings is the ith eigenvector multiplied by the square root of the ith eigenvalue and divided by the standard error of the jth variable. The standard error of the jth variable is the jth diagonal entry of the sum of squares and cross products matrix divided by the number of rows (X’X/n).
Note: When you are analyzing the unscaled data, the i,jth loading is not the correlation between the ith variable and the jth principal component.
Figure 5.10 Loading Matrix
Note: The degree of transparency for the table values indicates the distance of the absolute loading value from zero. Absolute loading values that are closer to zero are more transparent than absolute loading values that are farther from zero.
Formatted Loading Matrix Shows columns corresponding to the loadings for each component. The table is sorted in order of decreasing loadings on the first principal 92
Principal Components
Principal Components Report Options
Chapter 5
Multivariate Methods
component, and so the variables are listed in the order of decreasing loadings on the first component.
Figure 5.11 Formatted Loading Matrix
Tip: Use the sliders to dim loadings whose absolute values fall below your selected value and to set the degree of transparency for the loadings.
Summary Plots Shows or hides the summary information produced in the initial report. See “Principal Components Report” on page 88.
Tip: Select the tips of arrows in the loading plot to select the corresponding columns in the data table. Hold down CTRL and click on an arrow tip to deselect the column.
Biplot Shows a plot that overlays the Score Plot and the Loading Plot for the specified number of components.
Figure 5.12 Biplot
Chapter 5
Multivariate Methods
Principal Components
Principal Components Report Options
93
Shows a graph of the eigenvalue for each component. This scree plot helps in visualizing the dimensionality of the data space.
Scree Plot
Figure 5.13 Scree Plot
Score Plot Shows a matrix of scatterplots of the scores for pairs of principal components for the specified number of components. This plot is shown in Figure 5.4 (left‐most plot).
Shows a matrix of two‐dimensional representations of factor loadings for the specified number of components. The loading plot labels variables if the number of variables is 30 or fewer. If there are more than 30 variables, the labels are off by default. This information is shown in Figure 5.4 (right‐most plot).
Loading Plot
Tip: Select the tips of arrows in the loading plot to select the corresponding columns in the data table. Hold down CTRL and click on an arrow tip to deselect the column.
Score Plot with Imputation Imputes any missing values and creates a score plot. This option is available only if there are missing values.
3D Score Plot Shows a 3D scatterplot of any principal component scores. When you first invoke the command, the first three principal components are presented.
94
Principal Components
Principal Components Report Options
Chapter 5
Multivariate Methods
Figure 5.14 Scatterplot 3D Score Plot
Switches among
Principal Components,
Rotated Components,
and Data Columns
Select specific
axis contents
Cycles among all axis
content possibilities
The variables show as rays in the plot. These rays, called biplot rays, approximate the variables as a function of the principal components on the axes. If there are only two or three variables, the rays represent the variables exactly. The length of the ray corresponds to the eigenvalue or variance of the principal component.
Display Options
Allows you to show or hide arrows on all plots that can display arrows.
Factor Analysis Performs factor analysis‐style rotations of the principal components, or factor analysis. See the Factor Analysis chapter in the Consumer Research book for details.
Cluster Variables Performs a cluster analysis on the variables by dividing the variables into non‐overlapping clusters. Variable clustering provides a method for grouping similar variables into representative groups. Each cluster can then be represented by a single component or variable. The component is a linear combination of all variables in the cluster. Alternatively, the cluster can be represented by the variable identified to be the most representative member in the cluster. See “Cluster Variables” on page 96.
Chapter 5
Multivariate Methods
Principal Components
Principal Components Report Options
95
Note: Cluster Variables uses correlation matrices for all calculations, even when you select the on Covariance or on Unscaled options.
Figure 5.15 Cluster Summary
Saves the principal component to the data table with a formula for computing the components. The formula cannot evaluate rows with any missing values.
Save Principal Components
The calculation for the principal components depends on which matrix you select for extraction of principal components:
‒ For the on Correlations option, the ith principal component is a linear combination of the centered and scaled observations using the entries of the ith eigenvector as coefficients.
‒ For the on Covariances options, the ith principal component is a linear combination of the centered observations using the entries of the ith eigenvector as coefficients.
‒ For the on Unscaled option, the ith principal component is a linear combination of the raw observations using the entries of the ith eigenvector as coefficients.
Saves the rotated components to the data table, with a formula for computing the components. This option appears after the Factor Analysis option is used. The formula cannot evaluate rows with missing values.
Save Rotated Components
Imputes missing values, and saves the principal components to the data table. The column contains a formula for doing the imputation and computing the principal components. This option is available only if there are missing values.
Save Principal Components with Imputation
Save Rotated Components with Imputation Imputes missing values and saves the rotated components to the data table. The column contains a formula for doing the imputation and computing the rotated components. This option appears after the Factor Analysis option is used and if there are missing values.
Script
Contains options that are available to all platforms. See the Using JMP book.
96
Principal Components
Principal Components Report Options
Chapter 5
Multivariate Methods
Wide Principal Components Options
For the Wide estimation method, the Wide Principal Components red triangle menu contains the following options:
Saves principal components (scores) and their formulas in terms of the unstandardized variables to the data table. You are prompted to specify how many components to save.
Save Principal Components
The ith principal component is a linear combination of the centered and scaled observations using the entries of the ith eigenvector as coefficients.
In the data table, the principal components are given in columns called Prin<number>. The formulas depend on an additional saved column called Prin Data Matrix. This column contains the difference between the vector of the raw data, given by a Matrix expression, and the vector of means.
Note: The formulas cannot evaluate rows with missing values.
Save Principal Component Script Produces a script that allows you to score new data in a different data table. You are prompted to specify how many components to save. The script contains the formulas for the columns that were created using the Save Principal Components command.
Script
Contains options that are available to all platforms. See the Using JMP book.
Cluster Variables
Note: Cluster Variables uses correlation matrices for all calculations, even when you select the on Covariance or on Unscaled options.
Principal components analysis constructs components that are linear combinations of all the variables in the analysis. In contrast, the Cluster Variables option constructs components that are linear combinations of variables in a cluster of similar variables. The entire set of variables is partitioned into clusters. For each cluster, a cluster component is constructed using the first principal component of the variables in that cluster. This is the linear combination that explains as much of the variation as possible among the variables in that cluster.
You can use the Cluster Variables option as a dimension‐reduction method. A substantial part of the variation in a large set of variables can often be represented by cluster components or by the most representative variable in the cluster. These new variables can then be used in predictive or other modeling techniques. The new cluster‐based variables are usually more interpretable than principal components based on all the variables.
Principal components constructed from a common set of variables are orthogonal. However, cluster components are not orthogonal because they are constructed from distinct sets of variables.
Chapter 5
Multivariate Methods
Principal Components
Principal Components Report Options
97
Variable Clustering Algorithm
The clustering algorithm iteratively splits clusters of variables and reassigns variables to clusters until no more splits are possible. The initial cluster consists of all variables. The algorithm was developed by the SAS and is implemented in PROC VARCLUS (SAS Institute Inc., 2011).
The iterative steps in the algorithm are as follows:
1. For all clusters, do the following:
a. Compute the principal components for the variables in each cluster.
b. If the second eigenvalues for all of the clusters are less than one then terminate the algorithm.
2. Partition the cluster whose second eigenvalue is the largest (and greater than 1) into two new clusters as follows:
a. Rotate the principal components for the variables in the current cluster using an orthoblique rotation.
b. Define one cluster to consist of the variables in the current cluster whose squared correlations to the first rotated principal component are higher than their squared correlations to the second principal component.
c. Define the other cluster to consist of the remaining variables in the original cluster. These are the variables that are more highly correlated with the second principal component.
d. Compute the principal components of the two new clusters.
3. Test to see if any variable in the data set should be assigned to a different cluster. For each variable, do the following:
a. Compute the variable’s squared correlation with the first principal component for each cluster.
b. Place the variable in the cluster for which its squared correlation is the largest.
Note: An orthoblique rotation is also know as a raw quartimax rotation. See Harris and Kaiser (1964).
Variable Clustering Options
Tip: In any of the Variable Clustering reports, select rows in order to select the corresponding columns in the data table. Hold down CTRL and click the row to deselect the column in the data table.
The red triangle menu next to Variable Clustering contains the following options:
98
Principal Components
Principal Components Report Options
Chapter 5
Multivariate Methods
Cluster Summary This report gives the following:
‒ Cluster is a cluster identifier.
‒ Number of Members is the number of variables in the cluster.
‒ Most Representative Variable is the cluster variable that has the largest squared correlation with its cluster component.
‒ Cluster Proportion of Variance Explained is the squared correlation of the Most Representative Variable with its cluster component.
‒ Total Proportion of Variation Explained is the squared correlation of the Most Representative Variable in the cluster with the first principal component for all variables in that cluster.
Cluster Members This report gives the following:
‒ Cluster is the cluster identifier.
‒ Members lists the variables included in the cluster.
‒ RSquare with Own Cluster is the is the squared correlation of the variable with its cluster component.
‒ RSquare with Next Closest is the squared correlation of the variable with the cluster component for its next closest cluster. The next closest cluster is the cluster for which the squared correlation of the variable with the cluster component is the second highest.
‒ 1 ‐ RSquare Ratio is the a measure of the relative closeness between the cluster to which a variable belongs and its next closest cluster. It is defined as follows:
(1 ‐ RSquare with Own Cluster)/(1 ‐ RSquare with Next Closest)
Cluster Components Shows the Standardized Components report that gives the coefficients that define the cluster components. These are the eigenvectors of the first principal component within each cluster.
Save Cluster Components Saves columns called Cluster <i> Components to the data table. Each column is given by a formula that expresses the cluster component in terms of the uncentered and unscaled variables.
Launch Fit Model Opens a Model Specification window with the Most Representative Variables for each cluster entered in the Construct Model Effects list. Use this option to construct models based on the Most Representative Variables.
Chapter 6
Discriminant Analysis
Predict Classifications Based on Continuous Variables
Discriminant analysis predicts membership in a group or category based on observed values of several continuous variables. Specifically, discriminant analysis predicts a classification (X) variable (nominal or ordinal) based on known continuous responses (Y). The data for a discriminant analysis consist of a sample of observations with known group membership together with their values on the continuous variables.
For example, you might attempt to classify loan applicants into three credit risk categories (X): good, moderate, or bad. You might use continuous variables such as current salary, years in current job, age, and debt burden, (Ys) to predict an individual’s credit risk category. You could build a predictive model to classify an individual into a credit risk category using discriminant analysis.
Features of the Discriminant platform include the following:
•
A stepwise selection option to help choose variables that discriminate well.
•
A choice of fitting methods: Linear, Quadratic, Regularized, and Wide Linear.
•
A canonical plot and a misclassification summary.
•
Discriminant scores and squared distances to each group.
•
Options to save prediction distances and probabilities to the data table.
Figure 6.1 Canonical Plot
Contents
Discriminant Analysis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Example of Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Discriminant Launch Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Stepwise Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Discriminant Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Shrink Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
The Discriminant Analysis Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Canonical Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Discriminant Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Score Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Discriminant Analysis Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Score Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Canonical Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Example of a Canonical 3D Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Specify Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Consider New Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Save Discrim Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Validation in JMP and JMP Pro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Description of the Wide Linear Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Saved Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Between Groups Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Analysis Overview
101
Discriminant Analysis Overview
Discriminant analysis attempts to classify observations described by values on continuous variables into groups. Group membership, defined by a categorical variable X, is predicted by the continuous variables. These variables are called covariates and are denoted by Y.
Discriminant analysis differs from logistic regression. In logistic regression, the classification variable is random and predicted by the continuous variables. In discriminant analysis, the classifications are fixed, and the covariates (Y) are realizations of random variables. However, in both techniques, the categorical value is predicted by the continuous variables.
The Discriminant platform provides four methods for fitting models. All methods estimate the distance from each observation to each groupʹs multivariate mean (centroid) using Mahalanobis distance. You can specify prior probabilities of group membership and these are accounted for in the distance calculation. Observations are classified into the closest group.
Fitting methods include the following:
•
Linear—Assumes that the within‐group covariance matrices are equal. The covariate means for the groups defined by X are assumed to differ.
•
Quadratic—Assumes that the within‐group covariance matrices differ. This requires estimating more parameters than does the Linear method. If group sample sizes are small, you risk obtaining unstable estimates.
•
Regularized—Provides two ways to impose stability on estimates when the within‐group covariance matrices differ. This is a useful option if group sample sizes are small.
•
Wide Linear—Useful in fitting models based on a large number of covariates, where other methods can have computational difficulties. It assumes that all covariance matrices are equal.
Example of Discriminant Analysis
In Fisherʹs Iris data set, four measurements are taken from a sample of Iris flowers consisting of three different species. The goal is to identify the species accurately using the values of the four measurements.
1. Select Help > Sample Data Library and open Iris.jmp.
2. Select Analyze > Multivariate Methods > Discriminant.
3. Select Sepal length, Sepal width, Petal length, and Petal width and click Y, Covariates.
4. Select Species and click X, Categories.
5. Click OK.
102
Discriminant Analysis
Discriminant Launch Window
Chapter 6
Multivariate Methods
Figure 6.2 Discriminant Analysis Report Window
Because there are three classes for Species, there are two canonical variables. In the Canonical Plot, each observation is plotted against the two canonical coordinates. The plot shows that these two coordinates separate the three species. Since there was no validation set, the Score Summaries report shows a panel for the Training set only. When there is no validation set, the entire data set is considered the Training set. Of the 150 observations, only three are misclassified.
Discriminant Launch Window
Launch the Discriminant platform by selecting Analyze > Multivariate Methods > Discriminant.
Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Launch Window
103
Figure 6.3 Discriminant Launch Window for Iris.jmp
Note: The Validation button appears in JMP Pro only. In JMP, you can define a validation set using excluded rows. See “Validation in JMP and JMP Pro” on page 129.
Y, Covariates
Columns containing the continuous variables used to classify observations into categories.
X, Categories
A column containing the categories or groups into which observations are to be classified.
Weight
A column whose values assign a weight to each row for the analysis.
Freq
A column whose values assign a frequency to each row for the analysis. In general terms, the effect of a frequency column is to expand the data table, so that any row with integer frequency k is expanded to k rows. Row ordering is maintained. You can specify fractional frequencies.
104
Discriminant Analysis
Discriminant Launch Window
Validation
Chapter 6
Multivariate Methods
A numeric column containing two or three distinct values:
•
If there are two values, the smaller value defines the training set and the larger value defines the validation set.
•
If there are three values, these values define the training, validation, and test sets in order of increasing size.
•
If there are more than three values, all but the smallest three are ignored.
If you click the Validation button with no columns selected in the Select Columns list, you can add a validation column to your data table. For more information about the Make Validation Column utility, see Basic Analysis.
By
Performs a separate analysis for each level of the specified column.
Stepwise Variable Selection
Performs stepwise variable selection using covariance analysis and p‐values. For details, see “Stepwise Variable Selection” on page 105.
If you have specified a validation set, statistics that have been calculated for the validation set appear.
Note: This option is not provided for the Wide Linear discriminant method.
Discriminant Method
Provides four methods for conducting discriminant analysis. See “Discriminant Methods” on page 108.
Shrink Covariances
Shrinks the off‐diagonal entries of the pooled within‐group covariance matrix and the within‐group covariance matrices. See “Shrink Covariances” on page 111.
Uncentered Canonical
Suppresses centering of canonical scores for compatibility with older versions of JMP.
Use Pseudoinverses
Uses Moore‐Penrose pseudoinverses in the analysis. The resulting scores involve all covariates. If left unchecked, the analysis drops covariates that are linear combinations of covariates that precede them in the list of Y, Covariates.
Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Launch Window
105
Stepwise Variable Selection
Note: Stepwise Variable Selection is not available for the Wide Linear method.
If you select the Stepwise Variable Selection option in the launch window, the Discriminant Analysis report opens, showing the Column Selection panel. Perform stepwise analysis, using the buttons to select variables or selecting them manually with the Lock and Entered check boxes. Based on your selection F ratios and p‐values are updated. For details about how these are updated, see “Updating the F Ratio and Prob>F” on page 105.
Figure 6.4 Column Selection Panel for Iris.jmp with a Validation Set
Note: The Go button only appears when you use excluded rows for validation in JMP or a validation column in JMP Pro.
Updating the F Ratio and Prob>F
When you enter or remove variables from the model, the F Ratio and Prob>F values are updated based on an analysis of covariance model with the following structure:
•
The covariate under consideration is the response.
•
The covariates already entered into the model are predictors.
•
The group variable is a predictor.
The values for F Ratio and Prob>F given in the Stepwise report are the F ratio and p‐value for the analysis of covariance test for the group variable. The analysis of covariance test for the group variable is an indicator of its discriminatory power relative to the covariate under consideration.
Statistics
Columns In Number of columns currently selected for entry into the discriminant model.
Columns Out Number of columns currently available for entry into the discriminant model.
106
Discriminant Analysis
Discriminant Launch Window
Chapter 6
Multivariate Methods
Smallest P to Enter Smallest p‐value among the p‐values for all covariates available to enter the model.
Largest P to Remove Largest p‐value among the p‐values for all covariates currently selected for entry into the model.
Validation Entropy RSquare Entropy RSquare for the validation set. See “Entropy RSquare” on page 118. Larger values indicate better fit. Available only if a validation set is used.
Note: It is possible for the Validation Entropy RSquare to be negative.
Validation Misclassification Rate Misclassification rate for the validation set. Smaller values indicate better classification. Available only if a validation set is used.
Buttons
Step Forward Enters the most significant covariate from the covariates not yet entered. If a validation set is used, the Prob>F values are based on the training set.
Step Backward Removes the least significant covariate from the covariates entered but not locked. If a validation set is used, Prob>F values are based on the training set.
Enter All Enters all covariates by checking all covariates that are not locked in the Entered column.
Remove All Removes all covariates that are not locked by deselecting them in the Entered column.
Apply this Model Produces a discriminant analysis report based on the covariates that are checked in the Entered columns. The Select Columns outline is closed and the Discriminant Analysis window is updated to show analysis results based on your selected Discriminant Method.
Tip: After you click Apply this Model, the columns that you select appear at the top of the Score Summaries report.
Go
Enters covariates in forward steps until the Validation Entropy RSquare begins to decrease. Entry terminates when two forward steps are taken without improving the Validation Entropy RSquare. Available only with excluded rows in JMP or a validation column in JMP Pro.
Columns
Lock Forces a covariate to stay in its current state regardless of any stepping using the buttons.
Note the following:
Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Launch Window
107
‒ If you enter a covariate and then select Lock for that covariate, it remains in the model regardless of selections made using the control buttons. The Entered box for the locked covariate shows a dimmed check mark to indicate that it is in the model.
‒ If you select Lock for a covariate that is not Entered, it is not entered into the model regardless of selections made using the control buttons.
Entered Indicates which columns are currently in the model. You can manually select columns in or out of the model. A dimmed check mark indicates a locked covariate that has been entered into the model.
Column Covariate of interest.
F Ratio F ratio for a test for the group variable obtained using an analysis of covariance model. For details, see “Updating the F Ratio and Prob>F” on page 105.
Prob > F p‐value for a test for the group variable obtained using an analysis of covariance model. For details, see “Updating the F Ratio and Prob>F” on page 105.
Stepwise Example
For an illustration of how to use Stepwise, consider the Iris.jmp sample data table.
1. Select Help > Sample Data Library and open Iris.jmp.
2. Select Analyze > Multivariate Methods > Discriminant.
3. Select Sepal length, Sepal width, Petal length, and Petal width and click Y, Covariates.
4. Select Species and click X, Categories.
5. Select Stepwise Variable Selection.
6. Click OK.
7. Click Step Forward three times.
Three covariates are entered into the model. The Smallest P to Enter appears in the top panel. It is 0.0103288, indicating that the remaining covariate, Sepal length, might also be valuable in a discriminant analysis model for Species.
108
Discriminant Analysis
Discriminant Launch Window
Chapter 6
Multivariate Methods
Figure 6.5 Stepped Model for Iris.jmp
8. Click Apply This Model.
The Column Selection outline is closed. The window is updated to show reports for a fit based on the entered covariates and your selected discriminant method.
Note that the covariates that you selected for your model are listed at the top of the Score Summaries report.
Figure 6.6 Score Summaries Report Showing Selected Covariates
Discriminant Methods
JMP offers these methods for conducting Discriminant Analysis: Linear, Quadratic, Regularized, and Wide Linear. The first three methods differ in terms of the underlying model. The Wide Linear method is an efficient way to fit a Linear model when the number of covariates is large.
Note: When you enter more than 500 covariates, a JMP Alert recommends that you switch to the Wide Linear method. This is because computation time can be considerable when you use the other methods with a large number of columns. Click Wide Linear, Many Columns to switch to the Wide Linear method. Click Continue to use the method you originally selected.
Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Launch Window
109
Figure 6.7 Linear, Quadratic, and Regularized Discriminant Analysis
Linear
Quadratic
Regularized (=0.4, =0.4)
The Linear, Quadratic, and Regularized methods are illustrated in Figure 6.7. The methods are described here briefly. For technical details, see “Saved Formulas” on page 130.
Performs linear discriminant analysis. This method assumes that the within‐group covariance matrices are equal. See“Linear Discriminant Method” on page 131.
Linear, Common Covariance
Quadratic, Different Covariances Performs quadratic discriminant analysis. This method assumes that the within‐group covariance matrices differ. This method requires estimating 110
Discriminant Analysis
Discriminant Launch Window
Chapter 6
Multivariate Methods
more parameters than the Linear method requires. If group sample sizes are small, you risk obtaining unstable estimates. See “Quadratic Discriminant Method” on page 132.
If a covariate is constant across a level of the X variable, then its related entries in the within‐group covariance matrix have zero covariances. To enable matrix inversion, the zero covariances are replaced with the corresponding pooled within covariances. When this is done, a note appears at the top of the report window identifying the problematic covariate and level of X.
Tip: A shortcoming of the quadratic method surfaces in small data sets. It can be difficult to construct invertible and stable covariance matrices. The Regularized method ameliorates these problems, still allowing for differences among groups.
Regularized, Compromise Method Provides two ways to impose stability on estimates when the within‐group covariance matrices differ. This is a useful option when group sample sizes are small. See “Regularized, Compromise Method” on page 110 and “Regularized Discriminant Method” on page 133.
Wide Linear, Many Columns Useful in fitting models based on a large number of covariates, where other methods can have computational difficulties. This method assumes that all within‐group covariance matrices are equal. This method uses a singular value decomposition approach to compute the inverse of the pooled within‐group covariance matrix. See “Description of the Wide Linear Algorithm” on page 130.
Note: When you use the Wide Linear option, a few of the features that normally appear for other discriminant methods are not available. This is because the algorithm does not explicitly calculate the very large pooled within‐group covariance matrix.
Regularized, Compromise Method
Regularized discriminant analysis is governed by two nonnegative parameters.
•
The first parameter (Lambda, Shrinkage to Common Covariance) specifies how to mix the individual and group covariance matrices. For this parameter, 1 corresponds to Linear Discriminant Analysis and 0 corresponds to Quadratic Discriminant Analysis.
•
The second parameter (Gamma, Shrinkage to Diagonal) is a multiplier that specifies how much deflation to apply to the non‐diagonal elements (the covariances across variables). If you choose 1, then the covariance matrix is forced to be diagonal.
Assigning 0 to each of these two parameters is identical to requesting quadratic discriminant analysis. Similarly, assigning 1 to Lambda and 0 to Gamma requests linear discriminant analysis. Use Table 6.1 to help you decide on the regularization. See Figure 6.7 for examples of linear, quadratic, and regularized discriminant analysis.
Chapter 6
Multivariate Methods
Discriminant Analysis
The Discriminant Analysis Report
111
Table 6.1 Regularized Discriminant Analysis
Use Smaller Lambda
Use Larger Lambda
Use Smaller Gamma
Use Larger Gamma
Covariance matrices differ
Covariance matrices are identical
Variables are correlated
Variables are uncorrelated
Many rows
Few rows
Few variables
Many variables
Shrink Covariances
In the Discriminant launch window, you can select the option to Shrink Covariances. This option is recommended when some groups have a small number of observations. Discriminant analysis requires inversion of the covariance matrices. Shrinking off‐diagonal entries improves their stability and reduces prediction variance. The Shrink Covariances option shrinks the off‐diagonal entries by a factor that is determined using the method described in Schafer and Strimmer, 2005.
Rather than selecting the option in the launch window, you can achieve an equivalent shrinkage of the covariance matrices by using the appropriate values with a Regularized discriminant method. When you select the Shrink Covariances option and run your analysis, the Shrinkage report gives you an Overall Shrinkage value and an Overall Lambda value. To obtain the same analysis using the Regularized method, in the Regularization Parameters window, enter 1 as Lambda and the Overall Lambda from the Shrinkage report as Gamma.
The Discriminant Analysis Report
The Discriminant Analysis report provides discriminant results based on your selected Discriminant Method. The Discriminant Method and the Classification variable are shown at the top of the report. If you selected the Regularized method, its associated parameters are also shown.
You can change Discriminant Method by selecting the option from the red triangle menu. The results in the report update to reflect the selected method.
112
Discriminant Analysis
The Discriminant Analysis Report
Chapter 6
Multivariate Methods
Figure 6.8 Example of a Discriminant Analysis Report
The default Discriminant Analysis report contains the following sections:
•
When you select the Wide Linear discriminant method, a Principal Components report appears. See “Principal Components” on page 112.
•
The Canonical Plot shows the points and multivariate means in the two dimensions that best separate the groups. See “Canonical Plot” on page 113.
•
The Discriminant Scores report provides details about how each observation is classified. See “Discriminant Scores” on page 116.
•
The Score Summaries report provides an overview of how well observations are classified. See “Discriminant Scores” on page 116.
Principal Components
This report only appears for the Wide Linear method. Consider the following notation:
•
Denote the n by p matrix of covariates by X, where n is the number of observations and p is the number of covariates.
•
For each observation in X, subtract the covariate mean and divide the difference by the pooled standard deviation for the covariate. Denote the resulting matrix by Xs.
The report gives the following:
Chapter 6
Multivariate Methods
Discriminant Analysis
The Discriminant Analysis Report
113
Number The number of eigenvalues extracted. Eigenvalues are extracted until Cum Percent is at least 99.99%, indicating that 99.99% of the variation has been explained.
Eigenvalue The eigenvalues of the covariance matrix for Xs, namely (Xs’Xs)/(n ‐ p), arranged in decreasing order.
The cumulative sum of the eigenvalues as a percentage of the sum of all eigenvalues. The eigenvalues sum to the rank of Xs’Xs.
Cum Percent
Singular Value
The singular values of Xs arranged in decreasing order.
Canonical Plot
The Canonical Plot is a biplot. Figure 6.9 shows the Canonical Plot for a linear discriminant analysis of the data table Iris.jmp. The points have been colored by Species.
Figure 6.9 Canonical Plot for Iris.jmp
The biplot axes are the first two canonical variables. These define the two dimensions that provide maximum separation among the groups. The biplot shows how each observation is represented in terms of canonical variables and how each covariate contributes to the canonical variables.
•
The observations and the multivariate means of each group are represented as points on the biplot. They are expressed in terms of the first two canonical variables.
•
The set of rays that appears in the plot represents the covariates. The rays show how each covariate loads onto the first two standardized canonical variables. The direction of a ray indicates the degree of association of that covariate with the first two canonical variables.
114
Discriminant Analysis
The Discriminant Analysis Report
Chapter 6
Multivariate Methods
Notice these additional details about the Canonical Plot and its association with other parts of the Discriminant report:
•
The point corresponding to each multivariate mean is denoted by a plus (“+”) marker.
•
A 95% confidence level ellipse is plotted for each mean. If two groups differ significantly, the confidence ellipses tend not to intersect.
•
Show or hide the 95% confidence ellipses by selecting Canonical Options > Show Means
CL Ellipses from the red triangle menu.
•
The labeled rays show the directions of the covariates in the canonical space. These rays emanate from the point (0,0), which represents the grand mean of the data in terms of the canonical variables. They represent the degree of association, or loading, of each covariate on each canonical variable.
•
Obtain the values of the loadings by selecting Canonical Options > Show Canonical Details from the red triangle menu. At the bottom of the Canonical Details report, click Standardized Scoring Coefficients. See “Standardized Scoring Coefficients” on page 125 for details.
•
Show or hide the rays by selecting Canonical Options > Show Biplot Rays from the red triangle menu.
•
Drag the center of the biplot rays to other places in the graph. Specify their position and scaling by selecting Canonical Options > Biplot Ray Position from the red triangle menu. The default Radius Scaling shown in the Canonical Plot is 1.5, unless an adjustment is needed to make the rays visible.
•
An ellipse denoting a 50% contour is plotted for each group. This depicts a region in the space of the first two canonical variables that contains approximately 50% of the observations, assuming normality.
•
Show or hide the 50% contours by selecting Canonical Options > Show Normal 50%
Contours from the red triangle menu.
•
Color code the points to match the ellipses by selecting Canonical Options > Color Points from the red triangle menu.
Classification into Three or More Categories
For the Iris.jmp data, there are three Species, so only two canonical variables. The plot in Figure 6.9 shows good separation of the three groups using the two canonical variables.
The rays in the plot indicate the following:
•
Petal length is positively associated with Canonical1 and negatively associated with Canonical2. It loads more heavily on Canonical1 than on Canonical2.
•
Petal width is positively associated with both Canonical1 and Canonical2. It loads more heavily on Canonical2 than on Canonical1.
Chapter 6
Multivariate Methods
•
Discriminant Analysis
The Discriminant Analysis Report
115
Sepal width is negatively associated with Canonical1 and positively associated with Canonical2. It loads more heavily on Canonical2 than on Canonical1.
•
Sepal length is weakly associated with Canonical1 and very weakly associated with Canonical2.
Classification into Two Categories
When the classification variable has only two levels, the points are plotted against the single canonical variable, denoted by Canonical1 in the plot. The covariates load on Canonical1 only. The rays are shown with a vertical component only in order to separate them. Project the rays onto the Canonical1 axis to compare their association with the single canonical variable.
Figure 6.10 shows a Canonical Plot for the sample data table Fitness.jmp. The seven continuous variates are used to classify an individual into the categories M (male) or F (female). Since the classification variable has only two categories, there is only one canonical variable.
Figure 6.10 Canonical Plot for Fitness.jmp
The points in Figure 6.10 have been colored by Sex. Note that the two groups are well separated by their values on Canonical1.
Although the rays corresponding to the seven covariates have a vertical component, in this case you must interpret the rays only in terms of their projection onto the Canonical1 axis. You note the following:
•
MaxPulse, Runtime, and RunPulse have little association with Canonical1.
116
Discriminant Analysis
The Discriminant Analysis Report
•
Chapter 6
Multivariate Methods
Weight, RstPulse, and Age are positively associated with Canonical1. Weight has the highest degree of association. The covariates RstPulse and Age have a similar, but smaller, degree of association.
•
Oxy is negatively associated with Canonical1.
Discriminant Scores
The Discriminant Scores report provides the predicted classification of each observation and supporting information.
Row Row of the observation in the data table.
Actual Classification of the observation as given in the data table.
Value of the saved formula SqDist[<level>] for the classification of the observation given in the data table. For details, see “Score Options” on page 121.
SqDist(Actual)
Prob(Actual)
Estimated probability of the observation’s actual classification.
-Log(Prob) Negative of the log of Prob(Actual). Large values of this negative log‐likelihood identify observations that are poorly predicted in terms of membership in their actual categories.
A plot of ‐Log(Prob) appears to the right of the ‐Log(Prob) values. A large bar indicates a poor prediction. An asterisk(*) indicates observations that are misclassified.
If you are using a validation or a test set, observations in the validation set are marked with a “v” and those in the test set are marked with a “t”.
Predicted classification of the observation. The predicted classification is the category with the highest predicted probability of membership.
Predicted
Prob(Pred) Estimated probability of the observation’s predicted classification.
Others Lists other categories, if they exist, that have a predicted probability that exceeds 0.1.
Figure 6.11 shows the Discriminant Scores report for the Iris.jmp sample data table using the Linear discriminant method. The option Score Options > Show Interesting Rows Only option is selected, showing only misclassified rows or rows with predicted probabilities between 0.05 and 0.95.
Chapter 6
Multivariate Methods
Discriminant Analysis
The Discriminant Analysis Report
117
Figure 6.11 Show Interesting Rows Only
Score Summaries
The Score Summaries report provides an overview of the discriminant scores. The table in Figure 6.12 shows Actual and Predicted classifications. If all observations are correctly classified, the off‐diagonal counts are zero.
Figure 6.12 Score Summaries for Iris.jmp
The Score Summaries report provides the following information:
If you used Stepwise Variable Selection to construct the model, the columns entered into the model are listed. See Figure 6.6.
Columns
If no validation is used, all observations comprise the Training set. If validation is used, a row is shown for the Training and Validation sets, or for the Training, Validation, and Test sets.
Source
Provides the number of observations in the specified set that are incorrectly classified.
Number Misclassified
Percent Misclassified Provides the percent of observations in the specified set that are incorrectly classified.
118
Discriminant Analysis
The Discriminant Analysis Report
Chapter 6
Multivariate Methods
Entropy RSquare A measure of fit. Larger values indicate better fit. For details, see “Entropy RSquare” on page 118.
Note: It is possible for Entropy RSquare to be negative.
-2LogLikelihood Twice the negative log‐likelihood of the observations in the training set, based on the model. Larger values indicate better fit. Provided for the training set only. For more details, see Fitting Linear Models.
Shows matrices of actual by predicted counts for each level of the categorical X. If you are using JMP Pro with validation, a matrix is given for each set of observations. If you are using JMP with excluded rows, the excluded rows are considered the validation set and a separate Validation matrix is given. For more information, see “Validation in JMP and JMP Pro” on page 129.
Confusion Matrices
Entropy RSquare
The Entropy RSquare is a measure of fit. It is computed for the training set and for the validation and test sets if validation is used.
Entropy RSquare for the Training Set
For the training set, Entropy RSquare is computed as follows:
•
A discriminant model is fit using the training set.
•
Predicted probabilities based on the model are obtained.
•
Using these predicted probabilities, the likelihood is computed for observations in the training set. Call this Likelihood_FullTraining.
•
The reduced model (no predictors) is fit using the training set.
•
The predicted probabilities for the levels of X from the reduced model are used to compute the likelihood for observations in the training set. Call this quantity Likelihood_ReducedTraining.
•
The Entropy RSquare for the training set is:
log  Likelihood_Full Training 
Entropy RSquare Training = 1 – ----------------------------------------------------------------------------------------log  Likelihood_Reduced Training 
Entropy RSquare for Validation and Test Sets
For the validation set, Entropy RSquare is computed as follows:
•
A discriminant model is fit using only the training set.
•
Predicted probabilities based on the training set model are obtained for all observations.
Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Analysis Options
119
•
Using these predicted probabilities, the likelihood is computed for observations in the validation set. Call this Likelihood_FullValidation.
•
The reduced model (no predictors) is fit using only the training set.
•
The predicted probabilities for the levels of X from the reduced model are used to compute the likelihood for observations in the validation set. Call this quantity Likelihood_ReducedValidation.
•
The Validation Entropy RSquare is:
log  Likelihood_FullValidation 
Validation Entropy RSquare = 1 – --------------------------------------------------------------------------------------------log  Likelihood_Reduced Validation 
The Entropy RSquare for the test set is computed in a manner analogous to the Entropy RSquare for the Validation set.
Discriminant Analysis Options
The following commands are available from the Discriminant Analysis red triangle menu.
Command
Description
Stepwise Variable Selection
Selects or deselects the Stepwise control panel. See “Stepwise Variable Selection” on page 105.
Discriminant Method
Selects the discriminant method. See “Discriminant Methods” on page 108.
Discriminant Scores
Shows or hides the Discriminant Scores portion of the report.
Score Options
Provides several options connected with the scoring of the observations. In particular, you can save the scoring formulas. See “Score Options” on page 121.
Canonical Plot
Shows or hides the Canonical Plot. See “Canonical Plot” on page 113.
Canonical Options
Provides options that affect the Canonical Plot. See “Canonical Options” on page 122.
Canonical 3D Plot
Shows a three‐dimensional canonical plot. This option is available only when there are four or more levels of the categorical X. See “Example of a Canonical 3D Plot” on page 126.
120
Discriminant Analysis
Discriminant Analysis Options
Chapter 6
Multivariate Methods
Command
Description
Specify Priors
Enables you to specify prior probabilities for each level of the X variable. See “Specify Priors” on page 127.
Consider New Levels
Used when you have some points that might not fit any known group, but instead might be from an unscored new group. For details, see “Consider New Levels” on page 127.
Show Within Covariances
Shows or hides these reports:
•
A Covariance Matrices report that gives the pooled‐within covariance and correlation matrices.
•
For the Quadratic and Regularized methods, a Correlations for Each Group report that shows:
‒ the within‐group correlation matrices
‒ for each group, the log of the determinant of the within‐group covariance matrix
•
For the Quadratic discriminant method, adds a Group Covariances outline to the Covariance Matrices report that shows the within‐group covariance matrices.
Not available for the Wide Linear discriminant method.
Show Group Means
Shows or hides a Group Means report that provides the means of each covariate. Means for each level of the X variable and overall means appear.
Save Discrim Matrices
Saves a script called Discrim Results to the data table. The script is a list of the following objects for use in JSL:
‒ a list of the covariates (Ys)
‒ the categorical variable X
‒ a list of the levels of X
‒ a matrix of the means of the covariates by the levels of X
‒ the pooled‐within covariance matrix
Not available for the Wide Linear discriminant method.
See “Save Discrim Matrices” on page 128.
Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Analysis Options
121
Command
Description
Scatterplot Matrix
Opens a separate window containing a Scatterplot Matrix report that shows a matrix with a scatterplot for each pair of covariates. The option invokes the Scatterplot Matrix platform with shaded density ellipses for each group. The scatterplots include all observations in the data table, even if validation is used. See “Scatterplot Matrix” on page 128.
Not available for the Wide Linear discriminant method.
Script
Contains options that are available to all platforms. For more information, see the Using JMP book.
Score Options
Score Options provides the following selections that deal with scores:
Show Interesting Rows Only In the Discriminant Scores report, shows only rows that are misclassified and those with predicted probability between 0.05 and 0.95.
Show Classification Counts Shows or hides the confusion matrices, showing actual by predicted counts, in the Score Summaries report. By default, the Score Summaries report shows a confusion matrix for each level of the categorical X. If you are using JMP Pro with validation, a matrix is given for each set of observations. If you are using JMP with excluded rows, these rows are considered the validation set and a separate Validation matrix is given. For more information, see “Validation in JMP and JMP Pro” on page 129.
Show Distances to Each Group Adds a report called Squared Distances to Each Group that shows each observation’s squared Mahalanobis distance to each group mean.
Show Probabilities to Each Group Adds a report called Probabilities to Each Group that shows the probability that an observation belongs to each of the groups defined by the categorical X.
ROC Curve Appends a Receiver Operating Characteristic curve to the Score Summaries report. For details about the ROC Curve, see, the Specialized Models book.
Select Misclassified Rows Selects the misclassified rows in the data table and in report windows that display a listing by Row.
Select Uncertain Rows Selects rows with uncertain classifications in the data table and in report windows that display a listing by Row. An uncertain row is one whose probability of group membership for any group is neither close to 0 nor close to 1.
When you select this option, a window opens where you can specify the range of predicted probabilities that reflect uncertainty. The default is to define as uncertain any row whose 122
Discriminant Analysis
Discriminant Analysis Options
Chapter 6
Multivariate Methods
probability differs from 0 or 1 by more than 0.1. Therefore, the default selects rows with probabilities between 0.1 and 0.9.
Save Formulas Saves distance, probability, and predicted membership formulas to the data table. For details, see “Saved Formulas” on page 130.
‒ The distance formulas are SqDist[0] and SqDist[<level>], where <level> represents a level of X. The distance formulas produce intermediate values connected with the Mahalanobis distance calculations.
‒ The probability formulas are Prob[<level>], where <level> represents a level of X. Each probability column gives the posterior probability of an observation’s membership in that level of X. The Response Probability column property is saved to each probability column. For details about the Response Probability column property, see the Using JMP book.
‒ The predicted membership formula is Pred <X> and contains the “most likely level” classification rule.
‒ The Wide Linear method also saves a Discrim Data Matrix column containing the vector of covariates and a Discrim Prin Comp formula. See “Wide Linear Discriminant Method” on page 134.
Note: For any method other than Wide Linear, when you Save Formulas, a RowEdit Prob script is saved to the data table. This script selects uncertain rows in the data table. The script defines any row whose probability differs from 0 or 1 by more than 0.1 as uncertain. The script also opens a Row Editor window that enables you to examine the uncertain rows. If you fit a new model (other than Wide Linear) and select Save Formulas, any existing RowEdit Prob script is replaced with a script that applies to the new fit.
Make Scoring Script Creates a script that constructs the formula columns saved by the Save Formulas option. You can save this script and use it, perhaps with other data tables, to create the formula columns that calculate membership probabilities and predict group membership.
Canonical Options
The first options listed below relate to the appearance of the Canonical Plot or the Canonical 3D Plot. The remaining options provide detail on the calculations related to the plot.
Note: The Canonical 3D Plot is available only when there are three or more covariates and when the grouping variable has four or more categories.
Options Relating to Plot Appearance
Show Points Shows or hides the points in the Canonical Plot and Canonical 3D Plot.
Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Analysis Options
123
Show Means CL Ellipses Shows or hides 95% confidence ellipses for the means on the canonical variables, assuming normality. Shows or hides 95% confidence ellipsoids in the Canonical 3D Plot.
Show Normal 50% Contours Shows or hides an ellipse or an ellipsoid that denotes a 50% contour for each group. In the Canonical Plot, each ellipse depicts a region in the space of the first two canonical variables that contains approximately 50% of the observations, assuming normality. In the Canonical 3D Plot, each ellipsoid depicts a region in the space of the first three canonical variables that contains approximately 50% of the observations, assuming normality.
Show Biplot Rays Shows or hides the biplot rays in the Canonical Plot and in the Canonical 3D Plot. The labeled rays show the directions of the covariates in the canonical space. They represent the degree of association, or loading, of each covariate on each canonical variable.
Biplot Ray Position Enables you to specify the position and radius scaling of the biplot rays in the Canonical Plot and in the Canonical 3D Plot.
‒ By default, the rays emanate from the point (0,0), which represents the grand mean of the data in terms of the canonical variables. In the Canonical Plot, you can drag the rays or use this option to specify coordinates.
‒ The default Radius Scaling in the canonical plots is 1.5, unless an adjustment is needed to make the rays visible. Radius Scaling is done relative to the Standardized Scoring Coefficients.
Color Points Colors the points in the Canonical Plot and the Canonical 3D Plot based on the levels of the X variable. Color markers are added to the rows in the data table. This option is equivalent to selecting Rows > Color or Mark by Column and selecting the X variable. It is also equivalent to right‐clicking the graph and selecting Row Legend, and then coloring by the classification column.
Options Relating to Calculations
Show Canonical Details Shows or hides the Canonical Details report. See “Show Canonical Details” on page 124.
Show Canonical Structure Shows or hides Canonical Structures report. See “Show Canonical Structure” on page 125. Not available for the Wide Linear discriminant method.
Save Canonical Scores Creates columns in the data table that contain canonical score formulas for each observation. The column for the kth canonical score is named Canon[<k>].
Tip: In a script, sending the scripting command Save to New Data Table to the Discriminant object saves the following to a new data table: group means on the canonical variables; the biplot rays with 1.5 Radius Scaling of the Standardized Scoring Coefficients; and the canonical scores. Not available for the Wide Linear discriminant method.
124
Discriminant Analysis
Discriminant Analysis Options
Chapter 6
Multivariate Methods
Show Canonical Details
The Canonical Details report shows tests that address the relationship between the covariates and the grouping variable X. Relevant matrices are presented at the bottom of the report.
Figure 6.13 Canonical Details for Iris.jmp
Note: The matrix used in computing the results in the report is the pooled within‐covariance matrix (given as the Within Matrix). This matrix is used as a basis for the Canonical Details report for all discriminant methods. The statistics and tests in the Canonical Details report are the same for all discriminant methods.
Statistics and Tests
The Canonical Details report lists eigenvalues and gives a likelihood ratio test for zero eigenvalues. Four tests are provided for the null hypothesis that the canonical correlations are zero.
Eigenvalue Eigenvalues of the product of the Between Matrix and the inverse of the Within Matrix. These are listed from largest to smallest. The size of an eigenvalue reflects the amount of variance explained by its associated discriminant function.
Percent
Proportion of the sum of the eigenvalues represented by the given eigenvalue.
Cum Percent
Cumulative sum of the proportions.
Canonical Corr Canonical correlations between the covariates and the groups defined by the categorical X. Suppose that you define numeric indicator variables to represent the groups defined by X. Then perform a canonical correlation analysis using the covariates as one set of variables and the indicator variables representing the groups in X as the other. The Canonical Corr values are the canonical correlation values that result from this analysis.
Likelihood Ratio Likelihood ratio statistic for a test of whether the population values of the corresponding canonical correlation and all smaller correlations are zero. The ratio equals Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Analysis Options
125
the product of the values (1 ‐ Canonical Corr2) for the given and all smaller canonical correlations.
Lists four standard tests for the null hypothesis that the means of the covariates are equal across groups: Wilk’s Lambda, Pillai’s Trace, Hotelling‐Lawley, and Roy’s Max Root. See “Multivariate Tests” on page 173 and “Approximate F‐Tests” on page 174 in the “Statistical Details” appendix.
Test
F value associated with the corresponding test. For certain tests, the F value is approximate or an upper bound. See “Approximate F‐Tests” on page 174 in the “Statistical Details” appendix.
Approx. F
NumDF
Numerator degrees of freedom for the corresponding test.
DenDF Denominator degrees of freedom for the corresponding test.
Prob>F p‐value for the corresponding test.
Matrices
Four matrices that relate to the canonical structure are presented at the bottom of the report. To view a matrix, click the disclosure icon beside its names. To hide it, click the name of the matrix.
Within Matrix Pooled within‐covariance matrix.
Between Matrix Between groups covariance matrix, SB. See “Between Groups Covariance Matrix” on page 137.
Scoring Coefficients Coefficients used to compute canonical scores in terms of the raw data. These are the coefficients used for the option Canonical Options > Save Canonical Scores. For details about how these are computed, see “The CANDISC Procedure” in SAS Institute Inc. (2011).
Standardized Scoring Coefficients Coefficients used to compute canonical scores in terms of the standardized data. Often called loadings. For details about how these are computed, see “The CANDISC Procedure” in SAS Institute Inc. (2011).
Show Canonical Structure
The Canonical Structure report gives three matrices that provide correlations between the canonical variables and the covariates. Another matrix shows means across the levels of the group variable. To view a matrix, click the disclosure icon beside its names. To hide it, click the name of the matrix.
126
Discriminant Analysis
Discriminant Analysis Options
Chapter 6
Multivariate Methods
Figure 6.14 Canonical Structure for Iris.jmp Showing between Canonical Structure
Total Canonical Structure Correlations between the canonical variables and the covariates.
Between Canonical Structure Correlations between the group means on the canonical variables and the group means on the covariates.
Pooled Within Canonical Structure Partial correlations between the canonical variables and the covariates, adjusted for the group variable.
Class Means on Canonical Variables Provides means across the levels of the group variable for each canonical variable.
Example of a Canonical 3D Plot
1. Select Help > Sample Data Library and open Owl Diet.jmp.
2. Select rows 180 through 294.
These are the rows for which species is missing. You will hide and exclude these rows.
3. Select Rows > Hide and Exclude.
4. Select Rows > Color or Mark by Column.
5. Select species.
6. From the Colors menu, select JMP Dark.
7. Check Make Window with Legend.
8. Click OK.
A small Legend window appears. The rows in the data table are assigned colors by species.
9. Select Analyze > Multivariate Methods > Discriminant.
10. Specify skull length, teeth row, palatine foramen, and jaw length as Y, Covariates.
11. Specify species as X, Categories.
12. Click OK.
13. Select Canonical 3D Plot from the Discriminant Analysis red triangle menu.
Tip: Click on categories in the Legend to highlight those points in the Canonical 3D plot. Click and drag inside the 3D plot to rotate it.
Chapter 6
Multivariate Methods
Discriminant Analysis
Discriminant Analysis Options
127
Figure 6.15 Canonical 3D Plot with Legend Window
Specify Priors
The following options are available for specifying priors:
Equal Probabilities Assigns equal prior probabilities to all groups. This is the default.
Proportional to Occurrence Assigns prior probabilities to the groups that are proportional to their frequency in the observed data.
Other
Enables you to specify custom prior probabilities.
Consider New Levels
Use the Consider New Levels option if you suspect that some of your observations are outliers with respect to the specified levels of the categorical variable. When you select the option, a menu asks you to specify the prior probability of the new level.
Observations that would be better fit using a new group are assigned to the new level, called “Other”. Probability of membership in the Other group assumes that these observations have the distribution of the entire set of observations where no group structure is assumed. This leads to correspondingly wide normal contours associated with the covariance structure. Distance calculations are adjusted by the specified prior probability.
128
Discriminant Analysis
Discriminant Analysis Options
Chapter 6
Multivariate Methods
Save Discrim Matrices
Save Discrim Matrices creates a global list (DiscrimResults) for use in the JMP scripting language. The list contains the following, calculated for the training set:
•
YNames, a list of the covariates (Ys)
•
XName, the categorical variable
•
XValues, a list of the levels of X
•
YMeans, a matrix of the means of the covariates by the levels of X
•
YPartialCov, the within covariance matrix
Consider the analysis obtained using the Discriminant script in the Iris.jmp sample data table. If you select Save Discrim Matrices from the red triangle menu, the script Discrim Results is saved to the data table. The script is shown in Figure 6.16.
Figure 6.16 Discrim Results Table Script for Iris.jmp
Note: In a script, you can send the scripting command Get Discrim Matrices to the Discriminant platform object. This obtains the same values as Save Discrim Matrices, but does not store them in the data table.
Scatterplot Matrix
The Scatterplot Matrix command invokes the Scatterplot Matrix platform in a separate window containing a lower triangular scatterplot matrix for the covariates. Points are plotted for all observations in the data table.
Ellipses with 90% coverage are shown for each level of the categorical variable X. For the Linear discriminant method, these are based on the pooled within covariance matrix. Figure 6.17 shows the Scatterplot Matrix window for the Iris.jmp sample data table.
Chapter 6
Multivariate Methods
Discriminant Analysis
Validation in JMP and JMP Pro
129
Figure 6.17 Scatterplot Matrix for Iris.jmp
The options in the report’s red triangle menu are described in the Essential Graphing book.
Validation in JMP and JMP Pro
In JMP, you can specify a validation set by excluding the rows that form the validation set. Select the rows that you want to use as your validation set and then select Rows > Exclude/Unexclude. The unexcluded rows are treated as the training set.
Note: In JMP Pro, you can specify a Validation column in the Discriminant launch window. A validation column must have a numeric data type and should contain at least two distinct values.
Notice the following:
•
If the column contains two values, the smaller value defines the training set and the larger value defines the validation set.
•
If the column contains three values, the values define the training, validation, and test sets in order of increasing size.
•
If the column contains four or more distinct values, only the smallest three values and their associated observations are used to define the training, validation, and test sets, in that order.
When a validation set is specified, the Discriminant platform does the following:
•
Models are fit using the training data.
130
Discriminant Analysis
Technical Details
Chapter 6
Multivariate Methods
•
The Stepwise Variable Selection option gives the Validation Entropy RSquare and Validation Misclassification Rate statistics for the model. For details, see “Statistics” on page 105 and “Entropy RSquare for Validation and Test Sets” on page 118.
•
The Discriminant Scores report shows an indicator identifying rows in the validation and test sets.
•
The Score Summaries report shows actual by predicted classifications for the training, validation, and test sets.
Technical Details
Description of the Wide Linear Algorithm
Wide Linear discriminant analysis is performed as follows:
•
The data are standardized by subtracting group means and dividing by pooled standard deviations.
•
The singular value decomposition is used to obtain a principal component transformation matrix from the set of singular vectors.
•
The number of components retained represents a minimum of 0.9999 of the sum of the squared singular values.
•
A linear discriminant analysis is performed on the transformed data, where the data are not shifted by group means. This is a fast calculation because the pooled‐within covariance matrix is diagonal.
Saved Formulas
This section gives the derivation of formulas saved by Score Options > Save Formulas. The formulas depend on the Discriminant Method.
For each group defined by the categorical variable X, observations on the covariates are assumed to have a p‐dimensional multivariate normal distribution, where p is the number of covariates. The notation used in the formulas is given in Table 6.2.
Table 6.2 Notation for Formulas Given by Save Formulas Options
p
number of covariates
T
total number of groups (levels of X)
t = 1, ..., T
subscript to distinguish groups defined by X
Chapter 6
Multivariate Methods
Discriminant Analysis
Technical Details
131
Table 6.2 Notation for Formulas Given by Save Formulas Options (Continued)
nt
number of observations in group t
n = n1 + n2 + ... + nT
total number of observations
y
p by 1 vector of covariates for an observation
y it =  y i1t y i2t  y ipt 
ith observation in group t, consisting of a vector of p covariates
yt
p by 1 vector of means of the covariates y for observations in group t
ybar
p by 1 vector of means for the covariates across all observations
1
S t = -------------nt – 1
nt

 y it – y t   y it – y t 
estimated (p by p) within‐group covariance matrix for group t
i=1
1
S p = ------------n–T
T

 n t – 1 S
t=1
t
estimated (p by p) pooled within covariance matrix
qt
prior probability of membership for group t
p(t|y)
posterior probability that y belongs to group t
|A|
determinant of a matrix A
Linear Discriminant Method
In linear discriminant analysis, all within‐group covariance matrices are assumed equal. The common covariance matrix is estimated by Sp. See Table 6.2 for notation.
The Mahalanobis distance from an observation y to group t is defined as follows:
2
–1
d t =  y – y t S p  y – y t 
The likelihood for an observation y in group t is estimated as follows:
132
Discriminant Analysis
Technical Details
l t  y  =  2 
–T  2
=  2 
–T  2
Chapter 6
Multivariate Methods
Sp
–1  2
Sp
–1
exp  –  y – y t S p  y – y t   2 
–1  2
2
exp  – d t  2 
Note that the number of parameters that must be estimated is p2 for the pooled covariance matrix plus pT for the means.
The posterior probability of membership in group t is given as follows:
qt lt  y 
1
p  t y  = ------------------------------- = ---------------------------------------------------------------------------------------------------------------------------------------2
2
T
1 +  exp  –   d u – 2 log  q u   –  d t – 2 log  q t     2 
l

y

q
 uu
ut
u=1
An observation y is assigned to the group for which its posterior probability is the largest.
The formulas saved by the Linear discriminant method are defined as follows:
SqDist[0]
–1
yS p y
SqDist[<group t>]
d t – 2 log  q t 
Prob[<group t>]
pt y
Pred <X>
t for which p  t y  is maximum t = 1  T
2
Quadratic Discriminant Method
In quadratic discriminant analysis, the within‐group covariance matrices are not assumed equal. The within‐group covariance matrix for group t is estimated by StThis means that the total number of parameters to be estimated is Tp2 + Tp: Tp2 for the within‐group covariance matrices and Tp for the means.
When group sample sizes are small relative to p, the estimates of the within‐group covariance matrices tend to be highly variable. The discriminant score is heavily influenced by the smallest eigenvalues of the inverse of the within‐group covariance matrices. See Friedman, 1989. For this reason, if your group sample sizes are small compared to p, you might want to consider the Regularized method, described in “Regularized Discriminant Method” on page 133.
See Table 6.2 for notation. The Mahalanobis distance from an observation y to group t is defined as follows:
Chapter 6
Multivariate Methods
Discriminant Analysis
Technical Details
133
–1
2
d t =  y – y t S t  y – y t 
The likelihood for an observation y in group t is estimated as follows:
l t  y  =  2 
–T  2
=  2 
–T  2
–1  2
St
St
–1
exp  –  y – y t S t  y – y t   2 
–1  2
2
exp  – d t  2 
The posterior probability of membership in group t is the following:
 T

p  t y  =  q t l t  y      q u l u  x 


u=1
1
= ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------2
2
1 +  exp  –   d u + log S u – 2 log  q u   –  d t + log S t – 2 log  q t     2 
ut
An observation y is assigned to the group for which its posterior probability is the largest.
The formulas saved by the Quadratic discriminant method are defined as follows:
SqDist[<group t>]
d t + log S t – 2 log  q t 
Prob[<group t>]
pt y
Pred <X>
t for which p  t y  is maximum t = 1  T
2
Note: SqDist[<group t>] can be negative.
Regularized Discriminant Method
Regularized discriminant analysis allows for two parameters:  and .
•
The parameter balances weights assigned to the pooled covariance matrix and the within‐group covariance matrices, which are not assumed equal.
•
The parameter  determines the amount of shrinkage toward a diagonal matrix.
This method enables you to leverage two aspects of regularization to bring stability to estimates for quadratic discriminant analysis. See Friedman, 1989. See Table 6.2 for notation.
For the regularized method, the covariance matrix for group t is:
134
Discriminant Analysis
Technical Details
Chapter 6
Multivariate Methods
 t =  1 –    S p +  1 –  S  + Diag   S p +  1 –  S  
t
t
The Mahalanobis distance from an observation y to group t is defined as follows:
–1
2
d t =  y – y t  t  y – y t 
The likelihood for an observation y in group t is estimated as follows:
l t  y  =  2 
–T  2
=  2 
–T  2
–1  2
t
t
–1
exp  –  y – y t  t  y – y t   2 
–1  2
2
exp  – d t  2 
The posterior probability of membership in group t given by the following:
 T

p  t y  =  q t l t  y      q u l u  x 


u=1
1
= ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------2
2
1 +  exp  –   d u + log  u – 2 log  q u   –  d t + log  t – 2 log  q t     2 
ut
An observation y is assigned to the group for which its posterior probability is the largest.
The formulas saved by the Regularized discriminant method are defined below:
SqDist[<group t>]
d t + log  t – 2 log  q t 
Prob[<group t>]
pt y
Pred <X>
t for which p  t y  is maximum t = 1  T
2
Note: SqDist[<group t>] can be negative.
Wide Linear Discriminant Method
The Wide Linear method is useful when you have a large number of covariates and, in particular, when the number of covariates exceeds the number of observations (p > n). This approach centers around an efficient calculation of the inverse of the pooled within‐covariance matrix Sp or of its transpose, if p > n. It uses a singular value decomposition approach to avoid inverting and allocating space for large covariance matrices.
Chapter 6
Multivariate Methods
Discriminant Analysis
Technical Details
135
The Wide Linear method assumes equal within‐group covariance matrices and is equivalent to the Linear method if the number of observations equals or exceeds the number of covariates.
Wide Linear Calculation
See Table 6.2 for notation. The steps in the Wide Linear calculation are as follows:
1. Compute the T by p matrix M of within‐group sample means. The (t,j)th entry of M, mtj, is the sample mean for members of group t on the jth covariate.
2. For each covariate j, calculate the pooled standard deviation across groups. Call this sjj.
3. Denote the diagonal matrix with diagonal entries sjj by Sdiag.
4. Center and scale values for each covariate as follows:
‒ Subtract the mean for the group to which the observation belongs.
‒ Divide the difference by the pooled standard deviation.
Using notation, for an observation i in group t, the group‐centered and scaled value for the jth covariate is:
y ij – m t  i j
*
y ij = -------------------------s jj
The notation t(i) indicates the group t to which observation i belongs.
*
5. Denote the matrix of y ij values by Ys.
6. Denote the pooled within‐covariance matrix for the group‐centered and scaled covariates by R. The matrix R is given by the following:
R =  Y s Y s    n – T 
7. Apply the singular value decomposition to Ys:
Y s = UDV
where U and V are orthonormal and Dis a diagonal matrix with positive entries (the singular values) on the diagonal. See “The Singular Value Decomposition” on page 171 in the “Statistical Details” appendix.
Then R can be written as follows:
2
R =  Y s Y s    n – T  =  VD V    n – T 
8. If R is of full rank, obtain R‐1/2 as follows:
R
–1  2
=  VD
–1
V   n – T
where D‐1 is the diagonal matrix whose diagonal entries are the inverses of the diagonal entries of D.
136
Discriminant Analysis
Technical Details
Chapter 6
Multivariate Methods
If R is not of full rank, define a pseudo‐inverse for R as follows:
+
R =  VD
–2
V    n – T 
Then define the inverse square root of R as follows:
+ 12
R 
=  VD
–1
V   n – T
+
9. If R is of full rank, it follows that R = R . So, for completeness, the discussion continues using pseudo‐inverses.
Define a p by p matrix Ts as follows:
–1
+
T s =  S diag VD    n – T 
Then:
–1
+ 2
–1
–1
+ –1
+
 T s T s   =  S diag V  D  VS diag   n – T  = S diag R S diag = S p


+
where S p is a generalized inverse of the pooled within‐covariance matrix for the original data that is calculated using the SVD.
Mahalanobis Distance
The formulas for the Mahalanobis distance, the likelihood, and the posterior probabilities are identical to those in “Linear Discriminant Method” on page 131. However, the inverse of Sp is replaced by a generalized inverse computed using the singular value decomposition.
When you save the formulas, the Mahalanobis distance is given in terms of the decomposition. For an observation y, the distance to group t is the following, where the last equality uses the notation seen in the saved formulas:
2
+
d t =  y – y t S p  y – y t 
=  y – y t T s T s   y – y t 
=   y – y  –  y t – y  T s T s    y – y  –  y t – y  
=  T s   y – y    T s   y – y   – 2  T s   y t – y    T s   y – y   +  T s   y t – y    T s   y t – y  
= SqDist  0  – 2Discrim Prin Comp +  T s   y t – y    T s   y t – y  
Saved Formulas
The formulas saved by the Wide Linear discriminant method are defined as follows:
Discrim Data Matrix
Vector of observations on the covariates
Chapter 6
Multivariate Methods
Discriminant Analysis
Technical Details
Discrim Prin Comp
The data transformed by the principal component scoring matrix, which renders the data uncorrelated within groups. Given by T s   y – y  , where y is a 1 by p vector containing the overall means.
SqDist[0]
Sum of squares of the entries in the matrix
Ts   y – y 
SqDist[<group t>]
The Mahalanobis distance from the from observation to the group centroid. See “Mahalanobis Distance” on page 136.
Prob[<group t>]
p  t y  , given in “Linear Discriminant Method” on page 131
Pred <X>
t for which p  t y  is maximum t = 1  T
Between Groups Covariance Matrix
Using the notation in Table 6.2, this matrix is defined as follows:
1
S B = -----------T–1
T

t=1
nt
T  -----  y t – y bar   y t – y bar 
 n
137
138
Discriminant Analysis
Technical Details
Chapter 6
Multivariate Methods
Chapter 7
Partial Least Squares Models
Develop Models Using Correlations between Ys and Xs
The Partial Least Squares (PLS) platform fits linear models based on factors, namely, linear combinations of the explanatory variables (Xs). These factors are obtained in a way that attempts to maximize the covariance between the Xs and the response or responses (Ys). PLS exploits the correlations between the Xs and the Ys to reveal underlying latent structures.
JMP Pro provides additional functionality, allowing you to conduct PLS Discriminant Analysis (PLS‐DA), include a variety of model effects, utilize several validation methods, impute missing data, and obtain bootstrap estimates of the distributions of various statistics.
Partial least squares performs well in situations such as the following, where the use of ordinary least squares does not produce satisfactory results: More X variables than observations; highly correlated X variables; a large number of X variables; several Y variables and many X variables.
Figure 7.1 A Portion of a Partial Least Squares Report
Contents
Overview of the Partial Least Squares Platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Example of Partial Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Launch the Partial Least Squares Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Model Launch Control Panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Partial Least Squares Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Model Comparison Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
<Cross Validation Method> and Method = <Method Specification> . . . . . . . . . . . . . . . . . . 151
Model Fit Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Partial Least Squares Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Model Fit Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Variable Importance Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
VIP vs Coefficients Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Save Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Statistical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Partial Least Squares. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
van der Voet T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
T2 Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Confidence Ellipses for X Score Scatterplot Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Standard Error of Prediction and Confidence Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Standardized Scores and Loadings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
PLS Discriminant Analysis (PLS‐DA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Chapter 7
Multivariate Methods
Partial Least Squares Models
Overview of the Partial Least Squares Platform
141
Overview of the Partial Least Squares Platform
In contrast to ordinary least squares, PLS can be used when the predictors outnumber the observations. PLS is used widely in modeling high‐dimensional data in areas such as spectroscopy, chemometrics, genomics, psychology, education, economics, political science, and environmental science.
The PLS approach to model fitting is particularly useful when there are more explanatory variables than observations or when the explanatory variables are highly correlated. You can use PLS to fit a single model to several responses simultaneously. See Garthwaite (1994), Wold (1995), Wold et al. (2001), Eriksson et al. (2006), and Cox and Gaudard (2013).
Two model fitting algorithms are available: nonlinear iterative partial least squares (NIPALS) and a “statistically inspired modification of PLS” (SIMPLS). (For NIPALS, see Wold, H., 1980; for SIMPLS, see De Jong, 1993. For a description of both methods, see Boulesteix and Strimmer, 2007). The SIMPLS algorithm was developed with the goal of solving a specific optimality problem. For a single response, both methods give the same model. For multiple responses, there are slight differences.
In JMP, the PLS platform is accessible only through Analyze > Multivariate Methods > Partial Least Squares. In JMP Pro, you can also access the Partial Least Squares personality through Analyze > Fit Model.
In JMP Pro, you can do the following:
•
Conduct PLS‐DA (PLS discriminant analysis) by fitting responses with a nominal modeling type, using the Partial Least Squares personality in Fit Model.
•
Fit polynomial, interaction, and categorical effects, using the Partial Least Squares personality in Fit Model.
•
Select among several validation and cross validation methods.
•
Impute missing data.
•
Obtain bootstrap estimates of the distributions of various statistics. Right‐click in the report of interest. For more details, see the Basic Analysis book.
Partial Least Squares uses the van der Voet T2 test and cross validation to help you choose the optimal number of factors to extract.
•
In JMP, the platform uses the leave‐one‐out method of cross validation. You can also choose not to use validation.
•
In JMP Pro, you can choose KFold, Leave‐One‐Out, or random holdback cross validation, or you can specify a validation column. You can also choose not to use validation.
142
Partial Least Squares Models
Example of Partial Least Squares
Chapter 7
Multivariate Methods
Example of Partial Least Squares
This example is from spectrometric calibration, which is an area where partial least squares is very effective. Suppose you are researching pollution in the Baltic Sea. You would like to use the spectra of samples of sea water to determine the amounts of three compounds that are present in these samples.
The three compounds of interest are:
•
lignin sulfonate (ls), which is pulp industry pollution
•
humic acid (ha), which is a natural forest product
•
an optical whitener from detergent (dt)
The amounts of these compounds in each of the samples are the responses. The predictors are spectral emission intensities measured at a range of wavelengths (v1–v27).
For the purposes of calibrating the model, samples with known compositions are used. The calibration data consist of 16 samples of known concentrations of lignin sulfonate, humic acid, and detergent. Emission intensities are recorded at 27 equidistant wavelengths. Use the Partial Least Squares platform to build a model for predicting the amount of the compounds from the spectral emission intensities.
1. Select Help > Sample Data Library and open Baltic.jmp.
Note: The data in the Baltic.jmp data table are reported in Umetrics (1995). The original source is Lindberg, Persson, and Wold (1983).
2. Select Analyze > Multivariate Methods > Partial Least Squares.
3. Assign ls, ha, and dt to the Y, Response role.
4. Assign Intensities, which contains the 27 intensity variables v1 through v27, to the X, Factor role.
5. Click OK.
The Partial Least Squares Model Launch control panel appears.
6. Select Leave-One-Out as the Validation Method.
7. Click Go.
A portion of the report appears in Figure 7.2. Since the van der Voet test is a randomization test, your Prob > van der Voet T2 values can differ slightly from those in Figure 7.2.
Chapter 7
Multivariate Methods
Partial Least Squares Models
Example of Partial Least Squares
143
Figure 7.2 Partial Least Squares Report
The Root Mean PRESS (predicted residual sum of squares) Plot shows that Root Mean PRESS is minimized when the number of factors is 7. This is stated in the note beneath the Root Mean PRESS Plot. A report called NIPALS Fit with 7 Factors is produced. A portion of that report is shown in Figure 7.3.
The van der Voet T2 statistic tests to determine whether a model with a different number of factors differs significantly from the model with the minimum PRESS value. A common practice is to extract the smallest number of factors for which the van der Voet significance level exceeds 0.10 (SAS Institute Inc, 2011 and Tobias, 1995). If you were to apply this thinking here, you would fit a new model by entering 6 as the Number of Factors in the Model Launch panel.
144
Partial Least Squares Models
Example of Partial Least Squares
Chapter 7
Multivariate Methods
Figure 7.3 Seven Extracted Factors
8. Select Diagnostics Plots from the NIPALS Fit with 7 Factors red triangle menu.
This gives a report showing actual by predicted plots and three reports showing various residual plots. The Actual by Predicted Plot in Figure 7.4 shows the degree to which predicted compound amounts agree with actual amounts.
Figure 7.4 Diagnostics Plots
9. Select VIP vs Coefficients Plot from the NIPALS Fit with 7 Factors red triangle menu.
Chapter 7
Multivariate Methods
Partial Least Squares Models
Example of Partial Least Squares
145
Figure 7.5 VIP vs Coefficients Plot
The VIP vs Coefficients plot helps identify variables that are influential relative to the fit for the various responses. For example, v23, v2, and v26 have both VIP values that exceed 0.8 and relatively large coefficients.
Launch the Partial Least Squares Platform
There are two ways to launch the Partial Least Squares platform:
•
Select Analyze > Multivariate Methods > Partial Least Squares.
•
Select Analyze > Fit Model and select Partial Least Squares from the Personality menu. This approach enables you to do the following:
‒ Enter categorical variables as Ys or Xs. Conduct PLS‐DA by entering categorical Ys.
‒ Add interaction and polynomial terms to your model.
‒ Use the Standardize X option to construct higher‐order terms using centered and scaled columns.
‒ Save your model specification script.
Some features on the Fit Model launch window are not applicable for the Partial Least Squares personality:
•
Weight, Nest, Attributes, Transform, and No Intercept.
Tip: You can transform a variable by right‐clicking it in the Select Columns box and selecting a Transform option.
•
The following Macros: Mixture Response Surface, Scheffé Cubic, and Radial.
146
Partial Least Squares Models
Example of Partial Least Squares
Chapter 7
Multivariate Methods
Figure 7.6 JMP Pro Partial Least Squares Launch Window (Imputation Method EM Selected)
The Partial Least Squares launch window contains the following options:
Y, Response Enter numeric response columns. If you enter multiple columns, they are modeled jointly.
In JMP Pro, you can enter nominal response columns in the Fit Model launch window to conduct PLS‐DA. For details, see “PLS Discriminant Analysis (PLS‐DA)” on page 164.
X, Factor Enter the predictor columns. The Partial Least Squares launch window only allows numeric predictors.
In JMP Pro, you can enter nominal and ordinal model effects in the Fit Model launch window. Ordinal effects are treated as nominal.
If your data are summarized, enter the column whose values contain counts for each row.
Freq
Validation Enter an optional validation column. A validation column must contain only consecutive integer values. Note the following:
‒ If the validation column has two levels, the smaller value defines the training set and the larger value defines the validation set.
‒ If the validation column has three levels, the values define the training, validation, and test sets in order of increasing size.
‒ If the validation column has more than three levels, then KFold Cross Validation is used. For information about other validation options, see “Validation Method” on page 149.
Note: If you click the Validation button with no columns selected in the Select Columns list, you can add a validation column to your data table. For more information about the Make Validation Column utility, see Basic Analysis.
By
Enter a column that creates separate reports for each level of the variable.
Chapter 7
Multivariate Methods
Partial Least Squares Models
Example of Partial Least Squares
147
Centering Centers all Y variables and model effects by subtracting the mean from each column. See “Centering and Scaling” on page 148.
Scales all Y variables and model effects by dividing each column by its standard deviation. See “Centering and Scaling” on page 148.
Scaling
(Fit Model launch window only) Select this option to center and scale all columns that are used in the construction of model effects. If this option is not selected, higher‐order effects are constructed using the original data table columns. Then each higher‐order effect is centered or scaled, based on the selected Centering and Scaling options. Note that Standardize X does not center or scale Y variables. See “Standardize X” on page 148.
Standardize X
Impute Missing Data Replaces missing data values in Ys or Xs with nonmissing values. Select the appropriate method from the Imputation Method list.
If Impute Missing Data is not selected, rows that are missing observations on any X variable are excluded from the analysis and no predictions are computed for these rows. Rows with no missing observations on X variables but with missing observations on Y variables are also excluded from the analysis, but predictions are computed.
Imputation Method (Appears only when Impute Missing Data is selected) Select from the following imputation methods:
‒ Mean: For each model effect or response column, replaces the missing value with the mean of the nonmissing values.
‒ EM: Uses an iterative Expectation‐Maximization (EM) approach to impute missing values. On the first iteration, the specified model is fit to the data with missing values for an effect or response replaced by their means. Predicted values from the model for Y and the model for X are used to impute the missing values. For subsequent iterations, the missing values are replaced by their predicted values, given the conditional distribution using the current estimates.
For the purpose of imputation, polynomial terms are treated as separate predictors. When a polynomial term is specified, that term is calculated from the original data, or, if Standardize X is checked, from the standardized column values. If a row has a missing value for a column involved in the definition of the polynomial term, then that entry is missing for the polynomial term. Imputation is conducted using polynomial terms defined in this way.
For more details about the EM approach, see Nelson, Taylor, and MacGregor (1996).
Max Iterations (Appears only when EM is selected as the Imputation Method) Enables you to set the maximum number of iterations used by the algorithm. The algorithm terminates if the maximum difference between the current and previous estimates of missing values is bounded by 10^‐8.
After completing the launch window and clicking OK, the Model Launch control panel appears. See “Model Launch Control Panel” on page 148.
148
Partial Least Squares Models
Model Launch Control Panel
Chapter 7
Multivariate Methods
Centering and Scaling
The Centering and Scaling options are selected by default. This means that predictors and responses are centered and scaled to have mean 0 and standard deviation 1. Centering the predictors and the responses places them on an equal footing relative to their variation. Without centering, both the variable’s mean and its variation around that mean are involved in constructing successive factors. To illustrate, suppose that Time and Temp are two of the predictors. Scaling them indicates that a change of one standard deviation in Time is approximately equivalent to a change of one standard deviation in Temp.
Standardize X
When the Partial Least Square personality is selected in the Fit Model window, the Standardize X option is selected by default. This ensures that all columns entered as model effects and that all columns that are involved in an interaction or polynomial term are standardized.
Suppose that you have two columns, X1 and X2, and you enter the interaction term X1*X2 as a model effect in the Fit Model window. When the Standardize X option is selected, both X1 and X2 are centered and scaled before forming the interaction term. The interaction term that is formed is calculated as follows:
– mean  X1 -  X2
– mean  X2 
 X1
-------------------------------------- ----------------------------------------

 

std  X1 
std  X2 
All model effects are then centered or scaled, in accordance with your selections of the Centering and Scaling options, prior to inclusion in the model.
If the Standardize X option is not selected, and Centering and Scaling are both selected, then the term that is entered into the model is calculated as follows:
X1  X2 – mean  X1  X2 ------------------------------------------------------------------std  X1  X2 
Model Launch Control Panel
After you click OK in the platform launch window (or Run in the Fit Model window), the Model Launch control panel appears.
Chapter 7
Multivariate Methods
Partial Least Squares Models
Model Launch Control Panel
149
Figure 7.7 Partial Least Squares Model Launch Control Panel
Note: The Validation Method portion of the Model Launch control panel appears differently in JMP Pro.
The Model Launch control panel contains the following selections:
Method Specification Select the type of model fitting algorithm. There are two algorithm choices: NIPALS and SIMPLS. The two methods produce the same coefficient estimates when there is only one response variable. See “Statistical Details” on page 159 for details about differences between the two algorithms.
Validation Method Select the validation method. Validation is used to determine the optimum number of factors to extract. For JMP Pro, if a validation column is specified on the platform launch window, these options do not appear.
Holdback Randomly selects the specified proportion of the data for a validation set, and uses the other portion of the data to fit the model.
KFold Partitions the data into K subsets, or folds. In turn, each fold is used to validate the model that is fit to the rest of the data, fitting a total of K models. This method is best for small data sets because it makes efficient use of limited amounts of data.
Leave-One-Out Performs leave‐one‐out cross validation.
None Does not use validation to choose the number of factors to extract. The number of factors is specified in the Factor Search Range.
Factor Search Range Specify how many latent factors to extract if not using validation. If validation is being used, this is the maximum number of factors the platform attempts to fit before choosing the optimum number of factors.
Factor Specification Appears once you click Go to fit an initial model. Specify a number of factors to be used in fitting a new model.
150
Partial Least Squares Models
Partial Least Squares Report
Chapter 7
Multivariate Methods
Partial Least Squares Report
The first time you click Go in the Model Launch control panel (Figure 7.7), the Validation Method panel is removed from the Model Launch window. If you specified a Validation column or if you selected Holdback in the Validation Method panel, all model fits in the report are based on the training data. Otherwise, all model fits are based on the entire data set.
If you used validation, three reports appear:
•
Model Comparison Summary
•
<Cross Validation Method> and Method = <Method Specification>
•
NIPALS (or SIMPLS) Fit with <N> Factors
If you selected None as the CV method, two reports appear:
•
Model Comparison Summary
•
NIPALS (or SIMPLS) Fit with <N> Factors
To fit additional models, specify the desired numbers of factors in the Model Launch panel.
Model Comparison Summary
The Model Comparison Summary shows summary results for each fitted model.
Figure 7.8 Model Comparison Summary
In Figure 7.8, models for 7 and then 6 factors have been fit. The report includes the following summary information:
Method Shows the analysis method that you specified in the Model Launch control panel.
Number of rows Shows the number of observations used in the training set.
Number of factors Shows the number of extracted factors.
Percent Variation Explained for Cumulative X
Shows the percent of variation in X that is explained by the model.
Percent Variation Explained for Cumulative Y
explained by the model.
Shows the percent of variation in Y that is Chapter 7
Multivariate Methods
Partial Least Squares Models
Partial Least Squares Report
151
Number of VIP>0.8 Shows the number of model effects with VIP (variable importance for projection) values greater than 0.8. The VIP score is a measure of a variable’s importance relative to modeling both X and Y (Wold, 1995 and Eriksson et al., 2006).
<Cross Validation Method> and Method = <Method Specification>
This report appears only when a form of cross validation is selected as a Validation Method in the Model Launch control panel. It shows summary statistics for models fit, using from 0 to the maximum number of extracted factors, as specified in the Model Launch control panel. The report also provides a plot of Root Mean PRESS values. See “Root Mean PRESS Plot” on page 153. An optimum number of factors is identified using the minimum Root Mean PRESS statistic.
Figure 7.9 Cross Validation Report
152
Partial Least Squares Models
Partial Least Squares Report
Chapter 7
Multivariate Methods
When the Standardize X option is selected, cross validation is applied once to the entire data table. It is not reapplied to the individual training sets. However, when any combination of the Centering or Scaling options are selected, this combination of selections is applied to each cross validation training set. Cross validation proceeds by using the training sets, which are individually centered and scaled if these options are selected.
The following statistics are shown in the report. If any form of validation or cross validation is used, the reported results are summaries of the training set statistics.
Number of Factors Number of factors used in fitting the model.
Root Mean PRESS Prediction error sum of squares. For details, see “Root Mean PRESS” on page 153.
van der Voet T2 Test statistic for the van der Voet test, which tests whether models with different numbers of extracted factors differ significantly from the optimum model. The null hypothesis for each van der Voet T2 test states that the model based on the corresponding number of factors does not differ from the optimum model. For more details, see “van der Voet T2” on page 161.
Prob > van der Voet T2 p‐value for the van der Voet T2 test. For more details, see “van der Voet T2” on page 161.
Q2 Dimensionless measure of predictive ability defined by subtracting the ratio of the PRESS value divided by the total sum of squared for Y from one:
1 – PRESS  SSY
For details see “Calculation of Q2” on page 153.
Cumulative Q2 Indicator of the predictive ability of models with the given number of factors or fewer. For a given number of factors, f, Cumulative Q2 is defined as follows:
f
1–

 1 – PRESS i  SSY i 
i=1
Here PRESSi and SSYi correspond to their values for i factors.
R2X
Percent of X variation explained by the model with the given number of factors. See “Calculation of R2X and R2Y When Validation Is Used” on page 153.
Cumulative R2X Sum of values R2X for i = 1 to the given number of factors.
R2Y Percent of Y variation explained by the model with the given number of factors. See “Calculation of R2X and R2Y When Validation Is Used” on page 153.
Cumulative R2Y Sum of values R2Y for i = 1 to the given number of factors.
Chapter 7
Multivariate Methods
Partial Least Squares Models
Partial Least Squares Report
153
Root Mean PRESS Plot
This bar chart shows the number of factors along the horizontal axis and the Root Mean PRESS values on the vertical axis. It is equivalent to the horizontal bar chart that appears to the right of the Root Mean PRESS column in the Cross Validation report. See Figure 7.9.
Root Mean PRESS
For a specified number of factors, a, Root Mean PRESS is calculated as follows:
1. Fit a model with a factors to each training set (with None as the Validation Method).
2. Apply the resulting prediction formula to the observations in the validation set.
3. For each Y:
‒ For each validation set, compute the squared difference between each observed validation set value and its predicted value (the squared prediction error).
‒ For each validation set, average these squared differences and divide the result by the variance for the entire response column.
‒ Sum these means and, in the case of more than one validation set, divide their sum by the number of validation sets minus one. This is the PRESS statistic for the given Y.
4. Root Mean PRESS for a factors is the square root of the average of the PRESS values across all responses.
Calculation of Q2
The statistic Q2 is defined as 1 – PRESS  SSY . The PRESS statistic is the predicted error sum of squares across all responses for the model developed based on training data, but evaluated on the validation set. The value of SSY is the sum of squares for Y across all responses based on the observations in the validation set.
The statistic Q2 in the Cross Validation report is computed in the following ways, depending on the selected Validation Method:
Q2 is the average of the values 1 – PRESS  SSY computed for the validation sets based on the models constructed by leaving out one observation at a time.
Leave-One-Out
KFold Q2 is the average of the values 1 – PRESS  SSY computed for the validation sets based on the K models constructed by leaving out each of the K folds.
Holdback or Validation Set Q2 is the value of 1 – PRESS  SSY computed for the validation set based on the model constructed using the single set of training data.
Calculation of R2X and R2Y When Validation Is Used
The statistics R2X and R2Y in the Cross Validation report are computed in the following ways, depending on the selected Validation Method:
154
Partial Least Squares Models
Partial Least Squares Report
Chapter 7
Multivariate Methods
Note: For all of these computations, R2Y is calculated analogously.
R2X is the average of the Percent Variation Explained for X Effects for the models constructed by leaving out one observation at a time.
Leave-One-Out
KFold R2X is the average of the Percent Variation Explained for X Effects for the K models constructed by leaving out each fold.
Holdback or Validation Set R2X is the Percent Variation Explained for X Effects for the model constructed using the training data.
Model Fit Report
The Model Fit Report shows detailed results for each fitted model. The fit uses either the optimum number of factors based on cross validation, or the specified number of factors if no cross validation methods are specified. The report title indicates whether NIPALS or SIMPLS was used and gives the number of extracted factors.
Figure 7.10 Model Fit Report
The Model Fit report includes the following summary information:
X-Y Scores Plots Scatterplots of the X and Y scores for each extracted factor.
Percent Variation Explained Shows the percent variation and cumulative percent variation explained for both X and Y. Results are given for each extracted factor.
Model Coefficients for Centered and Scaled Data For each Y, shows the coefficients of the Xs for the model based on the centered and scaled data.
Chapter 7
Multivariate Methods
Partial Least Squares Models
Partial Least Squares Options
155
Partial Least Squares Options
The Partial Least Squares red triangle menu contains the following options:
Set Random Seed Sets the seed for the randomization process used for KFold and Holdback validation. This is useful if you want to reproduce an analysis. Set the seed to a positive value, save the script, and the seed is automatically saved in the script. Running the script always produces the same cross validation analysis. This option does not appear when Validation Method is set to None, or when a validation column is used.
Contains automation options that are available to all platforms. See the Using JMP book.
Script
Model Fit Options
The Model Fit red triangle menu contains the following options:
Percent Variation Plots Adds two plots entitled Percent Variation Explained for X Effects and Percent Variation Explained for Y Effects. These show stacked bar charts representing the percent variation explained by each extracted factor for the Xs and Ys.
Plots the VIP values for each X variable. VIP scores appear in the Variable Importance Table. See “Variable Importance Plot” on page 157.
Variable Importance Plot
Plots the VIP statistics against the model coefficients. You can show only those points corresponding to your selected Ys. Additional labeling options are provided. There are plots for both the centered and scaled data and the original data. See “VIP vs Coefficients Plots” on page 157.
VIP vs Coefficients Plots
Set VIP Threshold Sets the threshold level for the Variable Importance Plot, Variance Importance Table, and the VIP vs Coefficients Plots.
Coefficient Plots Plots the model coefficients for each response across the X variables. You can show only those points corresponding to your selected Ys. There are plots for both the centered and scaled data and the original data.
Loading Plots Plots X and Y loadings for each extracted factor. There are separate plots for the Xs and Ys.
Loading Scatterplot Matrices Shows scatterplot matrices of the X loadings and the Y loadings.
Correlation Loading Plot Shows either a single scatterplot or a scatterplot matrix of the X and Y loadings overlaid on the same plot. When you select the option, you specify how many factors you want to plot.
156
Partial Least Squares Models
Model Fit Options
Chapter 7
Multivariate Methods
‒ If you specify two factors, a single correlation loading scatterplot appears. Select the two factors that define the axes beneath the plot. Click the right arrow button to successively display each combination of factors on the plot.
‒ If you specify more than two factors, a scatterplot matrix appears with a cell for pair of factors up to the number that you selected.
In both cases, use check boxes to control labeling.
X-Y Score Plots Includes the following options:
Fit Line Shows or hides a fitted line through the points on the X‐Y Scores Plots.
Show Confidence Band Shows or hides 95% confidence bands for the fitted lines on the X‐Y Scores Plots. These should be used only for outlier detection.
Score Scatterplot Matrices Shows a scatterplot matrix of the X scores and a scatterplot matrix of the Y scores. Each X score scatterplot displays a 95% confidence ellipse, which can be used for outlier detection. For statistical details about the confidence ellipses, see “Confidence Ellipses for X Score Scatterplot Matrix” on page 162.
Distance Plots Shows plots of the following:
‒ the distance from each observation to the X model
‒ the distance from each observation to the Y model
‒ a scatterplot of distances to both the X and Y models
In a good model, both X and Y distances are small, so the points are close to the origin (0,0). Use the plots to look for outliers relative to either X or Y. If a group of points clusters together, then they might have a common feature and could be analyzed separately. When a validation set or a validation and test set are in use, separate reports are provided for these sets and for the training set.
T Square Plot Shows a plot of T2 statistics for each observation, along with a control limit. An observation’s T2 statistic is calculated based on that observation’s scores on the extracted factors. For details about the computation of T2 and the control limit, see “T2 Plot” on page 162.
Diagnostics Plots Shows diagnostic plots for assessing the model fit. Four plot types are available: Actual by Predicted Plot, Residual by Predicted Plot, Residual by Row Plot, and a Residual Normal Quantile Plot. Plots are provided for each response. When a validation set or a validation and test set are in use, separate reports are provided for these sets and for the training set.
Profiler shows a profiler for each Y variable.
Spectral Profiler Shows a single profiler where all of the response variables appear in the first cell of the plot. This profiler is useful for visualizing the effect of changes in the X variables on the Y variables simultaneously.
Chapter 7
Multivariate Methods
Partial Least Squares Models
Model Fit Options
157
Save Columns Includes options for saving various formulas and results. See “Save Columns” on page 158.
Remove Fit Removes the model report from the main platform report.
Make Model Using VIP Opens and populates a launch window with the appropriate responses entered as Ys and the variables whose VIPs exceed the specified threshold entered as Xs. Performs the same function as the button in the VIP vs Coefficients for Centered and Scaled Data report. See “VIP vs Coefficients Plots” on page 157.
Variable Importance Plot
The Variable Importance Plot graphs the VIP values for each X variable. The Variable Importance Table shows the VIP scores. A VIP score is a measure of a variable’s importance in modeling both X and Y. If a variable has a small coefficient and a small VIP, then it is a candidate for deletion from the model (Wold,1995). A value of 0.8 is generally considered to be a small VIP (Eriksson et al, 2006) and a blue line is drawn on the plot at 0.8.
Figure 7.11 Variable Importance Plot
VIP vs Coefficients Plots
Two options to the right of the plot facilitate variable reduction and model building:
•
Make Model Using VIP opens and populates a launch window with the appropriate responses entered as Ys and the variables whose VIPs exceed the specified threshold entered as Xs.
158
Partial Least Squares Models
Model Fit Options
•
Chapter 7
Multivariate Methods
Make Model Using Selection enables you to select Xs directly in the plot and then enters the Ys and only the selected Xs into a launch window.
To use another platform based on your current column selection, open the desired platform. Notice in the launch window that the selections are retained. Click on the role button and the selected columns are populated.
Figure 7.12 VIP vs Coefficients Plot for Centered and Scaled Data
Save Columns
Save Prediction Formula For each response, saves a column to the data table called Pred
Formula <response> that contains its the prediction formula.
Save Prediction as X Score Formula For each response, saves a column to the data table called Pred Formula <response> that contains the prediction formula in terms of the X scores.
For each response, saves a column to the data table called PredSE <response> that contains the standard error of the predicted mean. For details, see “Standard Error of Prediction and Confidence Limits” on page 162.
Save Standard Errors of Prediction Formula
Save Mean Confidence Limit Formula For each response, saves two columns to the data table called Lower 95% Mean <response> and Upper 95% Mean <response>. These columns contain 95% confidence limits for the response mean. For details, see “Standard Error of Prediction and Confidence Limits” on page 162.
Save Individual Confidence Limit Formula For each response, saves two columns to the data table called Lower 95% Indiv <response> and Upper 95% Indiv <response>. These columns contain 95% prediction limits for individual values. For details, see “Standard Error of Prediction and Confidence Limits” on page 162.
Save X Score Formula Saves a column to the data table called X Score <N> Formula
containing the formula for each X Score. See “Partial Least Squares” on page 160.
Save Y Predicted Values Saves the predicted values for the Y variables to columns in the data table.
Chapter 7
Multivariate Methods
Partial Least Squares Models
Statistical Details
Save Y Residuals
159
Saves the residual values for the Y variables to columns in the data table.
Save X Predicted Values Saves the predicted values for the X variables to columns in the data table.
Save X Residuals
Saves the residual values for the X variables to columns in the data table.
Save Percent Variation Explained For X Effects Saves the percent variation explained for each X variable across all extracted factors to a new table.
Save Percent Variation Explained For Y Responses Saves the percent variation explained for each Y variable across all extracted factors to a new table.
Save Scores Saves the X and Y scores for each extracted factor to the data table.
Save Loadings
Saves the X and Y loadings to new data tables.
Saves the X and Y standardized scores used in constructing the Correlation Loading Plot to the data table. For the formulas, see “Standardized Scores and Loadings” on page 163.
Save Standardized Scores
Saves the X and Y standardized loadings used in constructing the Correlation Loading Plot to new data tables. For the formulas, see “Standardized Scores and Loadings” on page 163.
Save Standardized Loadings
Save T Square Saves the T2 values to the data table. These are the values used in the T Square Plot.
Save Distance Saves the Distance to X Model (DModX) and Distance to Y Model (DModY) values to the data table. These are the values used in the Distance Plots.
Save X Weights Saves the weights for each X variable across all extracted factors to a new data table.
Saves a new column to the data table describing how each observation was used in validation. For Holdback validation, the column identifies if a row was used for training or validation. For KFold validation, the column identifies the number of the subgroup to which the row was assigned.
Save Validation
If Impute Missing Data is selected, opens a new data table that contains the data table columns specified as X and Y, with missing values replaced by their imputed values. Columns for polynomial terms are not shown. If a Validation column is specified, the validation column is also included.
Save Imputation
Statistical Details
This section provides details about some of the methods used in the Partial Least Squares platform. For additional details, see Hoskuldsson (1988), Garthwaite (1994), or Cox and Gaudard (2013).
160
Partial Least Squares Models
Statistical Details
Chapter 7
Multivariate Methods
Partial Least Squares
Partial least squares fits linear models based on linear combinations, called factors, of the explanatory variables (Xs). These factors are obtained in a way that attempts to maximize the covariance between the Xs and the response or responses (Ys). In this way, PLS exploits the correlations between the Xs and the Ys to reveal underlying latent structures. The factors address the combined goals of explaining response variation and predictor variation. Partial least squares is particularly useful when you have more X variables than observations or when the X variables are highly correlated.
NIPALS
The NIPALS method works by extracting one factor at a time. Let X = X0 be the centered and scaled matrix of predictors and Y = Y0 the centered and scaled matrix of response values. The PLS method starts with a linear combination t = X0w of the predictors, where t is called a score vector and w is its associated weight vector. The PLS method predicts both X0 and Y0 by regression on t:
X̂ 0 = tp´, where p´ = (t´t)‐1t´X0
Ŷ 0 = tc´, where c´ = (t´t)‐1t´Y0
The vectors p and c are called the X‐ and Y‐loadings, respectively.
The specific linear combination t = X0w is the one that has maximum covariance t´u with some response linear combination u = Y0q. Another characterization is that the X‐ and Y‐weights, w and q, are proportional to the first left and right singular vectors of the covariance matrix X0´Y0. Or, equivalently, the first eigenvectors of X0´Y0Y0´X0 and Y0´X0X0´Y0 respectively.
This accounts for how the first PLS factor is extracted. The second factor is extracted in the same way by replacing X0 and Y0 with the X‐ and Y‐residuals from the first factor:
X1 = X0 – X̂ 0
Y1 = Y0 – Ŷ 0
These residuals are also called the deflated X and Y blocks. The process of extracting a score vector and deflating the data matrices is repeated for as many extracted factors as desired.
SIMPLS
The SIMPLS algorithm was developed to optimize a statistical criterion: it finds score vectors that maximize the covariance between linear combinations of Xs and Ys, subject to the requirement that the X‐scores are orthogonal. Unlike NIPALS, where the matrices X0 and Y0 are deflated, SIMPLS deflates the cross‐product matrix, X0´Y0.
Chapter 7
Multivariate Methods
Partial Least Squares Models
Statistical Details
161
In the case of a single Y variable, these two algorithms are equivalent. However, for multivariate Y, the models differ. SIMPLS was suggested by De Jong (1993).
van der Voet T2
The van der Voet T2 test helps determine whether a model with a specified number of extracted factors differs significantly from a proposed optimum model. The test is a randomization test based on the null hypothesis that the squared residuals for both models have the same distribution. Intuitively, one can think of the null hypothesis as stating that both models have the same predictive ability.
To obtain the van der Voet T2 statistic given in the Cross Validation report, the calculation below is performed on each validation set. In the case of a single validation set, the result is the reported value. In the case of Leave‐One‐Out and KFold validation, the results for each validation set are averaged.
Denote by R i jk the jth predicted residual for response k for the model with i extracted factors. Denote by R opt ,jk is the corresponding quantity for the model based on the proposed optimum number of factors, opt. The test statistic is based on the following differences:
2
2
D i jk = R i jk – R opt jk
Suppose that there are K responses. Consider the following notation:
d i j =  D i j1 D i j2  D i jK 
d i . =
 d i j
j
Si =
 di j di j 
j
The van der Voet statistic for i extracted factors is defined as follows:
–1
C i = d i . S i d i .
The significance level is obtained by comparing Ci with the distribution of values that results 2
from randomly exchanging R i2 jk and R opt
 jk . A Monte Carlo sample of such values is simulated and the significance level is approximated as the proportion of simulated critical values that are greater than Ci.
162
Partial Least Squares Models
Statistical Details
Chapter 7
Multivariate Methods
T2 Plot
The T2 value for the ith observation is computed as follows:
2
Ti =  n – 1 
p 
n

2
t2 
t
  ij  kj
k=1 
j = 1
where tij = X score for the ith row and jth extracted factor, p = number of extracted factors, and n = number of observations used to train the model. If validation is not used, n = total number of observations.
The control limit for the T2 Plot is computed as follows:
((n‐1)2/n)*BetaQuantile(0.95, p/2, (n‐p‐1)/2)
where p = number of extracted factors, and n = number of observations used to train the model. If validation is not used, n = total number of observations. See Tracy, Young, and Mason, 1992.
Confidence Ellipses for X Score Scatterplot Matrix
The Score Scatterplot Matrices option adds 95% confidence ellipses to the X Score scatterplots. The X scores are uncorrelated because both the NIPALS and SIMPLE algorithms produce orthogonal score vectors. The ellipses assume that each pair of X scores follows a bivariate normal distribution with zero correlation.
Consider a scatterplot for score i on the vertical axis and score j on the horizontal axis. The coordinates of the top, bottom, left, and right extremes of the ellipse are as follows:
•
the top and bottom extremes are +/‐sqrt(var(score i)*z)
•
the left and right extremes are +/‐sqrt(var(score j)*z)
where z = ((n‐1)*(n‐1)/n)*BetaQuantile(0.95, 1, (n‐3)/2). For background on the z value, see Tracy, Young, and Mason, 1992.
Standard Error of Prediction and Confidence Limits
Let X denote the matrix of predictors and Y the matrix of response values, which might be centered and scaled based on your selections in the launch window. Assume that the components of Y are independent and normally distributed with a common variance 2.
Hoskuldsson (1988) observes that the PLS model for Y in terms of scores is formally similar to a multiple linear regression model. He uses this similarity to derive an approximate formula for the variance of a predicted value. See also Umetrics (1995). However, Denham (1997) points out that any value predicted by PLS is a non‐linear function of the Ys. He suggests Chapter 7
Multivariate Methods
Partial Least Squares Models
Statistical Details
163
bootstrap and cross validation techniques for obtaining prediction intervals. The PLS platform uses the normality‐based approach described in Umetrics (1995).
Denote the matrix whose columns are the scores by T and consider a new observation on X, x0. The predictive model for Y is obtained by regressing Y on T. Denote the score vector associated with x0 by t0.
Let a denote the number of factors. Define s2 to be the sum of squares of residuals divided by df = n ‐ a ‐1 if the data are centered and df = n ‐ a if the data are not centered. The value of s2 is an estimate of 2.
Standard Error of Prediction Formula
The standard error of the predicted mean at x0 is estimated by the following:
–1
1
SE  Y x  = s  --- + t 0  TT  t 0 
n

0
Mean Confidence Limit Formula
Let t0.975, df denote the 0.975 quantile of a t distribution with degrees of freedom df = n ‐ a ‐1 if the data are centered and df = n ‐ a if the data are not centered.
The 95% confidence interval for the mean is computed as follows:
Y x  t 0.975 df SE  Y x 
0
0
Indiv Confidence Limit Formula
The standard error of a predicted individual response at x0 is estimated by the following:
–1
SE  Ŷ x  = s  --1- + 1 + t 0  TT  t 0 
n

0
Let t0.975, df denote the 0.975 quantile of a t distribution with degrees of freedom df = n ‐ a ‐1 if the data are centered and df = n ‐ a if the data are not centered.
The 95% prediction interval for an individual response is computed as follows:
Y x  t 0.975 df SE  Ŷ x 
0
0
Standardized Scores and Loadings
Consider the following notation:
•
ntr is the number of observations in the training set
164
Partial Least Squares Models
Statistical Details
•
m is the number of effects in X
•
k is the number of responses in Y
•
VarXi is the percent variation in X explained by the ith factor
•
VarYi is the percent variation in Y explained by the ith factor
•
XScorei is the vector of X scores for the ith factor
•
YScorei is the vector of Y scores for the ith factor
•
XLoadi is the vector of X loadings for the ith factor
•
YLoadi is the vector of Y loadings for the ith factor
Chapter 7
Multivariate Methods
Standardized Scores
The vector of ith Standardized X Scores is defined as follows:
XScore i
--------------------------------------------------------- n tr – 1  mVarX i  n tr
The vector of ith Standardized Y Scores is defined as follows:
YScore i
------------------------------------------------------ n tr – 1  kVarY i  n tr
Standardized Loadings
The vector of ith Standardized X Loadings is defined as follows:
XLoad i mVarXi
The vector of ith Standardized Y Loadings is defined as follows:
YLoad i kVarY i
PLS Discriminant Analysis (PLS-DA)
When a categorical variable is entered as Y in the launch window, it is coded using indicator coding. If there are k levels, each level is represented by an indicator variable with the value 1 for rows in that level and 0 otherwise. The resulting k indicator variables are treated as continuous and the PLS analysis proceeds as it would with continuous Ys.
Appendix A
References
Bartlett, M.S. (1937), “Properties of sufficiency and statistical tests,” Proceedings of the Royal Society of London Series A, 160, 268–282.
Bartlett, M.S. (1954), “A Note on the Multiplying Factors for Various Chi Square Approximations,” Journal of the Royal Statistical Society, 16 (Series B), 296‐298.
Boulesteix, A.‐L. and Strimmer, K. (2007), “Partial Least Squares: A Versatile Tool for the Analysis of High‐Dimensional Genomic Data,” Briefings in Bioinformatics, 8(1), 32‐44.
Cox, I. and Gaudard, M. (2013), Discovering Partial Least Squares with JMP, Cary NC: SAS Institute Inc.
Cronbach, L.J. (1951), “Coefficient Alpha and the Internal Structure of Tests,” Psychometrika, 16, 297–334.
De Jong, S. (1993), “SIMPLS: An Alternative Approach to Partial Least Squares Regression,” Chemometrics and Intelligent Laboratory Systems, 18, 251–263.
Denham, M.C. (1997), “Prediction Intervals in Partial Least Squares,” Journal of Chemometrics, 11, 39‐52.
Dwass, M. (1955), “A Note on Simultaneous Confidence Intervals,” Annals of Mathematical Statistics 26: 146–147.
Eriksson, L., Johansson, E., Kettaneh‐Wold, N., Trygg, J., Wikstrom, C., and Wold, S. (2006), Multi‐ and Megavariate Data Analysis Basic Principles and Applications (Part I), Chapter 4, Umetrics.
Farebrother, R.W. (1981), “Mechanical Representations of the L1 and L2 Estimation Problems,” Statistical Data Analysis, 2nd Edition, Amsterdam, North Holland: edited by Y. Dodge.
Fieller, E.C. (1954), “Some Problems in Interval Estimation,” Journal of the Royal Statistical Society, Series B, 16, 175‐185.
Florek, K., Lukaszewicz, J., Perkal, J., and Zubrzycki, S. (1951a), “Sur La Liaison et la Division des Points d’un Ensemble Fini,” Colloquium Mathematicae, 2, 282–285.
Garthwaite, P. (1994), “An Interpretation of Partial Least Squares,” Journal of the American Statistical Association, 89:425, 122‐127.
Golub, G.H., Kahan, W. (1965), “Calculating the singular values and pseudo‐inverse of a matrix,” Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis 2:2, 205–224.
166
References
Appendix A
Multivariate Methods
Goodnight, J.H. (1978), “Tests of Hypotheses in Fixed Effects Linear Models,” SAS Technical Report R–101, Cary NC: SAS Institute Inc, also in Communications in Statistics (1980), A9 167–180.
Goodnight, J.H. and W.R. Harvey (1978), “Least Square Means in the Fixed Effect General Linear Model,” SAS Technical Report R–103, Cary NC: SAS Institute Inc.
Harris, C.W. and Kaiser, H.F. (1964), “Oblique Factor Analytic Solutions by Orthogonal Transformation,” Psychometrika, 32, 363–379.
Hartigan, J.A. (1981), “Consistence of Single Linkage for High–Density Clusters,” Journal of the American Statistical Association, 76, 388–394.
Hocking, R.R. (1985), The Analysis of Linear Models, Monterey: Brooks–Cole.
Hoskuldsson, A. (1988), “PLS Regression Methods,” Journal of Chemometrics, 2:3, 211‐228.
Hoeffding, W (1948), “A Non‐Parametric Test of Independence”, Annals of Mathematical Statistics, 19, 546–557.
Huber, P.J. (1964), “Robust Estimation of a Location Parameter,” Annals of Mathematical Statistics, 35:1, 73‐101.
Huber, Peter J. (1973), “Robust Regression: Asymptotics, Conjecture, and Monte Carlo,” Annals of Statistics, Volume 1, Number 5, 799‐821.
Huber, P.J. and Ronchetti, E.M. (2009), Robust Statistics, Second Edition, Wiley.
Jackson, J. Edward (2003), A User’s Guide to Principal Components, New Jersey: John Wiley and Sons.
Jardine, N. and Sibson, R. (1971), Mathematical Taxonomy, New York: John Wiley and Sons.
Lindberg, W., Persson, J.‐A., and Wold, S. (1983), “Partial Least‐Squares Method for Spectrofluorimetric Analysis of Mixtures of Humic Acid and Ligninsulfonate,” Analytical Chemistry, 55, 643–648.
Mason, R.L. and Young, J.C. (2002), Multivariate Statistical Process Control with Industrial Applications, Philadelphia: ASA‐SIAM.
McLachlan, G.J. and Krishnan, T. (1997), The EM Algorithm and Extensions, New York: John Wiley and Sons.
McQuitty, L.L. (1957), “Elementary Linkage Analysis for Isolating Orthogonal and Oblique Types and Typal Relevancies,” Educational and Psychological Measurement, 17, 207–229.
Milligan, G.W. (1980), “An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms,” Psychometrika, 45, 325–342.
Nelson, Philip R.C., Taylor, Paul A., MacGregor, John F. (1996), “Missing Data Methods in PCA and PLS: Score calculations with incomplete observations,” Chemometrics and Intelligent Laboratory Systems, 35, 45‐65.
Press, W.H, Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. (1998), Numerical Recipes in C: The Art of Scientific Computing, Second Edition, Cambridge, England: Cambridge University Press.
SAS Institute Inc. (2011), SAS/STAT 9.2 User’s Guide, “The VARCLUS Procedure,” Cary, NC: SAS Institute Inc. Retrieved April 15, 2015 from Appendix A
Multivariate Methods
References
167
http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#var
clus_toc.htm.
SAS Institute Inc. (2011), SAS/STAT 9.3 User’s Guide, “The PLS Procedure,” Cary, NC: SAS Institute Inc. Retrieved April 15, 2015 from http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#pls
_toc.htm.
SAS Institute Inc. (2011), SAS/STAT 9.3 User’s Guide, “The CANDISC Procedure,” Cary, NC: SAS Institute Inc. Retrieved April 15, 2015 from http://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#can
disc_toc.htm.
Sneath, P.H.A. (1957) “The Application of Computers to Taxonomy,” Journal of General Microbiology,17, 201–226.
Sokal, R.R. and Michener, C.D. (1958), “A Statistical Method for Evaluating Systematic Relationships,” University of Kansas Science Bulletin, 38, 1409–1438.
Tobias, R.D. (1995), “An Introduction to Partial Least Squares Regression,” Proceedings of the
Twentieth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc.
Tracy, N.D., Young, J.C., Mason, R.R. (1992), “Multivariate Control Charts for Individual Observations,” Journal of Quality Technology, 24, 88–95.
Umetrics (1995), Multivariate Analysis (3‐day course), Winchester, MA.
Wold, (1980), “Soft Modelling: Intermediate between Traditional Model Building and Data Analysis,” Mathematical Statistics (Banach Center Publications, Warsaw), 6, 333‐346.
Wold, S. (1994), “PLS for Multivariate Linear Modeling”, QSAR: Chemometric Methods in Molecular Design. Methods and Principles in Medicinal Chemistry.
Wold, S., Sjostrom, M., and Eriksson, L. (2001), “PLS‐Regression: A Basic Tool of Chemometrics,” Chemometrics and Intelligent Laboratory Systems, 58:2, 109‐130.
Wright, S.P. and R.G. O’Brien (1988), “Power Analysis in an Enhanced GLM Procedure: What it Might Look Like,” SUGI 1988, Proceedings of the Thirteenth Annual Conference, 1097–1102, Cary NC: SAS Institute Inc.
168
References
Appendix A
Multivariate Methods
Appendix B
Statistical Details
Multivariate Methods
This appendix discusses Wide Linear methods and the use of the singular value decomposition. It also gives details on computations used in multivariate tests and exact and approximate F‐statistics.
Contents
Wide Linear Methods and the Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 171
The Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
The SVD and the Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
The SVD and the Inverse Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Calculating the SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Multivariate Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Approximate F‐Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Appendix B
Multivariate Methods
Statistical Details
Wide Linear Methods and the Singular Value Decomposition
171
Wide Linear Methods and the Singular Value Decomposition
Wide Linear methods in the Cluster, Principal Components, and Discriminant platforms enable you to analyze data sets with thousands (or even millions) of variables. Most multivariate techniques require the calculation or inversion of a covariance matrix. When your multivariate analysis involves a large number of variables, the covariance matrix can be prohibitively large so that calculating it or inverting it is problematic and computationally expensive.
Suppose that your data consist of n rows and p columns. The rank of the covariance matrix is at most the smaller of n and p. In wide data sets, p is often much larger than n. In these cases, the inverse of the covariance matrix has at most n nonzero eigenvalues. Wide Linear methods use this fact, together with the singular value decomposition, to provide efficient calculations. See “Calculating the SVD” on page 173.
The Singular Value Decomposition
The singular value decomposition (SVD) enables you to express any linear transformation as a rotation, followed by a scaling, followed by another rotation. The SVD states that any n by p matrix X can be written as follows:
X = UDiag   V
Let r be the rank of X. Denote the r by r identity matrix by Ir.
The matrices U, Diag(, and V have the following properties:
U is an n by r semi‐orthogonal matrix with U’U = Ir
V is a p by r semi‐orthogonal matrix with V’V = Ir
Diag() is an r by r diagonal matrix with positive diagonal elements given by the column vector  =   1  2   r  where  1   2     r  0 .
The i are the nonzero singular values of X.
The following statements relate the SVD to the spectral decomposition of a square matrix:
•
The squares of the i are the nonzero eigenvalues of X’X.
•
The r columns of V are eigenvectors of X’X.
Note: There are various conventions in the literature regarding the dimensions of the matrices U, V, and the matrix containing the eigenvalues. However, the differences have no practical impact on the decomposition up to the rank of X.
For further details, see Press et al. (1998, Section 2.6).
172
Statistical Details
Wide Linear Methods and the Singular Value Decomposition
Appendix B
Multivariate Methods
The SVD and the Covariance Matrix
This section describes how the eigenvectors and eigenvalues of a covariance matrix can be obtained using the SVD. When the matrix of interest has at least one large dimension, calculating the SVD is much more efficient than calculating its covariance matrix and its eigenvalue decomposition.
Let n be the number of observations and p the number of variables involved in the multivariate analysis of interest. Denote the n by p matrix of data values by X.
The SVD is usually applied to standardized data. To standardize a value, subtract its mean and divide by its standard deviation. Denote the n by p matrix of standardized data values by Xs. Then the covariance matrix of the standardized data is the correlation matrix for X and is given as follows:
Cov = X s X s   n – 1 
The SVD can be applied to Xs to obtain the eigenvectors and eigenvalues of Xs’Xs. This allows efficient calculation of eigenvectors and eigenvalues when the matrix X is either extremely wide (many columns) or tall (many rows). This technique is the basis for Wide PCA. See “Wide Principal Components Options” on page 96 in the “Principal Components” chapter.
The SVD and the Inverse Covariance Matrix
Some multivariate techniques require the calculation of inverse covariance matrices. This section describes how the SVD can be used to calculate the inverse of a covariance matrix.
Denote the standardized data matrix by Xs and define S = Xs’Xs. The singular value decomposition allows you to write S as follows:
2
S =  UDiag   V   UDiag   V  = VDiag    V
If S is of full rank, then V is a p by p orthonormal matrix, and you can write S‐1 as follows:
S
–1
2
=  VDiag    V 
–1
= VDiag   
–2
V
–1
+
If S is not of full rank, then Diag    can be replaced with a generalized inverse, Diag    , where the diagonal elements of Diag() are replaced by their reciprocals. This defines a generalize inverse of S as follows:
+
+ 2
S = V  Diag     V
This generalized inverse can be calculated using only the SVD.
Appendix B
Multivariate Methods
Statistical Details
Multivariate Tests
173
To see the specific details behind the application of the SVD for wide linear discriminant analysis, see “Wide Linear Discriminant Method” on page 134 in the “Discriminant Analysis” chapter.
Calculating the SVD
In the Multivariate Methods platforms, JMP’s calculation of the SVD of a matrix follows the method suggested in Golub and Kahan (1965). Golub and Kahan’s method involves a two‐step procedure. The first step consists of reducing the matrix M to a bidiagonal matrix J. The second step consists of computing the singular values of J, which are the same as the singular values of the original matrix M. The columns of the matrix M are usually standardized in order to equalize the effect of the variables on the calculation. The Golub and Kahan method is computationally efficient.
Multivariate Tests
In the following, E is the residual cross product matrix. Diagonal elements of E are the residual sums of squares for each variable. In the discriminant analysis literature, this is often called W, where W stands for within.
–1
Test statistics in the multivariate results tables are functions of the eigenvalues  of E H . The following list describes the computation of each test statistic.
Note: After specification of a response design, the initial E and H matrices are premultiplied by M' and postmultiplied by M.
Table B.1 Computations for Multivariate Tests
Wilks’ Lambda
Pillai’s Trace
det  E  - =
 = ---------------------------det  H + E 
n

i=1
V = Trace  H  H + E 
–1
1 -
 ------------1 +  
i
n

 =
i=1
Hotelling‐Lawley Trace
U = Trace  E
–1
i
------------1 + i
n
H =

i
i=1
Roy’s Max Root
 =  1 , the maximum eigenvalue of E
–1
H.
174
Statistical Details
Multivariate Tests
Appendix B
Multivariate Methods
The whole model L is a column of zeros (for the intercept) concatenated with an identity matrix having the number of rows and columns equal to the number of parameters in the model. L matrices for effects are subsets of rows from the whole model L matrix.
Approximate F-Tests
To compute F‐values and degrees of freedom, let p be the rank of H + E . Let q be the rank of –1
L  X'X  L' , where the L matrix identifies elements of X'X associated with the effect being tested. Let v be the error degrees of freedom and s be the minimum of p and q. Also let m = 0.5  p – q – 1  and n = 0.5  v – p – 1  .
Table B.2 on page 174, gives the computation of each approximate F from the corresponding test statistic.
Table B.2 Approximate F‐statistics
Test
Approximate F
Numerator DF
Denominator DF
Wilks’ Lambda
 1 –  1  t rt – 2u
F =  ---------------------  -----------------
  1  t   pq 
pq
rt – 2u
Pillai’s Trace
V -  ------------------------2n + s + 1-
F =  ----------s – V  2m + s + 1
s  2m + s + 1 
s  2n + s + 1 
Hotelling-Lawley
Trace
2  sn + 1 U F = ----------------------------------2
s  2m + s + 1 
s  2m + s + 1 
2  sn + 1 
Roy’s Max Root
 v – max  p q  + q -
F = 
---------------------------------------------------max  p q 
max  p q 
v – max  p q  + q
Index
Multivariate Methods
Numerics
95% bivariate normal density ellipse 40
A
agglomerative clustering 53
algorithms 171
approximate F test 174
Average Linkage 77
B
Baltic.jmp 142
bar chart of correlations 36
Biplot 66
Biplot 3D 67
Biplot Options 67
Biplot Ray Position 67, 123
biplot rays 94
bivariate normal density ellipse 40
By variable 85
C
calculation details 171
Canonical 3D Plot 119
centroid 42
Centroid Method 77
Cluster Criterion 60
Cluster platform 51
compare methods 53
hierarchical 57–78
introduction 53–54
k‐means 63–68
launch 56–57
normal mixtures 68–79
Cluster the Correlations 38
Clustering History 60
Color Clusters 60
Color Clusters and Mark Clusters 63
Color Map 61
Color Map On Correlations 38
Color Map On p-values 38
Color Points 123
Complete Linkage 78
computational details 171
Consider New Levels 120
Constellation Plot 61
constellation plot 55
contrast M matrix 173
correlation matrix 33
Cronbach’s alpha 43–45
statistical details 50
Cytometry.jmp 64
D
Danger.jmp 44
dendrogram 53, 55, 59
Dendrogram Scale command 60
Density Ellipse 40–41
dimensionality 83
Discriminant Analysis, PLS 164
Distance Graph 60
Distance Scale 60
E
E matrix 173
eigenvalue decomposition 83, 88
Eigenvectors 90
Ellipse alpha 41
Ellipse Color 41
Ellipses Transparency 41
EM algorithm 63
Even Spacing 60
Expectation Maximization algorithm 63
expectation step 54
176
F
factor analysis 83
Factor Analysis platform
By variable 85
Freq variable 84
Weight variable 84
Factor Rotation 94
Fit Line 41
formulas used in JMP calculations 171
Freq variable 84
Index
Multivariate Methods
linear combination 83
Loading Plot 93
M
H matrix 173
Hoeffding’s D 39, 47
M matrix 173
Mahalanobis distance 42, 48
plot 42
Mark Clusters 60
maximization step 54
menu tips 24
missing data imputation, PLS 147
missing value 34
missing values 36
Multivariate 29, 31
multivariate mean 42
multivariate outliers 42
Multivariate platform 169
example 44
principal components 83
I
N
Impute Missing Data in PLS 147
Inverse Corr table 35
inverse correlation 35, 47
Iris.jmp 69
item reliability 43–45
Nonpar Density 41
Nonparametric Correlations 39
Nonparametric Measures of Association table 39
normal density ellipse 40
J
O
jackknife 42
Jackknife Distances 42
JMP Starter 25
Other 41
Outlier Analysis 42
K
P
Kendall’s Tau 39
Kendall’s tau‐b 46
KMeans 63
Pairwise Correlations 34
G
Geometric Spacing 60
group similar rows see Cluster platform
H
K‐Means Clustering Platform
SOMs 72
Technical Details 73
L
L matrix 173
Legend 61
Outlier Distance plot 48
Pairwise Correlations table 36
Parallel Coord Plots 67
Partial Corr table 35
partial correlation 35
Partial Least Squares platform
validation 146
PCA 83
Pearson correlation 36, 46
PLS 139–162
Statistical Details 159–163
177
Index
Multivariate Methods
PLS‐DA 164
principal components analysis 83
product‐moment correlation 36, 46
Q
questionnaire analysis 43–45
R
reduce dimensions 83
reliability analysis 43–45
also see Survival platform
Response Screening platform
Weight variable 84
ROC Curve 121
S
Save Canonical Scores 123
Save Cluster Hierarchy 62
Save Cluster Tree 62
Save Clusters 61, 67
Save Density Formula 68
Save Display Order 62
Save Distance Matrix 62
Save Formula for Closest Cluster 62
Save Mixture Formulas 67
Save Mixture Probabilities 67
Scatterplot Matrix 40, 121
scatterplot matrix 33
Score Plot 93
Scree Plot 93
scree plot 55, 59
Shaded Ellipses 41
Show Biplot Rays 67, 123
Show Canonical Details 123
Show Correlations 41
Show Dendrogram 60
Show Distances to each group 121
Show Group Means 120
Show Histogram 41
Show Means CL Ellipses 123
Show Normal 50% Contours 123
Show Points 41, 122
Show Probabilities to each group 121
Show Within Covariances 120
significance probability 36
Simulate Clusters 68
Single Linkage 78
Solubility.jmp 84
SOM Technical Details 73
SOMs 72
Spearman’s Rho 46
Spearman’s Rho 39
Standardize Data 58
statistical details 171
T
T2 Statistic 42
tooltips 24
tutorial examples
correlation 44
tutorials 23
Two way clustering 61
U
Univariate 34
W-Z
Ward’s 77
Weight variable 84
Y role 57
178
Index
Multivariate Methods
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising