RapidMiner Operator Reference

RapidMiner 7
Operator Reference Manual
RapidMiner 7
Operator Reference Manual
July 19, 2017
RapidMiner GmbH
www.rapidminer.com
© 2016 by RapidMiner GmbH. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of RapidMiner GmbH.
Contents
1 Data Access
Copy Repository Entry . . . .
Delete Repository Entry . . .
Move Repository Entry . . . .
Rename Repository Entry . .
Retrieve . . . . . . . . . . . .
Store . . . . . . . . . . . . . .
1.1 Files . . . . . . . . . . . . . . . . . .
1.1.1 Read . . . . . . . . . . . . . .
Read ARFF . . . . . . . . . .
Read Access . . . . . . . . . .
Read BibTeX . . . . . . . . .
Read C4.5 . . . . . . . . . . .
Read CSV . . . . . . . . . . .
Read dBase . . . . . . . . . .
Read DASYLab . . . . . . . .
Read Excel . . . . . . . . . .
Read SAS . . . . . . . . . . .
Read SPSS . . . . . . . . . . .
Read Sparse . . . . . . . . . .
Read Stata . . . . . . . . . .
Read XML . . . . . . . . . . .
Read XRFF . . . . . . . . . .
1.1.2 Write . . . . . . . . . . . . .
Write ARFF . . . . . . . . . .
Write Access . . . . . . . . .
Write CSV . . . . . . . . . . .
Write Excel . . . . . . . . . .
Write PMML . . . . . . . . .
Write Special Format . . . . .
Write XRFF . . . . . . . . . .
1.2 Database . . . . . . . . . . . . . . .
Read Database . . . . . . . .
Update Database . . . . . . .
Write Database . . . . . . . .
1.3 NoSQL . . . . . . . . . . . . . . . . .
1.3.1 Cassandra . . . . . . . . . . .
Delete Cassandra . . . . . . .
Execute CQL . . . . . . . . .
Read Cassandra . . . . . . . .
Write Cassandra . . . . . . .
1.3.2 MongoDB . . . . . . . . . . .
Delete MongoDB . . . . . . .
Execute MongoDB Command
Read MongoDB . . . . . . . .
Update MongoDB . . . . . .
Write MongoDB . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
4
5
6
8
8
8
12
14
15
17
21
22
23
26
27
29
32
33
35
37
37
40
42
44
46
47
50
51
51
54
57
60
60
60
62
64
66
68
68
69
70
72
73
V
Contents
1.3.3
Solr . . . . . . . . . . . .
Add to Solr (Data) . . . .
Add to Solr (Documents)
Search Solr (Data) . . . .
Search Solr (Documents)
1.4 Applications . . . . . . . . . . .
Trigger Zapier . . . . . .
1.4.1 Salesforce . . . . . . . . .
Delete Salesforce . . . . .
Read Salesforce . . . . . .
Update Salesforce . . . .
Write Salesforce . . . . .
1.4.2 Mozenda . . . . . . . . .
Read Mozenda . . . . . .
1.4.3 Qlik . . . . . . . . . . . .
Write QVX . . . . . . . .
1.4.4 Twitter . . . . . . . . . .
Get Twitter Relations . .
Get Twitter User Details .
Get Twitter User Statuses
Search Twitter . . . . . .
1.4.5 Splunk . . . . . . . . . . .
Search Splunk . . . . . .
1.5 Cloud Storage . . . . . . . . . . .
1.5.1 Amazon S3 . . . . . . . .
Loop Amazon S3 . . . . .
Read Amazon S3 . . . . .
Write Amazon S3 . . . . .
1.5.2 Azure Blob Storage . . . .
Loop Azure Blob Storage
Read Azure Blob Storage
Write Azure Blob Storage
1.5.3 Dropbox . . . . . . . . . .
Read Dropbox . . . . . . .
Write Dropbox . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
74
74
75
76
78
80
80
82
82
83
84
86
88
88
89
89
90
90
91
92
93
95
95
96
96
96
98
99
100
100
102
103
104
104
105
2 Blending
2.1 Attributes . . . . . . . . . . . . . .
Reorder Attributes . . . . .
2.1.1 Names and Roles . . . . . .
Exchange Roles . . . . . . .
Rename . . . . . . . . . . .
Rename by Constructions .
Rename by Example Values
Rename by Generic Names
Rename by Replacing . . .
Set Role . . . . . . . . . . .
2.1.2 Types . . . . . . . . . . . .
Date to Nominal . . . . . .
Date to Numerical . . . . .
Format Numbers . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
107
107
110
110
112
114
116
118
122
126
129
129
133
137
VI
Contents
Guess Types . . . . . . . . . . .
Nominal to Binominal . . . . . .
Nominal to Date . . . . . . . . .
Nominal to Numerical . . . . . .
Nominal to Text . . . . . . . . .
Numerical to Binominal . . . . .
Numerical to Polynominal . . . .
Numerical to Real . . . . . . . .
Parse Numbers . . . . . . . . . .
Real to Integer . . . . . . . . . .
Text to Nominal . . . . . . . . .
2.1.3 Selection . . . . . . . . . . . . .
Remove Attribute Range . . . . .
Remove Correlated Attributes . .
Remove Useless Attributes . . .
Select Attributes . . . . . . . . .
Select by Random . . . . . . . .
Select by Weights . . . . . . . . .
Work on Subset . . . . . . . . . .
2.1.4 Generation . . . . . . . . . . . .
Generate Absolutes . . . . . . .
Generate Aggregation . . . . . .
Generate Attributes . . . . . . .
Generate Concatenation . . . . .
Generate Copy . . . . . . . . . .
Generate Empty Attribute . . . .
Generate Function Set . . . . . .
Generate ID . . . . . . . . . . . .
Generate Products . . . . . . . .
Generate TFIDF . . . . . . . . .
Generate Weight (Stratification)
2.2 Examples . . . . . . . . . . . . . . . . .
2.2.1 Filter . . . . . . . . . . . . . . .
Filter Example Range . . . . . .
Filter Examples . . . . . . . . . .
2.2.2 Sampling . . . . . . . . . . . . .
Sample . . . . . . . . . . . . . .
Sample (Bootstrapping) . . . . .
Sample (Kennard-Stone) . . . .
Sample (Stratified) . . . . . . . .
Split Data . . . . . . . . . . . . .
2.2.3 Sort . . . . . . . . . . . . . . . .
Shuffle . . . . . . . . . . . . . .
Sort . . . . . . . . . . . . . . . .
2.3 Table . . . . . . . . . . . . . . . . . . .
2.3.1 Grouping . . . . . . . . . . . . .
Aggregate . . . . . . . . . . . . .
2.3.2 Rotation . . . . . . . . . . . . . .
De-Pivot . . . . . . . . . . . . .
Pivot . . . . . . . . . . . . . . . .
Transpose . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
142
146
150
155
160
163
167
171
174
178
182
186
186
188
191
195
199
201
204
208
208
212
216
219
221
222
224
226
228
229
231
232
232
232
235
239
239
241
244
246
248
251
251
252
254
254
254
259
259
262
265
VII
Contents
2.3.3
Joins . . . . . . . . .
Append . . . . . . .
Intersect . . . . . .
Join . . . . . . . . .
Set Minus . . . . . .
Superset . . . . . . .
Union . . . . . . . .
2.4 Values . . . . . . . . . . . .
Adjust Date . . . . .
Cut . . . . . . . . .
Map . . . . . . . . .
Merge . . . . . . . .
Remap Binominals .
Replace . . . . . . .
Replace (Dictionary)
Set Data . . . . . . .
Split . . . . . . . . .
Trim . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
268
268
270
272
276
278
280
283
283
285
288
293
295
299
303
308
310
314
3 Cleansing
3.1 Normalization . . . . . . . . . . . . . . . . . .
Normalize . . . . . . . . . . . . . . . . .
Scale by Weights . . . . . . . . . . . . .
3.2 Binning . . . . . . . . . . . . . . . . . . . . . .
Discretize by Binning . . . . . . . . . .
Discretize by Entropy . . . . . . . . . .
Discretize by Frequency . . . . . . . . .
Discretize by Size . . . . . . . . . . . . .
Discretize by User Specification . . . . .
3.3 Missing . . . . . . . . . . . . . . . . . . . . . .
Declare Missing Value . . . . . . . . . .
Fill Data Gaps . . . . . . . . . . . . . . .
Impute Missing Values . . . . . . . . . .
Replace Infinite Values . . . . . . . . .
Replace Missing Values . . . . . . . . .
3.4 Duplicates . . . . . . . . . . . . . . . . . . . . .
Remove Duplicates . . . . . . . . . . . .
3.5 Outliers . . . . . . . . . . . . . . . . . . . . . .
Detect Outlier (COF) . . . . . . . . . . .
Detect Outlier (Densities) . . . . . . . .
Detect Outlier (Distances) . . . . . . . .
Detect Outlier (LOF) . . . . . . . . . . .
3.6 Dimensionality Reduction . . . . . . . . . . . .
Fourier Transformation . . . . . . . . .
Generalized Hebbian Algorithm . . . .
Independent Component Analysis . . .
Principal Component Analysis . . . . .
Principal Component Analysis (Kernel)
Self-Organizing Map . . . . . . . . . . .
Singular Value Decomposition . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
319
319
319
324
326
326
331
335
340
344
349
349
352
354
358
362
366
366
370
370
374
376
378
381
381
382
384
387
390
393
396
VIII
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
4 Modeling
4.1 Predictive . . . . . . . . . . . . . . . . . . . . .
Create Formula . . . . . . . . . . . . . .
Group Models . . . . . . . . . . . . . . .
4.1.1 Lazy . . . . . . . . . . . . . . . . . . . .
Default Model . . . . . . . . . . . . . .
K-NN . . . . . . . . . . . . . . . . . . .
4.1.2 Bayesian . . . . . . . . . . . . . . . . .
Naive Bayes . . . . . . . . . . . . . . . .
Naive Bayes (Kernel) . . . . . . . . . . .
4.1.3 Trees . . . . . . . . . . . . . . . . . . .
CHAID . . . . . . . . . . . . . . . . . . .
Decision Stump . . . . . . . . . . . . . .
Decision Tree . . . . . . . . . . . . . . .
Decision Tree (Multiway) . . . . . . . .
Decision Tree (Weight-Based) . . . . . .
Gradient Boosted Trees . . . . . . . . .
ID3 . . . . . . . . . . . . . . . . . . . .
Random Forest . . . . . . . . . . . . . .
Random Tree . . . . . . . . . . . . . . .
4.1.4 Rules . . . . . . . . . . . . . . . . . . .
Rule Induction . . . . . . . . . . . . . .
Subgroup Discovery . . . . . . . . . . .
Tree to Rules . . . . . . . . . . . . . . .
4.1.5 Neural Nets . . . . . . . . . . . . . . . .
Deep Learning . . . . . . . . . . . . . .
Neural Net . . . . . . . . . . . . . . . .
Perceptron . . . . . . . . . . . . . . . .
4.1.6 Functions . . . . . . . . . . . . . . . . .
Gaussian Process . . . . . . . . . . . . .
Generalized Linear Model . . . . . . . .
Linear Regression . . . . . . . . . . . .
Local Polynomial Regression . . . . . .
Polynomial Regression . . . . . . . . . .
Relevance Vector Machine . . . . . . . .
Vector Linear Regression . . . . . . . .
4.1.7 Logistic Regression . . . . . . . . . . .
Logistic Regression (SVM) . . . . . . . .
Logistic Regression (Evolutionary) . . .
Logistic Regression (SVM) . . . . . . . .
4.1.8 Support Vector Machines . . . . . . . .
Fast Large Margin . . . . . . . . . . . .
Support Vector Machine . . . . . . . . .
Support Vector Machine . . . . . . . . .
Support Vector Machine (Evolutionary)
Support Vector Machine (LibSVM) . . .
Support Vector Machine (PSO) . . . . .
4.1.9 Discriminant Analysis . . . . . . . . . .
Linear Discriminant Analysis . . . . . .
Quadratic Discriminant Analysis . . . .
Regularized Discriminant Analysis . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
399
399
399
400
402
402
404
409
409
412
414
414
417
419
423
425
428
434
436
439
442
442
444
447
448
448
456
460
462
462
465
471
474
477
480
482
484
484
487
491
494
494
496
497
502
507
511
515
515
518
520
IX
Contents
4.2
4.3
4.4
4.5
4.6
X
4.1.10 Ensembles . . . . . . . . . . . . . . .
AdaBoost . . . . . . . . . . . . . . . .
Bagging . . . . . . . . . . . . . . . . .
Bayesian Boosting . . . . . . . . . . .
Classification by Regression . . . . . .
MetaCost . . . . . . . . . . . . . . . .
Polynomial by Binomial Classification
Stacking . . . . . . . . . . . . . . . . .
Vote . . . . . . . . . . . . . . . . . . .
Segmentation . . . . . . . . . . . . . . . . . .
Agglomerative Clustering . . . . . . .
DBSCAN . . . . . . . . . . . . . . . . .
Expectation Maximization Clustering
Extract Cluster Prototypes . . . . . .
Flatten Clustering . . . . . . . . . . .
Random Clustering . . . . . . . . . .
Support Vector Clustering . . . . . . .
Top Down Clustering . . . . . . . . .
K-Means . . . . . . . . . . . . . . . .
K-Means (Kernel) . . . . . . . . . . . .
K-Medoids . . . . . . . . . . . . . . .
Associations . . . . . . . . . . . . . . . . . .
Apply Association Rules . . . . . . . .
Create Association Rules . . . . . . .
FP-Growth . . . . . . . . . . . . . . .
Generalized Sequential Patterns . . .
Correlations . . . . . . . . . . . . . . . . . . .
ANOVA Matrix . . . . . . . . . . . . .
Correlation Matrix . . . . . . . . . . .
Covariance Matrix . . . . . . . . . . .
Grouped ANOVA . . . . . . . . . . . .
Mutual Information Matrix . . . . . .
Similarities . . . . . . . . . . . . . . . . . . .
Cross Distances . . . . . . . . . . . . .
Data to Similarity . . . . . . . . . . .
Data to Similarity Data . . . . . . . .
Similarity to Data . . . . . . . . . . .
Feature Weights . . . . . . . . . . . . . . . .
Data to Weights . . . . . . . . . . . .
Weight by Chi Squared Statistic . . . .
Weight by Component Model . . . . .
Weight by Correlation . . . . . . . . .
Weight by Deviation . . . . . . . . . .
Weight by Gini Index . . . . . . . . . .
Weight by Information Gain . . . . . .
Weight by Information Gain Ratio . .
Weight by PCA . . . . . . . . . . . . .
Weight by Relief . . . . . . . . . . . .
Weight by Rule . . . . . . . . . . . . .
Weight by SVM . . . . . . . . . . . . .
Weight by Tree Importance . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
522
522
525
528
532
533
536
538
541
543
543
547
551
555
557
560
562
565
567
570
574
578
578
580
583
586
589
589
591
594
596
598
599
599
604
607
610
611
611
613
615
617
620
622
624
626
628
630
632
634
636
Contents
Weight by Uncertainty . . . . . . . .
Weight by User Specification . . . .
Weight by Value Average . . . . . .
Weights to Data . . . . . . . . . . .
4.7 Optimization . . . . . . . . . . . . . . . . .
4.7.1 Parameters . . . . . . . . . . . . . .
Clone Parameters . . . . . . . . . .
Optimize Parameters (Evolutionary)
Optimize Parameters (Grid) . . . . .
Optimize Parameters (Quadratic) . .
Set Parameters . . . . . . . . . . . .
4.7.2 Feature Selection . . . . . . . . . . .
Backward Elimination . . . . . . . .
Forward Selection . . . . . . . . . .
Optimize Selection . . . . . . . . . .
Optimize Selection (Brute Force) . .
Optimize Selection (Evolutionary) .
4.7.3 Feature Generation . . . . . . . . .
Optimize by Generation (GGA) . . .
Optimize by Generation (YAGGA) . .
Optimize by Generation (YAGGA2) .
4.7.4 Feature Weighting . . . . . . . . . .
Optimize Weights (Evolutionary) . .
Optimize Weights (Forward) . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
638
640
642
643
644
644
644
647
651
655
657
660
660
663
666
670
673
678
678
682
686
691
691
695
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
699
699
701
701
703
706
708
Bootstrapping Validation . . . . . . . .
Cross Validation . . . . . . . . . . . . .
Split Validation . . . . . . . . . . . . . .
Wrapper-X-Validation . . . . . . . . . .
6.1 Performance . . . . . . . . . . . . . . . . . . .
Combine Performances . . . . . . . . .
Extract Performance . . . . . . . . . . .
Performance . . . . . . . . . . . . . . .
Performance (Min-Max) . . . . . . . . .
Performance to Data . . . . . . . . . . .
6.1.1 Predictive . . . . . . . . . . . . . . . . .
Performance (Attribute Count) . . . . .
Performance (Binominal Classification)
Performance (Classification) . . . . . .
Performance (Costs) . . . . . . . . . . .
Performance (Ranking) . . . . . . . . .
Performance (Regression) . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
711
711
714
719
723
726
726
728
730
733
735
736
736
738
743
748
751
753
5 Scoring
Apply Model . . . . . . . .
5.1 Confidences . . . . . . . . . . . . .
Apply Threshold . . . . . .
Create Threshold . . . . . .
Drop Uncertain Predictions
Find Threshold . . . . . . .
6 Validation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
XI
Contents
6.1.2
Segmentation . . . . . . . . . .
Cluster Count Performance . .
Cluster Density Performance .
Cluster Distance Performance .
Item Distribution Performance
Map Clustering on Labels . . .
6.1.3 Significance Tests . . . . . . .
ANOVA . . . . . . . . . . . . .
T-Test . . . . . . . . . . . . . .
6.2 Visual . . . . . . . . . . . . . . . . . .
Compare ROCs . . . . . . . . .
Create Lift Chart . . . . . . . .
Visualize Model by SOM . . . .
7 Utility
7.1
7.2
XII
Execute Process . . . .
Multiply . . . . . . . . .
Schedule Process . . . .
Subprocess . . . . . . .
Scripting . . . . . . . . . . . .
Execute Program . . . .
Execute Python . . . . .
Execute R . . . . . . . .
Execute SQL . . . . . .
Execute Script . . . . .
Process Control . . . . . . . . .
Publish to App . . . . .
Recall . . . . . . . . . .
Recall from App . . . .
Remember . . . . . . .
7.2.1 Loops . . . . . . . . . .
Loop . . . . . . . . . . .
Loop Attribute Subsets
Loop Attributes . . . . .
Loop Batches . . . . . .
Loop Clusters . . . . . .
Loop Collection . . . .
Loop Data Sets . . . . .
Loop Examples . . . . .
Loop Files . . . . . . . .
Loop Labels . . . . . . .
Loop Parameters . . . .
Loop Values . . . . . . .
Loop and Average . . .
Loop and Deliver Best .
7.2.2 Branches . . . . . . . .
Branch . . . . . . . . .
Select Subprocess . . .
7.2.3 Collections . . . . . . .
Collect . . . . . . . . . .
Flatten Collection . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
756
756
758
760
763
765
767
767
770
773
773
776
779
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
781
781
784
786
789
792
792
795
798
802
806
810
810
812
814
817
819
819
821
823
827
829
831
833
835
837
840
841
845
848
850
852
852
855
858
858
860
Contents
Select . . . . . . . . . . . . . . . . . .
Exceptions . . . . . . . . . . . . . . .
Handle Exception . . . . . . . . . . .
Throw Exception . . . . . . . . . . . .
Macros . . . . . . . . . . . . . . . . . . . . . .
Extract Macro . . . . . . . . . . . . . .
Generate Macro . . . . . . . . . . . .
Set Macro . . . . . . . . . . . . . . . .
Set Macros . . . . . . . . . . . . . . .
Files . . . . . . . . . . . . . . . . . . . . . . .
Add Entry to Archive File . . . . . . .
Copy File . . . . . . . . . . . . . . . .
Create Archive File . . . . . . . . . . .
Create Directory . . . . . . . . . . . .
Delete File . . . . . . . . . . . . . . .
Move File . . . . . . . . . . . . . . . .
Rename File . . . . . . . . . . . . . .
Write as Text . . . . . . . . . . . . . .
Annotations . . . . . . . . . . . . . . . . . . .
Annotate . . . . . . . . . . . . . . . .
Annotations to Data . . . . . . . . . .
Data to Annotations . . . . . . . . . .
Extract Macro from Annotation . . . .
Logging . . . . . . . . . . . . . . . . . . . . .
Extract Log Value . . . . . . . . . . . .
Log . . . . . . . . . . . . . . . . . . .
Log to Data . . . . . . . . . . . . . . .
Provide Macro as Log Value . . . . . .
Data Anonymization . . . . . . . . . . . . . .
De-Obfuscate . . . . . . . . . . . . . .
Obfuscate . . . . . . . . . . . . . . . .
Random Data Generation . . . . . . . . . . .
Add Noise . . . . . . . . . . . . . . . .
Generate Data . . . . . . . . . . . . .
Generate Direct Mailing Data . . . . .
Generate Multi-Label Data . . . . . .
Generate Nominal Data . . . . . . . .
Generate Sales Data . . . . . . . . . .
Generate Transaction Data . . . . . .
Misc . . . . . . . . . . . . . . . . . . . . . . .
Free Memory . . . . . . . . . . . . . .
Join Paths . . . . . . . . . . . . . . . .
Materialize Data . . . . . . . . . . . .
Register Visualization from Database
7.2.4
7.3
7.4
7.5
7.6
7.7
7.8
7.9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
862
864
864
866
867
867
873
876
879
882
882
885
887
889
890
893
895
897
899
899
901
902
904
905
905
907
911
913
915
915
916
917
917
921
925
926
928
930
931
933
933
935
936
938
XIII
1Data Access
Copy Repository Entry
Copy Repository ...
thr
thr
An operator to copy a repository entry to another repository location.
Description
Copies an entry to a new parent folder. If destination references a folder, the source entry is
copied to that folder. If it references an existing entry and overwriting is not enabled (default
case), an exception is raised. If overwriting is enabled the existing entry will be overwritten. If
it references a location which does not exist, say, “/root/folder/leaf”, but the parent exists (in
this case “/root/folder”), a new entry named by the last path component (in this case “leaf”) is
created.
Input Ports
input (inp)
Output Ports
output (out)
Parameters
source entry Entry that should be copied
destination Copy destination
overwrite Overwrite elements at copy destination?
1
1. Data Access
Delete Repository Entry
Delete Repositor...
thr
thr
An operator to delete a repository entry within a process.
Description
An operator to delete a repository entry within a process.
Input Ports
input (inp)
Output Ports
output (out)
Parameters
entry to delete Entry that should be deleted
2
Move Repository Entry
Move Repository ...
thr
thr
An operator to move a repository entry to another repository location.
Description
Moves an entry to a new parent folder. If destination references a folder, the source entry is
moved to that folder. If it references an existing entry and overwriting is not enabled (default
case), an exception is raised. If overwriting is enabled the existing entry will be overwritten. If
it references a location which does not exist, say, “/root/folder/leaf”, but the parent exists (in
this case “/root/folder”), a new entry named by the last path component (in this case “leaf”) is
created.
Input Ports
input (inp)
Output Ports
output (out)
Parameters
source entry Entry that should be moved
destination Destination for move action
overwrite Overwrite elements at move destination?
3
1. Data Access
Rename Repository Entry
Rename Reposito...
thr
thr
An operator to rename repository a entry within a process.
Description
An operator to rename a repository entry. The user can select the entry that should be renamed,
a new name and if an already existing entry should be overwritten or not. If overwriting is not
allowed (default case) a user error is thrown if there already exists another element with the new
name.
Input Ports
input (inp)
Output Ports
output (out)
Parameters
entry to rename Entry that should be renamed
new name New entry name
overwrite Overwrite already existing entry with same name?
4
Retrieve
Retrieve
out
This operator reads an object from the data repository.
Description
This operator can be used to access the repositories. It should replace all file access, since it
provides full meta data processing, which eases the usage of RapidMiner a lot. In contrast to
accessing a raw file, it provides the complete meta data of the data, so all meta data transformations are possible.
An easier way to load an object from the repository is to drag and drop the required object
from the Repositories View. This will automatically insert a Retrieve operator with correct path
of the desired object.
This operator has no input port. All it requires is a valid value in repository entry parameter.
Output Ports
output (out) It returns the object whose path was specified in repository entry parameter.
Parameters
repository entry (string) A valid path should be specified here in order to load an object. This
parameter references an entry in the repository which will be returned as the output of
this operator. Repository locations are resolved relative to the repository folder containing the current process. Folders in the repository are separated by a forward slash (/), a
“..” references the parent folder. A leading forward slash references the root folder of the
repository containing the current process. A leading double forward slash is interpreted
as an absolute path starting with the name of a repository.
• ’MyData’ looks up an entry ‘MyData’ in the same folder as the current process.
• ’../Input/MyData’ looks up an entry ‘MyData’ located in a folder ‘Input’ next to the
folder containing the current process.
• ’/data/Model’ looks up an entry ‘Model’ in a top-level folder ‘data’ in the repository
holding the current process
• ’//Samples/data/Iris’ looks up the Iris data set in the ‘Samples’ repository.
Tutorial Processes
Retrieving Golf from Repository
The Example Process loads Golf data set from repository. Repository entry parameter is provided
with path ‘//Samples/data/Golf’, thus Golf data set is returned from Samples repository. As it
can be seen in Results Workspace, Retrieve operator loads both data and meta data.
5
1. Data Access
Process
Retrieve
out
inp
res
res
Figure 1.1: Tutorial process ‘Retrieving Golf from Repository’.
Store
Store
inp
thr
This operator stores an IO Object in the data repository.
Description
This operator stores an IO Object at a location in the data repository. The location of the object
to be stored is specified through the repository entry parameter. The stored object can be used
by other processes by using the Retrieve operator. Please see the attached Example Processes to
understand the basic working of this operator. The Store operator is used to store an ExampleSet
and a model in the Example Processes.
Input Ports
input (inp) This port expects an IO Object. In the attached Example Processes an ExampleSet
and a model are provided as input.
Output Ports
through (thr) The IO Object provided at the input port is delivered through this output port
without any modifications.This is usually used to reuse the same IO Object in further operators of the process.
Parameters
repository entry (string) This parameter is used to specify the location where the input IO
Object is to be stored.
Tutorial Processes
6
Process
Golf
inp
out
Append
exa
Store(Golf)
mer
inp
thr
exa
res
res
exa
Golf-Testset
out
Figure 1.2: Tutorial process ‘Storing an ExampleSet using the Store operator’.
Storing an ExampleSet using the Store operator
This Process shows how the Store operator can be used to store an ExampleSet. The ‘Golf’ data
set and the ‘Golf-Testset’ data set are loaded using the Retrieve operator. These ExampleSets
are merged using the Append operator. The resultant ExampleSet is named ‘Golf-Complete’ and
stored using the Store operator. The stored ExampleSet is used in the third Example Process.
Storing a model using the Store operator
This Process shows how the Store operator can be used to store a model. The ‘Golf’ data set is
loaded using the Retrieve operator. The Naive Bayes operator is applied on it and the resultant
model is stored in the repository using the Store operator. The model is stored with the name
‘Golf-Naive-Model’. The stored model is used in the third Example Process.
Using the objects stored by the Store operator
This Process shows how a stored IO Object can be used. The ‘Golf-Complete’ data set stored in
the first Example Process and the ‘Golf-Naive-Model’ stored in the second Example Process is
loaded using the Retrieve operator. The Apply Model operator is used to apply the ‘Golf-NaiveModel’ on the ‘Golf-Complete’ data set. The resultant labeled ExampleSet can be viewed in the
Results Workspace.
7
1. Data Access
Process
inp
Golf
Na i v e Ba y e s
out
tra
mod
Store (Model)
inp
thr
exa
res
res
Figure 1.3: Tutorial process ‘Storing a model using the Store operator’.
1.1 Files
1.1.1 Read
Read ARFF
Read ARFF
fil
out
This operator is used for reading an ARFF file.
Description
This operator can read ARFF (Attribute-Relation File Format) files known from the machine
learning library Weka. An ARFF file is an ASCII text file that describes a list of instances sharing a
set of attributes. ARFF files were developed by the Machine Learning Project at the Department
of Computer Science of The University of Waikato for use with the Weka machine learning software. Please study the attached Example Process for understanding the basics and structure of
the ARFF file format. Please note that when an ARFF file is written, the roles of the attributes are
not stored. Similarly when an ARFF file is read, the roles of all the attributes are set to regular.
Input Ports
file (fil) An ARFF file is expected as a file object which can be created with other operators with
file output ports like the Read File operator.
8
1.1. Files
Process
inp
Go lf - Na iv e - M od e l
out
Apply Model
mod
unl
lab
res
mod
res
Golf-Complete
out
Figure 1.4: Tutorial process ‘Using the objects stored by the Store operator’.
Output Ports
output (out) This port delivers the ARFF file in tabular form along with the meta data. This
output is similar to the output of the Retrieve operator.
Parameters
data file (filename) The path of the ARFF file is specified here. It can be selected using the
choose a file button.
encoding (selection) This is an expert parameter. A long list of encoding is provided; users
can select any of them.
read not matching values as missings (boolean) This is an expert parameter. If this parameter is set to true, values that do not match with the expected value type are considered as missing values and are replaced by ‘?’. For example if ‘abc’ is written in an integer
column, it will be treated as a missing value. Question mark (?) in ARFF file is also read as
missing value.
decimal character (char) This character is used as the decimal character.
grouped digits (boolean) This parameter decides whether grouped digits should be parsed or
not. If this parameter is set to true, the grouping character parameter should be specified.
grouping character (char) This parameter is available only when the grouped digits parameter is set to true.This character is used as the grouping character. If it is found between
numbers, the numbers are combined and this character is ignored. For example if “22-14”
is present in the ARFF file and “-” is set as grouping character, then “2214” will be stored.
9
1. Data Access
Tutorial Processes
The basics of the ARFF
Process
Iris
inp
Write Arff
out
inp
thr
res
fil
res
Figure 1.5: Tutorial process ‘The basics of the ARFF’.
The ‘Iris’ data set is loaded using the Retrieve operator. The Write ARFF operator is applied on
it to write the ‘Iris’ data set into an ARFF file. The example set file parameter is set to ‘D:\Iris’.
Thus an ARFF file is created in the ‘D’ drive of your computer with the name ‘Iris’. Open this file
to see the structure of an ARFF file.
ARFF files have two distinct sections. The first section is the Header information, which is
followed by the Data information. The Header of the ARFF file contains the name of the Relation and a list of the attributes. The name of the Relation is specified after the @RELATION
statement. The Relation is ignored by RapidMiner. Each attribute definition starts with the
@ATTRIBUTE statement followed by the attribute name and its type. The resultant ARFF file
of this Example Process starts with the Header. The name of the relation is ‘RapidMinerData’.
After the name of the Relation, six attributes are defined.
Attribute declarations take the form of an ordered sequence of @ATTRIBUTE statements.
Each attribute in the data set has its own @ATTRIBUTE statement which uniquely defines the
name of that attribute and its data type. The order of declaration of the attributes indicates the
column position in the data section of the file. For example, in the resultant ARFF file of this
Example Process the ‘label’ attribute is declared at the end of all other attribute declarations.
Therefore values of the ‘label’ attribute are in the last column of the Data section.
The possible attribute types in ARFF are: numeric integer real {nominalValue1,nominalValue2,...}
for nominal attributes string for nominal attributes without distinct nominal values (it is however recommended to use the nominal definition above as often as possible) date [date-format]
(currently not supported by RapidMiner)
You can see in the resultant ARFF file of this Example Process that the attributes ‘a1’, ‘a2’, ‘a3’
and ‘a4’ are of real type. The attributes ‘id’ and ‘label’ are of nominal type. The distinct nominal
values are also specified with these nominal attributes.
The ARFF Data section of the file contains the data declaration line @DATA followed by the
actual example data lines. Each example is represented on a single line, with carriage returns
denoting the end of the example. Attribute values for each example are delimited by commas.
They must appear in the order that they were declared in the Header section (i.e. the data corresponding to the n-th @ATTRIBUTE declaration is always the n-th field of the example line).
Missing values are represented by a single question mark (?).
A percent sign (%) introduces a comment and will be ignored during reading. Attribute names
10
1.1. Files
or example values containing spaces must be quoted with single quotes (’). Please note that in
RapidMiner the sparse ARFF format is currently only supported for numerical attributes. Please
use one of the other options for sparse data files provided by RapidMiner if you also need sparse
data files for nominal attributes.
Reading an ARFF file using the Read ARFF operator
Process
Read ARFF
inp
fil
out
res
res
Figure 1.6: Tutorial process ‘Reading an ARFF file using the Read ARFF operator’.
The ARFF file that was written in the first Example Process using the Write ARFF operator is
retrieved in this Example Process using the Read ARFF operator. The data file parameter is set to
‘D:\Iris’. Please make sure that you specify the correct path. All other parameters are used with
default values. Run the process. You will see that the results are very similar to the original Iris
data set of RapidMiner repository. Please note that the role of all the attributes is regular in the
results of the Read ARFF operator. Even the roles of ‘id’ and ‘label’ attributes are set to regular.
This is so because the ARFF files do not store information about the roles of the attributes.
11
1. Data Access
Read Access
Read Access
fil
out
This operator reads an ExampleSet from a Microsoft Access
database.
Description
The Read Access operator is used for reading an ExampleSet from the specified Microsoft Access
database (.mdb or .accdb extension). You need to have at least basic understanding of databases,
database connections and queries in order to use this operator properly. Go through the parameters and Example Process to understand the flow of this operator.
Output Ports
output (out) This port delivers the result of the query on database in tabular form along with
the meta data. This output is similar to the output of the Retrieve operator.
Parameters
username (string) This parameter is used to specify the username of the database (if any).
password (string) This parameter is used to specify the password of the database (if any).
define query (selection) Query is a statement that is used to select required data from the
database. This parameter specifies whether the database query should be defined directly,
through a file or implicitly by a given table name. The SQL query can be auto generated giving a table name, passed to RapidMiner via a parameter or, in case of long SQL statements,
in a separate file. The desired behavior can be chosen using the define query parameter.
Please note that column names are often case sensitive and might need quoting.
query (string) This parameter is only available when the define query parameter is set to ‘query’.
This parameter is used to define the SQL query to select desired data from the specified
database.
query file (filename) This parameter is only available when the define query parameter is set
to ‘query file’. This parameter is used to select a file that contains the SQL query to select
desired data from the specified database. Long queries are usually stored in files. Storing
queries in files can also enhance reusability.
table name (string) This parameter is only available when the define query parameter is set
to ‘table name’. This parameter is used to select the required table from the specified
database.
database file (filename) This parameter specifies the path of the Access database i.e. the mdb
or accdb file.
12
1.1. Files
Process
Golf
Write Access
out
inp
inp
thr
Read Access
fil
out
res
res
Figure 1.7: Tutorial process ‘Writing and then reading data from an Access database’.
Tutorial Processes
Writing and then reading data from an Access database
The ‘Golf’ data set is loaded using the Retrieve operator. The Write Access operator is used for
writing this ExampleSet into the golf table of the ‘golf_db.mdb’ database. The database file parameter is provided with the path of the database file ‘golf_db.mdb’ and the name of the desired
table is specified in the table name parameter ( i.e. it is set to ‘golf’). A breakpoint is inserted
here. No results are visible in RapidMiner at this stage but you can see that at this point of the
execution the database has been created and the golf table has been filled with the examples of
the ‘Golf’ data set.
Now the Read Access operator is used for reading the golf table from the ‘golf_db.mdb’ database.
The database file parameter is provided with the path of the database file ‘golf_db.mdb’. The define query parameter is set to ‘table name’. The table name parameter is set to ‘golf’ which is the
name of the required table. Continue the process, you will see the entire golf table in the Results
Workspace. The define query parameter is set to ‘table name’ if you want to read an entire table
from the database. You can also read a selected portion of the database by using queries. Set
the define query parameter to ‘query’ and specify a query in the query parameter.
13
1. Data Access
Read BibTeX
Read BibTeX
fil
out
This operator can read BibTeX files.
Description
This operator can read BibTeX files. It uses Stefan Haustein’s kdb tools.
Input Ports
file (fil) An BibTeX file is expected as a file object which can be created with other operators
with file output ports like the Read File operator.
Output Ports
output (out) This port delivers the BibTeX file in tabular form along with the meta data. This
output is similar to the output of the Retrieve operator.
Parameters
label attribute (string) The (case sensitive) name of the label attribute
id attribute (string) The (case sensitive) name of the id attribute
weight attribute (string) The (case sensitive) name of the weight attribute
datamanagement (selection) Determines, how the data is represented internally
data file (filename) The file containing the data
14
1.1. Files
Read C4.5
Read C4.5
out
This operator can read data and meta given in C4.5 format.
Description
Loads data given in C4.5 format (names and data file). Both files must be in the same directory.
You can specify one of the C4.5 files (either the data or the names file) or only the filestem.
For a dataset named “foo”, you will have two files: foo.data and foo.names. The .names file
describes the dataset, while the .data file contains the examples which make up the dataset.
The files contain series of identifiers and numbers with some surrounding syntax. A | (vertical
bar) means that the remainder of the line should be ignored as a comment. Each identifier consists of a string of characters that does not include comma, question mark or colon. Embedded
whitespce is also permitted but multiple whitespace is replaced by a single space.
The .names file contains a series of entries that describe the classes, attributes and values
of the dataset. Each entry can be terminated with a period, but the period can be omited if it
would have been the last thing on a line. The first entry in the file lists the names of the classes,
separated by commas. Each successive line then defines an attribute, in the order in which they
will appear in the .data file, with the following format:
attribute-name : attribute-type
The attribute-name is an identifier as above, followed by a colon, then the attribute type which
must be one of
• continuous: If the attribute has a continuous value.
• discrete [n]: The word ‘discrete’ followed by an integer which indicates how many values the
attribute can take (not recommended, please use the method depicted below for defining
nominal attributes).
• [list of identifiers]: This is a discrete, i.e. nominal, attribute with the values enumerated
(this is the prefered method for discrete attributes). The identifiers should be separated
by commas.
• ignore: This means that the attribute should be ignored - it won’t be used. This is not
supported by RapidMiner, please use one of the attribute selection operators after loading
if you want to ignore attributes and remove them from the loaded example set.
Here is an example .names file:
good, bad. dur: continuous. wage1: continuous. wage2: continuous. wage3: continuous.
cola: tc, none, tcf. hours: continuous. pension: empl_contr, ret_allw, none. stby_pay: continuous. shift_diff: continuous. educ_allw: yes, no. ...
Foo.data contains the training examples in the following format: one example per line, attribute values separated by commas, class last, missing values represented by “?”. For example:
2,5.0,4.0,?,none,37,?,?,5,no,11,below_average,yes,full,yes,full,good 3,2.0,2.5,?,?,35,none,?,?,?,10,average,?,?,yes,full,b
3,4.5,4.5,5.0,none,40,?,?,?,no,11,average,?,half,?,?,good 3,3.0,2.0,2.5,tc,40,none,?,5,no,10,below_average,yes,half,yes,full,bad ...
15
1. Data Access
Output Ports
output (out) This port delivers the C4.5 file in tabular form along with the meta data. This
output is similar to the output of the Retrieve operator.
Parameters
c45 filestem (filename) The path to either the C4.5 names file, the data file, or the filestem
(without extensions). Both files must be in the same directory.
datamanagement (selection) Determines, how the data is represented internally.
decimal point character (char) Character that is used as decimal point.
encoding (selection) The encoding used for reading or writing files.
16
1.1. Files
Read CSV
Read CSV
fil
out
This operator is used to read CSV files.
Description
CSV is an abbreviation for Comma-Separated Values. The CSV files store data (both numerical
and text) in plain-text form. CSV files have all values of an example in one line. Values for
different attributes are separated by a constant separator. It may have many rows. Each row
uses a constant separator for separating attribute values. CSV name suggests that the attributes
values would be separated by commas, but other separators can also be used.
For complete understanding of this operator read the parameters section thoroughly. The
easiest and shortest way to import a CSV file is to use the import configuration wizard from the
Parameters panel. The best way, which may require some extra effort, is to first set all the parameters in the Parameters panel and then use the wizard. Please make sure that the CSV file is
read correctly before building a process using it.
Input Ports
file (fil) A CSV file is expected as a file object which can be created with other operators with
file output ports like the Read File operator.
Output Ports
output (out) this port delivers the CSV file in tabular form along with the meta data. This output is similar to the output of the Retrieve operator.
Parameters
Import Configuration Wizard (menu) This option allows you to configure this operator by
means of a wizard. This user-friendly wizard makes the use of this operator easy.
csv file (string) The path of the CSV file is specified here. It can be selected using the choose
a file button.
column separators (string) Column separators for CSV files can be specified here in a regular
expression format. A good understanding of regular expression can be developed from
studying the Select Attributes operator’s description and Example Processes.
trim lines (boolean) This option indicates if lines should be trimmed (empty spaces are removed at the beginning and the end) before the column split is performed. This option
might be problematic if TABs are used as separators.
use quotes (boolean) This option indicates if quotes should be regarded. Quotes can be used
to store special characters like column separators. For example if (,) is set as column separator and (”) is set as quotes character. (a,b,c,d) will be translated as 4 values for 4 columns.
On the other hand (”a,b,c,d”) will be translated as a single column value a,b,c,d. If this
17
1. Data Access
option is set to false, the quotes character parameter and the escape character parameter
for quotes cannot be defined.
quotes character (char) This option defines the quotes character.
escape character for quotes (char) This is the character that is used to escape quotes. For
example if (”) is used as quotes character and (\)is used as escape character. (”yes”) will be
translated as (yes) and (\”yes\”) will be translated as (”yes”).
skip comments (boolean) The skip comments option is used to ignore comments in the CSV
file. This is only useful if the CSV file has comments. If this option is set to true, a comment
character should be defined using the comment characters parameter.
comment characters (string) Lines beginning with these characters are ignored. If this character is present in the middle of the line, anything that comes in that line after this character is ignored. Remember that the comment character itself is also ignored.
parse numbers (boolean) Specifies whether numbers are parsed or not.
decimal character (char) This character is used as the decimal character.
grouped digits (boolean) This option decides whether grouped digits should be parsed or not.
If this option is set to true, a grouping character parameter should be specified.
grouping character (char) This character is used as the grouping character. If this character
is found between numbers, the numbers are combined and this character is ignored. For
example if “22-14” is present in the CSV file and “-” is set as grouping character, then “2214”
will be stored.
date format (string) The date and time format is specified here. Many predefined options
exist; users can also specify a new format. If text in a CSV file column matches this date
format, that column is automatically converted to date type. Some corrections are automatically made in date type values. For example a value ‘32-March’ will automatically be
converted to ‘1-April’. Columns containing values which can’t be interpreted as numbers
will be interpreted as nominal, as long as they don’t match the date and time pattern of the
date format parameter. If they do, this column of the CSV file will be automatically parsed
as date and the according attribute will be of date type.
first row as names (boolean) If this option is set to true, it is assumed that the first line of
the CSV file has the names of the attributes. Then the attributes are automatically named
and first line of the CSV file is not treated as a data line.
annotations (menu) If first row as names is not set to true, annotations can be added using
the ‘Edit List’ button of this parameter which opens a new menu. This menu allows you
to select any row and assign an annotation to it. Name, Comment and Unit annotations
can be assigned. If row 0 is assigned a Name annotation, it is equivalent to setting the first
row as names parameter to true. If you want to ignore any rows you can annotate them as
Comment. Remember row number in this menu does not count commented lines.
time zone (selection) This is an expert parameter. A long list of time zones is provided; users
can select any of them.
locale (selection) This is an expert parameter. A long list of locales is provided; users can
select any of them.
18
1.1. Files
encoding (selection) This is an expert parameter. A long list of encodings is provided; users
can select any of them.
data set meta data information (menu) This option is an important one. It allows you to
adjust the meta data of the CSV file. Column index, name, type and role can be specified
here. The Read CSV operator tries to determine an appropriate type of the attributes by
reading the first few lines and checking the occurring values. If all values are integers, the
attribute will become an integer. Similarly if all values are real numbers, the attribute will
become of type real. Columns containing values which can’t be interpreted as numbers
will be interpreted as nominal, as long as they don’t match the date and time pattern of
the date format parameter. If they do, this column of the CSV file will be automatically
parsed as date and the according attribute will be of type date. Automatically determined
types can be overridden using this parameter.
read not matching values as missings (boolean) If this value is set to true, values that do
not match with the expected value type are considered as missing values and are replaced
by ‘?’. For example if ‘abc’ is written in an integer column, it will be treated as a missing
value. A question mark (?) in the CSV file is also read as a missing value.
data management (selection) This is an expert parameter. A long list is provided; users can
select any option from this list.
Tutorial Processes
Reading a CSV file
Process
Read CSV
inp
fil
out
res
res
Figure 1.8: Tutorial process ‘Reading a CSV file’.
Save the following text in a text file and load it with the given Read CSV Example Process. Run
the process and compare the results in the Results Workspace (data view) with the CSV file.
att1,att2,att3,att4 # row 1
80.6, yes , 1996.JAN.21 ,22-14 # row 2
12.43,”yes”,1997.MAR.30,23-22 # row 3
13.5,\”no\”,1998.AUG.22,23-14 # row 4
23.3,yes,1876.JAN.32,42-65# row 5
21.6,yes,2001.JUL.12,xyz # row 6
12.56,”,_?”,2002.SEP.18,15-90# row 7
Here is some explanation of what happens in this process: ’#’ is defined as comment character
so ‘row no.’ is ignored in all rows.As the first row as names parameter is set to true att1,att2,att3
and att4 are set as names of attributesatt1 is set as real , att2 as polynomial, att3 as date and att4
19
1. Data Access
as realin attribute att4 ,’-’ are ignored because the grouped digits parameter is set to true and
‘-’ is the grouping characterIn row 2 the white spaces at start and at end of values are ignored
because trim lines parameter is set to true.In row 3 quotes are used but they are ignored because
the escape character is not used.In row 4 the escape quote is used, so quotes are not ignored.In
row 5 the date value is automatically corrected, ‘jan.32’ is changed to ‘feb.1’.In row 6 an invalid
real value in forth column is replaced by ‘?’ because the read not matching values as missings
parameter is set to true.In row 7 quotes are used to store special characters including the column
separator and a question mark.
20
1.1. Files
Read dBase
Read DBase
fil
out
This operator can read dBase files.
Description
This operator can read dBase files. It uses Stefan Haustein’s kdb tools.
Input Ports
file (fil) An dBase file is expected as a file object which can be created with other operators with
file output ports like the Read File operator.
Output Ports
output (out) This port delivers the dBase file in tabular form along with the meta data. This
output is similar to the output of the Retrieve operator.
Parameters
label attribute (string) The (case sensitive) name of the label attribute
id attribute (string) The (case sensitive) name of the id attribute
weight attribute (string) The (case sensitive) name of the weight attribute
datamanagement (selection) Determines, how the data is represented internally.
data file (filename) The file containing the data
21
1. Data Access
Read DASYLab
Read DasyLab
fil
out
This operator can read DASYLab data files.
Description
This operator allows to import data from DASYLab files (.DDF) into RapidMiner. Currently only
universal format 1 is supported. External files (.DDB) and histogram data are currently not supported.
The parameter timestamp allows to configure whether and what kind of timestamp should be
included in the example set. If it is set to relative, the timestamp attribute captures the amount
of milliseconds since the file start time. If it is set to absolute, the absolute time is used to timestamp the examples.
Input Ports
file (fil) A DASYLab file is expected as a file object which can be created with other operators
with file output ports like the Read File operator.
Output Ports
output (out) This port delivers the DASYLab file in tabular form along with the meta data. This
output is similar to the output of the Retrieve operator.
Parameters
filename (filename) Name of the file to read the data from.
datamanagement (selection) Determines, how the data is represented internally.
timestamp (selection) Specifies whether to include an absolute timestamp, a timestamp relative to the beginning of the file (in seconds) or no timestamp at all.
22
1.1. Files
Read Excel
Read Excel
fil
out
This operator reads an ExampleSet from the specified Excel file.
Description
This operator can be used to load data from Microsoft Excel spreadsheets. This operator is able to
read data from Excel 95, 97, 2000, XP, and 2003. The user has to define which of the spreadsheets
in the workbook should be used as data table. The table must have a format such that each row is
an example and each column represents an attribute. Please note that the first row of the Excel
sheet might be used for attribute names which can be indicated by a parameter. The data table
can be placed anywhere on the sheet and can contain arbitrary formatting instructions, empty
rows and empty columns. Missing data values in Excel should be indicated by empty cells or by
cells containing only “?”.
For complete understanding of this operator read the parameters section. The easiest and
shortest way to import an Excel file is to use the import configuration wizard from the Parameters
panel. The best way, which may require some extra effort, is to first set all the parameters in the
Parameters panel and then use the wizard. Please make sure that the Excel file is read correctly
before building a process using it.
Input Ports
file (fil) An Excel file is expected as a file object which can be created with other operators with
file output ports like the Read File operator.
Output Ports
output (out) This port delivers the Excel file in tabular form along with the meta data. This
output is similar to the output of the Retrieve operator.
Parameters
import configuration wizard This option allows you to configure this operator by means of
a wizard. This user-friendly wizard makes the use of this operator easy.
excel file The path of the Excel file is specified here. It can be selected using the choose a file
button.
sheet number (integer) The number of the sheet which you want to import should be specified here.
imported cell range This is a mandatory parameter. The range of cells to be imported from
the specified sheet is given here. It is specified in ‘xm:yn’ format where ‘x’ is the column
of the first cell of range, ‘m’ is the row of the first cell of range, ‘y’ is the column of the last
cell of range, ‘n’ is the row of the last cell of range. ‘A1:E10’ will select all cells of the first
five columns from row 1 to 10.
23
1. Data Access
first row as names (boolean) If this option is set to true, it is assumed that the first line of
the Excel file has the names of attributes. Then the attributes are automatically named
and the first line of Excel file is not treated as a data line.
annotations If the first row as names parameter is not set to true, annotations can be added
using the ‘Edit List’ button of this parameter which opens a new menu. This menu allows
you to select any row and assign an annotation to it. Name, Comment and Unit annotations
can be assigned. If row 0 is assigned Name annotation, it is equivalent to setting the first
row as names parameter to true. If you want to ignore any rows you can annotate them as
Comment.
date format The date and time format is specified here. Many predefined options exist; users
can also specify a new format. If text in an Excel file column matches this date format, that
column is automatically converted to date type. Some corrections are automatically made
in the date type values. For example a value ‘32-March’ will automatically be converted
to ‘1-April’. Columns containing values which can’t be interpreted as numbers will be interpreted as nominal, as long as they don’t match the date and time pattern of the date
format parameter. If they do, this column of the Excel file will be automatically parsed as
date and the according attribute will be of date type.
time zone This is an expert parameter. A long list of time zones is provided; users can select
any of them.
locale This is an expert parameter. A long list of locales is provided; users can select any of
them.
data set meta data information This option is an important one. It allows you to adjust the
meta data of the ExampleSet created from the specified Excel file. Column index, name, type
and role can be specified here. The Read Excel operator tries to determine an appropriate
type of the attributes by reading the first few lines and checking the occurring values. If
all values are integers, the attribute will become an integer. Similarly if all values are real
numbers, the attribute will become of type real. Columns containing values which can’t
be interpreted as numbers will be interpreted as nominal, as long as they don’t match the
date and time pattern of the date format parameter. If they do, this column of the Excel
file will be automatically parsed as date and the according attribute will be of type date.
Automatically determined types can be overridden using this parameter.
read not matching values as missings (boolean) If this value is set to true, values that do
not match with the expected value type are considered as missing values and are replaced
by ‘?’. For example if ‘abc’ is written in an integer column, it will be treated as a missing
value. A question mark (?) or an empty cell in the Excel file is also read as a missing value.
data management This is an expert parameter. A long list is provided; users can select any
option from this list.
Tutorial Processes
Reading an ExampleSet from an Excel file
For this Example Process you need an Excel file first. The one of this Example Process was created by copying the ‘Golf’ data set present in the Repositories into a new Excel file which was
named ‘golf’. The data set was copied on sheet 1 of the Excel file thus the sheet number parameter is given value 1. Make sure that you provide the correct location of the file in the Excel file
24
1.1. Files
Process
Read Excel
inp
fil
out
res
res
Figure 1.9: Tutorial process ‘Reading an ExampleSet from an Excel file’.
parameter. The first cell of the sheet is A1 and last required cell is E15, thus the imported cell
range parameter is provided value ‘A1:E15’. As the first row of the sheet contains names of attributes, the first row as names parameter is checked. The remaining parameters were used with
default values. Run the process, you will see almost the same results as you would have gotten
from using the Retrieve operator to retrieve the ‘Golf’ data set from the Repository. You will
see a difference in the meta data though, for example here the types and roles of attributes are
different from those in the ‘Golf’ data set. You can change the role and type of attributes using
the data set meta data information parameter. It is always good to make sure that all attributes
are of desired role and type. In this example one important change that you would like to make
is to change the role of the Play attribute. Its role should be changed to label if you want to use
any classification operators on this data set.
25
1. Data Access
Read SAS
Read SAS
fil
out
This operator is used for reading an SAS file.
Description
This operator can read SAS (Statistical Analysis System) files. Please study the attached Example
Process for understanding the use of this operator. Please note that when an SAS file is read, the
roles of all the attributes are set to regular. Numeric columns use the “real” data type, nominal
columns use the “polynominal” data type in RapidMiner.
Input Ports
file (fil) An SAS file is expected as a file object which can be created with other operators with
file output ports like the Read File operator.
Output Ports
output (out) This port delivers the SAS file in tabular form along with the meta data. This
output is similar to the output of the Retrieve operator.
Parameters
file (filename) The path of the SAS file is specified here. It can be selected using the choose a
file button.
Tutorial Processes
Use of the SAS operator
Process
Read SAS
inp
fil
out
res
res
Figure 1.10: Tutorial process ‘Use of the SAS operator’.
An SAS file is loaded using the Open File operator and then read via the Read SAS operator.
26
1.1. Files
Read SPSS
Read SPSS
fil
out
This operator is used for reading SPSS files.
Description
The Read SPSS operator can read the data files created by SPSS (Statistical Package for the Social
Sciences), an application used for statistical analysis. SPSS files are saved in a proprietary binary
format and contain a dataset as well as a dictionary that describes the dataset. These files save
data by ‘cases’ (rows) and ‘variables’ (columns).
These files have a ‘.SAV’ file extension. SAV files are often used for storing datasets extracted
from databases and Microsoft Excel spreadsheets. SPSS datasets can be manipulated in a variety
of ways, but they are most commonly used to perform statistical analysis tests such as regression
analysis, analysis of variance, and factor analysis.
Input Ports
file (fil) This optional port expects a file object.
Output Ports
output (out) Data from the SPSS file is delivered through this port mostly in form of an ExampleSet.
Parameters
filename (filename) This parameter specifies the path of the SPSS file. It can be selected using
the choose a file button.
datamanagement (selection) This parameter determines how the data is represented internally. This is an expert parameter. There are different options, users can choose any of
them.
attribute naming mode (selection) This parameter determines which SPSS variable properties should be used for naming the attributes.
use value labels (boolean) This parameter specifies if the SPSS value labels should be used
as values.
recode user missings (boolean) This parameter specifies if the SPSS user missings should
be recoded to missing values.
sample ratio (real) This parameter specifies the fraction of the data set which should be read.
If it is set to 1, the complete data set is read. If it is set to -1 then the sample size parameter
is used for determining the size of the data to read.
sample size (integer) This parameter specifies the exact number of samples which should be
read. If it is set to -1, then the sample ratio parameter is used for determining the size of
data to read. If both are set to -1 then the complete data set is read.
27
1. Data Access
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomization. Using the same value of local random seed will produce the same
randomization.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
Tutorial Processes
Reading an SPSS file
Process
inp
Read SPSS
fil
out
res
res
Figure 1.11: Tutorial process ‘Reading an SPSS file’.
You need to have an SPSS file for this process. In this process, the name of the SPSS file is
airline_passengers.sav and it is placed in the D drive of the computer. The file is read using the
Read SPSS operator. All parameters are used with default values. After execution of the process
you can see the resultant ExampleSet in the Results Workspace.
28
1.1. Files
Read Sparse
Read Sparse
out
This operator is used for reading files written in sparse formats.
Description
This operator reads sparse format files. The lines of a sparse file have the form:
label index:value index:value index:value...
Where index may be an integer (starting with 1) for the regular attributes or one of the prefixes
specified by the prefix map parameter. The following formats are supported:
• xy format: The label is the last token in each line.
• yx format: The label is the first token in each line.
• prefix format: The label is prefixed by ‘l:’
• separate file format: The label is read from a separate file specified by the label file parameter.
• no label: The ExampleSet is unlabeled.
Output Ports
output (out) This port delivers the required file in tabular form along with the meta data. This
output is similar to the output of the Retrieve operator.
Parameters
format (selection) This parameter specifies the format of the sparse data file.
attribute description file (filename) The name of the attribute description file is specified
here. An attribute description file (extension: .aml) is required to retrieve meta data of the
ExampleSet. This file is a simple XML document defining the properties of the attributes
(like their name and range) and their source files. The data may be spread over several files.
This file also contains the names of the files to read the data from. Therefore, the actual
data files do not have to be specified as a parameter of this operator.
data file (filename) This parameter specifies the name of the data file. It is necessary if it is
not specified in the attribute description file.
label file (filename) This parameter specifies the name of the file containing the labels. It is
necessary if the format parameter is set to ‘format separate file’
dimension (integer) This parameter specifies the dimension of the example space. It is necessary if the attribute description file parameter is not set.
sample size (integer) This parameter specifies the maximum number of examples which should
be read. If it is set to -1, then all examples are read.
29
1. Data Access
use quotes (boolean) This parameter indicates if quotes should be regarded. If this option is
set to true, the quotes character parameter can be used for specifying the quotes character.
quotes character (char) This parameter defines the quotes character.
datamanagement (selection) This parameter determines how the data is represented internally. This is an expert parameter. There are different options, users can choose any of
them.
decimal point character (string) This character is used as the decimal character.
prefix map (list) This parameter maps prefixes to names of special attributes.
encoding (selection) This is an expert parameter. A long list of encoding is provided; users
can select any one of them.
Tutorial Processes
Writing and Reading a sparse file
Process
inp
Golf
Write AML
out
inp
thr
Read Sparse
out
res
res
Figure 1.12: Tutorial process ‘Writing and Reading a sparse file’.
This Example Process shows the Write AML operator can be used for writing a sparse file and
how the Read Sparse operator can be used for reading a sparse file. The ‘Golf’ data set is loaded
30
1.1. Files
using the Retrieve operator. This ExampleSet is provided as input to the Write AML operator.
The example set file parameter is set to ‘D:\golf_data’ thus a file named ‘golf_data’ is created (if
it does not already exist) in the ‘D’ drive of your computer. You can open the written file and
make changes in it (if required). This file has the instances of the ExampleSet. The attribute
description file parameter is set to ‘D:\golf_att’ thus a file named ‘golf_att’ is created (if it does
not already exist) in the ‘D’ drive of your computer. You can open the written file and make
changes in it (if required). This file has the meta data of the ExampleSet. The format parameter
is set to ‘sparse_xy’ to write the file in xy sparse format. The Read Sparse operator is applied next
to read the ExampleSet from the files. The attribute description file and data file parameters are
set to ‘D:\golf_att’ and ‘D:\golf_data’ respectively. The format parameter is set to ‘xy’ because
the file was written in xy format. All other parameters are used with default values. The resultant
ExampleSet can be seen in the Results Workspace.
31
1. Data Access
Read Stata
Read Stata
fil
out
This operator can read Stata data files.
Description
This operator can read Stata files. Currently only Stata files of version 113 or 114 are supported.
Input Ports
file (fil) This optional port expects a file object.
Output Ports
output (out) Data from the Stata file is delivered through this port mostly in form of an ExampleSet.
Parameters
filename (filename) Name of the file to read the data from.
datamanagement (selection) Determines, how the data is represented internally.
attribute naming mode (selection) Determines which variable properties should be used
for attribute naming.
handle value labels (selection) Specifies how to handle attributes with value labels, i.e. whether
to ignore the labels or how to use them.
sample ratio (real) The fraction of the data set which should be read (1 = all; only used if
sample_size = -1)
sample size (integer) The exact number of samples which should be read (-1 = all; if not -1,
sample_ratio will not have any effect)
use local random seed (boolean) Indicates if a local random seed should be used.
local random seed (integer) Specifies the local random seed
32
1.1. Files
Read XML
Read XML
fil
out
This operator is used for reading an XML file.
Description
This operator can read XML files, where examples are represented by elements which match a
given XPath and features are attributes and text-content of each element and its sub-elements.
This operator tries to determine an appropriate type of the attributes by reading the first few
elements and checking the occuring values. If all values are integers, the attribute will become
integer, if real numbers occur, it will be of type real. Columns containing values which can’t be
interpreted as numbers will be nominal, as long as they don’t match the date and time pattern
of the date format parameter. If they do, this attribute will be automatically parsed as date and
the according feature will be of type date.
Input Ports
file (fil) An XML file is expected as a file object which can be created with other operators with
file output ports like the Read File operator.
Output Ports
output (out) This port delivers the XML file in tabular form along with the meta data. This
output is similar to the output of the Retrieve operator.
Parameters
parse numbers (boolean) Specifies whether numbers are parsed or not.
decimal character (char) This character is used as the decimal character.
grouped digits (boolean) This option decides whether grouped digits should be parsed or not.
If this option is set to true, a grouping character parameter should be specified.
grouping character (char) This character is used as the grouping character. If this character
is found between numbers, the numbers are combined and this character is ignored. For
example if “22-14” is present in the CSV file and “-” is set as grouping character, then “2214”
will be stored.
date format (string) The date and time format is specified here. Many predefined options
exist; users can also specify a new format. If text in a CSV file column matches this date
format, that column is automatically converted to date type. Some corrections are automatically made in date type values. For example a value ‘32-March’ will automatically be
converted to ‘1-April’. Columns containing values which can’t be interpreted as numbers
will be interpreted as nominal, as long as they don’t match the date and time pattern of the
date format parameter. If they do, this column of the CSV file will be automatically parsed
as date and the according attribute will be of date type.
33
1. Data Access
first row as names (boolean) If this option is set to true, it is assumed that the first line of
the CSV file has the names of the attributes. Then the attributes are automatically named
and first line of the CSV file is not treated as a data line.
annotations (menu) If first row as names is not set to true, annotations can be added using
the ‘Edit List’ button of this parameter which opens a new menu. This menu allows you
to select any row and assign an annotation to it. Name, Comment and Unit annotations
can be assigned. If row 0 is assigned a Name annotation, it is equivalent to setting the first
row as names parameter to true. If you want to ignore any rows you can annotate them as
Comment. Remember row number in this menu does not count commented lines.
time zone (selection) This is an expert parameter. A long list of time zones is provided; users
can select any of them.
locale (selection) This is an expert parameter. A long list of locales is provided; users can
select any of them.
data set meta data information (menu) This option is an important one. It allows you to
adjust the meta data of the CSV file. Column index, name, type and role can be specified
here. The Read CSV operator tries to determine an appropriate type of the attributes by
reading the first few lines and checking the occurring values. If all values are integers, the
attribute will become an integer. Similarly if all values are real numbers, the attribute will
become of type real. Columns containing values which can’t be interpreted as numbers
will be interpreted as nominal, as long as they don’t match the date and time pattern of
the date format parameter. If they do, this column of the CSV file will be automatically
parsed as date and the according attribute will be of type date. Automatically determined
types can be overridden using this parameter.
read not matching values as missings (boolean) If this value is set to true, values that do
not match with the expected value type are considered as missing values and are replaced
by ‘?’. For example if ‘abc’ is written in an integer column, it will be treated as a missing
value. A question mark (?) in the CSV file is also read as a missing value.
datamanagement (selection) This is an expert parameter. A long list is provided; users can
select any option from this list.
34
1.1. Files
Read XRFF
Read XRFF
fil
out
This operator is used for reading XRFF (eXtensible attributeRelation File Format) files.
Description
This operator can read XRFF files known from Weka. The XRFF (eXtensible attribute-Relation
File Format) is an XML-based extension of the ARFF format in some sense similar to the original
RapidMiner file format for attribute description files (.aml). You can see a sample XRFF file by
studying the attached Example Process.
Since the XML representation takes up considerably more space because the data is wrapped
into XML tags, one can also compress the data via gzip. RapidMiner automatically recognizes a
file being gzip compressed, if the file’s extension is .xrff.gz instead of .xrff.
The XRFF file is divided into two portions i.e. the header and the body. The header has the
meta data description and the body has the instances. Via the class=”yes” attribute in the attribute specification in the header, one can define which attribute should be used as a prediction label attribute. Although the RapidMiner terminology for such classes is “label” instead of
“class” we support the terminology class in order to have compatibility with the original XRFF
files.
Input Ports
file (fil) This optional port expects a file object.
Output Ports
output (out) The XRFF file is read from the specified path and the resultant ExampleSet is delivered through this port.
Parameters
data file (filename) This parameter specifies the path of the XRFF file. It can be selected using
the choose a file button.
id attribute (string) This parameter specifies the name of the id attribute. Please note that
this field is case-sensitive.
datamanagement (selection) This parameter determines how the data is represented internally. This is an expert parameter. There are different options, users can choose any of
them.
decimal point character (string) This parameter specifies the character that is used as decimal point.
sample ratio (real) This parameter specifies the fraction of the data set which should be read.
If it is set to 1, the complete data set is read. If it is set to -1 then the sample size parameter
is used for determining the size of the data to read.
35
1. Data Access
sample size (integer) This parameter specifies the exact number of samples which should be
read. If it is set to -1 the sample ratio parameter is used for determining the size of data to
read. If both are set to -1 the complete data set is read.
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomization. Using the same value of local random seed will produce the same
randomization.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
Tutorial Processes
Writing and Reading an XRFF file
Process
inp
Golf
Write XRFF
out
inp
thr
fil
Read XRFF
fil
out
res
res
Figure 1.13: Tutorial process ‘Writing and Reading an XRFF file’.
This Example Process demonstrates the use of the Write XRFF and Read XRFF operators respectively. This Example Process shows how these operators can be used to write and read an
ExampleSet. The ‘Golf’ data set is loaded using the Retrieve operator. This ExampleSet is provided as input to the Write XRFF operator. The example set file parameter is set to ‘D:\golf_xrff’
thus a file named ‘golf_xrff’ is created (if it does not already exist) in the ‘D’ drive of your computer. You can open the written file and make changes in it (if required). The Read XRFF operator
is applied next. The data file parameter is set to ‘D:\golf_xrff’ to read the file that was just written using the Write XRFF operator. The remaining parameters are used with default values. The
resultant ExampleSet can be seen in the Results Workspace.
36
1.1. Files
1.1.2 Write
Write ARFF
Write ARFF
inp
thr
This operator is used for writing an ARFF file.
fil
Description
This operator can write data in form of ARFF (Attribute-Relation File Format) files known from
the machine learning library Weka. An ARFF file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project
at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software. Please study the attached Example Processes for understanding the
basics and structure of the ARFF file format. Please note that when an ARFF file is written, the
roles of the attributes are not stored. Similarly when an ARFF file is read, the roles of all the
attributes are set to regular.
Input Ports
input (inp) This input port expects an ExampleSet. It is the output of the Retrieve operator in
the attached Example Process.
Output Ports
through (thr) The ExampleSet that was provided at the input port is delivered through this
output port without any modifications. This is usually used to reuse the same ExampleSet
in further operators of the process.
file (fil) This port buffers the file object for passing it to the reader operators
Parameters
example set file (filename) The path of the ARFF file is specified here. It can be selected
using the choose a file button.
encoding (selection) This is an expert parameter. A long list of encoding is provided; users
can select any of them.
Tutorial Processes
The basics of ARFF
The ‘Iris’ data set is loaded using the Retrieve operator. The Write ARFF operator is applied on it
to write the ‘Iris’ data set into an ARFF file. The example set file parameter is set to ‘D:\Iris.txt’.
Thus an ARFF file is created in the ‘D’ drive of your computer with the name ‘Iris’. Open this file
to see the structure of an ARFF file.
37
1. Data Access
Process
Iris
inp
Write Arff
out
inp
thr
res
fil
res
Figure 1.14: Tutorial process ‘The basics of ARFF’.
ARFF files have two distinct sections. The first section is the Header information, which is
followed by the Data information. The Header of the ARFF file contains the name of the Relation and a list of the attributes. The name of the Relation is specified after the @RELATION
statement. The Relation is ignored by RapidMiner. Each attribute definition starts with the
@ATTRIBUTE statement followed by the attribute name and its type. The resultant ARFF file
of this Example Process starts with the Header. The name of the relation is ‘RapidMinerData’.
After the name of the Relation, six attributes are defined.
Attribute declarations take the form of an ordered sequence of @ATTRIBUTE statements.
Each attribute in the data set has its own @ATTRIBUTE statement which uniquely defines the
name of that attribute and its data type. The order of declaration of the attributes indicates the
column position in the data section of the file. For example, in the resultant ARFF file of this
Example Process the ‘label’ attribute is declared at the end of all other attribute declarations.
Therefore values of the ‘label’ attribute are in the last column of the Data section.
The possible attribute types in ARFF are: numeric integer real {nominalValue1,nominalValue2,...}
for nominal attributes string for nominal attributes without distinct nominal values (it is however recommended to use the nominal definition above as often as possible) date [date-format]
(currently not supported by RapidMiner)
You can see in the resultant ARFF file of this Example Process that the attributes ‘a1’, ‘a2’, ‘a3’
and ‘a4’ are of real type. The attributes ‘id’ and ‘label’ are of nominal type. The distinct nominal
values are also specified with these nominal attributes.
The ARFF Data section of the file contains the data declaration line @DATA followed by the
actual example data lines. Each example is represented on a single line, with carriage returns
denoting the end of the example. Attribute values for each example are delimited by commas.
They must appear in the order that they were declared in the Header section (i.e. the data corresponding to the n-th @ATTRIBUTE declaration is always the n-th field of the example line).
Missing values are represented by a single question mark (?).
A percent sign (%) introduces a comment and will be ignored during reading. Attribute names
or example values containing spaces must be quoted with single quotes (’). Please note that in
RapidMiner the sparse ARFF format is currently only supported for numerical attributes. Please
use one of the other options for sparse data files provided by RapidMiner if you also need sparse
data files for nominal attributes.
38
1.1. Files
Process
Read ARFF
fil
inp
out
res
res
Figure 1.15: Tutorial process ‘Reading an ARFF file using the Read ARFF operator’.
Reading an ARFF file using the Read ARFF operator
The ARFF file that was written in the first Example Process using the Write ARFF operator is
retrieved in this Example Process using the Read ARFF operator. The data file parameter is set
to ‘D:\Iris.txt’. Please make sure that you specify the correct path. All other parameters are
used with default values. Run the process. You will see that the results are very similar to the
original Iris data set of RapidMiner repository. Please note that the role of all the attributes is
regular in the results of the Read ARFF operator. Even the roles of ‘id’ and ‘label’ attributes are
set to regular. This is so because the ARFF files do not store information about the roles of the
attributes.
39
1. Data Access
Write Access
Write Access
inp
thr
This operator writes an ExampleSet into the specified Microsoft
Access database.
Description
The Write Access operator is used for writing an ExampleSet into the specified Microsoft Access
database (.mdb or .accdb extension) using the UCanAccess jdbc driver. You only need to have a
basic understanding of databases in order to use this operator properly. Please go through the
parameters and the attached Example Process to understand the working of this operator.
Input Ports
input (inp) This input port expects an ExampleSet. It is output of the Retrieve operator in the
attached Example Process.
Output Ports
through (thr) The ExampleSet that was provided at the input port is delivered through this
output port without any modifications. This is usually used to reuse the same ExampleSet
in further operators of the process.
file (fil) This port memory buffers file object for passing it to the reader operators
Parameters
database file (filename) This parameter specifies the path of the Access database ( i.e. the
mdb or accdb file)
username (string) This parameter is used for specifying the username of the database (if any).
password (string) This parameter is used for specifying the password of the database (if any).
table name (string) This parameter is used for specifying the name of the required table from
the specified database.
overwrite mode (selection) This parameter indicates if an existing table should be overwritten or the data should be appended.
access version (selection) If a new database is created this parameter specifies its format version. This parameter is not used if the database already exists.
Tutorial Processes
Writing and then reading data from an Access database
The ‘Golf’ data set is loaded using the Retrieve operator. The Write Access operator is used for
writing this ExampleSet into the golf table of the ‘golf_db.mdb’ database. The database file parameter is provided with the path of the database file ‘golf_db.mdb’ and the name of the desired
40
1.1. Files
Process
Golf
inp
Write Access
out
inp
thr
Read Access
fil
out
res
res
Figure 1.16: Tutorial process ‘Writing and then reading data from an Access database’.
table is specified in the table name parameter ( i.e. it is set to ‘golf’). A breakpoint is inserted
here. No results are visible in RapidMiner at this stage but you can see that at this point of the
execution the database has been created and the golf table has been filled with the examples of
the ‘Golf’ data set.
Now the Read Access operator is used for reading the golf table from the ‘golf_db.mdb’ database.
The database file parameter is provided with the path of the database file ‘golf_db.mdb’. The define query parameter is set to ‘table name’. The table name parameter is set to ‘golf’ which is the
name of the required table. Continue the process, you will see the entire golf table in the Results
Workspace. The define query parameter is set to ‘table name’ if you want to read an entire table
from the database. You can also read a selected portion of the database by using queries. Set
the define query parameter to ‘query’ and specify a query in the query parameter.
41
1. Data Access
Write CSV
Write CSV
inp
thr
This operator is used to write CSV files(Comma-Separated Values).
fil
Description
A comma-separated values (CSV) file stores tabular data (numbers and text) in plain-text form.
CSV files have all values of an example in one line. Values for different attributes are separated by
a constant separator. It may have many rows. Each row uses a constant separator for separating
attribute values. The name suggests that the attributes values would be separated by commas,
but other separators can also be used. This separator can be specified using the column separator
parameter. Missing data values are indicated by empty cells.
Input Ports
input (inp) This input port expects an ExampleSet. It is output of the Retrieve operator in the
attached Example Process.
Output Ports
through (thr) The ExampleSet that was provided at the input port is delivered through this
output port without any modifications.This is usually used to reuse the same ExampleSet
in further operators of the process.
file (fil) The created CSV file is provided as a file object that can be used with other operators
with file input ports like ’Write File’.
Parameters
csv file (filename) path of the CSV file is specified here. It can be selected using the choose a
file button.
column separator (string) Column separators for the CSV file can be specified here.
write attribute names (boolean) This parameter indicates if the attribute names should be
written as the first row of the CSV file.
quote nominal values (boolean) This parameter indicates if the nominal values should be
quoted with double quotes in the CSV file.
format date attributes (boolean) This parameter indicates if the date attributes should be
written as a formatted string or as milliseconds past since January 1, 1970, 00:00:00 GMT.
append to file (boolean) This parameter indicates if new content should be appended to the
file or if the pre-existing file content should be overwritten.
encoding (selection) This is an expert parameter. There are different options, users can choose
any of them.
42
1.1. Files
Tutorial Processes
Writing the Labor-Negotiations data set into a CSV file
Process
Labor-Negotiations
inp
out
Write CSV
inp
thr
res
fil
res
Figure 1.17: Tutorial process ‘Writing the Labor-Negotiations data set into a CSV file’.
The ‘Labor-Negotiations’ data set is loaded using the Retrieve operator. The Write CSV operator is applied on it to write the ‘Labor-Negotiations’ data set in a CSV file. The csv file parameter
is provided with this path: ‘D:\Labor data set’. Thus a CSV file named ‘Labor data set’ is created
in the ‘D’ drive of your computer. All parameters are used with default values. The write attribute names parameter is set to true thus the first line of the resultant CSV file has the names
of the attributes of the ‘Labor-Negotiations’ data set. The quote nominal values parameter is
also set to true, thus all nominal values are quoted with double quotes in the CSV file. Files
written by the Write CSV operator can be loaded in RapidMiner using the Read CSV operator.
43
1. Data Access
Write Excel
Write Excel
inp
thr
This operator writes an ExampleSet to a Excel spreadsheet file.
fil
Description
The Write Excel operator can be used for writing an ExampleSet into a Microsoft Excel spreadsheet. This operator creates Excel files that are readable by Excel 95, 97, 2000, XP, 2003 and
newer versions. Missing data values in the ExampleSet are indicated by empty cells in the Excel
spreadsheet. The first row of the resultant Excel file has the names of attributes of the input
ExampleSet. Files written by the Write Excel operator can be loaded in RapidMiner using the
Read Excel operator.
Input Ports
input (inp) This input port expects an ExampleSet. It is output of the Retrieve operator in the
attached Example Process.
Output Ports
through (thr) The ExampleSet that was provided at the input port is delivered through this
output port without any modifications.This is usually used to reuse the same ExampleSet
in further operators of the process.
file (fil) The created Excel file is provided as a file object that can be used with other operators
with file input ports like ’Write File’.
Parameters
excel file (string) The path of the Excel file is specified here. It can be selected using the choose
a file button.
file format (selection) Allows the user to specify if the resulting excel sheet should have the
xls or xlsx format.
encoding (selection) This is an expert parameter furthermore it is shown with file format xls
only. There are different options, users can choose any of them.
sheet name (string) This parameter is shown with file format xlsx only. The user can specify
the name of the excel sheet.
date format (string) This is an expert parameter furthermore it is shown with file format xlsx
only. Format dates should be saved in.
number format (string) This is an expert parameter furthermore it is shown with file format
xlsx only. Format number should be saved in.
44
1.1. Files
Tutorial Processes
Writing the Labor-Negotiations data set into an Excel file
Process
inp
Labor-Negotiations
out
Write Excel
inp
thr
res
fil
res
Figure 1.18: Tutorial process ‘Writing the Labor-Negotiations data set into an Excel file’.
The Labor-Negotiations data set is loaded using the Retrieve operator. The Write Excel operator is applied on it to write the Labor-Negotiations data set in a Excel file. The excel file parameter
is provided with this path: ‘D:\Labor data set.xls’. Thus an Excel file named ‘Labor data set’ is
created in the ‘D’ drive of your computer. Note that the first row of the resultant Excel file has
the names of attributes of the Labor-Negotiations data set. Also note that all missing values in
the Labor-Negotiations data set are represented by empty cells in the Excel file.
45
1. Data Access
Write PMML
Write PMML
mod
mod
This operator will save the given model to an XML file of PMML 4.0
format.
Description
This operator will write the given model to an XML file of PMML 4.0 format. This format is
a standard for data mining models and is understood by many data bases. It can be used for
applying data mining models directly in the database. This way it can be applied on a regular
basis on huge amounts of data.
This operator supports the following models:
• Decision Tree Models
• Rule Models
• Naive Bayes models for nominal attributes
• Linear Regression Models
• Logistic Regression Models
• Centroid based Cluster models like models of k-means and k-medoids
Input Ports
model input (mod) The model input port.
Output Ports
model output (mod) The model output port.
Parameters
file Specifies the file for saving the pmml.
version Determines which PMML version should be used for export.
46
1.1. Files
Write Special Format
Write Special For...
inp
thr
This operator writes an ExampleSet or subset of an ExampleSet in
a special user defined format.
Description
The path of the file is specified through the example set file parameter. The special format parameter is used for specifying the exact format. The character following the $ character introduces a
command. Additional arguments to this command may be supplied by enclosing them in square
brackets. The following commands can be used in the special format parameter:
• $a : This command writes all attributes separated by the default separator.
• $a[separator] : This command writes all attributes separated by a separator (the separator
is specified as an argument in brackets).
• $s[separator][indexSeparator] : This command writes in sparse format. The separator and
indexSeparator are provided as first and second arguments respectively. For all non zero
attributes the following strings are concatenated: the column index, the value of the indexSeparator, the attribute value. The attributes are separated by the specified separator.
• $v[name] : This command writes the values of a single attribute. The attribute name is
specified as an argument. This command can be used for writing both regular and special
attributes.
• $k[index] : This command writes the values of a single attribute. The attribute index is
specified as an argument. The indices start from 0. This command can be used for writing
only regular attributes.
• $l : This command writes the values of the label attribute.
• $p : This command writes the values of the predicted label attribute.
• $d : This command writes all prediction confidences for all classes in the form ‘conf(class)=value’
• $d[class] : This command writes the prediction confidences for the defined class as a simple
number. The required class is provided as an argument.
• $i : This command writes the values of the id attribute.
• $w : This command writes the example weights.
• $b : This command writes the batch number.
• $n : This command writes the newline character i.e. newline is inserted when this character is reached.
• $t : This command writes the tabulator character i.e. tab is inserted when this character
is reached.
• $$ : This command writes the dollar sign.
47
1. Data Access
• $[ : This command writes the ‘[’ character i.e. the opening square bracket.
• $] : This command writes the ‘]’ character i.e. the closing square bracket.
Please Make sure that the format string ends with $n or the add line separator parameter is set
to true if you want examples to be separated by newlines.
Input Ports
input (inp) This input port expects an ExampleSet. It is output of the Apply Model operator in
the attached Example Process. The output of other operators can also be used as input.
Output Ports
through (thr) The ExampleSet that was provided at the input port is delivered through this
output port without any modifications. This is usually used to reuse the same ExampleSet
in further operators of the process.
Parameters
example set file (filename) The ExampleSet is written into the file specified through this parameter.
special format (string) This parameter specifies the exact format of the file. Many commands
are available for specifying the format. These commands are discussed in the description
of this operator.
fraction digits (integer) This parameter specifies the number of fraction digits in the output
file. This parameter is used for rounding off real numbers. Setting this parameter to -1
will write all possible digits i.e. no rounding off is done.
quote nominal values (boolean) This parameter indicates if nominal values should be quoted
with double quotes.
add line separator (boolean) This parameter indicates if each example should be followed by
a line break or not . If set to true, each example is followed by a line break automatically.
zipped (boolean) This parameter indicates if the data file content should be zipped or not.
overwrite mode (selection) This parameter indicates if an existing file should be overwritten
or data should be appended.
encoding (selection) This is an expert parameter. There are different options, users can choose
any of them
Tutorial Processes
Writing labeled data set in a user-defined format
The k-NN classification model is trained on the ‘Golf’ data set. The trained model is then applied
on the ‘Golf-Testset’ data set using the Apply Model operator. The resulting labeled data set is
written in a file using the Write Special Format operator. Have a look at the parameters of the
Write Special Format operator. You can see that the ExampleSet is written into a file named
48
1.1. Files
Process
inp
Golf
k-NN
out
tra
Apply Model
mod
exa
mod
unl
lab
Write Special For...
inp
thr
mod
res
res
Golf-Testset
out
Figure 1.19: Tutorial process ‘Writing labeled data set in a user-defined format’.
‘special’. The special format parameter is set to ‘ $[ $l $] $t $p $t $d[yes] $t $d[no]’. This format
string is composed of a number of commands, it can be interpreted as: ‘[label] predicted_label
confidence (yes) confidence (no)’. This format string states that four attributes shall be written
in the file i.e. ‘label’, ‘predicted label’, ‘confidence (yes)’ and ‘confidence (no)’. Each attribute
should be separated by a tab. The label attribute should be enclosed in square brackets. Run the
process and see the written file for verification.
49
1. Data Access
Write XRFF
Write XRFF
inp
thr
Writes the values of all examples into an XRFF-file.
fil
Description
Writes values of all examples into an XRFF file which can be used by the machine learning library
Weka. The XRFF format is described in the XrffExampleSource operator which is able to read
XRFF files to make them usable with RapidMiner.
Please note that writing attribute weights is not supported, please use the other RapidMiner
operators for attribute weight loading and writing for this purpose.
Input Ports
input (inp) This input port expects an ExampleSet.
Output Ports
through (thr) The ExampleSet that was provided at the input port is delivered through this
output port without any modifications. This is usually used to reuse the same ExampleSet
in further operators of the process.
file (fil) This port buffers the file object for passing it to the reader operators
Parameters
example set file (filename) The path of the XRFF file is specified here. It can be selected
using the choose a file button.
encoding (selection) This is an expert parameter. A long list of encoding is provided; users
can select any of them.
50
1.2. Database
1.2 Database
Read Database
Read Database
out
This operator reads an ExampleSet from a SQL database.
Description
The Read Database operator is used for reading an ExampleSet from the specified SQL database.
You need to have at least basic understanding of databases, database connections and queries
in order to use this operator properly. Go through the parameters and Example Process to understand the flow of this operator.
When this operator is executed, the table delivered by the query will be copied into the memory
of your computer. This will give all subsequent operators a fast access on the data. Even learning
schemes like the Support Vector Machine with their high number of random accesses will run
fast.
The java ResultSetMetaData interface does not provide information about the possible values
of nominal attributes. The internal indices the nominal values are mapped to, will depend on
the ordering they appear in the table. This may cause problems only when processes are split up
into a training process and a testing process. This is not a problem for learning schemes which
are capable of handling nominal attributes. If a learning scheme like the SVM is used with nominal data, RapidMiner pretends that nominal attributes are numerical and uses indices for the
nominal values as their numerical value. The SVM may perform well if there are only two possible values. If a test set is read in another process, the nominal values may be assigned different
indices, and hence the SVM trained is useless. This is not a problem for the label attributes,
since the classes can be specified using the classes parameter and hence all learning schemes
intended to use with nominal data are safe to use. You might avoid this problem if you first
combine both ExampleSets using the Append operator and then split it again using two Filter
Examples operators.
Differentiation
• Execute SQL The Read Database operator is used for loading data from a database into
RapidMiner. The Execute SQL operator cannot be used for loading data from databases. It
can be used for executing SQL statements like CREATE or ADD etc on the database. See
page 802 for details.
Output Ports
output (out) This port delivers the result of the query on database in tabular form along with
the meta data. This output is similar to the output of the Retrieve operator.
Parameters
define connection (selection) This parameter indicates how the database connection should
be specified. It gives you three options: predefined, url and jndi.
51
1. Data Access
connection (string) This parameter is only available when the define connection parameter is
set to predefined. This parameter is used to connect to the database using a predefined connection. You can have many predefined connections. You can choose one of them using the
drop down box. You can add a new connection or modify previous connections using the
button next to the drop down box. You may also accomplish this by clicking on the Manage
Database Connections... from the Tools menu in the main window. A new window appears.
This window asks for several details e.g. Host, Port, Database system, schema, username and
password. The Test button in this new window will allow you to check whether the connection can be made. Save the connection once the test is successful. After saving a new
connection, it can be chosen from the drop down box of the connection parameter. You
need to have basic understanding of databases for configuring a connection.
database system (selection) This parameter is only available when the define connection parameter is set to url. This parameter is used to select the database system in use. It can
have one of the following values: MySQL, PostgreSQL, Sybase, HSQLDB, ODBC Bridge (e.g.
Access), Microsoft SQL Server (JTDS), Ingres, Oracle.
database url (string) This parameter is only available when the define connection parameter
is set to url. This parameter is used to define the URL connection string for the database,
e.g. ‘jdbc:mysql://foo.bar:portnr/database’.
username (string) This parameter is only available when the define connection parameter is
set to url. This parameter is used to specify the username of the database.
password (string) This parameter is only available when the define connection parameter is
set to url. This parameter is used to specify the password of the database.
jndi name (string) This parameter is only available when the define connection parameter is
set to jndi. This parameter is used to give the JNDI a name for a data source.
define query (selection) Query is a statement that is used to select required data from the
database. This parameter specifies whether the database query should be defined directly,
through a file or implicitly by a given table name. The SQL query can be auto generated giving a table name, passed to RapidMiner via a parameter or, in case of long SQL statements,
in a separate file. The desired behavior can be chosen using the define query parameter.
Please note that column names are often case sensitive and might need quoting.
query (string) This parameter is only available when the define query parameter is set to query.
This parameter is used to define the SQL query to select desired data from the specified
database.
query file (filename) This parameter is only available when the define query parameter is set
to query file. This parameter is used to select a file that contains the SQL query to select
desired data from the specified database. Long queries are usually stored in files. Storing
queries in files can also enhance reusability.
table name (string) This parameter is only available when the define query parameter is set to
table name. This parameter is used to select the required table from the specified database.
prepare statement (boolean) If checked, the statement is prepared, and ‘?’ can be filled in
using the parameters parameter.
parameters (enumeration) Parameters to insert into ‘?’ placeholders when statement is prepared.
52
1.2. Database
Related Documents
• Execute SQL (page 802)
Tutorial Processes
Reading ExampleSet from a mySQL database
Process
Read Database
inp
out
res
res
Figure 1.20: Tutorial process ‘Reading ExampleSet from a mySQL database’.
The Read Database operator is used to read a mySQL database. The define connection parameter is set to predefined. The define connection parameter was configured using the button
next to the drop down box. The name of the connection was set to ‘mySQLconn’. The following
values were set in the connection parameter’s wizard. The Database system was set to ‘mySQL’.
The Host was set to ‘localhost’. The Port was set to ‘3306’. The Database scheme was set to
‘golf’; this is the name of the database. The User was set to ‘root’. No password was provided.
You will need a password if your database is password protected. Set all the values and test the
connection. Make sure that the connection works.
The define query parameter was set to ‘table name’. The table name parameter was set to
‘golf_table’ which is the name of the required table in the ‘golf’ database. Run the process, you
will see the entire ‘golf_table’ in the Results Workspace. The define query parameter is set to
‘table name’ if you want to read an entire table from the database. You can also read a selected
portion of the database by using queries. Set the define query parameter to ‘query’ and specify a
query in the query parameter. One sample query is already defined in this example. This query
reads only those examples from ‘golf_table’ where the ‘Outlook’ attribute has the value ‘sunny’.
53
1. Data Access
Update Database
Update Database
inp
thr
This operator updates the values of all examples with matching ID
values in a database.
Description
The Update Database operator is used for updating an existing table in the specified SQL database.
You need to have at least basic understanding of databases and database connections in order
to use this operator properly. Go through the parameters and the attached Example Process to
understand the flow of this operator.
The user can specify the database connection, a table name and ID column names. The most
convenient way of defining the necessary parameters is the Manage Database Connections wizard. The most important parameters (database URL and user name) will be automatically determined by this wizard.
The row(s) to update are specified via the db id attribute name parameter. If the id columns of
the table do not match all the id values of any given example, the row will be inserted instead.
The ExampleSet attribute names must be a subset of the table column names, otherwise the
operator will fail.
Input Ports
input (inp) This input port expects an ExampleSet. It is output of the Retrieve operator in the
attached Example Process.
Output Ports
through (thr) The ExampleSet that was provided at the input port is delivered through this
output port without any modifications. This is usually used to reuse the same ExampleSet
in further operators of the process.
Parameters
define connection (selection) This parameter indicates how the database connection should
be specified. It gives you three options: predefined, url and jndi.
connection (string) This parameter is only available when the define connection parameter is
set to predefined. This parameter is used for connecting to the database using a predefined
connection. You can have many predefined connections. You can choose one of them using
the drop down box. You can add a new connection or modify previous connections using
the button next to the drop down box. You may also accomplish this by clicking on Manage
Database Connections... from the Tools menu in the main window. A new window appears.
This window asks for several details e.g. Host, Port, Database system, schema, username
and password. The Test button in this new window will allow you to check whether the
connection can be made. Save the connection once the test is successful. After saving a
new connection, it can be chosen from the drop down box of the connection parameter. You
need to have basic understanding of databases for configuring a connection.
54
1.2. Database
database system (selection) This parameter is only available when the define connection parameter is set to url. This parameter is used for selecting the database system in use. It
can have one of the following values: MySQL, PostgreSQL, Sybase, HSQLDB, ODBC Bridge
(e.g. Access), Microsoft SQL Server (JTDS), Ingres, Oracle.
database url (string) This parameter is only available when the define connection parameter
is set to url. This parameter is used for defining the URL connection string for the database,
e.g. ‘jdbc:mysql://foo.bar:portnr/database’.
username (string) This parameter is only available when the define connection parameter is
set to url. This parameter is used for specifying the username of the database.
password (string) This parameter is only available when the define connection parameter is
set to url. This parameter is used for specifying the password of the database.
jndi name (string) This parameter is only available when the define connection parameter is
set to jndi. This parameter is used for giving the JNDI a name for a data source.
table name This parameter is used for selecting the required table from the specified database.
Please note that you can also write a table name here, if the table does not exist it will be
created during writing.
attribute filter type (selection) This parameter allows you to select the ID attribute which
values ALL have to match in the example set and the database for the row to be updated.
It has the following options:
• all Does not make sense in this context so do not use, will break the process.
• single This option allows the selection of a single id attribute.
• subset This option allows the selection of multiple id attributes through a list. This
option will not work if the meta data is not known.
• regular_expression This option allows you to specify a regular expression for the
id attribute selection. When this option is selected some other parameters (regular
expression, use except expression) become visible in the Parameter panel.
• value_type This option allows selection of all the id attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. The user should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected
some other parameters (value type, use value type exception) become visible in the
Parameter panel.
• block_type This option is similar in working to the value_type option. This option
allows the selection of all the attributes of a particular block type. It should be noted
that block types may be hierarchical. For example value_series_start and value_series_end block types both belong to the value_series block type. When this option is selected some other parameters (block type, use block type exception) become visible
in the Parameter panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in the Parameter panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
55
1. Data Access
Tutorial Processes
Updating an ExampleSet in a mySQL database
Process
Retrieve
inp
Update Database
out
inp
thr
res
res
Figure 1.21: Tutorial process ‘Updating an ExampleSet in a mySQL database’.
The ‘Iris’ data set is loaded using the Retrieve operator. The Update Database operator is used
to update an existing database table named “Test” in the “My connection” SQL database. Rows
in the example set and table which match on their “ID” column will be updated. If no match can
be found, the row will be inserted instead.
56
1.2. Database
Write Database
Write Database
inp
thr
This operator writes an ExampleSet to an SQL database.
Description
The Write Database operator is used for writing an ExampleSet to the specified SQL database.
You need to have at least basic understanding of databases and database connections in order
to use this operator properly. Go through the parameters and the attached Example Process to
understand the flow of this operator.
The user can specify the database connection and a table name. Please note that the table
will be created during writing if it does not exist. The most convenient way of defining the necessary parameters is the Manage Database Connections wizard. The most important parameters
(database URL and user name) will be automatically determined by this wizard. At the end, you
only have to define the table name. This operator only supports the writing of the complete ExampleSet consisting of all regular and special attributes and all examples. If this is not desired,
perform some preprocessing operators like the Select Attributes or Filter Examples operators before applying the Write Database operator. Data from database tables can be read in RapidMiner
by using the Read Database operator.
Input Ports
input (inp) This input port expects an ExampleSet. It is output of the Retrieve operator in the
attached Example Process.
Output Ports
through (thr) The ExampleSet that was provided at the input port is delivered through this
output port without any modifications. This is usually used to reuse the same ExampleSet
in further operators of the process.
Parameters
define connection (selection) This parameter indicates how the database connection should
be specified. It gives you three options: predefined, url and jndi.
connection (string) This parameter is only available when the define connection parameter is
set to predefined. This parameter is used for connecting to the database using a predefined
connection. You can have many predefined connections. You can choose one of them using
the drop down box. You can add a new connection or modify previous connections using
the button next to the drop down box. You may also accomplish this by clicking on Manage
Database Connections... from the Tools menu in the main window. A new window appears.
This window asks for several details e.g. Host, Port, Database system, schema, username
and password. The Test button in this new window will allow you to check whether the
connection can be made. Save the connection once the test is successful. After saving a
new connection, it can be chosen from the drop down box of the connection parameter. You
need to have basic understanding of databases for configuring a connection.
57
1. Data Access
database system (selection) This parameter is only available when the define connection parameter is set to url. This parameter is used for selecting the database system in use. It
can have one of the following values: MySQL, PostgreSQL, Sybase, HSQLDB, ODBC Bridge
(e.g. Access), Microsoft SQL Server (JTDS), Ingres, Oracle.
database url (string) This parameter is only available when the define connection parameter
is set to url. This parameter is used for defining the URL connection string for the database,
e.g. ‘jdbc:mysql://foo.bar:portnr/database’.
username (string) This parameter is only available when the define connection parameter is
set to url. This parameter is used for specifying the username of the database.
password (string) This parameter is only available when the define connection parameter is
set to url. This parameter is used for specifying the password of the database.
jndi name (string) This parameter is only available when the define connection parameter is
set to jndi. This parameter is used for giving the JNDI a name for a data source.
table name This parameter is used for selecting the required table from the specified database.
Please note that you can also write a table name here, if the table does not exist it will be
created during writing.
overwrite mode (selection) This parameter indicates if an existing table should be overwritten or data should be appended to the existing data.
set default varchar length (boolean) This parameter allows you to set varchar columns to
default length.
default varchar length (integer) This parameter is only available when the set default varchar length parameter is set to true. This parameter specifies the default length of varchar
columns.
add generated primary keys (boolean) This parameter indicates whether a new attribute
holding the auto generated primary keys should be added to the table in the database.
db key attribute name (string) This parameter is only available when the add generated primary keys parameter is set to true. This parameter specifies the name of the attribute for
the auto generated primary keys.
batch size (integer) This parameter specifies the number of examples which are written at
once with one single query to the database. Larger values can greatly improve the speed.
However, too large values can drastically decrease the performance. Moreover, some databases
have restrictions on the maximum number of values written at once.
Tutorial Processes
Writing an ExampleSet to a mySQL database
The ‘Golf’ data set is loaded using the Retrieve operator. The Write Database operator is used for
writing this data set to a mySQL database. The define connection parameter is set to predefined
and it is configured using the button next to the drop down box. The name of the connection
is set to ‘mySQLconn’. The following values are set in the connection parameter’s wizard: the
Database system is set to ‘mySQL’. The Host is set to ‘localhost’. The Port is set to ‘3306’. The
Database scheme is set to ‘golf’; this is the name of the database. The User is set to ‘root’. No
58
1.2. Database
Process
inp
Golf
Write Database
out
inp
thr
res
res
Figure 1.22: Tutorial process ‘Writing an ExampleSet to a mySQL database’.
password is provided. You will need a password if your database is password protected. Set all
the values and test the connection. Make sure that the connection works.
The table name parameter is set to ‘golf_table’ which is the name of the required table in the
‘golf’ database. Run the process, you will see the entire ‘golf_table’ in the Results Workspace.
You can also check the ‘golf’ database in phpmyadmin to see the ‘golf_table’. You can read this
table from the database using the Read Database operator. Please study the Example Process of
the Read Database operator for more information.
59
1. Data Access
1.3 NoSQL
1.3.1 Cassandra
Delete Cassandra
Delete Cassandra
inp
out
This operator deletes data from a Cassandra table. The input example set is expected to have an ID attribute which is used to define
the rows that will be deleted from Cassandra.
Description
The Delete Cassandra operator is used to delete data from a Cassandra table.
The data to be deleted is defined by the ID attribute of the provided example set. If the selected
table contains a compound primary key, additional attributes can be added to the key with the
parameter ‘additional_primary_keys’.
Input Ports
input (inp) The example set that defines which data should be deleted from the Cassandra database.
Output Ports
output (out) The passed through example set.
Parameters
conncetion (configurable) The connection details for the Cassandra connection have to be
specified. If you have already configured a Cassandra connection, you can select it from
the drop-down list. If you have not configured a Cassandra connection yet, select the Cassandra icon right to the drop-down list. Create a new Cassandra connection in the Manage
connections box. The contact points and keyspace name are mandatory.
consistency level (selection) The consistency level for the Cassandra query. The consistency
level defines how many Cassandra nodes have to respond to the query in order to be successful. Possible levels are: ONE, TWO, THREE, QUORUM, ALL, ANY
• ONE A write must be written at least to one node.
• TWO A write must be written at least to two nodes.
• THREE A write must be written at least to three nodes.
• QUORUM A write must be written at least on a quorum of nodes. A quorum is calculated as (rounded down to a whole number): (replication_factor / 2) + 1. For example,
with a replication factor of 3, a quorum is 2 (can tolerate 1 node down). With a replication factor of 6, a quorum is 4 (can tolerate 2 nodes down).
• ALL A write must be written on all nodes in the cluster for that row key.
• ANY A write must be written to at least one node
table name (string) Specify the table from which data should be deleted.
60
1.3. NoSQL
batch size (integer) Define the maximum number of rows which should be deleted with one
request.
primary key attributes (enumeration) If the selected Cassandra table has a compound primary key this parameter allows you to add more attributes to the primary key.
61
1. Data Access
Execute CQL
Execute CQL
fil
fil
thr
thr
This operator is used to execute a CQL statement on a Cassandra
database.
Description
The Execute CQL operator is used to execute CQL statements on a Cassandra cluster. It cannot
return data though and therefore ‘SELECT’ will not yield any results.
Input Ports
file (fil) The CQL file which specifies the CQL statement to be executed. If the ‘define query’
parameter is set to the ‘query file’ option, the input port ‘file’ is used for the CQL file. Note:
If the input port is connected to another operator with output port file and the input port
is connected to it, the ‘query file’ option of the ‘define query file’ parameter is ignored.
through (thr) An arbitrary Input/Output (IO) object that is passed through the operator.
Output Ports
file (fil) If the input port ‘file’ is connected, the unchanged CQL file is returned.
through (thr) An arbitrary Input/Output (IO) object that is passed through the operator.
Parameters
conncetion (configurable) The connection details for the Cassandra connection have to be
specified. If you have already configured a Cassandra connection, you can select it from
the drop-down list. If you have not configured a Cassandra connection yet, select the Cassandra icon right to the drop-down list. Create a new Cassandra connection in the Manage
connections box. The contact points and keyspace name are mandatory.
consistency level (selection) The consistency level for the Cassandra query. The consistency
level defines how many Cassandra nodes have to respond to the query in order to be successful. Possible levels are: ONE, TWO, THREE, QUORUM, ALL, ANY
• ONE A write must be written at least to one node.
• TWO A write must be written at least to two nodes.
• THREE A write must be written at least to three nodes.
• QUORUM A write must be written at least on a quorum of nodes. A quorum is calculated as (rounded down to a whole number): (replication_factor / 2) + 1. For example,
with a replication factor of 3, a quorum is 2 (can tolerate 1 node down). With a replication factor of 6, a quorum is 4 (can tolerate 2 nodes down).
• ALL A write must be written on all nodes in the cluster for that row key.
• ANY A write must be written to at least one node
62
1.3. NoSQL
define query (selection) This parameter allows to select the mode the data of a query should
be defined.
• query Define a CQL query via the ‘query’ parameter.
• query file Load CQL query from file. If ‘file’ input port is connected, the query is
loaded from the provided file object.
query (string) The CQL query that defines the data that should be queried can be specified
here. It is shown if ’define query’ is set to ‘query’. The operator cannot return data though
and therefore ‘SELECT’ will not yield any results.
query file (file) The CQL file which contains the CQL statement that defines the data that
should be queried can be specified here. It is shown if ‘define query’ is set to ‘query file’.
The operator cannot return data though and therefore ‘SELECT’ will not yield any results.
prepare statement (boolean) This parameter specifies whether the query will be a prepared
query or a normal query. If activated, the parameter ‘parameters’ is shown.
parameters (enumeration) If you have activated the ‘prepare statement’ checkbox, this parameter allows to specify prepared values for the query. Every ‘?’ from the specified CQL
query will be replaced by the prepared values in the order they are listed in the Edit parameter list: parameters. Note: If you select the wrong type for the parameter, an error
message informs you about.
63
1. Data Access
Read Cassandra
Read Cassandra
fil
out
This operator reads an example set from a Cassandra table.
fil
Description
The example set to be read can be specified via a CQL statement, a CQL file or by specifying a
table name.
Input Ports
file (fil) The CQL file which specifies the CQL statement to be executed. If the ‘define query’
parameter is set to the ‘query file’ option, the input port ‘file’ is used for the CQL file. Note:
If the input port is connected to another operator with output port file and the input port
is connected to it, the ‘query file’ option of the ‘define query file’ parameter is ignored.
Output Ports
output (out) The example set specified via either the CQL statement or the table.
file (fil) If the input port ‘file’ is connected, the unchanged CQL file is returned.
Parameters
conncetion (configurable) The connection details for the Cassandra connection have to be
specified. If you have already configured a Cassandra connection, you can select it from
the drop-down list. If you have not configured a Cassandra connection yet, select the Cassandra icon right to the drop-down list. Create a new Cassandra connection in the Manage
connections box. The contact points and keyspace name are mandatory.
consistency level (selection) The consistency level for the Cassandra query. The consistency
level defines how many Cassandra nodes have to respond to the query in order to be successful. Possible levels are: ONE, TWO, THREE, QUORUM, ALL, ANY
• ONE A write must be written at least to one node.
• TWO A write must be written at least to two nodes.
• THREE A write must be written at least to three nodes.
• QUORUM A write must be written at least on a quorum of nodes. A quorum is calculated as (rounded down to a whole number): (replication_factor / 2) + 1. For example,
with a replication factor of 3, a quorum is 2 (can tolerate 1 node down). With a replication factor of 6, a quorum is 4 (can tolerate 2 nodes down).
• ALL A write must be written on all nodes in the cluster for that row key.
• ANY A write must be written to at least one node
define query (selection) This parameter allows to select the mode the data of a query should
be defined.
64
1.3. NoSQL
• query Define a CQL query via the ‘query’ parameter.
• query file Load CQL query from file. If ‘file’ input port is connected, the query is
loaded from the provided file object.
• query table Select a table to be loaded without defining a CQL query.
query (string) This parameter is only displayed when you have selected the ‘query’ parameter. If you click in the ‘Edit text...’ field, the ‘Edit parameter: query’ editor opens and you
specify the CQL query. Only SELECT statements are allowed.
query file (file) This parameter is only displayed when you have selected the ‘query file’ parameter. You can select the file that contains the CQL statement that defines the data. Only
SELECT statements are allowed. Note: If the Input port of the Read Cassandra operator is
connected to an Open file operator, this parameter is not displayed.
prepare statement (boolean) If you have either select ‘query’ or ‘query file’ for the ‘define
query’ operator, this parameter is displayed. It specifies whether the query will be a prepared query or a normal query. If activated, the parameter ’parameters’ is shown.
parameters (enumeration) If you have activated the ’prepare statement’ checkbox, this parameter allows to specify prepared values for the query. Every ‘?’ from the specified CQL
query will be replaced by the prepared values in the order they are listed in the Edit parameter list: parameters. Note: If you select the wrong type for the parameter, an error
message informs you about.
table (string) If ‘define query’ is set to ’query table’, this parameter is displayed. It allows to
select the table that should be read.
datamanagement (selection) This parameter allows you to select the appropriate data type
for the internal data description.
65
1. Data Access
Write Cassandra
Write Cassandra
inp
out
This operator writes an example set to a Cassandra database.
Description
The ‘Write Cassandra’ operator writes an example set to a Cassandra database. The input example set is expected to have an ID attribute which is used as primary key for the selected Cassandra
table. If the table has a compound primary key use the parameter ‘primary key attributes’ to add
more attribute as key attributes.
Input Ports
input (inp) Requires an example set read by an appropriate operator. The example set must
contain an ID attribute. Therefore a Set rule operator must be added to the process in
order to specify the ID attribute.
Output Ports
output (out) The passed through example set that is written to the Cassandra database.
Parameters
conncetion (configurable) The connection details for the Cassandra connection have to be
specified. If you have already configured a Cassandra connection, you can select it from
the drop-down list. If you have not configured a Cassandra connection yet, select the Cassandra icon right to the drop-down list. Create a new Cassandra connection in the Manage
connections box. The contact points and keyspace name are mandatory.
consistency level (selection) The consistency level for the Cassandra query. The consistency
level defines how many Cassandra nodes have to respond to the query in order to be successful. Possible levels are: ONE, TWO, THREE, QUORUM, ALL, ANY
• ONE A write must be written at least to one node.
• TWO A write must be written at least to two nodes.
• THREE A write must be written at least to three nodes.
• QUORUM A write must be written at least on a quorum of nodes. A quorum is calculated as (rounded down to a whole number): (replication_factor / 2) + 1. For example,
with a replication factor of 3, a quorum is 2 (can tolerate 1 node down). With a replication factor of 6, a quorum is 4 (can tolerate 2 nodes down).
• ALL A write must be written on all nodes in the cluster for that row key.
• ANY A write must be written to at least one node
table name (string) Name of the table to which the example set should be written. If a table
with the same name already exists, it is updated, presupposed the example set is compatible, i.e., attribute names and types do match. In case the table does not exist yet, a new
66
1.3. NoSQL
table with this name is created and the example set is written to this table. The ID attribute
of the example set is used as primary key. In case index columns should be defined for the
newly created table, use the parameter ‘index columns’.
batch size (integer) This parameter defines the maximum number of rows which should be
written with one request. Default value is 1000.
primary key attributes (enumeration) If the Cassandra table already exists and has a compound primary key, you can add more attributes to the primary key that is used to store the
example set. If the Cassandra table does not exist yet, you can add primary key attributes
in the Edit parameter list: primary key attributes to create a compound primary key. This
primary key consists on the ID attribute and the selected attributes.
index columns (enumeration) This option is only required in case the Cassandra table does
not exists yet. It allows you to define columns as index columns for the newly created table
in the Edit paramater list: index columns.
use ttl (boolean) If the checkbox is activated, an additional parameter ‘ttl’ (Time To Live) is
displayed. The parameter allows you to specify a time interval value in seconds for the
written data. If set, the inserted values are automatically removed from the database after
the specified time interval. Note: This remove action affects only the inserted values, not
the column themselves. This means that any subsequent update of the column will reset
the ‘ttl’ value. By default, values are never removed.
ttl (integer) If the ‘use_ttl’ checkbox is activated, you can specify a value in seconds. By default
this value is 120 seconds. You can enter any positive number >= 1.
67
1. Data Access
1.3.2 MongoDB
Delete MongoDB
Delete MongoDB
doc
col
Deletes a set of MongoDB documents.
Description
This operator can be used to delete documents from the specified MongoDB collection. The
default configuration of the operator assumes that documents are deleted via their ID, however,
more general deletion queries are supported as well.
Input Ports
documents (doc) The documents to be deleted form the specified MongoDB collection.
Output Ports
documents (doc) The documents that have been deleted from the collection. This collection
is a subset of the input collection: skipped documents are not included.
Parameters
mongodb instance (Configurable) The MongoDB instance to be used for storing the documents.
write concern (Selection) The write concern which controls the acknowledgment of write operations by MongoDB. See the MongoDB documentation for details.
collection (String) The MongoDB collection in which the documents are stored.
require id (Boolean) If checked the operator requires documents to include a MongoDB ID,
i.e., to include the “_id” field. Documents missing an ID are considered invalid. Otherwise,
all documents are passed to the database.
skip invalid documents (Boolean) If checked, invalid documents (i.e., not in JSON format)
are skipped and a warning is logged. Otherwise, the process execution is stopped.
68
1.3. NoSQL
Execute MongoDB Command
Execute MongoD...
com
res
Runs a user specified command on the MongoDB instance.
Description
This operator can be used to execute arbitrary MongoDB commands. Commands are specified
and results returned via JSON/BSON documents. For instance, the command {”create”: “myCollection”} creates a new collection of the name “myCollection”.
Input Ports
command (com) The database command to be executed (a JSON/BSON document). Alternatively, you can specify this document via the command parameter. Note that this parameter is only visible if no document is connected to the input port.
Output Ports
result (res) The document containing the results of the MongoDB database command.
Parameters
mongodb instance (Configurable) The MongoDB instance to be used to run the command.
command (String) The database command to be executed (a JSON/BSON document). Alternatively, you can specify this document via the command input port.
69
1. Data Access
Read MongoDB
Read MongoDB
cri
pro
col
Reads documents from a MongoDB collection.
sor
Description
This operator retrieves a collection of documents from the specified MongoDB collection. The
query criteria, the query projection and sorting criteria can be specified via JSON/BSON documents.
Input Ports
criteria (cri) The query criteria which can be used to select only specific documents (a JSON/BSON
document). Alternatively, you can specify this document via the criteria parameter. Note
that the parameter is only visible if no document is connected to this input port.
projection (pro) The query projection which can be used to include/exclude specific fields from
the results (a JSON/BSON document). Alternatively, you can specify this document via the
projection parameter. Note that the parameter is only visible if no document is connected
to this input port.
sorting criteria (sor) The sorting criteria which can be used to sort the returned documents in
a specific order (a JSON/BSON document). Alternatively, you can specify this document
via the sort document and sorting criteria parameters. Note that the parameter is only
visible if no document is connected to this input port.
Output Ports
collection (col) The documents retrieved from the MongoDB collection.
Parameters
mongodb instance (configurable) The MongoDB instance to be used for storing the documents.
collection (string) The MongoDB collection in which the documents are stored.
criteria (String) The query criteria which can be used to select only specific documents (a JSON/BSON
document). Alternatively, you can specify this document via the criteria input port.
projection (String) The query projection which can be used to include/exclude specific fields
from the results (a JSON/BSON document). Alternatively, you can specify this document
via the projection input port.
70
1.3. NoSQL
sort documents (boolean) If checked, a sorting criteria document can be specified to sort the
query results. Alternatively, you can enable this behavior by connection a sorting criteria
document to the sorting input port.
sorting criteria (String) The sorting criteria which can be used to sort the returned documents in a specific order (a JSON/BSON document). Alternatively, you can specify this
document via the sorting input port.
limit results (boolean) Whether the number of results should be limited.
limit (integer) The number of documents to be queried.
skip (integer) The number of documents to be skipped.
71
1. Data Access
Update MongoDB
Update MongoDB
cri
upd
cri
Updates one or more documents in a MongoDB collection.
upd
Description
This operator updates one or more documents in the specified MongoDB collection. An update
can thereby refer to the replacement of an document or to the modification of individual fields.
The update consists of two parts: the query criteria to identify the document(s) and an update
object that contains the new data.
The default update behavior of MongoDB is to replace entire documents. Special BSON operators are required to update individual fields. However, this operator tries to update individual
fields by default to prevent data loss.
Input Ports
criteria (cri) The JSON/BSON document to identify the document(s) to update.
update (upd) The JSON/BSON document containing the updated data. If using BSON operator
such as “$set” ensure that the parameter “update individual fields” is disabled.
Output Ports
criteria (cri) Pass through of the input criteria document (if any).
update (upd) Pass through of the input update document.
Parameters
mongodb instance (configurable) The MongoDB instance to be used for storing the documents.
write concern (selection) The write concern which controls the acknowledgment of write operations by MongoDB. See MongoDB documentation for details.
collection (string) The MongoDB collection in which the documents are stored.
update individual fields (boolean) If checked, the operator uses the MongoDB operator “$set”
to update the fields of the provided update object without replacing other data. Otherwise,
the operator simply replaces matching documents with the provided update document.
insert unmatched documents (upsert) (boolean) If checked, the operator adds the update
document to the collection when no document matches the query criteria. Otherwise, the
collections remains unchanged.
update multiple documents (boolean) If checked, all documents that match the query criteria are updated. Otherwise, only the first match is updated.
72
1.3. NoSQL
Write MongoDB
Write MongoDB
doc
col
Writes documents to a MongoDB collection.
Description
This operator stores JSON/BSON documents in the specified MongoDB collection.
Input Ports
documents (doc) The example set(s) containing the entries which should be transformed to
JSON documents.
Output Ports
documents (doc) The documents that have been written to the collection. This collection is
a subset of the input collection: skipped documents are not included.
Parameters
mongodb instance (Configurable) The MongoDB instance to be used for storing the documents.
write concern (Selection) The write concern which controls the acknowledgment of write operations by MongoDB. See the MongoDB documentation for details.
collection (String) The MongoDB collection in which the documents are stored.
skip invalid documents (Boolean) If checked, invalid documents (i.e., not in JSON format)
are skipped and a warning is logged. Otherwise, the process execution is stopped.
73
1. Data Access
1.3.3 Solr
Add to Solr (Data)
Add to Solr (Data)
inp
thr
This operator adds an example set to Solr.
Description
To connect to a Solr server, you have to specify a Solr connection. This comprises the URL of a
Solr server and an optional user/password combination for authentication. Typically, the Solr
server URL ends with the string ‘/solr’.
The next step is to select a collection on the server. A collection can be imagined as a table. It
is composed of several columns, which are called Solr fields. A Solr field has a type (e.g. number)
and a key (the name of the column). Each entry in Solr can be imagined as a row and contains
values for the respective fields.
A RapidMiner example set has a very similar structure. It also can be imagined as a table.
Therefore every row of RapidMiner is added as row in Solr. The RapidMiner attributes are used
as Solr collection fields.
Input Ports
input (inp) This port connects the example set, which has to be added.
Output Ports
through (thr) The added example set is provided at this port.
Parameters
connection (configurable) The connection details for the Solr connection have to be specified. If you have already configured a Solr connection, you can select it from the drop-down
list. If you have not configured a Solr connection yet, select the icon to the right of the
drop-down list. Create a new Solr connection in the Manage connections dialog. The Solr
server URL is required. Additionally, you can specify a username/password combination
for authentication.
collection (string) Provide the name of the Solr collection, which has to be used to access data.
74
1.3. NoSQL
Add to Solr (Documents)
Add to Solr (Doc...
doc
doc
This operator adds collections of documents to Solr.
Description
To connect to a Solr server, you have to specify a Solr connection. This comprises the URL of a
Solr server and an optional user/password combination for authentication. Typically, the Solr
server URL ends with the string ‘/solr’.
The next step is to select a collection on the server. A collection can be imagined as a table. It
is composed of several columns, which are called Solr fields. A Solr field has a type (e.g. number)
and a key (the name of the column). Each entry in Solr can be imagined as a row and contains
values for the respective fields.
A RapidMiner document has a set of metadata records, which consist of a key and a related
value. The metadata keys are mapped to the Solr attributes. RapidMiner documents have an
additional body. Therefore you can select a Solr field, in which the document body will be stored.
Input Ports
documents (doc) This port connects a collection of documents, which has to be added. This
port is extendable.
Output Ports
documents (doc) The added collection of documents are provided at this port. This port is
extendable.
Parameters
connection (configurable) The connection details for the Solr connection have to be specified. If you have already configured a Solr connection, you can select it from the drop-down
list. If you have not configured a Solr connection yet, select the icon to the right of the
drop-down list. Create a new Solr connection in the Manage connections dialog. The Solr
server URL is required. Additionally, you can specify a username/password combination
for authentication.
collection (string) Provide the name of the Solr collection, which has to be used to access data.
document body field (string) The Solr field, which is used for the RapidMiner document body.
75
1. Data Access
Search Solr (Data)
Search Solr (Data)
out
This operator searches for Solr entries and generates an example
set.
fac
Description
To connect to a Solr server, you have to specify a Solr connection. This comprises the URL of a
Solr server and an optional user/password combination for authentication. Typically, the Solr
server URL ends with the string ‘/solr’.
The next step is to select a collection on the server. A collection can be imagined as a table. It
is composed of several columns, which are called Solr fields. A Solr field has a type (e.g. number)
and a key (the name of the column). Each entry in Solr can be imagined as a row and contains
values for the respective fields.
A RapidMiner example set has a very similar structure. It also can be imagined as a table.
Therefore every row of Solr is added as row in RapidMiner. The Solr collection fields are used as
RapidMiner attributes.
To search Solr, you have to specify a query string. You can add filters to refine your query.
E.g., if you want to receive no items with an attribute key “popularity” and the value “6”, use
”!popularity:6”. The range of the entries to receive can be set by the attributes offset and rows.
You can specify, which field is used to sort the received entries. It is also possible to enable
faceting. Faceted search breaks up search results into multiple categories. Use “facet fields”
and “date facets” to specify Solr fields for faceting.
If a Solr field supports multiple elements, the related values are provided as a JSON array.
Output Ports
output (out) This port provides the main search result. It consists of an example set.
facets (fac) This port is used to provide results of the faceted search. An example set is provided
and contains the field name, the value which was found, and the number of occurrences.
Parameters
connection (configurable) The connection details for the Solr connection have to be specified. If you have already configured a Solr connection, you can select it from the drop-down
list. If you have not configured a Solr connection yet, select the icon to the right of the
drop-down list. Create a new Solr connection in the Manage connections dialog. The Solr
server URL is required. Additionally, you can specify a username/password combination
for authentication.
collection (string) Provide the name of the Solr collection, which has to be used to access data.
query (string) The term to search for.
filter query (string) A filter, which does not influence the relevancy score, which is the default
sort order. With this field, you can refine your query. E.g. if the field name has to contain
John, but must not contain Doe, you can use ‘name:John -name:Doe’.
76
1.3. NoSQL
offset (integer) The first document index to fetch.
limit (integer) The maximum number of results.
sort (boolean) Specifies, if search results are sorted.
sort field (string) The Solr field which is used for sorting.
sort order (selection) The sorting order of results.
faceted search (boolean) Specifies, if faceted searching is used.
categorical facets (enumeration) The facets to use for faceted search.
date facets (enumeration) The date facets to use for faceted search. A single date facet consists of the field name, a start date, an end date, and a gap.
include generated fields (boolean) Specifies, if automatically generated fields are included
into search results. These fields can consist of SolrCloud fields or can be based on dynamic
Solr fields.
77
1. Data Access
Search Solr (Documents)
Search Solr (Doc...
out
This operator searches for Solr entries and generates a document
for each result.
fac
Description
To connect to a Solr server, you have to specify a Solr connection. This comprises the URL of a
Solr server and an optional user/password combination for authentication. Typically, the Solr
server URL ends with the string ‘/solr’.
The next step is to select a collection on the server. A collection can be imagined as a table. It
is composed of several columns, which are called Solr fields. A Solr field has a type (e.g. number)
and a key (the name of the column). Each entry in Solr can be imagined as a row and contains
values for the respective fields.
A RapidMiner document has a set of metadata records, which consist of a key and a related
value. The metadata keys are mapped to the Solr attributes. RapidMiner documents have an
additional body. Therefore you can select a Solr field, whose contents will bestored in the RapidMiner document body.
To search Solr, you have to specify a query string. You can add filters to refine your query.
E.g., if you want to receive no items with an attribute key “popularity” and the value “6”, use
”!popularity:6”. The range of the entries to receive can be set by the attributes offset and rows.
You can specify, which field is used to sort the received entries. It is also possible to enable
faceting. Faceted search breaks up search results into multiple categories. Use “facet fields”
and “date facets” to specify Solr fields for faceting.
If a Solr field supports multiple elements, the related values are provided as a JSON array.
Output Ports
output (out) This port provides the main search result. It consists of a collection of documents.
facets (fac) This port is used to provide results of the faceted search. An example set is provided
and contains the field name, the value which was found, and the number of occurrences.
Parameters
connection (configurable) The connection details for the Solr connection have to be specified. If you have already configured a Solr connection, you can select it from the drop-down
list. If you have not configured a Solr connection yet, select the icon to the right of the
drop-down list. Create a new Solr connection in the Manage connections dialog. The Solr
server URL is required. Additionally, you can specify a username/password combination
for authentication.
collection (string) Provide the name of the Solr collection, which has to be used to access data.
query (string) The term to search for.
document body field (string) The Solr field, which is used as the RapidMiner document body.
78
1.3. NoSQL
filter query (string) A filter, which does not influence the relevancy score, which is the default
sort order. With this field, you can refine your query. E.g. if the field name has to contain
John, but must not contain Doe, you can use ‘name:John -name:Doe’.
offset (integer) The first document index to fetch.
limit (integer) The maximum number of results.
sort (boolean) Specifies, if search results are sorted.
sort field (string) The Solr field which is used for sorting.
sort order (selection) The sorting order of results.
faceted search (boolean) Specifies, if faceted searching is used.
categorical facets (enumeration) The facets to use for faceted search.
date facets (enumeration) The date facets to use for faceted search. A single date facet consists of the field name, a start date, an end date, and a gap.
include generated fields (boolean) Specifies, if automatically generated fields are included
into search results. These fields can consist of SolrCloud fields or can be based on dynamic
Solr fields.
79
1. Data Access
1.4 Applications
Trigger Zapier
Trigger Zapier
exa
exa
This operator allows you to use the Zapier service connecting to a
huge collection of data sinks.
Description
Zapier works by defining triggers and actions and combining them into Zaps. The operator can
be used as such a trigger. Zapier can then be used to send data to arbitrary actions. Each example
in your example set will trigger one action. Note: The Trigger Zapier operator therefore always
needs another operator that provides an example set.
To use this operator, perform the following steps.
• Create an account on www.zapier.com and create a new Zap. (You can quickly get there by
clicking the button next to the “zapier url” parameter of the “Trigger Zapier” operator.)
• Select “RapidMiner” as the trigger service and “Trigger Zapier” operator as the trigger.
• Select an action service from the list and an action.
• In step 2 of the Zap creation (”Select RapidMiner account”), a URL is provided that you
must first copy to the clipboard and then paste as the zapier url parameter of this operator
• In step 3 of the Zap creation (Select an action service account) log in the Action service
and grant access to the Zapier service.
• The fields you can select in Step 5 (”Match up”) correspond to the attributes of the example
set received by this operator. Note: When you use this operator for the first time, the list
may be empty. Therefore it is recommended to execute a dry run of your process in RapidMiner studio to let Zapier know which attributes will be expected. To that end, check the
test hook parameter and run your process. It may be possible that you must refresh the
Zapier page afterwards. Then you should be able to pick appropriate attributes from the
attribute list.
• In step 7 save and activate your trigger on the Zapier web page. In a productive run make
sure that the test hook parameter is switched off.
If you run your process, you should see that one action in Zapier is triggered for each example
in your example set.
Input Ports
example set in (exa) This will trigger your Zap once for each example in the example set.
Output Ports
example set in (exa) The same example set as received as an input.
80
1.4. Applications
Parameters
zapier url (HTTPS url) The URL to which your requests are sent. Zapier shows the URL in Step
2 (”Select a RapidMiner account”). Make sure to use HTTPS.
test hook (boolean) If checked, only test messages will be sent to Zapier. These will not trigger your Zap. You can use test mode to populate the choices in Step 5 (”Match up”) of the
Zap editor.
81
1. Data Access
1.4.1 Salesforce
Delete Salesforce
Wr i t e S a l e s fo r c e
inp
thr
This operator deletes records of a Salesforce object from the input
example set.
Description
This operator deletes entries of a Salesforce object from the input example set in the specified
Salesforce instance. Each example of the input data will delete one Salesforce object, identified
by the ID attribute.
If the skip invalid parameter is selected, each example for which the deletion in Salesforce
failed will be ignored.
Input Ports
input (inp) The example set containing the entries which should be deleted. The entries are
identified by a Salesforce ID.
Output Ports
through (thr) The unmodified input example set.
Parameters
connection (configurable) The connection details for the Salesforce connection have to be
specified. If you have already configured a Salesforce connection, you can select it from
the drop-down list. If you have not configured a Salesforce connection yet, select the icon
to the right of the drop-down list. Create a new Salesforce connection in the Manage connections box. This includes username, password and the security token. The URL is predefined but can be changed to work on a different API version.
skip invalid rows (boolean) If selected, skips and ignores failed deletions of a record. In such
cases, invalid deletion IDs will be skipped. If not selected, the process will fail if a record
cannot be deleted and revert all previous deletions.
82
1.4. Applications
Read Salesforce
Read Salesforce
out
This operator creates an example set from a Salesforce object. Each
record is represented by an example containing an attribute for
each field.
Description
This operator reads an example set from a Salesforce object in the specified Salesforce instance.
Each record is represented by an example containing an attribute for each field.
You can either use the simplified user interface to create the query, or use the advanced SOQL
editor which allows you to directly enter SOQL queries.
Note that datetime fields are always treated as UTC, using the pattern “yyyy-MM-dd’T’HH:mm:ss.SSSX”.
Date fields use “yyyy-MM-dd” and time fields “HH:mm:ss.SSSX”.
Output Ports
output (out) The example set created from the result of the Salesforce query. Each queried
field corresponds to an attribute and each record is represented as an example.
Parameters
connection (configurable) The connection details for the Salesforce connection have to be
specified. If you have already configured a Salesforce connection, you can select it from
the drop-down list. If you have not configured a Salesforce connection yet, select the icon
to the right of the drop-down list. Create a new Salesforce connection in the Manage connections box. This includes username, password and the security token. The URL is predefined but can be changed to work on a different API version.
query (salesforce_query) The SOQL query which will be used to query Salesforce. You can
either select the Simple or the Advanced SOQL mode. The Simple mode supports you and
eases the query creation, while the Advanced SOQL mode allows you to use the full power
of SOQL.
guess value types (boolean) If selected, the operator tries to guess the value types for each
column. It does so by taking the first ten rows of the returned data and trying to parse it
as an integer, number, date_time, date, time (in this order). If this all fails, the attribute
is treated as nominal. If this option is not selected, the operator treats all attributes as
nominal. Other operators have to be applied afterwards to convert the attributes to the
desired value type.
batch size (integer) The batch size the query should use. If you query more records than the
batch size, they are retrieved in chunks of the specified size. This parameter is only for
performance optimization and does not affect the result.
83
1. Data Access
Update Salesforce
Update Salesforce
inp
thr
This operator updates records of a Salesforce object from the input
example set.
Description
This operator updates entries for a Salesforce object from the input example set in the specified Salesforce instance. Each example of the input data will update one record. The selected
attributes will be used as the respective field values. Each record is identified by its ID, which is
taken from the ID attribute.
To select the fields which should be updated, you can use the attribute selection parameters.
Attributes which are not selected are ignored.
Note: Datetime fields are always treated as UTC (Coordinated Universal Time), using the pattern “yyyy-MM-dd’T’HH:mm:ss.SSSX”. Date fields use “yyyy-MM-dd” and time fields “HH:mm:ss.SSSX”.
Input Ports
input (inp) The example set containing the entries which should be updated. Note: The example set must have an ID column by which the records can be identified in the Salesforce
object.
Output Ports
through (thr) The unmodified input example set.
Parameters
connection (configurable) The connection details for the Salesforce connection have to be
specified. If you have already configured a Salesforce connection, you can select it from
the drop-down list. If you have not configured a Salesforce connection yet, select the icon
to the right of the drop-down list. Create a new Salesforce connection in the Manage connections box. This includes username, password and the security token. The URL is predefined but can be changed to work on a different API version.
object name (selection) The name of the Salesforce object for which you want to update records.
skip invalid rows (boolean) If selected, skips and ignores failed creations of a record. In such
cases, the ID column value is set to missing. If not selected, the process will fail if a record
cannot be created and no records will be created in Salesforce at all.
attribute filter type (selection) You can specify which attributes should be updated. By default all attributes are updated. Possible values are: all, single, subset, regular_expression,
value_type, block_type, no_missing_values, numeric_value_filter.
invert selection (boolean) If the checkbox is activated, the attributes selection is toggled: all
attributes that are selected before, are excluded and all excluded attributes are included.
If the checkbox is deactivated (default) the attribute selection is applied.
84
1.4. Applications
includes special attributes (boolean) If the checkbox is activated, the operator is also applied to special attributes. If the checkbox is deactivated, the special attributes are ignored.
85
1. Data Access
Write Salesforce
Wr i t e S a l e s fo r c e
inp
thr
This operator creates records for a Salesforce object from the input
example set.
Description
This operator creates entries for a Salesforce object from the input example set in the specified Salesforce instance. Each example of the input data will create one Salesforce entry. The
selected attributes will be used as the respective field values.
To select the fields which should be created, you can use the attribute selection parameters.
Attributes which are not selected are ignored.
Note that datetime fields are always treated as UTC (Coordinated Universal Time), using the
pattern “yyyy-MM-dd’T’HH:mm:ss.SSSX”. Date fields use “yyyy-MM-dd” and time fields “HH:mm:ss.SSSX”.
If the skip invalid parameter is selected, each row for which the creation in Salesforce failed
will return a missing value in the ID column.
Input Ports
input (inp) The example set containing the entries which should be created. Note: It is strictly
forbidden to have an ID column in the example set.
Output Ports
through (thr) The input example set including an ID column containing the IDs generated by
Salesforce for each entry or a missing value if the entry could not be created and the parameter ‘skip_invalid_rows’ is set to true.
Parameters
connection (configurable) The connection details for the Salesforce connection have to be
specified. If you have already configured a Salesforce connection, you can select it from
the drop-down list. If you have not configured a Salesforce connection yet, select the icon
to the right of the drop-down list. Create a new Salesforce connection in the Manage connections box. This includes username, password and the security token. The URL is predefined but can be changed to work on a different API version.
object name (selection) The name of the Salesforce object for which to create records.
salesforce id column (string) The name of the column which will contain the IDs for each
successfully created entry. The name must not be used for any existing column in the incoming example set.
skip invalid rows (boolean) If selected, skips and ignores failed creations of a record. In such
cases, the ID column value is set to missing. If not selected, the process will fail if a record
cannot be created and no records will be created in Salesforce at all.
86
1.4. Applications
attribute filter type (selection) You can specify which attributes should be updated. By default all attributes are updated. Possible values are: all, single, subset, regular_expression,
value_type, block_type, no_missing_values, numeric_value_filter.
invert selection (boolean) If the checkbox is activated, the attributes selection is toggled: all
attributes that are selected before, are excluded and all excluded attributes are included.
If the checkbox is deactivated (default) the attribute selection is applied.
includes special attributes (boolean) If the checkbox is activated, the operator is also applied to special attributes. If the checkbox is deactivated, the special attributes are ignored.
87
1. Data Access
1.4.2 Mozenda
Read Mozenda
Read Mozenda
out
This operator loads the specified view from the Mozenda cloud
storage and returns it’s data as an example set.
Description
After you have created a Mozenda account, you can get a Mozenda view as an example set, using
this operator.
Output Ports
output (out) The example set created from fetching the specified Mozenda view.
Parameters
connection (configurable) The connection details for the Mozenda connection have to be
specified. If you have already configured a Mozenda connection, you can select it from
the drop-down list. If you have not configured a Mozenda connection yet, select the cloud
icon to the right of the drop-down list. Create a new Mozenda connection in the Manage
connections box. An API key is required, which is available in your Mozenda web interface.
Test the connection and click the Save all changes button.
collection (selection) Select the Mozenda collection from the drop-down list. All available
collections are displayed in the drop-down list. To update the list of collections you need to
update the cache of the specified Mozenda connection. Click on the cloud icon to the right
of your selected connection. Select your Mozenda connection in the Manage connections
box and click on the menu next to the Test button to update the cache.
view (selection) Select the Mozenda view from the drop-down list. All available views associated with the selected Mozenda collection are displayed in the drop-down list. To update
the list of views you need to update the cache of the specified Mozenda connection. Click
on the cloud icon to the right of your selected connection. Select your Mozenda connection
in the Manage connections box and click on the menu next to the Test button to update
the cache.
page number (integer) This parameter defines the page number of the page that is returned
from the selected Mozenda view.
page item count (integer) This parameter defines the number of items that are requested to
be on one page of the Mozenda view.
88
1.4. Applications
1.4.3 Qlik
Write QVX
Write QVX
inp
thr
This operator writes data in Qlik’s QVX data exchange format.
fil
Description
This operator can write Example Sets in Qlik’s data exchange format QVX. The operator can
either send the file as specified by the “file” parameter or send it to the output port labeled “file”.
If that port is connected, the “file” parameter can no longer be used and a file object is sent to
the port. This file object can subsequently be used in two ways:
• It can be further processed, e.g. written to the repository by using one of the file operators
like Write File.
• It sent to the result port of the process, e.g. when using it as output of a RapidMiner Server
web service. This is an easy way to connect RapidMiner Server as a data source to Qlik.
Input Ports
input (inp) This input port expects an ExampleSet. It is the output of the Retrieve operator in
the attached Example Process.
Output Ports
through (thr) The ExampleSet that was provided at the input port is delivered through this
output port without any modifications. This is usually used to reuse the same ExampleSet
in further operators of the process.
file (fil) This port buffers the file object for passing it to the reader operators
Parameters
file (filename) The file to which this operator will write the input example set. Only available
if the file port is not connected.
table name (string) The name by which this table will be known in Qlik.
89
1. Data Access
1.4.4 Twitter
Get Twitter Relations
Search Twitter
out
This operator gets friends or followers of a specific user.
Description
With the Get Twitter Relations operator, you can specify a Twitter user and receive the users
friends or followers.
Select a Twitter connection to specify the Twitter account for the Twitter API access. Specify
the user name or the user ID of interest. Finally, set if you want to receive the followers or friends
of the specified user.
Output Ports
output (out) An example set consisting of data from the Twitter API. This consists of the IDs of
friends or the IDs of followers. Additionally it contains the name or ID, that was searched
for.
Parameters
connection (configurable) The connection details for the Twitter connection have to be specified. If you have already configured a Twitter connection, you can select it from the dropdown list. If you have not configured a Twitter connection yet, select the icon to the right
of the drop-down list. Create a new Twitter connection in the Manage Connections box.
query type (selection) Specifies whether a user should be searched by id or screen name.
id The id of the user.
name The screen name of the user.
relation (selection) Specifies whether friends or followers of that user should be retrieved.
90
1.4. Applications
Get Twitter User Details
Get Twitter User ...
out
This operator shows properties of a specific user.
Description
With the Get Twitter User Details operator, you can specify a Twitter user and receive a list of
properties of the user.
Select a Twitter connection to specify the Twitter account for the Twitter API access. Specify
the user name or the user ID to get information about the user.
Output Ports
output (out) An example set consisting of data from the Twitter API. This comprises the users
ID, name, screen name, description, URL, creation date, verification and protection info,
the number of followers and friends, the number of tweets, the language, the profile image,
and the time zone.
Parameters
connection (configurable) The connection details for the Twitter connection have to be specified. If you have already configured a Twitter connection, you can select it from the dropdown list. If you have not configured a Twitter connection yet, select the icon to the right
of the drop-down list. Create a new Twitter connection in the Manage Connections box.
query type (selection) Specifies whether a user should be searched by id or screen name.
id (long) The id of the user.
user (string) The screen name of the user.
91
1. Data Access
Get Twitter User Statuses
Search Twitter
out
This operator searches for Twitter statuses of a specific user.
Description
With the Get Twitter User Statuses operator, you can specify a Twitter user and receive a list of
statuses of the user. The list of statuses contains additional data with context of the statuses.
There are advanced parameters you can use to specify additional search restrictions.
Select a Twitter connection to specify the Twitter account for the Twitter API access. Specify
at least the user name or the user ID of interest. There are advanced parameters you can use to
specify additional search restrictions. For example, you can increase the number of pages. This
will increase the number of search results.
Output Ports
output (out) An example set consisting of data from the Twitter API. This comprises the tweet
text, the tweet ID, the number of retweets, the date of creation, the language, the geolocation, the used source of the tweet, and user information.
Parameters
connection (configurable) The connection details for the Twitter connection have to be specified. If you have already configured a Twitter connection, you can select it from the dropdown list. If you have not configured a Twitter connection yet, select the icon to the right
of the drop-down list. Create a new Twitter connection in the Manage Connections box.
query type (selection) Specifies whether a user should be searched by id or screen name.
id (long) The id of the user.
user (string) The screen name of the user.
limit (integer) The limit on the number of tweets to return.
since id (long) Returns results with an ID greater than (that is, more recent than) the specified
ID.
max id (long) Returns results with an ID less than (that is, older than) or equal to the specified
ID.
92
1.4. Applications
Search Twitter
Search Twitter
out
This operator searches for Twitter statuses.
Description
With the Search Twitter operator, you can specify a query and get Twitter statuses containing
this query. The list of statuses contains additional data with context of the statuses. In the expert
mode, you can specify additional search restrictions.
Select a Twitter connection to specify the Twitter account for the Twitter API access. Specify
at least a query to search Twitter for it. There are advanced parameters you can use to specify
additional search restrictions. For example, you can limit the search results to a language.
Output Ports
output (out) An example set consisting of data from the Twitter API. This comprises the tweet
text, the tweet ID, the number of retweets, the date of creation, the language, the geolocation, the used source of the tweet, and user information.
Parameters
connection (configurable) The connection details for the Twitter connection have to be specified. If you have already configured a Twitter connection, you can select it from the dropdown list. If you have not configured a Twitter connection yet, select the icon to the right
of the drop-down list. Create a new Twitter connection in the Manage Connections box.
query (string) The term that should be searched.
result type (selection) Specifies the preferred search result type.
limit (integer) The limit on the number of tweets to return.
since id (long) Returns results with an ID greater than (that is, more recent than) the specified
ID.
max id (long) Returns results with an ID less than (that is, older than) or equal to the specified
ID.
language (string) Restricts tweets to the given language, specified by an ISO 639-1 code.
locale (string) Specifies the language of the query you are sending. (The official Twitter API
mentions, that only ’ja’ is currently effective.)
until (string) Returns tweets generated before the given date. The values year, month, and day
are used as search parameters.
filter by geo location (boolean) Indicates if the results should be filtered by a geo location.
latitude (double) The latitude of the geo location.
93
1. Data Access
longitude (double) The longitude of the geo location.
radius (double) The radius of the geo location.
radius unit (selection) The unit of the geo location radius.
94
1.4. Applications
1.4.5 Splunk
Search Splunk
Search Splunk
out
Reads search results from a Splunk® server.
Description
This operator can be used to query a Splunk® server based on a query term and returns the results
as an example set. Search results can be restricted by specifying a time frame.
Output Ports
result (res) The example set consisting of the search results.
Parameters
connection (Configurable) The Splunk® connection to use. Select a connection from the dropdown or click the button to create a new one.
query (String) The Splunk® query in Splunk Process Language (SPL).
pagination (Boolean) If set, only a limited number of results will be returned, starting from
a given offset.
offset (Integer) Offset from which the result set should start.
limit (Integer) Maximum number of results to return.
earliest time (Time) If this parameter is set, it specifies the earliest time in the time range to
search.
latest time (Time) If this parameter is set, it specifies the latest time in the time range to
search.
95
1. Data Access
1.5 Cloud Storage
1.5.1 Amazon S3
Loop Amazon S3
Read Amazon S3
fil
This operator loops over all files in the specified bucket/folder from
the Amazon S3 cloud storage.
Description
After you have configured your Amazon S3 account, you can process all Amazon S3 files within
the selected folder.
Be aware that the operator cannot read the file as example set. For this reason, you must
connect the file input in the inner process of this operator to another appropriate operator to
process the file. For example, if you want to load Excel files from your Amazon S3 folder, you
must connect the file input in the inner process with the Read Excel operator.
Input Ports
in (in ) Optional input data which is delivered to the inner process.
Output Ports
out (out) Output data of the inner process.
Parameters
connection (configurable) The connection details for the Amazon S3 connection have to be
specified. If you have already configured a Amazon S3 connection, you can select it from
the drop-down list. If you have not configured a Amazon S3 connection yet, select the
icon to the right of the drop-down list. Create a new Amazon S3 connection in the Manage
connections box. The access key, secret key and the region are required. Note: It is very
important to select the correct region for your connection. Otherwise an error occurs.
folder (selection) Provide the name of the Amazon S3 ‘folder’ over which you want to loop.
Note that the concept of folders does not exist in Amazon S3, so the default delimiter (’/’)
is used to represent them. If your file was stored as ‘name1/name2/my_file.xls’ on Amazon
S3, the file ‘my_file.xls’ would be displayed as residing in the folder ‘name1/name2/’.
filter (string) Optional filter via a regular expression which is used to exclude files from looping
over them, e.g. ‘a.*b’ for all files starting with ‘a’ and ending with ‘b’. Ignored if empty.
filtered string (selection) Indicates which part of the file name is matched against the filter
expression.
• file_name Filtered on the name, e.g. ‘myfolder/myfile.txt’
• full_path Filtered on the full path, e.g. ‘mybucket/myfolder/myfile.txt’
96
1.5. Cloud Storage
• parent_path Filtered on the parent folder, e.g. ‘myfolder/’
file name macro (string) The name of the macro which will contain the name of the current
file for each file the loop iterates over, e.g. ‘myfolder/myfile.txt’
file path macro (string) The name of the macro which will contain the full path of the current
file for each file the loop iterates over, e.g. e.g. ‘mybucket/myfolder/myfile.txt’
parent path macro (string) The name of the macro which will contain the parent folder of
the current file for each file the loop iterates over, e.g. e.g. ‘myfolder/’
recursive (boolean) If selected, the loop will also iterate over all files in all subfolders of the
selected folder. Otherwise, it will only iterate over the files in the selected folder.
97
1. Data Access
Read Amazon S3
Read Amazon S3
fil
This operator downloads the specified file from the Amazon S3
cloud storage.
Description
After you have configured your Amazon S3 account, you can load the Amazon S3 file with this
operator.
Be aware that the operator cannot read the file as example set. For this reason, you must connect the Read Amazon S3 operator to another appropriate operator to read the file. For example,
if you want to load an Excel file from your Amazon S3, you must connect the Read Amazon S3
operator with the Read Excel operator to see the result.
Output Ports
file (fil) The downloaded file object is returned here. Must be connected to a appropriate Read
Operator, for example Read Excel or Read CSV.
Parameters
connection (configurable) The connection details for the Amazon S3 connection have to be
specified. If you have already configured a Amazon S3 connection, you can select it from
the drop-down list. If you have not configured a Amazon S3 connection yet, select the
icon to the right of the drop-down list. Create a new Amazon S3 connection in the Manage
connections box. The access key, secret key and the region are required. Note: It is very
important to select the correct region for your connection. Otherwise an error occurs.
file (selection) Select the Amazon S3 file you want to download. Note that the concept of folders does not exist in Amazon S3, so the default delimiter (’/’) is used to represent them.
If your file was stored as ‘name1/name2/my_file.xls’ on Amazon S3, the file ‘my_file.xls’
would be displayed as residing in the folder ‘name1/name2/’.
98
1.5. Cloud Storage
Write Amazon S3
Write Amazon S3
fil
fil
This operator uploads the input file to the Amazon S3 cloud storage.
Description
Before you can upload the input file to the selected Amazon S3 cloud storage, you must load it
with an Open file operator.
Ensure that the correct bucket is selected, otherwise an error occurs! Buckets are container
for the Amazon S3 objects. Each Bucket name is unique across all of Amazon S3.
Input Ports
file (fil) The file object which should be uploaded to Amazon S3 cloud storage. The file must be
provided by an Open file operator.
Output Ports
file (fil) The input file object is passed through and returned here.
Parameters
connection (configurable) The connection details for the Amazon S3 connection have to be
specified. If you have already configured a Amazon S3 connection, you can select it from
the drop-down list. If you have not configured a Amazon S3 connection yet, select the
icon to the right of the drop-down list. Create a new Amazon S3 connection in the Manage
connections box. The access key, secret key and the region are required. Note: It is very
important to select the correct region for your connection. Otherwise an error occurs.
file (selection) Enter the name of the file as it should be stored on Amazon S3, e.g., /mybucket/my_file.xls.
content type (string) This option is optional. Enter the MIME type of the upload file, e.g.,
text/xml.
99
1. Data Access
1.5.2 Azure Blob Storage
Loop Azure Blob Storage
Read Azure Blob ...
fil
This operator loops over all files in the specified container/folder
from the Microsoft Azure Blob Storage.
Description
After you have configured your Azure Blob Storage account, you can process all Azure Blob Storage files within the selected folder.
Be aware that the operator cannot read the file as example set. For this reason, you must
connect the file input in the inner process of this operator to another appropriate operator to
process the file. For example, if you want to load Excel files from your Azure Blob Storage folder,
you must connect the file input in the inner process with the Read Excel operator.
Input Ports
in (in ) Optional input data which is delivered to the inner process.
Output Ports
out (out) Output data of the inner process.
Parameters
connection (configurable) The connection details for the Azure Blob Storage connection have
to be specified. If you have already configured an Azure Blob Storage connection, you can
select it from the drop-down list. If you have not configured an Azure Blob Storage yet,
select the icon to the right of the drop-down list. Create a new Azure Blob Storage connection in the Manage connections box. The account name and account key are required.
folder (selection) Provide the name of the Azure Blob Storage ‘folder’ over which you want
to loop. Note that the concept of folders does not exist in Azure Blob Storage, so the default delimiter (’/’) is used to represent them. If your file was stored as ‘name1/name2/my_file.xls’ on Azure Blob Storage, the file ‘my_file.xls’ would be displayed as residing in the
folder ‘name1/name2/’.
filter (string) Optional filter via a regular expression which is used to exclude files from looping
over them, e.g. ‘a.*b’ for all files starting with ‘a’ and ending with ‘b’. Ignored if empty.
filtered string (selection) Indicates which part of the file name is matched against the filter
expression.
• file_name Filtered on the name, e.g. ‘myfolder/myfile.txt’
• full_path Filtered on the full path, e.g. ‘mycontainer/myfolder/myfile.txt’
• parent_path Filtered on the parent folder, e.g. ‘myfolder/’
100
1.5. Cloud Storage
file name macro (string) The name of the macro which will contain the name of the current
file for each file the loop iterates over, e.g. ‘myfolder/myfile.txt’
file path macro (string) The name of the macro which will contain the full path of the current
file for each file the loop iterates over, e.g. e.g. ‘mycontainer/myfolder/myfile.txt’
parent path macro (string) The name of the macro which will contain the parent folder of
the current file for each file the loop iterates over, e.g. e.g. ‘myfolder/’
recursive (boolean) If selected, the loop will also iterate over all files in all subfolders of the
selected folder. Otherwise, it will only iterate over the files in the selected folder.
101
1. Data Access
Read Azure Blob Storage
Read Azure Blob ...
fil
This operator downloads the specified file from the Microsoft
Azure Blob Storage cloud storage.
Description
After you have configured your Azure Blob Storage account, you can load the Azure Blob Storage
file with this operator.
Be aware that the operator cannot read the file as example set. For this reason, you must
connect the Read Azure Blob Storage operator to another appropriate operator to read the file.
For example, if you want to load an Excel file from your Azure Blob Storage, you must connect
the Read Azure Blob Storage operator with the Read Excel operator to see the result.
Output Ports
file (fil) The downloaded file object is returned here. Must be connected to a appropriate Read
Operator, for example Read Excel or Read CSV.
Parameters
connection (configurable) The connection details for the Azure Blob Storage connection have
to be specified. If you have already configured an Azure Blob Storage connection, you can
select it from the drop-down list. If you have not configured an Azure Blob Storage yet,
select the icon to the right of the drop-down list. Create a new Azure Blob Storage connection in the Manage connections box. The account name and account key are required.
file (selection) Select the Azure Blob Storage file you want to download. Note that the concept
of folders does not exist in Azure Blob Storage, so the default delimiter (’/’) is used to represent them. If your file was stored as ‘name1/name2/my_file.xls’ on Azure Blob Storage,
the file ‘my_file.xls’ would be displayed as residing in the folder ‘name1/name2/’.
102
1.5. Cloud Storage
Write Azure Blob Storage
Write Azure Blob...
fil
fil
This operator uploads the input file to the Azure Blob Storage
cloud storage.
Description
Before you can upload the input file to the selected Azure Blob Storage cloud storage, you must
load it with an Open file operator.
Ensure that the correct bucket is selected, otherwise an error occurs! Buckets are container
for the Azure Blob Storage objects. Each Bucket name is unique across all of Azure Blob Storage.
Input Ports
file (fil) The file object which should be uploaded to Azure Blob Storage cloud storage. The file
must be provided by an Open file operator.
Output Ports
file (fil) The input file object is passed through and returned here.
Parameters
connection (configurable) The connection details for the Azure Blob Storage connection have
to be specified. If you have already configured an Azure Blob Storage connection, you can
select it from the drop-down list. If you have not configured an Azure Blob Storage yet,
select the icon to the right of the drop-down list. Create a new Azure Blob Storage connection in the Manage connections box. The account name and account key are required.
file (selection) Enter the name of the file as it should be stored on Azure Blob Storage, e.g.,
/mycontainer/my_file.xls.
103
1. Data Access
1.5.3 Dropbox
Read Dropbox
Read Dropbox
fil
This operator loads the specified file from the Dropbox cloud storage.
Description
After you have created a Dropbox account, you can load the Dropbox file with this operator.
Be aware that the operator cannot read the file as example set. For this reason, you must connect the Read Dropbox operator to another appropriate operator to parse the file. For example,
if you want to load an Excel file from your Dropbox, you must connect the Read Dropbox operator
with the Read Excel operator to see the result.
Output Ports
file (fil) The downloaded file object is returned here. Must be connected to a appropriate Read
Operator, for example Read Excel or Read CSV.
Parameters
connection (configurable) The connection details for the Dropbox connection have to be specified. If you have already configured a Dropbox connection, you can select it from the dropdown list. If you have not configured a Dropbox connection yet, select the Dropbox icon
to the right of the drop-down list. Create a new Dropbox connection in the Manage connections box. An access token is required. If you don´t have a valid access token, you must
authenticate RapidMiner via OAuth and copy the generated token to the acess token field.
Test the connection and click the Save all changes button.
path (selection) Select the Dropbox folder from the drop-down list. All available folders are
displayed in the drop-down list.
file name (selection) Select the file you want to download. The available files are displayed
in the drop-down list.
104
1.5. Cloud Storage
Write Dropbox
Write Dropbox
fil
fil
This operator uploads the input file to the Dropbox cloud storage.
Description
Before you can upload the input file to the selected Dropbox cloud storage, you must load it with
an Open file operator.
Input Ports
file (fil) The file object that should be uploaded to Dropbox cloud storage. The file must be
provided by an Open file operator.
Output Ports
file (fil) The input file object is passed through and returned here.
Parameters
connection (configurable) The connection details for the Dropbox connection have to be specified. If you have already configured a Dropbox connection, you can select it from the dropdown list. If you have not configured a Dropbox connection yet, select the Dropbox icon
to the right of the drop-down list. Create a new Dropbox connection in the Manage connections box. An access token is required. If you don´t have a valid access token, you must
authenticate RapidMiner via OAuth and copy the generated token to the acess token field.
Test the connection and click the Save all changes button.
path (selection) Select the Dropbox folder from the drop-down list. All available folders are
displayed in the drop-down list.
file name (string) The file name of the file that is written to Dropbox cloud storage. This entry
is optional. If you don´t enter a name, the original input file name is taken.
overwrite (boolean) If the checkbox is activated, the input file will overwrite existing files
with the same file name. If the checkbox is not activated, existing files with the same name
are not overwritten. The file name will be enhanced by a counter. By default the option is
deactivated.
105
2Blending
2.1 Attributes
Reorder Attributes
Reorder Attributes
exa
exa
ref
ori
This operator allows to reorder regular Attributes of an ExampleSet. Reordering can be done alphabetically, by user specification
(including Regular Expressions) or with a reference ExampleSet.
Description
This operator allows to change the ordering of regular Attributes of an ExampleSet. Therefore,
two different order modes may be selected in the parameter sort_mode. If sort mode alphabetically is chosen attributes are sorted alphabetically according to the selected sort_direction. If
sort mode user specified is chosen the user can specify rules that define how attributes should
be ordered. If sort mode reference data is chosen the input ExampleSet will be sorted according
to the order of reference ExampleSet.
Note that special attributes will not be considered by this operator. If they also should be reordered set them to regular with Set Role operator before.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Retrieve operator
in the attached Example Process. The output of other operators can also be used as input.
It is essential that meta data should be attached with the data for input because attributes
are specified in their meta data. The Retrieve operator provides meta data along-with data.
reference data (ref) This input port expects an ExampleSet. If sort mode is set to reference
data and this port is connected, the ExampleSet from first port sorted will be sorted according to the order of attributes from this ExampleSet.
Output Ports
example set (exa) The ExampleSet with reordered attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
sort mode (selection) This parameter allows you to select the method you want to use for
reordering attributes. It has the following options:
107
2. Blending
• user specified This option allows to specify rules that define how the attributes should
be reordered. When this option is selected another parameter (attribute ordering) becomes visible in the Parameters panel. This is the default option.
• alphabetically This option simply reorders all regular attributes alphabetically according to the selected sort direction.
• reference data This option allows to reorder all regular attributes according to the
order of all regular attributes of the reference ExampleSet. If special attributes should
also be considered, set them to regular before using this operator.
sort direction (selection) The direction of matched attribute groups to be sorted. If sort mode
is alphabetically all regular attributes are sorted according to this direction. If sort mode
is user specified, attributes that match a Regular Expression and all unmachted attributes
are sorted according to this parameter. Moreover if sort mode is set to reference data all
attributes that could not be found in the reference ExampleSet are sorted according to this
parameter.
• ascending Sort attribute names ascending. This is the default option.
• descending Sort attribute names descending.
• none Apply no sorting at all.
attribute ordering (string) This parameter allows the user to specify rules that define how
attributes should be ordered. If the parameter use regular expressions is checked all specified rules are treated as Regular Expressions.
handle unmachted (selection) Defines how unmachted attributes should be handled. Unmachted attributes can occur if one or more Attribute do not match the rules that the user
did provide with the attribute ordering parameter or if one or more Attribute cannot be
found in the reference ExampleSet. If they are kept (prepend,append) they will be sorted
according to the selected sort direction.
• append Append all attributes that are not covered by the provided sorting rules.
• prepend Prepend all attributes that are not covered by the provided sorting rules.
• remove Remove all attributes that are not covered by the provided sorting rules.
use regular expressions (boolean) If this parameter is checked all rules created with the attribute ordering parameter are treated as Regular Expressions.
Tutorial Processes
Selecting attributes by specifying regular expressions matching their names
In the given Example process the Labor-Negotiations ExampleSet is loaded using the Retrieve
operator. Then Reorder Attribute operator is applied on it. Have a look at the Parameters panel
of the Reorder Attributes operator. Here is a stepwise explanation of this process.
The sort mode parameter is set to ‘user specified’. This allows the user to specify exact rules
on how the attributes should be ordered.
The attribute ordering parameter has two rules set. First rule is ‘contrib-.*’ and second rule is
‘.*-.*’
The first rule ’contrib-.*’ that attributes starting with ‘contrib-’ should be ordered in front.
Since this expression matches two attributes both are sorted in descending order (see sort
direction). ’.*-.*’ means all attributes that have a ’-’ in their name without those that already
have been matched be the first rule.
108
2.1. Attributes
Process
Reorder Attributes
inp
Retrieve
out
exa
exa
res
ref
ori
res
Figure 2.1: Tutorial process ‘Selecting attributes by specifying regular expressions matching
their names’.
Only duration, pension and vacation do not match these two rules.
They are also sorted according to the sort direction and appended like it is defined with the
handle unmachted parameter.
109
2. Blending
2.1.1 Names and Roles
Exchange Roles
Exchange Roles
exa
exa
This operator exchanges the roles of two attributes.
ori
Description
The Exchange Roles operator exchanges the roles of the two specified attributes i.e. it assigns
the role of the first attribute to the second attribute and vice versa. This can be useful, for example, to exchange the roles of a label with a regular attribute (or vice versa), or a label with
a batch attribute, a label with a cluster etc. For more information about roles please study the
description of the Set Role operator.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input.
Output Ports
example set output (exa) The roles of the specified attributes are exchanged and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
first attribute (string) This parameter specifies the name of the first attribute for the attribute
role exchange.
second attribute (string) This parameter specifies the name of the second attribute for the
attribute role exchange.
Tutorial Processes
Exchanging roles of attributes of the Golf data set
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you
can have a look at the ExampleSet. You can see that the roles of the Play and Outlook attributes
are label and regular respectively. The Exchange Roles operator is applied on the ‘Golf’ data set
to exchange the roles of these attributes. the first attribute and second attribute parameters are
set to ‘Play’ and ‘Outlook’ respectively. The resultant ExampleSet can be seen in the Results
110
2.1. Attributes
Process
Golf
inp
Exchange Roles
out
exa
exa
res
ori
res
Figure 2.2: Tutorial process ‘Exchanging roles of attributes of the Golf data set’.
Workspace. You can see that now the role of the Play attribute is regular and the role of the
Outlook attribute is label.
111
2. Blending
Rename
Rename
exa
exa
ori
This operator can be used to rename one or more attributes of an
ExampleSet.
Description
The Rename operator is used for renaming one or more attributes of the input ExampleSet.
Please keep in mind that attribute names must be unique. The Rename operator has no impact on the type or role of an attribute. For example if you have an attribute named ‘alpha’ of
integer type and regular role. Renaming the attribute to ‘beta’ will just change its name. It will
retain its type integer and role regular. To change the role of an operator, use the Set Role operator. Many type conversion operators are available for changing the type of an attribute at ‘Data
Transformation/Type Conversion’.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with data for input because attributes are specified in its meta data. The Retrieve operator provides meta data along-with
data.
Output Ports
example set (exa) The ExampleSet with renamed attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
old name (string) This parameter is used to select the attribute whose name is to be changed.
new name (string) The new name of the attribute is specified through this parameter. Name
can also include special characters.
rename additional attributes (string) To rename more than one attributes click on the Edit
List button. Here you can select attributes and assign new names to them.
Tutorial Processes
Renaming multiple attributes
The ‘Golf ‘ data set is used in this Example Process.The ‘Play’ attribute is renamed to ‘Game’ and
the ‘Wind’ attribute is renamed to ‘#*#’. The ‘Wind’ attribute is renamed to ‘#*#’, just to show
112
2.1. Attributes
Process
Retrieve
inp
Rename
out
exa
exa
res
ori
res
Figure 2.3: Tutorial process ‘Renaming multiple attributes’.
that special characters can also be used to rename attributes. However, attribute names should
always be meaningful and should be relevant to the type of information stored in them.
113
2. Blending
Rename by Constructions
Rename by Const...
exa
exa
ori
This operator renames the regular attributes of an ExampleSet by
their construction descriptions if available.
Description
The Rename by Constructions operator replaces the names of regular attributes of the given
ExampleSet by their corresponding construction descriptions if the attribute was constructed
at all. Please study the attached Example Process for better understanding.
Please keep in mind that attribute names must be unique. The Rename by Constructions operator has no impact on the type or role of an attribute. For example if you have an attribute
named ‘alpha’ of integer type and regular role. Renaming the attribute to ‘beta’ will just change
its name. It will retain its type integer and role regular. To change the role of an operator, use
the Set Role operator. Many type conversion operators are available for changing the type of an
attribute at ‘Data Transformation/Type Conversion’.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Write
Constructions operator in the attached Example Process. The output of other operators
can also be used as input. It is essential that meta data should be attached with the data
for the input because attributes are specified in their meta data.
Output Ports
example set output (exa) The ExampleSet with renamed attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Tutorial Processes
Renaming attributes by their construction descriptions
This Example Process shows how the Rename by Constructions operator can be used for renaming attributes. The ‘Sonar’ data set is loaded using the Retrieve operator. The Rename by Generic
names operator is applied on this ExampleSet to rename the attributes with the generic stem
‘att’. This ExampleSet is provided as input to the Write Constructions operator. The attribute
constructions file parameter is set to ‘D:\attributes’ thus a file named ‘attributes’ is created (if
it does not already exist) in the ‘D’ drive of your computer. You can open the written file and
make changes in it (if required). A breakpoint is inserted here so that you can have a look at the
constructions file. You can see that each line in the file holds the construction description of one
attribute. You can see that the attribute names are of the form att1, att2 and so on. The attribute
constructions are of the form attribute_1, attribute_2 and so on. The Rename by Constructions
114
2.1. Attributes
Process
inp
Sonar
Rename by Gene...
out
exa
exa
ori
Wr i te Co n st ru ct i . . .
inp
thr
Rename by Const...
exa
exa
res
ori
res
Figure 2.4: Tutorial process ‘Renaming attributes by their construction descriptions’.
operator is applied on this ExampleSet. This operator will replace the attribute names by the
attribute constructions. Which means that the attributes that are currently named as att1, att2
etc will be renamed to attribute_1, attribute_2 etc. You can verify this by viewing the resultant
ExampleSet in the Results Workspace.
115
2. Blending
Rename by Example Values
Rename by Exam...
exa
exa
ori
This operator renames the attributes of an ExampleSet by assigning the values of a specified example as attribute names and deleting that example from the ExampleSet.
Description
The Rename by Example Values operator uses the values of the specified example of the ExampleSet as new attribute names. The row number parameter specifies which row should be used as
attribute names. Please note that all regular and special attributes are renamed. Moreover, the
example is deleted from the ExampleSet. This operator can be useful in cases when an example
holds the names of the attributes.
Please keep in mind that attribute names must be unique. The Rename by Example Values
operator has no impact on the type or role of an attribute. For example if you have an attribute
named ‘alpha’ of integer type and regular role. Renaming the attribute to ‘beta’ will just change
its name. It will retain its type integer and role regular. To change the role of an operator, use
the Set Role operator. Many type conversion operators are available for changing the type of an
attribute at ‘Data Transformation/Type Conversion’.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input. It is essential that meta data should be attached with the data for the
input because attributes are specified in their meta data.
Output Ports
example set output (exa) The ExampleSet with renamed attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
row number (integer) This parameter specifies which row values should be used as attribute
names. Please note that counting starts with 1.
Tutorial Processes
Renaming all attributes by example values
This Example Process starts with the Subprocess operator. The Subprocess operator delivers an
ExampleSet. A breakpoint is inserted here so that you can have a look at the ExampleSet. You
can see that currently the attributes names are ‘label’, ‘att1’ and ‘att2’. The first example has the
116
2.1. Attributes
Process
Subprocess
inp
in
out
out
Rename by Exam...
exa
exa
res
ori
res
Figure 2.5: Tutorial process ‘Renaming all attributes by example values’.
values ‘new_label’, ‘new_name1’ and ‘new_name2’. The Rename by Example Values operator is
applied on this ExampleSet to set the values of the first example as attribute names. The row
number parameter is set to 1. After execution of the process you will see that the attributes have
been renamed accordingly. Moreover the first example has been removed from the ExampleSet.
117
2. Blending
Rename by Generic Names
Rename by Gene...
exa
exa
ori
This operator renames the selected attributes of the given ExampleSet to a set of generic names like att1, att2, att3 etc.
Description
The Rename by Generic Names operator renames the selected attributes of the given ExampleSet
to a set of generic names like att1, att2, att3 etc. The generic name stem parameter specifies the
name stem which should be used for building generic names. For example, using ‘att’ as stem
would lead to ‘att1’, ‘att2’, etc. as attribute names.
The Rename by Generic Names operator has no impact on the type or role of an attribute. For
example if you have an attribute named ‘alpha’ of integer type and regular role. Renaming the
attribute to ‘beta’ will just change its name. It will retain its type integer and role regular. To
change the role of an operator, use the Set Role operator. Many type conversion operators are
available for changing the type of an attribute at ‘Data Transformation/Type Conversion’.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. It is essential that meta data should be attached with the data for the
input because attributes are specified in their meta data.
Output Ports
example set output (exa) The ExampleSet with renamed attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet.This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.
118
2.1. Attributes
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list, which is the list of selected attributes.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
119
2. Blending
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator.
If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
generic name stem (string) This parameter specifies the name stem which should be used
for building generic names. For example, using ‘att’ as stem would lead to ‘att1’, ‘att2’,
etc. as attribute names.
Tutorial Processes
Renaming attributes of the Sonar data set
Process
Sonar
inp
Rename by Gene...
out
exa
exa
res
ori
res
Figure 2.6: Tutorial process ‘Renaming attributes of the Sonar data set’.
120
2.1. Attributes
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can view the ExampleSet. You can see that the ExampleSet has 60 regular attributes with
names like attribute_1, atribute_2 etc. The Rename by Generic Names operator is applied on it.
The attribute filter type parameter is set to ‘all’ thus all attributes can be renamed by this operator. The generic name stem parameter is set to ‘att’. Thus the attributes are renamed to format
att1, att2 and so on. This can be verified by seeing the results in the Results Workspace. You
can see that the label attribute is not renamed. This is so because the include special attributes
parameter was not set to true.
121
2. Blending
Rename by Replacing
Rename by Repla...
exa
exa
ori
This operator can be used to rename a set of attributes by replacing
parts of the attribute names by a specified replacement.
Description
The Rename by Replacing operator replaces parts of the attribute names by the specified replacement. This operator is used mostly for removing unwanted parts of attribute names like
whitespaces, parentheses, or other unwanted characters. The replace what parameter defines
that part of the attribute name that should be replaced. It can be defined as a regular expression
which is a very powerful tool but needs a detailed explanation to beginners. It is always good to
specify the regular expression through the edit and preview regular expression menu. The replace
by parameter can be defined as an arbitrary string. Empty strings are also allowed. Capturing
groups of the regular expression of the replace what parameter can be accessed with $1, $2, $3
etc. Please study the attached Example Process for more understanding.
Please keep in mind that attribute names must be unique. The Rename by Replacing operator
has no impact on the type or role of an attribute. For example if you have an attribute named
‘alpha’ of integer type and regular role. Renaming the attribute to ‘beta’ will just change its name.
It will retain its type integer and role regular. To change the role of an operator, use the Set Role
operator. Many type conversion operators are available for changing the type of an attribute at
‘Data Transformation/Type Conversion’.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. It is essential that meta data should be attached with the data for the
input because attributes are specified in their meta data.
Output Ports
example set output (exa) The ExampleSet with renamed attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet.This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
122
2.1. Attributes
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list, which is the list of selected attributes.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
123
2. Blending
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator.
If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
replace what (string) The replace what parameter defines that part of the attribute name that
should be replaced. It can be defined as a regular expression. Capturing groups of the regular expression of the replace what parameter can be accessed in the replace by parameter
with $1, $2, $3 etc.
replace by (string) The replace by parameter can be defined as an arbitrary string. Empty strings
are also allowed. Capturing groups of the regular expression of the replace what parameter
can be accessed with $1, $2, $3 etc.
Tutorial Processes
Renaming attributes of the Sonar data set
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can view the ExampleSet. You can see that the ExampleSet has 60 regular attributes with
124
2.1. Attributes
Process
inp
Sonar
Rename by Repla...
out
exa
exa
res
ori
res
Figure 2.7: Tutorial process ‘Renaming attributes of the Sonar data set’.
names like attribute_1, atribute_2 etc. The Rename by Replacing operator is applied on it. The
attribute filter type parameter is set to ‘all’ thus all attributes can be renamed by this operator.
The replace what parameter is set to the regular expression: ‘(att)ribute_’. The brackets are used
for specifying the capturing group which can be accessed in the replace by parameter with $1.
The replace by parameter is set to ‘$1-’. Wherever ‘attribute_’ is found in names of the ‘Sonar’
attributes, it is replaced by the first capturing group and a dash i.e. ‘att-’. Thus attributes are
renamed to format att-1, att-2 and so on. This can be verified by seeing the results in the Results
Workspace.
125
2. Blending
Set Role
S e t R ol e
exa
exa
This operator is used to change the role of one or more attributes.
ori
Description
The Role of an attribute reflects the part played by that attribute in an ExampleSet. Changing
the role of an attribute may change the part played by that attribute in a process. One attribute
can have exactly one role. This operator is used to change the role of one or more attributes of
the input ExampleSet. This is a very simple operator, all you have to do is to select an attribute
and select a new role for it. Different learning operators require attributes with different roles.
This operator is frequently used to set the right roles for attributes before applying the desired
operator. The change in role is only for the current process, i.e. the role of the attribute is not
changed permanently in the ExampleSet. The Set Role operator should not be confused with
the Rename operator or Type Conversion operators. The Rename operator is used to change
the name of an attribute. Many Type Conversion operators are available (at Data Transformation/Type conversion/) to change the type of attributes e.g. the Nominal to Binominal operator,
the Numerical to Polynomial operator and many more.
Broadly roles are classified into two types i.e. regular and special. Regular attributes simply
describe the examples. Regular attributes are usually used during learning processes. One ExampleSet can have numerous regular attributes. Special attributes are those which identify the
examples separately. Special attributes have some specific task. Special roles are: label, id, prediction, cluster, weight, and batch. An ExampleSet can have numerous special attributes but
one special role cannot be repeated. If one special role is assigned to more than one attribute in
an ExampleSet, all these attributes will change their role to regular except the last one (before
version 5.3.14 these attributes were dropped). This concept can be easily understood by studying
the attached Example Process. Explanation of various roles is given in the parameters section.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Retrieve operator
in our Example Process. Output of other operators may also be used as input. It is essential that meta data should be attached with the data for the input because the role of an
attribute is specified in the meta data of the ExampleSet. The Retrieve operator provides
meta data along with the data.
Output Ports
example set (exa) The ExampleSet with modified role(s) is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
126
2.1. Attributes
Parameters
name (string) The name of the attribute whose role should be changed is specified through
this parameter. You can select the attribute either from the drop down list or type it manually.
target role (string) The target role of the selected attribute is the new role assigned to it. Following target roles are possible:
• regular Attributes without a special role, i.e. those which simply describe the examples are called regular attributes and just leave out the role designation in most cases.
Regular attributes are used as input variables for learning tasks.
• id This is a special role, it acts as id attribute for the ExampleSet and it is usually
unique in every example of the ExampleSet. The id role is used to clearly identify the
examples of concerned ExampleSet. In this case the attribute adopts the role of an
identifier and is called ID for short. Unique ids can be given to all the examples using
the Generate ID operator.
• label This is a special role, it acts as a target attribute for learning operators e.g. the
Decision Tree operator. Labels identify the examples in any way and they must be
predicted for new examples that are not yet characterized in such a manner. The label
is also called ‘goal variable’.
• prediction This is a special role, it acts as predicted attribute of a learning scheme.
For example when a predictive model is learnt through any learning operator and then
it is applied using the Apply Model operator, in the output we have a new attribute
with role prediction which holds the values of label predicted by the given model. The
label and prediction attributes are also used for evaluating the performance of a model.
• cluster This is a special role, it indicates the membership of an example of the ExampleSet to a particular cluster. For example, the output of the k-Mean operator adds a
column with cluster role.
• weight This is a special role, it indicates the weight of the examples with regard to the
label. Weights are used in learning processes to give different importance to examples
with different weights. Attribute weights are used in numerous operators e.g. the
Select By Weights operator. Weights can also be used in evaluating the performance of
models e.g. the Performance operator has a use example weights parameter to consider
the weight of examples during the performance evaluation process.
• batch This is a special role, it indicates the membership to an example batch.
• user defined Any role can be provided by directly typing in the textbox instead of
selecting a role from the dropdown menu. If ‘ignore’ is written in the textbox, that
attribute will be ignored by the coming operators in the process. This is also a special
role, thus it needs to be unique. To ignore multiple attributes unique roles can be
assigned like ignore01, ignore02, igonre03 and so on.
set additional roles (menu) Click this button to modify roles of more than one attribute. A
click on this button opens a new menu which allows you to select any attribute and assign
any role to it. It also allows assigning multiple roles to the same attribute. But, as an
attribute can have exactly one role, only the last role assigned to that attribute is actually
assigned to it and all previous roles assigned to it are ignored.
Tutorial Processes
127
2. Blending
Setting roles of attributes
Root
inp
Retrieve
S et Rol e
out
exa
exa
res
ori
res
res
Figure 2.8: Tutorial process ‘Setting roles of attributes’.
In this Example Process, the ‘Labor-Negotiation’ data set is loaded using the Retrieve operator.
The roles of its attributes are changed using the Set Role operator. Here is an explanation of what
happens when this process is executed:
the attributes name and shift-differential are dropped because standby-pay is also given the
label role. As label is a special role and only one attribute of the same special role can exist, the first attributes are dropped and the last attribute (standby-pay) is assigned to the label
role.duration is assigned to weight rolewage-inc-1st, longterm-disability-assistance, pension,
bereavement-assistance and wage-inc-2nd are given a regular role. They were regular attributes
even before the reassignment of the same role. Thus assigning the same role will not make any
change. As there can be numerous regular attributes, no attribute is dropped.wage-inc-3rd and
working-hours roles were not modified. Thus they retain their original roles i.e. regular.col-adj
is assigned to id role.education-allowance is assigned to batch role.statutory-holidays and vacations are assigned to ignore0 and ignore1 roles respectively.contrib-to-dental-plan is assigned
to prediction role. contrib-to-health-plan is assigned to cluster role.
Some attributes are dropped as explained earlier but note that the number of examples remains the same. Roles assigned in this Example Process were just to show how the Set Role
operator works; in real scenarios such assignments of role may not be very useful. This also
highlights another point that the Set Role operator is not context-aware. It assigns roles set
by the users irrespective of its context. So users must have the knowledge of what role to be
assigned in which scenario. Thanks to the Problems View and quick fixes, it becomes easy to
set the right roles before applying different learning operators. Note that the Problems View
displays two warnings even in this Example Process.
128
2.1. Attributes
2.1.2 Types
Date to Nominal
Date to Nominal
exa
exa
ori
This operator parses the date values of the specified date attribute
with respect to the given date format string and transforms the values into nominal values.
Description
The Date to Nominal operator transforms the specified date attribute and writes a new nominal
attribute in a user specified format. This conversion is done with respect to the specified date
format string that is specified by the date format parameter. This operator might be useful for
time base OLAP to change the granularity of the time stamps from day to week or month. The
date attribute is selected by the attribute name parameter. The old date attribute will be removed
and replaced by a new nominal attribute if the keep old attribute parameter is not set to true. The
understanding of Date and Time patterns is very important for using this operator properly.
Date and Time Patterns
This section explains the date and time patterns. Understanding of date and time patterns is
necessary especially for specifying the date format string in the date format parameter. Within
date and time pattern strings, unquoted letters from ‘A’ to ‘Z’ and from ‘a’ to ‘z’ are interpreted
as pattern letters that represent the components of a date or time. Text can be quoted using
single quotes (’) to avoid interpretation as date or time components. All other characters are
not interpreted as date or time components; they are simply matched against the input string
during parsing.
Here is a brief description of the defined pattern letters. The format types like ‘Text’, ‘Number’,
‘Year’, ‘Month’ etc are described in detail after this section.
• G: This pattern letter is the era designator. For example: AD, BC etc. It follows the rules
of ‘Text’ format type.
• y: This pattern letter represents year. yy represents year in two digits e.g. 96 and yyyy
represents year in four digits e.g. 1996. This pattern letter follows the rules of the ‘Year’
format type.
• M: This pattern letter represents the month of the year. It follows the rules of the ‘Month’
format type. Month can be represented as; for example; March, Mar or 03 etc.
• w: This pattern letter represents the week number of the year. It follows the rules of the
‘Number’ format type. For example, the first week of January can be represented as 01 and
the last week of December can be represented as 52.
• W: This pattern letter represents the week number of the month. It follows the rules of
the ‘Number’ format type. For example, the first week of January can be represented as 01
and the forth week of December can be represented as 04.
129
2. Blending
• D: This pattern letter represents the day number of the year. It follows the rules of the
‘Number’ format type. For example, the first day of January can be represented as 01 and
last day of December can be represented as 365 (or 366 in case of a leap year).
• d: This pattern letter represents the day number of the month. It follows the rules of the
‘Number’ format type. For example, the first day of January can be represented as 01 and
the last day of December can be represented as 31.
• F: This pattern letter represents the day number of the week. It follows the rules of the
‘Number’ format type.
• E: This pattern letter represents the name of the day of the week. It follows the rules of
the ‘Text’ format type. For example, Tuesday or Tue etc.
• a: This pattern letter represents the AM/PM portion of the 12-hour clock. It follows the
rules of the ‘Text’ format type.
• H: This pattern letter represents the hour of the day (from 0 to 23). It follows the rules of
the ‘Number’ format type.
• k: This pattern letter represents the hour of the day (from 1 to 24). It follows the rules of
the ‘Number’ format type.
• K: This pattern letter represents the hour of the day for 12-hour clock (from 0 to 11). It
follows the rules of the ‘Number’ format type.
• h: This pattern letter represents the hour of the day for 12-hour clock (from 1 to 12). It
follows the rules of the ‘Number’ format type.
• m: This pattern letter represents the minutes of the hour (from 0 to 59). It follows the
rules of the ‘Number’ format type.
• s: This pattern letter represents the seconds of the minute (from 0 to 59). It follows the
rules of the ‘Number’ format type.
• S: This pattern letter represents the milliseconds of the second (from 0 to 999). It follows
the rules of the ‘Number’ format type.
• z: This pattern letter represents the time zone. It follows the rules of the ‘General Time
Zone’ format type. Examples include Pacific Standard Time, PST, GMT-08:00 etc.
• Z: This pattern letter represents the time zone. It follows the rules of the ‘RFC 822 Time
Zone’ format type. Examples include -08:00 etc.
Please note that all other characters from ‘A’ to ‘Z’ and from ‘a’ to ‘z’ are reserved. Pattern
letters are usually repeated, as their number determines the exact presentation. Here is the
explanation of various format types:
• Text: For formatting, if the number of pattern letters is 4 or more, the full form is used;
otherwise a short or abbreviated form is used (if available). For parsing, both forms are
acceptable independent of the number of pattern letters.
• Number: For formatting, the number of pattern letters is the minimum number of digits.
The numbers that are shorter than this minimum number of digits are zero-padded to this
amount. For example if the minimum number of digits is 3 then the number 5 will be
changed to 005. For parsing, the number of pattern letters is ignored unless it is needed
to separate two adjacent fields.
130
2.1. Attributes
• Year: If the underlying calendar is the Gregorian calendar, the following rules are applied:
– For formatting, if the number of pattern letters is 2, the year is truncated to 2 digits;
otherwise it is interpreted as a ‘Number’ format type.
– For parsing, if the number of pattern letters is more than 2, the year is interpreted
literally, regardless of the number of digits. So using the pattern ‘MM/dd/yyyy’, the
string ‘01/11/12’ parses to ‘Jan 11, 12 A.D’.
– For parsing with the abbreviated year pattern (’y’ or ‘yy’), this operator must interpret the abbreviated year relative to some century. It does this by adjusting dates to be
within 80 years before and 20 years after the time the operator is created. For example, using a pattern of ‘MM/dd/yy’ and the operator created on Jan 1, 1997, the string
‘01/11/12’ would be interpreted as Jan 11, 2012 while the string ‘05/04/64’ would be
interpreted as May 4, 1964. During parsing, only strings consisting of exactly two digits will be parsed into the default century. Any other numeric string, such as a one
digit string, a three or more digit string, or a two digit string that is not all digits (for
example, ‘-1’), is interpreted literally. So ‘01/02/3’ or ‘01/02/003’ are parsed, using
the same pattern, as ‘Jan 2, 3 AD’. Likewise, ‘01/02/-3’ is parsed as ‘Jan 2, 4 BC’.
Otherwise, if the underlying calendar is not the Gregorian calendar, calendar system specific forms are applied. If the number of pattern letters is 4 or more, a calendar specific
long form is used. Otherwise, a calendar short or abbreviated form is used.
• Month: If the number of pattern letters is 3 or more, the month is interpreted as ‘Text’
format type otherwise, it is interpreted as a ‘Number’ format type.
• General time zone: Time zones are interpreted as ‘Text’ format type if they have names.
It is possible to define time zones by representing a GMT offset value. RFC 822 time zones
are also acceptable.
• RFC 822 time zone: For formatting, the RFC 822 4-digit time zone format is used. General
time zones are also acceptable.
This operator also supports localized date and time pattern strings by defining the locale parameter. In these strings, the pattern letters described above may be replaced with other, localedependent pattern letters.
The following examples show how date and time patterns are interpreted in the U.S. locale.
The given date and time are 2001-07-04 12:08:56 local time in the U.S. Pacific Time time zone.
• ’yyyy.MM.dd G ‘at’ HH:mm:ss z’: 2001.07.04 AD at 12:08:56 PDT
• ’EEE, MMM d, yy’: Wed, Jul 4, ‘01
• ’h:mm a’: 12:08 PM
• ’hh ‘oclock’ a, zzzz’: 12 oclock PM, Pacific Daylight Time
• ’K:mm a, z’: 0:08 PM, PDT
• ’yyyy.MMMMM.dd GGG hh:mm aaa’: 2001.July.04 AD 12:08 PM
• ’EEE, d MMM yyyy HH:mm:ss Z’: Wed, 4 Jul 2001 12:08:56 -0700
• ’yyMMddHHmmssZ’: 010704120856-0700
• ’yyyy-MM-dd’T’HH:mm:ss.SSSZ’: 2001-07-04T12:08:56.235-0700
131
2. Blending
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input. The ExampleSet should have at least one date attribute because if there
is no such attribute, the use of this operator does not make sense.
Output Ports
example set output (exa) The selected date attribute is converted to a nominal attribute according to the specified date format string and the resultant ExampleSet is delivered through
this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute name (string) The name of the date attribute is specified here. The attribute name
can be selected from the drop down box of the attribute name parameter if the meta data
is known.
date format This is the most important parameter of this operator. It specifies the date time
format of the desired nominal attribute. This date format string specifies what portion
of the date attribute should be stored in the nominal attribute. Date format strings are
discussed in detail in the description of this operator.
locale (selection) This is an expert parameter. A long list of locales is provided; users can
select any of them.
keep old attribute (boolean) This parameter indicates if the original date attribute should
be kept or it should be discarded.
Tutorial Processes
Introduction to the Date to Nominal operator
This Example Process starts with a Subprocess operator. The subprocess delivers an ExampleSet
with just a single attribute. The name of the attribute is ‘deadline_date’. The type of the attribute
is date. A breakpoint is inserted here so that you can view the ExampleSet. As you can see, all the
examples of this attribute have both date and time information. The Date to Nominal operator
is applied on this ExampleSet to change the type of the ‘deadline_date’ attribute from date to
nominal type. Have a look at the parameters of the Date to Nominal operator. The attribute
name parameter is set to ‘deadline_date’. The date format parameter is set to ‘EEEE’, here is an
explanation of this date format string:
’E’ is the pattern letter used for the representation of the name of the day of the week. As
explained in the description, if the number of pattern letters is 4 or more, the full form is used.
Thus ‘EEEE’ is used for representing the day of the week in full form e.g. Monday, Tuesday etc.
Thus the date attribute is changed to a nominal attribute which has only name of days as possible
values. Please note that this date format string is used for specifying the format of the nominal
values of the new nominal attribute of the input ExampleSet.
132
2.1. Attributes
Process
inp
Subprocess
in
out
out
Date to Nominal
exa
exa
res
ori
res
Figure 2.9: Tutorial process ‘Introduction to the Date to Nominal operator’.
Date to Numerical
Date to Numerical
exa
exa
ori
This operator changes the type of the selected date attribute to a
numeric type. It also maps all values of this attribute to numeric
values. You can specify exactly which component of date or time
should be extracted. You can also specify relative to which date or
time component information should be extracted.
Description
The Date to Numerical operator provides a lot of flexibility when it comes to selecting a component of date or time. The following components can be selected: millisecond, second, minute,
hour, day, week, month, quarter, half year, and year. The most important thing is that these
components can be selected relative to other components. For example it is possible to extract
the day relative to the week, relative to the month or relative to the year. Suppose the date is
15/Feb/2012. Then the day relative to the month would be 15 because it is the 15th day of the
month. And the day relative to the year would be 46 because this is the 46th day of the year. All
date and time components can be extracted relative to the most common parent components
e.g. month can be calculated relative to the quarter or the year. Similarly second can be calculated relative to the minute, the hour or the day. All date and time components can be extracted
relative to the Epoch where Epoch is defined as the date: ‘01-01-1970 00:00’. If the date attribute has no time information then all calculations on time components will result to 0. All
these things can be understood easily by studying the attached Example Process.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Generate
Data operator in the attached Example Process. The output of other operators can also be
used as input. It is essential that meta data should be attached with the data for the input
because attributes are specified in the meta data. The Generate Data operator provides
133
2. Blending
meta data along-with the data. The ExampleSet should have at least one date attribute
because if there is no such attribute, the use of this operator does not make sense.
Output Ports
example set (exa) The ExampleSet with selected date attribute converted to numeric type is
output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute name (string) This parameter specifies the attribute of the input ExampleSet that
should be converted from date to numerical form.
time unit (selection) This parameter specifies the unit in which the time is measured. In other
words, this parameter specifies the component of the date that should be extracted. The
following components can be extracted: millisecond, second, minute, hour, day, week,
month, quarter, half year, and year.
millisecond relative to (selection) This parameter is only available when the time unit parameter is set to ‘millisecond’. This parameter specifies the component relative to which
the milliseconds should be extracted. The following options are available: second, epoch.
second relative to (selection) This parameter is only available when the time unit parameter
is set to ‘second’. This parameter specifies the component relative to which the seconds
should be extracted. The following options are available: minute, hour, day, epoch.
minute relative to (selection) This parameter is only available when the time unit parameter
is set to ‘minute’. This parameter specifies the component relative to which the minutes
should be extracted. The following options are available: hour, day, epoch.
hour relative to (selection) This parameter is only available when the time unit parameter is
set to ‘hour’. This parameter specifies the component relative to which the hours should
be extracted. The following options are available: day, epoch.
day relative to (selection) This parameter is only available when the time unit parameter is
set to ‘day’. This parameter specifies the component relative to which the days should be
extracted. The following options are available: week, month, year, epoch.
week relative to (selection) This parameter is only available when the time unit parameter is
set to ‘week’. This parameter specifies the component relative to which the weeks should
be extracted. The following options are available: month, year, epoch.
month relative to (selection) This parameter is only available when the time unit parameter
is set to ‘month’. This parameter specifies the component relative to which the months
should be extracted. The following options are available: quarter, year, epoch.
quarter relative to (selection) This parameter is only available when the time unit parameter
is set to ‘quarter’. This parameter specifies the component relative to which the quarters
should be extracted. The following options are available: year, epoch.
134
2.1. Attributes
half year relative to (selection) This parameter is only available when the time unit parameter is set to ‘half year’. This parameter specifies the component relative to which the half
years should be extracted. The following options are available: year, epoch.
year relative to (selection) This parameter is only available when the time unit parameter is
set to ‘year’. This parameter specifies the component relative to which the years should
be extracted. The following options are available: epoch, era.
keep old attribute (selection) This is an expert parameter. This parameter indicates if the
original date attribute of the input ExampleSet should be kept. This parameter is set to
false by default thus the original date attribute is removed from the input ExampleSet.
Tutorial Processes
Introduction to the Date to Numerical operator
Process
Generate Data b...
inp
out
Date to Numerical
exa
exa
res
ori
res
Figure 2.10: Tutorial process ‘Introduction to the Date to Numerical operator’.
The Generate Data by User Specification operator is used in this Example Process to create
a date type attribute. The attribute is named ‘Date’ and it is defined by the expression ‘date_parse(”04/21/2012”)’. Thus an attribute named ‘Date’ is created with just a single example.
The value of the date is 21/April/2012. Please note that no information about time is given. The
Date to Numerical operator is applied on this ExampleSet. The ‘Date’ attribute is selected in the
attribute name parameter.
If the time unit parameter is set to ‘year’ and the year relative to parameter is set to ‘era’ then
the result is 2012. This is so because this is the 2012th year relative to the era.If the time unit
parameter is set to ‘year’ and the year relative to parameter is set to ‘epoch’ then the result is
42. This is so because the year of epoch date is 1970 and difference between 2012 and 1970 is
42.If the time unit parameter is set to ‘half year’ and the half year relative to parameter is set to
‘year’ then the result is 1. This is so because April is in the first half of the year.If the time unit
parameter is set to ‘quarter’ and the quarter relative to parameter is set to ‘year’ then the result
is 2. This is so because April is the 4th month of the year and it comes in the second quarter of
the year.If the time unit parameter is set to ‘month’ and the month relative to parameter is set
to ‘year’ then the result is 4. This is so because April is the fourth month of the year.If the time
unit parameter is set to ‘month’ and the month relative to parameter is set to ‘quarter’ then the
result is 1. This is so because April is the first month of the second quarter of the year.If the time
unit parameter is set to ‘week’ and the week relative to parameter is set to ‘year’ then the result
135
2. Blending
is 16. This is so because 21st April comes in the 16th week of the year.If the time unit parameter
is set to ‘week’ and the week relative to parameter is set to ‘month’ then the result is 3. This is so
because 21st day of month comes in the 3rd week of the month.If the time unit parameter is set
to ‘day’ and the day relative to parameter is set to ‘year’ then the result is 112. This is so because
21st April is the 112th day of the year.If the time unit parameter is set to ‘day’ and the day relative
to parameter is set to ‘month’ then the result is 21. This is so because 21st April is the 21st day
of the month.If the time unit parameter is set to ‘day’ and the day relative to parameter is set
to ‘week’ then the result is 7. This is so because 21st April 2012 is on Saturday. Saturday is the
seventh day of the week. Sunday is the first day of the week.If the time unit parameter is set to
‘hour’ and the hour relative to parameter is set to ‘day’ then the result is 0. This is so because
no time information was provided for this date attribute and all time information was assumed
to be 00 by default.
136
2.1. Attributes
Format Numbers
Format Numbers
exa
exa
ori
This operator reformats the selected numerical attributes according to the specified format and changes the attributes to nominal.
Description
This operator parses numerical values and formats them into the specified format. The format is
specified by the format type parameter. It supports different kinds of number formats including
integers (e.g. 123), fixed-point numbers (e.g. 123.4), scientific notation (e.g. 1.23E4), percentages (e.g. 12%), and currency amounts (e.g. $123). Please note that this operator only works on
numerical attributes and the result will be in any case a nominal attribute even if the resulting
format is a number which can be parsed again.
If the format type parameter is set to ‘pattern’, the pattern parameter is used for defining the
format. If two different formats for positive and negative numbers should be used, those formats
can be defined by separating them by a semi-colon ‘;’. The pattern parameter provides a lot of
flexibility for defining the pattern. Important structures that can be used for defining a pattern
are listed below. The structures in brackets are optional.
• pattern := subpattern{;subpattern}
• subpattern := {prefix}integer{.fraction}{suffix}
• prefix := any character combination including whitespace
• suffix := any character combination including whitespace
• integer := #* 0* 0
• fraction := 0* #*
0* and #* stand for multiple 0 or # respectively. 0 and # perform similar functions but 0 ensures
that length of all numbers is same i.e. if a digit is missing it is replaced by 0. For example 54 will
be formatted to 0054 with pattern ‘0000’ and it will be formatted to 54 with pattern ‘####’.
The following placeholders can be used within the pattern parameter:
• . placeholder for decimal separator.
• , placeholder for grouping separator.
• E separates mantissa and exponent for exponential formats.
• - default negative prefix.
• % multiply by 100 and show as percentage.
• ’ used to quote special characters in a prefix or suffix.
The locale parameter is ignored when the format type parameter is set to ‘pattern’. In other
cases it plays its role e.g. if the format type parameter is set to ‘currency’ then the locale parameter
specifies the notation for that currency (i.e. dollar, euro etc).
137
2. Blending
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can also
be used as input. The ExampleSet should have at least one numerical attribute because if
there is no such attribute, the use of this operator does not make sense.
Output Ports
example set output (exa) The selected numerical attributes are reformatted and converted
to nominal and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
138
2.1. Attributes
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
139
2. Blending
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
format type (selection) This parameter specifies the type of formatting to perform on the selected numerical attributes.
pattern (string) This parameter is only available when the format type parameter is set to ‘pattern’. This parameter specifies the pattern for formatting the numbers. Various structures
and replacement patterns for this parameter have been discussed in the description of this
operator.
locale (selection) This is an expert parameter. A long list of locales is provided; users can
select any of them.
use grouping (boolean) This parameter indicates if a grouping character should be used for
larger numbers.
Tutorial Processes
Changing numeric values to currency format
Process
Generate Data
inp
out
Format Numbers
exa
exa
res
ori
res
Figure 2.11: Tutorial process ‘Changing numeric values to currency format’.
This process starts with the Generate Data operator which generates a random ExampleSet
with a numeric attribute named ‘att1’. A breakpoint is inserted here so that you can have a look
at the ExampleSet. The Format Numbers operator is applied on it to change the format of this
attribute to a currency format. The attribute filter type parameter is set to ‘single’ and the attribute parameter is set to ‘att1’ to select the required attribute. The format type parameter is set
to ‘currency’. Run the process and switch to the Results Workspace. You can see that the ‘att1’
attribute has been changed from numeric to nominal type and its values have a ‘$’ sign in the beginning because they have been converted to a currency format. The locale parameter specifies
140
2.1. Attributes
the required currency. In this process the locale parameter was set to ‘English (United States)’
therefore the numeric values were converted to the currency of United States (i.e. dollar).
141
2. Blending
Guess Types
Guess Types
exa
exa
ori
This operator (re-)guesses the value types of all attributes of the
input ExampleSet and changes them accordingly.
Description
The Guess Types operator can be used to (re-)guess the value types of the attributes of the input
ExampleSet. This might be useful after some preprocessing transformations and purification
of some of the attributes. This operator can be useful especially if nominal attributes can be
handled as numerical attributes after some preprocessing. It is not necessary to (re-)guess the
type of all the attributes with this operator. You can select the attributes whose type is to be (re)guessed. Please study the attached Example Process for more information. Please note that
this operator has no impact on the values of the ExampleSet.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input.
Output Ports
example set output (exa) The type of the selected attributes of the input ExampleSet is (re)guessed and the resultant ExampleSet is delivered through this output port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
142
2.1. Attributes
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list. Attributes can be shifted
to the right list, which is the list of selected attributes.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
143
2. Blending
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator.
If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
first character index (integer) This parameter specifies the index of the first character of the
substring which should be kept. Please note that the counting starts with 1.
last character index (integer) This parameter specifies the index of the last character of the
substring which should be kept. Please note that the counting starts with 1.
decimal point character (char) The character specified by this parameter is used as the decimal character.
number grouping character (char) The character specified by this parameter is used as the
grouping character. This character is used for grouping the numbers. If this character
is found between numbers, the numbers are combined and this character is ignored. For
example if “22-14” is present in the ExampleSet and “-” is set as the number grouping character, then the number will be considered to be “2214”.
Tutorial Processes
Guessing the type of an attribute after preprocessing
The ‘Iris’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you
can have a look at the ExampleSet. Please note the ‘id’ attribute. The ‘id’ attribute is of nominal
type and it has the values of the format ‘id_1’, ‘id_2’ and so on. The Cut operator is applied on the
ExampleSet to remove the substring ‘id_’ from the start of the ‘id’ attribute values. A breakpoint
is inserted after the Cut operator. You can see that now the values in the ‘id’ attribute are of the
form ‘1’, ‘2’, ‘3’ and so on but the type of this attribute is still nominal. The Guess Types operator
144
2.1. Attributes
Process
inp
Iris
Cut
out
exa
Guess Types
exa
ori
exa
exa
res
ori
res
Figure 2.12: Tutorial process ‘Guessing the type of an attribute after preprocessing’.
is applied on this ExampleSet. The attribute filter type parameter is set to ‘single’, the attribute
parameter is set to ‘id’ and the include special attributes parameter is also set to ‘true’. Thus the
Guess Types operator will re-guess the type of the ‘id’ attribute. A breakpoint is inserted after
the Guess Type operator. You can see that the type of the ‘id’ attribute has now been changed
to integer.
145
2. Blending
Nominal to Binominal
Nominal to Bino...
exa
exa
ori
pre
This operator changes the type of selected nominal attributes to
a binominal type. It also maps all values of these attributes to binominal values.
Description
The Nominal to Binominal operator is used for changing the type of nominal attributes to a binominal type. This operator not only changes the type of selected attributes but it also maps
all values of these attributes to binominal values i.e. true and false. For example, if a nominal
attribute with name ‘costs’ and possible nominal values ‘low’, ‘moderate’, and ‘high’ is transformed, the result is a set of three binominal attributes ‘costs = low’, ‘costs = moderate’, and
‘costs = high’. Only the value of one of these attributes is true for a specific example, the value
of the other attributes is false. Examples of the original ExampleSet where the ‘costs’ attribute
had value ‘low’, in the new ExampleSet these examples will have attribute ‘costs=low’ value set
to ‘true’, value of ‘cost=moderate’ and ‘ cost=high’ attributes will be ‘false’. Numeric attributes
of the input ExampleSet remain unchanged.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input. It is essential that meta data should be attached with the data for the input because
attributes are specified in its meta data. The Retrieve operator provides meta data alongwith data. The ExampleSet should have at least one nominal attribute because if there is
no such attribute, use of this operator does not make sense.
Output Ports
example set (exa) The ExampleSet with selected nominal attributes converted to binominal
type is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
146
2.1. Attributes
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes that you want to convert to binominal form. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are not selected.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list. Attributes can be shifted to
the right list which is the list of selected attributes on which the conversion from nominal
to binominal will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you to
try different expressions and preview the results simultaneously. This will enhance your
concept of regular expressions.
147
2. Blending
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special
attributes are selected irrespective of the conditions in the Nominal to Binominal operator. If this parameter is set to true, Special attributes are also tested against conditions
specified in the Nominal to Binominal operator and only those attributes are selected that
satisfy the conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
transform binominal (boolean) This parameter indicates if attributes which are already binominal should be dichotomized i.e. they should be split in two columns with values true
and false.
148
2.1. Attributes
use underscore in name (boolean) This parameter indicates if underscores should be used
in the new attribute names instead of empty spaces and ‘=’. Although the resulting names
are harder to read for humans it might be more appropriate to use these if the data should
be written into a database system.
Tutorial Processes
Nominal to Binominal conversion of attributes of Golf data set
Process
Retrieve
inp
Nominal to Bino...
out
exa
exa
res
ori
res
pre
Figure 2.13: Tutorial process ‘Nominal to Binominal conversion of attributes of Golf data set’.
This Example Process mostly focuses on the transform binominal parameter. All remaining
parameters are mostly for selecting the attributes. The Select Attributes operator also has many
similar parameters for selection of attributes. You can study the Example Process of the Select
Attributes operator if you want an understanding of these parameters.
The Retrieve operator is used to load the Golf data set. A breakpoint is inserted at this point
so that you can have look at the data set before application of the Nominal to Binominal operator. You can see that the ‘Outlook’ attribute has three possible values i.e. ‘sunny’, ‘rain’ and
‘overcast’. The ‘Wind’ attribute has two possible values i.e. ‘true’ and ‘false’. All parameters
of the Nominal to Binominal operator are used with default values. Run the process. First you
will see the Golf data set. Press the run button again and you will see the final results. You can
see that the ‘Outlook’ attribute is replaced by three binominal attributes, one for each possible
value of the original ‘Outlook’ attribute. These attributes are ‘ Outlook = sunny’, ‘ Outlook =
rain’, and ‘ Outlook = overcast’. Only the value of one of these attributes is true for a specific
example, the value of the other attributes is false. Examples whose ‘Outlook ‘ attribute had the
value ‘sunny’ in the original ExampleSet, will have the attribute ‘ Outlook =sunny’ value set to
‘true’in the new ExampleSet, the value of the ‘Outlook =overcast’ and ‘Outlook =rain’ attributes
will be ‘false’. The numeric attributes of the input ExampleSet remain unchanged.
The ‘Wind’ attribute was not replaced by two binominal attributes, one for each possible value
of the ‘Wind’ attribute because this attribute is already binominal. Still if you want to break it
into two separate binominal attributes, this can be done by setting the transform binominal
parameter to true.
149
2. Blending
Nominal to Date
Nominal to Date
exa
exa
ori
This operator converts the selected nominal attribute into the selected date time type. The nominal values are transformed into
date and/or time values. This conversion is done with respect to
the specified date format string.
Description
The Nominal to Date operator converts the selected nominal attribute of the input ExampleSet
into the selected date and/or time type. The attribute is selected by the attribute name parameter. The type of the resultant date and/or time attribute is specified by the date type parameter.
The nominal values are transformed into date and/or time values. This conversion is done with
respect to the specified date format string that is specified by the date format parameter. The
old nominal attribute will be removed and replaced by a new date and/or time attribute if the
keep old attribute parameter is not set to true.
Date and Time Patterns
This section explains the date and time patterns. Understanding of date and time patterns is
necessary specially for specifying the date format string in the date format parameter. Within
date and time pattern strings, unquoted letters from ‘A’ to ‘Z’ and from ‘a’ to ‘z’ are interpreted
as pattern letters that represent the components of a date or time. Text can be quoted using
single quotes (’) to avoid interpretation as date or time components. All other characters are
not interpreted as date or time components; they are simply matched against the input string
during parsing.
Here is a brief description of the defined pattern letters. The format types like ‘Text’, ‘Number’,
‘Year’, ‘Month’ etc are described in detail after this section.
• G: This pattern letter is the era designator. For example: AD, BC etc. This pattern letter
follows the rules of ‘Text’ format type.
• y: This pattern letter represents year. yy represents year in two digits e.g. 96 and yyyy
represents year in four digits e.g. 1996. This pattern letter follows the rules of the ‘Year’
format type.
• M: This pattern letter represents the month of the year. This pattern letter follows the
rules of the ‘Month’ format type. Month can be represented as; for example; March, Mar
or 03 etc.
• w: This pattern letter represents the week number of the year. This pattern letter follows
the rules of the ‘Number’ format type. For example, the first week of January can be represented as 01 and the last week of December can be represented as 52.
• W: This pattern letter represents the week number of the month. This pattern letter follows the rules of the ‘Number’ format type. For example, the first week of January can be
represented as 01 and the forth week of December can be represented as 04.
• D: This pattern letter represents the day number of the year. This pattern letter follows the
rules of the ‘Number’ format type. For example, the first day of January can be represented
as 01 and last day of December can be represented as 365 (or 366 in case of a leap year).
150
2.1. Attributes
• d: This pattern letter represents the day number of the month. This pattern letter follows the rules of the ‘Number’ format type. For example, the first day of January can be
represented as 01 and the last day of December can be represented as 31.
• F: This pattern letter represents the day number of the week. This pattern letter follows
the rules of the ‘Number’ format type.
• E: This pattern letter represents the name of the day of the week. This pattern letter follows
the rules of the ‘Text’ format type. For example, Tuesday or Tue etc.
• a: This pattern letter represents the AM/PM portion of the 12-hour clock. This pattern
letter follows the rules of the ‘Text’ format type.
• H: This pattern letter represents the hour of the day (from 0 to 23). This pattern letter
follows the rules of the ‘Number’ format type.
• k: This pattern letter represents the hour of the day (from 1 to 24). This pattern letter
follows the rules of the ‘Number’ format type.
• K: This pattern letter represents the hour of the day for 12-hour clock (from 0 to 11). This
pattern letter follows the rules of the ‘Number’ format type.
• h: This pattern letter represents the hour of the day for 12-hour clock (from 1 to 12). This
pattern letter follows the rules of the ‘Number’ format type.
• m: This pattern letter represents the minutes of the hour (from 0 to 59). This pattern letter
follows the rules of the ‘Number’ format type.
• s: This pattern letter represents the seconds of the minute (from 0 to 59). This pattern
letter follows the rules of the ‘Number’ format type.
• S: This pattern letter represents the milliseconds of the second (from 0 to 999). This pattern letter follows the rules of the ‘Number’ format type.
• z: This pattern letter represents the time zone. This pattern letter follows the rules of
the ‘General Time Zone’ format type. Examples include Pacific Standard Time, PST, GMT08:00 etc.
• Z: This pattern letter represents the time zone. This pattern letter follows the rules of the
‘RFC 822 Time Zone’ format type. Examples include -08:00 etc.
Please note that all other characters from ‘A’ to ‘Z’ and from ‘a’ to ‘z’ are reserved. Pattern
letters are usually repeated, as their number determines the exact presentation. Here is the
explanation of various format types:
• Text: For formatting, if the number of pattern letters is 4 or more, the full form is used;
otherwise a short or abbreviated form is used (if available). For parsing, both forms are
acceptable independent of the number of pattern letters.
• Number: For formatting, the number of pattern letters is the minimum number of digits.
The numbers that are shorter than this minimum number of digits are zero-padded to this
amount. For example if the minimum number of digits is 3 then the number 5 will be
changed to 005. For parsing, the number of pattern letters is ignored unless it is needed
to separate two adjacent fields.
• Year: If the underlying calendar is the Gregorian calendar, the following rules are applied:
151
2. Blending
– For formatting, if the number of pattern letters is 2, the year is truncated to 2 digits;
otherwise it is interpreted as a ‘Number’ format type.
– For parsing, if the number of pattern letters is more than 2, the year is interpreted
literally, regardless of the number of digits. So using the pattern ‘MM/dd/yyyy’, the
string ‘01/11/12’ parses to ‘Jan 11, 12 A.D’.
– For parsing with the abbreviated year pattern (’y’ or ‘yy’), this operator must interpret the abbreviated year relative to some century. It does this by adjusting dates to be
within 80 years before and 20 years after the time the operator is created. For example, using a pattern of ‘MM/dd/yy’ and the operator created on Jan 1, 1997, the string
‘01/11/12’ would be interpreted as Jan 11, 2012 while the string ‘05/04/64’ would be
interpreted as May 4, 1964. During parsing, only strings consisting of exactly two digits will be parsed into the default century. Any other numeric string, such as a one
digit string, a three or more digit string, or a two digit string that is not all digits (for
example, ‘-1’), is interpreted literally. So ‘01/02/3’ or ‘01/02/003’ are parsed, using
the same pattern, as ‘Jan 2, 3 AD’. Likewise, ‘01/02/-3’ is parsed as ‘Jan 2, 4 BC’.
Otherwise, if the underlying calendar is not the Gregorian calendar, calendar system specific forms are applied. If the number of pattern letters is 4 or more, a calendar specific
long form is used. Otherwise, a calendar short or abbreviated form is used.
• Month: If the number of pattern letters is 3 or more, the month is interpreted as ‘Text’
format type otherwise, it is interpreted as a ‘Number’ format type.
• General time zone: Time zones are interpreted as ‘Text’ format type if they have names.
It is possible to define time zones by representing a GMT offset value. RFC 822 time zones
are also acceptable.
• RFC 822 time zone: For formatting, the RFC 822 4-digit time zone format is used. General
time zones are also acceptable.
This operator also supports localized date and time pattern strings by defining the locale parameter. In these strings, the pattern letters described above may be replaced with other, localedependent pattern letters.
The following examples show how date and time patterns are interpreted in the U.S. locale.
The given date and time are 2001-07-04 12:08:56 local time in the U.S. Pacific Time time zone.
• ’yyyy.MM.dd G ‘at’ HH:mm:ss z’: 2001.07.04 AD at 12:08:56 PDT
• ’EEE, MMM d, yy’: Wed, Jul 4, ‘01
• ’h:mm a’: 12:08 PM
• ’hh ‘oclock’ a, zzzz’: 12 oclock PM, Pacific Daylight Time
• ’K:mm a, z’: 0:08 PM, PDT
• ’yyyy.MMMMM.dd GGG hh:mm aaa’: 2001.July.04 AD 12:08 PM
• ’EEE, d MMM yyyy HH:mm:ss Z’: Wed, 4 Jul 2001 12:08:56 -0700
• ’yyMMddHHmmssZ’: 010704120856-0700
• ’yyyy-MM-dd’T’HH:mm:ss.SSSZ’: 2001-07-04T12:08:56.235-0700
152
2.1. Attributes
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input. It is essential that meta data should be attached with the data for the input because
attributes are specified in their meta data. The ExampleSet should have at least one nominal attribute because if there is no such attribute, the use of this operator does not make
sense.
Output Ports
example set (exa) The selected nominal attribute is converted to date type and the resultant
ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute name (string) The name of the nominal attribute that is to be converted to date
type is specified here.
date type (selection) This parameter specifies the type of the resultant attribute.
• date If the date type parameter is set to ‘date’, the resultant attribute will be of date
type. The time portion (if any) of the nominal attribute will be ignored.
• time If the date type parameter is set to ‘time’, the resultant attribute will be of time
type. The date portion (if any) of the nominal attribute will be ignored.
• date_time If the date type parameter is set to ‘date_time’, the resultant attribute will
be of date_time type.
date format This is the most important parameter of this operator. It specifies the date time
format of the selected nominal attribute. Date format strings are discussed in detail in the
description of this operator.
time zone (selection) This is an expert parameter. A long list of time zones is provided; users
can select any of them.
locale (selection) This is an expert parameter. A long list of locales is provided; users can
select any of them.
keep old attribute (boolean) This parameter indicates if the original nominal attribute should
be kept or if it should be discarded.
Tutorial Processes
Introduction to the Nominal to Date operator
This Example Process starts with a subprocess. The subprocess delivers an ExampleSet with just
a single attribute. The name of the attribute is ‘deadline_date’. The type of the attribute is nominal. A breakpoint is inserted here so that you can view the ExampleSet. As you can see, all the
examples of this attribute have both date and time information. The Nominal to Date operator
153
2. Blending
Process
inp
Subprocess
in
out
out
Nominal to Date
exa
exa
res
ori
res
Figure 2.14: Tutorial process ‘Introduction to the Nominal to Date operator’.
is applied on this ExampleSet to change the type of the ‘deadline_date’ attribute from nominal
to date type. Have a look at the parameters of the Nominal to Date operator. The attribute name
parameter is set to ‘deadline_date’. The date type parameter is set to ‘date’. Thus the ‘deadline_date’ attribute will be converted from nominal to date type (not date_time) therefore the time
portion of the value will not be available in the resultant attribute. The date format parameter
is set to ‘EEEE, MMMM d, yyyy h:m:s a z’, here is an explanation of this date format string: ’E’ is
the pattern letter used for the representation of the name of the day of the week. As explained in
the description, if the number of pattern letters is 4 or more, the full form is used. Thus ‘EEEE’
is used for representing the day of the week in full form e.g. Monday, Tuesday etc.’M’ is the pattern letter used for the representation of the name of the month of the year. As explained in the
description, if the number of pattern letters is 4 or more, the full form is used. Thus ‘MMMM’
is used for representing the month of the year in full form e.g. January, February etc.’y’ is the
pattern letter used for the representation of the year portion of the date. ‘yyyy’ represents year
of date in four digits like 2011, 2012 etc.’h’ is the pattern letter used for the representation of
the hour portion of the time. ‘h’ can represent multiple digit hours as well e.g. 10, 11 etc. The
difference between ‘hh’ and ‘h’ is that ‘hh’ represents single digit hours by appending a 0 in start
e.g. 01, 02 and so on. But ‘h’ represents single digits without any modifications e.g. 1, 2 and so
on.’m’ is the pattern letter used for the representation of the minute portion of the time. ‘m’
can represent multiple digit minutes as well e.g. 51, 52 etc. The difference between ‘mm’ and
‘m’ is that ‘mm’ represents single digit minutes by appending a 0 in start e.g. 01, 02 and so on.
But ‘m’ represents single digits without any modifications e.g. 1, 2 and so on.’s’ is the pattern
letter used for the representation of the second portion of the time. ‘s’ can represent multiple
digit seconds as well e.g. 40, 41 etc. The difference between ‘ss’ and ‘s’ is that ‘ss’ represents
single digit seconds by appending a 0 in start e.g. 01, 02 and so on. But ‘s’ represents single
digits without any modifications e.g. 1, 2 and so on.’a’ is the pattern letter used for the representation of the ‘AM/PM’ portion of the 12-hour date and time.’z’ is the pattern letter used for
the representation of the time zone.
Please note that this date format string represents the date format of the nominal values of the
selected nominal attribute of the input ExampleSet. The date format string helps RapidMiner
to understand which portions of the nominal value represent which component of the date or
time e.g. year, month etc.
154
2.1. Attributes
Nominal to Numerical
Nominal to Nume...
exa
exa
ori
pre
This operator changes the type of selected non-numeric attributes
to a numeric type. It also maps all values of these attributes to
numeric values.
Description
The Nominal to Numerical operator is used for changing the type of non-numeric attributes to
a numeric type. This operator not only changes the type of selected attributes but it also maps
all values of these attributes to numeric values. Binary attribute values are mapped to 0 and
1. Numeric attributes of input the ExampleSet remain unchanged. This operator provides three
modes for conversion from nominal to numeric. This mode is selected by the coding type parameter. Explanation of these coding types is given in the parameters and they are also explained
in the example process.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with data for input because attributes are specified in its meta data. The Retrieve operator provides meta data along-with
data. The ExampleSet should have at least one non-numeric attribute because if there is
no such attribute, the use of this operator does not make sense.
Output Ports
example set (exa) The ExampleSet with selected non-numeric attributes converted to numeric
types is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes on which you want to apply nominal to numeric conversion. It has the following options:
155
2. Blending
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose all examples satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
156
2.1. Attributes
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles.
The special attributes are those attributes which identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction,
cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
coding type (selection) This parameter indicates the coding which will be used for transforming nominal attributes to numerical attributes. There are three available options i.e. unique
integers, dummy coding, effect coding. You can easily understand these options by studying the attached Example Process.
• unique_integers If this option is selected, the values of nominal attributes can be
seen as equally ranked, therefore the nominal attribute will simply be turned into a
real valued attribute, the old values result in equidistant real values.
• dummy_coding If this option is selected, for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can be
defined using the comparison groups parameter. In every example, the new attribute
which corresponds to the actual nominal value of that example gets value 1 and all
157
2. Blending
other new attributes get value 0. If the value of the nominal attribute of this example
corresponds to the comparison group, all new attributes are set to 0. Note that the
comparison group is an optional parameter with ‘dummy coding’. If no comparison
group is defined, in every example the new attribute which corresponds to the actual
nominal value of that example gets value 1 and all other new attributes get value 0.
In this case, there will be no example where all new attributes get value 0.This can be
easily understood by studying the attached example process.
• effect_coding If this option is selected; for all values of the nominal attribute, excluding the comparison group, a new attribute is created. The comparison group can
be defined using the comparison groups parameter. In every example, the new attribute which corresponds to the actual nominal value of that example gets value 1
and all other new attributes get value 0. If the value of the nominal attribute of this
example corresponds to the comparison group, all new attributes are set to -1. Note
that the comparison group is a mandatory parameter with ‘effect coding’. This can be
easily understood by studying the attached example process.
use comparison groups (boolean) This parameter is available only when the coding type parameter is set to dummy coding. If checked, for each selected attribute in the ExampleSet
a value has to be specified in the comparison group parameter. A separate new column for
this value will not appear in the final result set. If not checked, all values of the selected
attributes will result in an indicator attribute in the resultant ExampleSet.
comparison groups This parameter defines the comparison group for each selected non-numeric
attribute. Only one comparison group can be specified for one attribute. When the coding
type parameter is set to ‘effect coding’, it is compulsory to define a comparison group for
all selected attributes.
use underscore in name (boolean) This parameter indicates if underscores should be used
in the names of new attributes instead of empty spaces and ‘=’. Although the resulting
names are harder to read for humans but it might be more appropriate to use these if the
data is to be written into a database system.
Tutorial Processes
Nominal to Numeric conversion through different coding types
This Example Process mostly focuses on the coding type and comparison groups parameters. All
remaining parameters are mostly for selecting the attributes. The Select Attributes operator also
has many similar parameters for the selection of attributes. You can study its Example Process
if you want an understanding of these parameters.
The Retrieve operator is used to load the ‘Golf ‘data set. The Nominal to Numerical operator
is applied on it. The ‘Outlook’ and ‘Wind’ attributes are selected for this operator for changing
them to numeric attributes. Initially, the coding type parameter is set to ‘unique integers’. Thus,
the nominal attributes will simply be turned into real valued attributes; the old values will result
in equidistant real values. As you can see in the Results Workspace, all occurrences of value
‘sunny’ for the ‘Outlook’ attribute are replaced by 2. Similarly, ‘overcast’ and ‘rain’ are replaced
by 1 and 0 respectively. In the same way, all occurrences of ‘false’ value in the ‘Wind’ attribute
are replaced by 1 and occurrences of ‘true’ are replaced by 0.
Now, change the coding type parameter to ‘dummy coding’ and run the process again. As
dummy coding is selected, for all values of the nominal attribute a new attribute is created. In
every example, the new attribute which corresponds to the actual nominal value of that example
gets value 1 and all other new attributes get value 0. As you can see in the Results Workspace,
158
2.1. Attributes
Process
Golf
inp
Nominal to Nume...
out
exa
exa
res
ori
res
pre
res
Figure 2.15: Tutorial process ‘Nominal to Numeric conversion through different coding types’.
‘Wind=true’ and ‘Wind=false’ attributes are created in place of the ‘Wind’ attribute. In all examples where the ‘Wind’ attribute had value ‘true’, the ‘Wind=true’ attributes gets 1 and ‘Wind=false’
attribute gets 0. Similarly, all examples where the ‘Wind’ attribute had value ‘false’, the ‘Wind=true’
attribute gets value 0 and ‘Wind= false’ attribute gets value 1. The same principle applies to the
‘Outlook’ attribute.
Now, keep the coding type parameter as ‘dummy coding’ and also set the use comparison
groups parameter to true. Run the process again. You can see in the comparison groups parameter that ‘sunny’ and ‘true’ are defined as comparison groups for the ‘Outlook’ and ‘Wind’
attributes respectively. As dummy coding is used and the comparison groups are also used thus
for all values of the nominal attribute, excluding the comparison group, a new attribute is created. In every example, the new attribute which corresponds to the actual nominal value of
that example gets value 1 and all other new attributes get value 0. If the value of the nominal
attribute of this example corresponds to the comparison group, all new attributes are set to 0.
This is why ‘Outlook=rain’ and ‘Outlook=overcast’ attributes are created but ‘Outlook=sunny’
attribute is not created this time. In examples where the ‘Outlook’ attribute had value ‘sunny’,
all new Outlook attributes get value 0. You can see this in the Results Workspace. The same rule
is applied on the ‘Wind’ attribute.
Now, change the coding type parameter to ‘effect coding’ and run the process again. You can
see in the comparison groups parameter that ‘sunny’ and ‘true’ are defined as comparison groups
for the ‘Outlook’ and ‘Wind’ attributes respectively. As effect coding is selected thus for all values of the nominal attribute, excluding the comparison group, a new attribute is created. In
every example, the new attribute which corresponds to the actual nominal value of that example gets value 1 and all other new attributes get value 0. If the value of the nominal attribute of
this example corresponds to the comparison group, all new attributes are set to -1. This is why
‘Outlook=rain’ and ‘Outlook = overcast’ attributes are created but an ‘Outlook=sunny’ attribute
is not created this time. In examples where the ‘Outlook’ attribute had value ‘sunny’, all new
Outlook attributes get value -1. You can see this in the Results Workspace. The same rule is
applied on the ‘Wind’ attribute.
159
2. Blending
Nominal to Text
Nominal to Text
exa
exa
ori
This operator changes the type of selected nominal attributes to
text. It also maps all values of these attributes to corresponding
string values.
Description
The Nominal to Text operator converts all nominal attributes to string attributes. Each nominal
value is simply used as a string value of the new attribute. If the value is missing in the nominal
attribute, the new value will also be missing.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input. It is essential that meta data should be attached with the data for the input because
attributes are specified in their meta data. The ExampleSet should have at least one nominal attribute because if there is no such attribute, the use of this operator does not make
sense.
Output Ports
example set (exa) The ExampleSet with selected nominal attributes converted to text is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes on which you want to apply nominal to text conversion. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
160
2.1. Attributes
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
161
2. Blending
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value. One of the
following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes selected irrespective of the conditions in the Select Attribute operator. If
this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
Tutorial Processes
Applying the Nominal to Text operator on the Golf data set
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted after the Retrieve operator so that you can have a look at the ‘Golf’ data set before application of the Nominal to Text operator. You can see that the ‘Golf’ data set has three nominal attributes i.e. ‘Play’,
‘Outlook’ and ‘Wind’. The Nominal to Text operator is applied on this data set. The attribute
filter type parameter is set to ‘single’ and the attribute parameter is set to ‘Outlook’. Thus this
operator converts the type of the ‘Outlook’ attribute to text. You can verify this by seeing the
results in the Meta Data View in the Results Workspace.
162
2.1. Attributes
Process
Golf
Nominal to Text
out
inp
exa
exa
res
ori
res
Figure 2.16: Tutorial process ‘Applying the Nominal to Text operator on the Golf data set’.
Numerical to Binominal
Numerical to Bin...
exa
exa
ori
This operator changes the type of the selected numeric attributes
to a binominal type. It also maps all values of these attributes to
corresponding binominal values.
Description
The Numerical to Binominal operator changes the type of numeric attributes to a binominal
type (also called binary). This operator not only changes the type of selected attributes but it
also maps all values of these attributes to corresponding binominal values. Binominal attributes
can have only two possible values i.e. ‘true’ or ‘false’. If the value of an attribute is between the
specified minimal and maximal value, it becomes ‘false’, otherwise ‘true’. Minimal and maximal
values can be specified by the min and max parameters respectively. If the value is missing, the
new value will be missing. The default boundaries are both set to 0.0, thus only 0.0 is mapped
to ‘false’ and all other values are mapped to ‘true’ by default.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input. It is essential that meta data should be attached with the data for the input because
attributes are specified in their meta data. The Retrieve operator provides meta data alongwith data. The ExampleSet should have at least one numeric attribute because if there is
no such attribute, use of this operator does not make sense.
Output Ports
example set (exa) The ExampleSet with selected numeric attributes converted to binominal
type is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
163
2. Blending
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes that you want to convert to binominal form. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are not selected.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list. Attributes can be shifted to
the right list which is the list of selected attributes on which the conversion from nominal
to binominal will take place; all other attributes will remain unchanged.
164
2.1. Attributes
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you to
try different expressions and preview the results simultaneously. This will enhance your
concept of regular expressions.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special
attributes are selected irrespective of the conditions in the Nominal to Binominal operator. If this parameter is set to true, Special attributes are also tested against conditions
specified in the Nominal to Binominal operator and only those attributes are selected that
satisfy the conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
165
2. Blending
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
min (real) This parameter is used to set the lower bound of the range. The max parameter is
used to set the upper bound of the range. The attribute values that fell in this range are
mapped to ‘false’. The attribute values that do not fell in this range are mapped to ‘true’.
max (real) This parameter is used to set the upper bound of the range. The min parameter is
used to set the lower bound of the range. The attribute values that fell in this range are
mapped to ‘false’. The attribute values that do not fell in this range are mapped to ‘true’.
Tutorial Processes
Converting numeric attributes of the Sonar data set to binominal attributes
Process
Retrieve
inp
Numerical to Bin...
out
exa
exa
res
ori
res
res
Figure 2.17: Tutorial process ‘Converting numeric attributes of the Sonar data set to binominal
attributes’.
This Example Process mostly focuses on the min and max parameters. All remaining parameters are mostly for selecting the attributes. The Select Attributes operator also has many similar
parameters for selection of attributes. You can study the Example Process of the Select Attributes
operator if you want an understanding of these parameters.
The ‘Sonar’ data set is loaded using the Retrieve operator. The Numerical to Binominal operator is applied on it. The min parameter is set to 0.0 and the max parameter is set to 0.01. All
other parameters are used with default values. The attribute filter type parameter is set to ‘all’,
thus all numeric attributes of the ‘Sonar’ data set will be converted to binominal type. As you
can see in the Results Workspace, before application of the Numerical to Binominal operator,
all attributes were of real type. After application of this operator they are now all changed to
binominal type. All attribute values that fell in the range from 0.0 to 0.01 are mapped to ‘false’,
all the other values are mapped to ‘true’.
166
2.1. Attributes
Numerical to Polynominal
Numerical to Pol...
exa
exa
ori
This operator changes the type of selected numeric attributes to
a polynominal type. It also maps all values of these attributes to
corresponding polynominal values. This operator simply changes
the type of selected attributes; if you need a more sophisticated
normalization method please use the discretization operators.
Description
The Numerical to Polynominal operator is used for changing the type of numeric attributes to a
polynominal type. This operator not only changes the type of selected attributes but it also maps
all values of these attributes to corresponding polynominal values. It simply changes the type
of selected attributes i.e. every new numerical value is considered to be another possible value
for the polynominal attribute. In other words, each numerical value is simply used as nominal
value of the new attribute. As numerical attributes can have a huge number of different values
even in a small range, converting such a numerical attribute to polynominal form will generate a
huge number of possible values for the new attribute. Such a polynominal attribute may not be a
very useful one and it may increase memory usage significantly. If you need a more sophisticated
normalization method please use the discretization operators. The Discretization operators are
at: “Data Transformation/ Type Conversion/ Discretization”.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input. It is essential that meta data should be attached with the data for the input because
attributes are specified in their meta data. The Retrieve operator provides meta data alongwith data. The ExampleSet should have at least one numeric attribute because if there is
no such attribute, use of this operator does not make sense.
Output Ports
example set (exa) The ExampleSet with selected numeric attributes converted to nominal type
is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes that you want to convert to polynominal form. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
167
2. Blending
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are not selected.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list. Attributes can be shifted to
the right list which is the list of selected attributes on which the conversion from nominal
to polynominal will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you to
try different expressions and preview the results simultaneously. This will enhance your
concept of regular expressions.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
168
2.1. Attributes
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special
attributes are selected irrespective of the conditions in the Nominal to Polynominal operator. If this parameter is set to true, Special attributes are also tested against conditions
specified in the Nominal to Polynominal operator and only those attributes are selected
that satisfy the conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
Tutorial Processes
Converting numeric attributes of the Sonar data set to polynominal attributes
This Example Process mostly focuses on the working of this operator. All parameters of this
operator are mostly for selecting the attributes. The Select Attributes operator also has many
169
2. Blending
Process
Sonar
inp
Numerical to Pol...
out
exa
exa
res
ori
res
res
Figure 2.18: Tutorial process ‘Converting numeric attributes of the Sonar data set to polynominal attributes’.
similar parameters for selection of attributes. You can study the Example Process of the Select
Attributes operator if you want an understanding of these parameters.
The ‘Sonar’ data set is loaded using the Retrieve operator. The Numerical to Polynominal
operator is applied on it. All parameters are used with default values. The attribute filter type
parameter is set to ‘all’, thus all numeric attributes of the ‘Sonar’ data set will be converted to
nominal type. As you can see in the Results Workspace, before application of the Numerical to
Polynominal operator, all attributes were of real type. After application of this operator they
are now all changed to nominal type. But if you have a look at the examples, they are exactly
the same i.e. just the type of the values has been changed not the actual values. Every new numerical value is considered to be another possible value for the polynominal attribute. In other
words, each numerical value is simply used as nominal value of the new attribute. As there is a
very large number of different values for almost all attributes in the ‘Sonar’ data set, converting
these attributes to polynominal form generates a huge number of possible values for the new
attributes. These new polynominal attributes may not be very useful and they may increase
memory usage significantly. In such a scenario it is always better to use a more sophisticated
normalization method i.e. the discretization operators.
170
2.1. Attributes
Numerical to Real
N um er i c al t o R e a l
exa
exa
ori
This operator changes the type of the selected numerical attributes
to real type. It also maps all values of these attributes to real values.
Description
The Numerical to Real operator converts selected numerical attributes (especially the integer
attributes) to real valued attributes. Each integer value is simply used as a real value of the new
attribute. If the value is missing, the new value will be missing.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. The ExampleSet should have at least one non-real numerical attribute
because if there is no such attribute, the use of this operator does not make sense.
Output Ports
example set output (exa) The ExampleSet with selected numerical attributes converted to
real type is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes on which you want to apply numerical to real conversion. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
171
2. Blending
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. Users should have a basic understanding of type hierarchy
when selecting attributes through this option. When it is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
172
2.1. Attributes
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
Tutorial Processes
Integer to real conversion of attributes of the Golf data set
Process
Golf
inp
N u m e ri c a l t o R e al
out
exa
exa
res
ori
res
Figure 2.19: Tutorial process ‘Integer to real conversion of attributes of the Golf data set’.
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the type of the Humidity and Temperature attributes is integer. The Numerical to Real operator is applied on the ‘Golf’ data set to
convert the type of these integer attributes to real. All parameters are used with default values.
The resultant ExampleSet can be seen in the Results Workspace. You can see that now the type
of these attributes is real.
173
2. Blending
Parse Numbers
Parse Numbers
exa
exa
ori
This operator changes the type of selected nominal attributes to a
numeric type. It also maps all values of these attributes to numeric
values by parsing the numbers if possible.
Description
The Parse Numbers operator is used for changing the type of nominal attributes to a numeric
type. This operator not only changes the type of selected attributes but it also maps all values of
these attributes to numeric values by parsing the numbers if possible. In contrast to the Nominal
to Numerical operator, this operator directly parses numbers from the afore wrongly encoded
as nominal values. The Nominal to Numeric operator is used when the values are actually nominal but you want to change them to numerical values. On the other hand the Parse Numbers
operator is used when the values should actually be numerical but they are wrongly stored as
nominal values. Please note that this operator will first check the stored nominal mappings for
all attributes. If (old) mappings are still stored which actually are nominal (without the corresponding data being part of the ExampleSet), the attribute will not be converted. Please use the
Guess Types operator in these cases.
Differentiation
• Nominal to Numerical The Nominal to Numerical operator provides various coding types
to convert nominal attributes to numerical attributes. On the other hand the Parse Numbers operator is used when the values should actually be numerical but they are wrongly
stored as nominal values. See page 155 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input. The ExampleSet should have at least one nominal attribute because if
there is no such attribute, the use of this operator does not make sense.
Output Ports
example set output (exa) The ExampleSet with selected nominal attributes converted to numeric types is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
174
2.1. Attributes
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. Users should have a basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
175
2. Blending
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
decimal character (char) This character is used as the decimal character.
grouped digits (boolean) This option decides whether grouped digits should be parsed or not.
If this option is set to true, grouping character parameter should be specified.
grouping character (char) This character is used as the grouping character. If this character
is found between numbers, the numbers are combined and this character is ignored. For
example if “22-14” is present in the nominal attribute and “-” is set as grouping character,
then “2214” will be stored in the corresponding numerical attribute.
176
2.1. Attributes
unparsable value handling (selection) This selects the method for handling occurrences of
values which are not parsable to numbers. The unparsable value can either be skipped,
treated as an error or replaced with a missing value.
Related Documents
• Nominal to Numerical (page 155)
Tutorial Processes
Nominal to Numeric conversion by the Parse Numbers operator
Process
Subprocess
inp
in
out
out
Parse Numbers
exa
exa
res
ori
res
Figure 2.20: Tutorial process ‘Nominal to Numeric conversion by the Parse Numbers operator’.
This Example Process starts with a Subprocess operator. The Subprocess operator provides
an ExampleSet as its output. The ExampleSet has some nominal attributes. But these nominal
attributes actually wrongly store numerical values as nominal values. A breakpoint is inserted
here so that you can have a look at the ExampleSet. The type of these attributes should be numerical. To convert these nominal attributes to numerical attributes the Parse Numbers operator is applied. All parameters are used with default values. The resultant ExampleSet can be
seen in the Results Workspace. You can see that the type of all attributes has been changed from
nominal to numerical type.
177
2. Blending
Real to Integer
Real to Integer
exa
exa
ori
This operator changes the type of the selected real attributes to
integer type. It also maps all values of these attributes to integer
values.
Description
The Real to Integer operator converts selected real attributes to integer valued attributes. Each
real value is either cut or rounded off and then used as an integer value of the new attribute.
This option is controlled by the round values parameter. If it is set to false, the decimal portion
of the real value is simply truncated otherwise it is rounded off. If the real value is missing, the
new integer value will be missing.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. The ExampleSet should have at least one real attribute because if there
is no such attribute, the use of this operator does not make sense.
Output Ports
example set output (exa) The ExampleSet with selected real attributes converted to integer
type is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
178
2.1. Attributes
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy
when selecting attributes through this option. When it is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
179
2. Blending
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
round values (boolean) This parameter indicates if the values should be rounded off for conversion from real to integer. If not set to true, then the decimal portion of real values is
simply truncated to convert the real values to integer values.
Tutorial Processes
Real to integer conversion of attributes of the Iris data set
Process
Iris
inp
Real to Integer
out
exa
exa
res
ori
res
Figure 2.21: Tutorial process ‘Real to integer conversion of attributes of the Iris data set’.
180
2.1. Attributes
The ‘Iris’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the ExampleSet has four real attributes
i.e. a1, a2, a3 and a4. The Real to Integer operator is applied on the ‘Iris’ data set to convert
the type of these real attributes to integer. All parameters are used with default values. The
resultant ExampleSet can be seen in the Results Workspace. You can see that now the type of
these attributes is integer.
181
2. Blending
Text to Nominal
Text to Nominal
exa
exa
ori
This operator changes the type of selected text attributes to nominal. It also maps all values of these attributes to corresponding
nominal values.
Description
The Text to Nominal operator converts all text attributes to nominal attributes. Each text value
is simply used as a nominal value of the new attribute. If the value is missing in the text attribute,
the new value will also be missing.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input. It is essential that meta data should be attached with the data for the
input because attributes are specified in their meta data. The ExampleSet should have at
least one text attribute because if there is no such attribute, the use of this operator does
not make sense.
Output Ports
example set output (exa) The selected text attributes are converted to nominal and the resultant ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes on which you want to apply text
to nominal conversion. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
182
2.1. Attributes
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
183
2. Blending
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value. One of the
following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes selected irrespective of the conditions in the Select Attribute operator. If
this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
Tutorial Processes
Introduction to the Text to Nominal operator
Process
Subprocess
inp
in
out
out
Text to Nominal
exa
exa
res
ori
res
Figure 2.22: Tutorial process ‘Introduction to the Text to Nominal operator’.
184
2.1. Attributes
This Example Process starts with the Subprocess operator which provides an ExampleSet. A
breakpoint is inserted here so that you can have a look at the ExampleSet. You can see that
the ExampleSet has three text attributes i.e. ‘att1’, ‘att2’ and ‘att3’. The Text to Nominal operator is applied on this data set. The attribute filter type parameter is set to ‘single’ and the
attribute parameter is set to ‘att1’. Thus this operator converts the type of the ‘att1’ attribute
from text to nominal. You can verify this by seeing the results in the Meta Data View in the
Results Workspace.
185
2. Blending
2.1.3 Selection
Remove Attribute Range
Remove Attribut...
exa
exa
ori
This operator removes a range of attributes from the given ExampleSet.
Description
The Remove Attribute Range operator removes the attributes within the specified range. The
first and last attribute of the range are specified by the first attribute and last attribute parameters. All attributes in this range (including first and last attribute) will be removed from the
ExampleSet. It is important to note that the attribute range starts from 1. This is a little different from the way attributes are counted in the Table Index where counting starts from 0. So,
first and last attributes should be specified carefully.
Differentiation
• Select Attributes Provides a lot of options for selecting desired attributes e.g. on the
basis of type, block, numerical value even regular expressions. See page 195 for details.
• Remove Correlated Attributes Selects attributes on the basis of correlations of the attributes. See page 188 for details.
• Remove Useless Attributes Selects attributes on the basis of usefulness. Different usefulness measures are available e.g. numerical attributes with minimum deviation etc. See
page 191 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input.
Output Ports
example set output (exa) The ExampleSet with selected attributes removed from the original ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
186
2.1. Attributes
Parameters
first attribute (integer) The first attribute of the attribute range which should be removed is
specified through this parameter. The counting of attributes starts from 1.
last attribute (integer) The last attribute of the attribute range which should be removed is
specified through this parameter. The counting of attributes starts from 1.
Related Documents
• Select Attributes (page 195)
• Remove Correlated Attributes (page 188)
• Remove Useless Attributes (page 191)
Tutorial Processes
Removing the first two attributes of the Golf data set
Process
Golf
inp
Remove Attribut...
out
exa
exa
res
ori
res
Figure 2.23: Tutorial process ‘Removing the first two attributes of the Golf data set’.
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the Table Index of the Outlook attribute
is 0. The Table Index column can be seen if the Show column ‘Table Index’ option is selected
in the Meta Data View tab. The Table Index of the Temperature attribute is 1. The Remove Attribute Range operator is applied on the ‘Golf’ data set to remove the first two attributes. The
first attribute and second attribute parameters are set to 1 and 2 respectively to remove the first
two attributes. The first attribute and second attribute parameters were not set to 0 and 1 respectively because here attribute counting starts from 1 (instead of 0). The resultant ExampleSet
can be seen in the Results Workspace. You can see that the Outlook and Temperature attributes
have been removed from the ExampleSet.
187
2. Blending
Remove Correlated Attributes
Remove Correlat...
exa
exa
ori
This operator removes correlated attributes from an ExampleSet.
The correlation threshold is specified by the user. Correlation is a
statistical technique that can show whether and how strongly pairs
of attributes are related.
Description
A correlation is a number between -1 and +1 that measures the degree of association between two
attributes (call them X and Y). A positive value for the correlation implies a positive association.
In this case large values of X tend to be associated with large values of Y and small values of X tend
to be associated with small values of Y. A negative value for the correlation implies a negative
or inverse association. In this case large values of X tend to be associated with small values of
Y and vice versa.
Suppose we have two attributes X and Y, with means X’ and Y’ respectively and standard deviations S(X) and S(Y) respectively. The correlation is computed as summation from 1 to n of
the product (X(i)-X’).(Y(i)-Y’) and then dividing this summation by the product (n-1).S(X).S(Y)
where n is the total number of examples and i is the increment variable of summation. There
can be other formulas and definitions but let us stick to this one for simplicity.
As discussed earlier a positive value for the correlation implies a positive association. Suppose
that an X value was above average, and that the associated Y value was also above average. Then
the product (X(i)-X’).(Y(i)-Y’) would be the product of two positive numbers which would be
positive. If the X value and the Y value were both below average, then the product above would
be of two negative numbers, which would also be positive. Therefore, a positive correlation is
evidence of a general tendency that large values of X are associated with large values of Y and
small values of X are associated with small values of Y.
As discussed earlier a negative value for the correlation implies a negative or inverse association. Suppose that an X value was above average, and that the associated Y value was instead
below average. Then the product (X(i)-X’).(Y(i)-Y’) would be the product of a positive and a negative number which would make the product negative. If the X value was below average and the
Y value was above average, then the product above would also be negative. Therefore, a negative correlation is evidence of a general tendency that large values of X are associated with small
values of Y and small values of X are associated with large values of Y.
This operator can be used for removing correlated or uncorrelated attributes depending on
the setting of parameters specially the filter relation parameter. The procedure is quadratic in
number of attributes i.e. for m attributes an m x m matrix of correlations is calculated. Please
note that this operator might fail in some cases when the attributes should be filtered out. For
example, it might not be able to remove for example all negative correlated attributes because for
the complete m x m - matrix of correlation the correlations will not be recalculated and hence not
checked if one of the attributes of the current pair was already marked for removal. This means
that for three attributes X, Y, and Z that it might be that Y was already ruled out by the negative
correlation with X and is now not able to rule out Z any longer. The used correlation function in
this operator is the Pearson correlation. In order to get more stable results the original, random,
and reverse order of attributes is available.
Correlated attributes are usually removed because they are similar in behavior and will have
similar impact in prediction calculations, so keeping attributes with similar impacts is redundant. Removing correlated attributes saves space and time of calculation of complex algorithms.
Moreover, it also makes processes easier to design, analyze, understand and comprehend.
188
2.1. Attributes
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Filter
Examples operator in the attached Example Process. The output of other operators can
also be used as input.
Output Ports
example set output (exa) The (un-)correlated attributes are removed from the ExampleSet
and this ExampleSet is delivered through this output port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
correlation (real) This parameter specifies the correlation for filtering attributes. A correlation is a number between -1 and +1 that measures the degree of association between two
attributes (call them X and Y). A positive value for the correlation implies a positive association. In this case large values of X tend to be associated with large values of Y and
small values of X tend to be associated with small values of Y. A negative value for the
correlation implies a negative or inverse association. In this case large values of X tend to
be associated with small values of Y and vice versa.
filter relation (selection) Correlations of two attributes are compared at a time. One of the
two attributes is removed if their correlation fulfills the relation specified by this parameter.
attribute order (selection) The algorithm takes this attribute order to calculate correlations
and for filtering the attributes.
use absolute correlation (boolean) This parameter indicates if the absolute value of the correlations should be used for comparison.
Tutorial Processes
Removing correlated attributes from the Sonar data set
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can view the ExampleSet before further operators are applied on it. You can see that the
‘Sonar’ data set has 60 numerical attributes. The Correlation Matrix operator is applied on it.
This operator is applied so that you can view the correlation matrix of the ‘Sonar’ data set otherwise this operator was not required here. The Remove Correlated Attributes operator is applied
on the ‘Sonar’ data set. The correlation parameter is set to 0.8. The filter relation parameter is
set to ‘greater’ and the attribute order parameter is set to ‘original’. Run the process and you will
see in the Results Workspace that 19 out of 60 numerical attributes of the ‘Sonar’ data set have
been removed. Now have a look at the correlation matrix generated by the Correlation Matrix
operator. You can see that most of the attributes with correlations above 0.8 have been removed
from the data set. Some such attributes are not removed because this operator might fail in some
cases when the attributes should be filtered out. It might not be able to remove all correlated
attributes because for the complete m x m matrix of correlation the correlations will not be recalculated and hence not checked if one of the attributes of the current pair was already marked
189
2. Blending
Root
RemoveCorrelate...
exa
inp
exa
res
ori
Sonar
Correlation Matrix
out
exa
exa
mat
res
wei
res
Figure 2.24: Tutorial process ‘Removing correlated attributes from the Sonar data set’.
for removal. Change the value of the attribute order parameter to ‘random’ and run the process
again. Compare these results with the previous ones. This time a different set of attributes is
removed from the data set. So, the order in which correlation operator is applied may change
the output.
190
2.1. Attributes
Remove Useless Attributes
Remove Useless ...
exa
exa
ori
This operator removes useless attributes from an ExampleSet. The
thresholds for useless attributes are specified by the user.
Description
The Remove Useless Attributes operator removes four kinds of useless attributes:
1. Such nominal attributes where the most frequent value is contained in more than the specified ratio of all examples. The ratio is specified by the nominal useless above parameter.
This ratio is defined as the number of examples with most frequent attribute value divided
by the total number of examples. This property can be used for removing such nominal attributes where one value dominates all other values.
2. Such nominal attributes where the most frequent value is contained in less than the specified ratio of all examples. The ratio is specified by the nominal useless below parameter.
This ratio is defined as the number of examples with most frequent attribute value divided by the total number of examples. This property can be used for removing nominal
attributes with too many possible values.
3. Such numerical attributes where the Standard Deviation is less than or equal to a given deviation threshold. The numerical min deviation parameter specifies the deviation threshold. The Standard Deviation is a measure of how spread out values are. Standard Deviation
is the square root of the Variance which is defined as the average of the squared differences
from the Mean.
4. Such nominal attributes where the value of all examples is unique. This property can be
used to remove id-like attributes.
Please note that this is not an intelligent operator i.e. it cannot figure out at its own whether
an attribute is useless or not. It simply removes those attributes that satisfy the criteria for uselessness defined by the user.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Filter
Examples operator in the attached Example Process. The output of other operators can
also be used as input.
Output Ports
example set output (exa) The attributes that satisfy the user-defined criteria for useless attributes are removed from the ExampleSet and this ExampleSet is delivered through this
output port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
191
2. Blending
Parameters
numerical min deviation (real) The numerical min deviation parameter specifies the deviation threshold. Such numerical attributes where Standard Deviation is less than or equal
to this deviation threshold are removed from the input ExampleSet. The Standard Deviation is a measure of how spread out values are. Standard Deviation is the square root of
the Variance which is defined as the average of the squared differences from the Mean.
nominal useless above (real) The nominal useless above parameter specifies the ratio of the
number of examples with most frequent value to the total number of examples. Such nominal attributes where the ratio of the number of examples with most frequent value to the
total number of examples is more than this ratio are removed from the input ExampleSet.
This property can be used to remove such nominal attributes where one value dominates
all other values.
nominal remove id like (boolean) If this parameter is set to true, all such nominal attributes
where the value of all examples is unique are removed from the input ExampleSet. This
property can be used to remove id-like attributes.
nominal useless below (real) The nominal useless below parameter specifies the ratio of the
number of examples with most frequent value to the total number of examples. Such nominal attributes where the ratio of the number of examples with most frequent value to the
total number of examples is less than this ratio are removed from the input ExampleSet.
This property can be used to remove nominal attributes with too many possible values.
Tutorial Processes
Removing useless nominal attributes from an ExampleSet
Process
inp
Golf
first 10 examples
out
exa
exa
ori
Remove Useless ...
exa
exa
res
ori
res
Figure 2.25: Tutorial process ‘Removing useless nominal attributes from an ExampleSet’.
This Example Process explains how the nominal useless above and nominal useless below parameters can be used to remove useless nominal attributes. Please keep in mind that the Remove Useless Attributes operator removes those attributes that satisfy the user-defined criteria
for useless attributes.
192
2.1. Attributes
The ‘Golf’ data set is loaded using the Retrieve operator. The Filter Examples operator is applied on it to filter the first 10 examples. This is done to just simplify the calculations for understanding this process. A breakpoint is inserted after the Filter Examples operator so that you
can see the ExampleSet before application of the Remove Useless Attributes operator. You can
see that the ExampleSet has 10 examples. There are 2 regular nominal attributes: ‘Outlook’ and
‘Wind’. The most frequent values in the ‘Outlook’ attribute are ‘rain’ and ‘sunny’, they occur in
4 out of 10 examples. Thus their ratio is 0.4. The most frequent value in the ‘Wind’ attribute is
‘false’, it occurs in 7 out of 10 examples. Thus its ratio is 0.7.
The Remove Useless Attributes operator is applied on the ExampleSet. The nominal useless
above parameter is set to 0.6. Thus attributes where the ratio of most frequent value to total
number of examples is above 0.6 are removed from the ExampleSet. As the ratio of most frequent
value in the Wind attribute is greater than 0.6, it is removed from the ExampleSet.
The nominal useless below parameter is set to 0.5. Thus attributes where the ratio of most
frequent value to total number of examples is below 0.5 are removed from the ExampleSet. As
the ratio of most frequent value in the Outlook attribute is below 0.5, it is removed from the
ExampleSet.
This can be verified by seeing the results in the Results Workspace.
Removing useless numerical attributes from an ExampleSet
Process
Golf
inp
first 10 examples
out
exa
exa
ori
Aggregate
exa
res
exa
ori
Remove Useless ...
exa
exa
ori
res
res
Figure 2.26: Tutorial process ‘Removing useless numerical attributes from an ExampleSet’.
This Example Process explains how the numerical min deviation parameter can be used to
remove useless numerical attributes. The numerical min deviation parameter specifies the deviation threshold. Such numerical attributes where the Standard Deviation is less than or equal
to this deviation threshold are removed from the input ExampleSet. The Standard Deviation
is a measure of how spread out values are. Standard Deviation is the square root of the Variance which is defined as the average of the squared differences from the Mean. Please keep
in mind that the Remove Useless Attributes operator removes those attributes that satisfy the
user-defined criteria for useless attributes.
The ‘Golf’ data set is loaded using the Retrieve operator. The Filter Examples operator is applied on it to filter the first 10 examples. This is done to just simplify the calculations for understanding this process. A breakpoint is inserted after the Filter Examples operator so that you
see the ExampleSet before application of the Remove Useless Attributes operator. You can see
193
2. Blending
that it has 10 examples. There are 2 regular numerical attributes: ‘Temperature’ and ‘Humidity’. The Aggregate operator is applied on the ExampleSet to calculate and display the Standard
Deviations of both numerical attributes. This operator is inserted here so that you can see that
Standard Deviations without actually calculating them, otherwise this operator is not required
here. You can see that the Standard Deviation of the ‘Temperature’ and ‘Humidity’ attributes is
7.400 and 10.682 respectively.
The Remove Useless Attributes operator is applied on the original ExampleSet (the ExampleSet with the first 10 examples of the ‘Golf’ data set). The numerical min deviation parameter
is set to 9.0. Thus the numerical attributes where the Standard Deviation is less than 9.0 are
removed from the ExampleSet. As the Standard Deviation of the Temperature attribute is less
than 9.0, it is removed from the ExampleSet.
This can be verified by seeing the results in the Results Workspace.
194
2.1. Attributes
Select Attributes
Select Attributes
exa
exa
ori
This operator selects which attributes of an ExampleSet should be
kept and which attributes should be removed. This is used in cases
when not all attributes of an ExampleSet are required; it helps you
to select required attributes.
Description
Often need arises for selecting attributes before applying some operators. This is especially true
for large and complex data sets. The Select Attributes operator lets you select required attributes
conveniently. Different filter types are provided to make attribute selection easy. Only the selected attributes will be delivered from the output port and the rest will be removed from the
ExampleSet.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Retrieve operator
in the attached Example Process. The output of other operators can also be used as input.
It is essential that meta data should be attached with the data for input because attributes
are specified in their meta data. The Retrieve operator provides meta data along-with data.
Output Ports
example set (exa) The ExampleSet with selected attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet, no attributes are
removed. This is the default option.
• single This option allows the selection of a single attribute. When this option is selected another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows the selection of multiple attributes through a list. All attributes of ExampleSet are present in the list; required attributes can be easily selected. This option will not work if the meta data is not known. When this option is
selected another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for the
attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
195
2. Blending
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. The user should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected
some other parameters (value type, use value type exception) become visible in the
Parameters panel.
• block_type This option is similar in working to the value_type option. This option
allows the selection of all the attributes of a particular block type. It should be noted
that block types may be hierarchical. For example value_series_start and value_series_end block types both belong to the value_series block type. When this option is selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list, which is the list of selected attributes that will make it to the output port; all
other attributes will be removed.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you to
try different expressions and preview the results simultaneously. This will enhance your
concept of regular expressions.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, numeric, integer, real, text, binominal, polynominal, file_path, date_time, date, time.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
196
2.1. Attributes
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. the value type parameter’s value. One of the following types can be selected here: nominal, numeric, integer,
real, text, binominal, polynominal, file_path, date_time, date, time.
block type (selection) The Block type of attributes to be selected can be chosen from a drop
down list. One of the following types can be chosen: single_value, value_series, value_series_start, value_series_end, value_matrix, value_matrix_start, value_matrix_end, value_matrix_row_start.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type. One of the
following block types can be selected here: single_value, value_series, value_series_start,
value_series_end, value_matrix, value_matrix_start, value_matrix_end, value_matrix_row_start.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is mention here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||.
include special attributes (boolean) Special attributes are attributes with special roles which
identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special
attributes are delivered to the output port irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes are also tested against
conditions specified in the Select Attribute operator and only those attributes are selected
that satisfy the conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses
the selection. In that case all the selected attributes are removed and previously removed
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
Tutorial Processes
Selecting attributes by specifying regular expressions matching their names
In the given Example process the Labor-Negotiations ExampleSet is loaded using the Retrieve
operator. Then Select Attribute operator is applied on it. Have a look at the Parameters panel
of the Select Attributes operator. Here is a stepwise explanation of this process. See that at the
bottom of Parameters panel the include special attributes parameter is set to true. This means
that all special attributes will also be checked against all the given conditions, they will appear in
the output only if they pass all the conditions. The only special attribute is the ‘class’ attribute
in this ExampleSet. Though ‘class’ is a special attribute; it will make to the output port only
if it passes the conditions because the include special attributes parameter is set to true.The
197
2. Blending
Process
inp
Labor-Negotiations
out
Select Attributes
exa
exa
res
ori
res
res
Figure 2.27: Tutorial process ‘Selecting attributes by specifying regular expressions matching
their names’.
regular expression specified is = w.*| .*y.* w.* means all attribute names with starting alphabet
‘w’.wage-inc-1st, wage-inc-2nd, wage-inc-3rd, working-hours satisfy this condition.*y.* means
all attributes that have a ‘y’ in their name.standby-pay, statutory-holidays, longterm-disabilityassistance satisfy this condition.|| means logical OR operator. So if any attribute whose name
starts with ‘w’ or its name contains a ‘y’, it satisfies this expression and is selected.Following attributes of the Labor-Negotiations data set satisfy this expression: wage-inc-1st, wage-inc-2nd,
wage-inc-3rd, working-hours, standby-pay, statutory-holidays, longterm-disability-assistance.
The use except expression parameter is also set to true which means attributes that satisfy the
condition in the except regular expression parameter would be removed. The regular expression
for except regular expression is = .*[0-9].*This expression means any attribute whose name contains a digit.Three attributes satisfy this condition: wage-inc-1st, wage-inc-2nd, wage-inc-3rd.
Thus these three attributes do not make it to the output port even though they satisfied the regular expression of the regular expression parameter.Finally we are left with the following four attributes: working-hours, standby-pay, statutory-holidays, longterm-disability-assistance. These
four attributes make it to the output port.Notice that the invert selection parameter was not set
to true. If it was set to true, all attributes other than these four attributes would have made it to
the output port.
198
2.1. Attributes
Select by Random
Select by Random
exa
exa
ori
This operator selects a random subset of attributes of the given
ExampleSet.
Description
The Select by Random operator selects attributes randomly from the input ExampleSet. If the
use fixed number of attributes parameter is set to true, then the required number of attributes is
specified through the number of attributes parameter. Otherwise, a random number of attributes
is selected. The randomization can be changed by changing the seed value in the corresponding
parameters. This operator can be useful in combination with the Loop Parameters operator or
can be used as a baseline for significance test comparisons for feature selection techniques.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data. The Retrieve operator provides meta data
along-with the data.
Output Ports
example set (exa) The ExampleSet with selected attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
use fixed number of attributes (boolean) This parameter specifies if a fixed number of attributes should be selected.
number of attributes (integer) This parameter is only available when the use fixed number
of attributes parameter is set to true. This parameter specifies the number of attributes
which should be randomly selected.
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomization. Using the same value of local random seed will produce the same
ExampleSet. Changing the value of the local seed changes the randomization, thus the
ExampleSet will have a different set of attributes.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
199
2. Blending
Tutorial Processes
Selecting random attributes from Sonar data set
Process
Sonar
inp
Select by Random
out
exa
exa
res
ori
res
Figure 2.28: Tutorial process ‘Selecting random attributes from Sonar data set’.
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so
that you can have a look at the ExampleSet. You can see the ExampleSet has 60 attributes. The
Select by Random operator is applied on this ExampleSet. The use fixed number of attributes
parameter is set to true and the number of attributes parameter is set to 10. Thus 10 attributes
will be selected randomly from the ‘Sonar’ data set. The resultant ExampleSet can be seen in
the Results Workspace.
200
2.1. Attributes
Select by Weights
Select by Weights
exa
exa
wei
ori
wei
This operator selects only those attributes of an input ExampleSet
whose weights satisfy the specified criterion with respect to the
input weights.
Description
This operator selects only those attributes of an input ExampleSet whose weights satisfy the
specified criterion with respect to the input weights. Input weights are provided through the
weights input port. The criterion for attribute selection by weights is specified by the weight
relation parameter.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data. The Retrieve operator provides meta data
along-with data
weights (wei) This port expects the attribute weights. There are numerous operators that provide the attribute weights. The Weight by Correlation operator is used in the Example Process.
Output Ports
example set (exa) The ExampleSet with selected attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
weights (wei) The Attributes weights that were provided at the weights input port are delivered
through this output port.
Parameters
weight relation Only those attributes are selected whose weights satisfy this relation.
• greater Attributes whose weights are greater than the weight parameter are selected.
• greater_equals Attributes whose weights are equal or greater than the weight parameter are selected.
• equals Attributes whose weights are equal to the weight parameter are selected.
• less_equals Attributes whose weights are equal or less than the weight parameter are
selected.
201
2. Blending
• less Attributes whose weights are less than the weight parameter are selected.
• top_k The k attributes with highest weights are selected. k is specified by the k parameter.
• bottom_k The k attributes with lowest weights are selected. k is specified by the k
parameter.
• all_but_top_k All attributes other than the k attributes with highest weights are selected. k is specified by the k parameter.
• all_but_bottom_k All attributes other than k attributes with lowest weights are selected. k is specified by the k parameter.
• top_p% The top p percent attributes with highest weights are selected. p is specified
by the p parameter.
• bottom_p% The bottom p percent attributes with lowest weights are selected. p is
specified by the p parameter.
weight This parameter is available only when the weight relation parameter is set to ‘greater’,
‘greater equals’, ‘equals’, ‘less equals’ or ‘less’. This parameter is used to compare weights.
k This parameter is available only when the weight relation parameter is set to ‘top k’, ‘bottom
k’, ‘all but top k’ or ‘all but bottom k’. It is used to count the number of attributes to select.
p This parameter is available only when the weight relation parameter is set to ‘top p%’ or ‘bottom p%’. It is used to specify the percentage of attributes to select.
deselect unknown This is an expert parameter. This parameter indicates if attributes whose
weight is unknown should be removed from the ExampleSet.
use absolute weights This is an expert parameter. This parameter indicates if the absolute
values of the weights should be used for comparison.
Tutorial Processes
Selecting attributes from Sonar data set
Process
Sonar
inp
Weight by Correl...
out
exa
Select by Weights
wei
exa
exa
exa
wei
ori
wei
res
res
res
Figure 2.29: Tutorial process ‘Selecting attributes from Sonar data set’.
The ‘Sonar’ data set is loaded using the Retrieve operator. The Weight by Correlation operator
is applied on it to generate attribute weights. A breakpoint is inserted here. You can see the
202
2.1. Attributes
attributes with their weights here. The Select by Weights operator is applied next. The ‘Sonar’
data set is provided at the exampleset port and weights calculated by the Weight by Correlation
operator are provided at the weights input port. The weight relation parameter is set to ‘bottom
k’ and the k parameter is set to 4. Thus 4 attributes with minimum weights are selected. As you
can see the ‘attribute_57’, ‘attribute_17’, ‘attribute_30’ and ‘attribute_16’ have lowest weights,
thus these four attributes are selected. Also note that the label attribute ‘class’ is also selected.
This is because the attributes with special roles are selected irrespective of weights condition.
203
2. Blending
Work on Subset
Work on Subset
exa
exa
thr
This operator selects a subset (one or more attributes) of the input ExampleSet and applies the operators in its subprocess on the
selected subset.
Description
The Work on the Subset operator can be considered as the blend of the Select Attributes and
Subprocess operator to some extent. The attributes are selected in the same way as selected by
the Select Attributes operator and the subprocess of this operator works in the same way as the
Subprocess operator works. A subprocess can be considered as small unit of a process where all
operators and a combination of operators can be applied in a subprocess. That is why a subprocess can also be defined as a chain of operators that is subsequently applied. For more information about subprocess please study the Subprocess operator. Although the Work on Subset
operator has similarities with the Select Attributes and Subprocess operators however, this operator provides some functionality that cannot be performed by the combination of the Select
Attributes and Subprocess operator. Most importantly, this operator can merge the results of its
subprocess with the input ExampleSet such that the original subset is overwritten by the subset
received after processing of the subset in the subprocess. This merging can be controlled by the
keep subset only parameter. This parameter is set to false by default. Thus merging is done by
default. If this parameter is set to true, then only the result of the subprocess is returned by
this operator and no merging is done. In such a case this operator behaves very similar to the
combination of the Select Attributes and Subprocess operator. This can be understood easily by
studying the attached Example Process.
This operator can also deliver the additional results of the subprocess if desired. This can be
controlled by the deliver inner results parameter. Please note that this is a very powerful operator.
It can be used to create new preprocessing schemes by combining it with other preprocessing
operators. However, there are two major restrictions:
• Since the result of the subprocess will be combined with the rest of the input ExampleSet,
the number of examples is not allowed to be changed inside the subprocess.
• The changes in the role of an attribute will not be delivered outside the subprocess.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input.
Output Ports
example set (exa) The result of the subprocess will be combined with the rest of the input
ExampleSet and delivered through this port. However if the keep subset only parameter is
set to true then only the result of the subprocess will be delivered.
204
2.1. Attributes
through (thr) This operator can also deliver the additional results of the subprocess if desired.
This can be controlled by the deliver inner results parameter. This port is used for delivering the additional results of the subprocess.The Work on Subset operator can have multiple through ports. When one through port is connected, another through port becomes
available which is ready to deliver another output (if any). The order of outputs remains
the same. The object passed at the first through port inside the subprocess of the Work on
Subset operator is delivered at the first through port of the operator.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose all examples satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list, which is the list of selected attributes.
205
2. Blending
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles.
Special attributes are those attributes which identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes selected irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes
are also tested against conditions specified in the Select Attribute operator and only those
attributes are selected that satisfy the conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
206
2.1. Attributes
name conflict handling (selection) This parameter decides how to handle a conflict with names
when the Operator merges the Subset back to the ExampleSet. There are three possible
behaviors:
• error The Operator will show an Error if there is any conflict.
• keep new If there is an conflict, the Operator will keep the one from the Subset.The
other one will be deleted.
• keep original If there is an conflict, the Operator will keep the one which is not in
the Subset. The other one will be deleted.
role conflict handling (selection) This parameter decides how to handle a conflict with roles
when the Operator merges the Subset back to the ExampleSet. There are three possible
behaviors:
• error The Operator will show an Error if there is any conflict.
• keep new If there is an conflict, the Operator will keep the one from the Subset.The
other one will be deleted.
• keep original If there is an conflict, the Operator will keep the one which is not in
the Subset. The other one will be deleted.
keep subset only (boolean) The Work on Subset operator can merge the results of its subprocess with the input ExampleSet such that the original subset is overwritten by the subset
received after processing of the subset in the subprocess. This merging can be controlled
by the keep subset only parameter. This parameter is set to false by default. Thus merging
is done by default. If this parameter is set to true, then only the result of the subprocess
is returned by this operator and no merging is done.
deliver inner results (boolean) This parameter indicates if the additional results (other than
the input ExampleSet) of the subprocess should also be returned. If this parameter is set
to true then the additional results are delivered through the through ports.
remove roles (boolean) This parameter decides if the role of the Special Attributes in the
Subset will be removed by entering the Subset or not.
Tutorial Processes
Working on a subset of Golf data set
The ‘Golf’ data set is loaded using the Retrieve operator. Then the Work on Subset operator is applied on it. The attribute filter type parameter is set to subset. The attributes parameter is used
for selecting the ‘Temperature’ and ‘Humidity’ attributes. Double-click on the Work on Subset
operator to see its subprocess. All the operations in the subprocess will be performed only on
the selected attributes i.e. the ‘Temperature’ and ‘Humidity’ attributes. The Normalize operator is applied in the subprocess. The attribute filter type parameter of the Normalize operator
is set to ‘all’. Please note that the Normalize operator will not be applied on ‘all’ the attributes
of the input ExampleSet rather it would be applied on ‘all’ selected attributes of the input ExampleSet i.e. the ‘Temperature’ and ‘Humidity’ attributes. Run the process. You will see that
the normalized ‘Humidity’ and ‘Temperature’ attribute are combined with the rest of the input
ExampleSet. Now set the keep subset only parameter to true and run the process again. Now
you will see that only the results of the subprocess are delivered by the Work on Subset operator.
This Example Process just explains the basic usage of this operator. This operator can be used
for creating new preprocessing schemes by combining it with other preprocessing operators.
207
2. Blending
Process
inp
Golf
Work on Subset
out
exa
exa
res
thr
res
Figure 2.30: Tutorial process ‘Working on a subset of Golf data set’.
2.1.4 Generation
Generate Absolutes
Generate Absolu...
exa
exa
ori
This operator replaces all values of the selected numerical attributes by their corresponding absolute values.
Description
The Generate Absolutes operator replaces all values of the selected numerical attributes by their
absolute values. The absolute value of a real number is the numerical value of that number
without regard to its sign. For example, the absolute value of 7 is 7, and the absolute value of
–7 is also 7. The absolute value of a number may be thought of as its distance from zero.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
Output Ports
example set output (exa) The values of the selected numerical attributes are replaced by their
corresponding absolute values and the resultant ExampleSet is returned through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
208
2.1. Attributes
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy
when selecting attributes through this option. When it is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
209
2. Blending
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
Tutorial Processes
Absolute values of the Ripley-Set data set
The ‘Ripley-Set’ data set is loaded using the Retrieve operator. A breakpoint is inserted here
so that you can have a look at the ExampleSet. You can see that the ExampleSet has two real
210
2.1. Attributes
Process
Ri p l ey -S e t
inp
out
Generate Absolu...
exa
exa
res
ori
res
Figure 2.31: Tutorial process ‘Absolute values of the Ripley-Set data set’.
attributes i.e. att1 and att2. Note that both these attributes have both positive and negative
values. The Generate Absolutes operator is applied on this ExampleSet to replace these values
by their corresponding absolute values. The resultant ExampleSet can be seen in the Results
Workspace.
211
2. Blending
Generate Aggregation
Generate Aggreg...
exa
exa
ori
This operator generates a new attribute by performing the specified aggregation function on every example of the selected attributes.
Description
This operator can be considered to be a blend of the Generate Attributes operator and the Aggregate operator. This operator generates a new attribute which consists of a function of several
other attributes. These ‘other’ attributes can be selected by the attribute filter type parameter
and other associated parameters. The aggregation function is selected through the aggregation
function parameter. Several aggregation functions are available e.g. count, minimum, maximum, average, mode etc. The attribute name parameter specifies the name of the new attribute.
If you think this operator is close to your requirement but not exactly what you need, have a look
at the Aggregate and the Generate Attributes operators because they perform similar tasks.
Differentiation
• Aggregate This operator performs the aggregation functions known from SQL. It provides
a lot of functionalities in the same format as provided by the SQL aggregation functions.
SQL aggregation functions and GROUP BY and HAVING clauses can be imitated using this
operator. See page 254 for details.
• Generate Attributes It is a very powerful operator for generating new attributes from
existing attributes. It even supports regular expressions and conditional statements for
specifying the new attributes See page 216 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input.
Output Ports
example set output (exa) The ExampleSet with the additional attribute generated after applying the specified aggregation function is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute name (string) The name of the resulting attribute is specified through this parameter.
212
2.1. Attributes
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
213
2. Blending
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
aggregation function (selection) This parameter specifies the function for aggregating the
values of the selected attribute. Numerous options are available e.g. average, variance,
standard deviation, count, minimum, maximum, sum, mode, median and product.
keep all (boolean) This parameter indicates if all old attributes should be kept. If this parameter is set to false then all the selected attributes (i.e. attributes that are used for aggregation) are removed.
214
2.1. Attributes
ignore missings (boolean) This parameter indicates if missing values should be ignored and
if the aggregation function should be only applied on existing values. If this parameter
is not set to true the aggregated value will be a missing value in the presence of missing
values in the selected attribute.
ignore missing attributes (boolean) Normally an error is shown when the attribute filter
doesn’t match any attributes of the ExampleSet. If this parameter is set to true, that situation will be ignored.
Related Documents
• Aggregate (page 254)
• Generate Attributes (page 216)
Tutorial Processes
Generating an attribute having average of real attributes of Sonar data set
Process
res
inp
Retrieve Sonar
out
Generate Aggreg...
exa
res
exa
ori
Figure 2.32: Tutorial process ‘Generating an attribute having average of real attributes of Sonar
data set’.
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so
that you can have a look at the ExampleSet. You can see that the ExampleSet has one nominal
and sixty real attributes. The Generate Aggregation operator is applied on this ExampleSet to
generate a new attribute from the real attributes of the ExampleSet.
The attribute name parameter is set to ‘Average’ thus the new attribute will be named ‘Average’. The attribute filter type parameter is set to ‘value type’ and the value type parameter is
set to ‘real’, thus the new attribute will be created from real attributes of the ExampleSet. The
aggregation function parameter is set to ‘average’, thus the new attribute will be average of the
selected attributes.
The resultant ExampleSet can be seen in the Results Workspace. You can see that there is a
new attribute named ‘Average’ in the ExampleSet that has the average value of the attribute_1,
attribute_2, ..., attribute_60 attributes.
215
2. Blending
Generate Attributes
Generate Attribu...
exa
exa
ori
This operator constructs new user defined attributes using mathematical expressions.
Description
The Generate Attributes operator constructs new attributes from the attributes of the input ExampleSet and arbitrary constants using mathematical expressions. The attribute names of the
input ExampleSet might be used as variables in the mathematical expressions for new attributes.
During the application of this operator these expressions are evaluated on each example, these
variables are then filled with the example’s attribute values. Thus this operator not only creates
new columns for new attributes, but also fills those columns with corresponding values of those
attributes. If a variable is undefined in an expression, the entire expression becomes undefined
and ‘?’ is stored at its location.
Please note that there are some restrictions for the attribute names in order to let this operator
work properly:
• Attribute names containing dashes ‘-’ or other special characters, or having the same name
as a constant (e.g. ‘e’ or ‘pi’) must be placed in square brackets e.g. ‘[weird-name]’ or ‘[pi]’.
• Attribute names containing square brackets or backslashes must be placed in square brackets and the square brackets and backslashes inside the name must be escaped, e.g. ‘[a\\tt\[1\]]’
for an attribute ‘a\tt[1]’.
If you want to apply this operator but the attributes of your ExampleSet do not fulfill above
mentioned conditions you can rename attributes with the Rename operator before application of
the Generate Attributes operator. When replacing several attributes following a certain schema,
the Rename by Replacing operator might prove useful.
A large number of operations and functions is supported, which allows you to write rich expressions. For a list of operations and functions and their descriptions open the Edit Expression
dialog. Complicated expressions can be created by using multiple operations and functions.
Parenthesis can be used to nest operations.
This operator also supports various constants (for example ‘INFINITY’, ‘PI’ and ‘e’). Again you
can find a complete list in the Edit Expression dialog. You can also use strings in operations but
the string values should be enclosed in double quotes (”).
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Rename operator in the attached Example Process. The output of other operators can also be used as
input.
Output Ports
example set (exa) The ExampleSet with new attributes is output of this port.
216
2.1. Attributes
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the results workspace.
Parameters
function descriptions The list of functions for generating new attributes is provided here.
keep all (boolean) If set to true, all the original attributes are kept, otherwise they are removed from the output ExampleSet.
Tutorial Processes
Generating attributes through different function descriptions
Process
Labor-Negotiations
inp
out
Set Macros
thr
thr
thr
thr
Generate Attribu...
exa
exa
res
ori
res
Figure 2.33: Tutorial process ‘Generating attributes through different function descriptions’.
The ‘Labor-Negotiations’ data set is loaded using the Retrieve operator.
Now have a look at the Generate Attributes operator’s parameters. The keep all parameter
is checked, thus all attributes of the ‘Labor-Negotiations’ data set are also kept along with attributes generated by the Generate Attributes operator.
Click on the Edit List button of the function descriptions parameter to have a look at descriptions of functions defined for generating new attributes. 18 new attributes are generated, there
might be better ways of generating these attributes but here they are written to explain the usage
of the different type of functions available in the Generate Attributes operator. Please read the
function description of each attribute and then see the values of the corresponding attribute in
the Results Workspace to understand it completely. Here is a description of attributes created
by this operator:
The ‘average wage-inc’ attribute takes sum of the wage-inc-1st, wage-inc-2nd and wage-inc3rd attribute values and divides the sum by 3. This gives an average of wage-increments. There
are better ways of doing this, but this example was just shown to clarify the use of some basic
functions. The ‘neglected worker bool’ attribute is a boolean attribute i.e. it has only two possible values ‘0’ and ‘1’. This attribute was created here to show usage of logical operations like
‘AND’ and ‘OR’ in the Generate Attributes operator. This attribute assumes value ‘1’ if three
conditions are satisfied. First, the working-hours attribute has value 35 or more. Second, the
education-allowance attribute is not equal to ‘yes’. Third, the vacation attribute has value ‘average’ OR ‘below-average’. If any of these conditions is not satisfied, the new attribute gets
217
2. Blending
value ‘0’. The ‘logarithmic attribute’ attribute shows the usage of logarithm base 10 and natural
logarithm functions. The ‘trigno attribute’ attribute shows the usage of various trigonometric
functions like sine and cosine. The ‘rounded average wage-inc’ attribute uses the avg function to
take average of wage-increments and then uses the round function to round the resultant values.
The ‘vacations’ attribute uses the replaceAll function to replace all occurrences of value ‘generous’ with ‘above-average’ in the ‘vacation’ attribute. The ‘deadline’ attribute shows usage of
the If-then-Else and Date functions. This attribute assumes value of current date plus 25 days if
class attribute has value ‘good’. Otherwise it stores the date of the current date plus 10 days. The
‘shift complete’ attribute shows the usage of the If-then-Else, random, floor and missing functions. This attribute has values of the shift-differential attribute but it does not have missing
values. Missing values are replaced with a random number between 0 and 25. The ‘remaining_holidays’ attribute stores the difference of the statutory-holidays attribute value from 15. The
‘remaining_holidays_percentage’ attribute uses the ‘remaining_holidays’ attribute to find the
percentage of remaining holidays. This attribute was created to show that attributes created in
this Generate Attribute operator can be used to generate new attributes in the same Generate
Attributes operator. The ‘constants’ attribute was created to show the usage of constants like ‘e’
and ‘PI’. The ‘cut’ attribute shows the usage of cut function. If you want to specify a string, you
should place it in double quotes (””) as in the last term of this attribute’s expression. If you want
to specify name of an attribute you should not place it in the quotes. First term of expression
cuts first two characters of the ‘class’ attribute values. This is because name of attribute is not
placed in quotes. Last term of the expression selects first two characters of the string ‘class’. As
first two characters of string ‘class’ are ‘cl’, thus cl is appended at the end of this attribute’s values. The middle term is used to concatenate a blank space between first and last term’s results.
The ‘index’ attribute shows usage of the index function. If the ‘class’ attribute has value ‘no’, 1 is
stored because ‘o’ is at first index. If the ‘class’ attribute has value ‘yes’, -1 is stored because ‘o’
is not present in this value. The ‘date constants’ attribute shows the usage of the date constants.
It shows the date of the ‘deadline’ attribute in full format, but only time is selected for display.
The ‘macro’ attribute shows how to use macros in functions.The ‘macro eval’ attribute shows
how to use macros that contain a number. The macro function %{} always returns a string, so
if you want to obtain the number you have to use the eval function or the parse function.The
‘expression eval’ attribute shows usage of the eval function. If there is a string containing an
expression, for example coming from a macro %{expression} you can evaluate this expression
by using the eval function. The ‘macro with attribute’ attribute shows the usage of the #{} function. If there is a macro containing the name of an attribute, you can use this attribute in your
expression by using #{attribute_macro} where attribute_macro is the macro containing the attribute name. Using eval(%{attribute_macro}) would lead to the same result, but the #{} function
fails when the macro does not contain an attribute name, while eval(%{attribute_macro}) evaluates whatever is contained in the macro.
218
2.1. Attributes
Generate Concatenation
Generate Concat...
exa
exa
ori
This operator merges two attributes into a single new attribute by
concatenating their values. The new attribute is of nominal type.
The original attributes remain unchanged.
Description
The Generate Concatenation operator merges two attributes of the input ExampleSet into a single new nominal attribute by concatenating the values of the two attributes. If the resultant
attribute is actually of numerical type, it can be converted from nominal to numerical type by
using the Nominal to Numeric operator. The original attributes remain unchanged, just a new
attribute is added to the ExampleSet. The two attributes to be concatenated are specified by the
first attribute and second attribute parameters.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process.
Output Ports
example set output (exa) The ExampleSet with the new attribute that has concatenated values of the specified attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
first attribute (string) This parameter specifies the first attribute to be concatenated.
second attribute (string) This parameter specifies the second attribute to be concatenated.
separator (string) This parameter specifies the string which is used as separation of values of
the first and second attribute i.e. the string that is concatenated between the two values.
trim values (boolean) This parameter indicates if the values of the first and second attribute
should be trimmed i.e. leading and trailing whitespaces should be removed before the concatenation is performed.
Tutorial Processes
Generating a concatenated attribute in the Labor-Negotiations data set
The ‘Labor-Negotiations’ data set is loaded using the Retrieve operator. A breakpoint is inserted
here so that you can have a look at the ExampleSet. The ‘vacation’ and ‘statutory-holidays’
attributes will be concatenated to form a new attribute. The Generate Concatenation operator is
219
2. Blending
Process
inp
Labor-Negotiations
out
Generate Concat...
exa
exa
res
ori
res
Figure 2.34: Tutorial process ‘Generating a concatenated attribute in the Labor-Negotiations
data set’.
applied on the Labor-Negotiations data set. The first attribute and second attribute parameters
are set to ‘vacation’ and ‘statutory-holidays’ respectively. The separator parameter is set to ‘_’. Thus the values of the ‘vacation’ and ‘statutory-holidays’ attributes will be merged with a ‘_’
between them. You can verify this by seeing the resultant ExampleSet in the Results Workspace.
The ‘vacation’ and ‘statutory-holidays’ attributes remain unchanged. A new attribute named
‘vacation_statutory-holidays’ is created. The type of the new attribute is nominal.
220
2.1. Attributes
Generate Copy
Generate Copy
exa
exa
ori
This operator generates the copy of an attribute. The original attribute remains unchanged.
Description
The Generate Copy operator adds a copy of the selected attribute to the input ExampleSet. Please
note that the original attribute remains unchanged, just a new attribute is added to the ExampleSet. The attribute whose copy is required is specified by the attribute name parameter. The
name of the new attribute is specified through the new name parameter. Please note that the
names of attributes of an ExampleSet should be unique. Please note that only the view on the
data column is copied, not the data itself.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process.
Output Ports
example set output (exa) The ExampleSet with the new attribute that is a copy of the specified attribute is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute name (string) The attribute whose copy is required is specified by the attribute name
parameter.
new name (string) The name of the new attribute is specified through the new name parameter. Please note that the names of attributes of an ExampleSet should be unique.
Tutorial Processes
Generating a copy of the Temperature attribute of the Golf data set
The ‘Golf’ data set is loaded using the Retrieve operator. The Generate Copy operator is applied
on it. The attribute name parameter is set to ‘Temperature’. The new name parameter is set to
‘New Temperature’. Run the process. You will see that an attribute named ‘New Temperature’
has been added to the ‘Golf’ data set. The new attribute has the same values as the ‘Temperature’
attribute. The ‘Temperature’ attribute remains unchanged.
221
2. Blending
Process
Golf
Generate Copy
out
inp
exa
exa
res
ori
res
Figure 2.35: Tutorial process ‘Generating a copy of the Temperature attribute of the Golf data
set’.
Generate Empty Attribute
Generate Empty ...
exa
exa
ori
This operator adds a new attribute of specified name and type to
the input ExampleSet.
Description
The Generate Empty Attribute operator creates an empty attribute of specified name and type
which are specified by the name and the value type parameter respectively. One of the following
types can be selected: nominal, numeric, integer, real, text, binominal, polynominal, file_path,
date_time, date, time. Please note that all values are missing right after creation of the attribute.
The operators like the Set Data operator can be used to fill values of this attribute. Please note
that the name of the attribute can be changed later by the Rename operator and many type conversion operators are also available for changing the type of the attribute. Please note that this
operator creates an empty attribute independent of the input ExampleSet, if you want to generate an attribute from the existing attributes of the input ExampleSet you can use the Generate
Attributes operator.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input.
Output Ports
example set output (exa) An empty attribute of the specified name and type is added to the
input ExampleSet and the resultant ExampleSet is delivered through this output port.
222
2.1. Attributes
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
name (string) This parameter specifies the name of the new attribute. Please note that the
names of attributes should be unique. Please make sure that the input ExampleSet does
not have an attribute with the same name.
value type (selection) The type of the new attribute is specified by this parameter. One of the
following types can be selected: nominal, numeric, integer, real, text, binominal, polynominal, file_path, date_time, date, time.
Tutorial Processes
Adding an empty attribute to the ’Golf’ data set
Process
inp
Golf
Generate Empty ...
out
exa
exa
res
ori
res
Figure 2.36: Tutorial process ‘Adding an empty attribute to the ’Golf’ data set’.
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so
that you can have a look at the input ExampleSet. As you can see that the ‘Golf’ data set has
5 attributes: Play, Outlook, Temperature, Humidity and Wind. The Generate Empty Attribute
operator is applied on the ‘Golf’ data set. The name parameter is set to ‘name’ and the value
type parameter is set to ‘nominal’. When the process execution is complete, you can see the
ExampleSet in the Results Workspace. This ExampleSet has one attribute more than the ‘Golf’
data set. The name and type of the attribute are the same as specified in the parameters of the
Generate Empty Attribute operator. Please note that all values of this new attribute are missing. These values can be filled by using operators like the Set Data operator. Please note that the
created empty attribute is independent of the input ExampleSet, if you want to generate an attribute from the existing attributes of the input ExampleSet you can use the Generate Attributes
operator.
223
2. Blending
Generate Function Set
Generate Functio...
exa
exa
ori
This is an attribute generation operator which generates new attributes by applying a set of selected functions on all attributes.
Description
This operator applies a set of selected functions on all attributes of the input ExampleSet for
generating new attributes. Numerous functions are available including summation, difference,
multiplication, division, reciprocal, square root, power, sine, cosine, tangent, arc tangent, absolute, minimum, maximum, ceiling, floor and round. It is important to note that the functions
with two arguments will be applied on all possible pairs. For example suppose an ExampleSet
with three numerical attributes A, B and C. If the summation function is applied on this ExampleSet then three new attributes will be generated with values A+B, A+C and B+C. Similarly
non-commutative functions will be applied on all possible permutations. This is a useful attribute generation operator but if it does not meet your requirements please try the Generate
Attributes operator which is a very powerful attribute generation operator.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input.
Output Ports
example set output (exa) New attributes are created by application of the selected functions
and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
keep all (boolean) This parameter indicates if the original attributes should be kept.
use plus (boolean) This parameter indicates if the summation function should be applied for
generation of new attributes.
use diff (boolean) This parameter indicates if the difference function should be applied for
generation of new attributes.
use mult (boolean) This parameter indicates if the multiplication function should be applied
for generation of new attributes.
use div (boolean) This parameter indicates if the division function should be applied for generation of new attributes.
224
2.1. Attributes
use reciprocals (boolean) This parameter indicates if the reciprocal function should be applied for generation of new attributes.
use square roots (boolean) This parameter indicates if the square roots function should be
applied for generation of new attributes.
use power functions (boolean) This parameter indicates if the power function should be applied for generation of new attributes.
use sin (boolean) This parameter indicates if the sine function should be applied for generation of new attributes.
use cos (boolean) This parameter indicates if the cosine function should be applied for generation of new attributes.
use tan (boolean) This parameter indicates if the tangent function should be applied for generation of new attributes.
use atan (boolean) This parameter indicates if the arc tangent function should be applied for
generation of new attributes.
use exp (boolean) This parameter indicates if the exponential function should be applied for
generation of new attributes.
use log (boolean) This parameter indicates if the logarithmic function should be applied for
generation of new attributes.
use absolute values (boolean) This parameter indicates if the absolute values function should
be applied for generation of new attributes.
use min (boolean) This parameter indicates if the minimum values function should be applied for generation of new attributes.
use max (boolean) This parameter indicates if the maximum values function should be applied for generation of new attributes.
use ceil (boolean) This parameter indicates if the ceiling function should be applied for generation of new attributes.
use floor (boolean) This parameter indicates if the floor function should be applied for generation of new attributes.
use rounded (boolean) This parameter indicates if the round function should be applied for
generation of new attributes.
Tutorial Processes
Using the power function for attribute generation
The ‘Iris’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you
can have a look at the ExampleSet. You can see that the ExampleSet has 4 real attributes. The
Generate Function Set operator is applied on this ExampleSet for generation of new attributes,
only the Power function is used. It is not a commutative function e.g. 2 raised to power 3 is not
the same as 3 raised to power 2. The non-commutative functions are applied for all possible
permutations. As there are 4 original attributes, there are 16 (i.e. 4 x 4) possible permutations.
Thus 16 new attributes are created as a result of this operator. The resultant ExampleSet can be
seen in the Results Workspace. As the keep all parameter was set to true, the original attributes
of the ExampleSet are not discarded.
225
2. Blending
Process
Iris
Generate Functio...
out
inp
exa
exa
res
ori
res
res
Figure 2.37: Tutorial process ‘Using the power function for attribute generation’.
Generate ID
Generate ID
exa
exa
ori
This operator adds a new attribute with id role in the input ExampleSet. Each example in the input ExampleSet is tagged with an
incremented id. If an attribute with id role already exists, it is overridden by the new id attribute.
Description
This operator adds a new attribute with id role in the input ExampleSet. It assigns a unique id to
each example. This operator is usually used to uniquely identify each example. Each example in
the input ExampleSet is tagged with an incremented id. The number from where the ids start can
be controlled by the offset parameter. Numerical and integer ids can be assigned. If an attribute
with id role already exists in the input ExampleSet, it is overridden by the new id attribute.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
Output Ports
example set output (exa) The ExampleSet with an id attribute is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
create nominal ids (boolean) This parameter indicates if nominal ids should be created instead of integer ids. By default this parameter is not checked, thus integer ids are created
by default. Nominal ids are of the format id_1, id_2, id_3 and so on.
offset (integer) This is an expert parameter. It is used if you want to start id from a number
other than 1. This parameter is used to set the offset value. It is 0 by default, thus ids start
from 1 by default.
226
2.1. Attributes
Tutorial Processes
Overriding the id attribute of the ’Iris’ data set
Process
Iris
inp
Generate ID
out
exa
exa
res
ori
res
res
Figure 2.38: Tutorial process ‘Overriding the id attribute of the ’Iris’ data set’.
The ‘Iris’ data set is loaded using the Retrieve operator. The Generate ID operator is applied on
it. All parameters are used with default values. The ‘Iris’ data set already has an id attribute. The
old id attribute is overridden when the Generate ID operator is applied on it. Run the process and
you can see the ExampleSet with the new id attribute. The type of this new attribute is integer.
Set the create nominal ids parameter to true and run the process again, you will see that the ids
are in nominal form now (i.e. id_1, id_2 and so on). The offset parameter is set to 0 that is why
the ids start from 1. Now set the offset parameter to 10 and run the process again. Now you can
see that ids start from 11.
227
2. Blending
Generate Products
Generate Products
exa
exa
ori
This operator generates new attributes by taking the products of
the specified attributes.
Description
The Generate Products operator generates new attributes by taking the products of the specified
attributes. The attributes are specified through the first attribute name and second attribute name
parameters. For example, if the first attribute name parameter has attributes A and B, and the
second attribute name has attributes C and D. Then four attributes A*C, A*D, B*C and B*D will be
generated by this operator. These attributes will have products of the corresponding attribute
values.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Generate
Data operator in the attached Example Process.
Output Ports
example set output (exa) New attributes are generated by taking the products of the specified attributes and the resultant ExampleSet is returned through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
first attribute name (string) This parameter specifies the name(s) of the first attribute(s) to
be multiplied. Attribute names can be specified through regular expressions.
second attribute name (string) This parameter specifies the name(s) of the second attribute(s)
to be multiplied. Attribute names can be specified through regular expressions.
Tutorial Processes
Generating products of attributes
The Generate Data operator provides a sample ExampleSet. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the ExampleSet has four real attributes
i.e. att1, att2, att3 and att4. The Generate Products operator is applied on this ExampleSet.
Have a look at the parameters of this operator. The att1 and att2 attributes are selected through
the first attribute name parameter. The att3 and att4 attributes are selected through the second
attribute name parameter. The resultant ExampleSet can be seen in the Results Workspace. You
can see that this ExampleSet has new attributes that have been generated by multiplying the first
attribute name attributes with the second attribute name attributes.
228
2.1. Attributes
Process
Generate Data
out
inp
Generate Products
exa
exa
res
ori
res
Figure 2.39: Tutorial process ‘Generating products of attributes’.
Generate TFIDF
Generate TFIDF
exa
exa
ori
This operator performs a TF-IDF filtering of the given ExampleSet.
TF-IDF is a numerical statistic which reflects how important a word
is to a document.
Description
The Generate TFIDF operator generates TF-IDF values from the given ExampleSet The ExampleSet must contain either the binary occurrences (which will be normalized during calculation
of the term frequency TF) or it should already contain the calculated term frequency values (in
this case no normalization will be done). This behavior can be selected using the calculate term
frequencies parameter.
The TF-IDF (term frequency–inverse document frequency) is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a
weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency
of the word in the corpus, which helps to control for the fact that some words are generally more
common than others.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Read CSV
operator in the attached Example Process.
Output Ports
example set output (exa) The TF-IDF is calculated and the resultant ExampleSet is returned
through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
229
2. Blending
Parameters
calculate term frequencies (boolean) This parameter indicates if term frequency values should
be generated. This parameter must be set to true if the input data is given as simple occurrence counts.
Tutorial Processes
Introduction to the Generate TFIDF operator
Process
Subprocess
inp
in
out
out
Generate TFIDF
exa
exa
res
ori
res
Figure 2.40: Tutorial process ‘Introduction to the Generate TFIDF operator’.
This Example Process starts with a Subprocesses operator which generates a sample ExampleSet. A breakpoint is inserted here so that you can have a look at the ExampleSet. This is a very
simple ExampleSet. It has a text attribute which has different words. There are three integer
attributes named Doc1, Doc2 and Doc3 that have the count of the corresponding words in these
documents. The Generate TFIDF operator is applied on this ExampleSet to calculate the TFIDF.
The resultant ExampleSet can be seen in the Results Workspace.
230
2.1. Attributes
Generate Weight (Stratification)
Generate Weight...
exa
exa
ori
This operator distributes the specified weight over all the examples, such that weights sum up equally per label.
Description
The Generate Weight (Stratification) operator divides the weight specified through the total weight
parameter among all the examples. While dividing the weight, this operator makes sure that the
sum of example weights of all label values is same. This often improves the representativeness
of the label values. Please study the attached Example Process for better understanding.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
Output Ports
example set output (exa) The examples are assigned weights and the resultant ExampleSet
is returned through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
total weight (real) This parameter specifies the total weight that should be distributed over
all the examples.
Tutorial Processes
Assigning weights such that weights sum up equally per label
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the label of this ExampleSet has two
possible values i.e. ‘yes’ and ‘no’. The Generate Weight (Stratification) operator is applied on
this ExampleSet for weight assignment. The total weight parameter is set to 10. This operator
assigns weight to examples such that:
The sum of all weights is equal to the total weight.The sum of weights is equal per label.
Thus in this process, the sum of all weights should be 10 and the weight sum of examples with
label ‘no’ should be equal to the weight sum of examples with label ‘yes’. You can verify this by
viewing the resultant ExampleSet in the Results Workspace.
231
2. Blending
Process
Golf
Generate Weight...
out
inp
exa
exa
res
ori
res
Figure 2.41: Tutorial process ‘Assigning weights such that weights sum up equally per label’.
2.2 Examples
2.2.1 Filter
Filter Example Range
Filter Example R...
exa
exa
ori
This operator selects which examples (i.e. rows) of an ExampleSet
should be kept and which examples should be removed. Examples
within the specified index range are kept, remaining examples are
removed.
Description
This operator takes an ExampleSet as input and returns a new ExampleSet including only those
examples that are within the specified index range. Lower and upper bound of index range are
specified using first example and last example parameters. This operator may reduce the number of examples in an ExampleSet but it has no effect on the number of attributes. The Select
Attributes operator is used to select required attributes.
If you want to filter examples by options other than index range, you may use the Filter Examples operator. It takes an ExampleSet as input and returns a new ExampleSet including only
those examples that satisfy the specified condition. Several predefined conditions are provided;
users can select any of them. Users can also define their own conditions to filter examples. The
Filter Examples operator is frequently used to filter examples that have (or do not have) missing
values. It is also frequently used to filter examples with correct or wrong predictions (usually
after testing a learnt model).
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
232
2.2. Examples
Output Ports
example set output (exa) A new ExampleSet including only the examples that are within the
specified index range is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
first example (integer) This parameter is used to set the lower bound of the index range. The
last example parameter is used to set the upper bound of the index range. Examples within
this index range are delivered to the output port. Examples outside this index range are
discarded.
last example (integer) This parameter is used to set the upper bound of the index range. The
first example parameter is used to set the lower bound of the index range. Examples within
this index range are delivered to the output port. Examples outside this index range are
discarded.
invert filter (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected examples are removed and previously removed examples are selected. In other words it inverts the index range. For example if the first
example parameter is set to 1 and the last exampleparameter is set to 10. Then the output
port will deliver an ExampleSet with all examples other than the first ten examples.
Tutorial Processes
Filtering examples using the invert filter parameter
Process
Golf
inp
Generate ID
out
exa
exa
ori
Filter Example R...
exa
exa
res
ori
res
Figure 2.42: Tutorial process ‘Filtering examples using the invert filter parameter’.
The ‘Golf’ data set is loaded using the Retrieve operator. The Generate ID operator is applied
on it with offset set to 0. Thus all examples are assigned unique ids from 1 to 14. This is done so
that examples can be distinguished easily. A breakpoint is inserted here so that you can have a
look at the data set before application of the Filter Example Range operator. In the Filter Example Range operator the first example parameter is set to 5 and the last example parameter is set
to 10. The invert filter parameter is also set to true. Thus all examples other than examples in
233
2. Blending
index range 5 to 10 are delivered through the output port. You can clearly identify rows through
their ids. Rows with IDs from 1 to 4 and from 11 to 14 make it to the output port.
234
2.2. Examples
Filter Examples
Filter Examples
exa
exa
ori
unm
This operator selects which examples (i.e. rows) of an ExampleSet
should be kept and which examples should be removed. Examples
satisfying the given condition are kept, remaining examples are removed.
Description
This operator takes an ExampleSet as input and returns a new ExampleSet including only those
examples that satisfy the specified condition. Several predefined conditions are provided; users
can select any of them. Users can also define their own conditions to filter examples.This operator may reduce the number of examples in an ExampleSet but it has no effect on the number
of attributes. The select Attributes operator is used to select required attributes.
The Filter Examples operator is frequently used to filter examples that have (or do not have)
missing values. It is also frequently used to filter examples with correct or wrong predictions
(usually after testing a learnt model).
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of Retrieve operator in the attached Example Process.
Output Ports
example set output (exa) The new ExampleSet including only the examples that satisfied the
specified condition is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
unmatched example set (unm) An ExampleSet including only the examples that did not satisfy the specified condition is output of this port.
Parameters
condition class (selection) Various predefined conditions are available for filtering examples.
Users can select any of them. Examples satisfying the selected condition are passed to the
output port, others are removed. Following conditions are available:
• all if this option is selected, no examples are removed.
• correct_predictions if this option is selected, only those examples make it to the output port that have correct predictions i.e. the value of the label attribute and prediction
attribute are the same.
• wrong_predictions if this option is selected, only those examples make to the output
port that have wrong predictions i.e. the value of the label attribute and prediction
attribute are not the same.
235
2. Blending
• no_missing_attributes if this option is selected, only those examples make it to the
output port that have no missing values in their attribute values. Missing values or
null values are usually shown by ‘?’ in RapidMiner.
• missing_attributes if this option is selected, only those examples make it to the output port that have some missing values in their attribute values.
• no_missing_labels if this option is selected, only those examples make it to the output port that do not have any missing values in their label attribute values. Missing
values or null values are usually shown by ‘?’ in RapidMiner.
• missing_label if this option is selected, only those examples make to the output port
that have some missing values in their label attribute values.
• attribute_value_filter if this option is selected, another parameter (parameter string)is
enabled in the Parameters panel.
string (string) parameter string(Range: string):Instead of using one of the predefined conditions users can define their own conditions here. It is important to understand how to
specify conditions here because the true power of this operator lies in using it with defining own conditions according to requirements.For numerical attributes conditions can be
specified easily using “attribute op value” format. Where ‘attribute’ is the name of the
attribute, ’value’ is a value that the attribute can take and ‘op’ represents binary logical
operators like >, <, =>, <=, = and !=. For nominal attributes conditions can be specified
easily using “attribute op exp” format. Where ‘attribute’ is the name of the attribute, ‘op’
can be either ‘=’ or ‘!=’ and ‘exp’ stands for the regular expression. Users should have a
good understanding of regular expressions. You can have a good idea of regular expressions if you use the Select Attributes operator with the attribute filter type parameter set to
regular_expression and then using the edit and preview regular expression menu.
Multiple conditions can be linked by using logical AND (written as &&) or logical OR (written as || ) operators. Instead of writing multiple AND conditions you can use multiple Filter
Examples operators in a row to reduce complexity.
Missing values or null values can be written as ‘?’ for numerical attributes and as ‘\?’ for
nominal attributes. ’\?’ is used instead of ‘?’ in nominal attributes because this is the way
missing values are specified in regular expressions.
For ’unknown_attributes’ the parameter string must be empty. This filter removes all examples containing attributes that have missing or illegal values. For ‘unknown_label’ the
parameter string must also be empty. This filter removes all examples with an unknown
label value.
invert filter (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the selection. In that case all the selected examples are removed and previously removed examples are selected. In other words it inverts the condition. For example if missing_attributes
option is selected in condition class parameter and invert filter parameter is also set to true.
Then output port will deliver an ExampleSet with no missing values.
Tutorial Processes
Filtering correctly predicted examples
The ‘Golf’ dataset is loaded using the Retrieve operator and the k-NN operator is applied on it to
generate a classification model. That model is then applied on the ‘Golf-Testset’ data set using
the Apply Model operator. the Apply Model operator applies the model learnt by the k-NN operator on the ‘Golf-Testset’ data set and records the predicted values in a new attribute named
236
2.2. Examples
Process
Golf
k-NN
out
inp
tra
Apply Model
mod
mod
exa
unl
Filter Examples
lab
exa
mod
exa
res
ori
res
unm
Golf-Testset
out
Figure 2.43: Tutorial process ‘Filtering correctly predicted examples’.
‘prediction(Play)’. Labeled data from the Apply Model opartor serves as input to the Filter Examples operator. The correct_predictions option is selected in the condition class parameter
which ensures that only those examples make it to the output port that have correct predictions. Correct prediction means the value of the attributes label and prediction are the same in
that example. But, as the invert filter parameter is set to true, it reverses the selection and instead of correct predictions, wrong predictions are delivered through the output port. It can be
seen in the Results Workspace that the label attribute (Play) and the prediction attribute (prediction(Play)) have opposite values in all the resultant examples. A breakpoint is inserted before
the Filter Examples operator to have a look at the examples before the application of Filter Examples operator. Press the green-colored Run button to continue with the process.
Filtering examples according to their values
Process
Golf
inp
Filter Examples
out
exa
exa
res
ori
res
unm
res
Figure 2.44: Tutorial process ‘Filtering examples according to their values’.
’Golf’ data set is loaded using Retrieve operator and Filter Examples is applied on it with parameter string:”Outlook = .*n.* && Temperature>70”. Outlook attribute is a nominal attribute
237
2. Blending
thus regular expression is used to describe it. Regular expression “Outlook=.*n.*” means all examples that have alphabet ‘n’ in its Outlook attribute value. 10 examples qualify, all have ‘Outlook = rain’ or ‘Outlook=sunny’. Temperature attribute is a numerical attribute so “attribute op
value” syntax is used to select rows. 9 examples satisfy the condition where Temperature attribute has a value greater than 70. As these two conditions are joined using logical AND (&&),
finally selected examples are those that meet both the conditions. Only 6 such rows are present
that have an ‘n’ in Outlook attribute value and their Temperature attribute value is also greater
than 70. This can be seen clearly in the Results Workspace.
Filtering examples according to their values with or condition
Process
Labor-Negotiations
inp
out
Filter Examples
exa
exa
res
ori
res
unm
res
Figure 2.45: Tutorial process ‘Filtering examples according to their values with or condition’.
Labor-Negotiations data set is loaded using the Retrieve operator and Filter Examples is applied on it with parameter string:”duration=? || pension !=\?”. Duration attribute is a numerical
attribute so “attribute op value” syntax is used to select rows. 1 example satisfies the condition where Duration attribute has a missing value. Pension attribute is a nominal attribute thus
regular expression is used to describe it. Regular expression “pension !=\?” means all examples
that do not have missing values in its Pension attribute value. 18 examples qualify; all have no
missing values in their Pension attribute. Note that ‘?’ is used for missing values of numerical
attributes and ‘\?’ is used for missing values of nominal attributes. Note that for nominal values the question mark must be escaped (”\?”) because, as noted above, a regular expression is
expected in this case. As these two conditions are joined using logical OR (||), finally selected
examples are those that meet both the conditions. 18 such rows are present that have no missing values in Pension attribute values or have missing values in Duration attribute values. This
can be seen clearly in the Results Workspace.
238
2.2. Examples
2.2.2 Sampling
Sample
Sample
exa
exa
ori
This operator creates a sample from an ExampleSet by selecting
examples randomly. The size of a sample can be specified on absolute, relative and probability basis.
Description
This operator is similar to the Filter Examples operator in principle that it takes an ExampleSet
as input and delivers a subset of the ExampleSet as output. The difference is this that the Filter
Examples operator filters examples on the basis of specified conditions. But the Sample operator
focuses on the number of examples and class distribution in the resultant sample. Moreover,
the samples are generated randomly. The number of examples in the sample can be specified
on absolute, relative or probability basis depending on the setting of the sample parameter. The
class distribution of the sample can be controlled by the balance data parameter.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
Output Ports
example set output (exa) A randomized sample of the input ExampleSet is output of this
port.
original (ori) ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
sample (selection) This parameter determines how the amount of data is specified.
• absolute If the sample parameter is set to ‘absolute’ the sample is created of an exactly
specified number of examples. The required number of examples is specified in the
sample size parameter.
• relative If the sample parameter is set to ‘relative’ the sample is created as a fraction of
the total number of examples in the input ExampleSet. The required ratio of examples
is specified in the sample ratio parameter.
• probability If the sample parameter is set to ‘probability’ the sample is created of
probability basis. The required probability is specified in the sample probability parameter.
239
2. Blending
balance data (boolean) You can set this parameter to true if you need to sample differently
for examples of a certain class. If this parameter is set to true, sample size, sample ratio and
sample probability parameters are replaced by sample size per class, sample ratio per class
and sample probability per class parameters respectively. These parameters allow you to
specify different sample sizes for different values of the label attribute.
sample size (integer) This parameter specifies the exact number of examples which should be
sampled. This parameter is only available when the sample parameter is set to ‘absolute’
and the balance data parameter is not set to true.
sample ratio (real) This parameter specifies the fraction of examples which should be sampled. This parameter is only available when the sample parameter is set to ‘relative’ and
the balance data parameter is not set to true.
sample probability (real) This parameter specifies the sample probability for each example.
This parameter is only available when the sample parameter is set to ‘probability’ and the
balance data parameter is not set to true.
sample size per class This parameter specifies the absolute sample size per class. This parameter is only available when the sample parameter is set to ‘absolute’ and the balance
data parameter is set to true.
sample ratio per class This parameter specifies the fraction of examples per class. This parameter is only available when the sample parameter is set to ‘relative’ and the balance
data parameter is set to true.
sample probability per class This parameter specifies the probability of examples per class.
This parameter is only available when the sample parameter is set to ‘probability’ and the
balance data parameter is set to true.
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomizing examples of the sample. Using the same value of local random seed
will produce the same sample. Changing the value of this parameter changes the way the
examples are randomized, thus the sample will have a different set of examples.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
Tutorial Processes
Sampling the Ripley-Set data set
The ‘Ripley-Set’ data set is loaded using the Retrieve operator. The Generate ID operator is applied on it so that the examples can be identified uniquely. A breakpoint is inserted at this stage
so that you can see the ExampleSet before the Sample operator is applied. You can see that there
are 250 examples with two possible classes: 0 and 1. 125 examples have class 0 and 125 examples
have class 1. Now, the Sample operator is applied on the ExampleSet. The sample parameter is
set to ‘relative’. The balance data parameter is set to true. The sample ratio per class parameter
specifies two ratios. Class 0 is assigned ratio 0.2. Thus, of all the examples where label attribute
is 0 only 20 percent will be selected. There were 125 examples with class 0, so 25 (i.e. 20% of
125) examples will be selected. Class 1 is assigned ratio 1. Thus, of all the examples where label
attribute is 1, 100 percent will be selected. There were 125 examples with class 1, so all 125 (i.e.
100% of 125) examples will be selected. Run the process and you can verify the results. Also
note that the examples are taken randomly. The randomization can be changed by changing
the local random seed parameter.
240
2.2. Examples
Process
Rip ley -Se t
out
inp
Generate ID
exa
exa
ori
Sample
exa
exa
res
ori
res
Figure 2.46: Tutorial process ‘Sampling the Ripley-Set data set’.
Sample (Bootstrapping)
Sample (Bootstra...
exa
exa
ori
This operator creates a bootstrapped sample from an ExampleSet.
Bootstrapped sampling uses sampling with replacement, thus the
sample may not have all unique examples. The size of the sample
can be specified on absolute and relative basis.
Description
This operator is different from other sampling operators because it uses sampling with replacement. In sampling with replacement, at every step all examples have equal probability of being
selected. Once an example has been selected for the sample, it remains candidate for selection
and it can be selected again in any other coming steps. Thus a sample with replacement can have
the same example multiple number of times. More importantly, a sample with replacement can
be used to generate a sample that is greater in size than the original ExampleSet. The number of
examples in the sample can be specified on absolute or relative basis depending on the setting
of the sample parameter.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Generate
ID operator in the attached Example Process.
Output Ports
example set output (exa) A bootstrapped sample of the input ExampleSet is output of this
port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
sample (selection) This parameter determines how the amount of data is specified.
241
2. Blending
• absolute If the sample parameter is set to ‘absolute’ the sample is created of the exactly specified number of examples. The required number of examples is specified in
the sample size parameter.
• relative If the sample parameter is set to ‘relative’ the sample is created as a fraction of
the total number of examples in the input ExampleSet. The required ratio of examples
is specified in the sample ratio parameter.
sample size (integer) This parameter specifies the exact number of examples which should be
sampled. This parameter is only available when the sample parameter is set to ‘absolute’.
sample ratio (real) This parameter specifies the fraction of examples which should be sampled. This parameter is only available when the sample parameter is set to ‘relative’.
use weights (boolean) If set to true, example weights will be considered during the bootstrapping if such weights are present.
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomizing examples of the sample. Using the same value of the local random
seed will produce the same sample. Changing the value of this parameter changes the way
the examples are randomized, thus the sample will have a different set of examples.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
Tutorial Processes
Bootstrapped Sampling of the Golf data set
Process
inp
Golf
Generate ID
out
exa
exa
ori
Sample (Bootstra...
exa
exa
res
ori
res
Figure 2.47: Tutorial process ‘Bootstrapped Sampling of the Golf data set’.
The ‘Golf’ data set is loaded using the Retrieve operator. The Generate ID operator is applied
on it to create an id attribute with ids starting from 1. This is done so that the examples can be
identified uniquely, otherwise the id attribute was not necessary here. A breakpoint is inserted
here so that you can view the ExampleSet before the application of the Sample (Bootstrapping)
operator. As you can see, the ExampleSet has 14 examples. The Sample (Bootstrapping) operator is applied on the ExampleSet. The sample parameter is set to ‘absolute’ and the sample size
242
2.2. Examples
parameter is set to 140. Thus a sample 10 times in size of the original ExampleSet is generated.
Instead of repeating each example of the input ExampleSet 10 times, examples are selected randomly. You can verify this by seeing the results of this process in the Results Workspace.
243
2. Blending
Sample (Kennard-Stone)
Sample (Kennard...
exa
exa
ori
This operator creates a sample from the given ExampleSet by using the Kennard-Stone algorithm. The size of the sample can be
specified on absolute and relative basis.
Description
The Sample (Kennard-Stone) operator performs a Kennard-Stone Sampling. This sampling algorithm works as follows:
• Find the two most separated points in the ExampleSet.
• For each candidate point, find the smallest distance to any already selected object.
• Select the point which has the largest of these smallest distances.
This algorithm always gives the same result because the two starting points are always the
same. This implementation reduces the number of iterations by holding a list with candidates
of the largest smallest distances. Please note that the number of examples in the sample may
not be exactly the same as specified because of the way this algorithm works.
The sampling operators are similar to the Filter Examples operator in principle that they take
an ExampleSet as input and delivers a subset of the ExampleSet as output. The difference is
this that the Filter Examples operator filters examples on the basis of specified conditions. But
the Sample operators focus on the number of examples and class distribution in the resultant
sample. Moreover, the samples are generated randomly. The number of examples in the sample
can be specified on absolute and relative basis depending on the setting of the sample parameter.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
Output Ports
example set output (exa) The Kennard-Stone algorithm is applied and the resultant sample
of the input ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
sample (selection) This parameter determines how the amount of data is specified.
• absolute If the sample parameter is set to ‘absolute’ then the sample is created of an
exactly specified number of examples. The required number of examples is specified
in the sample size parameter.
244
2.2. Examples
• relative If the sample parameter is set to ‘relative’ then the sample is created as a
fraction of the total number of examples in the input ExampleSet. The required ratio
of examples is specified in the sample ratio parameter.
sample size (integer) This parameter specifies the exact number of examples which should be
sampled. This parameter is only available when the sample parameter is set to ‘absolute’.
sample ratio (real) This parameter specifies the fraction of examples which should be sampled. This parameter is only available when the sample parameter is set to ‘relative’.
Tutorial Processes
Kennard-Stone sampling of the Iris data set
Process
Iris
inp
Sample (Kennard...
out
exa
exa
res
ori
res
Figure 2.48: Tutorial process ‘Kennard-Stone sampling of the Iris data set’.
The ‘Iris’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can view the ExampleSet. You can see that the ExampleSet has 150 examples. The Sample
(Kennard-Stone) operator is applied on the ExampleSet. The sample parameter is set to ‘absolute’ and the sample size parameter is set to 15. Thus the resultant sample will have only 15
examples. The resultant ExampleSet with 15 examples can be seen in the Results Workspace.
245
2. Blending
Sample (Stratified)
Sample (Stratified)
exa
exa
ori
This operator creates a stratified sample from an ExampleSet.
Stratified sampling builds random subsets and ensures that the
class distribution in the subsets is the same as in the whole ExampleSet. This operator cannot be applied on data sets without
a label or with a numerical label. The size of the sample can be
specified on absolute and relative basis.
Description
The stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole ExampleSet. For example in the case of a binominal classification, Stratified sampling builds random subsets such that each subset contains roughly the same
proportions of the two values of class labels.
When there are different classes in an ExampleSet, it is sometimes advantageous to sample
each class independently. Stratification is the process of dividing examples of the ExampleSet
into homogeneous subgroups (i.e. classes) before sampling. The subgroups should be mutually
exclusive i.e. every examples in the ExampleSet must be assigned to only one subgroup (or class).
The subgroups should also be collectively exhaustive i.e. no example can be excluded. Then
random sampling is applied within each subgroup. This often improves the representativeness
of the sample by reducing the sampling error.
A real-world example of using stratified sampling would be for a political survey. If the respondents needed to reflect the diversity of the population, the researcher would specifically
seek to include participants of various minority groups such as race or religion, based on their
proportionality to the total population as mentioned above. A stratified survey could thus claim
to be more representative of the population than a survey of simple random sampling or systematic sampling.
In contrast to the simple sampling operator (the Sample operator), this operator performs
a stratified sampling of the data sets with nominal label attributes, i.e. the class distributions
remains (almost) the same after sampling. Hence, this operator cannot be applied on data sets
without a label or with a numerical label. In these cases a simple sampling without stratification
should be performed through the Sample operator.
This operator is similar to the Filter Examples operator in principle that it takes an ExampleSet
as input and delivers a subset of the ExampleSet as output. The difference is this that the Filter
Examples operator filters examples on the basis of specified conditions. But the Sample operator
focuses on the number of examples and class distribution in the resultant sample. Moreover,
the samples are generated randomly. The number of examples in the sample can be specified
on absolute and relative basis depending on the setting of the sample parameter.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Filter Examples operator in the attached Example Process.
Output Ports
example set output (exa) A randomized sample of the input ExampleSet is output of this
port. The class distributions of the sample is (almost) the same as the class distribution of
the complete ExampleSet.
246
2.2. Examples
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
sample (selection) This parameter determines how the amount of data is specified.
• absolute If the sample parameter is set to ‘absolute’ then the sample is created of an
exactly specified number of examples. The required number of examples is specified
in the sample size parameter.
• relative If the sample parameter is set to ‘relative’ then the sample is created as a
fraction of the total number of examples in the input ExampleSet. The required ratio
of examples is specified in the sample ratio parameter.
sample size (integer) This parameter specifies the exact number of examples which should be
sampled. This parameter is only available when the sample parameter is set to ‘absolute’.
sample ratio (real) This parameter specifies the fraction of examples which should be sampled. This parameter is only available when the sample parameter is set to ‘relative’.
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomizing examples of the sample. Using the same value of local random seed
will produce the same sample. Changing the value of this parameter changes the way the
examples are randomized, thus sample will have a different set of examples.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
Tutorial Processes
Stratified Sampling of the Golf data set
The ‘Golf’ data set is loaded using the Retrieve operator. The Filter Example Range operator
is applied on it to select the first 10 examples. This is done to simplify the Example Process
otherwise the filtering was not necessary here. A breakpoint is inserted here so that you can
view the ExampleSet before the application of the Sample (Stratified) operator. As you can see,
the ExampleSet has 10 examples. 6 examples (i.e. 60%) belong to class ‘yes’ and 4 examples
(i.e. 40%) belong to class ‘no’. The Sample (Stratified) operator is applied on the ExampleSet.
The sample parameter is set to ‘absolute’ and the sample size parameter is set to 5. Thus the
resultant sample will have only 5 examples. The sample will have the same class distribution
as the class distribution of the input ExampleSet i.e. 60% examples with class ‘yes’ and 40%
examples with class ‘no’. You can verify this by viewing the results of this process. 3 out of 5
examples (i.e. 60%) have class ‘yes’ and 2 out of 5 examples (i.e. 40%) have class ‘no’.
247
2. Blending
Process
inp
Golf
Filter Example R...
out
exa
exa
ori
Sample (Stratified)
exa
exa
res
ori
res
Figure 2.49: Tutorial process ‘Stratified Sampling of the Golf data set’.
Split Data
Split Data
exa
par
This operator produces the desired number of subsets of the given
ExampleSet. The ExampleSet is partitioned into subsets according
to the specified relative sizes.
Description
The Split Data operator takes an ExampleSet as its input and delivers the subsets of that ExampleSet through its output ports. The number of subsets (or partitions) and the relative size of
each partition are specified through the partitions parameter. The sum of the ratio of all partitions should be 1. The sampling type parameter decides how the examples should be shuffled in
the resultant partitions. For more information about this operator please study the parameters
section of this description. This operator is different from other sampling and filtering operators
in the sense that it is capable of delivering multiple partitions of the given ExampleSet.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process.
Output Ports
partition (par) This operator can have multiple number of partition ports. The number of useful partition ports depends on the number of partitions (or subsets) this operator is configured to produce. The partitions parameter is used for specifying the desired number of
partitions.
248
2.2. Examples
Parameters
partitions (enumeration) This is the most important parameter of this operator. It specifies
the number of partitions and the relative ratio of each partition. The user just requires
to specify the ratio of all partitions. The number of required partitions is not explicitly
specified by the user because it is calculated automatically by the number of ratios specified
in this parameter. The ratios should be between 0 and 1. The sum of all ratios should be
1. For better understanding of this parameter please study the attached Example Process.
sampling type (selection) The Split Data operator can use several types of sampling for building the subsets. Following options are available:
• Linear sampling Linear sampling simply divides the ExampleSet into partitions without changing the order of the examples i.e. subsets with consecutive examples are
created.
• Shuffled sampling Shuffled sampling builds random subsets of the ExampleSet. Examples are chosen randomly for making subsets.
• Stratified sampling Stratified sampling builds random subsets and ensures that the
class distribution in the subsets is the same as in the whole ExampleSet. For example
in the case of a binominal classification, Stratified sampling builds random subsets
such that each subset contains roughly the same proportions of the two values of the
class labels.
• Automatic Uses stratified sampling if the label is nominal, shuffled sampling otherwise.
use local random seed (boolean) Indicates if a local random seed should be used for randomizing examples of a subset. Using the same value of local random seed will produce the same
subsets. Changing the value of this parameter changes the way examples are randomized,
thus subsets will have a different set of examples. This parameter is only available if Shuffled or Stratified sampling is selected. It is not available for Linear sampling because it
requires no randomization, examples are selected in sequence.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
Tutorial Processes
Creating partitions of the Golf data set using the Split Data operator
The ‘Golf’ data set is loaded using the Retrieve operator. The Generate ID operator is applied on
it so the examples can be identified uniquely. A breakpoint is inserted here so the ExampleSet
can be seen before the application of the Split Data operator. It can be seen that the ExampleSet
has 14 examples which can be uniquely identified by the id attribute. The examples have ids
from 1 to 14. The Split Data operator is applied next. The sampling type parameter is set to
‘linear sampling’. The partitions parameter is configured to produce two partitions with ratios
0.8 and 0.2 respectively. The partitions can be seen in the Results Workspace. The number of
examples in each partition is calculated by this formula:
(Total number of examples) / (sum of ratios) * ratio of this partition
If the answer is a decimal number it is rounded off. The number of examples in each partition
turns out to be: (14) / (0.8 + 0.2) * (0.8) = 11.2 which is rounded off to 11(14) / (0.8 + 0.2) * (0.2)
= 2.8 which is rounded off to 3
249
2. Blending
Process
Golf
inp
Generate ID
out
exa
exa
ori
Split Data
exa
par
res
par
res
par
res
Figure 2.50: Tutorial process ‘Creating partitions of the Golf data set using the Split Data
operator’.
It is a good practice to adjust ratios such that the sum of ratios is 1. But this operator also works
if the sum of ratios is lower than or greater than 1. For example if two partitions are created with
ratios 1.0 and 0.4. The resultant partitions would be calculated as follows: (14) / (1.0 + 0.4) * (1.0)
= 10(14) / (1.0 + 0.4) * (0.4) = 4
250
2.2. Examples
2.2.3 Sort
Shuffle
Shuffle
exa
exa
ori
This operator creates a new, shuffled ExampleSet from the given
ExampleSet by making a new copy of the given ExampleSet in the
main memory.
Description
The Shuffle operator creates a new, shuffled ExampleSet by making a new copy of the given ExampleSet in the main memory. Please note that the system may run out of memory, if the ExampleSet is too large. The local random seed parameter can be used for randomizing the shuffling
process.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
Output Ports
example set output (exa) The shuffled ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomization. Using the same value of the local random seed will produce the
same randomization.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
Tutorial Processes
Shuffling the Iris data set
The ‘Iris’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the ExampleSet has an id attribute.
The ExampleSet is sorted in ascending order of this attribute. The Shuffle operator is applied
on this ExampleSet to randomize the order of its examples. The resultant shuffled ExampleSet
can be seen in the Results Workspace.
251
2. Blending
Process
Iris
Shuffle
out
inp
exa
exa
res
ori
res
Figure 2.51: Tutorial process ‘Shuffling the Iris data set’.
Sort
Sort
exa
exa
ori
This operator sorts the input ExampleSet in ascending or descending order according to a single attribute.
Description
This operator sorts the ExampleSet provided at the input port. The complete data set is sorted
according to a single attribute. This attribute is specified using the attribute name parameter.
Sorting is done in increasing or decreasing direction depending on the setting of the sorting
direction parameter.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
Output Ports
example set output (exa) The sorted ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute name (string) This parameter is used to specify the attribute which should be used
for sorting the ExampleSet.
sorting direction This parameter indicates the direction of the sorting. The ExampleSet can
be sorted in increasing(ascending) or decreasing(descending) order.
252
2.2. Examples
Tutorial Processes
Sorting the Golf data set according to Temperature
Process
inp
Golf
Sort
out
exa
exa
res
ori
res
Figure 2.52: Tutorial process ‘Sorting the Golf data set according to Temperature’.
The ‘Golf’ data set is loaded using the Retrieve operator. The Sort operator is applied on it.
The attribute name parameter is set to ‘Temperature’. The sort direction parameter is set to
‘increasing’. Thus the ‘Golf’ data set is sorted in ascending order of the ‘Temperature’ attribute.
The example with the smallest value of the ‘Temperature’ attribute becomes the first example
and the example with the largest value of the ‘Temperature’ attribute becomes the last example
of the ExampleSet.
Sorting on multiple attributes
This Example Process shows how two Sort operators can be used to sort an ExampleSet on two
attributes. The ‘Golf’ data set is loaded using the Retrieve operator. The Sort operator is applied
on it. The attribute name parameter is set to ‘Temperature’. The sort direction parameter is set
to ‘increasing’. Then another Sort operator is applied on it. The attribute name parameter is set
to ‘Humidity’ this time. The sort direction parameter is set to ‘increasing’. Thus the ‘Golf’ data
set is sorted in ascending order of the ‘Humidity’ attribute. The example with smallest value
of the ‘Humidity’ attribute becomes the first example and the example with the largest value of
the ‘Humidity’ attribute becomes the last example of the ExampleSet. If some examples have
the same value of the ‘Humidity’ attribute, they are sorted using the ‘Temperature’ attribute.
Where examples have same value of the ‘Humidity’ attribute then the examples with smaller
value of the ‘Temperature’ attribute precede the examples with higher value of the ‘Temperature’
attribute. This can be seen in the Results Workspace.
253
2. Blending
Process
inp
Golf
Sort
out
exa
Sort (2)
exa
ori
exa
exa
res
ori
res
Figure 2.53: Tutorial process ‘Sorting on multiple attributes’.
2.3 Table
2.3.1 Grouping
Aggregate
Aggregate
exa
exa
ori
This operator performs the aggregation functions known from
SQL. This operator provides a lot of functionalities in the same format as provided by the SQL aggregation functions. SQL aggregation functions and GROUP BY and HAVING clauses can be imitated
using this operator.
Description
The Aggregate operator creates a new ExampleSet from the input ExampleSet showing the results of the selected aggregation functions. Many aggregation functions are supported including
SUM, COUNT, MIN, MAX, AVERAGE and many other similar functions known from SQL. The
functionality of the GROUP BY clause of SQL can be imitated by using the group by attributes
parameter. You need to have a basic understanding of the GROUP BY clause of SQL for understanding the use of this parameter because it works exactly the same way. If you want to imitate
the known HAVING clause from SQL, you can do that by applying the Filter Examples operator
after the Aggregation operator. This operator imitates aggregation functions of SQL. It focuses
on obtaining summary information, such as averages and counts etc. It can group examples in
an ExampleSet into smaller sets and apply aggregation functions on those sets. Please study the
attached Example Process for better understanding of this operator.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Filter Examples
operator in the attached Example Process. Output of other operators can also be used as
254
2.3. Table
input.
Output Ports
example set (exa) The ExampleSet generated after applying the specified aggregation functions is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
use default aggregation (boolean) This parameter allows you to define the default aggregation for selected attributes. A number of parameters become available if this parameter is
set to true. These parameters allow you to select the attributes and corresponding default
aggregation function.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose all examples satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
255
2. Blending
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list. Attributes can be shifted
to the right list, which is the list of selected attributes.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes selected irrespective of the conditions in the Select Attribute operator. If
this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
256
2.3. Table
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
default aggregation function This parameter is only available when the use default aggregation parameter is set to true. It is used for specifying the default aggregation function for
the selected attributes.
aggregation attributes This parameter is one of the most important parameters of the operator. It allows you to select attributes and the aggregation function to apply on them.
Many aggregation functions are available including count, average, minimum, maximum
variance and many more.
group by attributes This operator can group examples of the input ExampleSet into smaller
groups using this parameter. The aggregation functions are applied on these groups. This
parameter allows the Aggregate operator to replicate the functionality of the known GROUP
BY clause of SQL. From version 6.0.3 on the operator will cause an error if a given attribute
can’t be found in the example set.
count all combinations (boolean) This parameter indicates if all possible combinations of
the values of the group by attributes are counted, even if they don’t occur. All possible
combinations may result in a huge number so handle this parameter carefully.
only distinct (boolean) This parameter indicates if only examples with distinct values for the
aggregation attribute should be used for the calculation of the aggregation function.
ignore missings (boolean) This parameter indicates if missing values should be ignored and
aggregation functions should be applied only on existing values. If this parameter is not
set to true then the aggregated value will be a missing value in the presence of missing
values in the selected attribute.
Tutorial Processes
Imitating an SQL aggregation query using the Aggregate operator
Process
FROM
inp
SELECT
out
exa
WHERE
exa
ori
exa
GROUP BY
exa
ori
unm
exa
exa
ori
HAVING
exa
ORDER BY
exa
ori
exa
exa
res
ori
res
unm
Figure 2.54: Tutorial process ‘Imitating an SQL aggregation query using the Aggregate operator’.
This Example Process discusses an arbitrary scenario. Then describes how this scenario could
be handled using SQL aggregation functions. Then the SQL’s solution is imitated in RapidMiner.
The Aggregate operator plays a key role in this process.
257
2. Blending
Let us assume a scenario where we want to apply certain aggregation functions on the Golf
data set. We don’t want to include examples where the Outlook attribute has the value ‘overcast’. We group the remaining examples of the ‘Golf’ data set by values of the Play and Wind
attributes. We wish to find the average Temperature and average Humidity for these groups.
Once these averages have been calculated, we want to see only those examples where the average Temperature is above 71. Lastly, we want to see the results in ascending order of the average
Temperature.
This problem can be solved by the following SQL query:
SELECT Play, Wind, AVG (Temperature), AVG (Humidity)
FROM Golf
WHERE Outlook NOT LIKE ‘overcast’
GROUP BY Play, Wind
HAVING AVG (Temperature)>71
ORDER BY AVG (Temperature)
The SELECT clause selects the attributes to be displayed. The FROM clause specifies the data
set. The WHERE clause pre-excludes the examples where the Outlook attribute has value ‘overcast’. The GROUP BY clause groups the data set according to the specified attributes. The HAVING clause filters the results after the aggregation functions have been applied. Finally the ORDER BY clause sorts the results in ascending order of the Temperature averages.
Here is how this scenario can be tackled using RapidMiner. First of all the Retrieve operator
is used for loading the ‘Golf’ data set. This is similar to the FROM clause. Then the Select Attributes operator is applied on it to select the required attributes. This works a little different
from the SQL query. If we select only the Play and Wind attributes as in the query, then the coming operators cannot be applied. Thus we select all attributes for now. You will see later that
the attribute set will be reduced automatically, thus the Select Attributes operator is not really
required here. Then the Filter Examples operator is applied to pre-exclude examples where the
Outlook attribute has the value ‘overcast’. This is similar to the WHERE clause of SQL. Then the
Aggregate operator is applied on the remaining examples. The Aggregate operator performs a
number of tasks here. Firstly, it specifies the aggregation functions using the aggregation attributes parameter. We need averages of the Temperature and Humidity attribute; this is specified using the aggregation attributes parameter. Secondly, we do not want the averages of the
entire data set. We want the averages by groups, grouped by the Play and Wind attribute values.
These groups are specified using the group by attributes parameter of the Aggregate operator.
Thirdly, required attributes are automatically filtered by this operator. Only those attributes
appear in the resultant data set that have been specified in the Aggregate operator. Next, we
are interested only in those examples where the average Temperature is greater than 71. This
condition can be applied using the Filter Examples operator. This step is similar to the HAVING
clause. Lastly we want the results to be sorted. The Sort operator is used to do the required
sorting. This step is very similar to the ORDER BY clause. Breakpoints are inserted after every
operator in the Example Process so that you can understand the part played by each operator.
258
2.3. Table
2.3.2 Rotation
De-Pivot
De-Pivot
exa
exa
ori
This operator transforms the ExampleSet by converting the examples of the selected attributes (usually attributes that measure the
same characteristic) into examples of a single attribute.
Description
This operator is usually used when your ExampleSet has multiple attributes that measure the
same characteristic (may be at different time intervals) and you want to merge these observations into a single attribute without loss of information. If the original ExampleSet has n examples and k attributes that measure the same characteristic, after application of this operator
the ExampleSet will have k x n examples. The k attributes will be combined into one attribute.
This attribute will have n examples of each of the k attributes. This can be easily understood by
studying the attached Example Process.
In other words, this operator converts an ExampleSet by dividing examples which consist of
multiple observations (at different times) into multiple examples, where each example covers
one point in time. An index attribute is added in the ExampleSet, which denotes the actual point
in time the example belongs to after the transformation.
The keep missings parameter specifies whether an example should be kept, even if it has missing values for all series at a certain point in time. The create nominal index parameter is only
applicable if only one time series per example exists. Instead of using a numeric index, then the
names of the attributes representing the single time points are used as index attribute values.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input.
Output Ports
example set output (exa) The selected attributes are converted into examples of a new attribute and the resultant ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute name (list) This parameter maps a number of source attributes onto result attributes.
The attribute name parameter is used for specifying the group of attributes that you want to
combine and the name of the new attribute. The attributes of a group are selected through
a regular expression. There can be numerous groups with each group having multiple attributes.
259
2. Blending
index attribute (string) This parameter specifies the name of the newly created index attribute. The index attribute is used for differentiating between examples of different attributes of a group after the transformation.
create nominal index (boolean) The create nominal index parameter is only applicable if only
one time series per example exists. Instead of using a numeric index, then the names of
the attributes representing the single time points are used as index attribute values.
keep missings (boolean) The keep missings parameter specifies whether an example should
be kept, even if it has missing values for all series at a certain point in time.
Tutorial Processes
Merging multiple attributes that measure the same characteristic into a single
attribute
Process
Subprocess
inp
in
out
out
De-Pivot
exa
exa
res
ori
res
Figure 2.55: Tutorial process ‘Merging multiple attributes that measure the same characteristic
into a single attribute’.
This process starts with the Subprocess operator which delivers an ExampleSet. The subprocess is used for creating a sample ExampleSet therefore it is not important to understand what
is going on inside the subprocess. A breakpoint is inserted after the subprocess so that you can
have a look at the ExampleSet. You can see that the ExampleSet has 14 examples and it has two
attributes i.e. ‘Morning’ and ‘Evening’. These attributes measure the temperature of an area in
morning and evening respectively. We want to convert these attributes into a single attribute
but we still want to be able to differentiate between morning and evening temperatures.
The De-Pivot operator is applied on this ExampleSet to perform this task. The attribute name
parameter is used for specifying the group of attributes that you want to combine and the name
of the new attribute. The attributes of a group are selected through a regular expression. There
can be numerous groups with each group having multiple attributes. In our case, there is only
one group which has all the attributes of the ExampleSet (i.e. both ‘Morning’ and ‘Evening’
attributes). The new attribute is named ‘Temperatures’ and the regular expression: ‘ .* ‘ is used
for selecting all the attributes of the ExampleSet. The index attribute is used for differentiating
between examples of different attributes of a group after transformation. The name of the index
attribute is set to ‘Time’. The create nominal index parameter is also set to true so that the
resultant ExampleSet is more self-explanatory.
Execute the process and have a look at the resultant ExampleSet. You can see that there are 28
examples in this ExampleSet. The original ExampleSet had 14 examples, and 2 attributes were
grouped, therefore the resultant ExampleSet has 28 (i.e. 14 x 2) examples. There are 14 examples from the Morning attribute and 14 examples of the Evening attribute in the ‘Temperatures’
260
2.3. Table
attribute. The ‘Time’ attribute explains whether an example measures morning or evening temperature.
261
2. Blending
Pivot
Pivot
exa
exa
ori
This operator rotates an ExampleSet by grouping multiple examples of same groups to single examples.
Description
The Pivot operator rotates the given ExampleSet by grouping multiple examples of same groups
to single examples. The group attribute parameter specifies the grouping attribute (i.e. the attribute which identifies examples belonging to the groups). The resultant ExampleSet has n
examples where n is the number of unique values of the group attribute. The index attribute parameter specifies the attribute whose values are used to identify the examples inside the groups.
The values of this attribute are used to name the group attributes which are created during the
pivoting. Typically the values of such an attribute capture subgroups or dates. The resultant
ExampleSet has m regular attributes in addition to the group attribute where m is the number
of unique values of the index attribute. If the given ExampleSet contains example weights (i.e.
an attribute with weight role), these weights may be aggregated in each group to maintain the
weightings among groups. This description can be easily understood by studying the attached
Example Process.
Differentiation
• Transpose The Transpose operator simply rotates the given ExampleSet (i.e. interchanges
rows and columns) but the Pivot operator provides additional options like grouping and
handling weights. See page 265 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process.
Output Ports
example set output (exa) The ExampleSet produced after pivoting is the output of this port.
original (ori) The ExampleSet that was given as input is passed without any modifications to
the output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
Parameters
group attribute (string) This parameter specifies the grouping attribute (i.e. the attribute
which identifies examples belonging to the groups). The resultant ExampleSet has n examples where n is the number of unique values of the group attribute.
262
2.3. Table
index attribute (string) This parameter specifies the attribute whose values are used to identify the examples inside the groups. The values of this attribute are used to name the group
attributes which are created during the pivoting. Typically the values of such an attribute
capture subgroups or dates. The resultant ExampleSet has m regular attributes in addition
to the group attribute where m is the number of unique values of the index attribute.
consider weights (boolean) This parameter specifies whether attribute weights (if any) should
be kept and aggregated or ignored.
weight aggregation (selection) This parameter is only available when the consider weights
parameter is set to true. It specifies how example weights should be aggregated in the
groups. It has the following options: average, variance, standard_deviation, count, minimum, maximum, sum, mode, median, product.
skip constant attributes (boolean) This parameter specifies if the attributes should be skipped
if their value never changes within a group.
data management (selection) This is an expert parameter. There are different options, users
can choose any of them
Related Documents
• Transpose (page 265)
Tutorial Processes
Introduction to the Pivot operator
Process
inp
Subprocess
in
out
out
Pivot
exa
exa
res
ori
res
res
Figure 2.56: Tutorial process ‘Introduction to the Pivot operator’.
This Example Process starts with the Subprocess operator. There is a sequence of operators
in this Subprocess operator that produces an ExampleSet that is easy to understand. A breakpoint is inserted after the Subprocess operator to show this ExampleSet. The Pivot operator
is applied on this ExampleSet. The group attribute and index attribute parameters are set to
‘group_attribute’ and ‘index_attribute’ respectively. The consider weights parameter is set to
true and the weight aggregation parameter is set to ‘sum’. The group_attribute has 5 possible
values therefore the pivoted ExampleSet has 5 examples i.e. one for each possible value of the
263
2. Blending
group_attribute. The index_attribute has 5 possible values therefore the pivoted ExampleSet has
5 regular attributes (in addition to the group_attribute). Here is an explanation of values of the
first example of the pivoted ExampleSet. The remaining examples also follow the same idea.
The value of the group_attribute of the first example of the pivoted ExampleSet is ‘group0’,
therefore all values of this example will be derived from all examples of the input ExampleSet
where the group_attribute had the value ‘group0’. The ids of examples with ‘group0’ in the input ExampleSet are 12, 16, 19 and 20. In the coming explanation these examples will be called
group0 examples for simplicity.
The value of the weight_attribute attribute of the pivoted ExampleSet is 11. It is the sum of
weights of group0 examples i.e. 4 + 4 + 0 + 3 = 11. The weights were added because the weight
aggregation parameter is set to ‘sum’. The value of the value_attribute_index0 attribute of the
pivoted ExampleSet is 4. Only two examples (id 12 and 16) of the group0 examples had ‘index0’
in index_attribute. The value of the latter of these examples (id 16) is selected i.e. 4 is selected.
The value of the value_attribute_index1 attribute of the pivoted ExampleSet is 1. Only one example (id 19) of the group0 examples had ‘index1’ in index_attribute. Therefore its value (i.e. 1)
is selected. The value of the value_attribute_index2 attribute of the pivoted ExampleSet is undefined because no example of the group0 examples had ‘index2’ in index_attribute. Therefore
its value is missing in the pivoted ExampleSet. The value of the value_attribute_index3 attribute
of the pivoted ExampleSet is 3. Only one example (id 20) of the group0 examples had ‘index3’ in
index_attribute. Therefore its value (i.e. 3) is selected. The value of the value_attribute_index4
attribute of the pivoted ExampleSet is undefined because no example of the group0 examples
had ‘index4’ in index_attribute. Therefore its value is missing in the pivoted ExampleSet.
264
2.3. Table
Transpose
Transpose
exa
exa
ori
This operator transposes the input ExampleSet i.e. the current rows become columns of the output ExampleSet and current
columns become rows of the output ExampleSet. This operator
works very similar to the well known transpose operation for matrices.
Description
This operator transposes the input ExampleSet i.e. the current rows become the columns of
the output ExampleSet and the current columns become the rows of the output ExampleSet. In
other words every example or row becomes a column with attribute values and each attribute
column becomes an example row. This operator works very similar to the well known transpose
operation for matrices. The transpose of a transpose of a matrix is same as the original matrix,
but the same rule cannot be applied here because the types of the original ExampleSet and the
transpose of the transpose of an ExampleSet may be different.
If an id attribute is part of the input ExampleSet, the ids will become the names of the new
attributes. The names of the old attributes will be transformed into the id values of a new id
attribute. All other new attributes will have regular role after the transformation. You can use
the Set Role operator after the transpose operator to assign roles to new attributes.
If all old attributes have the same value type, all new attributes will have the same value type.
If at least one nominal attribute is part of the input ExampleSet, the type of all new attributes
will be nominal. If the old attribute values were all mixed numbers, the type of all new attributes
will be real. This operator produces a copy of the data in the main memory. Therefore, it should
not be used on very large data sets.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
Output Ports
example set output (exa) The transpose of the input ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Tutorial Processes
Different scenarios of Transpose
There are four different cases in this Example Process:
Case 1: The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here
so that you can have a look at the ExampleSet before application of the Transpose operator. You
can see that the ‘Golf’ data set has no id attribute. The types of attributes are different including
attributes of nominal type. Press the Run button to continue. Now the Transpose operator is
applied on the ‘Golf’ data set. A breakpoint is inserted here so that you can see the ExampleSet
265
2. Blending
Process
Golf
Transpose (Golf)
out
inp
exa
exa
res
ori
Iris
Transpose (Iris)
out
exa
exa
res
ori
Market-Data
Transpose (5\Ma...
out
exa
exa
res
ori
Golf-Testset
out
Transpose
exa
exa
ori
Transpose (Trans...
exa
exa
res
ori
res
Figure 2.57: Tutorial process ‘Different scenarios of Transpose’.
after the application of the Transpose operator. Here you can see that a new attribute with id
role has been created. The values of the new id attribute are the names of the old attributes. New
attributes are named in a general format like ‘att_1’, ‘att_2’ etc because the input ExampleSet
had no id attribute. The type of all new attributes is nominal because there were attributes with
different types including at least one nominal attribute in the input ExampleSet.
Case 2: The ‘Iris’ data set is loaded using the Retrieve operator. A breakpoint is inserted here
so that you can have a look at the ExampleSet before application of the Transpose operator. You
can see that the ‘Iris’ data set has an id attribute. The types of attributes are different including
attributes of nominal type. Press the Run button to continue. Now the Transpose operator is
applied on the ‘Iris’ data set. A breakpoint is inserted here so that you can see the ExampleSet
after the application of the Transpose operator. Here you can see that a new attribute with id
role has been created. The values of the new id attribute are the names of the old attributes. The
ids of the old ExampleSet become names of the new attributes. The type of all new attributes
is nominal because there were attributes with different types including at least one nominal
attribute in the input ExampleSet.
Case 3:The ‘Market-Data’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can have a look at the ExampleSet before application of the Transpose
operator. You can see that the ‘Market-Data’ data set has no special attributes. The type of all
attributes is integer. Press the Run button to continue. Now the Transpose operator is applied
on the ‘Market-Data’ data set. A breakpoint is inserted here so that you can see the ExampleSet
after the application of the Transpose operator. Here you can see that a new attribute with id
role has been created. Values of the new id attribute are the names of the old attributes. New
266
2.3. Table
attributes are named in a general format like ‘att_1’, ‘att_2’ etc because the input ExampleSet
had no id attribute. The Type of all new attributes is real because there were attributes with
mixed numbers type in the input ExampleSet.
Case 4:The ‘Golf-Testset’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can have a look at the ExampleSet before application of the Transpose
operator. The Transpose operator is applied on the ‘Golf-Testset’ data set. Then the Transpose
operator is applied on the output of the first Transpose operator. Note that the types of the attributes of the original ExampleSet and the Transpose of the Transpose of the original data set
are different.
267
2. Blending
2.3.3 Joins
Append
Append
exa
mer
This operator builds a merged ExampleSet from two or more compatible ExampleSets by adding all examples into a combined set.
Description
This operator builds a merged ExampleSet from two or more compatible ExampleSets by adding
all examples into a combined set. All input ExampleSets must have the same attribute signature.
This means that all ExampleSets must have the same number of attributes. Names and roles of
attributes should be the same in all input ExampleSets. Please note that the merged ExampleSet
is built in memory and this operator might therefore not be applicable for merging huge data
set tables from database. In that case other preprocessing tools should be used that aggregate,
join, and merge tables into one table which is then used by RapidMiner.
Input Ports
example set (exa) The Append operator can have multiple inputs. When one input port is connected, another input port becomes available which is ready to accept another input (if
any).This input port expects an ExampleSet. It is output of the Retrieve operator in the
attached Example Process. Output of other operators can also be used as input. It is essential that meta data should be attached with the data for the input because attributes are
specified in their meta data. The Retrieve operator provides meta data along-with data.
Output Ports
merged set (mer) The merged ExampleSet is delivered through this port.
Parameters
data management (selection) This is an expert parameter. A long list is provided; users can
select any option from this list.
Tutorial Processes
Merging Golf and Golf-Testset data sets
In this process the ‘Golf’ data set and ‘Golf-Testset’ data set are loaded using the Retrieve operators. Breakpoints are inserted after the Retrieve operators so that you can have a look at the
input ExampleSets. When you run the process, first you see the ‘Golf’ data set. As you can see,
it has 14 examples. When you continue the process, you will see the ‘Golf-Testset’ data set. It
also has 14 examples. The Append operator is applied to merge these two ExampleSets into a
single ExampleSet. The merged ExampleSet has all examples from all input ExampleSets, thus
it has 28 examples. You can see that both input ExampleSets had the same number of attributes,
268
2.3. Table
Process
Golf
inp
Append
out
exa
mer
exa
res
res
exa
Golf-Testset
out
Figure 2.58: Tutorial process ‘Merging Golf and Golf-Testset data sets’.
same names and roles of attributes. This is why the Append operator could produce a merged
ExampleSet.
269
2. Blending
Intersect
Intersect
exa
exa
sec
ori
This operator returns those examples of the first ExampleSet
(given at the example set input port) whose IDs are contained within
the other ExampleSet (given at the second port). It is necessary that
both ExampleSets should have the ID attribute. The ID attribute
of both ExampleSets should be of the same type.
Description
This operator performs a set intersection on two ExampleSets on the basis of the ID attribute
i.e. the resulting ExampleSet contains all the examples of the first ExampleSet (given at the
example set input port) whose IDs appear in the second ExampleSet (given at the second port).
It is important to note that the ExampleSets do not need to have the same number of columns
or the same data types. The operation only depends on the ID attributes of the ExampleSets. It
should be made sure that the ID attributes of both ExampleSets are of the same type i.e. either
both are nominal or both are numerical.
Differentiation
• Set Minus The Set Minus and Intersect operators can be considered as opposite of each
other. The Set Minus operator performs a set minus on two ExampleSets on the basis of the
ID attribute i.e. the resulting ExampleSet contains all the examples of the first ExampleSet
whose IDs do NOT appear in the second ExampleSet. See page 276 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Generate ID operator in the attached Example Process because this operator only works if the
ExampleSets have the ID attribute.
second (sec) This input port expects an ExampleSet. It is the output of the Generate ID operator in the attached Example Process because this operator only works if the ExampleSets
have the ID attribute.
Output Ports
example set output (exa) The ExampleSet with remaining examples (i.e. examples remaining after the set intersection) of the first ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input (at example set input port) is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
Related Documents
• Set Minus (page 276)
270
2.3. Table
Tutorial Processes
Intersection of two ExampleSets
Process
inp
Golf
Generate ID
out
exa
exa
ori
Intersect
Polynomial
out
exa
exa
res
sec
ori
res
Generate ID (2)
exa
exa
ori
Figure 2.59: Tutorial process ‘Intersection of two ExampleSets’.
The ‘Golf’ data set is loaded using the Retrieve operator. The Generate ID operator is applied
on it with the offset parameter set to 0. Thus the ids of the ‘Golf’ data set are from 1 to 14. A
breakpoint is inserted here so you can have a look at the ‘Golf’ data set. The ‘Polynomial’ data set
is loaded using the Retrieve operator. The Generate ID operator is applied on it with the offset
parameter set to 10. Thus the ids of the ‘Polynomial’ data set are from 11 to 210. A breakpoint
is inserted here so you can have a look at the ‘Polynomial’ data set.
The Intersect operator is applied next. The ‘Golf’ data set is provided at the example set input
port and the ‘Polynomial’ data set is provided at the second port. The order of ExampleSets is
very important. The Intersect operator compares the ids of the ‘Golf’ data set with the ids of
the ‘Polynomial’ data set and then returns only those examples of the ‘Golf’ data set whose id is
present in the ‘Polynomial’ data set. The ‘Golf’ data set ids are from 1 to 14 and the ‘Polynomial’
data set ids are from 11 to 210. Thus ‘Golf’ data set examples with ids 11 to 14 are returned by
the Intersect operator. It is important to note that the meta data of both ExampleSets is very
different but it does not matter because the Intersect operator only depends on the ID attribute.
If the ExampleSets are switched at the input ports of the Intersect operator the results will be
very different. In this case the Intersect operator returns only those examples of the ‘Polynomial’
data set whose id is present in the ‘Golf’ data set. The ‘Golf’ data set ids are from 1 to 14 and the
‘Polynomial’ data set ids are from 11 to 210. Thus the ‘Polynomial’ data set examples with ids
11 to 14 are returned by the Intersect operator.
271
2. Blending
Join
Join
lef
joi
rig
This operator joins two ExampleSets using specified key attribute(s) of the two ExampleSets.
Description
The Join operator joins two ExampleSets using one or more attributes of the input ExampleSets
as key attributes. Identical values of the key attributes indicate matching examples. An attribute
with id role is selected as key by default but an arbitrary set of one or more attributes can be
chosen as key. Four types of joins are possible: inner, left, right and outer join. All these types
of joins are explained in the parameters section.
Input Ports
left (lef) The left input port expects an ExampleSet. This ExampleSet will be used as the left
ExampleSet for the join.
right (rig) The right input port expects an ExampleSet. This ExampleSet will be used as the
right ExampleSet for the join.
Output Ports
join (joi) The join of the left and right ExampleSets is delivered through this port.
Parameters
remove double attributes (boolean) This parameter indicates if double attributes should
be removed or renamed. Double attributes are those attributes that are present in both
ExampleSets. If this parameter is checked, from attributes which are present in both ExampleSets only the one from the left ExampleSet will be taken and the one from the right
ExampleSet would be discarded. The key attributes will always be taken from the left ExampleSet. Please note that this check for double attributes will only be applied for regular
attributes. Special attributes of the right ExampleSet which do not exist in the left ExampleSet will simply be added. If they already exist they are simply skipped.
join type (selection) This parameter specifies which Join should be performed. You can easily
understand these joins by studying the Example Process. Four types of joins are supported:
• inner The resultant ExampleSet will contain only those examples where the key attributes of both input ExampleSets match i.e. have the same value.
• left This is also called left outer join. The resultant ExampleSet will contain all records
from the left ExampleSet. If no matching records were found in the right ExampleSet,
then its fields will contain the null value i.e. ‘?’ will be stored there. The left join will
always contain the results of the inner join; however it can contain some examples
that have no matching examples in the right ExampleSet.
272
2.3. Table
• right This is also called right outer join. The resultant ExampleSet will contain all
records from the right ExampleSet. If no matching records were found in the left ExampleSet, then its fields will contain the null values i.e. ‘?’ will be stored there. The
right join will always contain the results of the inner join; however it can contain some
examples that have no matching examples in the left ExampleSet.
• outer This is also called full outer join. This type of join combines the results of the
left and the right join. All examples from both ExampleSets will be part of the resultant ExampleSet, whether the matching key attribute value exists in the other ExampleSet or not. If no matching key attribute value was found then the corresponding
resultant fields will have a null value. The outer join will always contain the results
of the inner join; however it can contain some examples that have no matching examples in the other ExampleSet.
use id attribute as key (boolean) This parameter indicates if the attribute with the id role
should be used as the key attribute. This option is checked by default. If unchecked, then
you have to specify the key attributes for both left and right ExampleSets. Identical values
of the key attributes indicate matching examples
key attributes This parameter specifies attribute(s) which are used as the key attributes. Identical values of the key attributes indicate matching examples. For each key attribute from the
left ExampleSet a corresponding one from the right ExampleSet has to be chosen. Choosing appropriate key attributes is critical for obtaining the desired results. This parameter
is available only when the use id attribute as key parameter is unchecked.
keep both join attributes If checked, both columns of a join pair will be kept. Usually this is
unneccessary since both attributes are identical. It may be useful to keep such a column
if there are missing values on one side.
Tutorial Processes
Different types of join
The last operator of this process is the Join operator. The sequence of operators leading to the
left input port of the Join operator is used to generate the left ExampleSet. Similarly, the sequence of operators leading to the right input port of the Join operator is used to generate the
right ExampleSet. The sequence of operators leading to the left and right input ports of the Join
operator are pretty similar.
In both cases the Retrieve operator is used to load the ‘Golf’ data set. Then the Generate Attribute operator is applied on it to generate a dummy attribute. All attributes of the ‘Golf’ data
set other than the ‘Play’ attribute and the newly generated attribute are discarded because the
keep all parameter is unchecked. Then the Generate ID operator is applied to generate an attribute with the id role. This attribute will later be used as the key attribute for joining.
The only difference is that for the left ExampleSet, the name of the attribute generated by the
Generate Attribute operator is ‘Golf 1 attribute’ and for the right ExampleSet the name of this
attribute is ‘Golf 2 attribute’. The other major difference is in the value of the offset parameter
of the Generate ID operator. For the left ExampleSet the offset parameter of the Generate ID
operator is set to 0 and for the right ExampleSet it is set to 7. Thus the left ExampleSet has id
from 1 to 14 and the right ExampleSet has id from 8 to 21. The breakpoints are inserted after
the Generate ID operator so that you can have a look at the left and right ExampleSets before
application of the Join operator.
The use id attribute as key parameter of the Join operator is set to true. Thus attributes with id
role will be used to join the left and right ExampleSets. The remove double attributes parameter
273
2. Blending
Process
Golf 1
Generate Attribu...
out
inp
exa
exa
Generate ID 1
exa
ori
exa
ori
Join
lef
rig
Golf 2
Generate Attribu...
out
exa
exa
ori
joi
res
res
Generate ID 2
exa
exa
ori
Figure 2.60: Tutorial process ‘Different types of join’.
is also checked. Thus regular attributes common in both input ExampleSets would appear just
once in the resultant ExampleSet. Only the ‘Play’ and ‘id’ attributes are common in both the
ExampleSets, but as they are not regular attributes so the remove double attributes parameter
has no effect on them. As mentioned earlier the key attributes will always be taken from the
left ExampleSet. Pease note that this check for double attributes will only be applied for regular
attributes. Special attributes of the right ExampleSet which do not exist in the left ExampleSet
will simply be added. If they already exist they are simply skipped.
In this example process the join type is set as inner join. You can change it to other values
and run the process again. Here is an explanation of results that are obtained on each value of
the join type parameter.
If inner join is selected as join type the resultant ExampleSet has examples with ids from 8 to
14. This is because the inner join delivers only those examples where the key attribute of both
input ExampleSets have the same values. In this example process, the left ExampleSet has ids
from 1 to 14 and the right ExampleSet has ids from 8 to 21. Thus examples with ids from 8 to 14
have equal value of the key attribute (i.e. the id attribute).
If left join is selected as join type the resultant ExampleSet has examples with ids from 1 to
14. This is because the left join delivers all examples of the left ExampleSet with corresponding
values of the right ExampleSet. If there is no match present in the right ExampleSet, the null
value is placed at its place. This is why you can see that the ‘Golf 2 attribute’ has null values for
ids 1 to 7.
If right join is selected as join type the resultant ExampleSet has examples with ids from 8 to 21.
This is because the right join delivers all examples of the right ExampleSet with corresponding
values of the left ExampleSet. If there is no match present in the left ExampleSet, a null value
is placed at its place. This is why you can see that the ‘Golf 1 attribute’ has null values for ids 15
to 21.
If outer join is selected as join type the resultant ExampleSet has examples with ids from 1
to 21. This is because the outer join combines the results of the left and right join. All exam-
274
2.3. Table
ples from both ExampleSets will be part of the resultant ExampleSet, whether the matching key
attribute value exists in the other ExampleSet or not. If no matching key attribute value was
found then the corresponding resultant fields will have a null value. In this example process
the left ExampleSet has ids from 1 to 14 and the right ExampleSet has ids from 8 to 21. Thus
examples with ids from 1 to 21 are part of the resultant ExampleSet. The ‘Golf 2 attribute’ has
null values in examples with ids from 1 to 7. Similarly, the ‘Golf 1 attribute’ has null values in
examples with ids from 15 to 21. There are no null values in examples with ids 8 to 14. The ‘Play’
attribute has null values in examples with id from 15 to 21. This is because special attributes
are taken from the left ExampleSet which in this example process has no values of the ‘Play’
attribute corresponding to ids 15 to 21.
275
2. Blending
Set Minus
Set Minus
exa
exa
sub
ori
This operator returns those examples of the ExampleSet (given at
the example set input port) whose IDs are not contained within the
other ExampleSet (given at the subtrahend port). It is necessary
that both ExampleSets should have the ID attribute. The ID attribute of both ExampleSets should be of the same type.
Description
This operator performs a set minus on two ExampleSets on the basis of the ID attribute i.e. the
resulting ExampleSet contains all the examples of the minuend ExampleSet (given at the example
set input port) whose IDs do not appear in the subtrahend ExampleSet (given at the subtrahend
port). It is important to note that the ExampleSets do not need to have the same number of
columns or the same data types. The operation only depends on the ID attributes of the ExampleSets. It should be made sure that the ID attributes of both ExampleSets are of the same type
i.e. either both are nominal or both are numerical.
Differentiation
• Intersect The Set Minus and Intersect operators can be considered as opposite of each
other. The Intersect operator performs a set intersect on two ExampleSets on the basis of
the ID attribute i.e. the resulting ExampleSet contains all the examples of the first ExampleSet whose IDs appear in the second ExampleSet. See page 270 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Generate ID operator in the attached Example Process because this operator only works if the
ExampleSets have the ID attribute.
subtrahend (sub) This input port expects an ExampleSet. It is the output of the Generate ID
operator in the attached Example Process because this operator only works if the ExampleSets have the ID attribute.
Output Ports
example set output (exa) The ExampleSet with remaining examples (i.e. examples remaining after the set minus) of the minuend ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input (at example set input port) is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
Related Documents
• Intersect (page 270)
276
2.3. Table
Tutorial Processes
Introduction to the Set Minus operator
Process
inp
Golf
Generate ID
out
exa
exa
ori
Polynomial
out
Set Minus
exa
exa
res
sub
ori
res
Generate ID (2)
exa
exa
ori
Figure 2.61: Tutorial process ‘Introduction to the Set Minus operator’.
The ‘Golf’ data set is loaded using the Retrieve operator. The Generate ID operator is applied
on it with the offset parameter set to 0. Thus the ids of the ‘Golf’ data set are from 1 to 14. A
breakpoint is inserted here so you can have a look at the ‘Golf’ data set. The ‘Polynomial’ data set
is loaded using the Retrieve operator. The Generate ID operator is applied on it with the offset
parameter set to 10. Thus the ids of the ‘Polynomial’ data set are from 11 to 210. A breakpoint
is inserted here so you can have a look at the ‘Polynomial’ data set.
The Set Minus operator is applied next. The ‘Golf’ data set is provided at the example set input
port and the ‘Polynomial’ data set is provided at the subtrahend port. The order of ExampleSets
is very important. The Set Minus operator compares the ids of the ‘Golf’ data set with the ids
of the ‘Polynomial’ data set and then returns only those examples of the ‘Golf’ data set whose
id is not present in the ‘Polynomial’ data set. The ‘Golf’ data set ids are from 1 to 14 and the
‘Polynomial’ data set ids are from 11 to 210. Thus ‘Golf’ data set examples with ids 1 to 10 are
returned by the Set Minus operator. It is important to note that the meta data of both ExampleSets is very different but it does not matter because the Set Minus operator only depends on the
ID attribute.
If the ExampleSets are switched at the input ports of the Set Minus operator the results will be
very different. In this case the Set Minus operator returns only those examples of the ‘Polynomial’ data set whose id is not present in the ‘Golf’ data set. The ‘Golf’ data set ids are from 1 to
14 and the ‘Polynomial’ data set ids are from 11 to 210. Thus the ‘Polynomial’ data set examples
with ids 15 to 210 are returned by the Set Minus operator.
277
2. Blending
Superset
Superset
exa
sup
exa
sup
This operator takes two ExampleSets as input and adds new features of the first ExampleSet to the second ExampleSet and vice
versa to generate two supersets. The resultant supersets have the
same set of attributes but the examples may be different.
Description
The Superset operator generates supersets of the given ExampleSets by adding new features
of one ExampleSet to the other ExampleSet. The values of the new features are set to missing
values in the supersets. This operator delivers two supersets as output:
1. The first has all attributes and examples of the first ExampleSet + all attributes of the second ExampleSet (with missing values)
2. The second has all attributes and examples of the second ExampleSet + all attributes of
the first ExampleSet (with missing values)
Thus both supersets have the same set of regular attributes but the examples may be different.
It is important to note that the supersets can have only one special attribute of a kind. By default
this operator adds only new ‘regular’ attributes to the other ExampleSet for generating supersets. For example, if both input ExampleSets have a label attribute then the first superset will
have all attributes of the first ExampleSet (including label) + all regular attributes of the second
ExampleSet. The second superset will behave correspondingly. The include special attributes
parameter can be used for changing this behavior. But it should be used carefully because even
if this parameter is set to true, the resultant supersets can have only one special attribute of a
kind. Please study the attached Example Process for better understanding.
Input Ports
example set 1 (exa) This input port expects an ExampleSet. It is the output of the Retrieve
operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data.
example set 2 (exa) This input port expects an ExampleSet. It is the output of the Retrieve
operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data.
Output Ports
superset 1 (sup) The first superset of the input ExampleSets is delivered through this port.
superset 2 (sup) The second superset of the input ExampleSets is delivered through this port.
278
2.3. Table
Parameters
include special attributes (boolean) This parameter indicates if the special attributes should
be included for generation of the supersets. This operator should be used carefully especially if both ExampleSets have the same special attributes because the resultant supersets
can have only one special attribute of a kind.
Tutorial Processes
Generating supersets of the Golf and Iris data sets
Process
inp
Golf
out
Superset
Iris
exa
sup
res
exa
sup
res
out
res
Figure 2.62: Tutorial process ‘Generating supersets of the Golf and Iris data sets’.
In this process the ‘Golf’ and ‘Iris’ data sets are loaded using the Retrieve operators. Breakpoints are inserted after the Retrieve operators so that you can have a look at the input ExampleSets. When you run the process, first you see the ‘Golf’ data set. It has four regular and one
special attribute with 14 examples each. When you continue the process, you will see the ‘Iris’
data set. It has four regular and two special attributes with 150 examples each. Note that the
meta data of both ExampleSets is very different. The Superset operator is applied for generating supersets of these two ExampleSets. The resultant supersets can be seen in the Results
Workspace. You can see that one superset has all attributes and examples of the ‘Iris’ data set
+ 4 regular attributes of the ‘Golf’ data set (with missing values). The other superset has all
attributes and examples of the ‘Golf’ data set + 4 regular attributes of the ‘Iris’ data set (with
missing values).
279
2. Blending
Union
Union
exa
uni
exa
This operator builds the union of the input ExampleSets. The input ExampleSets are combined in such a way that attributes and
examples of both input ExampleSets are part of the resultant union
ExampleSet.
Description
The Union operator builds the superset of features of both input ExampleSets such that all regular attributes of both ExampleSets are part of the superset. The attributes that are common in
both ExampleSets are not repeated in the superset twice, a single attribute is created that holds
data of both ExampleSets. If the special attributes of both input ExampleSets are compatible
with each other then only one special attribute is created in the superset which has examples of
both the input ExampleSets. If special attributes of ExampleSets are not compatible, the special attributes of the first ExampleSet are kept. If both ExampleSets have any attributes with the
same name, they should be compatible with each other; otherwise you will get an error message.
This can be understood by studying the attached Example Process.
Input Ports
example set 1 (exa) This input port expects an ExampleSet. It is the output of the Retrieve
operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data.
example set 2 (exa) This input port expects an ExampleSet. It is the output of the Retrieve
operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data.
Output Ports
union (uni) The union of the input ExampleSets is delivered through this port.
Tutorial Processes
Union of the Golf and Golf-Testset data sets
In this process the ‘Golf’ data set and ‘Golf-Testset’ data set are loaded using the Retrieve operators. Breakpoints are inserted after the Retrieve operators so that you can have a look at the
input ExampleSets. When you run the process, first you see the ‘Golf’ data set. As you can see, it
has 14 examples. When you continue the process, you will see the ‘Golf-Testset’ data set. It also
has 14 examples. Note that the meta data of both ExampleSets is almost the same. The Union
operator is applied to combine these two ExampleSets into a single ExampleSet. The combined
ExampleSet has all attributes and examples from the input ExampleSets, thus it has 28 examples. You can see that both input ExampleSets had the same number of attributes, same names
and roles of attributes. This is why the Union ExampleSet also has the same number of attributes
280
2.3. Table
Process
inp
Golf
out
Union
exa
uni
exa
res
res
Golf-Testset
out
Figure 2.63: Tutorial process ‘Union of the Golf and Golf-Testset data sets’.
with the same names and roles. Here the Union operator behaves like the Append operator i.e.
it simply combines examples of two ExampleSets with compatible meta data.
Union of the Golf and Iris data sets
In this process the ‘Golf’ data set and the ‘Iris’ data set are loaded using the Retrieve operators.
Breakpoints are inserted after the Retrieve operators so that you can have a look at the input
ExampleSets. When you run the process, first you see the ‘Golf’ data set. As you can see, it has
14 examples. When you continue the process, you will see the ‘Iris’ data set. It has 4 regular
and 2 special attributes with 150 examples. Note that the meta data of both ExampleSets is very
different. The Union operator is applied to combine these two ExampleSets into a single ExampleSet. The combined ExampleSet has all attributes and examples from the input ExampleSets,
thus it has 164 (14+150) examples. Note that the ‘Golf’ data set has an attribute with label role:
the ‘Play’ attribute. The ‘Iris’ data set also has an attribute with label role: the ‘label’ attribute.
As these two label attributes are not compatible, only the label attribute of the first ExampleSet
is kept. The examples of ‘Iris’ data set have null values in this attribute of the resultant Union
ExampleSet.
Union of the Golf(with id attribute) and Iris data sets
In this process the ‘Golf’ data set and ‘Iris’ data set are loaded using the Retrieve operators. The
Generate ID operator is applied on the Golf data set to generate nominal ids starting from id_1.
Breakpoints are inserted before the Union operator so that you can have a look at the input ExampleSets. When you run the process, first you see the ‘Golf’ data set. As you can see, it has 14
examples. It has two special attributes. When you continue the process, you will see the ‘Iris’
data set. It has 4 regular and 2 special attributes with 150 examples. Note that the meta data
of both ExampleSets is very different. The Union operator is applied to combine these two ExampleSets into a single ExampleSet. The combined ExampleSet has all attributes and examples
281
2. Blending
Process
inp
Golf
out
Union
exa
uni
exa
res
res
Iris
out
Figure 2.64: Tutorial process ‘Union of the Golf and Iris data sets’.
from the input ExampleSets, thus it has 164 (14+150) examples. Note that the ‘Golf’ data set
has an attribute with label role: the ‘Play’ attribute. The ‘Iris’ data set also has an attribute with
label role: the ‘label’ attribute. As these two label attributes are not compatible, only the label attribute of the first ExampleSet is kept. The examples of the ‘Iris’ data set have null values
in this attribute of the union ExampleSet. Also note that both input ExampleSets have id attributes. The names of these attributes are the same and they both have nominal values, thus
these two attributes are compatible with each other. Thus a single id attribute is created in the
resultant Union ExampleSet. Also note that the values of ids are not unique in the resultant
ExampleSet.
282
2.4. Values
Process
inp
Golf
out
Union
exa
uni
exa
res
res
Golf-Testset
out
Figure 2.65: Tutorial process ‘Union of the Golf(with id attribute) and Iris data sets’.
2.4 Values
Adjust Date
Adjust Date
exa
exa
ori
This operator adjusts the date in the specified attribute by adding
or subtracting the specified amount of time.
Description
The Adjust Date operator adjusts the values of the specified date attribute by adding or subtracting constant values. Year, month, day, hour, minute, second and millisecond adjustments are
allowed. Multiple adjustments can be made to a single attribute. For example, you can add a
month and subtract 2 hours from an attribute. If the keep old attribute parameter is set to true,
the old attribute will be kept along with the adjusted attribute. Otherwise, the adjusted attribute
will replace the previous attribute.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input. The ExampleSet should have at least one date/time attribute because if
there is no such attribute, the use of this operator does not make sense.
283
2. Blending
Output Ports
example set output (exa) The values of the selected date attribute are adjusted and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute name (string) This parameter specifies the name of the date attribute which should
be adjusted.
adjustments (list) This parameter defines the list of all date adjustments. Multiple adjustments can be made to a single attribute. For example, you can add a month and subtract
2 hours from the selected attribute.
keep old attribute (boolean) This parameter indicates if the original date attribute should
be kept. If this parameter is set to true, the old attribute will be kept along with the adjusted
attribute. Otherwise, the adjusted attribute will replace the previous attribute.
Tutorial Processes
Making multiple adjustments in a date attribute
Process
inp
Subprocess
Adjust Date
in
exa
out
out
exa
res
ori
res
res
Figure 2.66: Tutorial process ‘Making multiple adjustments in a date attribute’.
This Example Process starts with the Subprocess operator. The operator chain inside the Subprocess operator generates an ExampleSet for this process. The explanation of this inner chain
of operators is not relevant here. A breakpoint is inserted here so that you can have a look at
the ExampleSet. You can see that this ExampleSet has a date attribute named ‘deadline_date’.
The Adjust Date operator is applied on this ExampleSet to adjust this date attribute. Two adjustments are made to this attribute. 1) 5 days are added 2) 2 months are subtracted. Run the
process and compare the resultant ExampleSet with the original ExampleSet. You can clearly
see that the date values have been adjusted. For example, the date value 20-August has been
changed to 25-June after addition of 5 days and subtraction of two months.
284
2.4. Values
Cut
Cut
exa
exa
ori
This operator cuts the nominal values of the specified regular attributes. The resultant attributes have values that are substrings
of the original attribute values.
Description
The Cut operator creates new attributes from nominal attributes where the new attributes contain only substrings of the original values. The range of characters to be cut is specified by the
first character index and last character index parameters. The first character index parameter specifies the index of the first character and the last character index parameter specifies the index of
the last character to be included. All characters of the attribute values that are at index equal
to or greater than the first character index and less than or equal to the last character index are
included in the resulting substring. Please note that the counting starts with 1 and that the
first and the last character will be included in the resulting substring. For example, if the value
is “RapidMiner” and the first index is set to 6 and the last index is set to 9 the result will be
“Mine”. If the last index is larger than the length of the word, the resulting substrings will end
with the last character.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process.
Output Ports
example set output (exa) The ExampleSet with new attributes that have values that are substrings of the original attributes is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
285
2. Blending
parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and shifted to the right list,
which is the list of selected attributes.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
286
2.4. Values
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator.
If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
first character index (integer) This parameter specifies the index of the first character of the
substring which should be kept. Please note that the counting starts with 1.
last character index (integer) This parameter specifies the index of the last character of the
substring which should be kept. Please note that the counting starts with 1.
Tutorial Processes
Applying the Cut operator on label of the Iris data set
The ‘Iris’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you
can view the data set before application of the Cut operator. You can see that the label attribute
has three possible values: ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-virginica’. If we want to remove
the ‘Iris-’ substring from the start of all the label values we can use the Cut operator. The Cut
operator is applied on the Iris data set. The first character index parameter is set to 6 because we
want to remove first 5 characters (’Iris-’). The last character index parameter can be set to any
value greater than the length of longest possible value. Thus the last character index parameter
can be safely set to 20 because if the last index is larger than the length of the word, the resulting
substrings will end with the last character. Run the process and you can see that the substring
‘Iris-’ has been removed from the start of all possible values of the label attribute.
287
2. Blending
Process
Iris
Cut
out
inp
exa
exa
res
ori
res
Figure 2.67: Tutorial process ‘Applying the Cut operator on label of the Iris data set’.
Map
Map
exa
exa
ori
This operator maps specified values of selected attributes to new
values. This operator can be applied on both numerical and nominal attributes.
Description
This operator can be used to replace nominal values (e.g. replace the value ‘green’ by the value
‘green_color’) as well as numerical values (e.g. replace all values ‘3’ by ‘-1’). But, one use of this
operator can do mappings for attributes of only one type. A single mapping can be specified
using the parameters replace what and replace by as in Replace operator. Multiple mappings can
be specified through the value mappings parameter. Additionally, the operator allows defining
a default mapping. This operator allows you to select attributes to make mappings in. This
operator allows you to specify a regular expression. Attribute values of selected attributes that
match this regular expression are mapped by the specified value mapping. Please go through
the parameters and the Example Process to develop a better understanding of this operator.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data. The Retrieve operator provides meta data
along-with data.
Output Ports
example set (exa) The ExampleSet with value mappings is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
288
2.4. Values
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes on which you want to apply mappings. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if the meta data is not known. When this option is selected another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
• regular_expression This option allows you to specify a regular expression for the
attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. The user should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected
some other parameters (value type, use value type exception) become visible in the
Parameters panel.
• block_type This option is similar in working to the value_type option. This option
allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the
Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are not selected.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes.
regular expression (string) Attributes whose name match this expression will be selected.
Regular expression is very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through edit and preview regular expression
menu. This menu gives a good idea of regular expressions. This menu also allows you to
289
2. Blending
try different expressions and preview the results simultaneously. This will enhance your
concept of regular expressions.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) Attributes matching this type will be removed from the final
output even if they matched the previously mentioned type i.e. value typeparameter’s value.
block type (selection) Block type of attributes to be selected can be chosen from drop down
list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) Attributes matching this block type will be removed from the
final output even if they matched the previously mentioned block type.
numeric condition (string) Numeric condition for testing examples of numeric attributes is
mention here. For example the numeric condition ‘> 6’ will keep all nominal attributes and
all numeric attributes having a value of greater than 6 in every example. A combination
of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) Special attributes are attributes with special roles which
identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special
attributes are selected irrespective of the conditions in the Select Attribute operator. If
this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
290
2.4. Values
value mappings Multiple mappings can be specified through this parameter. If only a single
mapping is required. It can be done using the parameters replace what and replace by as
in the Replace operator. Old values and new values can be easily specified through this
parameter. Multiple mappings can be defined for the same old value but only the new value
corresponding to the first mapping is taken as replacement. Regular expressions can also
be used here if the consider regular expressions parameter is set to true.
replace what (string) This parameter specifies what is to be replaced. This can be specified
using regular expressions. This parameter is useful only if single mapping is to be done.
For multiple mappings use the value mappings parameter
replace by (string) Regions matching regular expression of the replace what parameter are replaced by the value of the replace by parameter.This parameter is useful only if single mapping is to be done. For multiple mappings use the value mappings parameter.
consider regular expressions (boolean) This parameter enables matching based on regular
expressions; old values(old values are original values, old values and ‘replace what’ represent the same thing) may be specified as regular expressions. If the parameter consider
regular expressions is enabled, old values are replaced by the new values if the old values
match the given regular expressions. The value corresponding to the first matching regular
expression in the mappings list is taken as a replacement.
add default mapping (boolean) If set to true, all values that occur in the selected attributes
of the ExampleSet but are not listed in the value mappings list are mapped to the value of
the default value parameter.
default value (string) This parameter is only available if the add default mapping parameter
is checked. If add default mapping is set to true and the default value is properly set, all
values that occur in the selected attributes of the ExampleSet but are not listed in the value
mappings list are replaced by thedefault value. This may be helpful in cases where only
some values should be mapped explicitly and many unimportant values should be mapped
to a default value (e.g. ‘other’).
Tutorial Processes
Mapping multiple values
Process
Retrieve
inp
Map
out
exa
exa
res
ori
res
Figure 2.68: Tutorial process ‘Mapping multiple values’.
Focus of this Example Process is the use of the value mappings parameter and the default
value parameter. Use of the replace what and replace by parameter can be seen in the Example
291
2. Blending
Process of the Replace operator. Almost all other parameters of the Map operator are also part
of the Select Attributes operator, their use can be better understood by studying the Attributes
operator and it’s Example Process.
The ‘Golf’ data set is loaded using the Retrieve operator. The Map operator is applied on it.
‘Wind’ and ‘Outlook’ attributes are selected for mapping. Thus, the effect of the Map operator will be limited to just these two attributes. Four value mappings are specified in the value
mappings parameter. ‘true’, ‘false’, ‘overcast’ and ‘sunny’ are replaced by ‘yes’, ‘no’, ‘bad’ and
‘good’ respectively. The add default mappings parameter is set to true and ‘other’ is specified in
the default value parameter. ‘Wind’ attribute has only two possible values i.e. ‘true’ and ‘false’.
Both of them were mapped in the mappings list. ‘Outlook’ attribute has three possible values
i.e. ‘sunny’, ‘overcast’ and ‘rain’. ‘sunny’ and ‘overcast’ were mapped in the mappings list but
‘rain’ was not mapped. As add default mappings parameter is set to true, ‘rain’ will be mapped
to the default value i.e. ‘other’.
292
2.4. Values
Merge
Merge
exa
exa
ori
This operator merges two nominal values of the specified regular
attribute.
Description
The Merge operator is used for merging two nominal values of the specified attribute of the input ExampleSet. Please note that this operator can merge only the values of regular attributes.
The required regular attribute is specified using the attribute name parameter. The first value parameter is used for specifying the first value to be merged. The second value parameter is used
for specifying the second value to be merged. The two values are merged in ‘first_second’ format
where first is the value of the first value parameter and second is the value of the second value parameter. It is not compulsory for the first value and second value parameters to have values from
the range of possible values of the selected attribute. However, at least one of the first value and
second value parameters should have a value from the range of possible values of the selected
attribute. Otherwise this operator will have no affect on the input ExampleSet.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process.
Output Ports
example set output (exa) The ExampleSet with the merged attribute values is output of this
port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute name (string) The required nominal attribute whose values are to be merged is selected through this parameter. This operator can be applied only on regular attributes.
first value (string) This parameter is used for specifying the first value to be merged. It is not
compulsory for the first value parameter to have a value from the range of possible values
of the selected attribute.
second value (string) This parameter is used for specifying the second value to be merged. It
is not compulsory for the second value parameter to have a value from the range of possible
values of the selected attribute.
293
2. Blending
Tutorial Processes
Introduction to the Merge operator
Process
inp
Retrieve
Merge
out
exa
exa
res
ori
res
Figure 2.69: Tutorial process ‘Introduction to the Merge operator’.
The Golf data set is loaded using the Retrieve operator. The Merge operator is applied on it.
The attribute name parameter is set to ‘Outlook’. The first value parameter is set to ‘sunny’
and the second value parameter is set to ‘hot’. All the occurrences of value ‘sunny’ are replaced
by ‘sunny_hot’ in the Outlook attribute of the resultant ExampleSet. Now set the value of the
second value parameter to ‘rain’ and run the process again. As ‘rain’ is also a possible value of the
Outlook attribute, all occurrences of ‘sunny’ and ‘rain’ in the Outlook attribute are replaced by
‘sunny_rain’ in the resultant ExampleSet. This Example Process is just to explain basic working
of the Merge operator.
294
2.4. Values
Remap Binominals
Remap Binominals
exa
exa
ori
This operator modifies the internal value mapping of binominal
attributes according to the specified negative and positive values.
Description
The Remap Binominals operator modifies the internal mapping of binominal attributes according to the specified positive and negative values. The positive and negative values are specified
by the positive value and negative value parameters respectively. If the internal mapping differs
from the specified values then the internal mapping is switched. If the internal mapping contains other values than the specified ones the mapping is not changed and the attribute is simply
skipped. Please note that this operator changes the internal mapping so the changes are not
explicitly visible in the ExampleSet. This operator can be applied only on binominal attributes.
Please note that if there is a nominal attribute in the ExampleSet with only two possible values,
this operator will still not be applicable on it. This operator requires the attribute to be explicitly
defined as binominal in the meta data.
Input Ports
example set input (exa) This input port expects an ExampleSet. Please note that there should
be at least one binominal attribute in the input ExampleSet.
Output Ports
example set output (exa) The resultant ExampleSet is output of this port. Externally this
data set is the same as the input ExampleSet, only the internal mappings may be changed.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.
295
2. Blending
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list. Attributes can be shifted
to the right list, which is the list of selected attributes.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
296
2.4. Values
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator.
If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
negative value (string) This parameter specifies the internal mapping for the negative or false
value of the selected binominal attributes.
positive value (string) This parameter specifies the internal mapping for the positive or true
value of the selected binominal attributes.
Tutorial Processes
Changing mapping of the Wind attribute of the Golf data set
The ‘Golf’ data set is loaded using the Retrieve operator. In this Example Process we shall change
the internal mapping of the ‘Wind’ attribute of the ‘Golf’ data set. A breakpoint is inserted after the Retrieve operator so that you can view the ‘Golf’ data set. As you can see the ‘Wind’
attribute of the ‘Golf’ data set is nominal but it has only two possible values. The Remap Binominals operator cannot be applied on such an attribute; it requires that the attribute should
be explicitly declared as binominal in the meta data. To accomplish this, the Nominal to Binominal operator is applied on the ‘Golf’ data set to convert the ‘Wind’ attribute to binominal
type. A breakpoint is inserted here so that you can view the ExampleSet. Now that the ‘Wind’
attribute has been converted to binominal type, the Remap Binominals operator can be applied
on it. The ‘Wind’ attribute is selected in the Remap Binominals operator. The negative value
and positive value parameter are set to ‘true’ and ‘false’ respectively. Run the process and the
297
2. Blending
Process
inp
Golf
Nominal to Bino...
out
exa
exa
ori
Remap Binominals
exa
exa
res
ori
res
pre
Figure 2.70: Tutorial process ‘Changing mapping of the Wind attribute of the Golf data set’.
internal mapping is changed. This change is an internal one so it will not be visible explicitly
in the Results Workspace. Now change the value of the positive value and negative value parameters to ‘a’ and ‘b’ respectively and run the complete process. Have a look at the log. You
will see the following message: “WARNING: Remap Binominals: specified values do not match
values of attribute Wind, attribute is skipped.” This log shows that as the values ‘a’ and ‘b’ are
not values of the ‘Wind’ attribute so no change in mapping is done.
298
2.4. Values
Replace
Replace
exa
exa
ori
This operator replaces parts of the values of selected nominal attributes matching a specified regular expression by a specified replacement.
Description
This operator allows you to select attributes to make replacements in and to specify a regular
expression. Attribute values of selected attributes that match this regular expression are replaced by the specified replacement. The replacement can be empty and can contain capturing
groups. Please keep in mind that although regular expressions are much more powerful than
simple strings, you might simply enter characters to search for.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Retrieve operator
in the attached Example Process. The output of other operators can also be used as input. It
is essential that meta data should be attached with the data for the input because attributes
are specified in their meta data. The Retrieve operator provides meta data along-with data.
Output Ports
example set (exa) An ExampleSet with replacements is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes in which you want to make replacements. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.(Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.(Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
299
2. Blending
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. User should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option
allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the
Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are not selected.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes.
regular expression (string) Attributes whose name match this expression will be selected.
Regular expression is very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through the edit and preview regular expression menu. It gives a good idea of regular expressions and also allows you to try different
expressions and preview the results simultaneously. This will enhance your concept of
regular expressions.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
300
2.4. Values
except value type (selection) Attributes matching this type will be removed from the final
output even if they matched the previously mentioned type i.e. value typeparameter’s value.
block type (selection) The Block type of attributes to be selected can be chosen from drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) Attributes matching this block type will be removed from the
final output even if they matched the previously mentioned block type.
numeric condition (string) Numeric condition for testing examples of numeric attributes is
mention here. For example the numeric condition ‘> 6’ will keep all nominal attributes and
all numeric attributes having a value of greater than 6 in every example. A combination
of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) Special attributes are attributes with special roles which
identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special
attributes are selected irrespective of the conditions in the Select Attribute operator. If
this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
replace what (string) This parameter specifies what is to be replaced. This can be specified
using regular expressions. The edit regular expression menu can assist you in specifying
the right regular expression.
replace by (string) The regions matching regular expression of the replace what parameter are
replaced by the value of the replace by parameter.
Tutorial Processes
Use of replace what and replace by parameters
The focus of this process is to show the use of the replace what and replace by parameters. All
other parameters are for the selection of attributes on which the replacement is to be made.
For understanding these parameters please study the Example Process of the Select Attributes
operator.
The ‘Golf’ data set is loaded using the Retrieve operator. The attribute filter type parameter is
set to ‘all’ and the include special attributes parameter is also checked. Thus, replacements are
made on all attributes including special attributes. The replace what parameter is provided with
the regular expression ‘.*e.*’ which means any attribute value that has character ‘e’ in it. The
301
2. Blending
Process
Retrieve
inp
Replace
out
exa
exa
res
ori
res
Figure 2.71: Tutorial process ‘Use of replace what and replace by parameters’.
replace by parameter is given the value ‘E’. Run the process. You will see that ‘E’ is placed in place
of ‘yes’, ‘overcast’, ‘true’ and ‘false’. This is because all the values have an ‘e’ in it. You can see
the power of this operator. Now set the regular expression of replace what operator to ‘e’. Run
the process again. This time you will see that the entire values are not replaced by ‘E’, instead
only the character ‘e’ is replaced by ‘E’. Thus new values of ‘yes’, ‘overcast’, ‘true’ and ‘false’ are
‘yEs’, ‘ovErcast’, ‘truE’ and ‘falsE’ respectively. You can see the power of this operator and regular
expressions. Thus it should be made sure that the correct regular expression is provided. If you
leave the replace by parameter empty or write ‘?’ in it, the null value is used as replacement
302
2.4. Values
Replace (Dictionary)
Replace (Diction...
exa
exa
dic
ori
pre
This operator replaces substrings (in the values) of the selected
nominal attributes of the first ExampleSet by using the dictionary
specified by the second ExampleSet.
Description
This operator takes two ExampleSets as input. It replaces substrings (in the values) of the selected nominal attributes of the first ExampleSet by using the value-mappings defined in the
second ExampleSet. This operator uses the second ExampleSet as a dictionary. The second ExampleSet must have two nominal attributes for value-mappings i.e. the ‘from’ attribute (i.e.
specified through the from attribute parameter) and the ‘to’ attribute (i.e. specified through the
to attribute parameter). For every example in the second ExampleSet a dictionary entry is created that matches the ‘from attribute’ value to the ‘to attribute’ value. Finally, this dictionary is
used for replacing substrings in the first ExampleSet. If the values of the ‘from’ attribute of the
second ExampleSet are found (as a whole or as a substring) in the selected nominal attributes of
the first ExampleSet, then the corresponding value of the ‘to’ attribute is used as a replacement
for the substring in the first ExampleSet. Please study the attached Example Process for better
understanding.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. The ExampleSet should have at least one nominal attribute because if
there is no such attribute, the use of this operator does not make sense. The substrings of
this ExampleSet will be replaced by using the second ExampleSet.
dictionary (dic) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also be used
as input. This ExampleSet should have a ‘from attribute’ and ‘to attribute’ as specified in
the description of this operator. These attributes will be used for substring replacements
in the first ExampleSet.
Output Ports
example set output (exa) The substrings of the selected nominal attributes of the first ExampleSet are replaced and the resultant ExampleSet is delivered through this port.
original (ori) The first ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has the information regarding the parameters of this operator in the current process.
303
2. Blending
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy
when selecting attributes through this option. When it is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
304
2.4. Values
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
from attribute (string) This parameter specifies the name of the attribute of the second ExampleSet that specifies the substrings that should be replaced.
305
2. Blending
to attribute (string) This parameter specifies the name of the attribute of the second ExampleSet that specifies the replacements of the substrings.
use regular expressions (boolean) This parameter specifies if the replacements should be
treated as regular expressions.
convert to lowercase (boolean) This parameter specifies if the strings should be converted
to lower case.
first match only (boolean) This parameter specifies if only the first match in the dictionary
should be considered. If set to false, subsequent matches will be applied iteratively.
Tutorial Processes
Replacing substrings by using a dictionary
Process
Golf
Replace (Diction...
out
inp
exa
exa
res
dic
ori
res
pre
Subprocess
in
out
out
Figure 2.72: Tutorial process ‘Replacing substrings by using a dictionary’.
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at this ExampleSet. This ExampleSet will be used as the first ExampleSet for
the Replace (Dictionary) operator. Therefore substring replacements will be made in this ExampleSet. The second ExampleSet is provided by the Subprocess operator. The operator chain
inside the Subprocess operator generates a dictionary ExampleSet for this process. The explanation of this inner chain of operators is not relevant here. A breakpoint is inserted here so
that you can have a look at the ExampleSet. You can see that this ExampleSet has two nominal attributes ‘att1’ and ‘att2’. The Replace (Dictionary) operator takes these two ExampleSets
as input and makes substring replacements in the first ExampleSet by using the second ExampleSet. Have a look at the parameters of the Replace (Dictionary) operator. The attribute filter
type parameter is set to ‘all’, thus substring replacements will be done in all attributes of the first
ExampleSet. The from attribute and to attribute parameters are set to ‘att1’ and ‘att2’ respectively. Thus if the values of the ‘att1’ attribute (i.e. ‘true’ and ‘false’) are found in any attribute
of the first ExampleSet, they will be replaced by the corresponding ‘att2’ attribute values (i.e.
306
2.4. Values
‘YES’ and ‘NO’ respectively). All other parameters are used with default values. Run the process and compare the resultant ExampleSet with the original ExampleSet. You can clearly see
in the Wind attribute that the substrings ‘true’ and ‘false’ have been replaced by ‘YES’ and ‘NO’
respectively. Please note that this operator is a substring replacement tool, although it was used
for value replacement in this process. If the ‘att1’ attribute had the value ‘tr’ instead of ‘true’;
all occurrences of this substring would have been replaced by ‘YES’. In that case ‘true’ value in
the Wind attribute would have been changed to ‘YESue’.
307
2. Blending
Set Data
Set Data
exa
exa
ori
This operator sets the value of one or more attributes of the specified example.
Description
The Set Data operator sets the value of one or more attributes of the specified example of the
input ExampleSet. The example is specified by the example index parameter. The attribute name
parameter specifies the attribute whose value is to be set. The value parameter specifies the
new value. Values of other attributes of the same example can be set by the additional values
parameter. Please note that the values should be consistent with the type of the attribute e.g.
specifying a string value is not allowed for an integer type attribute.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process.
Output Ports
example set output (exa) The ExampleSet with new values of the selected example’s attributes
is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
example index (integer) This parameter specifies the index of the example whose value should
be set. Please note that counting starts from 1.
attribute name (string) This parameter specifies the name of the attribute whose value should
be set.
count backwards (boolean) If set to true, the counting order is reversed. The last example
is addressed by index 1, the second last example is addressed by index 2 and so on.
value (string) This parameter specifies the new value of the selected attribute (selected by the
attribute name parameter) of the specified example (specified by the example index parameter).
additional values The values of other attributes of the same example can be set by this parameter.
308
2.4. Values
Tutorial Processes
Introduction to the Set Data operator
Process
inp
Golf
Set Data
out
exa
exa
res
ori
res
Figure 2.73: Tutorial process ‘Introduction to the Set Data operator’.
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can view the data set before application of the Set Data operator. You can see that the value
of the Temperature and Wind attributes is ‘85’ and ‘false’ respectively in the first example. The
Set Data operator is applied on the ‘Golf’ data set. The example index parameter is set to 1, the
attribute name parameter is set to ‘Temperature’ and the value parameter is set to 50. Thus the
value of the Temperature attribute will be set to 50 in the first example. Similarly, the value of
the Wind attribute in the first example is set to ‘fast’ using the additional values parameter. You
can verify this by running the process and seeing the results in the Results Workspace. Please
note that a string value cannot be set for the Temperature attribute because it is of integer type.
An integer value can be specified for the Wind attribute (nominal type) but it will be stored as a
nominal value.
309
2. Blending
Split
Split
exa
exa
ori
This operator creates new attributes from the selected nominal attributes by splitting the nominal values into parts according to the
specified split criterion.
Description
The Split operator creates new attributes from the selected nominal attributes by splitting the
nominal values into parts according to the split criterion which is specified through the split
pattern parameter in form of a regular expression. This operator provides two different modes for
splitting; the desired mode can be selected by the split mode parameter. The two splitting modes
are explained with an imaginary ExampleSet with a nominal attribute named ‘att’ assumimg that
the split pattern parameter is set to ‘,’ (comma). Suppose the ExampleSet has three examples:
1. value1
2. value2, value3
3. value3
Ordered Splits
In case of ordered split the resulting attributes get the name of the original attribute together
with a number indicating the order. In our example scenario there will be two attributes named
‘att_1’ and ‘att_2’ respectively. After splitting the three examples will have the following values
for ‘att_1’ and ‘att_2’ (described in form of tuples):
1. (value1,?)
2. (value2,value3)
3. (value3,?)
This mode is useful if the original values indicated some order like, for example, a preference.
Unordered Splits
In case of unordered split the resulting attributes get the name of the original attribute together
with the value for each of the occurring values. In our example scenario there will be three attributes named ‘att_value1’, ‘att_value2’ and ‘att_value3’ respectively. All these new attributes
are boolean. After splitting the three examples will have the following values for ‘att_value1’,
‘att_value2’ and ‘att_value3’ (described in form of tuples):
1. (true, false, false)
2. (false, true, true)
3. (false, false, true)
This mode is useful if the order is not important but the goal is a basket like data set containing
all occurring values.
310
2.4. Values
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input. The ExampleSet should have at least one nominal attribute because if
there is no such attribute, the use of this operator does not make sense.
Output Ports
example set output (exa) The selected nominal attributes are split into new attributes and
the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy
when selecting attributes through this option. When it is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
311
2. Blending
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
312
2.4. Values
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
split pattern (string) This parameter specifies the pattern which is used for dividing the nominal values into different parts. It is specified in form of a regular expression. Regular
expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview regular expression
menu. This menu gives a good idea of regular expressions.
split mode (selection) This parameter specifies the split mode for splitting. The two options
of this parameter are explained in the description of this operator.
Tutorial Processes
Ordered and unordered splits
Process
inp
Subprocess
in
out
Split
exa
out
exa
res
ori
res
Figure 2.74: Tutorial process ‘Ordered and unordered splits’.
This Example Process starts with a Subprocess operator. The operator chain inside the Subprocess operator generates an ExampleSet for this process. The explanation of this inner chain
of operators is not relevant here. A breakpoint is inserted here so that you can have a look at
the ExampleSet before the application of the Split operator. You can see that this ExampleSet is
the same ExampleSet that is described in the description of this operator. The Split operator is
applied on it with default values of all parameters. The split mode parameter is set to ‘ordered
split’ by default. Run the process and compare the results with the explanation of ordered split
in the description section of this document. Now change the split mode parameter to ‘unordered
split’ and run the process again. You can understand the results by studying the description of
unordered split in the description of this operator.
313
2. Blending
Trim
Trim
exa
exa
ori
This operator removes leading and trailing spaces from the values
of the selected nominal attributes.
Description
The Trim operator creates new attributes from the selected nominal attributes by removing leading and trailing spaces from the nominal values. The required attributes can be selected through
parameters. Please note that this operator only removes leading and trailing spaces from attribute values; spaces between a value are not removed. For example, values ‘ value 1’, ‘value 2
‘ and ‘ value 3 ‘ will be trimmed to ‘value 1’, ‘value 2’ and ‘value 3’ respectively.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input. The ExampleSet should have at least one nominal attribute because if
there is no such attribute, the use of this operator does not make sense.
Output Ports
example set output (exa) The values of the selected nominal attributes are trimmed and the
resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel. (Since RapidMiner 6.0.4 the Operator will fail if a selected Attribute is not in the ExampleSet)
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel. (Since RapidMiner 6.0.4
the Operator will fail if a selected Attribute is not in the ExampleSet)
314
2.4. Values
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy
when selecting attributes through this option. When it is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
315
2. Blending
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
Tutorial Processes
Removing leading and trailing spaces from attribute values
Process
Subprocess
inp
in
out
out
Trim
exa
exa
res
ori
res
res
Figure 2.75: Tutorial process ‘Removing leading and trailing spaces from attribute values’.
This Example Process starts with the Subprocess operator. The operator chain inside the Subprocess operator generates an ExampleSet for this process. The explanation of this inner chain
of operators is not relevant here. A breakpoint is inserted here so that you can have a look at the
316
2.4. Values
ExampleSet before the application of the Trim operator. You can see that this ExampleSet has
two nominal attributes ‘att1’ and ‘att2’. You can see that some values of these attributes have
leading and trailing spaces. The Trim operator is applied on this ExampleSet to remove these
spaces. All parameters are used with default values. Run the process and compare the resultant ExampleSet with the original ExampleSet. You can clearly see that the leading and trailing
spaces have been removed.
317
3Cleansing
3.1 Normalization
Normalize
Normalize
exa
exa
ori
This operator normalizes the attribute values of the selected attributes.
pre
Description
Normalization is a preprocessing technique used to rescale attribute values to fit in a specific
range. Normalization of the data is very important when dealing with attributes of different
units and scales. For example, some data mining techniques use the Euclidean distance. Therefore, all attributes should have the same scale for a fair comparison between them. In other
words normalization is a technique used to level the playing field when looking at attributes
that widely vary in size as a result of the units selected for representation.This operator performs normalization of selected attributes. Four normalization methods are provided. These
methods are explained in the parameters.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Retrieve operator
in the attached Example Process. The output of other operators can also be used as input. It
is essential that meta data should be attached with the data for input because the attributes
are specified in their meta data. The Retrieve operator provides meta data along-with data.
Output Ports
example set (exa) The ExampleSet with selected attributes in normalized form is output of
this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be nor-
319
3. Cleansing
mally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes that you want to normalize. It
has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if the meta data is not known. When this option is selected another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. The user should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option
allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the
Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are not selected.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes.
regular expression (string) Attributes whose name match this expression will be selected.
Regular expression is very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions. This menu also allows
you to try different expressions and preview the results simultaneously. This will enhance
your concept of regular expressions.
320
3.1. Normalization
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) Attributes matching this type will be removed from the final
output even if they matched the previously mentioned type i.e. value type parameter’s
value.
block type (selection) The Block type of the attributes to be selected can be chosen from a
drop down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) Attributes matching this block type will be removed from the
final output even if they matched the previously mentioned block type.
numeric condition (string) Numeric condition for testing examples of numeric attributes is
mention here. For example the numeric condition ‘> 6’ will keep all nominal attributes and
all numeric attributes having a value of greater than 6 in every example. A combination
of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) Special attributes are attributes with special roles. Special attributes are those attributes which identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster,
weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes
are also tested against conditions specified in the Select Attribute operator and only those
attributes are selected that satisfy the conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
method (selection) Four methods are provided here for normalizing data. These methods are
also explained in the attached Example Process.
321
3. Cleansing
• z_transformation This is also called Statistical normalization. The purpose of statistical normalization is to convert a data into Normal distribution with mean = 0 and
variance = 1. The formula of statistical normalization is Z = (X-u) /s .You have your
attribute values as vector X then you subtract the mean of the attribute values, u, and
divide this difference by the standard deviation, you will get another vector Z that has
normal distribution with zero mean and unit variance. It is also called Standard Normal distribution,N(0,1) . However, the range of the standard Normal distribution is
not between [0,1] but about -3 to +3 (actually infinity to infinity but by using -3 to +3
you already capture 99.9% of your data).
• range_transformation When this method is selected, two other parameters (min,
max) appear in the Parameters panel. Range transformation normalizes all attribute
values in the specified range [min,max]. min and max are specified using min and max
parameters respectively.
• proportion_transformation Each attribute value is normalized as proportion of the
total sum of the respective attribute i.e. each attribute value is divided by the total
sum of that attribute values.
• interquartile_range Normalization is performed using interquartile range. The range
is the difference between the largest and the smallest value in the data set. Since the
range only takes into account two values from the entire data set, it may be heavily influenced by outliers in the data. Therefore, another criterion - the interquartile range
- is commonly used. It is the distance between the 25th and 75th percentiles (Q3 Q1). The interquartile range is essentially the range of the middle 50% of the data.
Because it uses the middle 50%, the interquartile range is not affected by outliers or
extreme values.
min (real) This parameter is available only when the method parameter is set to ‘range transformation’. It is used to specify the minimum point of the range.
max (real) This parameter is available only when the method parameter is set to ‘range transformation’. It is used to specify the maximum point of the range.
Tutorial Processes
Different methods of normalization
Process
Golf
inp
Filter Examples
out
exa
Normalize
exa
res
ori
ori
res
unm
pre
exa
exa
Figure 3.1: Tutorial process ‘Different methods of normalization’.
322
3.1. Normalization
The focus of this process is to show different methods available for normalization. All parameters other than the method parameter are for selection of attributes on which normalization is
to be applied. To understand these parameters please study the Example Process of the Select
Attributes operator.
In this process the Retrieve operator is used to load the ‘golf’ data set from the Repository. The
Filter Examples operator is applied on it to select just four examples of the ‘golf’ data set. This is
done to just simplify the calculations. The breakpoint is inserted after this operator so that you
can have a look at the examples. There are four examples with ‘Humidity’ attribute values 65,
70, 70 and 70. The ‘Humidity’ attribute is selected for normalization in the Normalize operator.
The method parameter is set to ‘proportion transformation’. All values of the ‘Humidity’ attribute are divided by the sum of all values of the ‘Humidity’ attribute. The sum is 275 (65+70+70+70).
Thus the values after normalization are 0.236 (65/275) and 0.255 (70/275).
Now run the process again with the method parameter set to ‘z-transformation’. The mean
of the four ‘Humidity’ attribute values (65, 70, 70, 70,) is 68.75. The Standard deviation of these
values is calculated to be 2.5. Now for each attribute value, subtract the mean from the attribute
value and divide the result by the standard deviation. You will see that results are the same as
in the Results Workspace.
Select the ‘Temperature’ attribute and set the method parameter to ‘range transformation’.
Use 0 and 1 for min and max parameters. Run the process. You will see that all values of the
‘Temperature’ attribute are in range [0,1].
323
3. Cleansing
Scale by Weights
Scale by Weights
exa
exa
wei
This operator scales the input ExampleSet according to the given
weights. This operator deselects attributes with weight 0 and calculates new values for numeric attributes according to the given
weights.
Description
The Scale by Weights operator selects attributes with non zero weight. The values of the remaining numeric attributes are recalculated based on the weights delivered at the weights input
port. The new values of numeric attributes are calculated by multiplying the original values by
the weight of that attribute. This operator can hardly be used for selecting a subset of attributes
according to weights determined by a former weighting scheme. For this purpose the Select by
Weights operator should be used which selects only those attributes that fulfill a specified weight
relation.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Weight by Chi
Squared Statistic operator in the attached Example Process. The output of other operators
can also be used as input. It is essential that meta data should be attached with the data
for the input because attributes are specified in their meta data.
weights (wei) This port expects the attribute weights. There are numerous operators that provide the attribute weights. The Weight by Chi Squared Statistic operator is used in the
Example Process.
Output Ports
example set (exa) The attributes with weight 0 are removed from the input ExampleSet. The
values of the remaining numeric attributes are recalculated based on the weights provided
at the weights input port. The resultant ExampleSet is delivered through this port.
Tutorial Processes
Applying the Scale by Weights operator on the Golf data set
The ‘Golf’ data set is loaded using the Retrieve operator. The Weight by Chi Squared Statistic
operator is applied on it to generate attribute weights. A breakpoint is inserted here. You can
see the attributes with their weights here. You can see that the Wind, Humidity, Outlook and
Temperature attributes have weights 0, 0.438, 0.450 and 1 respectively. The Scale by Weights
operator is applied next. The ‘Golf’ data set is provided at the example set input port and weights
calculated by the Weight by Chi Squared Statistic operator are provided at the weights input port.
The Scale by Weights operator removes the attributes with weight 0 i.e. the Wind attribute is
removed. The values of the remaining numeric attributes (i.e. the Temperature and Humidity
attribute) are recalculated based on their weights. The weight of the Temperature attribute is
1 thus its values remain unchanged. The weight of the Humidity attribute is 0.438 thus its new
324
3.1. Normalization
Process
inp
Golf
Weight by Chi Sq...
out
exa
Scale by Weights
wei
exa
exa
wei
exa
res
res
Figure 3.2: Tutorial process ‘Applying the Scale by Weights operator on the Golf data set’.
values are calculated by multiplying the original values by 0.438. This can be verified by viewing
the results in the Results Workspace.
325
3. Cleansing
3.2 Binning
Discretize by Binning
Discretize
exa
exa
ori
pre
This operator discretizes the selected numerical attributes into
user-specified number of bins. Bins of equal range are automatically generated, the number of the values in different bins may
vary.
Description
This operator discretizes the selected numerical attributes to nominal attributes. The number of
bins parameter is used to specify the required number of bins. This discretization is performed by
simple binning. The range of numerical values is partitioned into segments of equal size. Each
segment represents a bin. Numerical values are assigned to the bin representing the segment
covering the numerical value. Each range is named automatically. The naming format for range
can be changed using the range name type parameter. Values falling in the range of a bin are
named according to the name of that range. This operator also allows you to apply binning only
on a range of values. This can be enabled by using the define boundaries parameter. The min
value and max value parameter are used for defining the boundaries of the range. If there are any
values that are less than the min value parameter, a separate range is created for them. Similarly
if there are any values that are greater than the max value parameter, a separate range is created
for them. Then, the discretization by binning is performed only on the values that are within
the specified boundaries.
Differentiation
• Discretize by Frequency The Discretize By Frequency operator creates bins in such a way
that the number of unique values in all bins are (almost) equal. See page 335 for details.
• Discretize by Size The Discretize By Size operator creates bins in such a way that each
bin has user-specified size (i.e. number of examples). See page 340 for details.
• Discretize by Entropy The discretization is performed by selecting bin boundaries such
that the entropy is minimized in the induced partitions. See page 331 for details.
• Discretize by User Specification This operator discretizes the selected numerical attributes into user-specified classes. See page 344 for details.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Retrieve operator
in the attached Example Process. the output of other operators can also be used as input. It
is essential that meta data should be attached with the data for the input because attributes
are specified in their meta data. The Retrieve operator provides meta data along-with the
data. Note that there should be at least one numerical attribute in the input ExampleSet,
otherwise the use of this operator does not make sense.
326
3.2. Binning
Output Ports
example set (exa) The selected numerical attributes are converted into nominal attributes by
binning and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
327
3. Cleansing
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list, which is the list of selected attributes.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator.
328
3.2. Binning
If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
number of bins (integer) This parameter specifies the number of bins which should be used
for each attribute.
define boundaries: (boolean) The Discretize by Binning operator allows you to apply binning only on a range of values. This can be enabled by using the define boundaries parameter. If this is set to true, discretization by binning is performed only on the values that are
within the specified boundaries. The lower and upper limit of the boundary is specified by
the min value and max value parameters respectively.
min value (real) This parameter is only available when the define boundaries parameter is set
to true. It is used to specify the lower limit value for the binning range.
max value (real) This parameter is only available when the define boundaries parameter is set
to true. It is used to specify the upper limit value for the binning range.
range name type (selection) This parameter is used to change the naming format for range.
‘long’, ‘short’ and ‘interval’ formats are available.
automatic number of digits (boolean) This is an expert parameter. It is only available when
the range name type parameter is set to ‘interval’. It indicates if the number of digits should
be automatically determined for the range names.
number of digits (integer) This is an expert parameter. It is used to specify the minimum
number of digits used for the interval names.
Related Documents
• Discretize by Frequency (page 335)
• Discretize by Size (page 340)
• Discretize by Entropy (page 331)
• Discretize by User Specification (page 344)
Tutorial Processes
Discretizing numerical attributes of the ’Golf’ data set by Binning
The focus of this Example Process is the binning procedure. For understanding the parameters
related to attribute selection please study the Example Process of the Select Attributes operator.
The ‘Golf’ data set is loaded using the Retrieve operator. The Discretize by Binning operator is
applied on it. The ‘Temperature’ and ‘Humidity’ attributes are selected for discretization. The
number of bins parameter is set to 2. The define boundaries parameter is set to true. The min
value and max value parameters are set to 70 and 80 respectively. Thus binning will be performed
329
3. Cleansing
Process
inp
Golf
Discretize
out
exa
exa
res
ori
res
pre
res
Figure 3.3: Tutorial process ‘Discretizing numerical attributes of the ’Golf’ data set by Binning’.
only in the range from 70 to 80. As the number of bins parameter is set to 2, the range will be
divided into two equal segments. Approximately speaking, the first segment of the range will
be from 70 to 75 and the second segment of the range will be from 76 to 80. These are not exact
values, but they are good enough for the explanation of this process. There will be a separate
range for all those values that are less than the min value parameter i.e. less than 70. This range
is automatically named ‘range1’. The first and second segment of the binning range are named
‘range2’ and ‘range3’ respectively. There will be a separate range for all those values that are
greater than the max value parameter i.e. greater than 80. This range is automatically named
‘range4’. Run the process and compare the original data set with the discretized one. You can
see that the values less than or equal to 70 in the original data set are named ‘range1’ in the
discretized data set. The values greater than 70 and less than or equal to 75 in the original data
set are named ‘range2’ in the discretized data set. The values greater than 75 and less than or
equal to 80 in the original data set are named ‘range3’ in the discretized data set. The values
greater than 80 in the original data set are named ‘range4’ in the discretized data set.
330
3.2. Binning
Discretize by Entropy
Discretize
exa
exa
ori
pre
This operator converts the selected numerical attributes into nominal attributes. The boundaries of the bins are chosen so that the
entropy is minimized in the induced partitions.
Description
This operator discretizes the selected numerical attributes to nominal attributes. The discretization is performed by selecting a bin boundary that minimizes the entropy in the induced partitions. Each bin range is named automatically. The naming format of the range can be changed
using the range name type parameter. The values falling in the range of a bin are named according to the name of that range.
The discretization is performed by selecting a bin boundary that minimizes the entropy in
the induced partitions. The method is then applied recursively for both new partitions until the
stopping criterion is reached. For more information please study:
• Multi-interval discretization of continued-values attributes for classification learning (Fayyad,Irani)
• Supervised and Unsupervised Discretization (Dougherty,Kohavi,Sahami).
This operator can automatically remove all attributes with only one range i.e. those attributes
which are not actually discretized since the entropy criterion is not fulfilled. This behavior can
be controlled by the remove useless parameter.
Differentiation
• Discretize by Binning The Discretize By Binning operator creates bins in such a way that
the range of all bins is (almost) equal. See page 326 for details.
• Discretize by Frequency The Discretize By Frequency operator creates bins in such a way
that the number of unique values in all bins are (almost) equal. See page 335 for details.
• Discretize by Size The Discretize By Size operator creates bins in such a way that each
bin has user-specified size (i.e. number of examples). See page 340 for details.
• Discretize by User Specification This operator discretizes the selected numerical attributes into user-specified classes. See page 344 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. Please note that there should be at least one numerical attribute in the
input ExampleSet, otherwise the use of this operator does not make sense.
331
3. Cleansing
Output Ports
example set output (exa) The selected numerical attributes are converted into nominal attributes by discretization and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
332
3.2. Binning
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
333
3. Cleansing
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
remove useless (boolean) This parameter indicates if the useless attributes, i.e. attributes
containing only a single range, should be removed. If this parameter is set to true then all
those attributes that are not actually discretized since the entropy criterion is not fulfilled
are removed.
range name type (selection) This parameter is used for changing the naming format for range.
‘long’, ‘short’ and ‘interval’ formats are available.
automatic number of digits (boolean) This is an expert parameter. It is only available when
the range name type parameter is set to ‘interval’. It indicates if the number of digits should
be automatically determined for the range names.
number of digits (integer) This is an expert parameter. It is used to specify the minimum
number of digits used for the interval names.
Related Documents
• Discretize by Binning (page 326)
• Discretize by Frequency (page 335)
• Discretize by Size (page 340)
• Discretize by User Specification (page 344)
Tutorial Processes
Discretizing the ’Sonar’ data set by entropy
The focus of this Example Process is the discretization procedure. For understanding the parameters related to attribute selection please study the Example Process of the Select Attributes
operator.
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can gave a look at the ExampleSet. You can see that this data set has 60 regular attributes
(all of real type). The Discretize by Entropy operator is applied on it. The attribute filter type
parameter is set to ‘all’, thus all the numerical attributes will be discretized.The remove useless
parameter is set to true, thus attributes with only one range are removed from the ExampleSet.
Run the process and switch to the Results Workspace. You can see that the ‘Sonar’ data set has
been reduced to just 22 regular attributes. These numerical attributes have been discretized to
nominal attributes.
334
3.2. Binning
Process
inp
Sonar
Discretize
out
exa
exa
res
ori
res
pre
Figure 3.4: Tutorial process ‘Discretizing the ’Sonar’ data set by entropy’.
Discretize by Frequency
Discretize
exa
exa
ori
pre
This operator converts the selected numerical attributes into nominal attributes by discretizing the numerical attribute into a userspecified number of bins. Bins of equal frequency are automatically generated, the range of different bins may vary.
Description
This operator discretizes the selected numerical attributes to nominal attributes. The number
of bins parameter is used to specify the required number of bins. The number of bins can also be
specified by using the use sqrt of examples parameter. If the use sqrt of examples parameter is set
to true, then the number of bins is calculated as the square root of the number of examples with
non-missing values (calculated for every single attribute). This discretization is performed by
equal frequency binning i.e. the thresholds of all bins is selected in a way that all bins contain
the same number of numerical values. Numerical values are assigned to the bin representing the
range segment covering the numerical value. Each range is named automatically. The naming
format for the range can be changed using the range name type parameter. Values falling in the
range of a bin are named according to the name of that range.
Other discretization operators are also available in RapidMiner. The Discretize By Frequency
operator creates bins in such a way that the number of unique values in all bins are (almost)
equal. In contrast, the Discretize By Binning operator creates bins in such a way that the range
of all bins is (almost) equal.
Differentiation
• Discretize by Binning The Discretize By Binning operator creates bins in such a way that
the range of all bins is (almost) equal. See page 326 for details.
335
3. Cleansing
• Discretize by Size The Discretize By Size operator creates bins in such a way that each
bin has user-specified size (i.e. number of examples). See page 340 for details.
• Discretize by Entropy The discretization is performed by selecting bin boundaries such
that the entropy is minimized in the induced partitions. See page 331 for details.
• Discretize by User Specification This operator discretizes the selected numerical attributes into user-specified classes. See page 344 for details.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in attached Example Process. The output of other operators can also be used as input.
Please note that there should be at least one numerical attribute in the input ExampleSet,
otherwise use of this operator does not make sense.
Output Ports
example set (exa) The selected numerical attributes are converted into nominal attributes by
discretization (frequency) and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
336
3.2. Binning
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type. When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list, which is the list of selected attributes.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
337
3. Cleansing
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator.
If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the
conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
use sqrt of examples (boolean) If set to true, the number of bins is determined by the square
root of the number of non-missing values instead of using the number of bins parameter.
number of bins (integer) This parameter is available only when the use sqrt of examples parameter is not set to true. This parameter specifies the number of bins which should be
used for each attribute.
range name type (selection) This parameter is used for changing the naming format for range.
‘long’, ‘short’ and ‘interval’ formats are available.
automatic number of digits (boolean) This is an expert parameter. It is only available when
the range name type parameter is set to ‘interval’. It indicates if the number of digits should
be automatically determined for the range names.
number of digits (integer) This is an expert parameter. It is used to specify the minimum
number of digits used for the interval names.
Related Documents
• Discretize by Binning (page 326)
• Discretize by Size (page 340)
• Discretize by Entropy (page 331)
• Discretize by User Specification (page 344)
338
3.2. Binning
Tutorial Processes
Discretizing the Temperature attribute of the ’Golf’ data set by Frequency
Process
inp
Golf
Discretize
out
exa
exa
res
ori
res
pre
Figure 3.5: Tutorial process ‘Discretizing the Temperature attribute of the ’Golf’ data set by
Frequency’.
The focus of this Example Process is the discretization (by frequency) procedure. For understanding the parameters related to attribute selection please study the Example Process of the
Select Attributes operator.
The ‘Golf’ data set is loaded using the Retrieve operator. The Discretize by Frequency operator
is applied on it. The ‘Temperature’ attribute is selected for discretization. The number of bins
parameter is set to 3. Run the process and switch to the Results Workspace. You can see that the
‘Temperature’ attribute has been changed from numerical to nominal form. The values of the
‘Temperature’ attribute have been divided into three ranges. Each range has an equal number
of unique values. You can see that ‘range1’ and ‘range3’ have 4 examples while the ‘range2’ has
6 examples. But in ‘range2’ the ‘Temperature’ values 72 and 75 occur twice. Thus essentially 4
unique numerical values are present in ‘range2’.
339
3. Cleansing
Discretize by Size
Discretize
exa
exa
ori
pre
This operator converts the selected numerical attributes into nominal attributes by discretizing the numerical attribute into bins of
user-specified size. Thus each bin contains a user-defined number
of examples.
Description
This operator discretizes the selected numerical attributes to nominal attributes. The size of
bins parameter is used for specifying the required size of bins. This discretization is performed
by binning examples into bins containing the same, user-specified number of examples. Each
bin range is named automatically. The naming format of the range can be changed by using the
range name type parameter. The values falling in the range of a bin are named according to the
name of that range.
It should be noted that if the number of examples is not evenly divisible by the requested number of examples per bin, the actual result may slightly differ from the requested bin size. Similarly, if a range of examples cannot be split, because the numerical values are identical within
this set, only all or none can be assigned to a bin. This may lead to further deviations from the
requested bin size.
This operator is closely related to the Discretize By Frequency operator. There you have to
specify the number of bins you need (say x) and the operator automatically creates it with an
almost equal number of values (i.e. n/x values where n is the total number of values). In the
Discretize by Size operator you have to specify the number of values you need in each bin (say
y) and the operator automatically creates n/y bins with y values.
Differentiation
• Discretize by Binning The Discretize By Binning operator creates bins so their range is
(almost) equal. See page 326 for details.
• Discretize by Frequency The Discretize By Frequency operator creates bins so the number of unique values in all bins are (almost) equal. See page 335 for details.
• Discretize by Entropy The discretization is performed by selecting bin boundaries so the
entropy is minimized in the induced partitions. See page 331 for details.
• Discretize by User Specification This operator discretizes the selected numerical attributes into user-specified classes. See page 344 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. Please note that there should be at least one numerical attribute in the
input ExampleSet, otherwise the use of this operator does not make sense.
340
3.2. Binning
Output Ports
example set output (exa) The selected numerical attributes are converted into nominal attributes by discretization and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. It allows selection of all the attributes of a particular block type. When this option is selected some
other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all attributes of the ExampleSet which
don’t contain a missing value in any example. Attributes that have even a single missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
341
3. Cleansing
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
342
3.2. Binning
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
size of bins (integer) This parameter specifies the required size of bins i.e. number of examples contained in a bin.
sorting direction (selection) This parameter indicates if the values should be sorted in increasing or decreasing order.
range name type (selection) This parameter is used for changing the naming format for range.
‘long’, ‘short’ and ‘interval’ formats are available.
automatic number of digits (boolean) This is an expert parameter. It is only available when
the range name type parameter is set to ‘interval’. It indicates if the number of digits should
be automatically determined for the range names.
number of digits (integer) This is an expert parameter. It is used to specify the minimum
number of digits used for the interval names.
Related Documents
• Discretize by Binning (page 326)
• Discretize by Frequency (page 335)
• Discretize by Entropy (page 331)
• Discretize by User Specification (page 344)
Tutorial Processes
Discretizing the Temperature attribute of the ’Golf’ data set
The focus of this Example Process is the discretization procedure. For understanding the parameters related to attribute selection please study the Example Process of the Select Attributes
operator.
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can gave a look at the ExampleSet. You can see that the ‘Temperature’ attribute is a numerical attribute. The Discretize by Size operator is applied on it. The ‘Temperature’ attribute is
selected for discretization. The size of bins parameter is set to 5. Run the process and switch
to the Results Workspace. You can see that the ‘Temperature’ attribute has been changed from
numerical to nominal form. The values of the ‘Temperature’ attribute have been divided into
three ranges. Each range has an equal number of unique values. You can see that ‘range1’ and
‘range3’ have 4 examples while the ‘range2’ has 6 examples. All bins do not have exactly equal
values because 14 examples cannot be grouped by 5 examples per bin. But in ‘range2’ the ‘Temperature’ values 72 and 75 occur twice. Thus essentially 4 unique numerical values are present
in ‘range2’.
343
3. Cleansing
Process
inp
Golf
Discretize
out
exa
exa
res
ori
res
pre
Figure 3.6: Tutorial process ‘Discretizing the Temperature attribute of the ’Golf’ data set’.
Discretize by User Specification
Discretize
exa
exa
ori
pre
This operator discretizes the selected numerical attributes into
user-specified classes. The selected numerical attributes will be
changed to nominal attributes.
Description
This operator discretizes the selected numerical attributes to nominal attributes. The numerical
values are mapped to the classes according to the thresholds specified by the user in the classes
parameter. The user can define the classes by specifying the upper limit of each class. The lower
limit of every class is automatically defined as the upper limit of the previous class. The lower
limit of the first class is assumed to be negative infinity. ‘Infinity’ can be used to specify positive
infinity as upper limit in the classes parameter. This is usually done in the last class. If a class
is named as ‘?’, the numerical values falling in this class will be replaced by unknown values in
the resulting attributes.
Differentiation
• Discretize by Binning The Discretize By Binning operator creates bins in such a way that
the range of all bins is (almost) equal. See page 326 for details.
• Discretize by Frequency The Discretize By Frequency operator creates bins in such a way
that the number of unique values in all bins are (almost) equal. See page 335 for details.
• Discretize by Size The Discretize By Size operator creates bins in such a way that each
bin has user-specified size (i.e. number of examples). See page 340 for details.
344
3.2. Binning
• Discretize by Entropy The discretization is performed by selecting bin boundaries such
that the entropy is minimized in the induced partitions. See page 331 for details.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Retrieve operator
in the attached Example Process. The output of other operators can also be used as input.
It is essential that meta data should be attached with the data for input because attributes
are specified in their meta data. The Retrieve operator provides meta data along-with data.
Note that there should be at least one numerical attribute in the input ExampleSet, otherwise use of this operator does not make sense.
Output Ports
example set (exa) The selected numerical attributes are converted into nominal attributes according to the user specified classes and the resultant ExampleSet is delivered through this
port.
original (ori) ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
345
3. Cleansing
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that
block types may be hierarchical. For example value_series_start and value_series_end
block types both belong to the value_series block type.When this option is selected
some other parameters (block type, use block type exception) become visible in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose all examples satisfy the mentioned numeric condition are selected. Please note that all
nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list. Attributes can be shifted
to the right list, which is the list of selected attributes.
regular expression (string) The attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners.
It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you
to try different expressions and preview the results simultaneously.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first regular
expression (regular expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will not be selected even if
they match the previously mentioned type i.e. value type parameter’s value.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
346
3.2. Binning
except block type (selection) The attributes matching this block type will be not be selected
even if they match the previously mentioned block type i.e. block type parameter’s value.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles.
Special attributes are those attributes which identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes selected irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes
are also tested against conditions specified in the Select Attribute operator and only those
attributes are selected that satisfy the conditions.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
classes This is the most important parameter of this operator. It is used to specify the classes
into which the numerical values will be mapped. The names and upper limits of the classes
are specified here. The numerical values are mapped to the classes according to the defined
thresholds. The user can define the classes by specifying the upper limit of each class. The
lower limit of every class is automatically defined as the upper limit of the previous class.
The lower limit of the first class is assumed to be negative infinity. ‘Infinity’ can be used
to specify positive infinity as upper limit in the classes parameter. This is usually done in
the last class. If a class is named as ‘?’, the numerical values falling in this class will be
replaced by unknown values in the resulting attributes.
Related Documents
• Discretize by Binning (page 326)
• Discretize by Frequency (page 335)
• Discretize by Size (page 340)
• Discretize by Entropy (page 331)
Tutorial Processes
Discretizing numerical attributes of the Golf data set
The focus of this Example Process is the classes parameter. Almost all parameters other than
the classes parameter are for selection of attributes on which discretization is to be performed.
For understanding these parameters please study the Example Process of the Select Attributes
operator.
347
3. Cleansing
Process
inp
Retrieve
Discretize
out
exa
exa
res
ori
res
pre
res
Figure 3.7: Tutorial process ‘Discretizing numerical attributes of the Golf data set’.
The ‘Golf’ data set is loaded using the Retrieve operator. The Discretize by User Specification
operator is applied on it. The ‘Temperature’ and ‘Humidity’ attributes are selected for discretization. As you can see in the classes parameter, four classes have been specified. The values from
negative infinity to 70 will be mapped to ‘low’ class. The values above 70 to 80 will be mapped
to ‘average’ class. The values above 80 to 90 will be mapped to ‘high’ class. The values above 90
will be considered as unknown (missing) values. This can be verified by running the process and
viewing the results in the Results Workspace. Note that value of the ‘Humidity’ attribute was 96
and 95 in Row No. 4 and 8 respectively. In the discretized attributes these values are replaced
by unknown values because of the last class defined in the classes parameter.
348
3.3. Missing
3.3 Missing
Declare Missing Value
Declare Missing ...
exa
exa
ori
This operator declares the specified values of the selected attributes as missing values.
Description
The Declare Missing Value operator replaces the specified values of the selected attributes by
Double.NaN, thus these values will become missing values. These values will be treated as missing values by the subsequent operators. The desired values can be selected through nominal,
numeric or regular expression mode. This behavior can be controlled by the mode parameter.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input.
Output Ports
example set output (exa) The specified values of the selected attributes are replaced by missing values and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
349
3. Cleansing
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to the numeric type. Users should have a basic understanding of type hierarchy
when selecting attributes through this option. When it is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
350
3.3. Missing
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
mode (selection) This parameter specifies the type of the value that should be set to missing value. The type can be nominal or numeric or it can be specified through a regular
expression.
numeric value (real) This parameter specifies the numerical value that should be declared as
missing value.
nominal value (string) This parameter specifies the nominal value that should be declared as
missing value.
expression value (string) This parameter specifies the value that should be declared as missing value through an expression.
Tutorial Processes
Declaring a nominal value as missing value
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you
can have a look at the ExampleSet. You can see that the ‘Outlook’ attribute has three possible
values i.e. ‘sunny’, ‘rain’ and ‘overcast’. The Declare Missing Value operator is applied on this
ExampleSet to change the ‘overcast’ value of the ‘Outlook’ attribute to a missing value. The
attribute filter type parameter is set to ‘single’ and the attribute parameter is set to ‘Outlook’.
The mode parameter is set to ‘nominal’ and the nominal value parameter is set to ‘overcast’.
Run the process and compare the resultant ExampleSet with the original ExampleSet. You can
clearly see that the value ‘overcast’ has been replaced by missing values.
351
3. Cleansing
Process
Golf
Declare Missing ...
out
inp
exa
exa
res
ori
res
res
Figure 3.8: Tutorial process ‘Declaring a nominal value as missing value’.
Fill Data Gaps
Fill Data Gaps
exa
exa
ori
This operator fills the gaps (based on the ID attribute) in the given
ExampleSet by adding new examples in the gaps. The new example
will have null values.
Description
The Fill Data Gaps operator fills the gaps (based on gaps in the ID attribute) in the given ExampleSet by adding new examples in the gaps. The new examples will have null values for all
attributes (except the id attribute) which can be replenished by operators like the Replace Missing Values operator. It is ideal that the ID attribute should be of integer type. This operator
performs the following steps:
• The data is sorted according to the ID attribute
• All occurring distances between consecutive ID values are calculated.
• The greatest common divisor (GCD) of all distances is calculated.
• All rows which would have an ID value which is a multiple of the GCD but are missing are
added to the data set.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input. It is essential that meta data should be attached with the data for the
input because attributes are specified in their meta data.
Output Ports
example set output (exa) The gaps in the ExampleSet are filled with new examples and the
resulting ExampleSet is output of this port.
352
3.3. Missing
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
use gcd for step size (boolean) This parameter indicates if the greatest common divisor (GCD)
should be calculated and used as the underlying distance between all data points.
step size (integer) This parameter is only available when the use gcd for step size parameter is
set to false. This parameter specifies the step size to be used for filling the gaps.
start (integer) This parameter can be used for filling the gaps at the beginning (if they occur)
before the first data point. For example, if the ID attribute of the given ExampleSet starts
with 3 and the start parameter is set to 1. Then this operator will fill the gaps in the beginning by adding rows with ids 1 and 2.
end (integer) This parameter can be used for filling the gaps at the end (if they occur) after the
last data point. For example, if the ID attribute of the given ExampleSet ends with 100 and
the end parameter is set to 105. Then this operator will fill the gaps at the end by adding
rows with ids 101 to 105.
Tutorial Processes
Introduction to the Fill Data Gaps operator
Process
Subprocess
inp
in
out
out
Fill Data Gaps
exa
exa
res
ori
res
Figure 3.9: Tutorial process ‘Introduction to the Fill Data Gaps operator’.
This Example Process starts with the Subprocess operator which delivers an ExampleSet. A
breakpoint is inserted here so that you can have a look at the ExampleSet. You can see that the
ExampleSet has 10 examples. Have a look at the id attribute of the ExampleSet. You will see
that certain ids are missing: ids 3, 6, 8 and 10. The Fill Data Gaps operator is applied on this
ExampleSet to fill these data gaps with examples that have the appropriate ids. You can see the
resultant ExampleSet in the Results Workspace. You can see that this ExampleSet has 14 examples. New examples with ids 3, 6, 8 and 10 have been added. But these examples have missing
values for all attributes (except the id attribute) which can be replenished by using operators
like the Replace Missing Values operator.
353
3. Cleansing
Impute Missing Values
Impute Missing V...
exa
exa
This operator estimates values for the missing values of the selected attributes by applying a model learned for missing values.
Description
This is a nested operator i.e. it has a subprocess. This subprocess should always accept an ExampleSet and return a model. The Impute Missing Values operator estimates values for missing
values by learning models for each attribute (except the label) and applying those models to the
ExampleSet. The learner for estimating missing values should be placed in the subprocess of
this operator. Please note that depending on the ability of the inner learner to handle missing
values this operator might not be able to impute all missing values in some cases. This behavior
leads to a warning. It might hence be useful to combine this operator with a subsequent Replace
Missing Values operator.
Input Ports
example set in (exa) This input port expects an ExampleSet. It is the output of the Retrieve
operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data. The Retrieve operator provides meta data
along-with data.
Output Ports
example set out (exa) The missing values in the ExampleSet are replaced by the values estimated by the given model and the resultant ExampleSet is output of this port.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes in which you want to replace
missing values. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in Parameters panel.
354
3.3. Missing
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. User should have basic understanding of type hierarchy when
selecting attributes through this option. When this option is selected some other parameters (value type, use value type exception) become visible in Parameters panel.
• block_type This option is similar in working to value_type option. This option allows
selection of all the attributes of a particular block type. It should be noted that block
types may be hierarchical. For example value_series_start and value_series_end block
types both belong to value_series block type. When this option is selected some other
parameters (block type, use block type exception) become visible in Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are not selected.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in Parameters panel. All numeric attributes whose all examples satisfy the mentioned numeric condition are selected. Please note that all nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes.
regular expression (string) Attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through edit and preview regular expression
menu. This menu gives a good idea of regular expressions. It also allows you to try different
expressions and preview the results simultaneously. This will enhance your concept of
regular expressions.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in regular expression parameter).
value type (selection) Type of attributes to be selected can be chosen from drop down list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in Parameters panel.
except value type (selection) Attributes matching this type will be removed from the final
output even if they matched the previously mentioned type i.e. value typeparameter’s value.
block type (selection) Block type of attributes to be selected can be chosen from drop down
list.
355
3. Cleansing
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in Parameters panel.
except block type (selection) Attributes matching this block type will be removed from the
final output even if they matched the previously mentioned block type.
numeric condition (string) Numeric condition for testing examples of numeric attributes is
mention here. For example the numeric condition ‘> 6’ will keep all nominal attributes and
all numeric attributes having a value of greater than 6 in every example. A combination
of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
include special attributes (boolean) Special attributes are attributes with special roles which
identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special
attributes are delivered to the output port irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes are also tested against
conditions specified in the Select Attribute operator and only those attributes are selected
that satisfy the conditions.
iterate (boolean) Set this parameter to true if you want to impute the missing values immediately (after having learned the corresponding concept) and iterate afterwards.
learn on complete cases (boolean) If this parameter is set to true, concepts are learned for
estimating missing values only on the basis of complete cases. This option should be used
when the inner learning approach cannot handle missing values.
order (selection) This parameter specifies the order of attributes in which missing values should
be estimated.
sort (selection) This parameter specifies the sort direction to be used in order strategy.
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomization. Using the same value of the local random seed will produce the
same randomization.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
Tutorial Processes
Using the K-NN scheme for estimating missing values
The ‘Labor-Negotiations’ data set is loaded using the Retrieve operator. A breakpoint is inserted
here so that you can view the ExampleSet. You can see that there are numerous missing values
356
3.3. Missing
Process
Labor-Negotiations
inp
out
Impute Missing V...
exa
exa
res
res
Figure 3.10: Tutorial process ‘Using the K-NN scheme for estimating missing values’.
in this ExampleSet. The Impute Missing Values operator is applied on this ExampleSet for estimating missing values. Have a look at the subprocess of this operator. The K-NN operator is
applied there for estimating the missing values. The attribute filter type parameter is set to ‘all’,
thus missing values of all attributes will be estimated using the K-NN scheme. All parameters are
used with default values. The resultant ExampleSet can be seen in the Results Workspace. You
can see that there are no missing values in this ExampleSet because they have been estimated
using the K-NN scheme.
357
3. Cleansing
Replace Infinite Values
Replace Infinite ...
exa
exa
ori
This operator replaces infinite values of the selected attributes by
the specified replacements.
pre
Description
The Replace Infinite Values operator replaces positive or negative infinite values by the specified replacements. The following replacements are available: none, zero, max_byte, max_int,
max_double and missing. The ‘max_byte’, ‘max_int’, ‘max_double’ replacements replace positive
infinity by the upper bound and negative infinity by the lower bound of the range of the byte,
int and double Java types respectively. If ‘missing’ replacement is used then the infinite values
are replaced by nan (not a number), which is internally used to represent missing values. These
missing values can be replenished by the Replace Missing Values operator. Different replacements can be specified for different attributes by using the columns parameter. If an attribute’s
name is not in the list of the columns parameter, the replacement specified by the default parameter is used.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Subprocess operator in the attached Example Process. The output of other operators can also
be used as input. It is essential that meta data should be attached with the data for the
input because attributes are specified in their meta data.
Output Ports
example set output (exa) The infinite values are replaced by the specified replacement and
the resultant ExampleSet is output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
358
3.3. Missing
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes in which you want to replace
infinite values. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. User should have basic understanding of type hierarchy when
selecting attributes through this option. When this option is selected some other parameters (value type, use value type exception) become visible in Parameters panel.
• block_type This option is similar in working to value_type option. This option allows
selection of all the attributes of a particular block type. It should be noted that block
types may be hierarchical. For example value_series_start and value_series_end block
types both belong to value_series block type. When this option is selected some other
parameters (block type, use block type exception) become visible in Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are not selected.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in Parameters panel. All numeric attributes whose all examples satisfy the mentioned numeric condition are selected. Please note that all nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes.
regular expression (string) Attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through edit and preview regular expression
menu. This menu gives a good idea of regular expressions. It also allows you to try different
expressions and preview the results simultaneously. This will enhance your concept of
regular expressions.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in Parameters panel.
359
3. Cleansing
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in regular expression parameter).
value type (selection) Type of attributes to be selected can be chosen from drop down list.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in Parameters panel.
except value type (selection) Attributes matching this type will be removed from the final
output even if they matched the previously mentioned type i.e. value typeparameter’s value.
block type (selection) Block type of attributes to be selected can be chosen from drop down
list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in Parameters panel.
except block type (selection) Attributes matching this block type will be removed from the
final output even if they matched the previously mentioned block type.
numeric condition (string) Numeric condition for testing examples of numeric attributes is
mention here. For example the numeric condition ‘> 6’ will keep all nominal attributes and
all numeric attributes having a value of greater than 6 in every example. A combination
of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
include special attributes (boolean) Special attributes are attributes with special roles which
identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special
attributes are delivered to the output port irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes are also tested against
conditions specified in the Select Attribute operator and only those attributes are selected
that satisfy the conditions.
default (selection) This parameter specifies the replacement to apply to all attributes that
are not explicitly specified by the columns parameter. The following options are available:
none, zero, max_byte, max_int, max_double, missing, value.
columns (list) Different attributes can be provided with different types of replacements through
this parameter. The default replacement selected by the default parameter is applied on
attributes that are not explicitly mentioned in the columns parameter
replenish what (selection) This parameter specifies if positive or negative infinity values should
be replaced.
360
3.3. Missing
replenishment value (real) This parameter is only available when the default parameter is
set to ‘value’. This value will be inserted instead of infinity.
Tutorial Processes
Replacing infinite values by missing values
Process
Subprocess
inp
in
out
out
Replace Infinite ...
exa
exa
res
ori
res
pre
Figure 3.11: Tutorial process ‘Replacing infinite values by missing values’.
This Example Process starts with the Subprocess operator which delivers an ExampleSet. A
breakpoint is inserted here so that you can have a look at the ExampleSet. Have a look at the
Ratio attribute of the ExampleSet. You will see that it has a positive infinity value in the first
example. The Replace Infinite Values operator is applied on this ExampleSet to replace infinite
values by missing values. The default parameter is set to ‘missing’ and all other parameters are
used with default values. You can see the resultant ExampleSet in the Results Workspace. You
can see that the infinite values of the Ratio attribute have been replaced by missing values. These
missing values can be replenished by using operators like the Replace Missing Values operator.
361
3. Cleansing
Replace Missing Values
Replace Missing ...
exa
exa
ori
This operator replaces missing values in examples of selected attributes by a specified replacement.
pre
Description
This operator replaces missing values in examples of selected attributes by a specified replacement. Missing values can be replaced by the minimum, maximum or average value of that attribute. Zero can also be placed in place of missing values. Any replenishment value can also
be specified as a replacement of missing values.
Input Ports
example set (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data. The Retrieve operator provides meta data
along-with data.
Output Ports
example set (exa) The ExampleSet with missing values replaced by specified replacement is
output of this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
create view (boolean) It is possible to create a View instead of changing the underlying data.
Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested
and the result is returned without changing the data.
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting attributes in which you want to replace
missing values. It has the following options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
362
3.3. Missing
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of ExampleSet are present in the list; required attributes can be easily selected. This
option will not work if meta data is not known. When this option is selected another
parameter becomes visible in Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. User should have basic understanding of type hierarchy when
selecting attributes through this option. When this option is selected some other parameters (value type, use value type exception) become visible in Parameters panel.
• block_type This option is similar in working to value_type option. This option allows
selection of all the attributes of a particular block type. It should be noted that block
types may be hierarchical. For example value_series_start and value_series_end block
types both belong to value_series block type. When this option is selected some other
parameters (block type, use block type exception) become visible in Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are not selected.
• numeric_value_filter When this option is selected another parameter (numeric condition) becomes visible in Parameters panel. All numeric attributes whose all examples satisfy the mentioned numeric condition are selected. Please note that all nominal attributes are also selected irrespective of the given numerical condition.
attribute (string) The required attribute can be selected from this option. The attribute name
can be selected from the drop down box of the parameter attribute if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes.
regular expression (string) Attributes whose name match this expression will be selected.
Regular expression is a very powerful tool but needs a detailed explanation to beginners. It
is always good to specify the regular expression through edit and preview regular expression
menu. This menu gives a good idea of regular expressions. It also allows you to try different
expressions and preview the results simultaneously. This will enhance your concept of
regular expressions.
use except expression (boolean) If enabled, an exception to the first regular expression can
be specified. When this option is selected another parameter (except regular expression)
becomes visible in Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in regular expression parameter).
value type (selection) Type of attributes to be selected can be chosen from drop down list.
363
3. Cleansing
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in Parameters panel.
except value type (selection) Attributes matching this type will be removed from the final
output even if they matched the previously mentioned type i.e. value typeparameter’s value.
block type (selection) Block type of attributes to be selected can be chosen from drop down
list.
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in Parameters panel.
except block type (selection) Attributes matching this block type will be removed from the
final output even if they matched the previously mentioned block type.
numeric condition (string) Numeric condition for testing examples of numeric attributes is
mention here. For example the numeric condition ‘> 6’ will keep all nominal attributes and
all numeric attributes having a value of greater than 6 in every example. A combination
of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
removed prior to selection of this parameter. After selection of this parameter ‘att1’ will
be removed and ‘att2’ will be selected.
include special attributes (boolean) Special attributes are attributes with special roles which
identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special
attributes are delivered to the output port irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes are also tested against
conditions specified in the Select Attribute operator and only those attributes are selected
that satisfy the conditions.
default (selection) Function to apply to all columns that are not explicitly specified by the
columns parameter.
• none If this option is selected, no function is applied by default i.e. missing values
are not replaced by default.
• minimum If this option is selected, by default missing values are replaced by the
minimum value of that attribute.
• maximum If this option is selected, by default missing values are replaced by the
maximum value of that attribute.
• average If this option is selected, by default missing values are replaced by the average value of that attribute.
• zero If this option is selected, by default missing values are replaced by zero.
• value If this option is selected, by default missing values are replaced by the value
specified in the replenishment value parameter.
364
3.3. Missing
columns (list) Different attributes can be provided with a different type of replacements through
this parameter. The default function selected by the default parameter is applied on attributes that are not explicitly mentioned in the columns parameter
replenishment value (string) This parameter is available for replacing missing values by a
specified value.
Tutorial Processes
Replacing missing values of the Labor Negotiations data set
Process
inp
Labor-Negotiations
out
Replace Missing ...
exa
exa
res
ori
res
pre
Figure 3.12: Tutorial process ‘Replacing missing values of the Labor Negotiations data set’.
The focus of this process is to show the use of the default and columns parameters. All other
parameters are for selection of attributes on which replacement is to be applied. For understanding these parameters please study the Example Process of the Select Attributes operator.
The ‘Labor Negotiations’ data set is loaded using the Retrieve operator. A breakpoint is inserted at this point so that you can view the data before the application of the Replace Missing
Values operator. The Replace Missing Values operator is applied on it. The attribute filter type
parameter is set to ‘no missing values’ and the invert selection parameter is also checked, thus all
attributes with missing values are selected. In the columns parameter the ‘wage-inc-1st’, ‘wageinc-2nd’ , ‘wage-inc-3rd’ and ‘working hours’ attributes are set to ‘minimum’, ‘maximum’, ‘zero’
and ‘value’ respectively. The minimum value of the ‘wage-inc-1st’ attribute is 2.000, thus missing values are replaced with 2.000. The maximum value of the ‘wage-inc-2nd’ attribute is 7.000,
thus missing values are replaced with 7.000. Missing values of wage-inc-3rd are replaced by 0.
The replenishment value parameter is set to 35, thus missing values of the ‘working hours’ operator are set to 35. The default parameter is set to ‘average’, thus missing values of all other
attributes are replaced by the average value of that attribute.
365
3. Cleansing
3.4 Duplicates
Remove Duplicates
R e m o v e D u pl ic a t e s
exa
exa
ori
dup
This operator removes duplicate examples from an ExampleSet by
comparing all examples with each other on the basis of the specified attributes. Two examples are considered duplicate if the selected attributes have the same values in them.
Description
The Remove Duplicates operator removes duplicate examples from an ExampleSet by comparing
all examples with each other on the basis of the specified attributes. This operator removes
duplicate examples such that only one of all the duplicate examples is kept. Two examples are
considered duplicate if the selected attributes have the same values in them. Attributes can be
selected from the attribute filter type parameter and other associated parameters. Suppose two
attributes ‘att1’ and ‘att2’ are selected and ‘att1’ and ‘att2’ have three and two possible values
respectively. Thus there are total 6 (i.e. 3 x 2) unique combinations of these two attribute. Thus
the resultant ExampleSet can have 6 examples at most. This operator works on all attribute
types.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input.
Output Ports
example set output (exa) The duplicate examples are removed from the given ExampleSet
and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
duplicates (dup) The duplicated examples from the given ExampleSet are delivered through
this port.
Parameters
attribute filter type (selection) This parameter allows you to select the attribute selection
filter; the method you want to use for selecting the required attributes. It has the following
options:
• all This option simply selects all the attributes of the ExampleSet. This is the default
option.
366
3.4. Duplicates
• single This option allows selection of a single attribute. When this option is selected
another parameter (attribute) becomes visible in the Parameters panel.
• subset This option allows selection of multiple attributes through a list. All attributes
of the ExampleSet are present in the list; required attributes can be easily selected.
This option will not work if the meta data is not known. When this option is selected
another parameter becomes visible in the Parameters panel.
• regular_expression This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
• value_type This option allows selection of all the attributes of a particular type. It
should be noted that types are hierarchical. For example real and integer types both
belong to numeric type. Users should have a basic understanding of type hierarchy
when selecting attributes through this option. When this option is selected some
other parameters (value type, use value type exception) become visible in the Parameters panel.
• block_type This option is similar in working to the value type option. This option
allows selection of all the attributes of a particular block type. When this option is
selected some other parameters (block type, use block type exception) become visible
in the Parameters panel.
• no_missing_values This option simply selects all the attributes of the ExampleSet
which don’t contain a missing value in any example. Attributes that have even a single
missing value are removed.
• numeric value filter When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples
all satisfy the mentioned numeric condition are selected. Please note that all nominal
attributes are also selected irrespective of the given numerical condition.
attribute (string) The desired attribute can be selected from this option. The attribute name
can be selected from the drop down box of attribute parameter if the meta data is known.
attributes (string) The required attributes can be selected from this option. This opens a new
window with two lists. All attributes are present in the left list and can be shifted to the
right list which is the list of selected attributes on which the conversion from nominal to
numeric will take place; all other attributes will remain unchanged.
regular expression (string) The attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview
regular expression menu. This menu gives a good idea of regular expressions. This menu
also allows you to try different expressions and preview the results simultaneously. This
will enhance your concept of regular expressions.
use except expression (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except regular expression (string) This option allows you to specify a regular expression.
Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
367
3. Cleansing
value type (selection) The type of attributes to be selected can be chosen from a drop down
list. One of the following types can be chosen: nominal, text, binominal, polynominal,
file_path.
use value type exception (boolean) If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible
in the Parameters panel.
except value type (selection) The attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter’s
value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
block type (selection) The block type of attributes to be selected can be chosen from a drop
down list. The only possible value here is ‘single_value’
use block type exception (boolean) If enabled, an exception to the selected block type can
be specified. When this option is selected another parameter (except block type) becomes
visible in the Parameters panel.
except block type (selection) The attributes matching this block type will be removed from
the final output even if they matched the previously mentioned block type.
numeric condition (string) The numeric condition for testing examples of numeric attributes
is specified here. For example the numeric condition ‘> 6’ will keep all nominal attributes
and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: ‘> 6 && < 11’ or ‘<= 5 || < 0’. But && and || cannot be used
together in one numeric condition. Conditions like ‘(> 0 && < 2) || (>10 && < 12)’ are
not allowed because they use both && and ||. Use a blank space after ‘>’, ‘=’ and ‘<’ e.g.
‘<5’ will not work, so use ‘< 5’ instead.
include special attributes (boolean) The special attributes are attributes with special roles
which identify the examples. In contrast regular attributes simply describe the examples.
Special attributes are: id, label, prediction, cluster, weight and batch.
invert selection (boolean) If this parameter is set to true, it acts as a NOT gate, it reverses the
selection. In that case all the selected attributes are unselected and previously unselected
attributes are selected. For example if attribute ‘att1’ is selected and attribute ‘att2’ is
unselected prior to checking of this parameter. After checking of this parameter ‘att1’ will
be unselected and ‘att2’ will be selected.
treat missing values as duplicates (boolean) This parameter specifies if missing values should
be treated as duplicates or not. If set to true, missing values are considered as duplicate
values.
Tutorial Processes
Removing duplicate values from the Golf data set on the basis of the Outlook and
Wind attributes
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the Outlook attribute has three possible
values i.e. ‘sunny’, ‘rain’ and ‘overcast’. The Wind attribute has two possible values i.e. ‘true’
and ‘false’. The Remove Duplicates operator is applied on this ExampleSet to remove duplicate
368
3.4. Duplicates
Process
inp
Golf
Rem o v e Du p l i c at e s
out
exa
exa
res
ori
res
dup
res
Figure 3.13: Tutorial process ‘Removing duplicate values from the Golf data set on the basis of
the Outlook and Wind attributes’.
examples on the basis of the Outlook and Wind attributes. The attribute filter type parameter is
set to ‘value type’ and the value type parameter is set to ‘nominal’, thus two examples that have
same values in their Outlook and Wind attributes are considered as duplicate. Note that the
Play attribute is not selected although its value type is nominal because it is a special attribute
(because it has label role). To select attributes with special roles the include special attributes
parameter should be set to true. The Outlook and Wind attributes have 3 and 2 possible values respectively. Thus the resultant ExampleSet will have 6 examples at most i.e. one example
for each possible combination of attribute values. You can see the resultant ExampleSet in the
Results Workspace. You can see that it has 6 examples and all examples have a different combination of the Outlook and Wind attribute values.
369
3. Cleansing
3.5 Outliers
Detect Outlier (COF)
Detect Outlier (C...
exa
exa
ori
This operator identifies outliers in the given ExampleSet based on
the Class Outlier Factors (COF).
Description
The main concept of an ECODB (Enhanced Class Outlier - Distance Based) algorithm is to rank
each instance in the ExampleSet given the parameters N (top N class outliers), and K (the number
of nearest neighbors). The rank of each instance is found using the formula:
COF = PCL(T,K) - norm(deviation(T)) + norm(kDist(T))
• PCL(T,K) is the Probability of the Class Label of the instance T with respect to the class
labels of its K nearest neighbors.
• norm(Deviation(T)) and norm(KDist(T)) are the normalized values of Deviation(T) and KDist(T)
respectively and their values fall in the range [0 - 1].
• Deviation(T) is how much the instance T deviates from instances of the same class. It is
computed by summing the distances between the instance T and every instance belonging
to the same class.
• KDist(T) is the summation of the distance between the instance T and its K nearest neighbors.
This operator adds a new boolean attribute named ‘outlier’ to the given ExampleSet. If the
value of this attribute is true, that example is an outlier and vice versa. Another special attribute
‘COF Factor’ is also added to the ExampleSet. This attribute measures the degree of being Class
Outlier for an example.
An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of
the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case
such examples should be discarded.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can
also be used as input.
Output Ports
example set output (exa) A new boolean attribute ‘outlier’ and a real attribute ‘COF Factor’
is added to the given ExampleSet and the ExampleSet is delivered through this output port.
370
3.5. Outliers
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
number of neighbors (integer) This parameter specifies the k value for the k nearest neighbors to be the analyzed. The minimum and maximum values for this parameter are 1 and
1 million respectively.
number of class outliers (integer) This parameter specifies the number of top-n Class Outliers to be looked for. The resultant ExampleSet will have n number of examples that are
considered outliers. The minimum and maximum values for this parameter are 2 and 1
million respectively.
measure types (selection) This parameter is used for selecting the type of measure to be used
for measuring the distance between points.The following options are available: mixed measures, nominal measures, numerical measures and Bregman divergences.
mixed measure (selection) This parameter is available when the measure type parameter is
set to ‘mixed measures’. The only available option is the ‘Mixed Euclidean Distance’
nominal measure (selection) This parameter is available when the measure type parameter
is set to ‘nominal measures’. This option cannot be applied if the input ExampleSet has
numerical attributes. In this case the ‘numerical measure’ option should be selected.
numerical measure (selection) This parameter is available when the measure type parameter is set to ‘numerical measures’. This option cannot be applied if the input ExampleSet
has nominal attributes. If the input ExampleSet has nominal attributes the ‘nominal measure’ option should be selected.
divergence (selection) This parameter is available when the measure type parameter is set to
‘Bregman divergences’.
kernel type (selection) This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’. The type of the kernel function is selected through
this parameter. Following kernel types are supported:
• dot The dot kernel is defined byk(x,y)=x*y i.e.it is inner product ofx and y.
• radial The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma that is
specified by the kernel gamma parameter. The adjustable parameter gamma plays a
major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
• polynomial The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of the polynomial and it is specified by the kernel degree parameter. The Polynomial kernels are well suited for problems where all the training data is normalized.
• neural The neural kernel is defined by a two layered neural net tanh(a x*y+b) where
a is alpha and b is the intercept constant. These parameters can be adjusted using the
kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the
data dimension. Note that not all choices of a and b lead to a valid kernel function.
• sigmoid This is the sigmoid kernel. Please note that the sigmoid kernel is not valid
under some parameters.
371
3. Cleansing
• anova This is the anova kernel. It has adjustable parameters gamma and degree.
• epachnenikov The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1
and 1 and zero for u outside that range. It has two adjustable parameters kernel sigma1
and kernel degree.
• gaussian_combination This is the gaussian combination kernel. It has adjustable
parameters kernel sigma1, kernel sigma2 and kernel sigma3.
• multiquadric The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2.
It has adjustable parameters kernel sigma1 and kernel sigma shift.
kernel gamma (real) This is the SVM kernel parameter gamma. This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to radial or anova.
kernel sigma1 (real) This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to epachnenikov, gaussian combination or multiquadric.
kernel sigma2 (real) This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to gaussian combination.
kernel sigma3 (real) This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to gaussian combination.
kernel shift (real) This is the SVM kernel parameter shift. This parameter is only available
when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the kernel
type parameter is set to multiquadric.
kernel degree (real) This is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to polynomial, anova or epachnenikov.
kernel a (real) This is the SVM kernel parameter a. This parameter is only available when
the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the kernel type
parameter is set to neural.
kernel b (real) This is the SVM kernel parameter b. This parameter is only available when
the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the kernel type
parameter is set to neural.
Tutorial Processes
Detecting outliers from an ExampleSet
The Generate Data operator is used for generating an ExampleSet. The target function parameter is set to ‘gaussian mixture clusters’. The number examples and number of attributes parameters are set to 200 and 2 respectively. A breakpoint is inserted here so that you can view the
ExampleSet in the Results Workspace. A good plot of the ExampleSet can be seen by switching
to the ‘Plot View’ tab. Set Plotter to ‘Scatter’, x-Axis to ‘att1’ and y-Axis to ‘att2’ to view the
scatter plot of the ExampleSet.
The Detect Outlier (COF) operator is applied on the ExampleSet. The number of neighbors
and number of class outliers parameters are set to 7. The resultant ExampleSet can be viewed
372
3.5. Outliers
Root
Generate Data
inp
out
Detect Outlier (C...
exa
exa
res
ori
res
Figure 3.14: Tutorial process ‘Detecting outliers from an ExampleSet’.
in the Results Workspace. For better understanding, switch to the ‘Plot View’ tab. Set Plotter to
‘Scatter’, x-Axis to ‘att1’, y-Axis to ‘att2’ and Color Column to ‘outlier’ to view the scatter plot
of the ExampleSet (the outliers are marked red).
373
3. Cleansing
Detect Outlier (Densities)
Detect Outlier (D...
exa
exa
ori
This operator identifies outliers in the given ExampleSet based on
the data density. All objects that have at least p proportion of all
objects farther away than distance D are considered outliers.
Description
The Detect Outlier (Densities) operator is an outlier detection algorithm that calculates the
DB(p,D)-outliers for the given ExampleSet. A DB(p,D)-outlier is an object which is at least D distance away from at least p proportion of all objects. The two real-valued parameters p and D can
be specified through the proportion and distance parameters respectively. The DB(p,D)-outliers
are distance-based outliers according to Knorr and Ng. This operator implements a global homogenous outlier search.
This operator adds a new boolean attribute named ‘outlier’ to the given ExampleSet. If the
value of this attribute is true, that example is an outlier and vice versa. Different distance functions are supported by this operator. The desired distance function can be selected by the distance function parameter.
An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of
the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case
such examples should be discarded.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can
also be used as input.
Output Ports
example set output (exa) A new boolean attribute ‘outlier’ is added to the given ExampleSet
and the ExampleSet is delivered through this output port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
distance (real) This parameter specifies the distance D parameter for calculation of the DB(p,D)outliers.
proportion (real) This parameter specifies the proportion p parameter for calculation of the
DB(p,D)-outliers.
distance function (selection) This parameter specifies the distance function that will be used
for calculating the distance between two examples.
374
3.5. Outliers
Tutorial Processes
Detecting outliers from an ExampleSet
Process
Generate Data
inp
out
Detect Outlier
exa
exa
res
ori
res
Figure 3.15: Tutorial process ‘Detecting outliers from an ExampleSet’.
The Generate Data operator is used for generating an ExampleSet. The target function parameter is set to ‘gaussian mixture clusters’. The number examples and number of attributes
parameters are set to 200 and 2 respectively. A breakpoint is inserted here so that you can view
the ExampleSet in the Results Workspace. A good plot of the ExampleSet can be seen by switching to the ‘Plot View’ tab. Set Plotter to ‘Scatter’, x-Axis to ‘att1’ and y-Axis to ‘att2’ to view the
scatter plot of the ExampleSet.
The Detect Outlier (Densities) operator is applied on the ExampleSet. The distance and proportion parameters are set to 4.0 and 0.8 respectively. The resultant ExampleSet can be viewed
in the Results Workspace. For better understanding switch to the ‘Plot View’ tab. Set Plotter to
‘Scatter’, x-Axis to ‘att1’, y-Axis to ‘att2’ and Color Column to ‘outlier’ to view the scatter plot
of the ExampleSet (the outliers are marked red). The number of outliers may differ depending
on the randomization, if the random seed parameter of the process is set to 1997, you will see 5
outliers.
375
3. Cleansing
Detect Outlier (Distances)
Detect Outlier (D...
exa
exa
ori
This operator identifies n outliers in the given ExampleSet based
on the distance to their k nearest neighbors. The variables n and k
can be specified through parameters.
Description
This operator performs outlier search according to the outlier detection approach recommended
by Ramaswamy, Rastogi and Shim in “Efficient Algorithms for Mining Outliers from Large Data
Sets”. In their paper, a formulation for distance-based outliers is proposed that is based on the
distance of a point from its k-th nearest neighbor. Each point is ranked on the basis of its distance
to its k-th nearest neighbor and the top n points in this ranking are declared to be outliers. The
values of k and n can be specified by the number of neighbors and number of outliers parameters
respectively. This search is based on simple and intuitive distance-based definitions for outliers
by Knorr and Ng which in simple words is: ‘A point p in a data set is an outlier with respect two
parameters k and d if no more than k points in the data set are at a distance of d or less from p’.
This operator adds a new boolean attribute named ‘outlier’ to the given ExampleSet. If the
value of this attribute is true that example is an outlier and vice versa. n examples will have
the value true in the ‘outlier’ attribute (where n is the value specified in the number of outliers
parameter). Different distance functions are supported by this operator. The desired distance
function can be selected by the distance function parameter.
An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of
the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case
such examples should be discarded.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can
also be used as input.
Output Ports
example set output (exa) A new boolean attribute ‘outlier’ is added to the given ExampleSet
and the ExampleSet is delivered through this output port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
number of neighbors (integer) This parameter specifies the k value for the k-th nearest neighbors to be the analyzed. The minimum and maximum values for this parameter are 1 and
1 million respectively.
376
3.5. Outliers
number of outliers (integer) This parameter specifies the number of top-n outliers to be looked
for. The resultant ExampleSet will have n number of examples that are considered outliers.
The minimum and maximum values for this parameter are 2 and 1 million respectively.
distance function (selection) This parameter specifies the distance function that will be used
for calculating the distance between two examples.
Tutorial Processes
Detecting outliers from an ExampleSet
Root
Generate Data
inp
out
Detect Outlier (D...
exa
exa
res
ori
res
Figure 3.16: Tutorial process ‘Detecting outliers from an ExampleSet’.
The Generate Data operator is used for generating an ExampleSet. The target function parameter is set to ‘gaussian mixture clusters’. The number examples and number of attributes
parameters are set to 200 and 2 respectively. A breakpoint is inserted here so that you can view
the ExampleSet in the Results Workspace. A good plot of the ExampleSet can be seen by switching to the ‘Plot View’ tab. Set Plotter to ‘Scatter’, x-Axis to ‘att1’ and y-Axis to ‘att2’ to view the
scatter plot of the ExampleSet.
The Detect Outlier (Distances) operator is applied on this ExampleSet. The number of neighbors and number of outliers parameters are set to 4 and 12 respectively. Thus 12 examples of
the resultant ExampleSet will have true value in the ‘outlier’ attribute. This can be verified by
viewing the ExampleSet in the Results Workspace. For better understanding switch to the ‘Plot
View’ tab. Set Plotter to ‘Scatter’, x-Axis to ‘att1’, y-Axis to ‘att2’ and Color Column to ‘outlier’
to view the scatter plot of the ExampleSet (the outliers are marked red).
377
3. Cleansing
Detect Outlier (LOF)
Detect Outlier (L...
exa
exa
ori
This operator identifies outliers in the given ExampleSet based on
local outlier factors (LOF). The LOF is based on a concept of a local
density, where locality is given by the k nearest neighbors, whose
distance is used to estimate the density. By comparing the local
density of an object to the local densities of its neighbors, one can
identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to
be outliers
Description
This operator performs a LOF outlier search. LOF outliers or outliers with a local outlier factor
per object are density based outliers according to Breunig, Kriegel, et al. As indicated by the
name, the local outlier factor is based on a concept of a local density, where locality is given
by k nearest neighbors, whose distance is used to estimate the density. By comparing the local
density of an object to the local densities of its neighbors, one can identify regions of similar
density, and points that have a substantially lower density than their neighbors. These are considered to be outliers. The local density is estimated by the typical distance at which a point
can be ‘reached’ from its neighbors. The definition of ‘reachability distance’ used in LOF is an
additional measure to produce more stable results within clusters.
The approach to find the outliers is based on measuring the density of objects and its relation
to each other (referred to as local reachability density). Based on the average ratio of the local
reachability density of an object and its k-nearest neighbors (i.e. the objects in its k-distance
neighborhood), a local outlier factor (LOF) is computed. The approach takes a parameter MinPts
(actually specifying the ‘k’) and it uses the maximum LOFs for objects in a MinPts range (lower
bound and upper bound to MinPts).
This operator supports cosine, inverted cosine, angle and squared distance in addition to the
usual euclidian distance which can be specified by the distance function parameter. In the first
step, the objects are grouped into containers. For each object, using a radius screening of all
other objects, all the available distances between that object and another object (or group of
objects) on the same radius given by the distance are associated with a container. That container
then has the distance information as well as the list of objects within that distance (usually only
a few) and the information about how many objects are in the container.
In the second step, three things are done:
1. The containers for each object are counted in ascending order according to the cardinality
of the object list within the container (= that distance) to find the k-distances for each
object and the objects in that k-distance (all objects in all the subsequent containers with
a smaller distance).
2. Using this information, the local reachability densities are computed by using the maximum of the actual distance and the k-distance for each object pair (object and objects in
k-distance) and averaging it by the cardinality of the k-neighborhood and then taking the
reciprocal value.
3. The LOF is computed for each MinPts value in the range (actually for all up to upper bound)
by averaging the ratio between the MinPts-local reachability-density of all objects in the
k-neighborhood and the object itself. The maximum LOF in the MinPts range is passed as
final LOF to each object.
378
3.5. Outliers
Afterwards LOFs are added as values for a special real-valued outlier attribute in the ExampleSet which the operator will return.
An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of
the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case
such examples should be discarded.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can
also be used as input.
Output Ports
example set output (exa) A new attribute ‘outlier’ is added to the given ExampleSet which
is then delivered through this output port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Parameters
minimal points lower bound (integer) This parameter specifies the lower bound for MinPts
for the Outlier test.
minimal points upper bound (integer) This parameter specifies the upper bound for MinPts
for the Outlier test.
distance function (selection) This parameter specifies the distance function that will be used
for calculating the distance between two objects.
Tutorial Processes
Detecting outliers from an ExampleSet
Root
Generate Data
inp
out
Detect Outlier (L...
exa
exa
res
ori
res
Figure 3.17: Tutorial process ‘Detecting outliers from an ExampleSet’.
The Generate Data operator is used for generating an ExampleSet. The target function parameter is set to ‘gaussian mixture clusters’. The number examples and number of attributes
379
3. Cleansing
parameters are set to 200 and 2 respectively. A breakpoint is inserted here so that you can view
the ExampleSet in the Results Workspace. A good plot of the ExampleSet can be seen by switching to the ‘Plot View’ tab. Set Plotter to ‘Scatter’, x-Axis to ‘att1’ and y-Axis to ‘att2’ to view the
scatter plot of the ExampleSet.
The Detect Outlier (LOF) operator is applied on this ExampleSet with default values for all
parameters. The minimal points lower bound and minimal points upper bound parameters are
set to 10 and 20 respectively. The resultant ExampleSet can be seen in the Results Workspace.
For better understanding switch to the ‘Plot View’ tab. Set Plotter to ‘Scatter’, x-Axis to ‘att1’,
y-Axis to ‘att2’ and Color Column to ‘outlier’ to view the scatter plot of the ExampleSet.
380
3.6. Dimensionality Reduction
3.6 Dimensionality Reduction
Fourier Transformation
Fourier Transfor...
exa
exa
ori
This operator uses the label as a function of each attribute and calculates the Fourier transformations as new attributes.
Description
The Fourier Transformation operator creates a new ExampleSet consisting of the result of a
Fourier transformation for each attribute of the input ExampleSet. This operator uses the label as a function of each attribute and calculates the Fourier transformations as new attributes.
The Fourier transformation is a mathematical transform with many applications in physics and
engineering. Very commonly, it expresses a mathematical function of time as a function of frequency, known as its frequency spectrum. The Fourier inversion theorem details this relationship. For instance, the transform of a musical chord made up of pure notes (without overtones)
expressed as amplitude as a function of time, is a mathematical representation of the amplitudes
and phases of the individual notes that make it up. The function of time is often called the time
domain representation, and the frequency spectrum the frequency domain representation. The
inverse Fourier transform expresses a frequency domain function in the time domain.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Generate
Data operator in the attached Example Process.
Output Ports
example set output (exa) The Fourier Transformation is performed and the resultant ExampleSet is returned through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
Tutorial Processes
Introduction to the Fourier Transformation operator
The Generate Data operator provides a sample ExampleSet. A breakpoint is inserted here so
that you can have a look at the ExampleSet. You can see that the ExampleSet has three real
attributes i.e. att1, att2 and att3. The Fourier Transformation operator is applied on this ExampleSet. The Fourier Transformation operator creates a new ExampleSet consisting of the result
of a fourier transformation for each attribute of the given ExampleSet. This operator uses the
label as a function of each attribute and calculates the fourier transformations as new attributes.
The resultant ExampleSet can be seen in the Results Workspace.
381
3. Cleansing
Process
Generate Data
out
inp
Fourier Transfor...
exa
exa
res
ori
res
res
Figure 3.18: Tutorial process ‘Introduction to the Fourier Transformation operator’.
Generalized Hebbian Algorithm
GHA
exa
exa
ori
pre
This operator is an implementation of the Generalized Hebbian Algorithm (GHA) which is an iterative method for computing principal components. The user can specify manually the required number of principal components.
Description
The Generalized Hebbian Algorithm (GHA) is a linear feedforward neural network model for unsupervised learning with applications primarily in principal components analysis. From a computational point of view, it can be advantageous to solve the eigenvalue problem by iterative
methods which do not need to compute the covariance matrix directly. This is useful when the
ExampleSet contains many attributes (hundreds or even thousands).
Principal Component Analysis (PCA) is an attribute reduction procedure. It is useful when
you have obtained data on a number of attributes (possibly a large number of attributes), and
believe that there is some redundancy in those attributes. In this case, redundancy means that
some of the attributes are correlated with one another, possibly because they are measuring the
same construct. Because of this redundancy, you believe that it should be possible to reduce the
observed attributes into a smaller number of principal components (artificial attributes) that will
account for most of the variance in the observed attributes. Principal Component Analysis is a
mathematical procedure that uses an orthogonal transformation to convert a set of observations
of possibly correlated attributes into a set of values of uncorrelated attributes called principal
components. The number of principal components is less than or equal to the number of original
attributes. This transformation is defined in such a way that the first principal component’s
variance is as high as possible (accounts for as much of the variability in the data as possible),
and each succeeding component in turn has the highest variance possible under the constraint
that it should be orthogonal to (uncorrelated with) the preceding components.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
382
3.6. Dimensionality Reduction
input. It is essential that meta data should be attached with the data for the input because
attributes are specified in their meta data. The Retrieve operator provides meta data along
with the data. Please note that this operator cannot handle nominal attributes; it works
on numerical attributes.
Output Ports
example set (exa) The Generalized Hebbian Algorithm is performed on the input ExampleSet
and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the GHA model.
Parameters
number of components (integer) The number of components to keep is specified by the number of components parameter. If set to -1 the number of principal components in the resultant ExampleSet is equal to the number of attributes in the original ExampleSet.
number of iterations (integer) This parameter specifies the number of iterations to apply
the update rule.
learning rate (real) This parameter specifies the learning rate of the GHA.
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomization.
local random seed (integer) This parameter specifies the local random seed. It is available
only if the use local random seed parameter is set to true.
Tutorial Processes
Dimensionality reduction of the Polynomial data set using the GHA operator
The ‘Polynomial’ data set is loaded using the Retrieve operator. A breakpoint is inserted here
so that you can have a look at the ExampleSet. You can see that the ExampleSet has 5 regular
attributes. The Generalized Hebbian Algorithm operator is applied on the ‘Polynomial’ data set.
The number of components parameter is set to 3. Thus the resultant ExampleSet will be composed of 3 principal components. All other parameters are used with default values. Run the
process, you will see that the ExampleSet that had 5 attributes has been reduced to an ExampleSet with 3 principal components.
383
3. Cleansing
Process
inp
Polynomial
out
GHA
exa
exa
res
ori
res
pre
Figure 3.19: Tutorial process ‘Dimensionality reduction of the Polynomial data set using the
GHA operator’.
Independent Component Analysis
ICA
exa
exa
ori
pre
This operator performs the Independent Component Analysis
(ICA) of the given ExampleSet using the FastICA-algorithm of
Hyvärinen and Oja.
Description
Independent component analysis (ICA) is a very general-purpose statistical technique in which
observed random data are linearly transformed into components that are maximally independent from each other, and simultaneously have “interesting” distributions. Such a representation seems to capture the essential structure of the data in many applications, including feature
extraction. ICA is used for revealing hidden factors that underlie sets of random variables or
measurements. ICA is superficially related to principal component analysis (PCA) and factor
analysis. ICA is a much more powerful technique, however, capable of finding the underlying
factors or sources when these classic methods fail completely. This operator implements the
FastICA-algorithm of A. Hyvärinen and E. Oja. The FastICA-algorithm has most of the advantages of neural algorithms: It is parallel, distributed, computationally simple, and requires little
memory space.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. It is essential that meta data should be attached with the data for the
input because attributes are specified in their meta data. The Retrieve operator provides
384
3.6. Dimensionality Reduction
meta data along with the data. Please note that this operator cannot handle nominal attributes; it works on numerical attributes.
Output Ports
example set output (exa) The Independent Component Analysis is performed on the input
ExampleSet and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
dimensionality reduction (selection) This parameter indicates which type of dimensionality reduction (reduction in number of attributes) should be applied.
• none if this option is selected, dimensionality reduction is not performed.
• fixed_number if this option is selected, only a fixed number of components are kept.
The number of components to keep is specified by the number of components parameter.
number of components (integer) This parameter is only available when the dimensionality
reduction parameter is set to ‘fixed number’. The number of components to keep is specified by the number of components parameter.
algorithm type (selection) This parameter specifies the type of algorithm to be used.
• parallel If parallel option is selected, the components are extracted simultaneously.
• deflation If deflation option is selected, the components are extracted one at a time.
function (selection) This parameter specifies the functional form of the G function to be used
in the approximation to neg-entropy.
alpha (real) This parameter specifies the alpha constant in range [1, 2] which is used in approximation to neg-entropy.
row norm (boolean) This parameter indicates whether rows of the data matrix should be standardized beforehand.
max iteration (integer) This parameter specifies the maximum number of iterations to perform.
tolerance (real) This parameter specifies a positive scalar giving the tolerance at which the
un-mixing matrix is considered to have converged.
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomization. Using the same value of local random seed will produce the same
randomization.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
385
3. Cleansing
Tutorial Processes
Dimensionality reduction of the Sonar data set using the Independent Component
Analysis operator
Process
inp
Sonar
ICA
out
exa
exa
res
ori
res
pre
Figure 3.20: Tutorial process ‘Dimensionality reduction of the Sonar data set using the Independent Component Analysis operator’.
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the ExampleSet has 60 attributes. The
Independent Component Analysis operator is applied on the ‘Sonar’ data set. The dimensionality reduction parameter is set to ‘fixed number’ and the number_of_components parameter is set
to 10. Thus the resultant ExampleSet will be composed of 10 components (artificial attributes).
You can see the resultant ExampleSet in the Results Workspace and verify that it has only 10
attributes. Please note that these attributes are not original attributes of the ‘Sonar’ data set.
These attributes were created using the ICA procedure.
386
3.6. Dimensionality Reduction
Principal Component Analysis
PCA
exa
exa
ori
pre
This operator performs a Principal Component Analysis (PCA) using the covariance matrix. The user can specify the amount of variance to cover in the original data while retaining the best number
of principal components. The user can also specify manually the
number of principal components.
Description
Principal component analysis (PCA) is an attribute reduction procedure. It is useful when you
have obtained data on a number of attributes (possibly a large number of attributes), and believe
that there is some redundancy in those attributes. In this case, redundancy means that some of
the attributes are correlated with one another, possibly because they are measuring the same
construct. Because of this redundancy, you believe that it should be possible to reduce the observed attributes into a smaller number of principal components (artificial attributes) that will
account for most of the variance in the observed attributes.
Principal Component Analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated attributes into a set of values of
uncorrelated attributes called principal components. The number of principal components is
less than or equal to the number of original attributes. This transformation is defined in such a
way that the first principal component’s variance is as high as possible (accounts for as much of
the variability in the data as possible), and each succeeding component in turn has the highest
variance possible under the constraint that it should be orthogonal to (uncorrelated with) the
preceding components.
Please note that PCA is sensitive to the relative scaling of the original attributes. This means
that whenever different attributes have different units (like temperature and mass); PCA is a
somewhat arbitrary method of analysis. Different results would be obtained if one used Fahrenheit rather than Celsius for example.
Input Ports
example set (exa) This input port expects an ExampleSet. It is output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input. It is essential that meta data should be attached with the data for the input because
attributes are specified in their meta data. The Retrieve operator provides meta data along
with the data. Please note that this operator cannot handle nominal attributes; it works
on numerical attributes.
Output Ports
example set (exa) The Principal Component Analysis is performed on the input ExampleSet
and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
387
3. Cleansing
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
dimensionality reduction (selection) This parameter indicates which type of dimensionality reduction (reduction in number of attributes) should be applied.
• none if this option is selected, no component is removed from the ExampleSet.
• keep_variance if this option is selected, all the components with a cumulative variance greater than the given threshold are removed from the ExampleSet. The threshold is specified by the variance threshold parameter.
• fixed_number if this option is selected, only a fixed number of components are kept.
The number of components to keep is specified by the number of components parameter.
variance threshold (real) This parameter is available only when the dimensionality reduction
parameter is set to ‘keep variance’. All the components with a cumulative variance greater
than the variance threshold are removed from the ExampleSet.
number of components (integer) This parameter is only available when the dimensionality
reduction parameter is set to ‘fixed number’. The number of components to keep is specified by the number of components parameter.
Tutorial Processes
Dimensionality reduction of the Polynomial data set using the Principal Component
Analysis operator
Process
Covariance Matri...
exa
inp
exa
res
cov
res
PCA
exa
exa
ori
res
pre
Polynomial
out
C ovaria nce Ma tr i x
exa
exa
cov
res
res
Figure 3.21: Tutorial process ‘Dimensionality reduction of the Polynomial data set using the
Principal Component Analysis operator’.
388
3.6. Dimensionality Reduction
The ‘Polynomial’ data set is loaded using the Retrieve operator. The Covariance Matrix operator is applied on it. A breakpoint is inserted here so that you can have a look at the ExampleSet
and its covariance matrix. For this purpose the Covariance Matrix operator is applied otherwise
it is not required here. The Principal Component Analysis operator is applied on the ‘Polynomial’ data set. The dimensionality reduction parameter is set to ‘fixed number’ and the number
of components parameter is set to 4. Thus the resultant ExampleSet will be composed of 4 principal components. As mentioned in the description, the principal components are uncorrelated
with each other thus their covariance should be zero. The Covariance Matrix operator is applied
on the output of the Principal Component Analysis operator. You can see the covariance matrix
of the resultant ExampleSet in the Results Workspace. As you can see that the covariance of the
components is zero.
389
3. Cleansing
Principal Component Analysis (Kernel)
PCA (Kernel)
exa
exa
ori
This operator performs Kernel Principal Component Analysis
(PCA) which is a non-linear extension of PCA.
pre
Description
Kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations
of PCA are done in a reproducing kernel Hilbert space with a non-linear mapping. By the use of
integral operator kernel functions, one can efficiently compute principal components in highdimensional feature spaces, related to input space by some nonlinear map. The result will be
the set of data points in a non-linearly transformed space. Please note that in contrast to the
usual linear PCA the kernel variant also works for large numbers of attributes but will become
slow for large number of examples.
RapidMiner provides the Principal Component Analysis operator for applying linear PCA. Principal Component Analysis is a mathematical procedure that uses an orthogonal transformation
to convert a set of observations of possibly correlated attributes into a set of values of uncorrelated attributes called principal components. This transformation is defined in such a way that
the first principal component’s variance is as high as possible (accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance
possible under the constraint that it should be orthogonal to (uncorrelated with) the preceding
components.
Differentiation
• Principal Component Analysis Kernel principal component analysis (kernel PCA) is an
extension of principal component analysis (PCA) using techniques of kernel methods. In
contrast to the usual linear PCA the kernel variant also works for large numbers of attributes but will become slow for large number of examples. See page 387 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is output of the Retrieve
operator in the attached Example Process. The output of other operators can also be used
as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data. The Retrieve operator provides meta data
along with the data. Please note that this operator cannot handle nominal attributes; it
works on numerical attributes.
Output Ports
example set output (exa) The kernel-based Principal Component Analysis is performed on
the input ExampleSet and the resultant ExampleSet is delivered through this port.
390
3.6. Dimensionality Reduction
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has the information regarding the parameters of this operator in the current process.
Parameters
kernel type (selection) The type of the kernel function is selected through this parameter.
Following kernel types are supported: dot, radial, polynomial, neural, anova, epachnenikov,
gaussian combination, multiquadric
• dot The dot kernel is defined byk(x,y)=x*y i.e. it is inner product ofx and y.
• radial The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma, it is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major
role in the performance of the kernel, and should be carefully tuned to the problem
at hand.
• polynomial The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of polynomial and it is specified by the kernel degree parameter. The polynomial
kernels are well suited for problems where all the training data is normalized.
• neural The neural kernel is defined by a two layered neural net tanh(a x*y+b) where
a is alpha and b is the intercept constant. These parameters can be adjusted using the
kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the
data dimension. Note that not all choices of a and b lead to a valid kernel function.
• anova The anova kernel is defined by raised to power d of summation of exp(-g (x-y))
where g is gamma and d is degree. gamma and degree are adjusted by the kernel gamma
and kernel degree parameters respectively.
• epachnenikov The epachnenikov kernel is this function (3/4)(1-u2) for u between -1
and 1 and zero for u outside that range. It has two adjustable parameters kernel sigma1
and kernel degree.
• gaussian_combination This is the gaussian combination kernel. It has adjustable
parameters kernel sigma1, kernel sigma2 and kernel sigma3.
• multiquadric The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2.
It has adjustable parameters kernel sigma1 and kernel sigma shift.
kernel gamma (real) This is the kernel parameter gamma. This is only available when the
kernel type parameter is set to radial or anova.
kernel sigma1 (real) This is the kernel parameter sigma1. This is only available when the
kernel type parameter is set to epachnenikov, gaussian combination or multiquadric.
kernel sigma2 (real) This is the kernel parameter sigma2. This is only available when the
kernel type parameter is set to gaussian combination.
kernel sigma3 (real) This is the kernel parameter sigma3. This is only available when the
kernel type parameter is set to gaussian combination.
kernel shift (real) This is the kernel parameter shift. This is only available when the kernel
type parameter is set to multiquadric.
391
3. Cleansing
kernel degree (real) This is the kernel parameter degree. This is only available when the kernel type parameter is set to polynomial, anova or epachnenikov.
kernel a (real) This is the kernel parameter a. This is only available when the kernel type parameter is set to neural.
kernel b (real) This is the kernel parameter b. This is only available when the kernel type parameter is set to neural.
Related Documents
• Principal Component Analysis (page 387)
Tutorial Processes
Introduction to the Principal Component Analysis (Kernel) operator
Process
Polynomial
inp
out
PCA (Kernel)
exa
exa
res
ori
res
pre
Figure 3.22: Tutorial process ‘Introduction to the Principal Component Analysis (Kernel)
operator’.
The ‘Polynomial’ data set is loaded using the Retrieve operator. A breakpoint is inserted here
so that you can have a look at the ExampleSet. You can see that the ExampleSet has 5 regular
attributes. The Principal Component Analysis (Kernel) operator is applied on this ExampleSet
with default values of all parameters. The kernel type parameter is set to ‘radial’ and the kernel
gamma parameter is set to 1.0. The resultant ExampleSet can be seen in the Results Workspace.
You can see that this ExampleSet has a different set of attributes.
392
3.6. Dimensionality Reduction
Self-Organizing Map
SOM
exa
exa
ori
pre
This operator performs a dimensionality reduction of the given
ExampleSet based on a self-organizing map (SOM). The user can
specify the required number of dimensions.
Description
A self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically
two-dimensional), discretized representation of the input space of the training samples, called
a map. Self-organizing maps are different from other artificial neural networks in the sense
that they use a neighborhood function to preserve the topological properties of the input space.
This makes SOMs useful for visualizing low-dimensional views of high-dimensional data, akin
to multidimensional scaling. The model was first described as an artificial neural network by
Teuvo Kohonen, and is sometimes called a Kohonen map.
Like most artificial neural networks, SOMs operate in two modes: training and mapping. Training builds the map using input examples. Mapping automatically classifies a new input vector.
A self-organizing map consists of components called nodes or neurons. Associated with each
node is a weight vector of the same dimension as the input data vectors and a position in the
map space. The usual arrangement of nodes is a regular spacing in a hexagonal or rectangular
grid. The self-organizing map describes a mapping from a higher dimensional input space to a
lower dimensional map space. The procedure for placing a vector from data space onto the map
is to first find the node with the closest weight vector to the vector taken from data space. Once
the closest node is located it is assigned the values from the vector taken from the data space.
While it is typical to consider this type of network structure as related to feed-forward networks where the nodes are visualized as being attached, this type of architecture is fundamentally different in arrangement and motivation.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. It is essential that meta data should be attached with the data for the
input because attributes are specified in their meta data. The Retrieve operator provides
meta data along with the data. Please note that this operator cannot handle nominal attributes; it works on numerical attributes.
Output Ports
example set output (exa) The dimensionality reduction of the given ExampleSet is performed
based on a self-organizing map and the resultant ExampleSet is delivered through this
port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
393
3. Cleansing
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
return preprocessing model (boolean) This parameter indicates if the preprocessing model
should be returned.
number of dimensions (integer) This parameter specifies the number of dimensions to keep
i.e. the number of attributes of the resultant ExampleSet.
net size (integer) This parameter specifies the size of the SOM net, by setting the length of
every edge of the net.
training rounds (integer) This parameter specifies the number of training rounds.
learning rate start (real) This parameter specifies the strength of an adaption in the first
round. The strength will decrease every round until it reaches the learning rate end in the
last round.
learning rate end (real) This parameter specifies the strength of an adaption in the last round.
The strength will decrease to this value in last round, beginning with learning rate start in
the first round.
adaption radius start (real) This parameter specifies the radius of the sphere around a stimulus in the first round. This radius decreases every round, starting by adaption radius start
in the first round, to adaption radius end in the last round.
adaption radius end (real) This parameter specifies the radius of the sphere around a stimulus in the last round. This radius decreases every round, starting by adaption radius start
in the first round, to adaption radius end in the last round.
Tutorial Processes
Dimensionality reduction of the Sonar data set using the Self-Organizing Map
operator
Process
Sonar
inp
SOM
out
exa
exa
res
ori
res
pre
Figure 3.23: Tutorial process ‘Dimensionality reduction of the Sonar data set using the SelfOrganizing Map operator’.
394
3.6. Dimensionality Reduction
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so
that you can have a look at the ExampleSet. You can see that the ExampleSet has 60 attributes.
The Self-Organizing Map operator is applied on the ‘Sonar’ data set. The number of dimensions
parameter is set to 10. Thus the resultant ExampleSet will be composed of 10 dimensions (artificial attributes). You can see the resultant ExampleSet in the Results Workspace and verify
that it has only 10 attributes. Please note that these attributes are not original attributes of the
‘Sonar’ data set. These attributes were created using the SOM procedure.
395
3. Cleansing
Singular Value Decomposition
SVD
exa
exa
ori
pre
This operator performs a dimensionality reduction of the given
ExampleSet based on Singular Value Decomposition (SVD). The
user can specify the required number of dimensions or specify the
cumulative variance threshold. In the latter case all components
having cumulative variance above this threshold are discarded.
Description
Singular Value Decomposition (SVD) can be used to better understand an ExampleSet by showing the number of important dimensions. It can also be used to simplify the ExampleSet by
reducing the number of attributes of the ExampleSet. This reduction removes unnecessary attributes that are linearly dependent in the point of view of Linear Algebra. It is useful when you
have obtained data on a number of attributes (possibly a large number of attributes), and believe
that there is some redundancy in those attributes. In this case, redundancy means that some of
the attributes are correlated with one another, possibly because they are measuring the same
construct. Because of this redundancy, you believe that it should be possible to reduce the observed attributes into a smaller number of components (artificial attributes) that will account
for most of the variance in the observed attributes. For example, imagine an ExampleSet which
contains an attribute that stores the water’s temperature on several samples and another that
stores its state (solid, liquid or gas). It is easy to see that the second attribute is dependent on the
first attribute and, therefore, SVD could easily show us that it is not important for the analysis.
RapidMiner provides various dimensionality reduction operators e.g. the Principal Component Analysis operator. The Principal Component Analysis technique is a specific case of SVD.
It is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated attributes into a set of values of uncorrelated attributes called
principal components. The number of principal components is less than or equal to the number
of original attributes. This transformation is defined in such a way that the first principal component’s variance is as high as possible (accounts for as much of the variability in the data as
possible), and each succeeding component in turn has the highest variance possible under the
constraint that it should be orthogonal to (uncorrelated with) the preceding components.
Differentiation
• Principal Component Analysis PCA is a dimensionality reduction procedure. PCA is a
specific case of SVD. See page 387 for details.
Input Ports
example set input (exa) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also
be used as input. It is essential that meta data should be attached with the data for the
input because attributes are specified in their meta data. The Retrieve operator provides
meta data along with the data. Please note that this operator cannot handle nominal attributes; it works on numerical attributes.
396
3.6. Dimensionality Reduction
Output Ports
example set output (exa) The Singular Value Decomposition is performed on the input ExampleSet and the resultant ExampleSet is delivered through this port.
original (ori) The ExampleSet that was given as input is passed without changing to the output
through this port. This is usually used to reuse the same ExampleSet in further operators
or to view the ExampleSet in the Results Workspace.
preprocessing model (pre) This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.
Parameters
dimensionality reduction (selection) This parameter indicates which type of dimensionality reduction (reduction in number of attributes) should be applied.
• none if this option is selected, dimensionality reduction is not performed.
• keep_percentage if this option is selected, all the components with a cumulative
variance greater than the given threshold are removed from the ExampleSet. The
threshold is specified by the percentage threshold parameter.
• fixed_number if this option is selected, only a fixed number of components are kept.
The number of components to keep is specified by the dimensions parameter.
percentage threshold (real) This parameter is only available when the dimensionality reduction parameter is set to ‘keep percentage’. All the components with a cumulative variance
greater than the percentage threshold are removed from the ExampleSet.
dimensions (integer) This parameter is only available when the dimensionality reduction parameter is set to ‘fixed number’. The number of components to keep is specified by the
dimensions parameter.
Related Documents
• Principal Component Analysis (page 387)
Tutorial Processes
Dimensionality reduction of the Sonar data set using the Singular Value
Decomposition operator
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the ExampleSet has 60 attributes. The
Singular Value Decomposition operator is applied on the ‘Sonar’ data set. The dimensionality
reduction parameter is set to ‘fixed number’ and the dimensions parameter is set to 10. Thus the
resultant ExampleSet will be composed of 10 dimensions (artificial attributes). You can see the
resultant ExampleSet in the Results Workspace and verify that it has only 10 attributes. Please
note that these attributes are not original attributes of the ‘Sonar’ data set. These attributes
were created using the SVD procedure.
397
3. Cleansing
Process
Sonar
inp
SVD
out
exa
exa
res
ori
res
pre
Figure 3.24: Tutorial process ‘Dimensionality reduction of the Sonar data set using the Singular
Value Decomposition operator’.
398
4Modeling
4.1 Predictive
Create Formula
Create Formula
mod
for
mod
This operator generates a formula from the given model. This operator can generate formula only for models that are capable of
producing formula.
Description
The Create Formula operator extracts a prediction calculation formula from the given model and
stores the formula in a formula result object which can then be written into a file, e.g. with the
Write operator. Please note that not all RapidMiner models provide a formula and this operator
is applicable on only those models that are capable of producing formula.
Input Ports
model (mod) This input port expects a model. The model should be capable of providing a
formula.
Output Ports
formula (for) The formula of the given model is passed to the output through this port.
model (mod) The model that was given as input is passed without changing to the output through
this port. This is usually used to reuse the same model in further operators or to view the
model in the Results Workspace.
Tutorial Processes
Formula of the Logistic Regression model
The ‘Ripley-Set’ data set is loaded using the Retrieve operator. The Logistic Regression operator
is applied on this ExampleSet with default values of all parameters. The regression model generated by the Logistic Regression operator is provided as input to the Create Formula operator
which returns a formula object. You can view this formula object in the Results Workspace. It is
important to note that most RapidMiner operators do not provide a formula, thus this operator
cannot be applied on them.
399
4. Modeling
Process
Rip le y-Set
out
inp
Logistic Regressi...
tra
mod
wei
Create Formula
mod
for
res
mod
res
exa
Figure 4.1: Tutorial process ‘Formula of the Logistic Regression model’.
Group Models
Group Models
mod
mod
mod
mod
This operator groups the given models into a single combined
model. When this combined model is applied, it is equivalent to
applying the original models in their respective order.
Description
The Group Models operator groups all input models together into a single combined model. This
combined model can be applied on ExampleSets (using the Apply Model operator) like other
models. When this combined model is applied, it is equivalent to applying the original models
in their respective order. This combined model can also be written into a file using the Write
Model operator. This operator is useful in cases where preprocessing and prediction models
should be applied together on new and unseen data. A grouped model can be ungrouped with
the Ungroup Models operator. Please study the attached Example Process for more information
about the Group Models operator.
Input Ports
model in (mod) This input port expects a model. This operator can have multiple inputs but
it is mandatory to provide at least two models to this operator as input. When one input
is connected, another model in port becomes available which is ready to accept another
model(if any). The order of models remains the same i.e. the model supplied at the first
model in port of this operator will be the first model to be applied when the resultant combined model is applied.
Output Ports
model out (mod) The given models are grouped into a single combined model and the resultant grouped model is returned from this port.
400
4.1. Predictive
Tutorial Processes
Grouping models and applying the resultant grouped model
Process
inp
SVD
exa
N ai ve B a ye s
exa
ori
pre
tra
mod
exa
Group Models
mod
mod
mod
mod
Iris
Split Data
out
exa
Apply Model
par
mod
par
unl
par
lab
res
mod
res
res
Figure 4.2: Tutorial process ‘Grouping models and applying the resultant grouped model’.
The ‘Iris’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. You can see that the ExampleSet has four regular attributes. The Split Data operator is applied on it to split the ExampleSet into training and testing
data sets. The training data set (composed of 70% of examples) is passed to the SVD operator.
The dimensionality reduction and dimensions parameters of the SVD operator are set to ‘fixed
number’ and 2 respectively. Thus the given data set will be reduced to a data set with two dimensions (artificial attributes that represent the original attributes). The SVD model (model
that reduces the dimensionality of the given ExampleSet) is provided as the first model to the
Group Models operator. The Naive Bayes operator is applied on the resultant ExampleSet (i.e.
the training data set with reduced dimensions). The classification model generated by the Naive
Bayes operator is provided as the second model to the Group Models operator. Thus the Group
Models operator combines two models SVD dimensionality reduction model Naive Bayes classification model. This combined model is applied on the testing data set (composed of 30% of
the ‘Iris’ data set) using the Apply Model operator. When the combined model is applied, the
SVD model is applied first on the testing data set. Then the Naive Bayes classification model is
applied on the resultant ExampleSet (i.e. the testing data set with reduced dimensions). The
combined model and the labeled ExampleSet can be seen in the Results Workspace after the
execution of the process.
401
4. Modeling
4.1.1 Lazy
Default Model
Default Model
tra
mod
exa
This operator generates a model that provides the specified default
value as prediction.
Description
The Default Model operator generates a model that predicts the specified default value for the
label in all examples. The method to use for generating a default value can be selected through
the method parameter. For a numeric label, the default value can be median or average of the
label values or a constant default value can be specified through the constant parameter. For
nominal values the mode of the labels can be used. Values of an attribute can be used as predictions; the attribute can be selected through the attribute parameter. This operator should
not be used for ‘actual’ prediction tasks, but it can be used for comparing the results of ‘actual’
learning schemes with guessing.
Input Ports
training set (tra) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input.
Output Ports
model (mod) The default model is delivered from this output port. This model can now be
applied on unseen data sets for the prediction of the label attribute. This model should
not be used for ‘actual’ prediction tasks, but it can be used for comparing the results of
‘actual’ learning schemes with guessing.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
Parameters
method (selection) This parameter specifies the method for computing the default values.
For a numeric label, the default value can be median or average of the label values or a
constant default value can be specified through the constant parameter. For nominal values the mode of the labels can be used. Values of an attribute can be used as predictions;
the attribute can be selected through the attribute parameter.
constant (real) This parameter is only available when the method parameter is set to ‘constant’. This parameter specifies a constant default value for a numeric label.
402
4.1. Predictive
attribute (string) This parameter is only available when the method parameter is set to ‘attribute’. This parameter specifies the attribute to get the predicted values from. If applied
on a nominal label, it should be made sure that the selected attribute has the same set of
possible values as the label.
Tutorial Processes
Using the Default Model operator with ’mode’ method
Process
Sonar
inp
Validation
out
tra
mod
res
tra
ave
res
ave
res
Figure 4.3: Tutorial process ‘Using the Default Model operator with ’mode’ method’.
The ‘Sonar’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so
that you can have a look at the ExampleSet. You can see that there are two possible label values
i.e. ‘Rock’ and ‘Mine’. The most frequently occurring label value is ‘Mine’. The Split Validation
operator is applied on this ExampleSet for training and testing a classification model. The Default Model operator is applied in the training subprocess of the Split Validation operator. The
method parameter of the Default Model operator is set to ‘mode’, thus the most frequently occurring label value (i.e. ‘Mine’) will be used as prediction in all examples. The Apply Model operator
is used in the testing subprocess for applying the model generated by the Default Model operator. A breakpoint is inserted here so that you can have a look at the labeled ExampleSet. You
can see that all examples have been predicted as ‘Mine’. This labeled ExampleSet is used by the
Performance operator for measuring the performance of the model. The default model and its
performance vector are connected to the output and they can be seen in the Results Workspace.
403
4. Modeling
K-NN
k-NN
tra
mod
exa
This operator generates a k-Nearest Neighbor model from the input ExampleSet. This model can be a classification or regression
model depending on the input ExampleSet.
Description
The k-Nearest Neighbor algorithm is based on learning by analogy, that is, by comparing a given
test example with training examples that are similar to it. The training examples are described
by n attributes. Each example represents a point in an n-dimensional space. In this way, all of
the training examples are stored in an n-dimensional pattern space. When given an unknown
example, a k-nearest neighbor algorithm searches the pattern space for the k training examples
that are closest to the unknown example. These k training examples are the k “nearest neighbors” of the unknown example. “Closeness” is defined in terms of a distance metric, such as the
Euclidean distance.
The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms:
an example is classified by a majority vote of its neighbors, with the example being assigned to
the class most common amongst its k nearest neighbors (k is a positive integer, typically small).
If k = 1, then the example is simply assigned to the class of its nearest neighbor.The same method
can be used for regression, by simply assigning the label value for the example to be the average of the values of its k nearest neighbors. It can be useful to weight the contributions of the
neighbors, so that the nearer neighbors contribute more to the average than the more distant
ones.
The neighbors are taken from a set of examples for which the correct classification (or, in the
case of regression, the value of the label) is known. This can be thought of as the training set
for the algorithm, though no explicit training step is required.
The basic k-Nearest Neighbor algorithm is composed of two steps:
1. Find the k training examples that are closest to the unseen example.
2. Take the most commonly occurring classification for these k examples (or, in the case of
regression, take the average of these k label values).
Input Ports
training set (tra) This input port expects an ExampleSet. It is output of the Select Attributes
operator in the attached Example Processes. Output of other operators can also be used
as input.
Output Ports
model (mod) The k-Nearest Neighbor model is delivered from this output port. This model can
now be applied on unseen data sets for prediction of the label attribute.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
404
4.1. Predictive
Parameters
k (integer) Finding the k training examples that are closest to the unseen example is the first
step of the k-NN algorithm. If k = 1, the example is simply assigned to the class of its
nearest neighbor. k is a typically small positive integer. Mostly k is provided with an odd
integer value.
weighted vote (boolean) If this parameter is set, the distance values between the examples
are also taken into account. It can be useful to weight the contributions of the neighbors,
so that the nearer neighbors contribute more than the more distant ones.
measure types (selection) This parameter is used for selecting the type of measure to be
used for finding the nearest neighbors.The following options are available: mixed measures, nominal measures, numerical measures and Bregman divergences.
mixed measure (selection) This parameter is available when the measure type parameter is
set to ‘mixed measures’. The only available option is the ‘Mixed Euclidean Distance’
nominal measure (selection) This parameter is available when the measure type parameter
is set to ‘nominal measures’. This option cannot be applied if the input ExampleSet has
numerical attributes. Otherwise the ‘numerical measure’ option should be selected.
numerical measure (selection) This parameter is available when the measure type parameter is set to ‘numerical measures’. This option cannot be applied if the input ExampleSet
has nominal attributes. Otherwise the ‘nominal measure’ option should be selected.
divergence (selection) This parameter is available when the measure type parameter is set to
‘bregman divergences’.
kernel type (selection) This parameter is available only when the numerical measure parameter is set to ‘Kernel Euclidean Distance’. The type of the kernel function is selected through
this parameter. Following kernel types are supported:
• dot The dot kernel is defined byk(x,y)=x*y i.e.it is the inner product ofx and y.
• radial The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma, it is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major
role in the performance of the kernel, and should be carefully tuned to the problem
at hand.
• polynomial The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of the polynomial and it is specified by the kernel degree parameter. The Polynomial kernels are well suited for problems where all the training data is normalized.
• neural The neural kernel is defined by a two layered neural net tanh(a x*y+b) where
a is alpha and b is the intercept constant. These parameters can be adjusted using the
kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the
data dimension. Note that not all choices of a and b lead to a valid kernel function.
• sigmoid This is the sigmoid kernel. Please note that it is not valid under some parameters.
• anova The anova kernel is defined by the raised to the power d of summation of exp(-g
(x-y)) where g is gamma and d is degree. The two are adjusted by the kernel gamma and
kernel degree parameters respectively.
• epachnenikov The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1
and 1 and zero for u outside that range. It has the two adjustable parameters kernel
sigma1 and kernel degree.
405
4. Modeling
• gaussian_combination This is the gaussian combination kernel. It has the adjustable
parameters kernel sigma1, kernel sigma2 and kernel sigma3.
• multiquadric The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2.
It has the adjustable parameters kernel sigma1 and kernel sigma shift.
kernel gamma (real) This is the SVM kernel parameter gamma.This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to radial or anova.
kernel sigma1 (real) This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to epachnenikov, gaussian combination or multiquadric.
kernel sigma2 (real) This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to gaussian combination.
kernel sigma3 (real) This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to gaussian combination.
kernel shift (real) This is the SVM kernel parameter shift. This parameter is only available
when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the kernel
type parameter is set to multiquadric.
kernel degree (real) This is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the
kernel type parameter is set to polynomial, anova or epachnenikov.
kernel a (real) This is the SVM kernel parameter a. This parameter is only available when
the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the kernel type
parameter is set to neural.
kernel b (real) This is the SVM kernel parameter b. This parameter is only available when
the numerical measure parameter is set to ‘Kernel Euclidean Distance’ and the kernel type
parameter is set to neural.
Tutorial Processes
Classification of the ’Golf-Testset’ data set using the K-NN operator
The ‘Golf’ data set is loaded using the Retrieve operator. The Select Attributes operator is applied on it to select just the ‘Play’ and ‘Temperature’ attributes to simplify this example process.
Then the K-NN operator is applied on it. All the parameters of the K-NN operator are used with
default values. The resultant classification model is applied on the ‘Golf-Testset’ data set using
the Apply Model operator. Note that the same two attributes of the ‘Golf-Testset’ data set were
selected before the application of the classification model on it.
Run the process. You can see the ‘Golf’ data set and the labeled ‘Golf-Testset’ data set in the
Results Workspace. As the k parameter was set to 1, each example of the ‘Golf-Testset’ data set
is simply assigned the class of its nearest neighbor in the ‘Golf’ data set. To understand how
examples were classified, simply select an example in the ‘Golf-Testset’ data set and note the
value of the ‘Temperature’ attribute of that example, we call it t1 here. Now, have a look at the
‘Golf’ data set and find the example where the ‘Temperature’ value is closest to t1. The example
406
4.1. Predictive
Process
inp
Golf
Select Attributes
out
exa
exa
k-NN
tra
ori
Apply Model
mod
exa
mod
unl
lab
res
mod
res
res
Golf-Testset
out
Select Attributes...
exa
exa
ori
Figure 4.4: Tutorial process ‘Classification of the ’Golf-Testset’ data set using the K-NN
operator’.
in the ‘Golf-Testset’ data set is assigned the same class as the class of this example in the ‘Golf’
data set. For example let us consider the first example of the ‘Golf-Testset’ data set. The value
of the ‘Temperature’ attribute is 85 in this example. Now find the example in the ‘Golf’ data set
where ‘Temperature’ value is closest to 85. As you can see that the first example of the ‘Golf’ data
set has a ‘Temperature’ value equal to 85. This example is labeled ‘no’, thus the corresponding
example in the ‘Golf-Testset’ data set is also predicted as ‘no’.
Regression of the Polynomial data set using the K-NN operator
The ‘Polynomial’ data set is loaded using the Retrieve operator. The Filter Example Range operator is applied on it to select just the first 100 examples. Then the Select Attributes operator is
applied on it to select just the ‘label’ and ‘a1’ attributes to simplify this example process. Then
the K-NN operator is applied on it. The k parameter is set to 3, the measure types parameter is
set to ‘Numerical Measures’ and the numerical measure parameter is set to ‘Euclidean Distance’.
The resultant regression model is applied on the last 100 examples of the ‘Polynomial’ data set
using the Apply Model operator. Note that the same two attributes of the ‘Polynomial’ data set
were selected before the application of the regression model on it.
Run the process. You can see the ‘Polynomial’ data set (first 100 examples) and the labeled
‘Polynomial’ data set (last 100 examples) in the Results Workspace. For convenience we call
these data sets as ‘Polynomial’_first and ‘Polynomial’_last data sets respectively. As the k parameter was set to 3, each example of the ‘Polynomial’_last data set is simply assigned the average label value of its 3 nearest neighbors in the ‘Polynomial’_first data set. To understand
how regression was performed, simply select an example in the ‘Polynomial’_last data set and
note the value of the ‘a1’ attribute of that example, we call it a1_val here. Now, have a look at
the ‘Polynomial’_first data set and find 3 examples where the ‘a1’ attribute value is closest to
a1_val. The ‘label’ attribute of the example of the ‘Polynomial’_last data set is assigned the av-
407
4. Modeling
Process
inp
Polynomial_first
out
Filter Example R...
exa
exa
Select Attributes
exa
ori
exa
k-NN
tra
ori
Apply Model
mod
exa
mod
unl
lab
res
mod
res
res
Polynomial_last
out
Filter Example R...
exa
exa
ori
Select Attributes...
exa
exa
ori
Figure 4.5: Tutorial process ‘Regression of the Polynomial data set using the K-NN operator’.
erage of these three label values of the ‘Polynomial’_first data set. For example let us consider
the last example of the ‘Polynomial’_last data set. The value of the ‘a1’ attribute is 4.788 in this
example. Now find 3 examples in the ‘Polynomial’_first data set where the value of the ‘a1’ attribute is closest to 4.788. These 3 examples are at Row No. 65, 71 and 86. The value of the
‘label’ attribute of these examples is 41.798, 124.250 and 371.814 respectively. The average of
these three ‘label’ values is 179.288. Thus the value of the ‘label’ attribute in the last example
of the ‘Polynomial’_last data set is predicted to be 179.288.
408
4.1. Predictive
4.1.2 Bayesian
Naive Bayes
Naive Bayes
tra
mod
This operator generates a Naive Bayes classification model.
exa
Description
A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem
(from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive
term for the underlying probability model would be ‘independent feature model’. In simple
terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature
of a class (i.e. attribute) is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4 inches in diameter.
Even if these features depend on each other or upon the existence of the other features, a Naive
Bayes classifier considers all of these properties to independently contribute to the probability
that this fruit is an apple.
The advantage of the Naive Bayes classifier is that it only requires a small amount of training
data to estimate the means and variances of the variables necessary for classification. Because
independent variables are assumed, only the variances of the variables for each label need to be
determined and not the entire covariance matrix.
Input Ports
training set (tra) The input port expects an ExampleSet. It is the output of the Select Attributes operator in our example process. The output of other operators can also be used
as input.
Output Ports
model (mod) The Naive Bayes classification model is delivered from this output port. This classification model can now be applied on unseen data sets for prediction of the label attribute.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
Parameters
laplace correction (boolean) This is an expert parameter. This parameter indicates if Laplace
correction should be used to prevent high influence of zero probabilities. There is a simple
trick to avoid zero probabilities. We can assume that our training set is so large that adding
one to each count that we need would only make a negligible difference in the estimated
probabilities, yet would avoid the case of zero probability values. This technique is known
as Laplace correction.
409
4. Modeling
Tutorial Processes
Working of Naive Bayes
Process
inp
Golf
Select Attributes
out
exa
exa
Na i ve Ba y es
tra
ori
mod
exa
Apply Model
mod
unl
Golf-Testset
out
lab
res
mod
res
Select Attributes...
exa
exa
ori
Figure 4.6: Tutorial process ‘Working of Naive Bayes’.
The Retrieve operator is used to load the ‘Golf’ data set. The Select Attributes operator is applied on it to select just Outlook and Wind attributes. This is done to simplify the understanding
of this Example Process. The Naive Bayes operator is applied on it and the resulting model is
applied on the ‘Golf-testset’ data set. The Same two attributes of the ‘Golf-testset’ data set were
selected before application of the Naive Bayes model. A breakpoint is inserted after the Naive
Bayes operator. Run the process and see the distribution table in the Results Workspace. We
will use this distribution table to explain how Naive Bayes works. Hit the Run button again to
continue with the process. Let us see how the first and last examples of the ‘Golf-testset’ data
set were predicted by Naive Bayes. Note that 9 out of 14 examples of the training set had label =
yes, thus the posterior probability of the label = yes is 9/14. Similarly the posterior probability
of the label = no is 5/14.
Note that in the testing set, the attributes of the first example are Outlook = sunny and Wind
= false. Naive Bayes does calculation for all possible label values and selects the label value that
has maximum calculated probability.
Calculation for label = yes
Find product of following:Posterior probability of label = yes (i.e. 9/14)value from distribution
table when Outlook = sunny and label = yes (i.e. 0.223)value from distribution table when Wind
= false and label = yes (i.e. 0.659) Thus the answer = 9/14*0.223*0.659 = 0.094
Calculation for label = no
Find product of following: posterior probability of label = no (i.e. 5/14)value from distribution
table when Outlook = sunny and label = no (i.e. 0.581)value from distribution table when Wind
= false and label = no (i.e. 0.397) Thus the answer = 5/14*0.581*0.397= 0.082
As the value for label = yes is the maximum of all possible label values, label is predicted to
be yes.
Similarly let us have a look at the last example of the ‘Golf-testset’ data set. Note that in the
410
4.1. Predictive
testing set, in first example Outlook = rain and Wind = true. Naive Bayes does calculation for all
possible label values and selects the label value that has maximum calculated probability.
Calculation for label = yes
Find product of following: posterior probability of label = yes (i.e. 9/14)value from distribution
table when Outlook = rain and label = yes (i.e. 0.331)value from distribution table when Wind =
true and label = yes (i.e. 0.333) Thus the answer = 9/14*0.331*0.333 = 0.071
Calculation for label = no
Find product of following: posterior probability of label = no (i.e. 5/14)value from distribution
table when Outlook = rain and label =no (i.e. 0.392)value from distribution table when Wind =
true and label = no (i.e. 0.589) Thus the answer = 5/14*0.392*0.589 = 0.082
As the value for label = no is the maximum of all possible label values, label is predicted to be
no.
Now run the process again, but this time uncheck the laplace correction parameter. Now you
can see that as laplace correction is not used for avoiding zero probability, there are numerous
zeroes in the distribution table.
411
4. Modeling
Naive Bayes (Kernel)
Naive Bayes (Ker...
tra
mod
exa
This operator generates a Kernel Naive Bayes classification model
using estimated kernel densities.
Description
A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem
(from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive
term for the underlying probability model would be the ‘independent feature model’. In simple
terms, a Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a
class (i.e. attribute) is unrelated to the presence (or absence) of any other feature. For example,
a fruit may be considered to be an apple if it is red, round, and about 4 inches in diameter. Even if
these features depend on each other or upon the existence of the other features, a Naive Bayes
classifier considers all of these properties to independently contribute to the probability that
this fruit is an apple. The Naive Bayes classifier performs reasonably well even if the underlying
assumption is not true
The advantage of the Naive Bayes classifier is that it only requires a small amount of training
data to estimate the means and variances of the variables necessary for classification. Because
independent variables are assumed, only the variances of the variables for each label need to be
determined and not the entire covariance matrix. In contrast to the Naive Bayes operator, the
Naive Bayes (Kernel) operator can be applied on numerical attributes.
A kernel is a weighting function used in non-parametric estimation techniques. Kernels are
used in kernel density estimation to estimate random variables’ density functions, or in kernel
regression to estimate the conditional expectation of a random variable.
Kernel density estimators belong to a class of estimators called non-parametric density estimators. In comparison to parametric estimators where the estimator has a fixed functional
form (structure) and the parameters of this function are the only information we need to store,
Non-parametric estimators have no fixed structure and depend upon all the data points to reach
an estimate.
Input Ports
training set (tra) The input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input.
Output Ports
model (mod) The Kernel Naive Bayes classification model is delivered from this output port.
This classification model can now be applied on unseen data sets for prediction of the label
attribute.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
412
4.1. Predictive
Parameters
laplace correction (boolean) This parameter indicates if Laplace correction should be used
to prevent high influence of zero probabilities. There is a simple trick to avoid zero probabilities. We can assume that our training set is so large that adding one to each count that
we need would only make a negligible difference in the estimated probabilities, yet would
avoid the case of zero probability values. This technique is known as Laplace correction.
estimation mode (selection) This parameter specifies the kernel density estimation mode.
Two options are available.
• full If this option is selected, you can select a bandwidth through heuristic or a fix
bandwidth can be specified.
• greedy If this option is selected, you have to specify the minimum bandwidth and the
number of kernels.
bandwidth selection (selection) This parameter is only available when the estimation mode
parameter is set to ‘full’. This parameter specifies the method to set the kernel bandwidth.
The bandwidth can be selected through heuristic or a fix bandwidth can be specified. Please
note that the bandwidth of the kernel is a free parameter which exhibits a strong influence
on the resulting estimate. It is important to choose the most appropriate bandwidth as a
value that is too small or too large is not useful.
bandwidth (real) This parameter is only available when the estimation mode parameter is set
to ‘full’ and the bandwidth selection parameter is set to ‘fix’. This parameter specifies the
kernel bandwidth.
minimum bandwidth (real) This parameter is only available when the estimation mode parameter is set to ‘greedy’. This parameter specifies the minimum kernel bandwidth.
number of kernels (integer) This parameter is only available when the estimation mode parameter is set to ‘greedy’. This parameter specifies the number of kernels.
use application grid (boolean) This parameter indicates if the kernel density function grid
should be used in the model application. It speeds up model application at the expense of
the density function precision.
application grid size (integer) This parameter is only available when the use application grid
parameter is set to true. This parameter specifies the size of the application grid.
Tutorial Processes
Introduction to the Naive Bayes (Kernel) operator
The ‘Golf’ data set is loaded using the Retrieve operator. The Naive Bayes (Kernel) operator is
applied on it. All parameters of the Naive Bayes (Kernel) operator are used with default values. The model generated by the Naive Bayes (Kernel) operator is applied on the ‘Golf-Testset’
data set using the Apply Model operator. The results of the process can be seen in the Results
Workspace. Please note that parameters should be carefully chosen for this operator to obtain
better performance. Specially the bandwidth should be selected carefully.
413
4. Modeling
Process
inp
Golf
Naive Bayes (Ker...
out
tra
mod
exa
Apply Model
mod
unl
lab
res
mod
res
Golf-Testset
out
Figure 4.7: Tutorial process ‘Introduction to the Naive Bayes (Kernel) operator’.
4.1.3 Trees
CHAID
CHAID
tra
mod
exa
This operator generates a pruned decision tree based on the chisquared attribute relevance test. This operator can be applied only
on ExampleSets with nominal data.
Description
The CHAID decision tree operator works exactly like the Decision Tree operator with one exception: it uses a chi-squared based criterion instead of the information gain or gain ratio criteria.
Moreover this operator cannot be applied on ExampleSets with numerical attributes. It is recommended that you study the documentation of the Decision Tree operator for basic understanding
of decision trees.
CHAID stands for CHi-squared Automatic Interaction Detection. The chi-square statistic is a
nonparametric statistical technique used to determine if a distribution of observed frequencies
differs from the theoretical expected frequencies. Chi-square statistics use nominal data, thus
instead of using means and variances, this test uses frequencies. CHAID’s advantages are that
its output is highly visual and easy to interpret. Because it uses multiway splits by default, it
needs rather large sample sizes to work effectively, since with small sample sizes the respondent
groups can quickly become too small for reliable analysis.
This representation of the data has the advantage compared with other approaches of being
meaningful and easy to interpret. The goal is to create a classification model that predicts the
value of the label based on several input attributes of the ExampleSet. Each interior node of the
tree corresponds to one of the input attributes. The number of edges of an interior node is equal
to the number of possible values of the corresponding input attribute. Each leaf node represents
414
4.1. Predictive
a value of the label given the values of the input attributes represented by the path from the root
to the leaf. This description can be easily understood by studying the Example Process of the
Decision Tree operator.
Pruning is a technique in which leaf nodes that do not add to the discriminative power of the
decision tree are removed. This is done to convert an over-specific or over-fitted tree to a more
general form in order to enhance its predictive power on unseen datasets. Pre-pruning is a type
of pruning performed parallel to the tree creation process. Post-pruning, on the other hand, is
done after the tree creation process is complete.
Differentiation
• The CHAID operator works exactly like the Decision Tree operator with one exception: it
uses a chi-squared based criterion instead of the information gain or gain ratio criteria.
Moreover this operator cannot be applied on ExampleSets with numerical attributes. See
page ?? for details.
• Decision Tree (Weight-Based) If the Weight by Chi Squared Statistic operator is applied
for attribute weighting in the subprocess of the Decision Tree (Weight-Based) operator, it
works exactly like the CHAID operator. See page 425 for details.
Input Ports
training set (tra) This input port expects an ExampleSet. It is the output of the Generate Nominal Data operator in the attached Example Process. The output of other operators can also
be used as input. This operator cannot handle numerical data, therefore the ExampleSet
should not have numerical attributes.
Output Ports
model (mod) The CHAID Decision Tree is delivered from this output port. This classification
model can now be applied on unseen data sets for the prediction of the label attribute.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
Parameters
minimal size for split (integer) The size of a node is the number of examples in its subset.
The size of the root node is equal to the total number of examples in the ExampleSet. Only
those nodes are split whose size is greater than or equal to the minimal size for split parameter.
minimal leaf size (integer) The size of a leaf node is the number of examples in its subset.
The tree is generated in such a way that every leaf node subset has at least the minimal leaf
size number of instances.
minimal gain (real) The gain of a node is calculated before splitting it. The node is split if its
gain is greater than the minimal gain. Higher values of minimal gain results in fewer splits
and thus a smaller tree. A too high value will completely prevent splitting and a tree with
a single node is generated.
415
4. Modeling
maximal depth (integer) The depth of a tree varies depending upon size and nature of the
ExampleSet. This parameter is used to restrict the size of the Decision Tree. The tree generation process is not continued when the tree depth is equal to the maximal depth. If its
value is set to ‘-1’, the maximal depth parameter puts no bound on the depth of the tree, a
tree of maximum depth is generated. If its value is set to ‘1’, a Tree with a single node is
generated.
confidence (real) This parameter specifies the confidence level used for the pessimistic error
calculation of pruning.
number of prepruning alternatives (integer) As prepruning runs parallel to the tree generation process, it may prevent splitting at certain nodes when splitting at that node does
not add to the discriminative power of the entire tree. In such a case alternative nodes
are tried for splitting. This parameter adjusts the number of alternative nodes tried for
splitting when it is prevented by prepruning at a certain node.
no prepruning (boolean) By default the Decision Tree is generated with prepruning. Setting
this parameter to true disables the prepruning and delivers a tree without any prepruning.
no pruning (boolean) By default the Decision Tree is generated with pruning. Setting this
parameter to true disables the pruning and delivers an unpruned Tree.
Related Documents
• (page ??)
• Decision Tree (Weight-Based) (page 425)
Tutorial Processes
Introduction to the CHAID operator
Process
Generate Nomina...
inp
out
CHAID
tra
mod
res
exa
res
Figure 4.8: Tutorial process ‘Introduction to the CHAID operator’.
The Generate Nominal Data operator is used for generating an ExampleSet with 100 examples. There are three nominal attributes in the ExampleSet and every attribute has three possible values. A breakpoint is inserted here so that you can have a look at the ExampleSet. The
CHAID operator is applied on this ExampleSet with default values of all parameters. The resultant model is connected to the result port of the process and it can be seen in the Results
Workspace.
416
4.1. Predictive
Decision Stump
Decision Stump
tra
mod
exa
This operator learns a Decision Tree with only one single split.
This operator can be applied on both nominal and numerical data
sets.
Description
The Decision Stump operator is used for generating a decision tree with only one single split. The
resulting tree can be used for classifying unseen examples. This operator can be very efficient
when boosted with operators like the AdaBoost operator. The examples of the given ExampleSet
have several attributes and every example belongs to a class (like yes or no). The leaf nodes of a
decision tree contain the class name whereas a non-leaf node is a decision node. The decision
node is an attribute test with each branch (to another decision tree) being a possible value of the
attribute. For more information about decision trees, please study the Decision Tree operator.
Input Ports
training set (tra) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input.
Output Ports
model (mod) The Decision Tree with just a single split is delivered from this output port. This
classification model can now be applied on unseen data sets for the prediction of the label
attribute.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
Parameters
criterion (selection) This parameter specifies the criterion on which attributes will be selected
for splitting. It can have one of the following values:
• information_gain The entropy of all the attributes is calculated. The attribute with
minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
• gain_ratio It is a variant of information gain. It adjusts the information gain for each
attribute to allow the breadth and uniformity of the attribute values.
• gini_index This is a measure of impurity of an ExampleSet. Splitting on a chosen
attribute gives a reduction in the average gini index of the resulting subsets.
• accuracy Such an attribute is selected for split that maximizes the accuracy of the
whole Tree.
417
4. Modeling
minimal leaf size (integer) The size of a leaf node is the number of examples in its subset.
The tree is generated in such a way that every leaf node subset has at least the minimal leaf
size number of instances.
Tutorial Processes
Introduction to the Decision Stump operator
Process
Golf
inp
Decision Stump
out
tra
mod
res
exa
res
Figure 4.9: Tutorial process ‘Introduction to the Decision Stump operator’.
To understand the basic terminology of trees, please study the Example Process of the Decision Tree operator.
The ‘Golf’ data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. The Decision Stump operator is applied on this ExampleSet. The criterion parameter is set to ‘information gain’ and the minimal leaf size parameter
is set to 1. The resultant decision tree model is connected to the result port of the process and it
can be seen in the Results Workspace. You can see that this decision tree has just a single split.
418
4.1. Predictive
Decision Tree
Decision Tree
tra
mod
exa
Generates a Decision Tree for classification of both nominal and
numerical data.
Description
A decision tree is a tree-like graph or model. It is more like an inverted tree because it has its
root at the top and it grows downwards. This representation of the data has the advantage compared with other approaches of being meaningful and easy to interpret. The goal is to create a
classification model that predicts the value of a target attribute (often called class or label ) based
on several input attributes of the ExampleSet. In RapidMiner an attribute with label role is predicted by the Decision Tree operator. Each interior node of tree corresponds to one of the input
attributes. The number of edges of a nominal interior node is equal to the number of possible
values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled
with disjoint ranges. Each leaf node represents a value of the label attribute given the values of
the input attributes represented by the path from the root to the leaf. This description can be
easily understood by studying the attached Example Process.
Decision Trees are generated by recursive partitioning. Recursive partitioning means repeatedly splitting on the values of attributes. In every recursion the algorithm follows the following
steps:
• An attribute A is selected to split on. Making a good choice of attributes to split on each
stage is crucial to generation of a useful tree. The attribute is selected depending upon a
selection criterion which can be selected by the criterion parameter.
• Examples in the ExampleSet are sorted into subsets, one for each value of the attribute
A in case of a nominal attribute. In case of numerical attributes, subsets are formed for
disjoint ranges of attribute values.
• A tree is returned with one edge or branch for each subset. Each branch has a descendant
subtree or a label value produced by applying the same algorithm recursively.
In general, the recursion stops when all the examples or instances have the same label value,
i.e. the subset is pure. Or recursion may stop if most of the examples are of the same label value.
This is a generalization of the first approach; with some error threshold. However there are other
halting conditions such as:
• There are less than a certain number of instances or examples in the current subtree. This
can be adjusted by using the minimal size for split parameter.
• No attribute reaches a certain threshold. This can be adjusted by using the minimum gain
parameter.
• The maximal depth is reached. This can be adjusted by using the maximal depth parameter.
Pruning is a technique in which leaf nodes that do not add to the discriminative power of the
decision tree are removed. This is done to convert an over-specific or over-fitted tree to a more
general form in order to enhance its predictive power on unseen datasets. Pre-pruning is a type
of pruning performed parallel to the tree creation process. Post-pruning, on the other hand, is
done after the tree creation process is complete.
419
4. Modeling
Differentiation
• CHAID The CHAID operator works exactly like the Decision Tree operator with one exception: it uses a chi-squared based criterion instead of the information gain or gain ratio criteria. Moreover this operator cannot be applied on ExampleSets with numerical attributes.
See page 414 for details.
Input Ports
training set (tra) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input.
Output Ports
model (mod) The Decision Tree is delivered from this output port. This classification model
can now be applied on unseen data sets for the prediction of the label attribute.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
Parameters
criterion (selection) Selects the criterion on which attributes will be selected for splitting. It
can have one of the following values:
• information_gain The entropy of all the attributes is calculated. The attribute with
minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
• gain_ratio It is a variant of information gain. It adjusts the information gain for each
attribute to allow the breadth and uniformity of the attribute values.
• gini_index This is a measure of impurity of an ExampleSet. Splitting on a chosen
attribute gives a reduction in the average gini index of the resulting subsets.
• accuracy Such an attribute is selected for split that maximizes the accuracy of the
whole Tree.
maximal depth (integer) The depth of a tree varies depending upon size and nature of the
ExampleSet. This parameter is used to restrict the size of the Decision Tree. The tree generation process is not continued when the tree depth is equal to the maximal depth. If its
value is set to ‘-1’, the maximal depth parameter puts no bound on the depth of the tree, a
tree of maximum depth is generated. If its value is set to ‘1’, a Tree with a single node is
generated.
apply pruning (boolean) By default the Decision Tree is generated with pruning. Setting this
parameter to false disables the pruning and delivers an unpruned Tree.
confidence (real) This parameter specifies the confidence level used for the pessimistic error
calculation of pruning.
apply prepruning (boolean) By default the Decision Tree is generated with prepruning. Setting this parameter to false disables the prepruning and delivers a tree without any prepruning.
420
4.1. Predictive
minimal gain (real) The gain of a node is calculated before splitting it. The node is split if its
Gain is greater than the minimal gain . Higher value of minimal gain results in fewer splits
and thus a smaller tree. A too high value will completely prevent splitting and a tree with
a single node is generated.
minimal leaf size (integer) The size of a leaf node is the number of examples in its subset.
The tree is generated in such a way that every leaf node subset has at least the minimal leaf
size number of instances.
minimal size for split (integer) The size of a node is the number of examples in its subset.
The size of the root node is equal to the total number of examples in the ExampleSet. Only
those nodes are split whose size is greater than or equal to the minimal size for split parameter.
number of prepruning alternatives (integer) As prepruning runs parallel to the tree generation process, it may prevent splitting at certain nodes when splitting at that node does
not add to the discriminative power of the entire tree. In such a case alternative nodes
are tried for splitting. This parameter adjusts the number of alternative nodes tried for
splitting when split is prevented by prepruning at a certain node.
Related Documents
• CHAID (page 414)
Tutorial Processes
Getting started with Decision Trees
Process
inp
Golf
Decision Tree
res
out
tra
mod
exa
res
Figure 4.10: Tutorial process ‘Getting started with Decision Trees’.
The ’Golf’ data set is retrieved using the Retrieve operator. Then the Decision Tree operator
is applied on it. Click on the Run button to go to the Results Workspace. First take a look at the
resultant tree. First of all, basic terminology of trees is explained here using this resultant tree.
The first node of the tree is called the root of the tree; ‘Outlook’ is the root in this case. As you
can see from this tree, trees grow downwards. This example clearly shows why interpretation of
data is easy through trees. Just take a glance at this tree and you will come to know that whenever
the ‘Outlook’ attribute value is ’overcast’, the ‘Play’ attribute will have the value ‘yes’. Similarly
421
4. Modeling
whenever the ‘Outlook’ attribute value is ‘rain’ and the ‘Wind’ attribute has value ‘false’, then
the ‘Play’ attribute will have the value ‘yes’. The Decision Tree operator predicts the value of
an attribute with the label role, in this example the ‘Play’ attribute is predicted. The nodes that
do not have child nodes are called the leaf nodes. All leaf non-leaf nodes correspond to one of
the input attributes. In this example the ‘Outlook’, ‘Wind’ and ‘Humidity’ attributes are nonleaf nodes. The number of edges of a nominal interior node is equal to the number of possible
values of the corresponding input attribute. The ‘Outlook’ attribute has three possible values:
‘overcast’, ‘rain’ and ’sunny’ thus it has three outgoing edges. Similarly the ’Wind’ attribute
has two outgoing edges. As ’Humidity’ is a numerical attribute, its outgoing edges are labeled
with disjoint ranges. Each leaf node represents a value of the label given the values of the input
attributes represented by the path from the root to the leaf . That is why all leaf nodes assume
possible values of the label attribute i.e. ’yes’ or ‘no’.
In this Example Process the ‘Gain ratio’ is used as the selection criterion. However using any
other criterion on this ExampleSet produces the same tree. This is because this is a very simple
data set. On large and complex data sets different selection criterion may produce different
trees.
When the tree is split on the ‘Outlook’ attribute, it is divided into three subtrees, one with
each value of the ’Outlook’ attribute. The ‘overcast’ subtree is pure i.e. all its label values are
same (’yes’), thus it is not split again. The ‘rain’ subtree and the ’sunny’ subtree are split again
in the next iteration. In this Example Process the minimal size of split parameter is set to 4. Set
it to 10 and run the process again. You will see that you get a tree with a single node. This is
because this time the nodes with size less than 10 cannot be split. As the size of all subtrees of
the ‘Outlook’ attribute is less than 10, thus no splitting takes place, and we get a tree with just
a single node.
The minimal gain parameter is set to 0.1. In general if you want to reduce the size of tree,
you can increase the minimal gain . Similarly if you want to increase the size of your tree, you
can reduce the value of the minimal gain parameter. The maximal depth parameter is set to
20. The actual depth of the resultant tree is 3. You can set an upper bound to the depth of tree
using the maximal depth parameter. Prepruning is enabled in this Example Process. To disable
it, check the no prepruning parameter. Now click the Run button. The resultant tree is much
more complex than the previous one. The previous tree was more useful and comprehendible
than this one.
422
4.1. Predictive
Decision Tree (Multiway)
Decision Tree (M...
tra
mod
This operator generates a multiway decision tree.
exa
Description
The Decision Tree (Multiway) operator is a nested operator i.e. it has a subprocess. The subprocess must have a Tree learner i.e. an operator that expects an ExampleSet and generates a
Tree model. You need to have basic understanding of subprocesses in order to apply this operator. Please study the documentation of the Subprocess operator for basic understanding of
subprocesses.
If we have only categorical attributes, we can use any C4.5-like algorithm in order to obtain
a multi-way decision tree, although we will usually obtain a binary tree if our dataset includes
continuous attributes. Using binary splits on numerical attributes implies that the attributes involved should be able to appear several times in the paths from the root of the tree to its leaves.
Although these repetitions can be simplfied when converting the decision tree into a set of rules,
they make the constructed tree more leafy, unnecessarily deeper, and harder to understand for
human experts. The non-binary splits on continuous attributes make the trees easier to understand and also seem to lead to more accurate trees in some domains.
The representation of the data as Tree has the advantage compared with other approaches of
being meaningful and easy to interpret. The goal is to create a classification model that predicts
the value of the label based on several input attributes of the ExampleSet. Each interior node of
tree corresponds to one of the input attributes. The number of edges of an interior node is equal
to the number of possible values of the corresponding input attribute. Each leaf node represents
a value of the label given the values of the input attributes represented by the path from the root
to the leaf. This description can be easily understood by studying the Example Process of the
Decision Tree operator.
Input Ports
training set (tra) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input.
Output Ports
model (mod) The Decision Tree is delivered from this output port. This classification model
can now be applied on unseen data sets for the prediction of the label attribute.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
423
4. Modeling
Tutorial Processes
Introduction to the Decision Tree (Multiway) operator
Process
Golf
inp
Decision Tree (M...
out
tra
mod
res
exa
res
Figure 4.11: Tutorial process ‘Introduction to the Decision Tree (Multiway) operator’.
The Golf data set is loaded using the Retrieve operator. A breakpoint is inserted here so that
you can have a look at the ExampleSet. The Decision Tree (Multiway) operator is applied on
this ExampleSet. The Decision Tree operator is applied in the subprocess of the Decision Tree
(Multiway) operator. The resultant Tree is connected to the result port of the process and it can
be seen in the Results Workspace.
424
4.1. Predictive
Decision Tree (Weight-Based)
Decision Tree (W...
tra
mod
This operator generates a pruned decision tree based on an arbitrary attribute relevance test. The attribute weighting scheme
should be provided as inner operator. This operator can be applied
only on ExampleSets with nominal data.
Description
The Decision Tree (Weight-Based) operator is a nested operator i.e. it has a subprocess. The
subprocess must have an attribute weighting scheme i.e. an operator that expects an ExampleSet
and generates attribute weights. You need to have basic understanding of subprocesses in order
to apply this operator. Please study the documentation of the Subprocess operator for basic
understanding of subprocesses.
The Decision Tree (Weight-Based) operator works exactly like the Decision Tree operator with
one exception: it uses an arbitrary attribute relevance test criterion instead of the information
gain or gain ratio criteria. Moreover this operator cannot be applied on ExampleSets with numerical attributes. It is recommended that you study the documentation of the Decision Tree
operator for basic understanding of decision trees.
If the Weight by Chi Squared Statistic operator is supplied for attribute weighting, this operator acts as the CHAID operator. CHAID stands for CHi-squared Automatic Interaction Detection.
The chi-square statistic is a nonparametric statistical technique used to determine if a distribution of observed frequencies differs from the theoretical expected frequencies. Chi-square
statistics use nominal data, thus instead of using means and variances, this test uses frequencies. CHAID’s advantages are that its output is highly visual and easy to interpret. Because it
uses multiway splits by default, it needs rather large sample sizes to work effectively, since with
small sample sizes the respondent groups can quickly become too small for reliable analysis.
The representation of the data as Tree has the advantage compared with other approaches of
being meaningful and easy to interpret. The goal is to create a classification model that predicts
the value of the label based on several input attributes of the ExampleSet. Each interior node
of the tree corresponds to one of the input attributes. The number of edges of an interior node
is equal to the number of possible values of the corresponding input attribute. Each leaf node
represents a value of the label given the values of the input attributes represented by the path
from the root to the leaf. This description can be easily understood by studying the Example
Process of the Decision Tree operator.
Pruning is a technique in which leaf nodes that do not add to the discriminative power of the
decision tree are removed. This is done to convert an over-specific or over-fitted tree to a more
general form in order to enhance its predictive power on unseen datasets. Pre-pruning is a type
of pruning performed parallel to the tree creation process. Post-pruning, on the other hand, is
done after the tree creation process is complete.
Differentiation
• CHAID If the Weight by Chi Squared Statistic operator is applied for attribute weighting
in the subprocess of the Decision Tree (Weight-Based) operator, it works exactly like the
CHAID operator. See page 414 for details.
425
4. Modeling
Input Ports
training set (tra) This input port expects an ExampleSet. It is the output of the Generate Nominal Data operator in the attached Example Process. The output of other operators can also
be used as input. This operator cannot handle numerical data, therefore the ExampleSet
should not have numerical attributes.
Output Ports
model (mod) The Decision Tree is delivered from this output port. This classification model
can now be applied on unseen data sets for the prediction of the label attribute.
Parameters
minimal size for split (integer) The size of a node in a Tree is the number of examples in its
subset. The size of the root node is equal to the total number of examples in the ExampleSet. Only those nodes are split whose size is greater than or equal to the minimal size for
split parameter.
minimal leaf size (integer) The size of a leaf node in a Tree is the number of examples in its
subset. The tree is generated in such a way that every leaf node subset has at least the
minimal leaf size number of instances.
maximal depth (integer) The depth of a tree varies depending upon size and nature of the
ExampleSet. This parameter is used to restrict the size of the Decision Tree. The tree generation process is not continued when the tree depth is equal to the maximal depth. If its
value is set to ‘-1’, the maximal depth parameter puts no bound on the depth of the tree, a
tree of maximum depth is generated. If its value is set to ‘1’, a Tree with a single node is
generated.
confidence (real) This parameter specifies the confidence level used for the pessimistic error
calculation of pruning.
no pruning (boolean) By default the Decision Tree is generated with pruning. Setting this
parameter to true disables the pruning and delivers an unpruned Tree.
number of prepruning alternatives (integer) As prepruning runs parallel to the tree generation process, it may prevent splitting at certain nodes when splitting at that node does
not add to the discriminative power of the entire tree. In such a case alternative nodes
are tried for splitting. This parameter adjusts the number of alternative nodes tried for
splitting when the split is prevented by prepruning at a certain node.
Related Documents
• CHAID (page 414)
Tutorial Processes
Introduction to the Decision Tree (Weight-Based) operator
The Generate Nominal Data operator is used for generating an ExampleSet with 100 examples.
There are three nominal attributes in the ExampleSet and every attribute has three possible values. A breakpoint is inserted here so that you can have a look at the ExampleSet. The Decision
426
4.1. Predictive
Process
Generate Nomina...
inp
out
Decision Tree (W...
tra
mod
res
res
Figure 4.12: Tutorial process ‘Introduction to the Decision Tree (Weight-Based) operator’.
Tree (Weight-Based) operator is applied on this ExampleSet with default values of all parameters. The resultant model is connected to the result port of the process and it can be seen in the
Results Workspace.
427
4. Modeling
Gradient Boosted Trees
Gradient Booste...
tra
mod
exa
Executes GBT algorithm using H2O 3.8.2.6.
wei
Description
Please note that the result of this algorithm may depend on the number of threads used. Different
settings may lead to slightly different outputs.
A gradient boosted model is an ensemble of either regression or classification tree models.
Both are forward-learning ensemble methods that obtain predictive results through gradually
improved estimations. Boosting is a flexible nonlinear regression procedure that helps improving the accuracy of trees. By sequentially applying weak classification algorithms to the incrementally changed data, a series of decision trees are created that produce an ensemble of weak
prediction models. While boosting trees increases their accuracy, it also decreases speed and
human interpretability. The gradient boosting method generalizes tree boosting to minimize
these issues.
The operator starts a 1-node local H2O cluster and runs the algorithm on it. Although it
uses one node, the execution is parallel. You can set the level of parallelism by changing the
Settings/Preferences/General/Number of threads setting. By default it uses the recommended
number of threads for the system. Only one instance of the cluster is started and it remains
running until you close RapidMiner Studio.
Input Ports
training set (tra) The input port expects a labeled ExampleSet.
Output Ports
model (mod) The Gradient Boosted classification or regression model is delivered from this
output port. This classification or regression model can be applied on unseen data sets for
prediction of the label attribute.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
weights (wei) This port delivers the weights of the attributes with respect to the label attribute.
Parameters
number of trees (integer) A non-negative integer that defines the number of trees. The default is 20.
428
4.1. Predictive
reproducible (boolean) Makes model building reproducible. If set then maximum_number_of_threads parameter controls parallelism level of model building. If this is not set then
parallelism level is defined by number of threads in General Preferences.
maximum number of threads (integer) Controls parallelism level of model building.
use local random seed (boolean) Available only if reproducible is set to true. Indicates if a
local random seed should be used for randomization.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
maximal depth (integer) The user-defined tree depth. The default is 5.
min rows (real) The minimum number of rows to assign to the terminal nodes. The default
is 10.0. If a weight column is specified, the number of rows are also weighted. E.g. if a
terminal node contains two rows with the weights 0.3 and 0.4, it is counted as 0.7 in the
minimum number of rows.
min split improvement (real) Minimum relative improvement in squared error reduction for
a split to happen.
number of bins (integer) For numerical columns (real/integer), build a histogram of at least
the specified number of bins, then split at the best point The default is 20.
learning rate (real) The learning rate. Smaller learning rates lead to better models, however,
it comes at the price of increasing computational time both during training and scoring:
lower learning rate requires more iterations. The default is 0.1 and the range is 0.0 to 1.0.
sample rate (real) Row sample rate per tree (from 0.0 to 1.0).
distribution (selection) The distribution function for the training data. For some function
(e.g. tweedie) further tuning can be achieved via the expert parameters
• AUTO Automatic selection. Uses multinomial for nominal and gaussian for numeric
labels.
• bernoulli Bernoulli distribution. Can be used for binominal or 2-class polynominal
labels.
• gaussian, possion, gamma, tweedie, quantile Distribution functions for regression.
early stopping (boolean) If true, parameters for early stopping needs to be specified.
stopping rounds (integer) Early stopping based on convergence of stopping_metric. Stop if
simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events. This parameter is visible only if early_stopping is set.
stopping metric (selection) Metric to use for early stopping. Set stopping_tolerance to tune
it. This parameter is visible only if early_stopping is set.
• AUTO Automatic selection. Uses logloss for classification, deviance for regression.
• deviance, logloss, MSE, AUC, lift_top_group, r2, misclassification The metric to
use to decide if the algorithm should be stopped.
stopping tolerance (real) Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). This parameter is visible only if early_stopping
is set.
429
4. Modeling
max runtime seconds (integer) Maximum allowed runtime in seconds for model training.
Use 0 to disable.
expert parameters (enumeration) These parameters are for fine tuning the algorithm. Usually the default values provide a decent model, but in some cases it may be useful to change
them. Please use true/false values for boolean parameters and the exact attribute name
for columns. Arrays can be provided by splitting the values with the comma (,) character.
More information on the parameters can be found in the H2O documentation.
• score_each_iteration Whether to score during each iteration of model training. Type:
boolean, Default: false
• score_tree_interval Score the model after every so many trees. Disabled if set to 0.
Type: integer, Default: 0
• fold_assignment Cross-validation fold assignment scheme, if fold_column is not
specified. Options: AUTO, Random, Modulo, Stratified. Type: enumeration, Default:
AUTO
• fold_column Column name with cross-validation fold index assignment per observation. Type: column, Default: no fold column
• offset_column Offset column name. Type: Column, Default: no offset column
• balance_classes Balance training data class counts via over/under-sampling (for imbalanced data). Type: boolean, Default: false
• max_after_balance_size Maximum relative size of the training data after balancing
class counts (can be less than 1.0). Requires balance_classes. Type: real, Default: 5.0
• max_confusion_matrix_size Maximum size (# classes) for confusion matrices to be
printed in the Logs. Type: integer, Default: 20
• nbins_top_level For numerical columns (real/int), build a histogram of (at most) this
many bins at the root level, then decrease by factor of two per level. Type: integer,
Default: 1024
• nbins_cats For categorical columns (factors), build a histogram of this many bins,
then split at the best point. Higher values can lead to more overfitting. Type: integer,
Default: 1024
• r2_stopping Stop making trees when the R^2 metric equals or exceeds this. type:
double, Default: 0.999999
• quantile_alpha Desired quantile for quantile regression (from 0.0 to 1.0). Type: double, Default: 0.5
• tweedie_power Tweedie Power (between 1 and 2). Type: double, Default: 1.5
• col_sample_rate Column sample rate (from 0.0 to 1.0). Type: double, Default: 1.0
• col_sample_rate_per_tree Column sample rate per tree (from 0.0 to 1.0). Type: double, Default: 1.0
• keep_cross_validation_predictions Keep cross-validation model predictions. Type:
boolean, Default: false
• keep_cross_validation_fold_assignment Keep cross-validation fold assignment. Type:
boolean, Default: false
• class_sampling_factors Desired over/under-sampling ratios per class (in lexicographic
order). If not specified, sampling factors will be automatically computed to obtain
class balance during training. Requires balance_classes=true. Type: float array, Default: empty
430
4.1. Predictive
• learn_rate_annealing Scale down the learning rate by this factor after each tree.
Type: double, Default: 1.0
• sample_rate_per_class Row sample rate per tree per class (from 0.0 to 1.0) Type:
double arary, Default: empty
• col_sample_rate_change_per_level Relative change of the column sampling rate for
every level (from 0.0 to 2.0). Type: double, Default: 1.0
• max_abs_leafnode_pred Maximum absolute value of a leaf node prediction. Type:
double, Default: Infinity
• nfolds Number of folds for cross-validation. Use 0 to turn off cross-validation. Type:
integer, Default: 0
Tutorial Processes
Classification using GBT
Process
Retrieve Deals
inp
out
Gradient Booste...
tra
Performance
mod
lab
per
exa
per
exa
wei
res
res
res
Apply Model
mod
unl
lab
mod
R etrie ve De als-T ...
out
Figure 4.13: Tutorial process ‘Classification using GBT’.
The H2O GBT operator is used to predict the future_customer attribute of the Deals sample
dataset. Since the label is nominal, classification will be performed. The GBT parameters are
slightly changed. The number of trees is decreased to 10 to lower the execution time and to prevent overfitting. The learning rate is increased to 0.3 for similar reasons. The resulting model is
connected to an Apply Model operator that applies the GBT model on the Deals_Testset sample
data. The labeled ExampleSet is connected to a Performance (Binominal Classification) operator, that calculates the Accuracy metric. On the process output the Performance Vector and the
Gradient Boosted Model is shown. The trees of the Gradient Boosted model can be checked on
the Results view.
431
4. Modeling
Classification with Split Validation using GBT
Process
res
inp
res
Retriev e Iris
Validation
res
out
tra
mod
tra
ave
ave
Figure 4.14: Tutorial process ‘Classification with Split Validation using GBT’.
The H2O GBT operator is used to predict the label attribute of the Iris sample dataset. Since
the label is polynominal, classification will be performed. The learner operator is inside a Split
Validation for being able to check the performance of the classification. The number of trees is
set to 10, all other parameters are kept at the default value. The Performance (Classification)
operator delivers the accuracy and the classification error. The model contains 30 trees, because
H2O creates 10 trees for every unique label value.
Regression using GBT
Process
Retrieve Polyno...
inp
out
Gradient Booste...
tra
mod
exa
Apply Model
mod
unl
lab
mod
res
res
wei
Split Data
exa
par
par
par
Figure 4.15: Tutorial process ‘Regression using GBT’.
The H2O GBT operator is used to predict the label attribute of the Polynomial sample dataset.
Since the label is real, regression is performed. The sample data is retrieved, then splitted into
432
4.1. Predictive
two parts with the Split Data operator. The first output is used as the training, the second as
the scoring data set. The GBT operator’s distribution parameter is changed to “gamma”. After
applying on the scoring ExampleSet, the output contains the GradientBoostedModel and the
labeled data. If you select Charts/Series Chart style for the labeled data and choose label and
prediction label in the Plot Series field, you can check the accuracy of the prediction visually.
433
4. Modeling
ID3
ID3
tra
mod
exa
This operator learns an unpruned Decision Tree from nominal data
for classification. This decision tree learner works similar to Quinlan’s ID3.
Description
ID3 (Iterative Dichotomiser 3) is an algorithm used to generate a decision tree invented by Ross
Quinlan. ID3 is the precursor to the C4.5 algorithm. Very simply, ID3 builds a decision tree
from a fixed set of examples. The resulting tree is used to classify future samples. The examples
of the given ExampleSet have several attributes and every example belongs to a class (like yes
or no). The leaf nodes of the decision tree contain the class name whereas a non-leaf node is
a decision node. The decision node is an attribute test with each branch (to another decision
tree) being a possible value of the attribute. ID3 uses feature selection heuristic to help it decide
which attribute goes into a decision node. The required heuristic can be selected by the criterion
parameter.
The ID3 algorithm can be summarized as follows:
1. Take all unused attributes and calculate their selection criterion (e.g. information gain)
2. Choose the attribute for which the selection criterion has the best value (e.g. minimum
entropy or maximum information gain)
3. Make node containing that attribute
ID3 searches through the attributes of the training instances and extracts the attribute that
best separates the given examples. If the attribute perfectly classifies the training sets then ID3
stops; otherwise it recursively operates on the n (where n = number of possible values of an
attribute) partitioned subsets to get their best attribute. The algorithm uses a greedy search,
meaning it picks the best attribute and never looks back to reconsider earlier choices.
Some major benefits of ID3 are:
• Understandable prediction rules are created from the training data.
• Builds a short tree in relatively small time.
• It only needs to test enough attributes until all data is classified.
• Finding leaf nodes enables test data to be pruned, reducing the number of tests.
ID3 may have some disadvantages in some cases e.g.
• Data may be over-fitted or over-classified, if a small sample is tested.
• Only one attribute at a time is tested for making a decision.
Input Ports
training set (tra) This input port expects an ExampleSet. It is the output of the Generate Nominal Data operator in the attached Example Process. This operator cannot handle numerical attributes. The output of other operators can also be used as input.
434
4.1. Predictive
Output Ports
model (mod) The Decision Tree is delivered from this output port. This classification model
can now be applied on unseen data sets for the prediction of the label attribute.
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
Parameters
criterion (selection) This parameter specifies the criterion on which attributes will be selected
for splitting. It can have one of the following values:
• information_gain The entropy of all the attributes is calculated. The attribute with
minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
• gain_ratio It is a variant of information gain. It adjusts the information gain for each
attribute to allow the breadth and uniformity of the attribute values.
• gini_index This is a measure of impurity of an ExampleSet. Splitting on a chosen
attribute gives a reduction in the average gini index of the resulting subsets.
• accuracy Such an attribute is selected for a split that maximizes the accuracy of the
whole Tree.
minimal size for split (integer) The size of a node is the number of examples in its subset.
The size of the root node is equal to the total number of examples in the ExampleSet. Only
those nodes are split whose size is greater than or equal to the minimal size for split parameter.
minimal leaf size (integer) The size of a leaf node is the number of examples in its subset.
The tree is generated in such a way that every leaf node subset has at least the minimal leaf
size number of instances.
minimal gain (real) The gain of a node is calculated before splitting it. The node is split if its
Gain is greater than the minimal gain. Higher value of minimal gain results in fewer splits
and thus a smaller tree. A too high value will completely prevent splitting and a tree with
a single node is generated.
Tutorial Processes
Getting started with ID3
To understand the basic terminology of trees, please study the Example Process of the Decision
Tree operator.
The Generate Nominal Data operator is used for generating an ExampleSet with nominal attributes. It should be kept in mind that the ID3 operator cannot handle numerical attributes. A
breakpoint is inserted here so that you can have a look at the ExampleSet. You can see that the
ExampleSet has three attributes and each attribute has three possible values. The ID3 operator
is applied on this ExampleSet with default values of all parameters. The resultant Decision Tree
model is delivered to the result port of the process and it can be seen in the Results Workspace.
435
4. Modeling
Process
inp
Generate Nomina...
out
ID3
tra
mod
res
exa
res
Figure 4.16: Tutorial process ‘Getting started with ID3’.
Random Forest
Random Forest
tra
mod
exa
This operator generates a set of a specified number of random trees
i.e. it generates a random forest. The resulting model is a voting
model of all the trees.
Description
The Random Forest operator generates a set of random trees. The random trees are generated in
exactly the same way as the Random Tree operator generates a tree. The resulting forest model
contains a specified number of random tree models. The number of trees parameter specifies the
required number of trees. The resulting model is a voting model of all the random trees. For
more information about random trees please study the Random Tree operator.
The representation of the data in form of a tree has the advantage compared with other approaches of being meaningful and easy to interpret. The goal is to create a classification model
that predicts the value of a target attribute (often called class or label) based on several input
attributes of the ExampleSet. Each interior node of the tree corresponds to one of the input
attributes. The number of edges of a nominal interior node is equal to the number of possible
values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled
with disjoint ranges. Each leaf node represents a value of the label attribute given the values of
the input attributes represented by the path from the root to the leaf. For better understanding
of the structure of a tree please study the Example Process of the Decision Tree operator.
Pruning is a technique in which leaf nodes that do not add to the discriminative power of the
tree are removed. This is done to convert an over-specific or over-fitted tree to a more general form in order to enhance its predictive power on unseen datasets. Pre-pruning is a type
of pruning performed parallel to the tree creation process. Post-pruning, on the other hand, is
done after the tree creation process is complete.
436
4.1. Predictive
Input Ports
training set (tra) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input.
Output Ports
model (mod) The Random Forest model is delivered from this output port. This model can be
applied on unseen data sets for the prediction of the label attribute. This model is a voting
model of all the random trees
example set (exa) The ExampleSet that was given as input is passed without changing to the
output through this port. This is usually used to reuse the same ExampleSet in further
operators or to view the ExampleSet in the Results Workspace.
Parameters
number of trees (integer) This parameter specifies the number of random trees to generate.
criterion (selection) Selects the criterion on which attributes will be selected for splitting. It
can have one of the following values:
• information_gain The entropy of all the attributes is calculated. The attribute with
minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
• gain_ratio It is a variant of information gain. It adjusts the information gain for each
attribute to allow the breadth and uniformity of the attribute values.
• gini_index This is a measure of impurity of an ExampleSet. Splitting on a chosen
attribute gives a reduction in the average gini index of the resulting subsets.
• accuracy Such an attribute is selected for a split that maximizes the accuracy of the
whole Tree.
maximal depth (integer) The depth of a tree varies depending upon size and nature of the
ExampleSet. This parameter is used to restrict the size of the Decision Tree. The tree generation process is not continued when the tree depth is equal to the maximal depth. If its
value is set to ‘-1’, the maximal depth parameter puts no bound on the depth of the tree, a
tree of maximum depth is generated. If its value is set to ‘1’, a Tree with a single node is
generated.
apply prepruning (boolean) By default the Decision Tree is generated with prepruning. Setting this parameter to false disables the prepruning and delivers a tree without any prepruning.
minimal gain (real) The gain of a node is calculated before splitting it. The node is split if its
Gain is greater than the minimal gain. Higher value of minimal gain results in fewer splits
and thus a smaller tree. A too high value will completely prevent splitting and a tree with
a single node is generated.
minimal leaf size (integer) The size of a leaf node is the number of examples in its subset.
The tree is generated in such a way that every leaf node subset has at least the minimal leaf
size number of instances.
437
4. Modeling
minimal size for split (integer) The size of a node is the number of examples in its subset.
The size of the root node is equal to the total number of examples in the ExampleSet. Only
those nodes are split whose size is greater than or equal to the minimal size for split parameter.
number of prepruning alternatives (integer) As prepruning runs parallel to the tree generation process, it may prevent splitting at certain nodes when splitting at that node does
not add to the discriminative power of the entire tree. In such a case alternative nodes
are tried for splitting. This parameter adjusts the number of alternative nodes tried for
splitting when split is prevented by prepruning at a certain node.
apply pruning (boolean) By default the Decision Tree is generated with pruning. Setting this
parameter to false disables the pruning and delivers an unpruned Tree.
confidence (real) This parameter specifies the confidence level used for the pessimistic error
calculation of pruning.
guess subset ratio (boolean) If this parameter is set to true then log(m) + 1 attributes are
used, otherwise a ratio should be specified by the subset ratio parameter.
voting strategy (selection) Specifies the prediction strategy in case of dissenting tree model
predictions:
• confidence_vote Selects the class that has the highest accumulated confidence.
• majority_vote Selects the class that was predicted by the majority of tree models.
subset ratio (real) This parameter specifies the ratio of randomly chosen attributes to test.
use local random seed (boolean) This parameter indicates if a local random seed should be
used for randomization. Using the same value of local random seed will produce the same
randomization.
local random seed (integer) This parameter specifies the local random seed. This parameter
is only available if the use local random seed parameter is set to true.
Tutorial Processes
Generating a set of random trees using the Random Forest operator
The ‘Golf’ data set is loaded using the Retrieve operator. The Split Validation operator is applied
on it for training and testing a classification model. The Random Forest operator is applied in
the training subprocess of the Split Validation operator. The number of trees parameter is set to
10, thus this operator generates a set of 10 random trees. The resultant model is a voting model
of all the random trees. The Apply Model operator is used in the testing subprocess to apply this
model. The resultant labeled ExampleSet is used by the Performance operator for measuring the
performance of the model. The random forest model and its performance vector is connected
to the output and it can be seen in the Results Workspace.
438
4.1. Predictive
Process
inp
Golf
Validation
out
tra
mod
res
tra
ave
res
ave
res
Figure 4.17: Tutorial process ‘Generating a set of random trees using the Random Forest
operator’.
Random Tree
Random Tree
tra
mod
exa
This operator learns a decision tree. This operator uses only a random subset of attributes for each split.
Description
The Random Tree operator works exactly like the Decision Tree operator with one exception:
for each split only a random subset of attributes is available. It is recommended that you study
the documentation of the Decision Tree operator for basic understanding of decision trees.
This operator learns decision trees from both nominal and numerical data. Decision trees are
powerful classification methods which can be easily understood. The Random Tree operator
works similar to Quinlan’s C4.5 or CART but it selects a random subset of attributes before it is
applied. The size of the subset is specified by the subset ratio parameter.
Representation of the data as Tree has the advantage compared with other approaches of being
meaningful and easy to interpret. The goal is to create a classification model that predicts the
value of the label based on several input attributes of the ExampleSet. Each interior node of tree
corresponds to one of the input attributes. The number of edges of an interior node is equal to
the number of possible values of the corresponding input attribute. Each leaf node represents a
value of the label given the values of the input attributes represented by the path from the root
to the leaf. This description can be easily understood by studying the Example Process of the
Decision Tree operator.
Pruning is a technique in which leaf nodes that do not add to the discriminative power of the
decision tree are removed. This is done to convert an over-specific or over-fitted tree to a more
general form in order to enhance its predictive power on unseen datasets. Pre-pruning is a type
of pruning performed parallel to the tree creation process. Post-pruning, on the other hand, is
439
4. Modeling
done after the tree creation process is complete.
Differentiation
• The Random Tree operator works exactly like the Decision Tree operator with one exception: for each split only a random subset of attributes is available. See page ?? for details.
Input Ports
training set (tra) This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as
input.
Output Ports
model (mod) The Random Tree is delivered from this output port. This classification model
can now be applied on unseen data sets for the prediction of the label attribute.
example set (exa) The ExampleSet that was given