CATEGORIZATION, ANALYSIS, AND VISUALIZATION OF COMPUTER- MEDIATED COMMUNICATION AND ELECTRONIC MARKETS

CATEGORIZATION, ANALYSIS, AND VISUALIZATION OF COMPUTER- MEDIATED COMMUNICATION AND ELECTRONIC MARKETS
CATEGORIZATION, ANALYSIS, AND VISUALIZATION OF COMPUTERMEDIATED COMMUNICATION AND ELECTRONIC MARKETS
by
Ahmed Abbasi
__________________________
A Dissertation Submitted to the Committee on
BUSINESS ADMINISTRATION
In Partial Fulfillment of the Requirements
For the Degree of
DOCTOR OF PHILOSOPHY
WITH A MAJOR IN MANAGEMENT
In the Graduate College
THE UNIVERSITY OF ARIZONA
2008
2
THE UNIVERSITY OF ARIZONA
GRADUATE COLLEGE
As members of the Dissertation Committee, we certify that we have read the dissertation
prepared by Ahmed Abbasi entitled Categorization, Analysis, and Visualization of
Computer-Mediated Communication and Electronic Markets and recommend that it be
accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy.
_________________________________________________ Date: 04/17/2008
Hsinchun Chen
_________________________________________________ Date: 04/17/2008
Jay F. Nunamaker, Jr.
_________________________________________________ Date: 04/17/2008
Zhu Zhang
Final approval and acceptance of this dissertation is contingent upon the candidate’s
submission of the final copies of the dissertation to the Graduate College.
I hereby certify that I have read this dissertation prepared under my direction and
recommend that it be accepted as fulfilling the dissertation requirement.
_________________________________________________ Date: 04/17/2008
Dissertation Director: Hsinchun Chen
3
STATEMENT BY AUTHOR
This dissertation has been submitted in partial fulfillment of requirements for an
advanced degree at the University of Arizona and is deposited in the University Library
to be made available to borrowers under rules of the Library.
Brief quotations from this dissertation are allowable without special permission, provided
that accurate acknowledgment of source is made. Requests for permission for extended
quotation from or reproduction of this manuscript in whole or in part may be granted by
the head of the major department or the Dean of the Graduate College when in his or her
judgment the proposed use of the material is in the interests of scholarship. In all other
instances, however, permission must be obtained from the author.
SIGNED: Ahmed Abbasi
4
ACKNOWLEDGMENTS
I would like to thank my advisor, Dr. Hsinchun Chen, for his encouragement and
invaluable feedback every step of the way. There is no doubt that Dr. Chen’s guidance
played an integral role in my scholarly development. My four years as a doctoral student
have provided me with the enduring skills and fortitude which I believe will enable me to
succeed in future endeavors. I am grateful to my committee members, Dr. Jay F.
Nunamaker Jr. and Dr. Zhu Zhang for possessing countless wisdom, and the kindness to
bestow some upon me. I thank the department chair, Dr. J. Leon Zhao and the rest of the
MIS department faculty for their support.
My dissertation has been partially supported by the National Science Foundation grant:
“Multilingual Online Stylometric Authorship Identification: An Exploratory Study,”
(NSF #0646942, September 2006 – February 2008). I would like to thank my good
friends and colleagues in the Artificial Intelligence Lab: Siddharth Kaza, Tianjun Fu, Xin
Li, Daning Hu, Sven Thoms, Arab Salem, Hsin-min Lu, Chun-Ju Tseng, and David
Zimbra for their support. I am especially indebted to Cathy Larson for her kindness,
humor, and optimism.
Most of all, I am grateful for the constant love and support of my family, without whom
this would not have been possible. To my wife Saba, who stood by me through thick and
thin. My parents, who always led by example and taught me the meaning of hard work.
And my brother and sister, who have always been wonderful role models.
5
DEDICATION
This dissertation is dedicated to my parents, for their unconditional love and support.
Their commendable work ethic and unwavering principles have been an inspiration.
6
TABLE OF CONTENTS
LIST OF ILLUSTRATIONS .............................................................................................11
LIST OF TABLES …………………………………………………………………….13
ABSTRACT
…………………………………………………………………….15
CHAPTER 1: INTRODUCTION ..................................................................................... 17
1.1 Motivation............................................................................................................. 17
1.2 Overview............................................................................................................... 19
1.3 Analysis of Textual Information Types ................................................................. 21
1.4 Analysis of Ideational Information Types ............................................................. 23
1.5 Analysis of Textual, Ideational, and Inter-personal Information .......................... 24
CHAPTER 2: A STYLOMETRIC APPROACH TO IDENTITY- LEVEL
IDENTIFICATION AND SIMILARITY DETECTION IN CYBERSPACE .................. 25
2.1 Introduction........................................................................................................... 25
2.2 Related Work......................................................................................................... 27
2.2.1 Stylometry.................................................................................................... 27
2.2.2 Online Stylometric Analysis ........................................................................ 31
2.2.3 Feature Set Types for Online Stylometry..................................................... 36
2.3 Research Gaps and Questions............................................................................... 38
2.3.1 Similarity Detection ..................................................................................... 38
2.3.2 Richer Feature Sets ...................................................................................... 38
2.3.3 Individual Author Level Features ................................................................ 39
2.3.4 Scalability across Domains .......................................................................... 39
2.3.5 Research Questions...................................................................................... 39
2.4 Research Design: An Overview ............................................................................ 40
2.4.1 Techniques ................................................................................................... 40
2.4.2 Feature Sets and Types................................................................................. 42
2.5 System Design ...................................................................................................... 43
2.5.1 Feature Extraction........................................................................................ 44
2.5.2 Classifier Construction................................................................................. 47
2.6 Evaluation ............................................................................................................. 54
2.6.1 Test Bed54
2.6.2 Experiment 1: Identification Task................................................................ 55
2.6.3 Experiment 2: Similarity Detection Task..................................................... 62
2.7 Conclusions........................................................................................................... 67
CHAPTER 3: STYLOMETRIC IDENTIFICATION IN ELECTRONIC MARKETS:
SCALABILITY AND ROBUSTNESS ............................................................................ 69
3.1 Introduction........................................................................................................... 69
3.2 Related Work......................................................................................................... 71
3.2.1 Reputation Systems/Online Feedback Mechanisms .................................... 71
3.2.2 Stylometric Analysis .................................................................................... 75
3.3 Research Gaps, Questions, and Design................................................................. 82
3.3.1 Research Gaps.............................................................................................. 82
7
3.3.2 Research Questions...................................................................................... 83
3.3.3 Research Design........................................................................................... 84
3.4 System Design ...................................................................................................... 85
3.4.1 Feature Extraction........................................................................................ 85
3.4.2 Classifier Construction: Writeprints ............................................................ 87
3.5 Evaluation ............................................................................................................. 92
3.5.1 Test Bed92
3.5.2 Experimental Setup...................................................................................... 93
3.5.3 Experiment 1: Scalability............................................................................. 95
3.5.4 Experiment 2: Robustness............................................................................ 99
3.6 Conclusions......................................................................................................... 106
CHAPTER 4: WEBSITE SIGNATURES: AN EXPERIMENT ON FAKE ESCROW
AND SPOOF WEBSITES .............................................................................................. 108
4.1 Introduction......................................................................................................... 108
4.2 Related Work........................................................................................................110
4.2.1 Fake Website Types.....................................................................................112
4.2.2 Fake Website Features ................................................................................115
4.2.3 Fake Website Categorization Techniques ...................................................118
4.3 Research Design.................................................................................................. 121
4.3.1 Research Gaps and Questions.................................................................... 121
4.3.2 Research Framework ................................................................................. 122
4.4 Evaluation ........................................................................................................... 126
4.4.1 Experimental Setup.................................................................................... 127
4.4.2 Experiment 1: Fake Escrow Websites........................................................ 129
4.4.3 Experiment 2: Spoof Sites ......................................................................... 131
4.4.4 Hypotheses Testing .................................................................................... 133
4.5 Conclusions......................................................................................................... 142
CHAPTER 5: A COMPARISON OF TOOLS FOR DETECTING FAKE WEBSITES 144
5.1 Introduction......................................................................................................... 144
5.2 Fake Website Detection Tools............................................................................. 146
5.2.1 Lookup Systems......................................................................................... 146
5.2.2 Classifier Systems...................................................................................... 147
5.2.3 Hybrid Systems and Dynamic Classifiers.................................................. 149
5.2.4 Summary of Existing Tools........................................................................ 149
5.3 Proposed Approach ............................................................................................. 151
5.4 Experiments and Results..................................................................................... 153
5.4.1 Overall Results........................................................................................... 155
5.4.2 Impact of Time of Day and Interval........................................................... 155
5.4.3 Hybrid Systems: Combining Classifier and Lookup Methods .................. 157
5.5 Conclusions......................................................................................................... 160
CHAPTER 6: FEATURE SELECTION FOR OPINION CLASSIFICATION IN
ONLINE FORUMS AND REVIEWS ............................................................................ 161
6.1 Introduction......................................................................................................... 161
6.2 Related Work....................................................................................................... 162
8
6.2.1 Tasks….163
6.2.2 Features 165
6.2.3 Classification Techniques .......................................................................... 168
6.2.4 Sentiment Analysis Domains ..................................................................... 170
6.3 Research Gaps and Questions............................................................................. 171
6.3.1 Web Forums in Multiple Languages .......................................................... 171
6.3.2 Stylistic Features ........................................................................................ 171
6.3.3 Feature Reduction for Sentiment Classification ........................................ 172
6.3.4 Research Questions.................................................................................... 173
6.4 Research Design.................................................................................................. 173
6.5 System Design .................................................................................................... 175
6.5.1 Feature Extraction...................................................................................... 175
6.5.2 Determining Size of Initial Feature Set ..................................................... 177
6.5.3 Feature Selection: Entropy Weighted Genetic Algorithm (EWGA) .......... 179
6.5.4 Classification.............................................................................................. 184
6.6 Evaluation ........................................................................................................... 185
6.6.1 Experiment 1: Movie Review Test Bed ..................................................... 185
6.6.2 Experiment 2: Online Discussion Forum................................................... 191
6.6.3 Results Discussion ..................................................................................... 195
6.7 Conclusions and Future Directions..................................................................... 197
CHAPTER 7: MINING ONLINE REVIEW SENTIMENTS USING FEATURE
RELATION NETWORKS.............................................................................................. 199
7.1 Introduction......................................................................................................... 199
7.2 Related Work....................................................................................................... 201
7.2.1 Classification Methods for Sentiment Analysis......................................... 202
7.2.2 N-Gram Features for Sentiment Analysis .................................................. 202
7.2.3 Feature Selection for Sentiment Analysis .................................................. 207
7.3 Research Gaps and Questions............................................................................. 210
7.3.1 Research Gaps.............................................................................................211
7.3.2 Research Questions.....................................................................................211
7.4 Research Design.................................................................................................. 212
7.4.1 Extended N-Gram Feature Set ................................................................... 213
7.4.2 Feature Relation Network .......................................................................... 214
7.5 Experiments ........................................................................................................ 223
7.5.1 Experiment 1a: Comparison of Feature Sets using Cross Validation ........ 224
7.5.2 Experiment 1b: Comparison of Features on 10,000 Review Test Beds..... 227
7.5.3 Experiment 2a: Comparison of Feature Selection Methods ...................... 228
7.5.4 Experiment 2b: Comparison of Selection Methods ................................... 230
7.5.5 Results Discussion ..................................................................................... 231
7.6 Conclusions......................................................................................................... 232
CHAPTER 8: AFFECT ANALYSIS OF WEB FORUMS AND BLOGS USING
CORRELATION ENSEMBLES .................................................................................... 234
8.1 Introduction......................................................................................................... 234
8.2 Related Work....................................................................................................... 235
9
8.2.1 Features for Affect Analysis....................................................................... 238
8.2.2 Techniques for Assigning Affect Intensities .............................................. 242
8.3 Research Design.................................................................................................. 243
8.3.1 Gaps and Questions.................................................................................... 243
8.3.2 Research Framework ................................................................................. 244
8.3.3 Research Hypotheses ................................................................................. 252
8.4 Evaluation ........................................................................................................... 253
8.4.1 Test Bed253
8.4.2 Experimental Design.................................................................................. 254
8.4.3 Experiment 1: Comparison of Feature Sets ............................................... 255
8.4.4 Experiment 2: Comparison of Techniques................................................. 258
8.4.5 Experiment 3: Ablation Testing ................................................................. 261
8.4.6 Hypotheses Results .................................................................................... 262
8.5 Case Study .......................................................................................................... 264
8.6 Conclusions......................................................................................................... 267
CHAPTER 9: CYBERGATE: A SYSTEM AND DESIGN FRAMEWORK FOR TEXT
ANALYSIS OF COMPUTER-MEDIATED COMMUNICATION ............................... 269
9.1 Introduction......................................................................................................... 269
9.2 Background ......................................................................................................... 272
9.2.1 CMC Text................................................................................................... 273
9.2.2 CMC Text Analysis Features ..................................................................... 273
9.2.3 CMC Text Analysis Systems...................................................................... 275
9.3 A Design Framework for CMC Text Analysis.................................................... 278
9.3.1 Proposed CMC Text Analysis Framework................................................. 279
9.4 Kernel Theory ..................................................................................................... 281
9.5 Meta-Requirements............................................................................................. 283
9.6 Meta-Design........................................................................................................ 285
9.6.1 Features for CMC Text Analysis................................................................ 286
9.6.2 Feature Selection Techniques for CMC Text Analysis .............................. 288
9.6.3 Visualization Techniques for CMC Text Analysis ..................................... 290
9.7 Testable Hypotheses............................................................................................ 292
9.8 System Design: The CyberGate System ............................................................. 293
9.8.1 Information Types and Features................................................................. 294
9.8.2 Feature Selection........................................................................................ 295
9.8.3 Visualization............................................................................................... 295
9.8.4 Writeprints and Ink Blots ........................................................................... 299
9.9 A CMC Text Analysis Example Using CyberGate: The Enron Case ................. 303
9.10 Experimental Evaluation: Text Categorization using CyberGate ..................... 307
9.10.1 Research Hypotheses ............................................................................... 309
9.10.2 Information Types Representing the Ideational Meta-function ............... 310
9.10.3 Information Types Representing the Textual Meta-function ................... 313
9.10.4 Information Types Representing the Interpersonal Meta-function .......... 316
9.10.5 Results Discussion ................................................................................... 317
9.11 Conclusions....................................................................................................... 318
10
CHAPTER 10: CONCLUSION ..................................................................................... 320
10.1 Contributions..................................................................................................... 320
10.2 Relevance to MIS.............................................................................................. 323
10.3 Future Directions .............................................................................................. 325
REFERENCES
….………………..……………………………………………. 326
11
LIST OF ILLUSTRATIONS
Figure 1.1: Dissertation Framework ................................................................................. 21
Figure 2.1: Identity Level Tasks ....................................................................................... 34
Figure 2.2: Stylometric Analysis System Design ............................................................. 44
Figure 2.3: Writeprint Creation Illustration ...................................................................... 48
Figure 2.4: Writeprint Comparisons ................................................................................. 51
Figure 2.5: Illustration of Pattern Disruption.................................................................... 52
Figure 2.6: Enron Data Set Feature Set Sizes and SVM Technique Performances .......... 60
Figure 2.7: Performance for Identification Techniques across Data Sets ......................... 61
Figure 3.1: Stylometric Similarity Detection System Design........................................... 85
Figure 3.2: Writeprint Comparisons ................................................................................. 91
Figure 3.3: Experiment 1a Results (scalability across traders using 2 identities per trader)
................................................................................................................................... 96
Figure 3.4: Experiment 1b Results (scalability across identities using 50 traders) .......... 98
Figure 3.5: Experiment 2a Results (robustness against word substitution) .................... 101
Figure 3.6: Experiment 2b Results (robustness against message forging) ..................... 103
Figure 3.7: Impact of Word Substitution and Forging on Various Techniques............... 105
Figure 3.8: Impact of Word Substitution and Forging on Writeprint N-Gram Features. 106
Figure 4.1: Examples of Different Categories of Fake Websites.....................................114
Figure 4.2: Average, Max, and Composite Kernels for Fake Website Detection ........... 125
Figure 4.3: Page and Site Level Performance for Various Features and Kernels ........... 130
Figure 4.4: Page and Site Level Performance for Various Features and Kernels ........... 132
Figure 4.5: Fake Escrow Website Detected Using Content Features.............................. 136
Figure 4.6: Escrow Website Replicas Detected Using Linkage Features ....................... 137
Figure 4.7: Similarities for Two Phony Pages Compared against Fake Website
www.bssew.com...................................................................................................... 140
Figure 4.8: Cumulative Page Level Errors for Features on Escrow Website Test Bed... 142
Figure 5.1: Fake Website Examples................................................................................ 145
Figure 5.2: Proposed AZProtect System Overview ........................................................ 151
Figure 5.3: Linear Composite Kernel used by AZProtect’s SVM Classification Model 153
Figure 5.4: Impact of Interval between Evaluation and Report Time and Time of Day on
Accuracy for Generated Fraud and Spoof Site Test Beds....................................... 157
Figure 5.5: Impact of Hybrid Systems on Fake Website Detection Accuracy................ 158
Figure 5.6 Generated Fraud Site Patterns Over Time ..................................................... 159
Figure 5.7: Spoof Site Patterns over Time ...................................................................... 160
Figure 6.1: Sentiment Classification System Design...................................................... 177
Figure 6.2: EWGA Illustration........................................................................................ 180
Figure 6.3: Key Stylistic Features for Movie Review Data Set...................................... 191
Figure 6.4: U.S. Forum Results using EWGA and GA .................................................. 195
Figure 6.5: Key Stylistic Features for U.S. Forum ......................................................... 196
Figure 7.1: Sentiment Analysis Research Design ........................................................... 212
Figure 7.2: Subsumption Relations between Word N-Grams......................................... 216
12
Figure 7.3: Parallel Relations between Various Bigams ................................................. 217
Figure 7.4: The Feature Relation Network ..................................................................... 218
Figure 7.5: The FRN Algorithm...................................................................................... 221
Figure 7.6: Example Application of FRN to Six Sentence Test Bed .............................. 222
Figure 7.7: Results for Feature Sets on 5-Fold Cross Validation Experiment (Setting A)
................................................................................................................................. 226
Figure 7.8: Feature Selection Results on 5-Fold Cross Validation Experiment (Setting A)
................................................................................................................................. 229
Figure 7.9: Weights for Top 200,000 N-Grams on Digital Camera Test Bed ................. 232
Figure 8.1: Affect Analysis Research Framework .......................................................... 246
Figure 8.2: SVR Correlation Ensemble for Assigning Affect Intensities ....................... 251
Figure 8.3: Macro-level Mean % Error and Correlation Coefficients for Feature Sets.. 257
Figure 8.4: Micro-level Mean % Error and Correlation Coefficients for Feature Sets .. 258
Figure 8.5: Macro-level Mean % Error and Correlation Coefficients for Techniques ... 260
Figure 8.7: Macro-level Mean % Error and Correlation Coefficients for Ablation Testing
................................................................................................................................. 262
Figure 8.8: Posting Frequency for Two Web Forums ..................................................... 265
Figure 8.9: Temporal View of Intensities in Two Web Forums ...................................... 267
Figure 9.1: CyberGate System Design............................................................................ 294
Figure 9.2: CyberGate Feature Selection Examples ....................................................... 295
Figure 9.3: Multi-dimensional Text Views in CyberGate............................................... 297
Figure 9.4: Text Overlay Views in CyberGate................................................................ 298
Figure 9.5: Interaction Views in CyberGate for Representing Interpersonal Information
................................................................................................................................. 299
Figure 9.6: Writeprints Process Illustration on Two Dimensions ................................... 301
Figure 9.7: Ink Blots Process Illustration ....................................................................... 302
Figure 9.8: Writeprints for Two Enron Employees......................................................... 304
Figure 9.9: Author A Ink Blots and Parallel Coordinates ............................................... 305
Figure 9.10: Author B Ink Blots and Parallel Coordinates ............................................. 306
Figure 9.11: Author B Bag-of-Words Clusters and Social Networks ............................. 307
13
LIST OF TABLES
Table 2.1 A Taxonomy for Online Stylometric Analysis .................................................. 32
Table 2.2 Previous Studies in Online Stylometric Analysis.............................................. 33
Table 2.3: Baseline and Extended Feature Sets ................................................................ 45
Table 2.4: Details for Data Sets in Test Bed ..................................................................... 55
Table 2.5: Techniques/Feature Sets for Identification Experiment................................... 56
Table 2.6: Experimental Results (% accuracy) for Identification Task ............................ 58
Table 2.7: P-values for Pair Wise t-tests on Accuracy ...................................................... 59
Table 2.8: Performance Comparison of Ensemble SVM and SVM on Enron Data Set ... 60
Table 2.10: Experimental Results (F-measure) for Similarity Detection Task ................. 64
Table 2.11: P-values for Pair Wise t-tests on F-Measure .................................................. 65
Table 3.1: Previous Unsupervised Stylometric Analysis Techniques ............................... 77
Table 3.2: Extended Feature Set ....................................................................................... 86
Table 3.3: eBay Test Bed Statistics ................................................................................... 93
Table 3.4: Number of Traders and Identities used in Experiment 1 ................................. 95
Table 3.5: P-Values for Pair Wise t-tests on F-measure (n=30)........................................ 97
Table 3.6: P-Values for Pair Wise t-tests on F-measure (n=30)........................................ 99
Table 3.7: Impact of Different Levels of Word Substitution on an Example Comment. 100
Table 3.8: P-Values for Pair Wise t-tests on F-measure (n=30)...................................... 101
Table 3.9: Illustration of Impact of 20% Message Forging on Feedback Comments..... 102
Table 3.10: P-Values for Pair Wise t-tests on F-measure (n=30).................................... 104
Table 4.1: Related Fake Website Detection Studies.........................................................112
Table 4.2: Summary of Fake Website Categories ............................................................115
Table 4.3: Fake Website Feature Set Description ........................................................... 123
Table 4.4: Description of Fake Website Test Beds.......................................................... 128
Table 4.5: Average Number of Features used by the Linear Classifiers ......................... 129
Table 4.6: Average Page and Site Level Classification Accuracy................................... 130
Table 4.7: Average Page and Site Level Classification Accuracy................................... 132
Table 4.8: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Escrow Websites..... 134
Table 4.9: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Spoof Websites ....... 135
Table 4.10: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Escrow Websites... 138
Table 4.11: P-Values for Spoof Site Test Bed ................................................................. 138
Table 4.12: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Escrow Websites... 140
Table 4.13: P-Values for Spoof Site Test Bed ................................................................. 141
Table 5.1: Summary of Fake Website Detection Tools................................................... 150
Table 5.2: Overall Results for Tool Accuracy Comparison ............................................ 155
Table 6.1: A Taxonomy of Sentiment Polarity Classification ......................................... 163
Table 6.2: Selected Previous Studies in Sentiment Polarity Classification .................... 164
Table 6.3: Text Classification Studies using GA, IG, and SVM Weights ....................... 174
Table 6.4: Sentiment Analysis Feature Set ..................................................................... 179
Table 6.5: Experiment 1a Results ................................................................................... 186
Table 6.7: Experiment 1b Results ................................................................................... 189
14
Table 6.8: P-Values for Pair Wise t-tests on Accuracy (n=50, df=49) ............................ 190
Table 6.9: Experiment 2a Results ................................................................................... 193
Table 6.10: P-Values for Pair Wise t-tests on Accuracy (n=50)...................................... 193
Table 6.11: Experiment 2b Results ................................................................................. 194
Table 6.12: P-Values for Pair Wise t-tests on Accuracy (n=50, df=49) .......................... 194
Table 7.1: Summary of N-Gram Features used for Sentiment Analysis......................... 207
Table 7.2: Select Univariate and Multivariate Methods used for Text Classification .... 210
Table 7.3: N-Gram Feature Set ....................................................................................... 214
Table 7.4: List of Relations between N-Gram Feature Groups ...................................... 218
Table 7.5: Descriptions of Online Review Test Beds ..................................................... 223
Table 7.6: Results for Feature Sets on 10,000 Review Experiment................................ 227
Table 7.7: Results for Selection Methods on 10,000 Review Experiment ..................... 231
Table 7.8: N-Grams from the FRN Feature Set .............................................................. 232
Table 8.1: Related Prior Affect Analysis Studies ............................................................ 237
Table 8.2: Manual Lexicon Examples for the Violence Affect ....................................... 249
Table 8.3: Test Bed Description ...................................................................................... 254
Table 8.4: Overall Results for Various Feature Sets ....................................................... 257
Table 8.5: Results for Experiment 2 (comparison of techniques)................................... 259
Table 8.6: Results for Experiment 3 (ablation testing) ................................................... 261
Table 8.7: Sample Learned N-Grams and Lexicon Items for Hate Affect...................... 263
Table 8.8: Summary Statistics for Two Web Forums Collected ..................................... 265
Table 8.9: Affect Intensities per Posting across Two Web Forums................................. 266
Table 9.1: Previous CMC Systems ................................................................................. 276
Table 9.2: Components of an ISDT Design Product....................................................... 280
Table 9.3: Components of the Proposed Design Framework for CMC Text Analysis ... 281
Table 9.4: Various Information Types for the Three Meta-Functions............................. 285
Table 9.5: Various Linguistic Features used for Text Analysis ....................................... 288
Table 9.7: Hypotheses Testing Results for Text Categorization Experiments ................ 310
Table 9.8: Topic Categorization Results (accuracy) ........................................................311
Table 9.9: Opinion Classification Results....................................................................... 313
Table 9.10: Style Classification Results.......................................................................... 314
Table 9.11: Genre Classification Results ........................................................................ 315
Table 9.12: Interaction Classification Results ................................................................ 317
15
ABSTRACT
Computer mediated communication (CMC) and electronic markets have seen
tremendous growth due to the fast propagation of the internet. In spite of the numerous
benefits of electronic communication, it is not without its pitfalls. Two characteristics of
computer mediated communication have proven to be particularly problematic: online
anonymity and the enormity of data present in cyber communities.
This dissertation follows the design science research paradigm in MIS, by addressing
issues pertaining to the design and development of an important IT artifact capable of
alleviating the two aforementioned CMC concerns. We present 8 essays related to the
creation of CMC systems that can provide improved text analysis capabilities by
incorporating a richer set of textual information types. Using Systemic Functional
Linguistic Theory (SFLT) as a kernel theory, emphasis is placed on developing
techniques for analyzing textual and ideational information. A rich set of features are used
to represent textual (e.g., style, genres, social cues etc.) and ideational (topics, sentiments,
affects, etc.) information. The research revolves around a core set of algorithms utilized
for feature selection, categorization, analysis, and visualization of CMC text. The
dissertation is arranged in three parts. The first two parts attempt to develop a set of
features and techniques that can effectively represent textual and ideational information.
In Chapters 2-5, we leverage information types related to the textual meta-function of
SFLT for enhanced identity and institutional trust. Experiments are conducted on various
CMC modes prevalent in organizational settings, including email, instant messaging,
forums, program code, and websites. Chapters 6-8 consider two important information
16
types associated with the ideational meta-function of SFLT: opinions and emotions. We
assess the ability to gauge consumer sentiments and affects using machine learning
techniques on various CMC modes, including product review and social discussion
forums.
The third part relates to the design, development, and evaluation of a visualization
system that can analyze the presence of the aforementioned information types in textbased CMC archives (Chapter 9). We propose a design framework for CMC text analysis
systems that is grounded in SFLT. The CyberGate system is developed as an instantiation
of the design framework.
17
CHAPTER 1: INTRODUCTION
1.1 Motivation
The advent and progression of information technology has brought about a
fundamental change to various application areas with the creation of massive amounts of
digitized data and information. Two important application areas spawned by enhanced
information technology are electronic commerce and computer mediated communication.
E-Commerce facilitates the exchange of goods and services via the Internet, and also
provides businesses with an invaluable outlet for gathering marketing and business
intelligence. Computer mediated communication (CMC) has seen tremendous growth due
to the fast propagation of the internet. Computer mediated communication allows
businesses to acquire consumer feedback via a plethora of modes, including email,
websites, forums, blogs, and chat rooms. These modes of CMC continue to have a
profound impact on organizations due to their quick and ubiquitous nature. Electronic
communication methods have redefined the fabric of organizational culture and
interaction. With the persistent evolution of communication processes and constant
advancements in technology, such metamorphoses are likely to continue.
In spite of the numerous benefits of electronic communication, it is not without its
pitfalls. Two characteristics of computer mediated communication and electronic markets
have proven to be particularly problematic: online anonymity and the enormity of data
present in cyber communities. These vices undermine the numerous benefits associated
with CMC and online communities.
The anonymous nature of the internet has resulted in several trust-related issues
18
including online deception, fraudulent websites, and the prevalence of agitators (i.e.,
those attempting to disrupt online discourse) and lurkers (i.e., those attempting to freeride off others) (Donath et al., 1999). Two important forms of online trust are identity and
institutional trust (Pavlou and Gefen, 2004). Identity trust is trust in the individuals we
interact with online. Institutional trust is belief in the trusted third party websites (e.g.,
online payment, escrow, delivery, and financial organization sites). Online anonymity and
the lack of physical contact make it difficult to ensure identity and institutional trust.
Collectively, these concerns can cast serious doubts onto the quality of information
exchanged in such online communities. Cyber communities also contain large volumes of
information including various communication modes, topics, threads, messages, and
authors. CMC environments feature very large scale conversation involving thousands of
people and messages. The enormous information quantities make such places noisy and
difficult to navigate.
Categorization, analysis and visualization techniques capable of improving the quality
of information retrieved from online settings can enhance knowledge discovery (Sack,
2000). Many believe that technologies encouraging social translucence in cyberspace by
diminishing the asymmetric nature of online information exchange between unsuspecting
internet users and fraudsters can help alleviate online anonymity abuse (Erickson and
Kellogg, 2000; Smith, 2002). This dissertation has been motivated by the need to address
these two hindering characteristics of computer mediated communication and electronic
markets:
•
Advancing research related to the development of methods capable of
19
enhancing accountability and security in online markets and computer
mediated communication.
•
Exploring richer textual feature representations coupled with advanced
analysis and visualization techniques, capable of enabling improved
information retrieval and knowledge discovery from computer mediated
communication archives.
1.2 Overview
There is a need for techniques to represent, evaluate, summarize, and present CMC
and electronic market content. Such methods, supporting navigation and knowledge
discovery can enhance informational transparency, which benefits community
participants and observers (Fiore and Smith, 2004). Perhaps the most important
characteristic of CMC is the language complexities it introduces as compared to other
forms of text (Wilson and Peterson, 2002). Effective analysis of CMC text entails the
utilization of a language theory that can provide representational guidelines. Grounded in
Functional Linguistics, Systemic Functional Linguistic Theory (SFLT) provides an
appropriate mechanism for representing CMC text information (Halliday, 2004). SFLT
states that language has three meta-functions: ideational, interpersonal, and textual
(Halliday, 2004). The three meta-functions are intended to provide a comprehensive
functional representation of language meaning by encompassing the physical, mental, and
social elements of language (Fairclough, 2003).
The ideational meta-function states that language consists of ideas. It relates to
aspects of the “mental world” which include attitudes, desires, and values (Fairclough,
20
2003; Halliday, 2004). The textual meta-function indicates that language has
organization, structure, flow, cohesion, and continuity (Halliday, 2004). It can be present
via information types such as style, genres, and vernaculars (Argamon et al., 2007). The
interpersonal meta-function refers to the fact that language is a medium of exchange
between people (Sack, 2000). It is generally represented using CMC interaction
information. Analysis of CMC text requires the inclusion of all three language metafunctions described by SFLT: ideational, textual, and interpersonal (Sack, 2000).
Therefore, effective depiction of CMC text entails consideration of information types
capable of representing these three meta-functions.
This dissertation proposes a design framework for the creation of CMC systems that
can provide improved text analysis capabilities by incorporating a richer set of textual
information types. Using Systemic Functional Linguistic Theory as a guiding principle,
emphasis was placed on developing techniques for analyzing information types
associated with the textual, ideational, and interpersonal meta-functions (Figure 1.1). A
rich set of features were used to represent textual (e.g., style, genres, social cues etc.) and
ideational (topics, sentiments, affects, etc.) information. The research revolved around a
core set of algorithms utilized for feature selection, categorization, analysis, and
visualization of CMC text. The dissertation is arranged in three parts. The first two parts
attempt to develop a set of features and techniques that can effectively represent textual
and ideational information. The third part relates to the development and evaluation of a
visualization system that can analyze the aforementioned information types in CMC text.
21
Figure 1.1: Dissertation Framework
1.3 Analysis of Textual Information Types
The first part of the dissertation explored different features and techniques for
categorization and analysis of textual information types. Emphasis was placed on
stylometric analysis (categorizing text based on style) and fake website detection
(learning text-based fraud cues). Various features and classification techniques developed
for analyzing style, structure, and social cues are incorporated. The effectiveness of the
different features and techniques for stylometric identification of web forum authors
22
based on writing style was evaluated, resulting in the creation of an ideal feature set and
benchmark techniques (Chapter 2). The extended feature set comprised of a vast array of
lexical, syntactic, structural, content-specific, and idiosyncratic text attributes. The best
techniques included machine learning algorithms such as support vector machine and
principal component analysis. In this chapter, a new technique for stylometric
identification and authentication of online texts was also proposed. The new method
incorporates Karhunen-Loeve transforms with a sliding window algorithm and pattern
disruption. The new technique (Writeprints) was compared extensively against the
existing state-of-the-art techniques for stylometric classification. Writeprints was
generally shown to have the best performance across data sets consisting of email, web
forum, instant messaging, and programming code text with varying numbers of potential
authors. In Chapter 3, we evaluated the scalability (in terms of number of authors) and
robustness (against intentional alteration) of various stylometric similarity detection
methods. Experiments were conducted on online buyer feedback comments derived from
eBay.
In Chapter 4 the extended feature set and techniques were then used to identify fake
websites based on digital signatures created using the aforementioned set of text
attributes and customized kernels (average and max similarity). Various feature sets were
compared, including body text, HTML, URLs, link features, and images. The best
performance was attained using the full set of features (i.e., fraud cues) in combination
with the custom kernel. The approach is capable of representing the unique characteristics
necessary for identification of fake spoof and escrow websites. In a follow up essay, our
23
approach was integrated into the AZProtect system for detecting fake websites. This
system was compared against existing fake website detection tools (Chapter 5).
1.4 Analysis of Ideational Information Types
The second part of the dissertation developed features and techniques for
categorization and analysis of ideational information. Given the significant progress
already made on topic categorization, and the prevalence of directional/opinionated and
emotional content in CMC, emphasis was placed on sentiment and affect analysis. In
order to identify the ideal set of features for classifying sentiments in online texts, an
Entropy-Weighted Genetic Algorithm (EWGA) which incorporates the information gain
heuristic was developed in Chapter 6. Using the Support Vector Machine classifier and a
set of features comprised of lexical, syntactic, and structural sentiment markers; the
EWGA was evaluated in comparison with existing text feature selection techniques.
Generally, EWGA outperformed other methods on feature selection for sentiment
classification in online reviews and forums. In Chapter 7, the sentiment analysis research
was extended. We developed a multivariate rule-based feature selection method for
opinion classification of online reviews. The method outperformed several comparison
feature selection methods, including recursive feature elimination and decision tree
models.
In order to analyze emotional content in text-based CMC, affect analysis of web
forums and blogs was performed in Chapter 8. The approach used Support Vector
Regression ensembles to gauge the emotive intensity for various affect classes found in
web discourse. Once affective intensities were determined, information visualization
24
techniques were developed for assessing member mood trends across the various forums
and blogs over time.
1.5 Analysis of Textual, Ideational, and Inter-personal Information
The third part of the dissertation proposed a set of algorithms and visualization
techniques to facilitate analysis of textual and ideational information found in CMC text,
based on the features, lexicons, and techniques developed in the first two parts. The
CyberGate visualization system was developed, which includes several basic, multidimensional, and text overlay presentation formats for viewing topical, stylistic, genrerelated, sentimental, and affective information extracted from CMC texts. In addition to
several standard text visualization techniques such as parallel coordinates and multidimensional scaling (MDS) projections, CyberGate includes the aforementioned
Writeprints technique which is useful for accentuating feature occurrence variation. It
also utilizes Ink Blots which is a text overlay technique that uses decision tree models to
present the occurrence frequencies for the most unique set of attributes for a particular
class (e.g., authors, threads, forums, etc.). The system also uses discussion trees and
interaction networks to represent the interpersonal dimension of Systemic Functional
Linguistic Theory. The CyberGate system and its CMC text visualization framework
were evaluated for their ability to represent various forms of textual, ideational, and
interpersonal information using simulated experiments (Chapter 9). The results favorably
demonstrated the effectiveness of the system for CMC content analysis.
25
CHAPTER 2: A STYLOMETRIC APPROACH TO IDENTITY- LEVEL
IDENTIFICATION AND SIMILARITY DETECTION IN CYBERSPACE
2.1 Introduction
One of the problems often associated with online anonymity is that it hinders social
accountability, as substantiated by the high levels of cybercrime. Although identity cues
are scarce in cyberspace, individuals often leave behind textual identity traces. In this
essay we proposed the use of stylometric analysis techniques to help identify individuals
based on writing style.
The Internet’s numerous benefits have been coupled with the realization of several
vices attributable to the ubiquitous nature of computer mediated communication and
abuses of online anonymity. The Internet is often used for the illegal sale and distribution
of software (Moores and Dhillon, 2000; Zheng et al., 2006). It also serves as an attractive
medium for hackers indulging in online attacks (Oman and Cook, 1989; Krsul and
Spafford, 1997) and cyber wars (Garson, 2006). Furthermore, Internet-based
communication is swarming with fraudulent schemes including email scams. One wellknown fraudulent scheme is the 4-1-9 scam (Airoldi and Malin, 2004) where deceptive
individuals convince users to provide bank account information or cash fake cashier
checks through email and forum messages. The scam has been around for over a decade
and has generated over 5 billion dollars in fraudulent revenues (Sullivan, 2005).
Electronic market places are another area susceptible to deception in the form of
reputation rank inflation (Morzy, 2005). In this scheme online sellers create fake sales
transactions to themselves in order to improve reputation rank (Josang, 2006). While
artificial accreditation can simply be a business ploy, it is also often done in order to
26
defraud unsuspecting future buyers.
Tools providing greater informational transparency in cyberspace are necessary to
counter anonymity abuses and garner increased accountability (Erickson and Kellogg,
2000; Sack, 2000). The aforementioned forms of Internet misuse all involve text-based
modes of computer mediated communication. Hence, the culprits often leave behind
potential textual traces of their identity (Li et al., 2006). Peng et al. (2003) refer to an
author’s unique stylistic tendencies as an “author profile.” Ding et al., (2003) described
such identifiers as a text fingerprints that can discriminate authorship.
Stylometry is the statistical analysis of writing style (Zheng et al., 2006). In lieu of
these textual traces, researchers have begun to use online stylometric analysis techniques
as a forensic identification tool, with recent application to email (De Vel et al., 2001),
forums (Zheng et al., 2006), and program code (Gray et el., 1997). Despite significant
progress, online stylometry has several current limitations. The biggest shortcoming has
been the lack of scalability in terms of number of authors and across application domains
(e.g., email, forums, chat). This is partially attributable to use of feature sets that are
insufficient in terms of the breadth of stylistic tendencies captured. Furthermore, previous
work has also mostly focused on the identification task (where potential authorship
entities are known in advance). There has been limited emphasis on similarity detection,
where no entities are known apriori (which is more practical for cyberspace).
In this essay we addressed some of the current limitations of online stylometric
analysis. We incorporated a larger, more holistic feature set than those used in previous
studies. We also developed the Writeprint technique, which is intended to improve
27
stylometric analysis scalability across authors and domains for the identification and
similarity detection tasks. Experiments were conducted in order to evaluate the
effectiveness of the proposed feature set and technique in comparison with benchmark
techniques and a baseline feature set.
The remainder of this chapter is organized as follows. Section 2.2 presents a general
review of stylometric analysis and a taxonomy of online stylometric analysis studies.
Section 2.3 describes research gaps, questions, and our proposed research design. Section
2.4 describes the system design which includes the stylometric features and techniques
utilized in our analysis. Section 2.5 presents two experiments used to evaluate the
effectiveness of the proposed approach and discussion of the results. Section 2.6
concludes with a summary of our research contributions, closing remarks, and future
directions.
2.2 Related Work
In this section we present a summary of stylometry followed by a taxonomy and
review of online stylometric analysis research.
2.2.1 Stylometry
Stylometric analysis techniques have been used for analyzing and attributing
authorship of literary texts for numerous years (e.g., Mosteller and Wallace, 1964). Three
important characteristics of stylometry are the analysis tasks, writing style features used,
and the techniques incorporated to analyze these features (Zheng et al., 2006). These
characteristics are discussed below.
28
2.2.1.1 Tasks
Two major stylometric analysis tasks are identification and similarity detection (Gray
et al., 1997; De Vel et al., 2001). The objective in the identification task is to compare
anonymous texts against those belonging to identified entities, where each anonymous
text is known to be written by one of those entities. The Federalist papers (Mosteller and
Wallace, 1964) are a good example of a stylometric identification problem. Twelve
anonymous/disputed essays were compared against writings belonging to Madison and
Hamilton. Since all possible author classes are known apriori, identification problems can
use supervised or unsupervised classification techniques.
The objective in the similarity detection task is to compare anonymous texts against
other anonymous texts and assess the degree of similarity. Examples include online
forums, where there are numerous anonymous identities (i.e., screen names, handles,
email addresses). Similarity detection tasks can only use unsupervised techniques since
no class definitions are available beforehand.
2.2.1.2 Features
Stylistic features are the attributes or writing style markers that are the most effective
discriminators of authorship. The vast array of stylistic features includes lexical,
syntactic, structural, content-specific, and idiosyncratic style markers.
Lexical features are word or character-based statistical measures of lexical variation.
These include style markers such as sentence/line length (Yule, 1938; Argamon et al.,
2003), vocabulary richness (e.g., Yule, 1944) and word length distributions (De Vel et al.,
2001; Zheng et al., 2006).
29
Syntactic features include function words (Mosteller and Wallace, 1964), punctuation
(Baayen et al., 2002), and part-of-speech tag n-grams (Baayen et al. 1996, Argamon et al.,
1998). Function words have been shown to be highly effective discriminators of
authorship since the usage variations of such words are a strong reflection of stylistic
choices (Koppel et al., 2006).
Structural features, which are especially useful for online text, include attributes
relating to text organization and layout (De Vel et al., 2001; Zheng et al., 2006). Other
structural attributes include technical features such as the use of various file extensions,
fonts, sizes, and colors (Abbasi and Chen, 2005). When analyzing computer programs,
different structural features, for example, the use of braces and comments, are utilized
(Oman and Cook, 1989).
Content-specific features are important keywords and phrases on certain topics
(Martindale and McKenzie, 1995) such as word n-grams (Diederich et al., 2003). For
example, content specific features on a discussion of computers may include “laptop” and
“notebook.”
Idiosyncratic features include misspellings, grammatical mistakes, and other usage
anomalies. Such features are extracted using spelling and grammar checking tools and
dictionaries (Chaski, 2001; Koppel and Schler, 2003). Idiosyncrasies may also reflect
deliberate author choices or cultural differences, e.g., use of the word “centre” versus
“center” (Koppel and Schler, 2003).
Over 1,000 different features have been used in previous authorship analysis research
with no consensus on a best set of style markers (Rudman, 1997). However, this could be
30
attributable to certain feature categories being more effective at capturing style variations
in different contexts. This necessitates the use of larger feature sets comprised of several
categories of features (e.g., punctuation, word length distributions, etc.) spanning various
feature groups (i.e., lexical, syntactic, etc.). For instance, the use of feature sets
containing lexical, syntactic, structural, and syntactic features has been shown to be more
effective for online identification than feature sets containing only a subset of these
feature groups (Abbasi and Chen, 2005; Zheng et al., 2006).
2.2.1.3 Techniques
Stylometric analysis techniques can be broadly categorized into supervised and
unsupervised methods. Supervised techniques are those that require author class labels
for categorization while unsupervised techniques make categorizations with no prior
knowledge of author classes.
Supervised techniques used for authorship analysis include support vector machine
(SVM) (Diederich, 2000; De Vel, 2001; Li et al., 2006), neural networks (Merriam, 1995;
Tweedie et al., 1996; Zheng et al., 2006), decision trees (Apte, 1998; Abbasi and Chen,
2005), and linear discriminant analysis (Baayen, 2002; Chaski, 2005). SVM is a highly
robust technique that has provided powerful categorization capabilities for online
authorship analysis. In head to head comparisons, SVM significantly outperformed other
supervised learning methods such as neural networks and decision trees (Abbasi and
Chen, 2005; Zheng et al., 2006).
Unsupervised stylometric categorization techniques include principal component
analysis (PCA) and cluster analysis (Holmes, 1992). PCA’s ability to capture essential
31
variance across large amounts of features in a reduced dimensionality makes it attractive
for text analysis problems, which typically involve large feature sets. PCA has been used
in numerous previous authorship studies (e.g., Burrows, 1987; Baayen et al., 1996) and
has also been shown to be effective for online stylometric analysis (Abbasi and Chen,
2006).
2.2.2 Online Stylometric Analysis
Online stylometric analysis is concerned with categorization of authorship style in
online texts. Here, we define “online texts” as any textual documents that may be found
in an online setting. This includes computer-mediated communication (CMC), nonliterary electronic documents (e.g., student essays, news articles etc.), and program code.
Previous online studies have several important characteristics pertaining to the tasks,
domains, features, and number of author classes utilized. These are summarized in the
taxonomy presented in Table 2.1.
Based on the proposed taxonomy, Table 2.2 shows previous studies dealing with
online stylometric classification. For some previous studies, the number of features and
categories used are marked with a dash (“-”) or a not available (“n/a”). The dashes are for
studies where authorship was evaluated manually, without the use of any defined set of
features. For studies marked “n/a,” the authors were unable to determine the number of
features and categories used in the study. We discuss the taxonomy and related studies in
detail below.
32
Table 2.1 A Taxonomy for Online Stylometric Analysis
Category
Identification
Similarity Detection
Category
Asynchronous CMC
Synchronous CMC
Documents
Program Code
Category
No. of Categories
Number of Features
Feature Set Type
Category
No. of Classes
Tasks
Description
Comparing text from anonymous identities against known classes.
Text from anonymous identities is compared against each other in
order to assess degree of similarity with no prior class definitions.
Domains
Examples
Asynchronous conversation including email, web forums, and blogs.
Persistent text, including chat rooms and instant messaging.
Electronic documents including non-literary texts and news articles.
Text containing code snippets and examples.
Features
Examples
Maximum number of stylistic feature categories used in experiments.
Maximum number of style marking attributes incorporated.
Whether a single author group level feature set or multiple individual
author level subsets were used.
Classes
Description
Maximum number of classes used in experiments.
Label
T1
T2
Label
D1
D2
D3
D4
Label
Cat.
#
Type
Label
#
2.2.2.1 Tasks
As described in the previous section, two important stylometric analysis tasks are
identification and similarity detection. For online texts, these two tasks can be performed
at the message/document or identity level (Pan, 2006). Message level analysis attempts to
categorize individual texts (e.g., emails) whereas identity level analysis is concerned with
classifying identities belonging to a particular entity. For example, let’s assume that the
entity John Smith has various email accounts (identities) in cyberspace (e.g.,
[email protected], [email protected], etc.). The message level identification task may
attempt to determine if an anonymous email was written by [email protected] while the
identity level identification task would attempt to determine whether [email protected] and
33
[email protected] are identities belonging to the same entity.
Table 2.2 Previous Studies in Online Stylometric Analysis
Previous Studies
Oman & Cook, 1989
Hayne & Rice, 1997
Krsul & Spafford, 1997
De Vel et al., 2001
Stamatatos et al., 2001
Chaski, 2001
Baayen et al., 2002
Corney et al., 2002
Argamon et al., 2003
Diederich et al., 2003
Hayne et al., 2003
Juola & Baayen, 2003
Koppel & Schler, 2003
Ding & Samadzadeh, 2004
Whitelaw & Argamon, 2004
Abbasi & Chen, 2005
Chaski, 2005
Abbasi & Chen, 2006
Li et al., 2006
Pan et al., 2006
Zheng et al., 2006
Tasks
T1 T2
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
D1
Domains
D2 D3 D4
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
Cat.
2
3
7
3
1
1
10
5
3
n/a
3
3
2
13
3
6
11
4
11
Features
#
16
49
191
22
33
60
221
506
120,000
n/a
4,060
56
109
418
6
106
270
56
270
Type
Group
Group
Group
Group
Group
Both
Group
Group
Group
Group
Group
Group
Group
Group
Group
Group
Group
Group
Group
Group
Group
Classes
#
6
26
29
3
10
4
8
2
20
7
5
2
11
46
3
5
2
10
5
2
20
The majority of previous studies focused on message level analysis (e.g., De Vel et
al., 2001; Abbasi and Chen, 2005; Zheng et al., 2006) which is useful for forensic
applications with a small number of potential authors (e.g., Chaski, 2001). However,
message level analysis is not highly scalable to larger numbers of authors in cyberspace
due to difficulties in consistently identifying texts shorter than 250 words (Forsyth and
Holmes, 1996). Consequently, Zheng at al. (2006) noted a 14% drop in accuracy when
increasing the number of author classes from 5 to 20 in their classification of forum
postings. Argamon et al. (2003) also observed as much as a 23% drop in message
classification accuracy when increasing the number of authors from 5 to 20.
Identity level analysis attempts to categorize identities based on all texts written by
34
that identity. It is somewhat less challenging than message level categorization due to the
presence of larger text samples, making identity level analysis more suitable for cyber
content (Pan, 2006). Figure 2.1 presents illustrations of the identity level identification
and similarity detection tasks. For ID Identification, each anonymous identity is
compared against all known entities. The identity is assigned to the entity with the
highest similarity score (classification task). For ID Similarity Detection, each
anonymous identity is compared to all other identities. Identities with a similarity score
above a certain threshold are grouped together and considered to belong to the same
entity (clustering task).
Figure 2.1: Identity Level Tasks
2.2.2.2 Domains
Online text includes various modes of computer mediated communication (CMC),
such as asynchronous and synchronous mediums (Herring, 2002). Relevant asynchronous
modes for stylometric analysis are email, web forums, blogs, feedback/comments etc.
Many previous authorship studies focused on email (e.g., De Vel et al., 2001; Argamon et
al., 2003), forums (e.g., Abbasi and Chen, 2005; Li et al., 2006) and feedback comments
35
(Hayne and Rice, 1997; Hayne et al., 2003). Synchronous forms of textual
communication include instant messaging and chat rooms. We are unaware of any
stylometric analysis relating to persistent conversation, despite it’s prevalence as a
communication medium. The continuous nature of synchronous mediums makes them
especially interesting since authors have less time to craft their responses (Hayne et al.,
2003). It is difficult to surmise without investigation what impact this may have on the
ability to categorize authorship of persistent conversation.
Online documents encompass non-literary texts, essays, and news articles. Electronic
documents tend to be lengthier, more structured, and better written as compared to CMC
text. Many previous studies focused on electronic documents with high levels of accuracy
(e.g., Stamatatos et al., 2001; Chaski, 2001; Whitelaw and Argamon, 2004).
The domain of program code is important for identifying hackers and attackers
(Garson, 2006) as well as detecting software plagiarism. Program style analysis studies
have developed programming-specific features, often tailored towards specific
programming languages (e.g., Oman and Cook, 1989; Krsul and Spafford, 1997).
2.2.2.3 Features and Classes
Feature sets used in previous online studies typically consist of a handful of
categories and less than 500 features. Here, we define a “category” as a set of similar
features (e.g., word length distribution, punctuation, part-of-speech tag bi-grams, etc.).
Studies that did utilize larger feature sets typically incorporated only a couple of syntactic
or content specific feature categories such as bag-of-words and part-of-speech bi-grams
(Koppel and Schler, 2003; Diederich et al., 2003). Consequently, online stylometric
36
analysis has typically been applied to less than 20 authors, with only a few studies
exceeding 25 authors (e.g., Krsul and Spafford, 1997; Ding and Samadzadeh, 2004).
2.2.3 Feature Set Types for Online Stylometry
Two types of feature sets have been used in previous research, author group level and
individual author level. Most previous research used author group level sets where one
set of features is applied across all authors. In contrast, individual author level sets
consist of a feature set for each author (i.e., 10 authors = 10 feature sets). For instance,
Peng et al. (2003) created a feature set of the 5,000 most frequently used character ngrams for each author based on that author’s usage. Similarly Chaski (2001) developed
author level feature sets for misspelled words where each author’s feature set consisted of
words they commonly misspelled. Individual author level feature sets can be effective
when using feature categories with large potential feature spaces, such as n-grams or
misspellings (Peng et al., 2003). However the use of individual author level sets requires
techniques that can handle multiple feature sets. Standard machine learning techniques
typically build a classifier using only a single feature set.
2.2.3.1 Individual Level Techniques
Two multiple feature-set techniques that have been utilized for pattern recognition
and stylistic analysis are ensemble classifiers and the Karhunen-Loeve transform.
Ensemble classifiers are a supervised technique that can be incorporated for the
stylometric identification task. They use multiple classifiers with each built using
different techniques, training instances, or feature subsets (Dietterich, 2000). Ensembles
37
are effective for analyzing large data streams (Wang et al., 2003). Particularly, the feature
subset classifier approach has been shown to be effective for analysis of style and
patterns. Stamatatos and Widmer (2002) used an SVM ensemble for music performer
recognition. They used multiple SVMs each trained using different feature subsets.
Similarly, Cherkauer (1996) used a Neural Network ensemble for imagery analysis. Their
ensemble consisted of 32 neural networks trained on 8 different feature subsets. The
intuition behind using an ensemble is that it allows each classifier to act as an “expert” on
its particular subset of features (Cherkauer, 1996; Stamatatos and Widmer, 2002), thereby
improving performance over simply using a single classifier. For stylometric analysis,
building a classifier trained using a particular author’s features could allow it to become
an “expert” on identifying that author against others.
Karhunen-Loeve (K-L) transforms are a supervised form of principal component
analysis (PCA) that allows inclusion of class information in the transformation process
(Webb, 2002). K-L transforms have been used in several pattern recognition studies (e.g.,
Kirby and Sirovich, 1990; Uenohara and Kanade, 1997). Like PCA, K-L transforms are a
dimensionality reduction technique where the transformation is done by deriving the
basis matrix (set of eigenvectors) and then projecting the feature usage matrix into a
lower dimension space. PCA captures the variance across a set of authors (inter-class
variance) using a single feature set and basis matrix. In contrast, K-L transforms can be
applied to each individual author (intra-class variance) by only considering that author’s
feature set and basis matrix. Thus, K-L transforms can be used as an individual level
similarity detection technique where identity A’s variance pattern (extracted using A’s
38
feature set and basis matrix) can be compared against identity B’s variance pattern
(extracted using B’s feature set and basis matrix). However, when comparing identity A
to identity B, we must evaluate A using B’s features and basis matrix and B using A’s
features and basis matrix. Two comparisons are necessary due to the use of different
feature sets for each individual identity.
2.3 Research Gaps and Questions
Based on our review of previous literature we have identified several important
research gaps.
2.3.1 Similarity Detection
Most studies have focused on the identification task, with less emphasis on similarity
detection. Similarity detection is important for cyberspace since class definitions are
often not known apriori. There is a need for techniques that can perform identification
and similarity detection.
2.3.2 Richer Feature Sets
Previous feature sets lack either the necessary breadth (number of categories) or
depth (number of features). It is difficult to apply such feature sets to larger numbers of
authors with a high level of accuracy. Consequently, previous research has typically used
less then 20 author classes in experiments. However, application of stylometric
methodologies to cyber content necessitates the ability to discriminate authorship across
larger sets of authors.
39
2.3.3 Individual Author Level Features
Few online studies have incorporated multiple individual author level feature subsets
despite their effective application to other areas of style and pattern recognition. The use
of such feature sets along with techniques that can support individual author level
attributes could improve authorship categorization performance and scalability.
2.3.4 Scalability across Domains
Little work has been done to assess the effectiveness of features and techniques across
domains. Prior work mostly focused on a single domain (e.g., email or documents).
Furthermore, we are unaware of any studies applied to synchronous communication (e.g.,
instant messaging). Analysis across domains is important in order to evaluate the
robustness of stylometric techniques for various modes of CMC.
2.3.5 Research Questions
Based on the gaps described, we propose the following research questions:
1) Which authorship analysis techniques can be successfully used for the online
identification and similarity detection tasks?
2) What impact will the use of a more holistic feature set have on online
classification performance?
3) Will the use of multiple individual author level feature subsets improve
online attribution accuracy as compared to using a single author group level
feature set?
4) How scalable are these features and techniques with respect to the various
40
domains and in terms of number of authors?
2.4 Research Design: An Overview
In order to address these questions, we propose the creation of a stylometric analysis
technique that can perform ID level identification and similarity detection. Furthermore, a
more holistic feature set consisting of a larger number of features across several
categories is utilized in order to improve our representational richness of authorial style.
We plan to utilize two variations of this extended feature set; at the author group and
individual author levels. Our approach will be evaluated across multiple domains in
comparison with benchmark techniques and feature sets. The proposed technique, feature
sets, and feature types as well as comparison benchmarks are discussed below.
2.4.1 Techniques
We propose the development of the Writeprint technique which is an unsupervised
method that can be used for identification and similarity detection. Writeprints is a
Karhunen-Loeve transforms based technique that uses a sliding window and pattern
disruption to capture feature usage variance at a finer level of granularity. A sliding
window was incorporated since it has been shown to be effective in previous authorship
studies (Kjell et al., 1994; Abbasi and Chen, 2006). The technique uses individual author
level feature sets where a Writeprint is constructed for each author using the author’s key
features. The use of individual author level feature sets is intended to provide greater
scalability as compared to traditional machine learning techniques that only utilize a
single author group level set (e.g., SVM, PCA). For all features an author uses, Writeprint
41
patterns project usage variance into a lower dimension space, where each pattern point
reflects a single window instance. All key attributes in an author’s feature set that the
author never uses are treated as pattern disruptors where the occurrence of these features
in an anonymous identity’s text decrease the similarity between the anonymous identity
and the author.
For the identification task, we plan to compare the Writeprint method against SVM
and an ensemble SVM classifier. SVM is a benchmark technique used in several previous
online stylometric identification studies (e.g., De Vel et al., 2001; Zheng et al., 2006; Li et
al., 2006). A single classifier is built using an author group level feature set. In contrast,
ensemble classifiers provide flexibility for using multiple individual author level feature
sets (Cherkauer, 1996; Dietterich, 2000). SVM ensembles with multiple feature subsets
have been shown to be effective for stylistic classification (Stamatatos and Widmer,
2002).
For the similarity detection task, we plan to compare the Writeprint method against
PCA and Karhunen-Loeve transforms. PCA has been used in numerous previous
stylometric analysis studies (e.g., Baayen et al., 2002; Abbasi and Chen, 2006). In PCA,
the underlying usage variance across a single author group level feature set is extracted
by deriving a basis of eigenvectors that are used to transform the feature space to a lower
dimensional representation/pattern (Binogo and Smith, 1999). The distance between two
identities’ patterns can be used to determine the degree of stylistic similarity. K-L
transforms are a PCA variant often used in pattern recognition studies (Watanbe, 1985;
Webb, 2002) and provide a mechanism for using multiple individual author level feature
42
sets. As previously alluded to, the use of different feature sets and basis matrices for each
author in K-L transforms entails two comparisons for each set of identities (A using B’s
features and basis, and vice versa). Specific details about the Writeprint and comparison
identification and similarity detection techniques are provided in the system design
discussion in section 2.5.
2.4.2 Feature Sets and Types
The use of an extended set of features could improve the scalability of stylometric
analysis by allowing greater discriminatory potential across larger sets of authors. We
propose the development of a holistic feature set containing more feature categories
(breadth) and number of features (depth) intended to improve performance and
scalability. Our extended feature (EF) set contains several static and dynamic feature
categories across various groups (i.e., lexical, syntactic, structural, content specific, and
misspellings). Static features include well-defined context free categories such as
function words, word length distributions, vocabulary richness measures etc. In contrast,
dynamic feature categories are context dependent attributes, such as n-grams (e.g., word,
character, POS tag, and digit level) and misspelled words. These categories have infinite
potential feature spaces, varying based on the underlying text corpus. As a result,
dynamic feature categories usually include some form of feature selection in order to
extract the most important style markers for the particular authors and text (Koppel and
Schler, 2003). We utilized the information gain heuristic due to its effectiveness in
previous text categorization (Efron et al., 2004) and authorship analysis research (Koppel
and Schler, 2003). In order to evaluate the effectiveness of our extended feature set (EF),
43
we plan to compare its performance against a baseline feature set (BF) commonly used in
previous online stylometric analysis research (De Vel et al., 2001; Corney et al., 2002;
Abbasi and Chen, 2005; Zheng et al., 2006; Li et al., 2006). The baseline set (BF)
consists of static lexical, syntactic, structural, and content specific features used for
categorization of up to 20 authors. Further details about the two feature sets (EF and BF),
extraction, and feature selection procedures are discussed in the system design section.
2.4.2.1 Feature Set Types
Based on the success of multiple feature subset approaches, we propose to compare
the effectiveness of the author group level feature set approach used in most previous
studies against the use of multiple, individual identity-level feature sets. Thus, our
extended feature set (EF) will be used as a single group level set (EF-Group) or multiple
individual level subsets will be selected (EF-Individual).
2.5 System Design
We propose the following system design (shown in Figure 2.2). Our design has two
major steps: feature extraction and classifier construction. These steps are used to carry
out identification and similarity detection of online texts.
44
Figure 2.2: Stylometric Analysis System Design
2.5.1 Feature Extraction
The extraction phase begins with a data preprocessing step where all message
signatures are initially filtered out in order to remove obvious identifiers (De Vel et al.,
2001). This step is particularly important for email data where authors often include
signatures such as name, job title, address, position, contact information etc. The next
step involves extraction of the static and dynamic features resulting in the creation of our
feature sets. We included two feature sets: a baseline feature set (BF) consisting of static
author group level features and an extended feature set (EF) consisting of static and
dynamic features. For static features, extraction simply involves generating the feature
usage statistics (feature vectors) across texts, however dynamic feature categories such as
n-grams require indexing and feature selection. The feature extraction procedures for the
two feature sets (BF and EF) are described below while Table 2.3 provides a description
of the two feature sets. For dynamic feature categories, the number of features varies
depending on the indexing and feature selection for a specific data set as well as whether
the author group (EF-Group) or individual author (EF-Individual) level is being used for
45
feature selection. For some such categories, the upper limit of features is already known
(e.g., number of character bigrams is less than 676).
Table 2.3: Baseline and Extended Feature Sets
Group
Category
Lexical
Word-Level
Character-Level
Letters
Character Bigrams
Character Trigrams
Digits
Digit Bigrams
Digit Trigrams
Syntactic
Structural
Content
Idiosyncratic
Quantity
Baseline Extended
(BF)
(EF)
5
5
5
5
26
26
< 676
< 17,576
10
< 100
-
< 1,000
Word Length Dist.
20
20
Vocabulary
Richness
Special Characters
8
8
21
21
Function Words
150
300
Punctuation
8
8
POS Tags
-
< 2,300
POS Tag Bigrams
POS Tag Trigrams
-
varies
varies
Message-Level
6
6
Paragraph-Level
8
8
Technical Structure
50
50
Words
Word Bigrams
Word Trigrams
Misspelled Words
20
-
varies
varies
varies
< 5,513
Description
total words, % char. per word
total char., % char. per message
count of letters (e.g., a, b, c)
char. bigrams (e.g., aa, ab, ac)
char. trigrams (e.g., aaa, aab, aac)
digits (e.g., 1, 2, 3)
2 digit number frequencies (e.g., 10,
11, 12)
frequencies of 3 digit numbers (e.g.,
100, 101)
frequency distribution of 1-20 letter
words
richness (e.g., hapax legomena, Yule’s
K)
occurrences of special char. (e.g.,
@#$%^&*+)
frequency of function words (e.g., of,
for, to)
occurrence of punctuation marks (e.g.,
!;:,.?)
frequency of part-of-speech tags (e.g.,
NP, VB, JJ)
POS tag bigrams (e.g., NNP_VB VB )
POS tag trigrams (e.g., NNP_VB VB
JJ )
e.g., has greeting, has url, requoted
content
e.g., no. of paragraphs, sentences per
paragraph
e.g., file extensions, fonts, use of
images
bag-of-words (e.g., “senior”, “editor”)
word bigrams (e.g. “senior editor”)
word trigrams (e.g., “editor in chief”)
common misspellings (e.g., “beleive”,
“thougth”)
46
2.5.1.1 Baseline Feature Set (BF)
This feature set contains 327 lexical, syntactic, structural, and content-specific
features. Variants of this feature set have been used in numerous previous studies (e.g.,
De Vel et al., 2001; Corney et al., 2002; Abbasi and Chen, 2005; Li et al., 2006; Zheng et
al., 2006). Since this feature set is devoid of any dynamic feature categories (e.g., ngrams, misspellings) it has a fairly straight forward extraction procedure.
2.5.1.2 Extended Feature Set (EF)
The extended feature set is a mixture of static and dynamic features. The dynamic
features include several n-gram feature categories and a list of 5,513 common word
misspelling taken from various websites including Wikipedia (www.wikipedia.org). Ngram categories utilized include character, word, POS tag, and digit level n-grams. The
POS tagging was conducted using the Arizona Noun Phrase extractor (McDonald et al.,
2005), which uses the Penn Treebank tag set and also performs noun phrase chunking and
named entity recognition and tagging. These n-gram based categories require indexing
with the number of initially indexed features varying depending on the data set. The
indexed features are then sent forward to the feature selection phase. Use of such an
indexing and feature selection/filtering procedure for n-grams is quite necessary and
common in stylometric analysis research (e.g., Peng et al., 2003; Koppel and Schler,
2003).
Feature selection is applied to all the n-gram and misspelled word categories using
the information gain (IG) heuristic. IG has been used in many text categorization studies
as an efficient method for selecting text features (e.g., Forman, 2003; Efron et al., 2004;
47
Koppel and Schler, 2003). Specifically, it is computationally efficient compared to
search-based techniques (Dash and Liu, 1997; Guyon and Elisseef, 2003) and good for
multi-class text problems (Yang and Pederson, 1997). IG is applied at the author group
and individual author levels. The information gain for feature j across a set of classes c is
derived as IG(c,j) = H(c) – H(c|j) where H(c) is the overall entropy across author classes
and H(c|j) is the conditional entropy for feature j. For the author group level feature set
(EF-Group), IG is applied across all author classes (size of c = # authors). For individual
identity level feature sets (EF-Individual), IG is applied using a 2-class (one-against-all)
set up (size of c = 2, c1 = identity, c2 = rest). The EF-Group feature set is intended to
utilize the set of features that can best discriminate authorship across all authors while
each EF-Individual feature set attempts to find the set of features most effective at
differentiating a specific author against all others.
2.5.2 Classifier Construction
2.5.2.1 Writeprint Technique
The Writeprint technique has two major parts: creation and pattern disruption. The
creation part is concerned with the steps relating to the construction of patterns reflective
of an identities’ writing style variation. In this step Karhunen-Loeve transforms are
applied with a sliding window in order to capture stylistic variation with a finer level of
granularity. The pattern disruption part describes how zero usage features can be used as
red flags to decrease the level of stylistic similarity between identities. The two major
steps, which are repeated for each identity, are shown below:
48
Writeprint Steps
1) For all identity features with occurrence frequency >0.
a) Extract feature vectors for each sliding window instance.
b) Derive basis matrix (set of eigenvectors) from feature usage
covariance matrix using Karhunen-Loeve transforms.
c) Compute window instance coordinates (principal components) by
multiplying window feature vectors with basis. Window instance
points in n dimensional space represent author Writeprint pattern.
2) For all author features with occurrence frequency =0.
a) Compute feature disruption value as product of information gain,
synonymy usage, and disruption constant K.
b) Append features’ disruption values to basis matrix.
c) Apply disruptor based on pattern orientations.
3) Repeat steps 1-2 for each identity.
Figure 2.3 presents an illustration of the Writeprint process while these steps are
described in greater detail below.
Figure 2.3: Writeprint Creation Illustration
Step 1: Writeprint Creation
A lower dimensional usage variation pattern is created based on the occurrence
frequency of the identity’s features (individual level feature set). For all features with
usage frequency greater than zero, a sliding window of length L with a jump interval of J
49
characters is run over the identity’s messages. The feature occurrence vector for each
window is projected to an n-dimensional space by applying the Karhunen-Loeve
transform. The Kaiser-Guttman stopping rule (Jackson, 1993) was used to select the
number of eigenvectors in the basis. The formulation for step 1 is presented below:
Let Ω = {1,2,..., f } denote the set of f features with frequency greater than 0
and Φ = {1,2,..., w} represent the set of w text windows. Let X denote the author’s feature
matrix where xij is the value of feature j ∈ Ω for window i ∈ Φ .
 x11
x
21
X =
 ...

 x w1
x12
x22
...
xw2
... x1 f 
... x2 f 
... ... 

... x wf 
Extract the set of eigenvalues {λ1 , λ2 ,..., λn } for the covariance matrix Σ of the feature
matrix X by finding the points where the characteristic polynomial of Σ equals 0:
p (λ ) = det(Σ − λI ) = 0 .
For each eigenvalue λm > 1 extract its eigenvector a m = (a m1 , a m 2 ,..., amf ) by solving
the following system, resulting in a set of n eigenvectors {a1 , a 2 ,..., an } :
(Σ − λ m I ) a m = 0
Compute an n-dimensional representation for each window i by extracting principal
component scores ε ik for each dimension k ≤ n :
ε ik = a kT xi
Step 2: Pattern Disruption
Since Writeprints uses individual author level feature sets, an author’s key set of
50
features may contain attributes that are significant because the author never uses them.
However, features with no usage will currently be irrelevant to the process since they
have no variance. Nevertheless these features are still important when comparing an
author to other anonymous identities. The author’s lack of usage of these features
represents an important stylistic tendency. Anonymous identity texts containing these
features should be considered less similar (since they contain attributes never used by this
author).
As previously mentioned, when comparing two identities’ usage variation patterns,
two comparisons must be made since both identities used different feature sets and basis
matrices in order to construct their lower dimensional patterns. The dual comparisons are
illustrated in Figure 2.4. We would need to construct a pattern for identity B using B’s
text with A’s feature set and basis matrix (Pattern B) to be compared against identity A’s
Writeprint (and vice versa). The overall similarity between Identity A and B is the sum of
the average n-dimensional Euclidean distance between Writeprint A and Pattern B and
Writeprint B and Pattern A. When making such a comparison we would like A’s zero
frequency features to act as “pattern disruptors,” where the presence of these features in
identity B’s text decreases the similarity for the particular A – B comparison. It’s less
likely that identity A wrote text containing features that identity A never uses.
51
Figure 2.4: Writeprint Comparisons
Such disruption can be achieved by appending a non-zero coefficient d in identity A’s
basis matrix for such features. Let Ψ = { f + 1, f + 2,..., f + g} denote the set of g features
with zero frequency. For each feature
p ∈ Ψ append the value
d kp to each
eigenvector ak where k ≤ n . Let’s assume that one of identity A’s key attributes is the word
“folks,” which is important because identity A never uses it. Figure 2.5 shows how
pattern disruption can reduce the similarity between identity A and B by shifting away
identity B’s pattern points for text windows containing the word “folks.” In this example,
the value d is substituted as the coefficient for feature number 3 (“folks”) in identity A’s
primary two eigenvectors ( a13 , a 23 ). The direction of a window point’s shift is intended to
reduce the similarity between the Writeprint and comparison pattern. This is done by
making d kp positive or negative for a particular dimension k based on the orientation of
the Writeprint (WP) and comparison pattern (PT) points along that dimension, as follows:
d kp
w
WPik w PTik

−
d
,
if
 kp ∑ w >∑ w

i =1
i =1
=
w
w
WPik
PT
d , if
<∑ ik
∑
kp

w
w
i =1
i =1
For instance, if identity A’s Writeprint is spatially located to the left of identity B’s
52
pattern for dimension k, the disruptor d kp will be positive in order to ensure that the
comparison pattern moves away from the Writeprint as opposed to towards it.
Figure 2.5: Illustration of Pattern Disruption
The magnitude of d signifies the extent of the disruption for a particular feature.
Larger values of d will cause pattern points representing text windows containing the
disruptor feature to be shifted further away. However, not all features are equally
important discriminators. For example, lack of usage of the word “Colorado” is less
significant than lack of usage of the word “folks,” because “Colorado” is a noun
conveying topical information. Lack of usage of “Colorado” simply means this author
doesn’t talk about Colorado, and is not indicative of stylistic choice. It is more reflective
of context than style. In contrast, lack of use of “folks” (a function word used to address
people) is a stylistic tendency. It is possible and likely that the author uses some other
word (synonym of “folks”) to address people or doesn’t address them at all. Koppel et al.
(2006) developed a machine translation based technique for measuring the degree of
feature “stability.” Stability refers to how often a feature changes across authors and
documents for a constant topic. They found nouns to be more stable than function words
and argued that function words are better stylistic discriminators than nouns since use of
53
function words involves making choices between a set of synonyms. Based on this
intuition we devised a formula for the disruptor coefficient d for feature p. Our formula
considers the features information gain (IG) and synonymy usage:
d p = IG (c, p ) K (syn total + 1)(syn used + 1)
Where IG (c, p ) is the information gain for feature p across the set of classes c and
syn total and syn used are the total number of synonyms and the number used by the author,
respectively, for the disruptor feature. The feature synonym information is derived from
Wordnet (Fellbaum, 1998). Synonym information is only used for word-based features
(e.g., word n-grams, function words). For other feature category disruptors, syn total and
syn used will equal 0. K is a disruptor constant used to control the magnitude and
aggressiveness of the pattern disruption mechanism. We used integer values between one
and ten for K and generally attained the best results using a value of two. As previously
mentioned, each disruptor is applied in such a manner as to shift the comparison print
further away from the Writeprint.
2.5.2.2 Comparison Identification and Similarity Detection Techniques
For all comparison techniques, feature vectors are derived for non-overlapping 1,500
character blocks of text from each identity’s text. This particular length was used since it
corresponds to approximately 250 words, the minimum text length considered effective
for authorship analysis (Forsyth and Holmes, 1996).
In addition to the Writeprint method, SVM and ensemble SVM are utilized as
comparison identification techniques. SVM is run using linear kernel with sequential
54
minimal optimization (SMO) algorithm (Platt, 1999), which is the same settings as
numerous previous studies (e.g., Zheng et al., 2006; Li et al., 2006). For ensemble SVM
we build multiple classifiers (one using each identity’s features). Anonymous identities
are assigned by aggregating results across classifiers.
For similarity detection, PCA and K-L transforms are used. For PCA we extract the
basis matrix for single author group level feature set where the feature matrix contains
vectors across identity classes. Thus, PCA captures the inter-author feature usage
variation for a common set of features. In contrast, for the K-L Transforms the basis
matrix is extracted for each individual identity using the identity’s feature set and
occurrence vectors. Each author basis matrix thus captures the intra-author feature usage
variation.
2.6 Evaluation
In order to evaluate the effectiveness of the Writeprint technique and extended feature
set (EF) two experiments were conducted. The experiments compared the extended
features (EF) and Writeprint technique against our comparison techniques and baseline
feature set (BF). The experiments were conducted for the identification and similarity
detections tasks across test beds from various domains. The test beds and experiments are
described below.
2.6.1 Test Bed
The test bed consists of four data sets spanning asynchronous, synchronous, and
program code domains. This first data set is composed of email messages from the
55
publicly available Enron email corpus. The second test set consists of buyer/seller
feedback comments extracted from EBay (www.ebay.com). The third data set contains
programming
code
snippets
taken
from
the
Sun
Java
Technology
Forum
(forum.java.sun.com) while the fourth set of data are instant messaging chat logs taken
from CyberWatch (www.cyberwatch.com). Table 2.4 provides some details about the test
bed. For each data set, we randomly extracted 100 authors. The data sets also differ in
terms of the average amount of text per author, time span, and amount of noise. The
email and forum data sets have greater noise due to the presence of requoted and
forwarded content (which is not always easy to filter out). The CyberWatch chat logs
contain the least amount of text since each author’s text is only a single conversation.
Table 2.4: Details for Data Sets in Test Bed
Data Set
Enron Email
EBay Comments
Java Forum
CyberWatch Chat
Domain
Asynchronous (D1)
Asynchronous (D1)
Program Code (D4)
Synchronous (D2)
# Authors
100
100
100
100
Words (per Author)
27,774
23,423
43,562
1,422
Time Duration
10/98 - 09/02
02/03 – 04/06
04/03 - 05/06
05/04 – 08/06
Noise
Yes
No
Yes
No
2.6.2 Experiment 1: Identification Task
2.6.2.1 Experimental Setup
For the identification task, each author’s text was split into two identities: one known
entity and one anonymous identity. All techniques were run using 10-fold cross validation
by splitting author texts into 10 parts (5 for known entity, 5 for anonymous identity). For
example, in fold 1: parts 1-5 are used for the known entity while parts 6-10 for the
anonymous identity, fold 2: parts 2-6 are used for the known entity while parts 1 and 7-10
56
for the anonymous identity. The overall accuracy was the average classification accuracy
across all 10 folds where the classification accuracy was computed as follows:
Classification Accuracy =
Number of Correctly Classified Identities
Total Number of Identities
Four combinations of feature sets, feature types, and techniques were used (shown in
Table 2.5 below). As shown in the 5th row in Table 2.5, a baseline was included which
featured the use of SVM with the group level baseline feature set (BF). This particular
baseline consisted of the same combination of features and technique used in numerous
previous studies (e.g., De Vel et al., 2001; Abbasi and Chen, 2005; Li et al., 2006; Zheng
et al., 2006). The baseline was intended to be compared against the use of SVM with the
group level extended feature set (SVM/EF, as shown in row 4) in order to assess the
effectiveness of a more holistic feature set for online identification (4th row vs. 5th row in
Table 2.5). We also wanted to evaluate the effectiveness of individual author level feature
sets by comparing an Ensemble SVM using EF-Individual against the SVM/EF method
which uses a single group level feature set (3rd row vs. 4th row in Table 2.5). Finally, the
Writeprint technique was included with the extended feature set in order to evaluate the
effectiveness of this technique in comparison with Ensemble SVM and SVM/EF (2nd row
vs. 3rd and 4th row in Table 2.5).
Table 2.5: Techniques/Feature Sets for Identification Experiment
Label
Writeprint
Ensemble
SVM/EF
Baseline
Technique
Writeprint
Ensemble SVM
SVM
SVM
Feature Set Type
Individual
Individual
Group
Group
Feature Set
EF
EF
EF
BF
57
2.6.2.2 Hypotheses
H1a (Feature Sets):
The use of a more holistic feature set with a larger number of features and categories
(EF) will outperform the baseline feature set (BF). Thus, SVM/EF will outperform the
Baseline.
H1b (Feature Set Types):
The use of individual author level feature subsets (EF-Individual) will outperform the
use of a single author group-level feature set (EF-Group). Thus, Ensemble SVM will
outperform SVM/EF.
H1c (Techniques):
The Writeprint technique will outperform SVM (SVM/EF and Ensemble SVM).
2.6.2.3 Experimental Results
Table 2.6 shows the experimental results for all four combinations of features and
techniques across the four data sets. The Writeprint technique had the best performance
on the email, comments, and chat data sets. Furthermore, individual author level feature
set techniques (Writeprint and Ensemble) had higher accuracy on these data sets than
author group-level feature set methods (SVM/EF and Baseline). However, Writeprints
performed poorly on the programming forum data set. This is attributable to the inability
of the variation patterns and disruptors to effectively capture programming style. The
extended feature set (EF) had better performance than the benchmark feature set (BF) as
reflected by the fact that all techniques using EF had higher accuracy than the Baseline
on all data sets.
58
Table 2.6: Experimental Results (% accuracy) for Identification Task
Test Bed
Techniques/Features
Enron Email
EBay Comments
Java Forum
CyberWatch Chat
Writeprint
Ensemble
SVM/EF
Baseline
Writeprint
Ensemble
SVM/EF
Baseline
Writeprint
Ensemble
SVM/EF
Baseline
Writeprint
Ensemble
SVM/EF
Baseline
25
92.0
88.0
87.2
64.8
96.0
96.0
95.6
90.6
88.8
92.4
94.0
84.8
50.4
46.0
40.0
37.6
# Authors
50
90.4
88.2
86.6
54.4
95.2
94.0
93.8
86.4
66.4
85.2
86.6
60.2
42.6
36.6
33.3
30.8
100
83.1
76.7
69.7
39.7
91.3
90.9
90.4
83.9
52.7
53.5
41.1
23.4
31.7
22.6
19.8
17.5
2.6.2.4 Hypotheses Results
Table 2.7 shows the p-values for the pair-wise t-tests conducted on the classification
accuracies in order to measure the statistical significance of the results. Bolded values
indicate statistically significant outcomes in line with our hypotheses. Values with a plus
sign indicate significant outcomes contradictory to our hypotheses.
H1a: Feature Sets
The extended feature set (EF) outperformed the baseline feature set (BF) across all
data sets (p <0.01) based on the better performance of SVM/EF as compared to Baseline.
H1b: Feature Set Types
Individual author level feature subsets (EF-Individual) significantly outperformed the
group level feature set (EF-Group) on the Enron and CyberWatch data sets (p<0.01). This
is based on the better performance of the Ensemble technique as compared to SVM/EF.
EF-Individual also outperformed the EF-Group feature set on the EBay data set, but not
59
significantly.
H1c: Techniques
The Writeprint technique significantly outperformed SVM (Ensemble and SVM/EF)
on the Enron and CyberWatch data sets (p<0.01). Writeprints also outperformed SVM on
EBay data set, but not significantly.
Table 2.7: P-values for Pair Wise t-tests on Accuracy
Test Bed
Enron Email
EBay Comments
Java Forum
CyberWatch Chat
Techniques/Features
Writeprint vs. Ensemble
Writeprint vs. SVM/EF
Writeprint vs. Baseline
Ensemble vs. SVM/EF
Ensemble vs. Baseline
SVM/EF vs. Baseline
Writeprint vs. Ensemble
Writeprint vs. SVM/EF
Writeprint vs. Baseline
Ensemble vs. SVM/EF
Ensemble vs. Baseline
SVM/EF vs. Baseline
Writeprint vs. Ensemble
Writeprint vs. SVM/EF
Writeprint vs. Baseline
Ensemble vs. SVM/EF
Ensemble vs. Baseline
SVM/EF vs. Baseline
Writeprint vs. Ensemble
Writeprint vs. SVM/EF
Writeprint vs. Baseline
Ensemble vs. SVM/EF
Ensemble vs. Baseline
SVM/EF vs. Baseline
25
<0.001**
<0.001**
<0.001**
0.330
<0.001**
<0.001**
0.500
0.673
<0.001**
0.673
<0.001**
<0.001**
0.002+
<0.001+
0.005**
0.097
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
# Authors
50
0.002**
<0.001**
<0.001**
0.049*
<0.001**
<0.001**
0.100
0.167
<0.001**
0.772
<0.001**
<0.001**
<0.001+
<0.001+
<0.001**
0.166
<0.001**
<0.001**
0.052
<0.001**
<0.001**
0.155
0.064
<0.001**
100
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
0.134
0.101
<0.001**
0.339
<0.001**
<0.001**
0.309
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
0.004**
<0.001**
<0.001**
0.008**
<0.001**
<0.001**
* P-values significant at alpha = 0.05; ** P-values significant at alpha = 0.01
+ P-values contradict hypotheses;
2.6.2.5 Results Discussion
Feature Sets
The Enron email data set feature set sizes and SVM techniques’ performance are
60
shown in Figure 2.6 below. The number of features for EF-Individual is the average of
each author’s feature set. The increased number of authors caused the EF-Group feature
set to grow at an increasing rate. This resulted in a decreased number of relevant features
per author in EF-Group, as evidenced by the widening gap between EF-Individual and
EF-Group as the number of authors grew to 50 and 100.
Techniques
Feature Sets
16000
12000
75
65
Ensemble
55
SVM/EF
Baseline
45
# Features
% Accuracy
85
EF-Ind
8000
EF-Group
BF
4000
0
35
25
50
# Authors
100
25
50
# Authors
100
Figure 2.6: Enron Data Set Feature Set Sizes and SVM Technique Performances
Consequently, the ensemble SVM technique significantly outperformed SVM/EF for
experiments involving a larger number of authors (50 and 100). This is shown in Table
2.8 which presents the results for the Enron data set. We can see that when using only 25
authors, the EF-Individual feature set marginally outperformed EF-Group as illustrated
by the slightly better performance of the ensemble over SVM/EF. However, when the
number of authors increased to 100, the widening gap in terms of number of features in
each feature set caused the ensemble technique to significantly outperform SVM/EF.
Table 2.8: Performance Comparison of Ensemble SVM and SVM on Enron Data Set
# Authors
25
50
100
Ensemble
88.0%
88.2%
76.7%
SVM/EF
87.2%
86.6%
69.7%
Difference
0.8%
1.6%*
7.0%**
* P-values significant at alpha = 0.05
** P-values significant at alpha = 0.01
61
Techniques
The Writeprint technique significantly outperformed SVM (Ensemble, SVM/EF, and
Baseline) on the email and chat data sets. For most data sets, the Writeprint technique
also had a smaller drop off in accuracy as the number of authors increased. This is shown
in Figure 2.7 which presents the performance accuracies for each technique across data
sets and authors.
EBay Data Set
Enron Data Set
95
100
75
Ensemble
65
SVM/EF
Baseline
55
Writeprint
% Accuracy
% Accuracy
85
45
35
95
Ensemble
SVM/EF
90
Baseline
Writeprint
85
80
25
50
# Authors
100
25
100
Java Data Set
CyberWatch Data Set
95
55
80
45
Ensemble
SVM/EF
35
Baseline
Writeprint
25
% Accuracy
% Accuracy
50
# Authors
Ensemble
65
SVM/EF
Baseline
50
Writeprint
35
20
15
25
50
# Authors
100
25
50
# Authors
100
Figure 2.7: Performance for Identification Techniques across Data Sets
The Writeprint technique appears to be more scalable as the number of authors
increase, based on the fact that the slope of its accuracy line typically remains consistent.
In contrast, other techniques’ accuracies decrease more sharply as the number of authors
goes from 25 to 50 or 100. We believe this is attributable to the disruptors effectively
differentiating authorship across larger numbers of identities in the Writeprint technique.
For the programming data set, however, the disruptors were less effective due to the
62
differences in program code as opposed to other forms of text. These differences are
expounded upon in the similarity detection experimental results discussion.
2.6.3 Experiment 2: Similarity Detection Task
2.6.3.1 Experimental Setup
For the similarity detection task, each author’s text was split into two anonymous
identities. All techniques were run using 10-fold cross validation in the same manner as
the previous experiment. A trial and error method was used to find a single optimal
similarity threshold for matching. The same threshold was used for all techniques. All
identity-identity pair scores above the pre-defined threshold were considered a match.
Trial and error methods for finding optimal thresholds are common for stylometric
similarity detection tasks (e.g., Peng et al., 2002). The average F-measure across all 10
folds was used to evaluate performance, where the F-measure for each fold was
computed as follows:
F - Measure =
2(Precision)(Recall)
Precision + Recall
Similar to the Identification experiment (Table 2.5), four combinations of feature sets,
feature types, and techniques were used (shown in Table 2.9 below). A baseline was
included which featured the use of PCA with the baseline feature set (BF). The baseline
was intended to be compared against the use of PCA with the group level extended
feature set (PCA/EF) in order to assess the effectiveness of a more holistic feature set for
online similarity detection (4th row vs. 5th row in Table 2.9). We also wanted to evaluate
the effectiveness of individual author level feature sets by comparing Karhunen-Loeve
63
transforms (which use EF-Individual) against the PCA/EF method which uses a single
group level feature set (3rd row vs. 4th row in Table 2.9). Finally, the Writeprint technique
was included with the extended feature set (EF-Individual) in order to evaluate the
effectiveness of this technique in comparison with the standard Karhunen-Loeve (K-L)
transforms and PCA/EF (2nd row vs. 3rd and 4th row in Table 2.9). Since the Writeprint
technique also utilizes the K-L transform, comparing Writeprints against K-L provided a
good method for evaluating the effectiveness of the sliding window and pattern
disruption algorithms.
Table 2.9: Techniques/Feature Sets for Similarity Detection Experiment
Label
Writeprint
K-L
PCA/EF
Baseline
Technique
Writeprint
K-L Transforms
PCA
PCA
Feature Set Type
Individual
Individual
Group
Group
Feature Set
EF
EF
EF
BF
2.6.3.2 Hypotheses
H1a (Feature Sets):
The use of a more holistic feature set with a larger number of features and categories
(EF) will outperform the baseline feature set (BF). Thus, PCA/EF will outperform the
Baseline.
H1b (Feature Set Types):
The use of individual author level feature subsets (EF-Individual) will outperform the
use of a single author group-level feature set (EF-Group). Thus, K-L transforms will
outperform PCA/EF.
H1c (Techniques):
The Writeprint technique will outperform K-L Transforms and PCA.
64
2.6.3.3 Experimental Results
Table 2.10 shows the experimental results for all four combinations of features and
techniques across the four data sets. The Writeprint technique had the best performance
on all data sets, with F-measures over 85% for the Enron and EBay data sets when using
100 authors (200 identities). Furthermore, individual author level feature set techniques
(Writeprint and K-L transforms) had higher accuracy on all data sets than author grouplevel feature set methods (PCA/EF and Baseline). This gap in performance appeared to
widen as the number of authors increased (e.g., looking at K-L versus PCA/EF),
suggesting that the individual author level feature set (EF-Individual) is more scalable
than the author group level feature set (EF-Group). The extended feature set (EF) had
better overall performance than the benchmark feature set (BF) as reflected by the fact
that all techniques using EF had higher accuracy than the Baseline across all data sets.
Table 2.10: Experimental Results (F-measure) for Similarity Detection Task
Test Bed
Enron Email
EBay Comments
Java Forum
CyberWatch Chat
Techniques/Features
Writeprint
K-L
PCA/EF
Baseline
Writeprint
K-L
PCA/EF
Baseline
Writeprint
K-L
PCA/EF
Baseline
Writeprint
K-L
PCA/EF
Baseline
25
93.62
75.29
70.32
64.32
100.00
92.25
81.19
75.65
90.13
77.76
76.21
72.90
68.43
50.72
40.0
39.43
# Authors
50
94.29
68.23
56.33
48.49
97.96
84.10
77.32
70.02
85.02
67.63
66.65
60.59
62.88
42.39
33.3
28.62
100
85.56
65.44
50.82
34.33
94.59
80.93
72.25
60.19
76.87
60.27
56.10
42.45
49.91
30.77
19.8
20.10
65
2.6.3.4 Hypotheses Results
Table 2.11 shows the p-values for the pair-wise t-tests conducted on the classification
accuracies in order to measure the statistical significance of the results. Bolded values
indicate statistically significant outcomes in line with our hypotheses.
Table 2.11: P-values for Pair Wise t-tests on F-Measure
Test Bed
Enron Email
EBay
Comments
Java Forum
CyberWatch
Chat
Techniques/Features
Writeprint vs. K-L
Writeprint vs. PCA/EF
Writeprint vs. Baseline
K-L vs. PCA/EF
K-L vs. Baseline
PCA/EF vs. Baseline
Writeprint vs. K-L
Writeprint vs. PCA/EF
Writeprint vs. Baseline
K-L vs. PCA/EF
K-L vs. Baseline
PCA/EF vs. Baseline
Writeprint vs. K-L
Writeprint vs. PCA/EF
Writeprint vs. Baseline
K-L vs. PCA/EF
K-L vs. Baseline
PCA/EF vs. Baseline
Writeprint vs. K-L
Writeprint vs. PCA/EF
Writeprint vs. Baseline
K-L vs. PCA/EF
K-L vs. Baseline
PCA/EF vs. Baseline
25
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
0.094
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
# Authors
50
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
0.087
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
100
0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
0.001**
<0.001**
<0.001**
<0.001**
<0.001**
<0.001**
* P-values significant at alpha = 0.05
** P-values significant at alpha = 0.01
H1a: Feature Sets
The extended feature set (EF) outperformed the baseline feature set (BF) across all
data sets (p <0.01) based on the better performance of PCA/EF as compared to Baseline.
H1b: Feature Set Types
66
Individual author level feature subsets (EF-Individual) significantly outperformed the
group level feature set (EF-Group) on most data sets (p<0.01). This is based on the better
performance of the K-L transforms technique as compared to PCA/EF.
H1c: Techniques
The Writeprint technique significantly outperformed K-L Transforms and PCA/EF on
all data sets (p<0.01).
2.6.3.5 Results Discussion
Overall performance for all techniques was best on the synchronous CMC data sets:
Enron email and EBay comments. Once again, the performance was somewhat lower on
the Java Forum and considerably lower on the CyberWatch chat data sets. For the Java
Forum, we suspect that the feature sets EF and BF are not as effective at capturing
programming style. For instance, many of the Writeprint pattern disruptors for the
programming data set were variable names and program methods. While such disruptors
were assigned low values (based on synonymy), they still had a noticeable negative
impact on performance. As we previously alluded to, program style analysis requires the
use of features specifically geared towards code (Krsul and Spafford, 1997). In many
cases, these features are not only tailored towards code, but rather for code in a specific
programming language (Berry and Meekings, 1985). Future analysis of programming
style should continue to incorporate more program specific features such as those used by
Oman and Cook (1989) and Krsul and Spafford (1997).
In the case of the CyberWatch Chat data set, the amount of text for each author was
insufficient to effectively discriminate authorship. More important than the amount of
67
words per author was the fact that we only had a single conversation for each author. It is
unlikely that a single conversation would reveal a sufficient portion of an author’s
spectrum of stylistic variation for effective categorization. Further work is needed on
stylometric analysis of chat room data, including investigating chat room specific features
and techniques on larger data sets.
2.7 Conclusions
In this essay we applied stylometric analysis to online texts. Our research
contributions are manyfold. We developed the K-L transforms based Writeprint technique
which can be used for identity level identification and similarity detection. A novel
pattern disruption mechanism was introduced to help detect authorship dissimilarity. We
also incorporated a significantly more comprehensive feature set for online stylometric
analysis and demonstrated the effectiveness of individual author level feature subsets.
Our proposed feature set and technique were applied across multiple domains, including
asynchronous CMC, synchronous CMC, and program code. The results compared
favorably against existing benchmark methods and other individual author level
techniques. Specifically, the Writeprint technique significantly outperformed other
identification methods across domains such as email messages and chat room postings.
For similarity detection, Writeprints significantly outperformed comparison techniques
across all data sets. The extended feature set utilized demonstrated the effectiveness of
using richer stylistic representations for improved performance and scalability.
Furthermore, the use of individual author level feature sets seems promising for
application to cyberspace, where the number of authors can quickly become very large;
68
making a single feature set less effective.
In the future we will work on further improving the scalability of the proposed
approach to larger numbers of authors in a computationally efficient manner. We also
plan to evaluate temporally dynamic individual author-level feature sets that can
gradually change over time as an author’s writing style evolves. Another important
direction is to assess the impact of intentional stylistic alteration on stylometric
categorization performance.
69
CHAPTER 3: STYLOMETRIC IDENTIFICATION IN ELECTRONIC MARKETS:
SCALABILITY AND ROBUSTNESS
3.1 Introduction
In the previous chapter we performed stylometric identification and similarity
detection on online texts. Two crucial concerns regarding the success of stylometry as an
online authentication mechanism are its scalability in terms of number of identities and
robustness against intentional stylistic alteration. In this chapter we assess the scalability
and robustness of stylometric similarity detection methods on eBay user feedback
comments.
Electronic markets have seen unprecedented growth in recent years. Online auction
marketplaces such as eBay are one type of electronic market that has become especially
popular. However, the lack of physical contact and prior interaction makes such places
more susceptible to opportunistic member behavior (Pavlou & Gefen, 2004). While
reputation systems attempt to alleviate some of the troubles with electronic markets, these
systems themselves suffer from two problems: easy identity changes and reputation
manipulation. Easy identity changes stem from the fact that online traders can create new
identities, thereby refreshing their reputation (Dellarocas, 2003). Reputation manipulation
allows online market traders to inflate their own reputations using multiple identities or to
sabotage competitors’ reputation scores. Consequently, fraud and deception are highly
prevalent in electronic markets; particularly in online auctions, which account for 50% of
internet fraud (Chua & Wareham, 2004).
The aforementioned problems stem from online anonymity. However individuals
leave behind textual traces of their identity in the feedback comments posted to other
70
traders. Stylometric similarity detection techniques applied to reputation system feedback
comments can help minimize problems stemming from anonymity abuses in reputation
systems. These techniques attempt to assess the degree of similarity between individuals
based on writing style. Since text traces are often the only identity cues left behind in
cyberspace, researchers have begun to use online stylometric analysis techniques as a
forensic tool. They have recently been applied to email, web forums, and program code
(Gray et el., 1997; De Vel et al., 2001; Zheng et al., 2006) as well as group support
system comments (Hayne & Rice, 1997; Hayne et al., 2003).
Despite significant progress, online stylometry has several current limitations. Most
previous work focused on the identification task (where potential authorship identities are
known in advance). There has been limited evaluation of similarity detection techniques
where no identities are known apriori, and are clustered based on their similarity scores.
Similarity detection is more practical for cyberspace applications, such as reputation
systems. Furthermore there has been a lack of evaluation of the scalability of stylometric
analysis in terms of number of authors and identities per author for reputation systems.
Additionally, there has been a lack of assessment of robustness against intentional
stylistic alteration and message copycatting or forging. In this essay we propose a system
that can provide stylometric analysis scalability and robustness for identifying traders in
online reputation systems based on their feedback comments posted for others. The
proposed system is highly accurate at differentiating across hundreds of identities based
on stylistic tendencies inherent in feedback comments, and is also fairly robust against
intentional stylistic alteration. The system uses an extended feature set consisting of
71
several static and dynamic feature categories and also includes the Writeprint technique
which assesses the degree of stylistic similarity and dissimilarity between authors.
Writeprints uses Karhunen-Loeve transforms to assess the degree of similarity between
traders and a pattern disruption mechanism to determine stylistic dissimilarity. The
system can be used for similarity detection in reputation systems to alleviate the identity
change and rank manipulation problems.
3.2 Related Work
3.2.1 Reputation Systems/Online Feedback Mechanisms
Reputation Systems are online feedback mechanisms where users rate other members
and provide textual comments describing the quality of service (i.e., transaction
experience). Such systems are intended to provide “soft security” for electronic markets
and online auctions (Rasmussen and Janson, 1996). In contrast to “hard security” systems
(e.g., access control/authentication), these systems are designed to offer social control
mechanisms. They are meant to allow social translucence for improved accountability
(Erickson and Kellogg, 2000). Online markets rely on such information provided via
reputation systems in order to promote trust (Bolton et al., 2004). While recommender
systems are designed to support collaborative filtering, reputation systems are intended to
support “collaborative sanctioning” (Mui et al., 2001). As Josang et al. (2006, p. 10)
pointed out “…the purpose is to sanction poor service providers, with the aim of giving
an incentive for them to provide quality services.”
The perceived effectiveness of online feedback mechanisms plays a critical role in the
72
amount of member trust in the community (Pavlou and Gefen, 2004). Reputation scores
are often synonymously referred to as “trust scores.” An important class of trust is
“identity trust” which describes the belief that an identity is who they claim to be
(Grandison and Sloman, 2000). Trustworthiness is an important factor affecting online
market outcomes (Brynjolfsson and Smith, 2000). Identity trust is especially crucial to
the success of reputation systems. However, the anonymous nature of the Internet makes
“identity trust” difficult to ensure in online settings. This has resulted in two critical
problems pertaining to reputation systems (Dellarocas, 2003; Josang et al., 2006); identity
changes and reputation manipulation.
3.2.1.1 Identity Changes
Easy identity changes allow con artists and fraudulent buyers and sellers to thrive in
electronic markets by constantly reappearing under different aliases. As Josang et al.
(2006) noted, identity changes allow parties to “cut with the past and start from fresh.”
Community members can build up a reputation, use it to deceive unsuspecting members,
and start over under a new identity (Friedman and Resnick, 2001; Dellarocas, 2003).
Friedman and Resnick (2001) refer to this identity change characteristic as “cheap
pseudonyms.” Cheap pseudonyms stemming from easy identity changes allow online
auction traders to circumvent the collaborative sanctioning mechanisms critical to the
success of reputation systems.
3.2.1.2 Reputation Manipulation
Reputation scores in electronic markets are important because they influence product
73
prices and traders’ perceived credibility. There has been a plethora of work done to
evaluate the correlation between reputation scores and product prices. Often enhanced
seller reputation scores result in premium sales prices (Lee et al., 2000; Ba and Pavlou,
2002). Resnick et al. (2006) observed an 8.1% increase in the buyer’s willingness to pay
price when transacting with an established, reputable identity as compared with a new
identity. Thorough reviews of literature evaluating the impact of reputation scores on
selling price can be found in Dellarocas (2003) and Resnick et al. (2006). Enhanced
reputation also increases the willingness of other members to engage in transactions,
which may be partially responsible for the enhanced selling prices. This is particularly
important for fraudulent members attempting to “bait” unsuspecting members based on
the fraudulent traders’ false credibility.
Reputation manipulation can take two forms, rank inflation and discrimination. A
common form of rank inflation involves using additional (fake) identities to inflate ones
reputation (Dellarocas, 2003). This is also referred to as ballot box stuffing (Josang et al.,
2006). Corroborating with other members to create deceitful groups can further amplify
the impact of such score inflation or stuffing (Resnick et al., 2000). Discrimination entails
blackmailing or threatening to post negative feedback about fellow traders (Resnick et al.,
2000). Posting dishonest comments to tarnish a competitor’s reputation is a common ploy
in online markets (Dellarocas, 2003).
3.2.1.3 Reputation Systems and Stylometry
Rank manipulation and easy identity changes have facilitated numerous forms of
fraud in electronic markets (Chua and Wareham, 2004), including failure to ship, failure
74
to pay, fencing, shell auctions, etc. Consequently many researchers have stated the need
for techniques to mitigate the impact of identity change and rank manipulation
(Dellarocas, 2003; Josang et al., 2006), both of which stem from online anonymity. Some
have proposed using social network analysis for anti-aliasing, however these techniques
have had limited success on real world data, with accuracies around 2% for matching
email aliases (Holzer et al., 2005). Reputation rank systems entail users/traders posting
text comments. The traders often leave behind potential textual traces of their identity (Li
et al., 2006). Keselj et al. (2002) refer to an author’s unique writing style tendencies as an
“author profile.” Ding et al., (2003) described such identifiers as “text fingerprints” that
can discriminate authorship. Juola and Baayen (2005) called them “stylistic fingerprints.”
Stylometric/authorship identification techniques that can discriminate authorship in
cyberspace could help alleviate the anonymity related problems pervasive in electronic
markets. Comparing trader feedback comments could help detect identity changes. Such
methods may also help detect reputation score manipulation attributable to fake identities.
Furthermore comparing known fraudulent identities’ comments against active members
could help prevent further scamming. Many web sites have begun to post archives and
databases containing names, aliases, and text from fraudulent buyers and sellers (Chua &
Wareham, 2004). For example, some documented fraudulent online auction individuals
listed on www.traderlist.com have as many as 30-40 known fake identities. Clustering
such “cheap pseudonyms” based on writing style tendencies could dramatically reduce
the effectiveness of recurring deceptive behavior attributable to reappearing under
different aliases.
75
3.2.2 Stylometric Analysis
Stylometry (also referred to as authorship analysis) is defined as the “statistical
analysis of writing style.” Four important characteristics of stylometric analysis (Zheng et
al., 2006) are the tasks, stylistic features, classification techniques, and parameters (i.e.,
factors influencing authorship analysis performance, such as number of classes, amount
of text, noise). Each of these four characteristics is described below.
3.2.2.1 Stylometric Analysis Tasks
Two major stylometric analysis tasks are identification and similarity detection (Gray
et al., 1997; De Vel et al., 2001). Identification entails comparing anonymous texts
against those belonging to identified entities, where the anonymous text is known to be
written by one of those entities. However, this “known class” assumption is not practical
(Juola & Baayen, 2005), especially for online settings. In cyberspace, author classes are
rarely known in advance, and hence require the use of unsupervised clustering based
approaches. Such a similarity detection task requires the comparison of anonymous texts
against other anonymous texts in order to assess the degree of similarity. For instance, in
online forums, where there are numerous anonymous identities (i.e., screen names,
handles, email addresses) one can only use unsupervised stylometric analysis techniques
since no class definitions are available. Similarly, in an online auction setting, one
hundred trader identities could represent anywhere between one and one hundred actual
traders.
76
3.2.2.2 Stylometric Analysis Features
Stylistic features are the attributes or writing style markers that are the most effective
discriminators of authorship. The vast array of stylistic features includes lexical,
syntactic, structural, content-specific, and idiosyncratic style markers.
Lexical features are word or character-based statistical measures of lexical variation.
These include style markers such as sentence/line length (Argamon et al., 2003),
vocabulary richness (De Vel et al., 2001) and word length distributions (De Vel et al.,
2001; Zheng et al., 2006). Syntactic features include function words (Abbasi and Chen,
2005; Li et al., 2006), punctuation, and part-of-speech tag n-grams (Baayen et al. 1996,
Koppel et al., 2003). Structural features, which are especially useful for online text,
include attributes relating to text organization and layout (De Vel et al., 2001; Zheng et
al., 2006). Content-specific features are important keywords and phrases pertaining to
certain topics. For example, content specific features on a discussion of computers may
include “laptop” and “notebook.” Idiosyncratic features include misspellings,
grammatical mistakes, and other usage anomalies. Such features are extracted using
spelling and grammar checking tools (Chaski, 2001; Koppel and Schler, 2003).
Over 1,000 different features have been used in previous authorship analysis research
with no consensus on a best set of style markers (Rudman, 1997). However, this could be
attributable to certain feature categories being more effective at capturing style variations
in different contexts. This necessitates the use of larger feature sets comprised of several
categories of features spanning various feature groups (i.e., lexical, syntactic, etc.). For
instance, the use of feature sets containing lexical, syntactic, structural, and syntactic
77
features has been shown to be more effective for online identification than feature sets
containing only a subset of these feature groups (Abbasi and Chen, 2005; Zheng et al.,
2006).
3.2.2.3 Stylometric Analysis Techniques
Several techniques have been used for stylometric identification. These can broadly
be classified as supervised and unsupervised methods. However, only unsupervised
techniques are suitable for online settings, such as reputation system feedback comments,
since class definitions are unknown apriori. We discuss previous unsupervised methods
useful for online similarity detection. These techniques include principal component
analysis (PCA), N-Gram Models, Markov Models, and Cross Entropy. Previous
stylometric analysis studies using these techniques are summarized in Table 3.1 below.
Table 3.1: Previous Unsupervised Stylometric Analysis Techniques
Technique
PCA
N-Gram Models
Study
Kjell et al., 1994
Baayen et al.,
1996
Abbasi & Chen,
2006
Keselj et al., 2003
Peng et al., 2003
Markov Models
Khmelev, 2000
Khmelev
&
Tweedie, 2001
Cross Entropy
Juola, 1997
Juola & Baayen,
2005
Novak et al.,
2004
K-L Similarity
Features
Letter bigrams (10 total)
Function words (50 total)
Test Bed
Federalist papers (2 authors).
Literary texts (2 authors).
Punctuation, word length
distributions, topical words,
special characters, letters, etc.
(104 total)
Character n-grams (5,000 per
author)
Character n-grams (5,000 per
author)
Character bigrams (729 total)
Character bigrams (729 total)
Pirated software web forum (10
authors).
Literary texts (8 authors).
Literary texts (8 authors).
Match lengths
Match lengths
Literary texts (82 authors).
Project Gutenberg (45 authors),
Literary texts (2 authors),
Federalist papers (2 authors).
Federalist papers (2 authors).
Student essays (8 authors).
Word unigrams (quantity not
available)
Board postings from
www.courttv.com (100 authors)
78
Principal Component Analysis (PCA). PCA is a popular stylometric identification
technique that has been used in numerous previous studies (Burrows, 1987; Kjell et al,
1994; Baayen et al., 1996; Abbasi and Chen, 2006). PCA’s ability to capture essential
variance across large amounts of features in a reduced dimensionality makes it attractive
for text analysis problems, which typically involve large feature sets. The essence of PCA
can be described as follows: given a feature matrix with each column representing a
feature and instance vector rows for the various authors’ texts, project the matrix into a
lower dimensional space by plotting principal component scores (which are the product
of the component weights and instance feature vectors). The similarity between authors
can be compared based on visual proximity of patterns (Kjell et al., 1994) or computation
of average distance (Abbasi and Chen, 2006). Given a set of n text instance vectors and p
eigenvectors, the average distance can be used to compute authorship dissimilarity as
follows:
n
p
∑∑ a
Dissimilar ity(a, b) =
ki
− bki
i =1 k =1
np
where aki and bki are the coefficien ts of the kth component of the usage instance i for authors a and b
N-Gram Models. Proposed by Keselj et al. (2003) and Peng et al. (2003), this
technique requires the construction of a profile for each author, where a profile is the set
of the n most frequently used character n-grams. Keselj et al. (2003) used in between 205,000 as the value for n, with the best accuracy attained using 5,000 n-grams. They
attained the best results using 4-8 character n-grams. Using this approach, they computed
the dissimilarity between two authors as:
79
 2( f1 ( x) - f 2 ( x)) 


Dissimilarity (profile1 , profile 2 ) =
f1 ( x) + f 2 ( x) 
x∈profile1 ∪ profile2 
∑
2
where f1 ( x) and f 2 ( x) are frequencies of an n - gram x contained in profile1 or profile 2
Keselj et al. (2003) and Peng et al. (2003) were able to attain good performance using this
approach on test beds consisting of up to 8 authors.
Markov Models. Proposed by Khmelev (2000) and later extended by Khmelev and
Tweedie (2001), this technique requires the creation of a Markov model for each author,
using bi-grams of letters and the space character. Khmelev (2000) removed all other
characters and ignored words beginning with a capitalized letters, resulting in a fixed (27
x 27 = 729) feature space for each author. Using this approach, the similarity between
two authors can be computed as follows:
 f ij ( a) f i (b) 


(
b
)
i
ij
i
j

where f ij ( a) and f ij (b) are the number of transitions from letter i to j for author a and b' s texts, respectively.
Similarity(a, b) =
∑∑ ln f (a) f
The technique has performed well on larger test beds of 45 and 82 authors (Khmelev,
2000; Khmelev & Tweedie, 2001). However these data sets consisted of literary texts
which tend to be longer and more stylistically consistent due to contextual independence.
Cross Entropy. Proposed by Juola (1997; 2003) and later applied in Juola and Baayen
(2005), this technique is based on the concept of match length where:
The match length Ln ( x) of a sequence x1 , x2 ,...xk is the length of the longest prefix of
the sequence xn +1 , xn + 2 ,...xk that matches a contiguous substring of x1 , x2 ,...xn
The substring
x1 , x2 ,...xn
is referred to as the database. For cross entropy, simply
compute the average match length for author B’s text b1 , b2 ,...bk compared against author
80
A’s database a1 , a 2 ,...a n
and author A a1 , a2 ,...a j to author B’s database b1 , b2 ,...bn
as
follows:
j
∑
Similarity (a, b) =
k
L n ( a i , b)
∑L
i =1
n (bi , a )
i =1
+
j
k
where Ln (a i , b) is the match length for author A' s substring a i , a i +1 ,...a j compared against author B' s database
Texts written by the same author should result in higher match lengths. Juola (1997)
used n=2,000 characters for each author’s database size. The cross entropy method has
performed well in prior studies, outperforming PCA on a test bed consisting of 8
students’ essays (Juola & Baayen, 2005).
K-L Similarity. Novak et al. (2004) used the Kullback-Leibler divergence as follows:
n
Similarity (a, b) =
∑p
i =1
i
log p i
log q i
where p and q are the feature distributions for the two authors a and b
Novak et al. (2004) performed smoothing to account for non-zero elements in p and
applied the approach to message board postings on www.courttv.com. They compared
various features, and attained the best performance using word unigrams. Their study is
one of the few prior similarity detection studies applied to computer mediated
communication. Kullback-Leibler similarity using word unigrams performed well,
however they acknowledged that their approach was susceptible to topical variation
(Novak et al., 2004), possibly stemming from the use of a feature set comprised only of
word unigram features. While topical variation is less of a concern for online feedback
comments, the sensitivity of such an approach may make it susceptible to intentional
obfuscation.
81
Most techniques, such as N-Gram and Markov models were designed to be used with
character n-grams. Word based features are too sparse to be used accurately with these
techniques (Peng et al., 2003). Similarly, Novak et al. (2004) attained better performance
using the Kullback-Leibler similarity on word unigrams as compared to other features,
such as punctuation, function words, misspellings, and a combined feature set. It is
unclear if such methods can be effectively applied to online settings, where techniques
capable of handling larger feature sets are typically required (Abbasi & Chen, 2005;
Zheng et al., 2006). Therefore assessing the efficacy of these approaches (i.e., the
combination of features and techniques employed by these prior studies) for online
analysis is especially important in order to gauge their applicability for stylometric
similarity detection of reputation system feedback comments.
3.2.2.4 Stylometric Analysis Parameters
Two important stylometric analysis parameters for online authentication are
scalability and robustness (Zheng et al., 2006). Scalability refers to the impact of the
number of author classes on classification performance. Typically, there has been a
noticeable drop in performance for prior online message level identification research as
the number of authors increased. Zheng at al. (2006) noted a 14% drop in accuracy when
increasing the number of authors from 5 to 20. Argamon et al. (2003) observed as much
as a 23% drop in accuracy over a similar number of authors. Given the large number of
traders in online markets, it is important to assess the impact of the number of traders and
identities per trader on stylometric performance.
It is also important to assess robustness of stylometric approaches against intentional
82
stylistic alteration and copycatting/message forging. Fraudulent traders may attempt to
avoid detection by altering their style or copying other traders’ style (referred to as
copycatting or forging). Previous research on intentional authorship obfuscation suggests
that such alteration can impact stylometric classification performance (Rao and Rohatgi,
2000). For instance, word substitution, a popular and convenient form of alteration, has
been shown to impact identification accuracy (Kacmarik and Gamon, 2006). Rao and
Rohatgi noted that word substitution via the use of thesaurus tools (altering words with
synonyms) could represent a promising stylistic obfuscation mechanism since it would
decrease the presence of stylistic elements attributable to an author’s vocabulary.
Forging/copycatting entails intentionally mimicking other community members’ styles or
usernames (Abbasi and Chen, 2006). This behavior is fairly common in certain CMC
modes, such as USENET forums. Mimicking other members’ styles by either directly
copying their text or attempting to copy their stylistic tendencies is an important and
plausible form of deception that must be considered when evaluating stylometric methods
in online settings.
3.3 Research Gaps, Questions, and Design
Based on the review of related research on reputation systems and stylometric
analysis, we present several research gaps and questions.
3.3.1 Research Gaps
Stylometric Similarity Detection of Feedback Comments. We are not aware of any
prior application of stylometric similarity detection techniques to online feedback
83
comments. Most previous stylometric work either focused on the online identification
task (known classes) or was applied to literary texts. Successful application to reputation
system feedback comments could reduce fraud and deception in reputation systems, and
consequently, online markets.
Techniques that can Handle Richer Feature Sets. There is a need for techniques that
can handle richer feature sets for online settings. Existing techniques were designed to
use a single category of features (e.g., character n-grams, match length). However
application to online settings necessitates the use of techniques that can incorporate rich
feature sets (Abbasi & Chen, 2005; Zheng et al., 2006).
Analysis of Scalability. There has been limited work done to analyze the scalability of
stylometric techniques for application to reputation system feedback comments. Given
the large number of identities in online markets, there is a need to apply stylometry
effectively in a scalable manner.
Analysis of Robustness against Intentional Alteration and Forging. We are unaware
of any previous research to assess the robustness of stylometric features and techniques
against intentional stylistic alteration or forging. Unlike biometrics, writing style may be
susceptible to intentional manipulation via stylistic alteration or message forging. It is
important to assess the robustness of stylometric similarity detection techniques against
such obfuscation of authorship.
3.3.2 Research Questions
Based on the gaps described, we propose the following research questions:
1) Which stylometric technique is most effective for similarity detection of
84
online market feedback comments?
2) How scalable are these techniques in terms of number of traders and
identities per trader?
3) How robust are these techniques against intentional stylistic obfuscation?
4) Can techniques using richer feature sets provide improved scalability and
robustness?
3.3.3 Research Design
We propose the development of a stylometric similarity detection system capable of
differentiating between online traders based on stylistic tendencies inherent in feedback
comments left for other buyers and sellers. Our system uses an extended feature set
comprised of lexical, syntactic, structural, content specific and idiosyncratic style
markers. The system also includes a novel Writeprint technique, which compares the
style patterns between two identities. Writeprint uses Karhunen-Loeve transforms to
assess the similarity for features used by the two identities as well as a pattern disruption
mechanism that assesses the degree of dissimilarity for features used by one identity but
not the other.
We intend to compare our system, which includes the Writeprint technique and an
extended feature set, against existing similarity detection approaches described in section
3.2.2.3, including PCA, N-Gram Models, Markov Models, Cross Entropy, and KullbackLeibler similarity. The evaluation will assess the scalability and robustness of our system
and comparison approaches for application to online market feedback comments.
85
3.4 System Design
The proposed system has two major components; feature extraction and classifier
construction (as shown in Figure 3.1). The feature extraction phase derives various static
and dynamic features (e.g., n-grams) from the trader feedback comments. A subset of the
dynamic features are chosen using feature selection in order to create an extended feature
set which is passed forward to the classifier construction phase. This stage involves the
creation of Writeprints for each trader identity, which can then be compared against each
other to assess the degree of stylistic similarity. Details about the two system components
are provided below.
Figure 3.1: Stylometric Similarity Detection System Design
3.4.1 Feature Extraction
The extraction phase involves derivation of static and dynamic features resulting in
the creation of our extended feature set. For static features, extraction simply involves
generating the feature usage statistics (feature vectors) across texts, however dynamic
feature categories such as n-grams require indexing and feature selection. The feature
extraction procedure for the extended feature set is described below while Table 3.2
86
provides a description of the style markers included. For dynamic feature categories, the
number of attributes varies depending on indexing and feature selection. For some such
categories, the upper limit of features is already known (e.g., number of character
bigrams is less than 676).
Table 3.2: Extended Feature Set
Group
Lexical
Syntactic
Structural
Content
Idiosyncratic
Category
Word-Level
Character-Level
Character N-Grams
Digit N-Grams
Word Length Dist.
Vocab Richness
Special Chars.
Function Words
Punctuation
POS Tag N-Grams
Message-Level
Paragraph-Level
Technical Structure
Word N-Grams
Misspelled Words
Quantity
5
5
< 18,278
< 1,110
20
8
21
300
8
varies
6
8
50
varies
< 5,513
Description/Examples
total words, % char. per word
total char., % char. per message
count of letter n-grams (e.g., a, at, ath)
count of digit n-grams (e.g., 1, 12, 123)
frequency distribution of 1-20 letter words
richness (e.g., hapax legomena, Yule’s K)
occurrences of special char. (e.g., @#$%^&*+=)
frequency of function words (e.g., of, for, to)
occurrence of punctuation marks (e.g., !;:,.?)
part-of-speech tag n-grams (e.g., NNP, NNP JJ)
e.g., has greeting, has url, requoted content
e.g., paragraphs, sentences per paragraph
e.g., file extensions, fonts, use of images
bag-of-word n-grams (e.g., “seller”, “bad sale”)
common misspellings (e.g., “beleive”, “thougth”)
Dynamic features incorporated in the extended feature set include several n-gram
feature groups and a list of 5,513 common word misspelling taken from various websites
including Wikipedia (www.wikipedia.org). N-gram categories utilized include character,
word, POS tag, and digit level n-grams. These categories require indexing with the
number of initially indexed features varying depending on the data set. The indexed
features are then sent forward to the feature selection phase. Use of such an indexing and
feature selection/filtering procedure for n-grams is quite necessary and common in
stylometric analysis research (e.g., Peng et al., 2003; Koppel and Schler, 2003).
Feature selection is applied to all the n-gram and misspelled word categories using
87
the information gain (IG) heuristic. IG has been used in many text categorization studies
as an efficient method for selecting text features (e.g., Koppel and Schler, 2003).
Specifically, it is computationally efficient compared to search-based techniques and
good for multi-class text problems (Yang and Pederson, 1997). The information gain for
feature j across a set of classes c is derived as IG(c,j) = H(c) – H(c|j) where H(c) is the
overall entropy across author classes and H(c|j) is the conditional entropy for feature j.
For each identity, IG is applied using a 2-class (one-against-all) set up (size of c = 2, c1 =
identity, c2 = rest). Thus, each trader identity’s feature set is intended to be comprised of
the set of dynamic features that can best discriminate that specific identity against all
others.
3.4.2 Classifier Construction: Writeprints
We propose a novel Writeprint technique that has two major components: creation
and comparison. The creation steps are concerned with the construction of Writeprint
patterns reflective of an identities’ writing style variation, based on the occurrence of
common identity features as well as lack of occurrence of style markers prevalent in other
identities’ text. The comparison steps describe how created Writeprints for various trader
identities are compared against one another to assess the degree of stylistic similarity. The
two components are described below.
3.4.2.1 Writeprint Creation
The Writeprint creation component can be further decomposed into two steps. In the
first step, Karhunen-Loeve transforms are applied with a sliding window in order to
88
capture stylistic variation with a finer level of granularity. Unlike PCA, the KarhunenLoeve Transform can include class information, in this case for the different
identities/aliases (Webb, 2002). Writeprints are created for each identity using their key
features. The second step, pattern disruption, uses zero usage features as red flags
intended to decrease the level of stylistic similarity between identities when one identity
contains important features not occurring in the other. The two major steps, which are
repeated for each identity, are shown below:
Writeprint Creation Steps
4) For all identity features with occurrence frequency >0.
a) Extract feature vectors for each sliding window instance.
b) Derive basis matrix (set of eigenvectors) from feature usage
covariance matrix using Karhunen-Loeve transforms.
c) Compute window instance coordinates (principal components) by
multiplying window feature vectors with basis. Window instance
points in n dimensional space represent author Writeprint pattern.
5) For all author features with occurrence frequency =0.
a) Compute feature disruption value as product of information gain,
synonymy usage, and disruption constant K.
b) Append features’ disruption values to basis matrix.
6) Repeat steps 1-2 for each identity.
Step 1: Sliding Window and Karhunen-Loeve Transforms
A lower dimensional usage variation pattern is created based on the occurrence
frequency of the identity’s features (individual level feature set). For all features with
usage frequency greater than zero, a sliding window of length L with a jump interval of J
characters is run over the identity’s messages. The feature occurrence vector for each
window is projected to an n-dimensional space by applying the Karhunen-Loeve
transform. The Kaiser-Guttman stopping rule (Jackson, 1993) was used to select the
number of eigenvectors in the basis. The formulation for step 1 is presented below:
89
a) Let Ω = {1,2,..., f } denote the set of f features with frequency greater than 0
and Φ = {1,2,..., w} represent the set of w text windows. Let X denote the author’s
feature matrix where xij is the value of feature j ∈ Ω for window i ∈ Φ .
 x11
x
21
X =
 ...

 x w1
x12
x22
...
xw2
... x1 f 
... x2 f 
... ... 

... x wf 
b) Extract the set of eigenvalues {λ1 , λ2 ,..., λn } for the covariance matrix Σ of the
feature matrix X by finding the points where the characteristic polynomial of
Σ equals 0:
p(λ ) = det(Σ − λI ) = 0 .
For each eigenvalue λm > 1 extract its eigenvector a m = (a m1 , a m 2 ,..., a mf ) by solving
the following system, resulting in a set of n eigenvectors {a1 , a 2 ,..., a n } :
(Σ − λ m I ) a m = 0
c) Compute an n-dimensional representation for each window i by extracting
principal component scores ε ik for each dimension k ≤ n :
ε ik = a kT xi
Step 2: Pattern Disruption
Since Writeprints uses individual author level feature sets, an author’s key set of
features may contain attributes that are significant because the author never uses them.
However, features with no usage by the identity of interest will currently be irrelevant to
the process since they have no variance. Nevertheless these features are still important
90
when comparing a trader identity to other anonymous trader identities. The trader’s lack
of usage of these features represents an important stylistic tendency. Anonymous identity
texts containing these features should be considered less similar (since they contain
attributes never used by this author). When comparing two trader identities A and B, we
would like A’s zero frequency features to act as pattern disruptors, where the presence of
these features in identity B’s feedback comments decreases the similarity for the
particular A – B comparison (and vice versa for the B – A comparison).
The magnitude of a disruptor signifies the extent of the disruption for a particular
feature. Larger values of for the disruptor will cause pattern points representing text
windows containing the disruptor feature to be shifted further away. However, not all
features are equally important discriminators. Koppel et al. (2006) developed a machine
translation based technique for measuring the degree of feature “stability.” Stability refers
to how often a feature changes across authors and documents for a constant topic. They
found noun phrases to be more stable than function words and argued that function words
are better stylistic discriminators than noun phrases since use of function words involves
making choices between a set of synonyms. Based on this intuition, we used the disruptor
feature’s information gain and synonymy information to assign them a weight (disruptor
coefficient), which was appended to the identity’s basis matrix (set of eigenvectors).
a) Let Ψ = { f + 1, f + 2,..., f + g} denote the set of g features with zero frequency. For
each feature p ∈ Ψ compute the disruptor coefficient d p :
d p = IG (c, p ) K (syn total + 1)(syn used + 1)
where IG (c, p ) is the information gain for feature p across the set of classes c,
91
syn total and syn used are the total synonyms and the number used by the author,
respectively, for the disruptor feature, and K is a disruptor constant.
b) For each feature p ∈ Ψ append the value d kp to each eigenvector ak where k ≤ n .
3.4.2.2 Writeprint Comparisons
When comparing two identities’ usage variation patterns, two comparisons must be
made since both identities used different feature sets and basis matrices in order to
construct their lower dimensional patterns. The dual comparisons are illustrated in Figure
3.2. We would need to construct a pattern for identity B using B’s text with A’s feature set
and basis matrix (Pattern B) to be compared against identity A’s Writeprint (and vice
versa). The overall similarity between Identity A and B is the sum of the average distance
between Writeprint A and Pattern B and Writeprint B and Pattern A.
Figure 3.2: Writeprint Comparisons
As previously mentioned, the pattern disruptors are intended to assess the degree of
stylistic dissimilarity based on important features only found in one of the two identities’
feedback comments. Disruptors shift pattern points further away from the Writeprint
they’re being compared against, thereby increasing the average distance between patterns
(and reducing the similarity score). The direction of a pattern window point’s shift is
92
intended to reduce the similarity between the Writeprint and comparison pattern. This is
done by making d kp positive or negative for a particular dimension k based on the
orientation of the Writeprint (WP) and comparison pattern (PT) points along that
dimension, as follows:
w

WPik w PTik
−
,
if
>
d
 kp
w

i =1
i =1 w
=
w
w
WPik
PTik
d , if
<
 kp
w
i =1
i =1 w

∑
d kp
∑
∑
∑
For instance, if identity A’s Writeprint is spatially located to the left of identity B’s
pattern for dimension k, the disruptor d kp will be positive in order to ensure that the
disruption moves the comparison pattern away from the Writeprint (towards the right in
this case) as opposed to towards it.
3.5 Evaluation
In order to evaluate the effectiveness of the proposed system, which includes the
Writeprint technique and extended feature set, experiments were conducted that
compared the system against previous unsupervised stylometric identification techniques
described, including PCA, N-Gram and Markov Models, Cross Entropy and KullbackLeibler. The test bed, experimental design, and parameter settings for the Writeprint and
comparison techniques are described below.
3.5.1 Test Bed
The test bed consisted of buyer/seller feedback comments extracted from eBay’s
online reputation system. We randomly extracted 200 eBay members selling electronic
93
goods. For each trader, 3,000 feedback comments posted by that author were included.
Table 3.3 provides summary statistics of the test bed while example feedback comments
are listed below:
•
“Another quick & easy transaction, thanks for your biz!”
•
“Excellent e-bayer!! fast payment, great to deal with, many thanks!!!”
•
“PLEASURE doing business with you and thanks for making this business a
PLEASURE!”
Table 3.3: eBay Test Bed Statistics
# Authors
(i.e., traders)
200
Words
(per author)
22,564
Comments
(per author)
3,000
Ave. Comment
Length (words)
7.94
Time Duration
02/2003 –
06/2006
3.5.2 Experimental Setup
All comparison techniques were run using the best parameter settings determined by
tuning these parameters on the actual test bed data. This was done in order to allow the
best possible comparison against the proposed Writeprint technique. Most of the
parameter values were consistent with prior research. PCA was run using the extended
feature set. We extracted feature vectors for 1,500 character text blocks, consistent with
prior research (Abbasi and Chen, 2006). The Kaiser-Guttman stopping rule was used (i.e.,
extract all eigenvectors with an eigenvalue greater than 1). For the N-gram Models, we
used character level n-grams, with profile sizes of 5,000 n-grams per identity. For each
identity we used 4-8 character n-grams since this configuration garnered the best results,
also consistent with Peng et al. (2003) and Keselj et al. (2003). Markov Models were
built using letters and space bigrams. We removed all other characters and ignored words
94
beginning with capital letters, as done by Khemelev (2001) and Khemelev and Tweedie
(2001). For Cross Entropy we used a database size of 5,000 characters for each identity
as this size provided the best performance. For the Kullback-Leibler similarity, word
unigrams were used and smoothing was performed as outlined by Novak et al. (2004).
For the experiments, we created multiple identities for each of the 200 eBay traders by
splitting the traders’ feedback comment text into multiple parts, as done in prior research
(e.g., Novak et al., 2004). The objective of the experiments was to see how well the
proposed Writeprint method and comparison techniques could match up the different
trader identities based on their comment texts. Each trader’s text was split into 12 parts. If
two identities were to be created for a single trader, 6 parts were randomly assigned to
each identity. For example, parts 1, 5, 7, 8, 9, 11 (identity 1), parts 2, 3, 4, 6, 10, 12
(identity 2). In order to test the statistical significance of the techniques’ performance,
bootstrapping was performed 30 times for each technique, where each iteration the 12
trader text parts were randomly split into the desired number of identities. A trial and
error method was used to find the optimal similarity threshold for matching for each
technique. The same threshold was used throughout the experiments for the Writeprint
method. A dynamic threshold yielding optimal results for the particular experimental
settings was used for each comparison technique. This was done in order to compensate
for differences in performance attributable to thresholds instead of techniques. All
identity-identity scores above a techniques’ threshold were considered a match. The FMeasure was used to evaluate performance.
2(Precision)(Recall)
F - Measure =
Precision + Recall
95
Using these experimental settings, two sets of experiments were conducted. The first
assessed the scalability of the proposed stylometric similarity detection system and
comparison approaches in terms of number of traders and number of identities’
comments. The second attempted to evaluate the effectiveness of these stylometric
methods against intentional stylistic alteration and forging/copycatting. Details about the
two experiments are presented in the ensuing sections.
3.5.3 Experiment 1: Scalability
We conducted two experiments to analyze the scalability across traders (experiment
1a) and identities (experiment 1b). In experiment 1a scalability across traders was
evaluated. Each trader’s text was split into two anonymous identities. We used 25, 50,
100, and 200 traders (i.e., 50, 100, 200, and 400 identities). In experiment 1b scalability
across identities was the focal point. We used 50 traders, with each trader’s text split into
n anonymous identities. We used 2, 3, 4, and 5 identities per trader (i.e., 100, 150, 200,
and 250 identities total). The details of the number of traders and identities used for
experiment 1 are presented in Table 3.4.
Table 3.4: Number of Traders and Identities used in Experiment 1
Experiment
# Traders
# Identities
1a
(Traders)
25
50
100
200
50
50
50
50
50
100
200
400
100
150
200
250
1b
(Identities)
Words
(per identity)
11,282
11,282
11,282
11,282
11,282
7,521
5,641
4,513
Comments
(per identity)
1,500
1,500
1,500
1,500
1,500
1,000
750
600
96
3.5.3.1 Results for Experiment 1a: Scalability across Traders
Figure 3.3 shows the F-measure percentages for 25, 50, 100, and 200 traders (with 2
identities per trader), intended to assess the scalability across traders. Overall all the
techniques except PCA performed well. As expected, doubling the number of authors and
identities decreased performance, however the decrease was gradual. Writeprint had the
best performance for all four identity levels. The technique only had approximately a 3%
decrease when going from 100 to 200 identities and from 200 to 400 identities. In
contrast the performance of N-Gram Models, K-L Similarity, and Cross Entropy fell 6%7% for each such increase.
Writeprint
PCA
N-Gram Models
Markov Models
Cross Entropy
K-L Similarity
50
100.00
81.19
97.96
98.04
100.00
98.06
#Identities
100
200
97.88 94.59
77.49 72.25
97.60 89.39
90.91 87.86
96.15 90.34
93.52 88.90
100
400
92.16
65.80
78.42
76.18
83.53
77.18
95
% F-Measure
Techniques
90
85
80
75
70
65
50
100
200
#Identities
400
Writeprint
Markov Models
PCA
Cross Entropy
N-Gram Models
K-L Similarity
Figure 3.3: Experiment 1a Results (scalability across traders using 2 identities per trader)
Table 3.5 shows the p-values for the pair wise t-tests on F-measure. For all t-tests, a
Bonferroni correction was performed to avoid spurious positives stemming from the large
number of comparisons. Only p-values less than 0.0001 were considered significant.
Since this threshold is considerably lower than
α
n
, we are confident that it ensures the
statistical validity of the t-tests. Since are primary concern is the effectiveness of
97
Writeprint coupled with the extended feature set, only p-values for this technique are
depicted in Table 3.5. However, other significant results of interest are also reported in
the text description below.
Writeprint significantly outperformed all comparison techniques. The N-gram and
Markov models, Cross Entropy, and K-L similarity techniques significantly outperformed
PCA for all four settings (p-values <0.0001). Furthermore, Cross Entropy significantly
outperformed N-Gram and Markov Models and K-L similarity when using 400 identities
(p-values <0.0001).
Table 3.5: P-Values for Pair Wise t-tests on F-measure (n=30)
Techniques
Writeprint vs. PCA
Writeprint vs. N-Gram Models
Writeprint vs. Markov Models
Writeprint vs. Cross Entropy
Writeprint vs. K-L Similarity
25/50
<0.0001*
<0.0001*
<0.0001*
0.8521
<0.0001*
# Traders / #Identities
50/100
100/200
<0.0001*
<0.0001*
0.1090
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
200/400
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
* P-values significant at corrected threshold alpha/n = 0.0001
3.5.3.2 Results for Experiment 1b: Scalability across Identities
Figure 3.4 shows the F-measure percentages for 2, 3, 4, and 5 identities per trader
(with 50 traders), intended to assess the scalability across identities. Writeprint again had
the best performance for all four trader/identity levels. N-Gram and Markov models
performed worse on this experiment as compared to the trader scalability experiment
(1a), with 10%-15% lower performance on an equal number of total identities (see values
for 200 identities in 1a and 4 identities per trader in 1b). The results suggest that the
number of identities per author has a greater impact on performance than the number of
authors for these techniques. Perhaps this is due to the amount of text per identity, which
98
was constant in experiment 1a and decreased in experiment 1b as the number of identities
per trader increased. Writeprint, Cross Entropy, and the K-L similarity method appear
more robust against smaller amounts of text. This finding is consistent with Novak et al.
(2004) who also found the K-L similarity approach to work almost equally well when
dealing with 2-4 aliases.
Writeprint
PCA
N-Gram Models
Markov Models
Cross Entropy
K-L Similarity
#Identities per Trader
2
3
4
5
97.88 96.92 95.59 94.72
77.49 71.09 70.43 67.21
97.60 84.44 78.66 74.22
90.91 73.23 72.43 69.32
96.15 93.27 90.16 88.10
98.06 91.21 87.90 85.31
100
95
% F-Measure
Techniques
90
85
80
75
70
65
2
3
4
#Identities per Trader
5
Writeprint
Markov Models
PCA
Cross Entropy
N-Gram Models
K-L Similarity
Figure 3.4: Experiment 1b Results (scalability across identities using 50 traders)
Table 3.6 shows the p-values for the pair wise t-tests on F-measure. Writeprint
significantly outperformed all comparison techniques. This is likely attributable to the
pattern disruptors effectively differentiating between a larger number of identities per
author. Cross entropy significantly outperformed N-gram and Markov models, K-L
similarity, and PCA for all four settings. This is consistent with prior research, where the
technique has been shown to be effective when applied to smaller texts (Juola & Baayen,
2005).
99
Table 3.6: P-Values for Pair Wise t-tests on F-measure (n=30)
Techniques
Writeprint vs. PCA
Writeprint vs. N-Gram Models
Writeprint vs. Markov Models
Writeprint vs. Cross Entropy
Writeprint vs. K-L Similarity
50/100
<0.0001*
0.1090
<0.0001*
<0.0001*
<0.0001*
# Traders / #Identities
50/150
50/200
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
50/250
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
* P-values significant at corrected threshold alpha/n = 0.0001
3.5.3.3 Results Discussion for Experiment 1
In both experiments, Writeprint had the best performance for all trader/identity levels.
The performance gap widened as the number of traders and identities increased,
suggesting that the extended feature set and pattern disruption mechanism incorporated
by Writeprint allowed improved scalability. The enhanced representational richness of
Writeprint allowed it to outperform the word (K-L similarity) and n-gram based
techniques (N-Gram and Markov Models) while the pattern disruption component
allowed improved performance over PCA.
3.5.4 Experiment 2: Robustness
We conducted experiments to analyze the robustness of the proposed system and
comparison approaches against intentional stylistic alteration and copycatting/forging.
For each experiment, we used 50 traders, with every trader’s text split into two identities.
For each trader, one identity was kept unchanged while the other was altered using word
substitution or forging. In experiment 2a, intentional stylistic alteration was simulated
using word substitution while experiment 2b evaluated the impact of message forging.
100
3.5.4.1 Results for Experiment 2a: Robustness against Word Substitution
Word substitution is a popular obfuscation strategy since word based features are
transparent and more easily modifiable (Kacmarik and Gamon, 2006). Altering words
with semantically equivalent ones using thesauruses is considered a promising technique
for stylistic obfuscation (Rao and Rohatgi, 2000). Based on this rationale, we simulated
word synonym substitution using a thesaurus. For each altered identity, WordNet
(Fellbaum, 2003) was used to randomly alter n% of the words with a synonym randomly
taken from the synset. We used 20%, 40%, and 60% as values for n. Table 3.7 shows the
average number of alterations per comment for each setting of n and the impact of such
alteration on an actual comment.
Table 3.7: Impact of Different Levels of Word Substitution on an Example Comment
% Words
Altered
0%
20%
40%
# Alterations
per Comment
0.000
1.448
2.883
60%
4.349
Example Comment
“Excellent e-bayer!! fast payment, great to deal with, many thanks!!!”
“Superb e-bayer!! swift payment, great to deal with, many thanks!!!”
“Astounding e-bayer!! expedited payment, lovely to deal with, many
thanks!!!”
“Awesome e-bayer!! quick payment, wonderful to interact with, lots
of thanks!!!”
Figure 3.5 shows the F-measure percentages for 20%, 40%, and 60% word
substitution using 50 traders and two identities per trader. Writeprint had the best
performance against alteration with Cross Entropy also performing very well. These
techniques seem more robust against synonymy based word alteration. N-Gram and
Markov Models and K-L similarity all performed poorly. These techniques’ accuracy
dropped 50%-75% with 20% synonym alteration. These methods utilize character ngrams and word unigrams respectively, which may be more susceptible to alteration. In
101
comparison, PCA’s performance was more stable. While N-Gram and Markov Models
outperformed PCA by a wide margin when no substitution was performed, PCA
considerably outperformed these techniques once alteration was introduced.
Writeprint
PCA
N-Gram Models
Markov Models
Cross Entropy
K-L Similarity
% Words Substituted
0
20
40
60
97.88 97.05 96.00 85.71
77.49 72.44 61.03 50.29
97.60 45.78 27.45 20.00
90.91 15.38 9.43
6.00
96.15 92.22 88.18 83.33
93.52 62.45 51.43 42.09
100
80
% F-Measure
Techniques
60
40
20
0
0
Writeprint
Markov Models
20
40
% Alteration
PCA
Cross Entropy
60
N-Gram Models
K-L Similarity
Figure 3.5: Experiment 2a Results (robustness against word substitution)
Table 3.8 shows the p-values for the pair wise t-tests on F-measure for the experiment
evaluating robustness against word substitution. Writeprint significantly outperformed all
comparison techniques. PCA also outperformed N-gram and Markov models and K-L
similarity with p-values less than 0.001, as previously mentioned. However, Cross
Entropy significantly outperformed PCA for all three alteration levels (all p-values <
0.0001). The t-tests indicate that Writeprint had the best performance.
Table 3.8: P-Values for Pair Wise t-tests on F-measure (n=30)
Techniques
Writeprint vs. PCA
Writeprint vs. N-Gram Models
Writeprint vs. Markov Models
Writeprint vs. Cross Entropy
Writeprint vs. K-L Similarity
0
<0.0001*
0.1090
<0.0001*
<0.0001*
<0.0001*
% Alteration
20
40
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
* P-values significant at corrected threshold alpha/n = 0.0001
60
<0.0001*
<0.0001*
<0.0001*
0.0043
<0.0001*
102
3.5.4.2 Results for Experiment 2b: Robustness against Forging
Message forging (also referred to as copycatting) occurs when an individual attempts
to mimic another user by imitating their writing style (Abbasi and Chen, 2006). In order
to assess the impact of forging on stylometric similarity detection of online market
feedback comments, we simulated identities engaging in different levels of message
forging. In similar fashion to the previous experiment, 50 traders and 2 identities per
trader were incorporated, with one of the two trader identities being subjected to different
levels of forging. For each altered identity, we randomly substituted n% of the identity’s
messages with randomly selected messages taken from other author identities. We used
10%, 20%, and 30% values for n. Table 3.9 illustrates the impact of 20% forgery on a set
of five comments from an author. In this case, one comment out of five (20%) are forged
with a random comment taken from another identity.
Table 3.9: Illustration of Impact of 20% Message Forging on Feedback Comments
0% Messages Forged
Another quick & easy transaction, thanks for your
biz!
Excellent e-bayer!! fast payment, many thanks!!!
A pleasure to do business with, don’t be a
stranger!!!
Great to deal with, fast payment.
A superb e-bayer!!! A real pleasure to do business
with.
20% Messages Forged
Another quick & easy transaction, thanks for your
biz!
Excellent e-bayer!! fast payment, many thanks!!!
A wonderful buyer. Prompt payment, quick
response.
Great to deal with, fast payment.
A superb e-bayer!!! A real pleasure to do business
with.
Figure 3.6 shows the F-measure percentages for 10%, 20%, and 30% message forging
using 50 traders and two identities per trader. Cross Entropy performed the best against
forging. Writeprint’s performance fell at an increasing rate, especially at 20% and 30%
forging. N-Gram and Markov model performance plummeted once again, when exposed
to message forging. PCA was the only technique that performed marginally better on the
103
forging experiment (2b) as compared to the word substitution experiment (2a).
Writeprint
PCA
N-Gram Models
Markov Models
Cross Entropy
K-L Similarity
% Messages Forged
0
10
20
30
97.88 88.17 74.23 56.10
77.49 70.42 64.13 57.91
97.60 34.43 21.96 10.53
90.91 11.02 5.54
3.48
96.15 92.18 87.03 76.94
93.52 65.65 58.53 52.63
100
80
% F-Measure
Techniques
60
40
20
0
0
Writeprint
Markov Models
10
% Forged
PCA
Cross Entropy
20
30
N-Gram Models
K-L Similarity
Figure 3.6: Experiment 2b Results (robustness against message forging)
Table 3.10 shows the p-values for the pair wise t-tests on F-measure for the experiment
evaluating robustness against message forging. Writeprint significantly outperformed NGram and Markov models, K-L similarity, and PCA for most settings. However Cross
Entropy significantly outperformed all other techniques including Writeprint and PCA.
The improved performance of Cross Entropy was particularly noticeable at the 20% and
30% forging levels. The following section provides an analysis of why Writeprints
performed poorly on the message forging experiment (2b) as compared to the word
substitution experiment (2a) while PCA performed marginally better on message forging
(as compared to word substitution, 2a).
104
Table 3.10: P-Values for Pair Wise t-tests on F-measure (n=30)
Techniques
Writeprint vs. PCA
Writeprint vs. N-Gram Models
Writeprint vs. Markov Models
Writeprint vs. Cross Entropy
Writeprint vs. K-L Similarity
0
<0.0001*
0.1090
<0.0001*
<0.0001*
<0.0001*
% Forged
10
20
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
30
0.0016
<0.0001*
<0.0001*
<0.0001*
<0.0001*
* P-values significant at corrected threshold alpha/n = 0.0001
3.5.4.3 Results Discussion for Experiment 2
We analyzed the impact of word substitution based alteration and forging on the
features selected for the altered identities. Since the feature sets are dynamically
generated at the group level (PCA) or individual identity level (Writeprint, N-Gram
Model, Cross Entropy, K-L Similarity) for most of our approaches, word substitution or
forging results in a different feature set as compared to no alteration. Thus, the amount of
change in the features used by an altered identity as compared to that same identity,
devoid of alteration, can shed light on the impact of alteration. We analyzed this by taking
the percentage change in altered/forged feature sets from the feature sets used when no
alteration/forging was performed. We considered the Writeprint, PCA, N-Gram Model,
Cross Entropy, and K-L similarity methods. For Cross Entropy, the features were the
match lengths. Markov models were not analyzed since they use a fixed feature set.
Figure 3.7 shows the impact of word substitution and message forging on the feature
sets for the various techniques. Word substitution and forging had a profound impact on
character n-gram and word features, resulting in the poor performance of the N-gram
models and K-L similarity methods. The Cross Entropy match lengths also changed
considerably, however the magnitude of the changes was not significant. In other words,
105
although the Cross Entropy features changed a lot, the manner in which the features are
applied is fairly conducive (i.e., insensitive) to word substitution and message forging.
For example, a change in the lengths from {6, 3, 4} to {4, 5, 6} results in 33% percent
change in features but only an average match length change of 0.67.
PCA had less feature changes for forging as compared to synonym alteration. This
was attributable to the fact that PCA used a single feature set. Message forging does not
change the overall text across identities, resulting in minimal change in the feature set
used by PCA. Consequently PCA performed better on the forging experiments. In
contrast, Writeprint features changed marginally for the word substitution experiment but
considerably more for the forging experiments. This resulted in lower accuracy when
encountering message forging. For example, the 56.10% accuracy for 30% forging can be
attributed to the fact that 40% of the forged identities’ features changed. The following
paragraph describes why the Writeprint features for the altered identities were generally
more effective.
100
% Change in Features
% Change in Features
100
80
60
40
20
0
80
60
40
20
0
20
Writeprint
Cross Entropy
40
% Altered
PCA
K-L Similarity
60
N-Gram Models
10
Writeprint
Cross Entropy
20
% Forged
PCA
K-L Similarity
30
N-Gram Models
Figure 3.7: Impact of Word Substitution and Forging on Various Techniques
106
Figure 3.8 shows the percentage change in the Writeprint features for character, word,
and part-of-speech tag n-grams across the word alteration and forging experiments. For
the word alteration experiments, the POS tag n-grams had minimal change. This led to a
reduced impact of word synonym substitution on performance. However, the forging
caused considerably higher change in the identities’ POS tags n-gram features, resulting
in decreased Writeprint performance for Experiment 2b.
50
% Change in Features
% Change in Features
50
40
30
20
10
0
40
30
20
10
0
20
Char. N-gram
40
% Alteration
Word N-gram
60
POS N-gram
10
20
% Forged
Char. N-gram
Word N-gram
30
POS N-gram
Figure 3.8: Impact of Word Substitution and Forging on Writeprint N-Gram Features
3.6 Conclusions
In this essay we developed a system that can be used for similarity detection of trader
feedback comments in online markets. Our research contributions are manyfold. We
developed the Writeprint technique which uses Karhunen-Loeve transforms and a novel
pattern disruption mechanism to help detect stylistic similarity between traders based on
feedback comments. We also incorporated a more comprehensive feature set, allowing
improved representation of reputation system feedback comments. Experiments in
comparison with existing stylometric techniques demonstrated the scalability and
robustness of the proposed features and technique for differentiating trader identities in
107
online markets. The system proposed in this paper was fairly scalable in terms of number
of traders and identities per trader. The approach was also fairly robust against word
substitution based alteration.
The viability of stylometric techniques that can differentiate between hundreds of
online traders, coupled with the emergence of large online fraudulent trader databases,
has several important research implications. Stylometric analysis techniques can serve as
identity authentication systems in online markets, allowing users to compare a potential
trading partner against existing fraudulent identities. Such authentication could be
especially useful considering that most fraudulent traders engage in such “opportunistic
behavior” repeatedly (Chua and Wareham, 2004), resulting in many documented
identities. In the future we intend to develop such an authentication system that allows
individuals to compare traders against hundreds of fraudulent identities collected from
various online resources that have emerged in recent years (Chua and Wareham, 2004).
Additionally, we intend to further enhance the scalability and robustness of the Writeprint
based system using a larger number of online traders. We also plan to investigate the
effectiveness of contextual stylometric models segmented temporally or based on genres,
emotions, message recipients, or topics.
108
CHAPTER 4: WEBSITE SIGNATURES: AN EXPERIMENT ON FAKE ESCROW
AND SPOOF WEBSITES
4.1 Introduction
In Chapters 2 and 3, we used information types related to the textual meta-function
for identity level stylometric identification and similarity detection in order to improve
identity trust. In Chapters 4 and 5, we turn our attention to using textual information for
fake website detection in order to enhance institutional trust.
Computer mediated communication (CMC) encompasses several modes, including
websites, forums, chat rooms, etc. (Herring, 2002). These CMC media have witnessed
unprecedented growth, with widespread adoption. Consequently several trust related
issues have arisen, including Internet fraud and online deception (Friedman et al., 2000).
The increased popularity of electronic markets and blogs has attracted opportunists
seeking to capitalize on the asymmetric nature of online information exchange (Hu et al.,
2004). Many forms of fake and deceptive websites have appeared, including web spam,
spoof sites, and escrow fraud sites (Chua and Wareham, 2004; Chou et al., 2004). Web
spam sites attempt to deceive search engines to boost their ranking. Spoof sites are
replicas of real commercial sites intended to deceive the authentic sites’ customers into
providing their information. Escrow fraud sites are fake online escrow services (OES)
created by auction sellers who trick unsuspecting buyers into placing their money with
these sites. The quantities of such fake websites are rising at alarming rates. For instance,
a large percentage of blogs are fake spam pages called splogs (Kolari et al., 2006).
Similarly hundreds of new spoof sites are detected daily, with many more likely going
unnoticed (Chou et al., 2004). These spoofs are used to attack millions of unsuspecting
109
Internet users. Escrow fraud sites have also seen significant usage with over one hundred
fake OES added daily to online databases such as the Artists Against 4-1-9 (Airoldi and
Malin, 2004). In lieu of such rampant online deception, many researchers have expressed
a need for mechanisms that provide greater informational transparency in cyberspace
(Smith, 2002). Such socially translucent tools are necessary to counter anonymity abuses
and garner increased accountability (Erickson and Kellogg, 2000).
There is a need for automated techniques capable of identifying fake websites.
Despite significant progress, several limitations remain. Many methods have been
developed for detecting web spam; however there has been limited work on spoof sites
and we’re unaware of any work on fake escrow website identification. This deficiency
exists in spite of their prevalence as major forms of internet fraud (Chua & Wareham,
2004; Chou et al., 2004). Furthermore, most fake website categorization studies have
used focused feature sets comprised of a subset of attributes taken from selective
categories (e.g., some link and content features). The effectiveness of such a diverse
group of features across several categories suggests that using an extended set
encompassing different types of attributes could be highly useful for categorization of
fake escrow and spoof sites. These relevant feature categories include attributes derived
from various information sources, including website design (i.e., HTML), website
content (i.e., body text), URL tokens, images, and website structure and linkage (Urvoy et
al., 2006). Additionally, while Support Vector Machines (SVM) has been used
considerably for web spam categorization, most studies employed the standard linear
kernel. Alternate kernel functions, including ones customized to better represent the
110
unique problem characteristics, have not been utilized despite their effectiveness in
related domains (Kolari et al., 2006).
In this essay we propose an approach designed to automatically identify fake
websites, using a rich feature set coupled with a composite kernel customized for
representing important characteristics of fake websites. The extended feature set includes
stylistic features extracted from body text and HTML; image pixel features for capturing
duplicate pictures, banners, and icons across sites; URL and anchor text tokens; and
website structure and linkage based features. The composite kernel incorporates patterns
based on content similarity and duplication, which are pervasive in fake websites. We
evaluated the approach on two test beds comprised of fraudulent escrow and spoof
websites, respectively. The combination of a rich feature set and the composite kernel
facilitated enhanced categorization of fake websites over methods using a subset of the
features or standard linear kernels.
4.2 Related Work
A major factor influencing a user’s level of trust in a particular website is the
perceived website quality (McKnight et al., 2002). Consequently fake websites are often
very well professional looking and difficult to identify as phony (MacInnes et al., 2005).
In response to increasing Internet user awareness, fraudsters are also becoming more
sophisticated (Levy and Arce, 2002). As a result, there is a need for fake website
detection techniques (Chou et al., 2004; Zdziarski et al., 2006). Such methods are
important to decrease Internet fraud stemming from user interaction with deceptive
websites. Three important characteristics of fake website detection include the site types,
111
features, and techniques adopted. Site types include web spam, spoof sites, and escrow
fraud sites. Features can be categorized into five broad categories based on the
information utilized: body text (BT), HTML style (HS), URL and anchor text (UA),
images (IM), and structure and linkage (SL) attributes. Table 4.1 presents a summary of
the relevant prior studies, including the site types investigated and the features and
techniques employed.
Based on the table, we can make several observations regarding the site types,
features, and techniques used in related previous research. (1) There has been limited
work on spoof sites and we’re unaware of any research on fake escrow websites. (2)
Many different features have been shown to be effective for fake website categorization,
but few studies have attempted to utilize a comprehensive set of features spanning several
categories into a single extended feature set. (3) Machine learning classifiers such as
SVM, C4.5 decision trees, Naïve Bayes, and Neural Networks have been used
considerably. Particularly, SVM has been utilized effectively in many studies, but mostly
with a linear kernel. Kernel functions tailored towards fake website categorization could
be useful, yet have not been explored (Kolari et al., 2006). These three important
characteristics of fake website detection (i.e., site types, features, and techniques) are
expounded upon in the remainder of the section.
112
Table 4.1: Related Fake Website Detection Studies
Study
Site Type
Features
Techniques
Chou et al.,
2004
Spoof Sites
(HS) “post” tag; (UA) URL
text; (IM) image hashes; (SL)
link information;
Test based scoring
mechanism (called
TSS)
Drost &
Scheffer,
2005
Web Spam
SVM (linear,
polynomial, and
RBF kernels)
Metaxas &
DeStefano,
2005
Mishne et al.,
2005
Web Spam
(BT) tokens in various page
sections; (HS) redirections (UA)
characters in URL, domain
name, etc.; (SL) number of
in/out links and their page
content sums, averages, ratios
(SL) site link graphs
Manual observation
Graphs for 10
seed websites;
Web Spam
(BT) word unigrams
K-L Divergence
based similarity
Kolari et al.,
2006
Web Spam
(BT) word n-grams; (UA) URL
and anchor token n-grams
SVM (linear kernel)
Ntoulas et al.,
2006
Web Spam
(BT) lexical measures, word ngrams; (UA) amount of text in
anchors
Salvetti &
Nicolov,
2006
Web Spam
(UA) URL tokens
C4.5 Decision Tree,
Neural Network,
SVM (kernel not
specified)
Naïve Bayes
Comments
from 50
weblogs; 83%
accuracy
1,400 weblog
pages; 88.1% fmeasure
Over 17,000
web pages;
95.4% accuracy
Shen et al.,
2006
Urvoy et al.,
2006
Web Spam
(SL) temporal link features such
as in-link growth/death rate
(HS) HTML tag n-grams
Wu &
Davidson,
2006
Web Spam
Zdziarski et
al., 2006
Spoof Sites
Web Spam
(BT) terms in various page
sections; (HS) HTML tags and
keywords; (SL) in/out links,
relative/absolute links, etc.
(BT) body text strings
SVM (linear kernel)
Jaccard based
similarity algorithm
(called HSS)
SVM (linear kernel)
Probabilistic Digital
Fingerprinting
Test Bed and
Results
719 real and
spoof sites;
67.8%
accuracy*
Web pages
(quantity used
unclear); over
95% accuracy
URLs from
20,000
weblogs; 78%
accuracy
113,756 web
pages;
5 million web
pages;
1,285 web
pages; 93%
precision, 85%
recall
No formal
evaluation
*Evaluated by Zhang et al., 2007
4.2.1 Fake Website Types
Three pervasive categories of fake websites include web spam, spoof sites, and
escrow fraud websites. Web spam is the “injection of artificially created web pages into
113
the web in order to influence the results from search engines, to drive traffic to certain
pages for fun or profit.” (Ntoulas et al., 2006; p. 83). Such search engine optimization
(SEO) is performed using various web spam variants, including link and content spam
(Gyongi and Garcia-Molina, 2005). While web spam sites are not fake per se, they are
deceptive by nature. Spoof sites are phony websites that are replicas of actual commercial
sites. With spoof sites, the intention is to trick users into providing their information (a
common method for online identity theft). Email-based “phishing” attacks often employ
such spoof sites (Chou et al., 2004). Escrow fraud websites are used to facilitate a variant
of the popular “failure-to-ship” fraud (Chua and Wareham, 2004). The seller creates a
fake online escrow service (OES) and disappears after collecting the buyer’s money
(Chou et al., 2004). Such forms of internet fraud are becoming increasingly prevalent
(Hoar, 2005). For instance, online databases such as the Artists-Against-419 (Airoldi and
Malin, 2004) contain thousands of entries for fraudulent escrow websites with hundreds
added daily. Figure 4.1 shows examples of all three categories of fake websites.
114
Figure 4.1: Examples of Different Categories of Fake Websites
Fake escrow sites share similarities with web spam and spoof sites. For instance there
are numerous commonalities between the features used for web spam categorization and
those necessary for fake escrow website identification (Gyongi and Garcia-Molina,
2005). Analogous to fake escrow websites, web spam typically utilizes automatic content
generation techniques to mass produce fake web pages (Urvoy et al., 2006). Automated
generation methods are employed due to the quick turnover of such content (Sullivan,
2002). Hundreds of new fake sites pop up daily to replace ones already used and/or
identified. The use of machine generated pages results in many content similarities which
may be discernable using statistical analysis of website and page level content (Fetterly et
al., 2004). However, an important difference between fake escrow sites and web spam is
115
that the latter is intended to deceive search engines (Gyongi and Garcia-Molina, 2005),
while fake OES sites are designed to deceive online traders. Like spoof sites, fake OES
must aesthetically appeal to buyers; thereby making visual elements such as HTML
design and images far more crucial (Chou et al., 2004). Table 4.2 below summarizes the
three categories of fake websites; including their objectives and information categories
that may provide indicators of such a website’s lack of authenticity (i.e. cues suggesting
that the site is fake). The table also highlights the importance of identifying each fake
website category, along with the amount of prior research devoted to detecting that
particular type of website.
Table 4.2: Summary of Fake Website Categories
Site Type
Web Spam
Objective
Search engine
optimization
Site Cues
Content and
Link
Spoof Sites
Identity theft
fraud
Content and
Image
Fake
Escrow
Failure-to-ship
fraud
Content,
Link, and
Image
Problem Significance
Some studies report that web
spam constitutes over 18% of all
web pages in search engines.
Over 30 sites found daily; These
sites are used against millions of
Internet users daily
Hundreds of sites discovered
daily; Major portion of auction
fraud, which accounts for 42%
of Internet fraud
Prior Research
Considerable
prior research
Limited work
Unaware of any
prior research
4.2.2 Fake Website Features
Various Internet fraud watch organizations and prior web spam research have
identified sets of features or “fraud cues” pervasive in fake websites (Chou et al., 2004;
Fetterly et al., 2004; Kolari et al., 2006; Urvoy et al., 2006). Phony websites often
duplicate content from previous fake websites, thereby looking “templatic” (Fetterly et
al., 2004). Therefore, fake website identification features can be found in the body text
116
style (BT), HTML style (HS), URL and anchor text (UA), images (IM), and website and
page level structure and linkage (SL).
Body text (BT) features include misspellings and grammatical mistakes, which are
more likely to occur in illegitimate websites. Such features have also been very
informative in other stylistic categorization problems (Koppel and Schler, 2003). Other
useful body text style features include lexical measures such as the words per page,
words per title, average word length, and word n-gram frequencies (Ntoulas et al., 2006).
Many studies have also used word n-grams found in the body text. Mishne et al. (2005)
developed word level unigram language models while Arasu et al. (2001) and others have
used the frequencies of bag-of-words, tokens, and terms (Drost and Scheffer, 2005; Wu &
Davidson, 2006).
HTML style (HS) attributes include web page source elements that may represent
potential cues regarding the authenticity of a website. For example, the “post” and
“redirect” tags are red flags that a website may be malicious (Chou et al., 2004; Drost and
Scheffer, 2005). Additionally, HTML tag n-grams are useful for identifying web page
design style similarities (Urvoy et al., 2006; Wu and Davidson, 2006).
Certain text appearing in site URLs and anchor text (UA) can represent powerful
fraud cues. For instance, lengthier URLs, or ones with dashes or digits are common in
fake websites (Fetterly et al., 2004). URLs using “http” instead of “https” and ones
ending with “.org”, “.biz”, “.us”, or “.info” are also more likely to be fake (Drost and
Scheffer, 2005). Ntoulas et al. (2006) randomly sampled 105 million pages from the web
and observed that 70% of “.biz” and 35% of selected “.us” pages were spam.
117
Consequently, URL and anchor text tokens have been used a lot for web spam
categorization (Kolari et al., 2006; Salvetti and Nicolov, 2006).
Difficulties in indexing make image (IM) features hard to accurately collect and
analyze. As a result, many previous web mining and categorization studies have ignored
multimedia content altogether (Menczer et al., 2004). The use of image features may not
reveal in depth patterns and tendencies, however even simplistic image representations
can facilitate the identification of duplicate images (Chou et al., 2004). This could be
useful given the pervasive nature of replicated photos, banners, and icons in spoof and
escrow fraud sites.
Structure and linkage (SL) features can be very useful for detecting spoof and spam
websites (Chou et al., 2004; Metaxas and DeStefano, 2005). Structural features such as
the web page level (i.e., the number of slashes “/” in the page URL) have been
incorporated for categorizing web pages (Ester et al., 2001). Various studies have used in
and out links as well as the content derived from such links (Drost and Scheffer, 2005).
Wu and Davidson (2006) used the frequency of total in/out links as well as the number of
relative (e.g., “..\..\default.htm”) and absolute (e.g. “http://www.abc.com/default.htm”)
address links. Shen et al. (2006) used temporal measures of a website’s in and out link
growth rates as features for detecting web spam. Numerous studies have used more
involved features derived from structure and linkage information. For instance, Gyongi
and Garcia-Molina (2005) employed a trust rank score, based on the page rank algorithm
proposed for topical search engines. Such techniques, which ignore website content, are
considered especially effective for the detection of spam stemming from link farms.
118
Most previous studies on fake website categorization have only adopted one or two of
the aforementioned feature groups (e.g., (Kolari et al., 2006; Mishne et al., 2005; Ntoulas
et al., 2006)). Identification of fake websites entails consideration of various textual style,
link, and image elements described. For example, structure/link information, coupled
with text content, can dramatically improve web page categorization (Menczer, 2004). It
is therefore unclear whether a single feature category would be sufficient for identifying
fake escrow and spoof sites. However the use of rich heterogeneous feature sets
comprised of text, link, and image features introduces representational complexities for
the classification techniques employed.
4.2.3 Fake Website Categorization Techniques
Various machine learning techniques have been used in previous fake website
categorization research. Support Vector Machines (SVM) has been particularly effective
in numerous prior web spam categorization studies. Drost and Scheffer (2005) attained
over 95% accuracy using linear and RBF kernel SVMs to differentiate ham pages from
spam. Kolari et al. (2006) used a linear SVM for classifying blogs and splogs. Shen et al.
(2006) trained a linear SVM using temporal in and out link growth rate features for web
spam categorization. Lin et al. (2007) achieved over 95% accuracy for weblog and splog
categorization using body text and URL/anchor text attributes coupled with temporal
features (e.g., posting time stamps). SVM has also been used considerably for related
applications, including website (Joachims et al., 2001), style (Stamatatos et al., 2002;
Abbasi and Chen, 2005; 2008), and image categorization (Muller et al., 2001).
As previously mentioned, using rich heterogeneous feature sets can introduce
119
representational complexities that can make it difficult to utilize a single standard
classifier (Dietterich, 2000). Here, by standard classifier, we refer to any classifier that
takes a raw feature matrix as input (with instance rows and feature columns). Using
individual feature categories or text and link features in unison, as done in prior web
spam research, is no problem for such a classifier. However, if we wish to use all
features, the one-to-many relationship between pages and images is problematic. One
solution is to use feature ensemble classifiers, where each feature category is trained on a
separate classifier (Cherkauer, 1996; Dietterich, 2000). Ensemble classifiers are multiple
classifiers, built using different techniques, training instances, or feature subsets
(Dietterich, 2000). The feature subset classifier approach has been effective for analysis
of style and patterns when using large and/or heterogeneous attributes. Stamatatos and
Widmer (2002) used an SVM ensemble for music performer recognition. They
incorporated multiple SVM classifiers each trained using different feature subsets.
Cherkauer (1996) used a neural network ensemble for imagery analysis, comprised of 32
neural nets trained on 8 feature subsets. Such a setup performed better than using a single
classifier, since each ensemble was better able to represent its particular feature subset
(i.e., edge features, pixel colors, etc.). Furnkranz (2002) used an ensemble of in-link
hypertext features for categorizing web pages, with each classifier trained on text from a
single in-link of the web page of interest. The advantage of using feature-based classifier
ensembles is that they allow each classifier to become an “expert” on a subset of features
(Stamatatos et al., 2002), while also potentially enabling an improved feature
representation over the use of a single classifier. However, feature set segmentation
120
results in a loss of potentially informative feature interactions.
4.2.3.1 Kernel Methods
Another alternative that can address these deficiencies is the use of kernel-based
methods. Kernel methods can facilitate the representation and classification of
information with complex patterns while also considering unique problem-specific
characteristics (Muller et al., 2001). Kernels can represent structural information and also
consider feature interactions. A kernel-based method contains a kernel function (kernel)
and a kernel machine. A kernel function k defines a similarity measure between data
instances (x1 , x 2 ) in the input space χ :
(x1 , x 2 ) → k (x1 , x 2 )
k (x1 , x 2 ) = < Φ ( x1 ), Φ ( x 2 ) >
k:χ ×χ →ℜ
satisfying :
Where Φ is the transformation from the input space to a feature space H. Kernel
methods have been effective for text categorization (Sun et al., 2004) and web page
classification (Yu et al., 2004). For example, SVM, coupled with non-linear kernels, has
been useful for classifying linked documents (Joachims et al., 2001). It is important to
note that performance is contingent upon the kernel function’s ability to represent the
unique problem characteristics (Tan & Wang, 2004).
Kolari et al. (2006) noted that kernel functions could be useful for detecting fake
websites, yet have not been explored. Most prior research on fake website detection has
instead adopted non-kernel methods or utilized the standard linear kernel (e.g., (Kolari et
al., 2006; Shen et al., 2006, Wu and Davidson, 2006)). Other kernels, such as RBF have
also been effective for web spam categorization (Drost and Scheffer, 2005; Lin et al.,
121
2007). In addition to representing rich heterogeneous features, kernel functions can
incorporate unique characteristics of fake websites, such as content patterns and
duplication (Drost and Scheffer, 2005). Exploration of such kernels is an important
endeavor that has not been undertaken.
4.3 Research Design
In this section we highlight fake website detection research gaps based on our review
of the related work. Research questions are then posed based on the relevant gaps
identified. Finally, a research framework is presented in order to address these research
questions, along with some research hypotheses. The framework encompasses a rich
feature set and several kernels designed to represent the unique characteristics of fake
websites.
4.3.1 Research Gaps and Questions
We are not aware of any prior research on fraudulent escrow website categorization.
There has also been limited work on detecting spoof sites. Therefore it is unclear what
features will be effective for automatic identification of such fake websites. Furthermore,
customized kernel functions have not been explored, with most studies using linear
SVMs or non-kernel methods. Additionally, it is also unclear what impact the information
level (i.e., website versus web page) will have on categorization performance, as most
prior research has focused only on web page level classification. Based on these gaps, we
present the following research questions:
•
Which feature categories are best at differentiating fake escrow and spoof
websites from real ones?
122
o How can the use of an extended feature set enhance performance over
individual feature categories?
•
Can customized fake website kernels outperform the standard linear SVM
classifier (including ensembles)?
•
What impact does the classification level have on categorization performance?
•
How will the performance vary across site types?
4.3.2 Research Framework
4.3.2.1 Fake Website Feature Set
The feature set was comprised of the five feature categories summarized in the
literature review (shown in Table 4.3). Body text features used included style markers
described in (Abbasi and Chen; 2005; Zheng et al., 2006). These encompass various word
and character level lexical measures (Mishne et al., 2005; Ntoulas et al., 2006);
vocabulary richness metrics, and word length distributions, along with word, letter, digit,
and part-of-speech (POS) tag n-grams. Additional body text style markers utilized are
structural measures (including message, paragraph, and technical structure), a list of
5,513 common misspellings, a list of function words, and punctuation marks and special
characters. HTML tag n-grams were used for representing page design style (Urvoy et al.,
2006). The URL features included character and token n-grams (Ntoulas et al., 2006;
Ester et al., 2001). The image features employed were frequencies for pixel colors
(Baldwin, 2005), arranged into 10,000 color range bins. Link and structure features
included page and site level relative and absolute in/out links for each web page (Wu and
Davidson, 2006) along with the page level frequency distribution for all in/out link pages.
Site level in-links were derived from the Google search engine, as done in prior research
123
(Diligenti et al., 2000). All n-gram features require feature selection, commonly using the
information gain heuristic to govern selection (Koppel and Schler, 2003). Therefore, the
quantity for these features is unknown apriori.
Table 4.3: Fake Website Feature Set Description
Feature Group
Body Text
HTML Style
URL and Anchor
Text
Image
Structure and
Linkage
Category
Word Level Lexical
Char. Level Lexical
Letter N-Grams
Digit N-Grams
Word Length Dist.
Vocabulary
Richness
Special Characters
Function Words
Quantity
5
5
< 18,278
< 1,110
20
8
Description/Examples
total words, % char. per word
total char., % char. per message
count of letters (e.g., a, at, ath)
count of digits (e.g., 1, 12, 123)
frequency distribution of 1-20 letter words
e.g., hapax legomena, Yule’s K, Honore’s H
21
300
Punctuation
8
POS Tag N-Grams
Message Structure
Paragraph Structure
Technical Structure
Bag-of-word NGrams
Misspelled Words
HTML tag N-Grams
Character N-Grams
Token N-Grams
Pixel Colors
Image Structure
Site and Page Link
varies
6
8
50
varies
occurrences of special char. (e.g., @#$%^)
frequency of function words (e.g., of, for,
to)
occurrence of punctuation marks (e.g.,
!;:,.?)
part-of-speech tags (e.g., NNP, NNP JJ)
e.g., has greeting, has url, requoted content
e.g., number of, sentences per paragraph
e.g., file extensions, font types, colors, sizes
e.g., “trusted”, “third party”, “trusted third”
< 5,513
varies
varies
varies
10,000
40
10
Site Structure
31
e.g., “beleive”, “thougth”
e.g., <HTML>, <HTML> <BODY>
e.g., a, at, ath, /, _, :
e.g. “spedition”, “escrow”, “trust”, “online”
frequency bins for pixel color ranges
image extensions, heights, widths, file sizes
site and page level in/out links,
relative/absolute
page level, link levels distribution
4.3.2.2 Fake Website Kernel Representations
We incorporated the standard linear kernel, which has been effectively utilized with
SVM in numerous text categorization studies (Joachims et al., 2001; Abbasi and Chen,
2005; 2008). More specifically, the linear SVM classifier has been used effectively in
several web spam detection research (Kolari et al., 2006; Drost and Scheffer, 2005; Wu
124
and Davidson, 2006). In addition, we propose an average and max similarity kernel,
along with a composite kernel specifically tailored towards handling the properties of
fake websites (shown in Figure 4.2). The kernels compute the similarity for each web
page against all websites (based on the 5 feature categories outlined above), resulting in a
page-site similarity vector for each web page. The inner product between every two web
pages’ vectors is computed to produce a kernel matrix.
•
Average Similarity Kernel
o For a given web page, it computes the average similarity between that page
and all pages in the comparison website.
o Intended to represent common stylistic patterns across fake websites.
•
Max Similarity Kernel
o For a given web page, it computes the max similarity between that page and
all pages in the comparison website.
o Intended to represent content duplication patterns across fake websites.
•
Composite Kernel
o This kernel is a combination of the two kernels, considering both average
and max similarity.
The composite kernel is a combination of the other two kernels, and therefore
considers both average and max similarity. Utilization of average and maximum
similarity enables the consideration of common patterns (via the average similarity score)
as well as content duplication (via the max similarity score) that may occur across fake
websites due to the templatic nature of these websites’ pages. Additionally, each kernel
function also considers the structural attributes of the pages being compared (for
example, the number of in/out links and page levels), allowing for a more accurate
representation of website similarity. In summary, while both the linear and composite
125
kernels consider content and linkage features, the composite kernel provides two
potential advantages: (1) for a given web page, it computes the average and maximum
similarity between that page and all pages in the comparison website; (2) it incorporates
structural attributes into each page-page comparison. These differences are intended to
enable a more holistic representation of the stylistic tendencies inherent across fake
websites.
Represent each page a with the vectors :
xa = {Sim ave (a, b1 ),..., Sim ave (a, b p )}; y a = {Sim max (a, b1 ),..., Sim max (a, b p )}
Where :
1
Sim ave (a, b) = 1 − 
m

  lv a − lv b

  lv a + lv b
k =1  
m
∑
  in a − in b
*
  in + in
b
  a
  out a − out b
 *
  out + out
a
b
 
 1

 n

n
∑
i =1
   lv − lv   in − in   out − out
a
b  
a
b  
a
b
Sim max (a, b) = arg max 1 −  
*
*
   lv + lv   in + in   out + out
k∈pages in site b
a
b
a
b
a
b
 
 
 
For :
b ∈ p web sites in the training set; k ∈ m pages in site b; a1, ...an and k1 ,...k n
ai − k i   

ai + k i   

 1

 n

n
∑
i =1
ai − k i   

ai + k i   

are page a and k' s feature vectors;
lv a , in a , and out a are the page level and number of in/out links for page a;
The similarity between two pages is defined as the inner product between their two vectors x1 , x2 and y1 , y 2 :
Average Kernel : K( x1 , x2 ) =
x1 , x2
x1 , x1 x2 , x2
Composite Kernel : K( x1 + y1 , x 2 + y 2 ) =
; Max Kernel : K( y1 , y 2 ) =
x1 , x2
x1 , x1 x2 , x2
+
y1 , y 2
;
y1 , y1 y 2 , y 2
y1 , y 2
y1 , y1 y 2 , y 2
Figure 4.2: Average, Max, and Composite Kernels for Fake Website Detection
4.3.2.3 Research Hypotheses
H1: Features
The use of all features will outperform the use of any individual feature category for
page and site level identification of fake escrow and spoof websites.
•
H1a: All features > Body text / HTML / URL / Image / Structure and
126
Linkage for page level categorization.
•
H1b: All features > Body text / HTML / URL / Image / Structure and
Linkage for site level categorization.
H2: Kernels
The composite linear kernel will out perform the linear, linear average similarity, and
linear max similarity kernels for page and site level identification of fake escrow and
spoof websites.
•
H2a: Composite Kernel > Linear / Average Similarity / Max Similarity
Kernel for page level categorization
•
H2b: Composite Kernel > Linear / Average Similarity / Max Similarity
Kernel for site level categorization
H3: Information Level
The website level performance will be better than the web page level performance for
identification of fake escrow and spoof websites across features and techniques.
•
H3a: kernels site level > kernels page level
•
H3b: features site level > features page level
4.4 Evaluation
We conducted experiments to evaluate the proposed extended feature set and kernel
representations on two test beds comprised of fake escrow and spoof websites,
respectively. This section encompasses a description of the experimental setup (including
the test beds and experimental design), experimental results, and outcomes of the
hypotheses testing.
127
4.4.1 Experimental Setup
4.4.1.1 Test Beds
We collected 350 fake OES and 60 real escrow websites over a three month period
between 12/2006 - 2/2007. The fake OES website URLs were taken from two online
databases that post the HTTP addresses for verified fraudulent escrow sites: Escrow
Fraud
Prevention
(http://escrow-fraud.com)
and
The
Artists
Against
4-1-9
(http://wiki.aa419.org). These sites allow defrauded traders to post URLs for fake escrow
sites. The site owners require all complaints to be accompanied with evidence that fraud
occurred. Such verification is important to ensure that the sites added to the databases are
indeed fraudulent.
For the second test bed, we gathered 300 spoof sites and 80 real websites (imitated by
these spoofs) over a one month period between 2/2007 - 3/2007. The real websites
included ones for various financial institutions and online payment services (e.g.,
www.bankofamerica.com, www.paypal.com). Only those legitimate websites were used
that also had spoofs in our test bed. The spoof site URLs were taken from an online
database that posts verified spoof sites, called Phish Tank (http://www.phishtank.com).
The site allows individuals to report URLs for potential spoofs. Once a URL has received
sufficient votes from various reports, it is considered a confirmed spoof website.
Since fake websites are often shut down or abandoned after they have been used,
these sites typically have a short life span (often less than a few days). In order to
effectively collect these sites, we developed a spidering program that monitored the
online databases and collected newly posted URLs daily. This was done in order to
128
retrieve the content from these fake OES sites before they disappeared. Table 4.4 below
shows the summary statistics for our two test beds.
Table 4.4: Description of Fake Website Test Beds
Test Bed
Category
Escrow Websites
Real OES
Fake OES
Total
Real Sites
Spoof Sites
Total
Spoof Websites
Number
of Sites
60
350
410
80
300
380
Number
of Pages
19,812
69,684
89,496
32,418
77,592
110,010
Number
of Images
6,653
29,764
36,417
8,644
30,561
39,205
Pages
Per Site
330.20
199.10
218.28
405.23
258.64
289.50
Images
Per Site
110.88
85.04
88.82
108.05
101.87
103.17
4.4.1.2 Experimental Design
The experimental design included 6 feature sets (body text, HTML, URL, image,
link, and all) and four kernels (linear, average, max, and composite). This resulted in 24
experimental conditions each, for page and site level classification. We ran 50 bootstrap
instances for each condition, in which 50 real websites and 50 fake ones were randomly
selected in each bootstrap instance. All the web pages from the selected 100 sites were
used as the instances for that run. For each of the 50 bootstrap runs, we used the SVM
light package (Joachims et al., 2001) with 10-fold cross validation (i.e., 90% of the pages
used for training, 10% for testing in each fold). Page and site level classification accuracy
were used as the evaluation metrics:
Page level =
Site level =
Number of Correctly Classified Web Pages
Total Number of Web Pages
Number of Sites with Greater than 50%Web Pages Correctly Classified
Total Number of Websites
Consistent with previous research, information gain (IG) was used to select all n-
129
gram quantities (Koppel and Schler, 2003). IG was performed on the 90% training data
used in each fold of the cross validation for all 50 bootstrap runs. The average number of
features used for each category for the two test beds is shown below in Table 4.5. The
image and link feature sets were static since they did not include attributes such as ngrams. The feature sets were used directly by the linear kernel while the average, max,
and composite kernels used these features as input into the kernel function. The “All”
features used the total features for the five categories.
Table 4.5: Average Number of Features used by the Linear Classifiers
Feature Category
Body Text
HTML
URL
Image
Link
Total (All features)
Escrow Site Features
12,274
6,865
8,836
10,040
41
38,056
Spoof Site Features
9,759
6,541
5,713
10,040
41
32,054
4.4.2 Experiment 1: Fake Escrow Websites
Table 4.6 and Figure 4.3 show the experimental results for the various feature and
kernel combinations on the fake escrow website test bed. The body text, HTML, URL,
and link features all performed well, with accuracies in the mid 90% range for page and
site level classification. Although image features only had up to 83% accuracy, this is also
quite promising. Since our image feature set was fairly simplistic, the performance with
the composite kernel suggests that image duplication is pervasive in fake escrow
websites. The use of all features outperformed individual feature categories for all
kernels. This supports the notion that “fraud cues” in fake escrow websites occur across
feature types.
130
Table 4.6: Average Page and Site Level Classification Accuracy across 50 Bootstrap Runs
SVM Kernels
Linear
Average
Max
Composite
Body Text
96.92
88.76
86.93
95.98
SVM Kernels
Linear
Average
Max
Composite
Body Text
97.68
91.64
89.98
97.86
Page Level Classification
HTML
URL
Image
93.99
72.26
97.08
91.77
86.54
71.80
89.52
84.30
74.56
95.98
95.93
78.18
Site Level Classification
HTML
URL
Image
94.36
76.04
97.80
93.22
89.88
73.54
91.60
87.52
77.96
97.74
97.72
83.14
95
95
90
90
% Accuracy
100
% Accuracy
100
85
80
Link
90.82
88.31
86.47
92.09
All
97.80
92.89
91.36
98.97
Link
93.92
91.04
90.26
95.69
All
97.96
94.80
93.74
98.44
85
80
75
75
70
70
Body Text
Linear
HTML
URL
Image
Feature Set
Average
Max
a) Page level
Link
Composite
All
Body Text
Linear
HTML
URL
Image
Feature Set
Average
Max
Link
All
Composite
b) Site Level
Figure 4.3: Page and Site Level Performance for Various Features and Kernels
The composite kernel outperformed the average and max similarity kernels on all
feature sets for page and site level categorization. This implies that average and
maximum similarity scores provide complementary information for fake website
categorization. The linear kernel outperformed the composite kernel on the body text and
HTML features for page level categorization. However, the composite kernel
outperformed the linear SVM on all other feature sets (i.e., URLs, images, links, and all
features). The composite kernel performed even better on site level categorization,
131
outperforming the linear SVM method on all features except body text.
As expected, the site level performance was better than the page level accuracy.
However the difference was only 3-4% or less on average, with the biggest jump coming
on the image features. This is somewhat surprising, considering that our metric assigned
all sites to the class for which the majority of its pages were assigned. Hence, 70%-80%
page level accuracy could result in 100% site level performance provided the errors are
distributed evenly across sites. Obviously this was not the case, as further illustrated in
the hypotheses results discussion in section 4.4.
4.4.3 Experiment 2: Spoof Sites
Table 4.7 and Figure 4.4 show the experimental results for the spoof site test bed. The
body text, HTML, and URL style features performed best, with accuracies in the upper
90% range for page and site level classification. The use of all features outperformed
individual feature categories for all kernels, typically outperforming body text, URL, and
HTML feature categories by 2%-5%. In general, all feature categories performed better
on the spoof sites as compared to the escrow website categorization test bed utilized in
experiment 1 (accuracies 2%-3% higher). Image feature performance was particularly
augmented, with accuracies typically 10% higher than for experiment 1. These results are
consistent with prior research, where image features have also been shown to be highly
informative for spoof website detection (Chou et al., 2004).
132
Table 4.7: Average Page and Site Level Classification Accuracy across 50 Bootstrap Runs
SVM Kernels
Linear
Average
Max
Composite
Body Text
98.06
94.87
88.74
98.63
SVM Kernels
Linear
Average
Max
Composite
Body Text
99.80
97.14
94.66
99.88
Page Level Classification
HTML
URL
Image
97.25
97.74
83.91
92.39
92.56
81.22
91.27
90.13
85.68
98.16
98.17
88.74
Site Level Classification
HTML
URL
Image
99.66
92.90
99.74
96.82
96.86
92.04
96.48
95.80
93.74
99.76
99.74
95.02
95
98
% A ccuracy
100
% Accuracy
100
90
85
Link
93.83
89.02
84.30
95.01
All
98.95
96.69
92.58
100.00
Link
97.98
95.56
92.08
98.86
All
99.94
98.90
97.06
100.00
96
94
80
92
Body Text
Linear
HTML
URL
Image
Feature Set
Average
Max
a) Page level
Link
Composite
All
Body Text
Linear
HTML
URL
Image
Feature Set
Average
Max
Link
All
Composite
b) Site Level
Figure 4.4: Page and Site Level Performance for Various Features and Kernels
The composite kernel outperformed all three comparison kernels for page and site
level categorization. The enhanced performance was consistent across feature sets. The
composite kernel outperformed the linear kernel by a wide margin when using image and
link features, and attained 100% accuracy when using all features. Consistent with
experiment 1, the composite kernel also outperformed the average and max similarity
kernels by 4%-10% on all features. This indicates that average and max similarity
provide complementary information for spoof site detection (as they do for fake escrow
133
website categorization). The composite kernel performance was even better on spoof sites
as compared to fake escrow websites, suggesting that fake escrow website detection may
be somewhat more challenging than spoof site identification.
Once again, the difference between page and site level performance was not as large
as anticipated (although still greater than for experiment 1). This indicates that the
erroneous web pages tend to be concentrated within a few websites, which results in site
level performance that is not considerably higher than page level classification accuracy.
4.4.4 Hypotheses Testing
We conducted pair wise t-tests on the 50 bootstrap runs for both test beds. Given the
large number of comparison conditions, a Bonferroni correction was performed to avoid
spurious positive results. All p-values less than 0.0005 were considered significant at
alpha = 0.01. The t-tests were performed on features, kernels, and page versus site level
classification performance.
4.4.4.1 H1: Features (All vs. Individual Categories)
For the fake escrow website experiment, we conducted 20 pair wise t-tests comparing
the use of all features against the 5 individual feature categories for all 4 kernels. The pair
wise t-test results for page and site level performance are shown in Table 4.8. For page
level performance, the use of all features significantly outperformed all other feature
categories across all 4 kernels at alpha equals 0.01, with all p-values less than 0.0001. For
site level classification, the use of all features significantly outperformed individual
feature categories for 17 out of 20 conditions (all p-values less than 0.0005). The use of
134
all features did not significantly outperform the body text features when using the linear
(p-value 0.0424) and composite kernels (p-value 0.0041), and the HTML features when
using the linear kernel (p-value 0.2074).
Table 4.8: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Escrow Websites
H1: Features
All vs. BT
All vs. HTML
All vs. URL
All vs. Image
All vs. Link
H1b: Features
All vs. BT
All vs. HTML
All vs. URL
All vs. Image
All vs. Link
Linear
<0.0001+
<0.0001*
<0.0001*
<0.0001*
<0.0001*
Linear
0.0424
0.2074
<0.0001*
<0.0001*
<0.0001*
Page Level
Average
Max
<0.0001+ <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
Site Level
Average
Max
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
Composite
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
Composite
0.0041
<0.0001*
<0.0001*
<0.0001*
<0.0001*
* significant at alpha = 0.01; + result contradicts hypothesis
For the spoof website experiment (shown in Table 4.9), the use of all features for page
level classification also significantly outperformed all other feature categories across all 4
kernels with all p-values less than 0.0005. For site level classification, the use of all
features significantly outperformed individual feature categories for 17 out of 20
conditions (all p-values less than 0.0001). The improved performance of all features was
not significantly better than that of the body text features when using the linear (p-value
0.0034) and composite kernels (p-value 0.0063), and the URL features when using the
linear kernel (p-value 0.0017).
135
Table 4.9: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Spoof Websites
H1a: Features
All vs. BT
All vs. HTML
All vs. URL
All vs. Image
All vs. Link
H1b: Features
All vs. BT
All vs. HTML
All vs. URL
All vs. Image
All vs. Link
Linear
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
Linear
0.0034
0.0003*
0.0017
<0.0001*
<0.0001*
Page Level
Average
Max
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
Site Level
Average
Max
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
Composite
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
Composite
0.0063
<0.0001*
<0.0001*
<0.0001*
<0.0001*
* significant at alpha = 0.01
The results validate the importance of using a rich feature set encompassing text and
image content attributes along with structure and linkage based features for categorizing
fake websites. We present two examples to illustrate why the extended feature set was
able to garner improved performance. Figure 4.5 shows an example of a fake escrow
website called ShipNanu (www.shipnanu.addr.com). The fraudulent website could not be
categorized correctly using link features. This was because it had over 400 site level inlinks (derived from Google) and numerous out-links. Furthermore, ShipNanu also had a
large site map, with numerous inter-connected pages and images. Hence, the website’s
inter-link and intra-link features resembled those of legitimate escrow websites. However,
the site shared content patterns with other fake OES sites, including similarities in body
text (BT), image (IM), HTML (HS), and URL and anchor text (UA) attributes. Thus,
while link features couldn’t categorize this site, content features were able to.
136
Figure 4.5: Fake Escrow Website Detected Using Content Features
Figure 4.6 shows the legitimate website Escrow.com along with two fake replicas.
Since the replicas copied web pages directly from the original Escrow.com, text and
image content features were unable to identify web pages from the replicas as fake.
However the replicas differed from the original in terms of link features (which allowed
them to be detected as fraudulent). Replica #1 was a full replica with a similar site map
but had only 3 site level in-links, a low number for legitimate sites. Replica #2 had higher
in-links but was a partial copy devoid of a large portion of the original’s FAQ section,
resulting in a less dense site map. The two examples presented in Figures 4.5 and 4.6
exemplify how a rich holistic feature set incorporating a wide array of content and
linkage based features can enhance the detection of fake websites.
137
Figure 4.6: Escrow Website Replicas Detected Using Linkage Features
4.4.4.2 H2: Kernels (Composite vs. Linear / Average / Max)
For the fake escrow website experiment, the composite kernel significantly
outperformed the linear, average, and max kernels for most feature and information level
settings (as shown in Table 4.10). For site and page level classification, the use of the
composite kernel significantly outperformed the other three kernels for 16 out of 18
conditions (all p-values less than 0.0005). The two non-significant conditions were when
comparing the composite kernel against the linear kernel when using body text and
HTML features. The linear kernel significantly outperformed the composite kernel for
138
page level classification using body text and HTML features.
Table 4.10: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Escrow Websites
H2a: Kernels
Composite vs. Linear
Composite vs. Average
Composite vs. Max
H2b: Kernels
Composite vs. Linear
Composite vs. Average
Composite vs. Max
Body text
<0.0001+
<0.0001*
<0.0001*
HTML
<0.0001+
<0.0001*
<0.0001*
Body text
0.1092+
<0.0001*
<0.0001*
HTML
0.3231
<0.0001*
<0.0001*
Page Level
URL
Image
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
Site Level
URL
Image
<0.0001* <0.0001*
<0.0001* <0.0001*
<0.0001* <0.0001*
Link
<0.0001*
<0.0001*
<0.0001*
All
<0.0001*
<0.0001*
<0.0001*
Link
<0.0001*
<0.0001*
<0.0001*
All
<0.0001*
<0.0001*
<0.0001*
* significant at alpha = 0.01; + result contradicts hypothesis
The p-values for the spoof site experiment are shown in Table 4.11. The composite
kernel significantly outperformed the linear, average, and max kernels for all features on
page level classification (all p-values less than 0.0005). On the site level classification,
the composite kernel outperformed the average and max kernels for all feature sets. It
also significantly outperformed the linear kernel on the image and link features, but not
on the other feature sets.
Table 4.11: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Spoof Site Test Bed
H2a: Kernels
Composite vs. Linear
Composite vs. Average
Composite vs. Max
H2b: Kernels
Composite vs. Linear
Composite vs. Average
Composite vs. Max
Body text
<0.0001*
<0.0001*
<0.0001*
HTML
<0.0001*
<0.0001*
<0.0001*
Body text
0.0222
<0.0001*
<0.0001*
HTML
0.0119
<0.0001*
<0.0001*
Page Level
URL
Image
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
Site Level
URL
Image
0.1611
<0.0001*
<0.0001*
<0.0001*
<0.0001*
<0.0001*
Link
<0.0001*
<0.0001*
<0.0001*
All
<0.0001*
<0.0001*
<0.0001*
Link
<0.0001*
<0.0001*
<0.0001*
All
0.0416
<0.0001*
<0.0001*
* significant at alpha = 0.01
The composite kernel had the best overall performance, with 98%-100% accuracy
when using all features. The improved performance of the composite kernel was largely
139
attributable to its use of both average and max similarity. Some fake pages (e.g., ones
with style tendencies) are better identifiable based on average similarity while others are
more discriminant using max similarity (e.g., ones with duplicated content). The
effectiveness of using average and maximum similarity is illustrated by the example
presented in Figure 4.7. The diagram depicts the similarity for two pages (A and B) taken
from a fake OES site compared against all web pages from the fraudulent site
www.bssew.com. The two graphs are the site maps for www.bssew.com, with each web
page represented with a node, and lines between nodes indicating linkage. Each node’s
darkness indicates the level of similarity between that particular www.bssew.com page
and pages A and B (with darker nodes having a higher similarity). The similarity scores
were computed using all features. Page A shares stylistic patterns with many pages in
www.bssew.com, resulting in a high average similarity (evidenced by the predominance
of gray page nodes). In contrast, Page B shares duplicated content with some pages in
www.bssew.com, but little similarity with other pages (indicated by the presence of a few
black page nodes and many lighter ones). The example illustrates how the composite
kernel can effectively categorize unknown escrow website pages (such as pages A and B)
by comparing them against prior known fraudulent escrow websites. The kernel can
check the page of interest for stylistic patterns/similarities that it may share with a large
number of web pages from previous fake OES websites. Alternately, the composite kernel
can also detect fraud cues stemming from duplicated content that the page of interest may
have rehashed from a specific subset of pages from those prior sites.
140
Figure 4.7: Similarities for Two Phony Pages Compared against Fake Website
www.bssew.com
4.4.4.3 H3: Information Level (Page vs. Site)
Table 4.12 shows the p-values for the pair wise t-tests conducted on the fake escrow
website experiment. The site level performance was generally significantly greater than
the page level accuracy (for 21of 24 conditions). However, the linear kernel performance
when using all features was not significantly better for site versus page level. It also
deteriorated for the composite kernel. As previously stated, we would have expected the
site level to be far better however this was not the case.
Table 4.12: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Escrow Websites
H3a-b
Linear
Average
Max
Composite
Body text
<0.0001*
<0.0001*
<0.0001*
<0.0001*
HTML
<0.0001*
<0.0001*
<0.0001*
<0.0001*
URL
0.4043
<0.0001*
<0.0001*
<0.0001*
Image
<0.0001*
<0.0001*
<0.0001*
<0.0001*
Link
<0.0001*
<0.0001*
<0.0001*
<0.0001*
All
0.1231
<0.0001*
<0.0001*
0.0261+
* significant at alpha = 0.01; + result contradicts hypothesis
For the spoof site t-tests (shown in Table 4.13), the site level performance was
significantly greater than the page level accuracy (for 23 of 24 conditions). This suggests
141
that the spoof site errors were not as concentrated within a few websites as compared to
the fake escrow websites, allowing the site level performance to significantly outperform
the page level performance.
Table 4.13: P-Values for Pair Wise t-Tests on Accuracy (n=50) for Spoof Site Test Bed
H3a-b
Linear
Average
Max
Composite
Body text
<0.0001*
<0.0001*
<0.0001*
<0.0001*
HTML
<0.0001*
<0.0001*
<0.0001*
<0.0001*
URL
<0.0001*
<0.0001*
<0.0001*
<0.0001*
Image
<0.0001*
<0.0001*
<0.0001*
<0.0001*
Link
<0.0001*
<0.0001*
<0.0001*
<0.0001*
All
<0.0001*
<0.0001*
<0.0001*
0.5000
* significant at alpha = 0.01
Figure 4.8 shows the concentration of percentage cumulative page level error for
different feature types using the composite kernel on the fake escrow test bed. The values
depicted are averaged across all 50 bootstrap instances, with the error contribution of the
10 most erroneous sites placed on the x-axis in descending order. We can see that for
most feature types, as many as 50% of the incorrectly classified pages came from 2-3
websites. These sites were consequently the ones with less than 50% of their pages
correctly classified, resulting in incorrect site level assignment. Such a page level error
concentration was responsible for the lack of considerably improved site level
performance, which is why most feature sets only had a 1%-2% improvement in site level
accuracy. In contrast, the image and link features had their errors relatively more evenly
distributed across websites. This explains why their site level performance was generally
4%-7% higher than their page level performance.
142
% of Cumulative Error
1
0.8
0.6
0.4
0.2
0
1
Body Text
2
3
HTML
4
5
6
Site Number
URL
7
Image
8
Link
9
10
All
Figure 4.8: Cumulative Page Level Errors for Features on Escrow Website Test Bed
4.5 Conclusions
In this essay we evaluated the effectiveness of automated approaches for fake website
identification. Our study involved evaluation of various features and kernel
representations for categorization of fraudulent escrow and spoof sites. The results
indicated that the use of the proposed composite kernel coupled with a rich feature set is
capable of effectively identifying fake websites. We attained over 98% accuracy for page
and site level classification when differentiating between legitimate and fake escrow
websites and 100% accuracy for page and site level classification of spoof sites.
In addition to proposing an approach for fake website detection, our analysis revealed
several key findings. Fake website “fraud cues” are inherent in body text, HTML, URL,
link, and image features. Therefore, the use of an extended feature set that employs a rich
set of content and structure/linkage based features can enhance fake website
identification capabilities. We also confirmed that fake websites share stylistic pattern
and content duplication tendencies with other fake and real websites. Escrow fraud sites
typically duplicate content from other fake OES sites (analogous to many web spam
143
sites) while spoof sites attempt to replicate legitimate websites. We observed that site
level categorization of fake websites is not necessarily easier than page level (especially
for fake OES sites) since most page-level errors are concentrated in a few sites.
We have identified several future directions pertaining to the features and kernels
used to represent fake websites. With respect to features, we intend to assess the impact
of more complex link features, such as the Trustrank scores proposed by (Gyongi and
Garcia-Molina, 2005) and the temporal features introduced by Shen et al. (2006) and Lin
et al., (2007). We also plan to explore kernel hierarchies that combine the composite
kernel used here with feature representation level kernels. For example, instead of using
the linear inner product to compare web pages, we could use tree or string kernels for
text, polynomial kernels for images, and graph kernels for linkage information.
144
CHAPTER 5: A COMPARISON OF TOOLS FOR DETECTING FAKE WEBSITES
5.1 Introduction
In the previous chapter, we compared various features and kernels for fake website
detection. In this chapter, we compare various systems for detecting fake websites. The
systems are compared on two test beds: generated fraud and spoof sites.
The increased popularity of the Internet has attracted opportunists seeking to
capitalize on the asymmetric nature of online information exchange. Consequently many
forms of fake and deceptive websites have appeared (Chua and Wareham, 2004),
including web spam, generated fraud sites, and spoof sites (Figure 5.1). Web spam sites
attempt to deceive search engines to boost their ranking (Gyongi and Molina, 2005).
Their objective is search engine optimization (SEO); web spam is often used for profit.
For instance, the cell phone spam domain in Figure 5.1 is for sale for $350. Since there
has been considerable progress on web spam detection (Gyongi and Molina, 2005), we
focus our attention towards generated fraud and spoof sites. Generated fraud sites are
deceptive sites attempting to appear as legitimate commercial entities. Figure 5.1 shows
an example of a generated fraud site for a phony investment bank called “Troy Inc.” The
objective of such sites is failure-to-ship fraud; they collect unsuspecting users’ money and
disappear (Chua and Wareham, 2004). Generated fraud sites commonly pose as fake
escrow, financial, delivery, or retail companies (Abbasi and Chen, 2007). In contrast,
spoof sites are imitations of real commercial sites, intended to deceive the authentic sites’
customers (Chou et al., 2004). The objective of spoofs is identity theft; capture users’
account information by having them log into a fake site. Commonly spoofed sites include
145
eBay, PayPal (shown in Figure 5.1), and various banks (Liu et al., 2006).
Figure 5.1: Fake Website Examples
Fake websites are often very professional looking and difficult to identify as phony
(MacInnes et al., 2005). In response to increasing user awareness, fraudsters are also
becoming more sophisticated (Levy and Arce, 2002). Accordingly, there is a need for
enhanced fake website detection techniques (Chou et al., 2004). Numerous tools have
been proposed, however they have several shortcomings. Most are reactive lookup
systems that rely solely on user reported blacklists of fake URLs. Few systems using
proactive classification techniques have been proposed. Those that have utilize overly
simplistic features and classification heuristics. Furthermore, while there has been
considerable focus on spoof site detection tools, generated fraud sites have received little
attention in spite of their increasing prevalence (Abbasi and Chen, 2007). It is unclear
how effective existing tools would be at detecting generated fraud sites. Since generated
sites do not simply mimic popular commercial websites, effectively identifying them may
require more involved methods.
To confront these challenges we propose a classifier system for identifying fake
websites. The AZProtect system is capable of detecting generated fraud and spoof sites. It
146
uses a rich feature set and composite SVM kernel based model for enhanced fake website
detection capabilities. The system can also be combined with a lookup mechanism for
hybridized detection using a dynamic classifier that is periodically retrained on the
updated blacklist entries. This article presents experimental results comparing the
proposed system against several existing tools for detection of generated fraud and spoof
sites.
5.2 Fake Website Detection Tools
Several systems have been developed for identifying fake websites. These tools
belong to two categories: lookup and classifier systems. A discussion of these categories,
including a review of example systems and their advantages and disadvantages is
presented below.
5.2.1 Lookup Systems
Lookup systems use a client-server architecture where the server side maintains a
blacklist of known fake URLs (Li and Helenius, 2007; Zhang et al., 2007). The clientside tool checks the blacklist and provides a warning if a website poses a threat. Lookup
systems employ collaborative sanctioning mechanisms similar to those used in reputation
ranking (Hariharan et al., 2006). The blacklists are generated and updated from two
sources: online communities of practice and system users. Online communities such as
the Anti-Phishing Working Group and the Artists Against 4-1-9 have developed databases
of known generated fraud and spoof websites. Lookup systems also consider URLs
directly reported or rated by system users. Numerous lookup systems are available.
147
Perhaps the most popular is Microsoft’s IE7 Phishing Filter. This tool uses a client side
whitelist coupled with a server side blacklist gathered from IE7 user reports and online
databases. Similarly, Mozilla Firefox’s FirePhish toolbar and the EarthLink toolbar also
maintain a blacklist of spoof URLs. Firetrust’s Sitehound system stores spoof and
generated fraud site URLs taken from online sources such as the Artists Against 4-1-9.
An advantage of lookup systems is that they typically have high precision since
they’re less likely to consider authentic sites fake (Zhang et al., 2007). They’re also easier
to implement and computationally faster than most classifier systems; comparing URLs
against a list of known fakes is fairly simple. Pitfalls include higher levels of false
negatives (i.e., failing to identify fake websites). The blacklist is limited to a small
number of online resources and may lack coverage. For example, the IE7 Phishing Filter
and FirePhish tools only store URLs for spoof sites, making them inept against generated
fraud sites. Furthermore, the performance of lookup systems may vary based on the time
of day and interval between report and evaluation time (Zhang et al., 2007). Due to the
augmented likelihood over time of a site being reported and added to the blacklist. This
allows fraudsters a better opportunity of succeeding before being blacklisted; 5% of spoof
site recipients are defrauded in spite of the availability of a plethora of web browser
integrated lookup systems (Liu et al., 2006).
5.2.2 Classifier Systems
Classifier systems are client-side tools that apply rule or similarity based heuristics to
website content or domain registration information (Wu et al., 2006; Zhang et al., 2007).
Several classifier systems have been developed for fake website detection. SpoofGuard
148
uses web page features, such as image hashes, password encryption checks, URL
similarities, and domain registration information (Chou et al., 2004). Netcraft’s classifier
relies on domain registration information such as the domain name, host name, host
country and registration date (Wu et al., 2006). EBay’s Account Guard tool compares the
content of the URL of interest with legitimate eBay and PayPal sites (Zhang et al., 2007).
SiteWatcher (now called Reasonable Anti-Phishing) uses visual similarity assessment
based on 40 body text, page style, and image features (Liu et al., 2006). A page is
considered a spoof if its similarity is above a threshold when compared against a client
whitelist.
Classifier systems provide numerous benefits. They can offer better coverage for
spoof and generated fake sites than lookup systems (Abbasi and Chen, 2007). Classifier
systems are also proactive; capable of detecting fakes independent of blacklists.
Consequently classifier systems are not impacted by time of day and the interval between
when a URL is visited and its first appearance in an online database (Zhang et al., 2007).
Caveats of classifier systems include computational cost. They can take longer to classify
web pages than lookup systems. They’re also more prone to false positives (Zhang et al.,
2007). Generalizability of classification models over time can be another issue, especially
if the fake websites are constantly changing and evolving. For instance, the Escrow Fraud
online database (http://.escrow-fraud.com) has over 190 unique templates for generated
fraud sites with new ones added constantly. In such situations the classification model
must also adapt and relearn.
149
5.2.3 Hybrid Systems and Dynamic Classifiers
Hybrid systems combine classifier and lookup mechanisms. Such tools generally use
simple content and domain registration information in unison with server side blacklists
(Li and Helenius, 2007). URLs on the blacklist are blocked, while others are evaluated by
the classifier. Examples of hybrid systems include Netcraft and eBay Account Guard.
Dynamic hybrid systems using blacklists to update their classifiers could be highly
effective against constantly changing fake website patterns. SpoofGuard does some
updating; it stores image hashes for websites visited, allowing it to check for image
duplication (Chou et al., 2004). Nevertheless there has been limited work on dynamic
classifiers for fake website detection.
5.2.4 Summary of Existing Tools
Many studies have evaluated the effectiveness of fake website detection tools from a
usability perspective (Li and Helenius, 2007; Wu et al., 2006). However there has been
limited evaluation of tools from a detection accuracy perspective. Zhang et al. (2007)
compared 11 tools’ ability to detect 200 spoof URLs and 516 real sites. They found that
SpoofGuard, Netcraft, and IE7 Phishing Filter had the best spoof website detection rates.
IE7 and Netcraft had the best overall performance; SpoofGuard was prone to high rates
of false positives. They also observed that lookup system performance was impacted by
the interval between when a site is reported and evaluated.
150
Table 5.1: Summary of Fake Website Detection Tools
Tool Name
System Type
Website Type
CallingID
Classifier
Domain registration
information
Lookup
Server-side blacklist
Spoof sites
Cloudmark
None
Server-side blacklist
Spoof sites
EarthLink Toolbar
Domain registration
information
Server-side blacklist
Spoof sites
eBay Account
Guard
Content similarity
heuristics
Server-side blacklist
FirePhish
None
Server-side blacklist
Spoof sites
(primarily
eBay and
PayPal)
Spoof sites
IE7 Phishing Filter
None
Client-side whitelist,
server-side blacklist
Spoof sites
Netcraft
Domain registration
information
Server-side blacklist
SiteWatcher
Text and image
feature similarity,
stylistic feature
correlation
None
Client-side whitelist
Generated
sites,
spoof sites
Spoof sites
Sitehound
SpoofGuard
GeoTrust
TrustWatch
Image hashes,
password
encryption, URL
similarities, domain
registration
information
None
Server-side blacklist
downloaded by
client
None
Generated
sites, spoof
sites
Generated
sites, spoof
sites
Server-side blacklist
Spoof sites
Prior Results
(Spoof Sites)
Overall: 85.9%
Spoof Detection:
23.0%
Overall: 83.9%
Spoof Detection:
45.0%
Overall: 90.5%
Spoof Detection:
68.5%
Overall: 83.2%
Spoof Detection:
40.0%
Overall: 89.2%
Spoof Detection:
61.5%
Overall: 92.0%
Spoof Detection:
71.5%
Overall: 91.2%
Spoof Detection:
68.5%
N/A
N/A
Overall: 67.7%
Spoof Detection:
93.5%
Overall: 85.1%
Spoof Detection:
46.5%
Table 5.1 shows a summary of existing fake website detection tools. For each tool,
the table depicts the system type, applicable fake website categories, and prior results:
overall accuracy (real and fake sites) and spoof site detection rates (Zhang et al., 2007).
From the table we can make several observations. There has been no prior evaluation on
151
generated fraud websites. Furthermore, most systems use a lookup mechanism; the few
existing classifier tools perform little content analysis of web pages (e.g., EarthLink,
Netcraft). There is a need for classifiers that use rich feature sets in order to keep pace
with the sophistication of fake websites (Levy and Arce, 2002; Liu et al., 2006). There
has also been limited utilization of hybrid systems that combine classifiers with a lookup
mechanism. Hybrid systems could leverage the enhanced precision of lookup
mechanisms and the coverage of classifiers. Lookup information could also be integrated
into dynamic classifiers capable of learning emerging fake website patterns.
5.3 Proposed Approach
We propose a classifier system that uses a rich feature set and kernel based machine
learning classifier (Figure 5.2). The AZProtect system is capable of classifying generated
fraud sites and spoof sites. Whereas existing systems only evaluate the current page’s
URL, the proposed system analyzes multiple web pages from the potential website for
improved performance.
Figure 5.2: Proposed AZProtect System Overview
152
AZProtect utilizes a feature set comprised of nearly 6,000 attributes from 5
information types: body text, HTML design, images, linkage, and URLs. The body text
attributes consist of approximately 2,500 word level (e.g., “bank of” “bank of america”)
and character level (e.g., “pa” “pay”) n-grams while the HTML design features
encompass over 1,000 HTML tag n-grams (e.g., “<html><head>”). The image features
include pixel color frequencies arranged into 1,000 bins as well as 40 image structure
attributes (e.g., image height, width, file extension, file size). The feature set also includes
1,500 token and character level n-grams derived from URLs and anchor text (e.g., “https”
“org”). All n-gram features consist of unigrams, bigrams, and trigrams.
The Support Vector Machine (SVM) classifier uses a linear composite kernel (Figure
5.3). The kernel function is tailored towards representing the content similarity and
duplication tendencies of fake websites. It compares pages’ feature vectors against
training site pages and considers the average and maximum similarity for pattern and
duplication detection. The kernel also incorporates page linkage and structure information
in each comparison (i.e., page levels, and in/out links). Using the trained SVM model, a
website is considered fake if greater than 50% of its pages are classified as fake. Using
multiple pages is intended to allow for better detection in situations where a single fake
page may not contain sufficient fraud cues (Abbasi and Chen, 2007).
153
Represent each page a with the vectors :
xa = {Sim ave (a, b1 ),..., Sim ave (a, b p )}; y a = {Sim max (a, b1 ),..., Sim max (a, b p )}
Where :
1
Sim ave (a, b) = 1 − 
m

  lv a − lv b

  lv a + lv b
k =1  
  in a − in b
*
  in + in
b
  a
m
∑
  out a − out b
 *
  out + out
a
b
 
 1

 n

n
ai − k i   


i
i 

∑ a +k
i =1
   lv − lv   in − in   out − out
a
b  
a
b  
a
b
Sim max (a, b) = arg max 1 −  
*
*







+
+
+
lv
lv
in
in
out
out
k∈pages in site b
b  
a
b  
a
b
  a
For :
b ∈ p web sites in the training set; k ∈ m pages in site b; a1, ...an and k1 ,...k n
 1

 n

n
ai − k i   


i
i 

∑ a +k
i =1
are page a and k' s feature vectors;
lv a , in a , and out a are the page level and number of in/out links for page a;
The similarity between two pages is defined as the inner product between their two vectors x1 , x2 and y1 , y 2 :
Average Kernel : K( x1 , x2 ) =
x1 , x2
x1 , x1 x2 , x2
Composite Kernel : K( x1 + y1 , x 2 + y 2 ) =
; Max Kernel : K( y1 , y 2 ) =
x1 , x2
x1 , x1 x2 , x2
y1 , y 2
;
y1 , y1 y 2 , y 2
y1 , y 2
+
y1 , y1 y 2 , y 2
Figure 5.3: Linear Composite Kernel used by AZProtect’s SVM Classification Model
5.4 Experiments and Results
We evaluated 350 fake generated websites and 350 spoof sites over a 6 week period.
The fake websites were taken from 4 online databases (8, 11). Generated fraud sites came
from the Artists Against 4-1-9 (http://wiki.aa419.org) and Escrow Fraud Online
(http://escrow-fraud.com) while the spoof
sites
(http://www.phishtank.com)
Anti-Phishing
and
the
were taken
from
Working
PhishTank
Group
(http://www.antiphishing.org). We also evaluated 100 legitimate sites: 50 authentic
websites to complement the 350 generated sites and 50 legitimate websites commonly
copied by the 350 spoofs. Overall, this resulted in two 400 website test beds.
Fake sites were evaluated between 9am and midnight. In order to assess the impact of
evaluation time of day on performance, at least 15 samples were collected for each hour.
154
We also evaluated the fake websites at different intervals between evaluation and report
time in the online database. This was done since performance for certain lookup systems
improves as the time interval increases (Zhang et al., 2007). As a result, at least 10
evaluation samples each were collected for 0-24 hour intervals. All times were rounded to
the nearest hour.
We evaluated the effectiveness of the proposed AZProtect system in comparison with
other state-of-the-art tools. The comparison tools were 7 which had performed well in
prior testing (5, 11) or not been evaluated. These included SpoofGuard, Netcraft, eBay
Account Guard, IE7 Phishing Filter, FirePhish, EarthLink Toolbar, and Sitehound. Only
SpoofGuard, Netcraft, and Sitehound were compared against AZProtect on the generated
site test bed, since the remaining tools do not support generated fraud site detection. In
contrast, all 8 tools were evaluated on the spoof site test bed.
Prior to the experimentation, AZProtect’s SVM classifier model was trained on web
pages from over 200 generated fraud and spoof sites as well as 50 legitimate websites
(none of which appeared in the evaluation test bed). SpoofGuard was also run on these
training websites in order to build a repository of image hashes and URL text for
comparison against the evaluation sites. Since AZProtect evaluates multiple pages from
the website of interest, we limited the maximum number of evaluated pages per site to 50
for computational reasons. AZProtect took an average of 2.9 seconds to evaluate a
website, a number that is slightly higher than the 0.5 to 2.0 second times for other tools
(2, 8). The evaluation metrics included overall accuracy, accuracy on legitimate sites, and
accuracy on fake sites. The latter is considered most important given the high cost of
155
false negatives (Zhang et al., 2007).
5.4.1 Overall Results
Table 5.2 shows the overall results on the generated fraud and spoof website test beds.
AZProtect had the best overall performance and fake website detection accuracy on both
data sets. All p-values on pair wise t-tests were less than 0.0001 (n=400, n=350). Netcraft
also performed well, but with 10%-15% lower fake website detection rates. FirePhish,
IE7, and SpoofGuard fared decently on the spoof site test bed while Sitehound performed
entirely poor.
Table 5.2: Overall Results for Tool Accuracy Comparison on Generated and Spoof Sites
System
SpoofGuard
Sitehound
Netcraft
AZProtect
EarthLink
IE7
FirePhish
eBay
Overall
(n=400)
56.25
48.75
73.75
87.75
-
Generated Sites
Legit Sites
Gen. Sites
(n=50)
(n=350)
88.00
51.71
41.43
100.00
98.00
70.29
96.00
86.57
-
Overall
(n=400)
78.50
32.75
88.00
96.25
51.00
78.50
80.00
60.25
Spoof Sites
Legit Sites
(n=50)
92.00
100.00
98.00
96.00
98.00
100.00
100.00
96.00
Spoof Sites
(n=350)
76.57
23.14
86.57
96.29
44.29
75.43
77.14
55.14
5.4.2 Impact of Time of Day and Interval
Figure 5.4a-b shows the results across times of day for various intervals between
evaluation and report time on the 350 generated fraud sites. AZProtect had the best
performance for interval between evaluation and report time and for evaluation time of
day. Netcraft performed second best followed by SpoofGuard and Sitehound. Netcraft’s
combination of classifier and lookup was beneficial; it was able to detect many newly
generated sites by evaluating their domain registration information. As expected, lookup
156
systems such as Sitehound and Netcraft performed better as the interval between
evaluation and report time increased. This was due to an increased likelihood of the
blacklist being updated as time transpired. The two systems even outperformed
AZProtect on larger intervals; however AZProtect outperformed comparison techniques
for all intervals less than 16 hours. Interestingly, Sitehound performed better when
evaluating websites in the evening. This is because the tool’s server side blacklist gets
daily updates in the evening time, resulting in enhanced performance in subsequent
hours.
Figure 5.4c-d shows the spoof detection results. AZProtect again had the best
performance, with over 90% accuracy for all intervals and times of day. Netcraft, IE7,
FirePhish, and SpoofGuard also performed well for various times of day and intervals.
Lookup systems such as IE7 and FirePhish only improved for time intervals up to 4
hours. Their accuracy leveled off near 80% for higher time intervals because these tools
update their blacklists more frequently. EarthLink and Sitehound had detection rates
under 50% for all time intervals. Sitehound once again performed better in the evening
hours. In contrast, other tools’ performance seemed consistent across time of day. The
eBay tool performed well at identifying fake replicas of eBay and PayPal websites, which
constitute a large portion of spoofs (Wu et al., 2006). Interestingly, the results by time of
day and interval for spoof sites were more stable than on the generated sites which tend to
have greater content variability. In contrast spoof sites usually replicate a handful of
common sites.
157
0900
1.0
0000
1000
2300
1100
0.8
% Accuracy
2200
1200
0.6
2100
1300
0.4
2000
0.2
1400
1900
1500
0.0
0
4
SpoofGuard
8
12
# of Hours
Sitehound
16
1800
20
1600
1700
Netcraft
AZProtect
(a) Interval: Generated Sites
SpoofGuard
Sitehound
AZProtect
(b) Time of Day: Generated Sites
0900
1.0
0000
1000
2300
0.8
% Accuracy
Netcraft
1100
2200
1200
0.6
1300
2100
0.4
2000
1400
0.2
1900
0.0
0
4
SpoofGuard
IE7
8
12
# of Hours
Sitehound
FirePhish
16
EarthLink
eBay
20
Netcraft
AZProtect
1500
1800
SpoofGuard
Netcraft
eBay
1600
1700
Sitehound
IE7
AZProtect
EarthLink
FirePhish
(c) Interval: Spoof Sites
(d) Time of Day: Spoof Sites
Figure 5.4: Impact of Interval between Evaluation and Report Time and Time of Day on
Accuracy for Generated Fraud and Spoof Site Test Beds
5.4.3 Hybrid Systems: Combining Classifier and Lookup Methods
We assessed the effectiveness of combining the AZProtect classifier with a lookup
mechanism on the same two 350 fake website test beds. The lookup component updated
its blacklist every n hours, where n ranged from 1 to 24. The PhishTank and Artists
Against 4-1-9 databases were used as blacklist sources. Three different systems were
compared: the standard AZProtect classifier, a hybrid classifier combining the classifier
and lookup mechanism, and a hybrid classifier that combined the lookup mechanism with
158
a dynamic classifier (which was updated every n hours with new blacklist URLs). The
standard classifier was run on every URL. The two hybrid classifiers each compared
URLs against the blacklist; ones on the blacklist were considered fake while the
remainders were evaluated by the classifier.
Figure 5.5 shows the percentage fake website detection rates across the 24 values of n
for the three systems. The hybrid classifiers both outperformed the standard classifier. As
expected, using smaller time intervals between blacklist updates led to higher
performance since the lookup mechanism was better able to identify recent fake sites.
The use of a dynamic classifier further improved performance, however the performance
increase was more pronounced on the generated fraud sites. This is because generated
sites’ patterns change and evolve over time, while spoof sites are more stagnant. This
1
1
0.96
0.96
% Accuracy
% Accuracy
point is elaborated upon below with an illustrative example.
0.92
0.88
0.92
0.88
0.84
0.84
100
Classifier
500
900
1300
# of Hours
Classifier+Lookup
1700
2100
Dynamic Classifier+Lookup
100
Classifier
500
900
1300
# of Hours
Classifier+Lookup
1700
2100
Dynamic Classifier+Lookup
(a) Generated Fraud Sites
(b) Spoof Sites
Figure 5.5: Impact of Hybrid Systems on Fake Website Detection Accuracy
Figure 5.6 shows four generated fraud sites that appeared over a one week period
with an evolving template. The standard classifier was unable to identify these sites as
fake due to new linkage, image, and text patterns not previously seen in the training data.
The earlier two sites (Aug. 24) were almost identical, with the only difference being the
159
company names. The third site (Aug. 29) had a somewhat different layout while the
fourth received a complete layout overhaul while maintaining body text similarities with
its predecessors. Although the dynamic classifier also misclassified the first site, it was
able to identify the rest as fake (after update). In contrast, Figure 5.7 shows three spoofs
of PayPal. Although there were some differences, their page content was similar given
that they had to appear like authentic PayPal sites. Consequently, the dynamic classifier
was more effective on generated websites.
Figure 5.6 Generated Fraud Site Patterns Over Time
160
Figure 5.7: Spoof Site Patterns over Time
5.5 Conclusions
We have developed a system comprised of a rich feature set and support vector
machine classification model for improved fake website detection performance. As fake
website developers become more innovative, so to must the tools used to protect
unsuspecting Internet users from these vices (Liu et al., 2006). In addition to providing
improved detection accuracy in the short run, hybrid systems combining lookup
mechanisms with dynamic classifiers could embody an effective long term solution.
Further exploration of various forms of hybridization and different types of dynamic
classification models represents a potentially fruitful future endeavor.
161
CHAPTER 6: FEATURE SELECTION FOR OPINION CLASSIFICATION IN
ONLINE FORUMS AND REVIEWS
6.1 Introduction
In the previous 4 chapters, we focused on the use of information related to the textual
meta-function for enhanced identity and institutional trust. Beginning with this chapter,
we turn our attention to information related to the ideational meta-function of Systemic
Functional Linguistic Theory. The Internet is frequently used as a medium for exchange
of information and opinions, as well as propaganda dissemination. In this essay the use of
sentiment analysis methodologies is proposed for classification of opinions in online
reviews and web forums.
Analysis of web content is becoming increasingly important due to augmented
communication via computer mediated communication (CMC) internet sources such as
email, websites, forums, and chat rooms. Sentiment analysis attempts to identify and
analyze opinions and emotions. Hearst (1992) and Wiebe (1994) originally proposed the
idea of mining direction-based text, i.e., text containing opinions, sentiments, affects, and
biases. Traditional forms of content analysis, such as topical analysis may not be effective
for forums. Nigam and Hurst (2004) found that only 3% of USENET sentences contained
topical information. In contrast, web discourse is rich in sentiment related information
(Subasic and Huettner, 2001). Consequently, in recent years, sentiment analysis has been
applied to various forms of web-based discourse (Agarwal et al., 2003; Efron, 2004).
In this essay we propose the application of sentiment analysis techniques to online
reviews and forum postings. Our analysis encompasses classification of sentiments on a
benchmark movie review data set and social discussion forum. We evaluate different
162
feature sets consisting of syntactic and stylistic features. We also develop the Entropy
Weighted Genetic Algorithm (EWGA) for feature selection. The features and techniques
result in the creation of a sentiment analysis approach geared towards classification of
web discourse sentiments in multiple languages. The results using Support Vector
Machine (SVM) indicate a high level of classification accuracy, demonstrating the
efficacy of this approach for classifying sentiments inherent in online text.
The remainder of this paper is organized as follows. Section 6.2 presents a review of
current research on sentiment classification. Section 6.3 describes research gaps and
questions while Section 6.4 presents our research design. Section 6.5 describes the
EWGA algorithm and our proposed feature set. Section 6.6 presents experiments used to
evaluate the effectiveness of the proposed approach and discussion of the results. Section
6.7 concludes with closing remarks and future directions.
6.2 Related Work
Sentiment analysis is concerned with analysis of direction-based text, i.e. text
containing opinions and emotions. We focus on sentiment classification studies which
attempt to determine whether a text is objective or subjective, or whether a subjective text
contains positive or negative sentiments. Sentiment classification has several important
characteristics including the various tasks, features, techniques, and application domains.
These are summarized in the taxonomy presented in Table 6.1. Table 6.2 shows selected
previous studies dealing with sentiment classification. We discuss the taxonomy and
related studies in detail below.
163
Table 6.1: A Taxonomy of Sentiment Polarity Classification
Category
Classes
Level
Source/Target
Tasks
Description
Positive/negative sentiments or objective/subjective texts
Document or sentence/phrase level classification
Whether source/target of sentiment is known or extracted
Label
C1
C2
C3
Category
Syntactic
Semantic
Link Based
Stylistic
Features
Examples
Word/POS tag n-grams, phrase patterns, punctuation
Polarity tags, appraisal groups, semantic orientation
Web links, send/reply patterns, and document citations
Lexical and structural measures of style
Label
F1
F2
F3
F4
Category
Machine Learning
Link Analysis
Similarity Score
Techniques
Examples
Techniques such as SVM, Naïve Bayes, etc.
Citation analysis and message send/reply patterns
Phrase pattern matching, frequency counts, etc.
Label
T1
T2
T3
Category
Reviews
Web Discourse
News Articles
Domains
Description
Product, movie, and music reviews
Web forums and blogs
Online news articles and web pages
Label
D1
D2
D3
6.2.1 Tasks
There have been several sentiment polarity classification tasks. Three important
characteristics of the various sentiment polarity classification tasks are the classes,
classification levels, and assumptions about sentiment source and target (topic). The
common two class problem involves classifying sentiments as positive or negative (Pang
et al., 2002; Turney, 2002). Additional variations include classifying messages as
opinionated/subjective or factual/objective (Wiebe et al., 2001; 2004). A closely related
problem is affect classification which attempts to classify emotions instead of sentiments.
Example affect classes include happiness, sadness, anger, horror etc. (Subasic and
Huettner, 2001; Grefenstette et al., 2004; Mishne, 2005).
164
Table 6.2: Selected Previous Studies in Sentiment Polarity Classification
Study
Subasic & Huettner, 2001
Tong, 2001
Morinaga et al., 2002
Pang et al., 2002
Turney, 2002
Agrawal et al., 2003
Dave et al., 2003
Nasukawa & Yi, 2003
Riloff et al., 2003
Yi et al., 2003
Yu & Hatzivassiloglou, 2003
Beineke et al., 2004
Efron, 2004
Fei et al., 2004
Gamon, 2004
Grefenstette et al., 2004
Hu & Liu, 2004
Kanayama et al., 2004
Kim & Hovy, 2004
Pang & Lee, 2004
Mullen & Collier, 2004
Nigam & Hurst, 2004
Wiebe et al., 2004
Liu et al., 2005
Mishne, 2005
Whitelaw et al., 2005
Wilson et al., 2005
Features
F
1
√
√
√
√
√
√
√
√
√
√
F
2
√
√
F
3
F
4
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
Reduce
Features
Y/N
No
No
Yes
No
No
No
No
No
No
Yes
No
No
No
No
Yes
No
No
No
No
No
No
No
Yes
No
No
No
No
Techniques
T1
T2
Domains
T3
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
D
1
√
D
2
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
D
3
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
Sentiment polarity classification can be conducted at the document, sentence, or
phrase (part of sentence) level. Document level polarity categorization attempts to
classify sentiments in movie reviews, news articles, or web forum postings (Wiebe et al.,
2001; Pang et al., 2002; Mullen and Collier, 2004; Pang and Lee, 2004; Whitelaw et al.,
2005). Sentence level polarity categorization attempts to classify positive and negative
sentiments for each sentence (Yi et al., 2003; Mullen and Collier, 2004; Pang and Lee,
2004) or whether a sentence is subjective or objective (Riloff et al., 2003). There has also
165
been work on phrase level categorization in order to capture multiple sentiments that may
be present within a single sentence (Wilson et al., 2005).
In addition to sentiment classes and categorization levels, different assumptions have
also been made about the sentiment sources and targets (Yi et al., 2003). In this essay we
focus on document level sentiment polarity categorization (i.e., distinguishing positive
and negative sentiment texts). However, we also review related sentence level and
subjectivity classification studies due to the relevance of the features and techniques
utilized and the application domains.
6.2.2 Features
There are four feature categories that have been used in previous sentiment analysis
studies. These include syntactic, semantic, link-based, and stylistic features. Along with
semantic features, syntactic attributes are the most commonly used set of features for
sentiment analysis. These include word n-grams (Pang et al., 2002; Gamon, 2004), partof-speech (POS) tags (Pang et al., 2002; Yi et al., 2003; Gamon, 2004), and punctuation.
Additional syntactic features include phrase patterns, which make use of POS tag n-gram
patterns (Nasukawa and Yi, 2003; Yi et al., 2003; Fei et al., 2004). They noted that phrase
patterns such as “n+aj” (noun followed by positive adjective) typically represented
positive sentiment orientation while “n+dj” (noun followed by negative adjective) often
expressed negative sentiment (Fei et al., 2004). Wiebe et al. (2004) used collocations,
where certain parts of fixed word n-grams were replaced with general word tags, thereby
also creating n-gram phrase patterns. For example, the pattern “U-adj as-prep” would be
used to signify all bigrams containing a unique (once occurring) adjective followed by
166
the preposition “as.” Whitelaw et al. (2005) used a set of modifier features (e.g., very,
mostly, not); the presence of these features transformed appraisal attributes for lexicon
items.
Semantic features incorporate manual/semi-automatic or fully automatic annotation
techniques to add polarity or affect intensity related scores to words and phrases.
Hatzivassiloglou and McKeown (1997) proposed a semantic orientation (SO) method
later extended by Turney (2002) that uses a mutual information calculation to
automatically compute the SO score for each word/phrase. The score is computed by
taking the mutual information between a phrase and the word “excellent” and subtracting
the mutual information between the same phrase and the word “poor.” In addition to
pointwise mutual information, the SO approach was later also evaluated using latent
semantic analysis (Turney and Littman, 2003).
Manual or semi-automatically generated sentiment lexicons (e.g., Tong, 2001; Fei et
al., 2004; Wilson et al., 2005) typically use an initial set of automatically generated terms
which are manually filtered and coded with polarity and intensity information. The user
defined tags are incorporated to indicate whether certain phrases convey positive or
negative sentiment. Riloff et al. (2003) used semi-automatic lexicon generation tools to
construct sets of strong subjectivity, weak subjectivity, and objective nouns. Their
approach outperformed the use of other features, including bag-of-words, for
classification of objective versus subjective English documents. Appraisal Groups
(Whitelaw et al., 2005) is another effective method for annotating semantics to
words/phrases. Initial term lists are generated using WordNet, which are then filtered
167
manually, to construct the lexicon. Developed based on Appraisal Theory (Martin and
White, 2005), each expression is manually classified into various appraisal classes. These
classes include attitude, orientation, graduation, and polarity of phrases. Whitelaw et al.
(2005) were able to get very good accuracy using appraisal groups on a movie review
corpus, outperforming several previous studies (e.g., Mullen and Collier, 2004), the
automated mutual information based approach (Turney, 2002), as well as the use of
syntactic features (Pang et al., 2002). Manually crafted lexicons have also been used for
affect analysis. Subasic and Huettner (2001) used affect lexicons along with fuzzy
semantic typing for affect analysis of news articles and movie reviews. Abbasi and Chen
(2007a, 2007b) used manually constructed affect lexicons for analysis of hate and
violence in web forums.
Other semantic attributes include contextual features representing the semantic
orientation of surrounding text, which have been useful for sentence level sentiment
classification. Riloff et al. (2003) utilized semantic features that considered the
subjectivity and objectivity of text surrounding a sentence. Their attributes measured the
level of subjective and objective clues in the sentence prior to and following the sentence
of interest. Pang and Lee (2004) also leveraged coherence in discourse by considering the
level of subjectivity of sentences in close proximity to the sentence of interest.
Link-based features use link/citation analysis to determine sentiments for web
artifacts and documents. Efron (2004) found that opinion web pages heavily linking to
each other often shared similar sentiments. Agarwal et al. (2003) observed the exact
opposite for USENET newsgroups discussing issues such as abortion and gun control.
168
They noticed that forum replies tended to be antagonistic. Due to the limited usage of
link-based features, it is unclear how effective they may be for sentiment classification.
Furthermore, unlike web pages and USENET, other forums may not have a clear message
link structure and some forums are serial (no threads).
Stylistic attributes include lexical and structural attributes incorporated in numerous
prior stylometric/authorship studies (e.g., De Vel et al., 2001; Zheng et al., 2006).
However, lexical and structural style markers have seen limited usage in sentiment
analysis research. Wiebe et al. (2004) used hapax legomena (unique/once occurring
words) effectively for subjectivity and opinion discrimination. They observed a
noticeably higher presence of unique words in subjective texts as compared to objective
documents across a Wall Street Journal corpus and noted “Apparently, people are creative
when they are being opinionated” (p. 286). Gamon (2004) used lexical features such as
sentence length for sentiment classification of feedback surveys. Mishne (2005) used
lexical style markers such as words per message, and words per sentence for affect
analysis of web blogs. While it is unclear whether stylistic features are effective
sentiment discriminators for movie/product reviews, style markers have been shown to be
highly prevalent in web discourse (Abbasi and Chen, 2005; Zheng et al., 2006; Schler et
al., 2006).
6.2.3 Classification Techniques
Previously used techniques for sentiment classification can be classified into three
categories. These include machine learning algorithms, link analysis methods, and scorebased approaches.
169
Many studies have used machine learning algorithms with support vector machines
(SVM) and Naïve Bayes (NB) being the most commonly used. SVM has been used
extensively for movie reviews (Pang et al, 2002; Pang and Lee, 2004; Whitelaw et al.,
2005) while Naïve Bayes has been applied to reviews and web discourse (Pang et al,
2002; Pang and Lee, 2004; Efron, 2004). In comparisons, SVM has outperformed other
classifiers such as NB (Pang et al., 2002). While SVM has become a dominant technique
for text classification, other algorithms such as Winnow (Nigam and Hurst, 2004) and
AdaBoost (Wilson et al., 2005) have also been used in previous sentiment classification
studies.
Studies using link based features and metrics for sentiment classification have often
used link analysis. Efron (2004) used cocitation analysis for sentiment classification of
web site opinions while Agarwal et al. (2003) used message reply link structures to
classify sentiments in USENET newsgroups. An obvious limitation of link analysis
methods is that they are not effective where link structure is not clear or links are sparse
(Efron, 2004).
Score-based methods are typically used in conjunction with semantic features. These
techniques generally classify message sentiments based on the total sum of comprised
positive or negative sentiment features. Phrase pattern matching (Nasukawa and Yi, 2003;
Yi et al., 2003; Fei et al., 2004) requires checking text for manually created polarized
phrase tags (positive and negative). Positive phrases are assigned a plus one while
negative phrases are assigned a minus one. All messages with a positive sum are assigned
positive sentiment while negative messages are assigned to the negative sentiment class.
170
The semantic orientation approach (Hatzivassiloglou and McKeown, 1997; Turney, 2002)
uses a similar method to score the automatically generated polarized phrase tags. Scorebased methods have also been used for affect analysis where the affect features present
within a message/document are scored based on their degree of intensity for a particular
emotion class (Subasic and Huettner, 2001).
6.2.4 Sentiment Analysis Domains
Sentiment analysis has been applied to numerous domains including reviews, web
discourse, and news articles and documents. Reviews include movie, product, and music
reviews (Morinaga et al., 2002; Pang et al., 2002; Turney, 2002). Sentiment analysis of
movie reviews is considered to be very challenging since movie reviewers often present
lengthy plot summaries and also use complex literary devices such as rhetoric and
sarcasm. Product reviews are also fairly complex since a single review can feature
positive and negative sentiments about particular facets of the product.
Web Discourse sentiment analysis includes evaluation of web forums, newsgroups,
and blogs. These studies assess sentiments about specific issues/topics. Sentiment topics
include abortion, gun control, and politics (Agarwal et al., 2003; Efron, 2004). Robinson
(2005) evaluated sentiments about 9/11 in three forums in the United States, Brazil, and
France. Wiebe et al. (2004) performed subjectivity classification of USENET newsgroup
postings.
Sentiment analysis has also been applied to news articles (Yi et al., 2003; Wilson et
al., 2005). Henley et al. (2004) analyzed newspaper articles for biases pertaining to
violence related reports. They found that there was a significant difference between the
171
manner in which the Washington Post and the San Francisco Chronicle reported news
stories relating to anti-gay attacks, with the reporting style reflecting newspaper
sentiments. Wiebe et al. (2004) classified objective and subjective news articles in a Wall
Street Journal corpus. Some general conclusions can be drawn from Table 6.2 and the
literature review. Most studies have used syntactic and semantic features. There has also
been little use of feature reduction/selection techniques which may improve classification
accuracy. Furthermore, there has been limited application of sentiment analysis to web
forums.
6.3 Research Gaps and Questions
Based on our review of previous literature and conclusions we have identified several
important research gaps. Firstly, there has been limited previous sentiment analysis work
on web forums, and most studies have focused on a sentiment classification of a single
language. Secondly there has been almost no usage of stylistic feature categories. Finally,
little emphasis has been placed on feature reduction/selection techniques.
6.3.1 Web Forums in Multiple Languages
Most previous sentiment classification of web discourse has focused on financial
forums. Applying such methods to web forums is important in order to develop a viable
set of features for assessing the presence of propaganda in these online communities.
6.3.2 Stylistic Features
Previous work has focused on syntactic and semantic features. There has been little
use of stylistic features such as word length distributions, vocabulary richness measures,
172
character and word level lexical features, and special character frequencies. Gamon
(2004) and Pang et al. (2002) pointed out that many important features may not seem
intuitively obvious at first. Thus, while prior emphasis has been on adjectives, stylistic
features may uncover latent patterns that can improve classification performance of
sentiments. This may be especially true for web forum discourse, which is rich in stylistic
variation (Abbasi and Chen, 2005; Zheng et al., 2006). Stylistic features have also been
shown to be highly prevalent in other forms of computer mediated communication,
including web blogs (Herring and Paolillo, 2006).
6.3.3 Feature Reduction for Sentiment Classification
Different automated and manual approaches have been used to craft sentiment
classification feature sets. Little emphasis has been given to feature subset selection
techniques. Gamon (2004) and Yi et al. (2003) used log likelihood to select important
attributes from a large initial feature space. Wiebe et al. (2004) evaluated the
effectiveness of various potential subjective elements (PSEs) for subjectivity
classification based on their occurrence distribution across classes. However many
powerful techniques have not been explored. Feature reduction/selection techniques have
two important benefits (Li et al., 2006). They can potentially improve classification
accuracy and also provide greater insight into important class attributes, resulting in a
better understanding of sentiment arguments and characteristics (Guyon and Elisseeff,
2003). Using feature reduction, Gamon (2004) was able to improve accuracy and narrow
in on a key feature subset of sentiment discriminators.
173
6.3.4 Research Questions
We propose the following research questions:
1) How effectively can sentiment analysis be applied to web forums?
2) Can stylistic features provide further sentiment insight and classification
power?
3) How can feature selection improve classification accuracy and identify key
sentiment attributes?
6.4 Research Design
In order to address these questions, we propose the use of a sentiment classification
feature set consisting of syntactic and stylistic features. Furthermore, utilization of feature
selection techniques such as genetic algorithms (Holland, 1975) and information gain
(Shannon 1948; Quinlan, 1986) is also included to improve classification accuracy and
gain insight into the important features for each sentiment class.
Based on the prevalence of stylistic variation in web discourse, we believe that lexical
and structural style markers can improve the ability to classify web forum sentiments.
Integrated stylistic features include attributes such as word length distributions,
vocabulary richness measures, letter usage frequencies, use of greetings, presence of
requoted content, use of URLs etc.
We also propose the use of an Entropy Weighted Genetic Algorithm (EWGA) that
incorporates the Information Gain (IG) heuristic with a Genetic Algorithm (GA) to
improve feature selection performance. GA is an evolutionary computing search method
(Holland, 1975) that has been used in numerous feature selection applications (Siedlecki
174
and Sklansky, 1989; Yang and Honavar, 1998; Li et al., 2006; 2007). Oliveira et al.
(2002) successfully applied GA to feature selection for hand-written digit recognition.
Vafiaie and Imam (1994) showed that GA outperformed other heuristics such as greedy
search for image recognition feature selection. Like most random search feature selection
methods (Dash and Liu, 1997), it uses a wrapper model where the performance accuracy
is used as the evaluation criterion to improve the feature subset in future generations.
In contrast, IG is a heuristic based on Information Theory (Shannon, 1948; Shannon,
1951). It uses a filter model for ranking features which makes it computationally more
efficient than GA. IG has outperformed numerous feature selection techniques in head-tohead comparisons (Forman, 2003). Since our experiments will use the SVM classifier, we
also plan to compare the proposed EWGA technique against the use of SVM weights for
feature selection. In this method, the SVM weights are used to iteratively reduce the
feature space, thereby improving performance (Koppel et al., 2002). SVM weights have
been shown to be effective for text categorization (Koppel et al., 2002; Mladenic et al.,
2004) and gene selection for cancer classification (Guyon et al., 2002). GA, IG, and SVM
weights have been used in several previous text classification studies as shown in Table
6.3. A review of feature selection for text classification can be found in (Sebastiani,
2002).
Table 6.3: Text Classification Studies using GA, IG, and SVM Weights
Technique
GA
IG
Task
Stylometric Analysis
Topic Classification
Stylometric Analysis
Study
Li et al, 2006
Efron et al., 2003
Juola & Baayen, 2003
175
SVM Weights
Topic Classification
Gender Categorization
Koppel & Schler, 2003
Abbasi & Chen, 2006
Mladenic et al., 2004
Koppel et al., 2002
A consequence of using an optimal search method such as GA in a wrapper model is
that convergence towards an ideal solution can be slow when dealing with very large
solution spaces. However, as previous researchers have argued, feature selection is
considered an “offline” task that does not need to be repeated constantly (Jain and
Zongker, 1997). This is why wrapper based techniques using genetic algorithms have
been used for gene selection with feature spaces consisting of tens of thousands of genes
(Li et al., 2007). Furthermore, hybrid GAs have previously been used for product design
optimization (Alexouda and Paparrizos, 2001; Balakrishnan et al., 2004) and scheduling
problems (Levine, 1996) to facilitate improved accuracy and convergence efficiency
(Balakrishnan et al., 2004). We developed the EWGA hybrid GA that utilizes the
Information Gain (IG) heuristic with the intention of improving feature selection quality.
More algorithmic details are provided in the next section.
6.5 System Design
We propose the following system design (shown in Figure 6.1). Our design has two
major steps: extracting an initial set of features and performing feature selection. These
steps are used to carry out sentiment classification of forum messages.
6.5.1 Feature Extraction
We incorporated syntactic and stylistic features in our sentiment classification
attribute set. Link based features were not included since our messages were not in
176
sequential order (insufficient cross-message references). These types of features are only
effective where the test bed consists of entire threads of messages and message
referencing information is available. Semantic features were not used since these
attributes are heavily context dependent (Pang et al., 2002). Such features are topic and
domain specific. For example, the set of positive polarity words describing a good movie
may not be applicable to discussions about racism. Unlike stylistic and syntactic features,
semantic features such as manually crafted lexicons incorporate an inherent feature
selection element via the human involvement. Such human involvement makes semantic
features (e.g., lexicons and dictionaries) very powerful for sentiment analysis. Lexicon
developers will only include features that are considered to be important, and weight
these features based on their significance, thereby reducing the need for feature selection.
For example, Whitelaw et al. (2005) used WordNet to construct an initial set of features,
which were manually filtered and weighted to create the lexicon. We hope to overcome
the lack of semantic features by incorporating feature selection methods intended to
isolate the important subset of stylistic and syntactic features and remove noise.
177
Figure 6.1: Sentiment Classification System Design
6.5.2 Determining Size of Initial Feature Set
Our initial feature set consisted of 14 different feature categories which included POS
tag n-grams, word n-grams, and punctuation for syntactic features. Style markers
included word and character level lexical features, word length distributions, special
characters, letters, character n-grams, structural features, vocabulary richness measures,
digit n-grams, and function words. The word length distribution includes the frequency of
1 to 20 letter words. Word level lexical features include total words per document,
average word length, average number of words per sentence, average number of words
per paragraph, total number of short words (i.e., ones less than 4 letters) etc. Character
level lexical features include total characters per document, average number of characters
per sentence, average number of characters per paragraph, percentage of all characters
that are in words, and the percentage of alphabetic, digit, and space characters.
178
Vocabulary richness features include the total number of unique words used, hapax
legomena (number of once occurring words), dis legomena (number of twice occurring
words), and various previously defined statistical measures of richness such as Yule’s K,
Honore’s R, Sichel’s S, Simpson’s D, and Brunet’s W measure. The structural features
encompass the total number of lines, sentences, and paragraphs; as well as whether the
document has a greeting or a signature. Additional structural attributes include whether
there is a separation between paragraphs, whether the paragraphs are indented, the
presence and position of quoted and forwarded content, and whether the document
includes email, URL, and telephone contact information. Further descriptions of the
lexical, vocabulary richness, and structural attributes can be found in (de Vel et al., 2001;
Zheng et al., 2006; Abbasi and Chen, 2005).
Many feature categories are pre-defined in terms of the number of potential features.
For example, there are only a certain number of possible punctuation and stylistic lexical
features (e.g., words per sentence, words per paragraph etc.). In contrast, there are
countless potential n-gram based features. Consequently, some shallow selection criterion
is typically incorporated to reduce the feature space for n-grams. A common approach is
to select features with a minimum usage frequency (Mitra et al., 1997; Jiang et al., 2004).
We used a minimum frequency threshold of 10 for n-gram based features. Less common
features are sparse, and likely to cause over-fitting. In addition, we only used bigrams and
trigrams as these higher level n-grams tend to be redundant. Using only up to trigrams
has been shown to be effective for stylometric analysis (Kjell et al., 1994) and sentiment
classification (Pang et al., 2002; Wiebe et al., 2004). Based on this criterion for n-gram
179
features, Table 6.4 shows the sentiment analysis feature set.
Table 6.4: Sentiment Analysis Feature Set
Category
Syntactic
Stylistic
Feature Group
POS N-grams
Word N-grams
Punctuation
Letter N-Grams
Character N-grams
Word Lexical
Char. Lexical
Word Length
Vocabulary Richness
Special Characters
Digit N-Grams
Structural
Function Words
Quantity
varies
varies
8
26
varies
8
8
20
8
20
varies
14
250
Examples
frequency of part-of-speech tags (e.g., NP_VB)
word n-grams (e.g. senior editor, editor in chief)
occurrence of punctuation marks (e.g., !;:,.?)
frequency of letters (e.g., a, b, c)
character n-grams (e.g., abo, out, ut, ab)
total words, % char. per word
total char., % char. per message
frequency distribution of 1-20 letter words
richness (e.g., hapax legomena, Yule’s K)
occurrences of special char. (e.g., @#$%^&*+)
frequency of digits (e.g., 100, 17, 5)
has greeting, has url, requoted content, etc.
frequency of function words (e.g., of, for, to)
6.5.3 Feature Selection: Entropy Weighted Genetic Algorithm (EWGA)
Most previous hybrid GA variations combine GA with other search heuristics such as
beam search, where the beam search output is used as part of the initial GA population
(Alexouda and Paparrizos, 2001; Balakrishnan et al., 2004). Additional hybridizations
include modification of the GA’s crossover (Aggarwal et al., 1997) and mutation
operators (Balakrishnan et al., 2004). The Entropy Weighted Genetic Algorithm (EWGA)
uses the information gain (IG) heuristic to weight the various sentiment attributes. These
weights are then incorporated into the GA’s initial population and crossover and mutation
operators. The major steps for the EWGA are as follows:
180
EWGA Steps
1) Derive feature weights using IG.
2) Include IG selected features as part of initial GA solution
population.
3) Evaluate and select solutions based on fitness function.
4) Crossover solution pairs at point that maximizes total IG
difference between the two solutions.
5) Mutate solutions based on feature IG weights.
6) Repeat steps 3-5 until stopping criterion is satisfied.
Figure 6.2: EWGA Illustration
Figure 6.2 shows an illustration of the EWGA process. A detailed description of the
IG, initial population, evaluation and selection, crossover, and mutation steps is presented
below.
6.5.3.1 Information Gain
For information gain (IG) we used the Shannon entropy measure (Shannon, 1949;
1951) in which:
IG(C,A) = H(C) - H(C|A)
181
where:
IG (C , A)
information gain for feature A;
n
H (C ) = −∑ p (C = i ) log 2 p (C = i )
entropy across sentiment classes C;
i =1
n
H (C | A) = −∑ p (C = i | A) log 2 p (C = i | A)
specific feature conditional entropy;
i =1
total number of sentiment classes;
n
If the number of positive and negative sentiment messages is equal, H(C) is 1.
Furthermore, the information gain for each attribute A will vary along the range 0-1 with
higher values indicating greater information gain. All features with an information gain
greater than 0.0025 (i.e., IG(C,A)>0.0025) are selected. The use of such a threshold is
consistent with prior work using IG for text feature selection (Yang and Pedersen, 1997).
6.5.3.2 Solution Structure and Initial Population
We represent each solution in the population using a binary string of length equal to
the total number of features, with each binary string character representing a single
feature. Specifically, 1 represents a selected feature while 0 represents a discarded one.
For example, a solution string representing five candidate features, “10011,” means that
the first, fourth and fifth features are selected, while the other two are discarded (Li et al.,
2006). In the standard GA, the initial population of n strings is randomly generated. In the
EWGA, n-1 solution strings are randomly generated while the IG solution features are
used as the final solution string in the initial population.
182
6.5.3.3 Evaluation and Selection
We use the classification accuracy as the fitness function used to evaluate the quality
of each solution. Hence, for each genome in the population, 10-fold cross validation with
SVM is used to assess the fitness of that particular solution. Solutions for the next
iteration are selected probabilistically with better solutions having a higher probability of
selection. While several population replacement strategies exist, we use the generational
replacement method originally defined by Holland (1976) in which the entire population
is replaced every generation. Other replacement alternatives include steady-state methods
where only a fraction of the population is replaced every iteration, while the majority is
passed over to the next generation (Levine, 1996). Generational replacement is used in
order to maintain solution diversity and prevent premature convergence attributable to the
IG seed solution dominating the other solutions (Aggarwal et al., 1997; Balakrishnan et
al., 2004).
6.5.3.4 Crossover
From the n solution strings in the population (i.e., n/2 pairs), certain string pairs are
randomly selected for crossover based on a crossover probability Pc . In the standard GA,
we use single-point crossover by selecting a pair of strings and swapping substrings at a
randomly determined crossover point x.
S = 010010
T = 110100
x=3
S = 010 | 010
T = 110 | 100
S = 010100
T = 110010
The IG heuristic is utilized in the EWGA crossover procedure in order to improve the
quality of the newly generated solutions. Given a pair of solution strings S and T, the
183
EWGA crossover method selects a crossover point x that maximizes the difference in
cumulative information gain across strings S and T. Such an approach is intended to
create a more diverse solution population: those with heavier concentrations of features
with higher IG values and those with fewer IG features. The crossover point selection
procedure can be formulated as follows:
x
arg max
x
∑
m
IG ( C , A )( S A − T A ) +
A =1
∑
IG ( C , A )( T A − S A )
A= x
where:
IG (C , A) information gain for feature A;
SA
Ath character in solution string S
TA
Ath character in solution string T
m
total number of features
x
crossover point in solution pair S and T, where 1<x<m
Maximizing the IG differential between solution pairs in the crossover process allows
the creation of potentially better solutions. Solutions with higher IG contain attributes
that may have greater discriminatory potential while the lower IG solutions help maintain
the diversity balance in the solution population. Such balance is important to avoid
premature convergence of solution populations towards local maxima (Aggarwal et al.,
1997).
6.5.3.5 Mutation
The traditional GA mutation operator randomly mutates individual feature characters
184
in a solution string based on a mutation probability constant Pm . The EWGA mutation
operator factors the attribute information gain into the mutation probability as shown
below. This is done in order to improve the likelihood of inclusion into the solution string
for features with higher information gain while decreasing the probability of features with
lower information gain. Our mutation operator sets the probability of a bit to mutate from
0 to 1 based on the feature’s information gain, whereas the probability to mutate from 1
to 0 is set to the value one minus the feature’s information gain. Balakrishnan et al.
(2004) demonstrated the potential for modified mutation operators that favored features
with higher weights in their hybrid genetic algorithm geared towards product design
optimization.
 B[IG (C , A)], if S A = 0
Pm ( A) = 
 B[1 − IG (C , A)], if S A = 1
where:
Pm (A)
probability of mutation for feature A;
IG (C , A) information gain for feature A;
SA
Ath character in solution string S;
B
constant in the range 0-1;
6.5.4 Classification
Because our research focus is on sentiment feature extraction and selection, in all
experiments Support Vector Machine (SVM) is used with 10-fold cross-validation to
classify the sentiments. We chose SVM in our experiments because it has outperformed
185
other machine learning algorithms for various text classification tasks (Pang et al., 2002;
Abbasi and Chen, 2005; Zheng et al., 2006). We use a linear kernel with the Sequential
Minimal Optimization algorithm included in Weka (Witten and Frank, 2005).
6.6 Evaluation
Experiments were conducted on a benchmark movie review data set (Experiment 1)
and a social web forum (Experiment 2). The purpose of Experiment 1 was to evaluate the
effectiveness of the proposed features and selection technique (EWGA) in comparison
with previous document level sentiment classification approaches. The overall accuracy
was the average classification accuracy across all 10 folds where the classification
accuracy was computed as follows:
Classification Accuracy =
Number of Correctly Classified Documents
Total Number of Documents
In addition to 10-fold cross validation, bootstrapping was used to randomly select 50
samples for statistical testing, as done in previous research (e.g., Whitelaw et al., 2005).
For each sample, we used 5% of the instances for testing and the other 95% for training.
Pair wise t-tests were performed on the bootstrap values to assess statistical significance.
6.6.1 Experiment 1: Movie Review Test Bed
In Experiment 1, we conducted two experiments to evaluate the effectiveness of our
features as well as feature selection methods for document level sentiment polarity
classification on a benchmark movie review data set (Pang et al., 2002; Pang and Lee,
2004). This data set has been used for document level sentiment categorization in several
previous studies (e.g. Pang et al., 2002; Mullen and Collier, 2003; Pang and Lee, 2004;
186
Whitelaw et al., 2005). The test bed consists of 2,000 movie reviews (1,000 positive and
1,000 negative) taken from the IMDb movie review archives. The positive reviews are
comprised of four and five star reviews while the negative reviews are those receiving
one or two stars. For all experiments, SVM was run using 10-fold cross-validation, with
1800 reviews used for training and 200 for testing in each fold. Bootstrapping was
performed by randomly selecting 100 reviews for testing and the remaining 1900 for
training, 50 times. In Experiment 1a we evaluated the effectiveness of syntactic and
stylistic features for sentiment polarity classification. Experiment 1b focused on
evaluating the effectiveness of EWGA for feature selection.
6.6.1.1 Experiment 1a: Evaluation of Features
In order to evaluate the effectiveness of syntactic and stylistic features for movie
review classification, we used a feature set permutation approach (e.g. stylistic, syntactic,
stylistic + syntactic). Stylistic features are unlikely to effectively classify sentiments on
their own. Syntactic features have been used in most previous studies and we suspect that
these are most important. However, stylistic features may be able to supplement syntactic
features; nevertheless this set of features has not been tested sufficiently. Table 6.5 shows
the results for the three feature sets. The bootstrap accuracy and standard deviation were
computed across the 50 samples.
Table 6.5: Experiment 1a Results
Features
Stylistic
Syntactic
Stylistic + Syntactic
10-Fold CV
73.65%
83.80%
87.95%
Bootstrap
73.26%
83.74%
88.06%
Standard Dev.
2.832
1.593
1.133
# Features
1,017
25,853
26,870
187
The best classification accuracy result using SVM was achieved when using both
syntactic and stylistic features. The combined feature set outperformed the use of only
syntactic or stylistic features. As expected, the results using only syntactic features were
considerably better than the results using just style markers. In addition to improved
accuracy, the results using stylistic and syntactic features had less variation based on the
lower standard deviation. This suggests that using both feature categories in conjunction
results in more consistent performance. In contrast, stylistic features had considerably
higher standard deviation, indicating that their effectiveness varies across messages.
Table 6.6: P-Values for Pair Wise t-tests on Accuracy (n=50)
Features
Sty. vs. Syn.
Sty. vs. Syn + Sty.
Syn. vs. Syn. + Sty.
P-Values
<0.0001*
<0.0001*
<0.0001*
* P-values significant at alpha = 0.05
Table 6.6 shows the pair wise t-tests conducted on the 50 bootstrap samples to
evaluate the statistical significance of the improved results using stylistic and syntactic
features. As expected, syntactic features outperformed stylistic features when both were
used alone. However, using both feature categories significantly outperformed the use of
either category individually. The results suggest that stylistic features are prevalent in
movie reviews and may be useful for document level sentiment polarity classification.
6.6.1.2 Experiment 1b: Evaluation of Features Selection Techniques
This experiment was concerned with evaluating the effectiveness of feature selection
for sentiment classification. The feature set consisted of all features (syntactic and
stylistic) since Experiment 1a had already demonstrated the superior performance of
188
using syntactic and stylistic features in unison. We compared the EWGA feature selection
approach to no selection/reduction (baseline), feature selection using information gain
(IG), genetic algorithm (GA), and the SVM weights. Feature selection was performed on
the 1800 training reviews for each fold, while the remaining 200 were used to evaluate
the accuracy for that fold. Thus, the ideal set of features chosen using each selection
technique on the 1800 training reviews was used on the testing messages. Thus, IG was
applied to the training messages for each fold in order to rank and select the features for
that particular fold that would be used on the testing messages. For the GA and EWGA
wrappers, this meant that they were run using SVM with 10-fold cross validation on the
1800 reviews from each fold. The selected feature subset was then used for evaluating the
messages from that particular fold. The overall accuracy was computed as the average
accuracy across all 10 folds (as is standard when using cross-validation). Once again,
SVM was used to classify the message sentiments. The GA and EWGA were each run for
200 iterations, with a population size of 50 for each iteration, using a crossover
probability of 0.6 ( Pc = 0.6) and a mutation probability of 0.01( Pm = 0.01). These
parameter settings are consistent with prior GA research (Alexouda and Paparizzos, 2001;
Balakrishnan et al., 2004). The EWGA mutation operator constant was set to 0.1 (B =
0.1). For the SVM weight (SVMW) approach, we used the method proposed by Koppel
et al. (2002). We iteratively reduced the number of features for each class from 5,000 to
250 in increments of 250 (i.e., decreased overall feature set from 10,000 – 500). For each
iteration features were ranked based on the product of their average occurrence frequency
per document and their absolute SVM weight. For all experiments, the number of
189
features yielding the best result was reported for the SVMW feature selection method.
Table 6.7 shows the results for the four feature reduction methods and the no feature
selection baseline applied to the movie reviews. The bottom half of Table 6.7 also
provides the results from prior document level sentiment classification studies conducted
on the same test bed. All four feature selection techniques improved the classification
accuracy over the baseline. Consistent with previous research (e.g., Koppel et al., 2002;
Mladenic et al., 2004), the SVM weights approach also performed well, outperforming
IG and GA. The EWGA had the best performance in terms of overall accuracy, resulting
in a 7% improvement in accuracy over the no feature selection baseline and a 2.5%-3%
improvement over the other feature selection methods. Furthermore, the EWGA was also
the most efficient in terms of the number of features used, improving accuracy while
utilizing a smaller subset of the initial feature set. EWGA based feature selection was
able to identify a more concise set of key features as compared to other selection
methods.
Table 6.7: Experiment 1b Results
Techniques
Base
IG
GA
SVMW
EWGA
Whitelaw et al. 2005
Pang & Lee 2004
Mullen & Collier 2004*
Pang et al., 2002*
10-Fold CV
87.95%
89.85%
90.05%
90.20%
91.70%
90.20%
87.20%
86.00%
82.90%
Bootstrap
88.06%
89.60%
89.84%
89.96%
91.52%
-
Std. Dev.
4.133
2.631
2.783
2.124
2.843
-
# Features
26,870
2,314
2,011
2,000
1,748
49,911
-
* Applied to earlier version of data set containing 1,300 reviews.
In comparison with prior work, the results indicate that we were able to achieve
higher accuracy than many previous studies on the movie review data set. Most previous
190
work has had accuracy in the 80-90% range (Pang et al., 2002; Whitelaw et al., 2005)
while our performance was over 91% when using stylistic and syntactic features in
conjunction with EWGA. This is attributable to the prevalence of varying style markers
across sentiment classes as well as the use of feature selection to remove noise and isolate
the most effective sentiment discriminators. As noted by Whitelaw et al. (2005), the Pang
et al. (2002) and Mullen and Collier (2004) studies used an earlier, smaller version of the
test bed, and are therefore not directly comparable. Table 6.8 shows the pair wise t-tests
conducted to evaluate the statistical significance of the improved results using feature
selection (n=50, df=49). EWGA significantly outperformed all other techniques,
including the no feature selection baseline, IG, GA, and SVMW.
Table 6.8: P-Values for Pair Wise t-tests on Accuracy (n=50, df=49)
Techniques
Base vs. IG
Base vs. GA
Base vs. EWGA
Base vs. SVMW
IG vs. GA
IG vs. EWGA
IG vs. SVMW
GA vs. EWGA
GA vs. SVMW
SVMW vs. EWGA
P-Values
<0.0001*
<0.0001*
<0.0001*
<0.0001*
0.1028
<0.0001*
0.1129
<0.0001*
0.3742
<0.0001*
* P-values significant at alpha = 0.05
6.6.1.3 Results Discussion
Figure 6.3 shows some of the important stylistic features for the movie review data
set. The diagram to the left shows the normalized average feature usage across all
positive and negative reviews. The table to the right shows the description for each
feature as well as its IG and SVM weight. The positive movie reviews in our data set tend
191
to be longer in terms of total number of characters and words (feat.1-2). These reviews
also have higher vocabulary richness, based on the various richness formulas that
measure the uniqueness of words in a document, such as Simpson’s D, Brunet’s W,
Honore’s R, and Yule’s K (feat. 3-6). The negative reviews have greater occurrence of the
function words “no” and “if.”
Movie Review
Normalized Usage
1.2
1.0
0.8
Positive
0.6
Negative
0.4
0.2
0.0
1
2
3
4
5
6
7
8
Feature #
Feat.
Description
IG
SVM
1
total char.
0.014
-0.163
2
total words
0.012
-0.164
3
Simpson
0.028
-0.430
4
Brunet
0.012
-0.121
5
Honore
0.012
-0.161
6
Yule
0.016
-0.116
7
no
0.025
0.252
8
if
0.018
0.249
Figure 6.3: Key Stylistic Features for Movie Review Data Set
6.6.2 Experiment 2: Online Discussion Forum
We conducted two experiments to evaluate the effectiveness of our features as well as
feature selection methods for sentiment classification of web forum postings. Once again,
SVM was run using 10-fold cross-validation, with 900 messages used for training and
100 for testing in each fold. Bootstrapping was performed by randomly selecting 50
messages for testing and the remaining 950 for training, 50 times. In Experiment 2a we
evaluated the effectiveness of syntactic and stylistic features. Experiment 2b focused on
evaluating the effectiveness of feature selection for forum sentiment analysis.
6.6.2.1 Test Bed
Our test bed consists of a web forum that belongs to the Libertarian National Socialist
192
Green Party (LNSG). We randomly selected 1,000 polar messages were manually tagged.
The polarized messages represented those in favor of (agonists) and against (antagonists)
a particular topic. The number of messages used is consistent with previous classification
studies (Pang et al., 2002). In accordance with previous sentiment classification
experiments, a maximum of 30 messages was used from any single author. This was done
in order to ensure that sentiments were being classified as opposed to authors. For the
sake of simplicity, from here on will refer to agonistic messages as “positive” and
antagonistic messages as “negative” as these terms are more commonly used to represent
the two sides in most previous sentiment analysis research. Here, we use the terms
positive and negative as indicators of semantic orientation with respect to the specific
topic, however the “positive” messages may also contain sentiments about other topics
(which may be positive or negative) as described by Wiebe et al. (2005). This is similar to
the document level annotations used for product and movie reviews (Pang et al., 2002; Yi
et al., 2003). Using two human annotators, 500 positive (agonistic) and 500 negative
(antagonistic) sentiment messages were incorporated. The message annotation task by the
independent coders had a Kappa (k) value of 0.90, which is considered to be reliable.
6.6.2.2 Experiment 2a: Evaluation of Features
In our first experiment, we repeated the feature set tests previously performed on the
movie review data set in Experiment 1a. Once again, the three permutations of stylistic
and syntactic features were used. Table 6.9 shows the results for the three features sets.
The best classification accuracy results using SVM were achieved when using both
syntactic and stylistic features. The combined feature set statistically outperformed the
193
use of only syntactic or stylistic features across both data sets.
Table 6.9: Experiment 2a Results
Features
Stylistic
Syntactic
Stylistic + Syntactic
10-Fold CV
71.40%
87.00%
90.60%
Bootstrap
71.08%
87.16%
90.56%
Standard Dev.
3.324
2.439
2.042
# Features
867
12,014
12,881
Table 6.10: P-Values for Pair Wise t-tests on Accuracy (n=50)
Features
Sty. vs. Syn.
Sty. vs. Syn + Sty.
Syn. vs. Syn. + Sty.
P-Values
<0.0001*
<0.0001*
<0.0001*
* P-values significant at alpha = 0.05
Table 6.10 shows the pair wise t-tests conducted on the bootstrap samples to evaluate
the statistical significance of the improved results using stylistic and syntactic features.
As expected, syntactic features outperformed stylistic features when both were used
alone. However, using both feature categories significantly outperformed the use of either
category individually. The results suggest that stylistic features are prevalent and
important in web discourse, even when applied to sentiment classification.
6.6.2.3 Experiment 2b: Evaluation of Feature Selection Techniques
This experiment was concerned with evaluating the effectiveness of feature selection
for sentiment classification of web forums. The same experimental settings as
Experiment 1b were used for all techniques. Table 6.11 shows the results for the four
feature reduction methods and the no feature selection baseline applied to the web forum
postings. All four feature selection techniques improved the classification accuracy over
the baseline. The EWGA had the best performance across both test beds in terms of
overall accuracy, resulting in a 3-4% improvement in accuracy over the no feature
194
selection baseline. Furthermore, the EWGA was also the most efficient in terms of the
number of features used, improving accuracy while utilizing a smaller subset of the initial
feature sets. EWGA based feature selection was able to identify a more concise set of key
features that was 50%-70% smaller than IG and SVMW and 75%-90% smaller than the
baseline. GA also used a smaller number of features however the use of EWGA resulted
in considerably improved accuracy.
Table 6.12 shows the pair wise t-tests conducted on the bootstrap values to evaluate
the statistical significance of the improved results using feature selection. EWGA
outperformed the Baseline and GA for both data sets significantly. In addition, EWGA
provided significantly better performance than IG and SVMW.
Table 6.11: Experiment 2b Results
Technique
Base
IG
GA
SVMW
EWGA
10-Fold CV
90.60%
91.10%
90.90%
91.15%
92.80%
Bootstrap
90.56%
91.16%
90.64%
91.20%
92.84%
Standard Dev.
1.831
1.564
1.453
1.656
1.458
# Features
12,881
1,055
505
1,000
508
Table 6.12: P-Values for Pair Wise t-tests on Accuracy (n=50, df=49)
Techniques
Base vs. IG
Base vs. GA
Base vs. EWGA
Base vs. SVMW
IG vs. GA
IG vs. EWGA
IG vs. SVMW
GA vs. EWGA
GA vs. SVMW
SVMW vs. EWGA
P-Values
<0.0384*
0.1245
<0.0001*
<0.0369*
0.0485*
<0.0001*
0.2934
<0.0001*
0.0461*
<0.0001*
* P-values significant at alpha = 0.05
195
6.6.3 Results Discussion
Figure 6.4 shows the selection accuracy and number of features selected (out of over
12,800 potential features) for the web forum using EWGA as compared to GA across the
200 iterations (average of 10 folds). The EWGA accuracy declines initially despite being
seeded with the IG solution. This is due to the use of generation replacement which
prevents the IG solution from dominating the other solutions and creating a stagnant
solution population. As intended, the IG solution features are gradually disseminated to
the remaining solutions in the population until the new solutions begin to improve in
accuracy around the 20th iteration. Overall, the EWGA is able to converge on an
improved solution while only using half of the features originally transferred from IG. It
is interesting to note that EWGA and GA both converge to a similar number of features;
however the EWGA is better able to isolate the more effective sentiment discriminators.
Selection Accuracy: U.S. Forum
Features Selected: U.S. Forum
1200
94
1000
92
800
Features
Accuracy
96
90
600
88
GA
EWGA
86
84
400
GA
EWGA
200
0
0
50
100
Iteration
150
200
0
50
100
Iteration
150
200
Figure 6.4: U.S. Forum Results using EWGA and GA
6.6.3.1 Analysis of Key Sentiment Features
We chose to analyze the EWGA features since they provided the highest performance
with the most concise set of features. Thus, the EWGA selected features are likely to be
196
the most significant discriminators with the least redundancy. Figure 6.5 shows some of
the important stylistic features for the web forum data set. The diagram to the left shows
the normalized average feature usage across all positive and negative sentiment
messages. The table to the right shows the description for each feature as well as its IG
and SVM weight.
Feat.
Description
IG
SVM
1
total char.
0.027
0.243
0.8
2
$
0.029
0.130
0.6
3
&
0.017
0.141
0.4
4
{
0.022
0.126
0.2
5
digit count
0.015
0.316
6
therefore
0.021
-0.104
7
however
0.017
-0.120
8
nevertheless
0.014
-0.119
U.S. Forum Usage
Positive
Normalized Usage
1.0
Negative
0.0
1
2
3
4
5
Feature #
6
7
8
Figure 6.5: Key Stylistic Features for U.S. Forum
The positive sentiment messages (agonists, in favor of racial diversity) tend to be
considerably shorter (feat. 1), containing a few long sentences. These messages also
feature heavier usage of conjunctive function words such as “however”, “therefore”, and
“nevertheless” (feat. 6-8). In contrast, the negative sentiment messages are nearly twice
as long and contain lots of digits (feat. 5) and special characters (feat. 2-4). Higher digit
usage in the negative messages is due to references to news articles used to stereotype.
Article snippets begin with a date, resulting in the higher digit count. The negative
messages also feature shorter sentences. The stylistic feature usage statistics suggest that
the positive sentiment messages follow more of a debating style with shorter, wellstructured arguments. In contrast, the negative sentiment messages tend to contain greater
signs of emotion. The following verbal joust between two members in the U.S. forum
197
exemplifies the stylistic differences across sentiment classes. It should be noted that some
of the content in the messages has been sanitized for vulgar word usage; however the
stylistic tendencies that are meant to be illustrated remain unchanged.
Negative:
You’re a total %#$*@ idiot!!! You walk around thinking you’re doing humanity a favor.
Sympathizing with such barbaric slime. They use your sympathy as an excuse to fail. They are a
burden to us all!!! Your opinion means nothing.
Positive:
Neither does yours. But at least my opinion is an educated and informed one backed by wellreasoned arguments and careful skepticism about my assumptions. Race is nothing more than a social
classification. What have you done for society that allows you to deem others a burden?
6.7 Conclusions and Future Directions
In this chapter we applied sentiment classification methodologies to online reviews
and forum postings. In addition to syntactic features, a wide array of stylistic attributes
including lexical, structural, and function word style markers were included. We also
developed the Entropy Weighted Genetic Algorithm (EWGA) for efficient feature
selection in order to improve accuracy and identify key features for each sentiment class.
EWGA significantly outperformed the no feature selection baseline and GA on all test
beds. It also significantly outperformed IG and SVMW on both data sets while isolating a
smaller subset of key features. EWGA demonstrated the utility of these key features in
terms of classification performance and for content analysis. Analysis of EWGA selected
stylistic and syntactic features allowed greater insight into writing style and content
differences across sentiment classes. Our approach of using stylistic and syntactic
features in conjunction with the EWGA feature selection method achieved a high level of
accuracy suggesting that these features and techniques may be used in the future to
198
perform sentiment classification and content analysis of online conent.
In the future we would like to evaluate the effectiveness of the proposed sentiment
classification features and techniques for other tasks such as sentence and phrase level
sentiment classification. We also intend to apply the technique to other sentiment
domains (e.g., news articles and product reviews). Moreover, we believe the suggested
feature selection technique may also be appropriate for other forms of text categorization
and plan to apply our technique to topic, style, and genre classification. We also plan to
investigate the effectiveness of other forms of GA hybridization, such as using SVM
weights instead of the IG heuristic.
199
CHAPTER 7: MINING ONLINE REVIEW SENTIMENTS USING FEATURE
RELATION NETWORKS
7.1 Introduction
In this chapter we propose a multivariate feature selection method capable of
identifying key n-grams for opinion classification. The proposed method is coupled with
a rich set of n-gram features for classification of movie and product reviews. Experiments
are conducted, comparing the features and feature selection techniques against
comparison features and selection methods.
The Internet is rich in directional text (i.e., text containing opinions and emotions).
The web provides volumes of text-based data about consumer preferences, stored in
online review websites, web forums, blogs, etc. Knowledge discovery from these
vociferous archives can lend invaluable insight towards enhanced understanding of
consumer preferences and needs. Sentiment analysis has emerged as a method for mining
opinions from large text archives. It uses machine learning methods combined with
linguistic attributes/features in order to identify the sentiment polarity (e.g., positive,
negative, neutral) and intensity (e.g., low, medium, high) for a particular text.
There are several online applications of sentiment analysis. One such application is
enhancing the consumer shopping experience. Sentiment analysis can allow consumers to
see how certain products are perceived by existing customers (Turney and Littman,
2003). Sentiment analysis is also important for gathering marketing and competitive
business Intelligence. Companies are increasingly interested in investigating how their
products and/or competitor products are perceived (Morinaga et al. 2002). This includes
analysis of consumer trends and behavior (Nasukawa and Nagano, 2001) as well as
200
investor opinions about a company (Das and Chen, 2007). Another important application
of sentiment analysis is user requirements gathering for product design. Designers seek to
gain consumer preference insights regarding product attributes and options via the
Internet (Green et al., 2001). Such information could be useful for improving existing
products or informing new product design.
In spite of its numerous functions, text sentiment analysis is a challenging problem. It
requires the use of large quantities of linguistic features (Argamon et al., 2007; Abbasi
and Chen, 2008). Various types of n-gram features have emerged for capturing sentiment
cues in text. However few studies have attempted to integrate these heterogeneous ngram categories into a single feature set. This is because using many n-gram categories in
unison leads to larger feature quantities which can introduce several problems. Noise and
redundancy in the feature space increases the likelihood of over fitting. It also prevents
many quality features from being incorporated due to computational limitations, resulting
in diminished performance. Furthermore, these larger feature spaces, spanning hundreds
of thousands of features, make many powerful feature selection methods infeasible.
In this essay we propose the use of a rich set of n-gram features spanning character,
word, part-of-speech tag, syntactic, and semantic n-gram categories. The proposed
feature set includes many fixed and variable n-grams. We couple the extended feature set
with a feature selection method capable of efficiently identifying an enhanced subset of
n-grams for opinion classification. The proposed feature relation network (FRN) is a rulebased multivariate n-gram feature selection technique that efficiently removes redundant
or less useful n-grams, allowing for more effective n-gram feature sets. Experimental
201
results reveal that the extended feature set and proposed feature selection method can
improve opinion classification performance over existing n-gram feature sets and
selection methods.
The remainder of this paper is organized as follows. Section 7.2 provides a review of
related work on features and feature selection methods for sentiment analysis. Based on
this review, Section 7.3 describes research gaps and questions addressed in this essay.
Section 7.4 provides our research design. Section 7.5 includes an experimental evaluation
of the proposed features and selection method in comparison with existing feature sets
and feature selection techniques. Section 7.6 outlines conclusions and future directions.
7.2 Related Work
Opinion mining involves several important tasks. Two such tasks are sentiment
polarity and intensity assignment (Hu and Liu, 2004; Popescu and Etzioni, 2005).
Polarity assignment is concerned with determining the polarity of sentiments, i.e.,
whether a text has a positive, negative, or neutral semantic orientation. Sentiment
intensity assignment looks at whether the positive/negative sentiments are mild or strong
(e.g., differentiating 1-2 star negative reviews or 4-5 star positive reviews). Hence, given
the two phrases “I don’t like you” and “I hate you,” both would be assigned a negative
semantic orientation but the latter would be considered more intense.
Effectively classifying sentiments entails the use of classification methods applied to
linguistic features. Classification methods are techniques capable of assigning appropriate
sentiment polarities and intensities. As input, they require linguistic features capable of
identifying and representing sentiments. The most popular class of features used for
202
opinion mining are n-grams (Ng et al., 2006; Abbasi and Chen, 2008). Larger n-gram
feature sets require the use of feature selection methods to select appropriate attribute
subsets. Next, we discuss various classification methods, n-gram features, and feature
selection techniques used for sentiment analysis.
7.2.1 Classification Methods for Sentiment Analysis
In this essay, our emphasis is on n-gram features and selection methods. Therefore,
our review of sentiment classification techniques is purposefully brief. Many
classification methods have been employed for opinion mining. Machine learning
methods such as Support Vector Machines (SVM), Winnow, and AdaBoost have been
shown to work well (Argamon et al., 2007). Specifically, Support Vector Machines
(SVM) has outperformed many comparison methods (Abbasi and Chen, 2008; Cui et al.,
2006). Based on prior sentiment analysis results, it provided better performance than
various techniques including Naïve Bayes, Decision Trees, Winnow, etc. (Pang and Lee,
2004; Cui et al., 2006; Abbasi and Chen, 2008).
7.2.2 N-Gram Features for Sentiment Analysis
N-gram features can be classified into two categories: fixed and variable. Fixed ngrams are exact sequences occurring at either the character or token level. In contrast,
variable n-grams are extraction patterns capable of representing more sophisticated
linguistic phenomena. A plethora of fixed and variable n-grams have been used for
opinion mining, including word, part-of-speech (POS), character, legomena, syntactic,
and semantic n-grams. These are described below.
203
Word n-grams include bag-of-words (called BOWs or word unigrams) and higher
order word n-grams (e.g., bigrams, trigrams, etc.). Word n-grams have been used
effectively in several studies (e.g., Pang et al., 2002). Studies typically only use up to
trigrams (Ng et al., 2006; Abbasi et al., 2008), although some have incorporated 4-grams
as well (e.g., Riloff et al., 2006). Word n-grams often provide a feature set foundation,
with additional feature categories added to them. Wiebe et al. (2004) used word n-grams
in conjunction with legomena n-grams for detecting subjective content in Wall Street
Journal articles. Argamon et al. (2007) added a semantic lexicon based on appraisal
groups to a large set of BOWs for movie review classification. Ng et al. (2006)
incorporated word n-grams with a set of polar adjectives for enhanced opinion
classification. Riloff et al. (2006) used word n-grams in combination with syntactic
phrase patterns. Many of these studies attained benchmark results for subjectivity or
sentiment polarity classification.
Part-of-speech (POS) tag n-grams are very useful for opinion classification.
Adjectives and adverbs have been shown to contain considerable sentiment polarity
information (Fei et al., 2004; Gamon, 2004). In addition to POS tag n-grams, some
studies have employed word plus part-of speech (POSWord) n-grams. These n-grams
consider a word along with its POS tag (Wiebe et al., 2004). For example, the phrase
“quality of the” can be represented with the POS trigram “noun prep det” or the
POSWord trigram “quality-noun of-prep the-det.” POSWord n-grams are useful for
avoiding overly general POS n-grams due to word-sense disambiguation in situations
where a word may otherwise have several senses.
204
Character n-grams are letter sequences. For example, the word “like” can be
represented with the following two and three letter sequences “li, ik, ke, lik, ike.”
Character n-grams were previously used mostly for style classification, attempting to
differentiate document authorship (Peng et al., 2003). However, recently character level
bigrams and trigrams have been shown to be useful in related affect classification
research attempting to identify emotions in text (Abbasi et al., 2008).
Legomena n-grams are collocations that replace once (hapax legomena) and twice
occurring words (dis legomena) with “HAPAX” and “DIS” tags. Hence, the trigram “I
hate Jim” would be replaced with “I hate HAPAX” provided “Jim” only occurs once in
the corpus. The intuition behind such collocations is to remove sparsely occurring words
with tags that will allow the extracted n-grams to be more generalizable (Wiebe et al.,
2001; 2004). Such n-grams have been used for subjectivity classification (i.e.,
determining whether a text is subjective or objective) as well as affect classification
(Wiebe et al., 2004; Abbasi et al., 2008).
Syntactic phrase patterns are learned variable n-gram patterns (Riloff et al., 2006).
Riloff and Wiebe (2003) developed a set of syntactic templates and information
extraction patterns reflective of subjective content. Their tool uses predefined templates
and extracts all patterns (i.e., instantiations of those templates) with the greatest
occurrence difference across sentiment classes. For example, the template “<subj>
passive-verb” may produce the pattern “<subj> was satisfied” from the text. Other studies
have also utilized syntactic phrase patterns. Gamon (2004) used context free phrase
patterns taken from parse trees. For example, “DECL::NP VERB NP” represents a
205
declarative sentence consisting of a noun phrase, a verbal head, and a second noun
phrase. Such phrase patterns can provide useful sentiment analysis features by
representing syntactic phenomena difficult to capture using fixed word n-grams (Wiebe et
al., 2004).
Semantic phrase patterns typically use an initial set of terms or phrases which are
manually or automatically filtered and coded with semantic information (e.g., sentiment
polarity and/or intensity). Riloff et al. (2003) used semi-automatic lexicon generation
tools to construct sets of strong subjectivity, weak subjectivity, and objective nouns. Their
approach outperformed the use of other features, including bag-of-words, for
classification of objective versus subjective English documents. Argamon et al. (2007)
effectively used Appraisal Groups for annotating semantics to words/phrases. Initial term
lists were generated using WordNet, and then filtered manually to construct the lexicon.
Developed based on Appraisal Theory, each expression was manually classified into
various appraisal classes. These classes include attitude, orientation, graduation, and
polarity of phrases. Fei et al. (2004) derived phrase patterns from manually crafted sets of
positive and negative words. For example, the phrase “Roma was defeated” would be
assigned the pattern n+dj which signifies a noun followed by the negative adjective
“defeated.” Many studies have also used WordNet to generate semantic lexicons (Kim &
Hovy, 2004; Mishne, 2005). Burgun and Bodenreider (2001) used WordNet to generate
semantic word classes for disambiguation of medical documents.
Table 7.1 provides a summary of n-gram features used for opinion classification.
Based on the table, we can see that many n-gram categories have been used in prior
206
opinion mining research. However, few studies have employed large sets of
heterogeneous n-grams. As previously stated, most studies utilized word n-grams in
combination with one other category, such as POS tag, legomena, semantic, or syntactic
n-grams (e.g., Wiebe et al., 2004; Ng et al., 2006; Riloff et al., 2006; Argamon et al.,
2007; Abbasi and Chen, 2008).
Opinion mining could greatly benefit from feature selection methods capable of
identifying important n-grams and allowing the use of larger feature sets. For instance,
the popular 2,000 movie review test bed developed by Pang et al. (2002) has over 49,000
bag-of-words (Argamon et al., 2007). Higher order word n-gram feature spaces can be
even larger, with hundreds of thousands of potential attributes. For example, the short
sentence “I like chocolate.” contains 6 word n-grams, though many of them are
redundant. Adding additional n-gram categories could potentially improve the
representational richness of text sentiment information. However, an extended set of ngram features requires feature selection methods to help manage the large feature spaces
created from the use of heterogeneous n-grams. As Riloff et al. (2006) noted, using
additional features without appropriate selection mechanisms is analogous to “throwing
the kitchen sink.” Various text feature selection methods are discussed below.
207
Table 7.1: Summary of N-Gram Features used for Sentiment Analysis
N-Gram Category
Character N-Grams
Word N-Grams
Examples
q, u, qu, ua, al, li, qua, ual, ali
quality, quality of, quality of the
POS Tag N-Grams
noun, noun prep, noun prep det
Word/POS Tag N-Grams
Legomena N-Grams
quality-noun of-prep the-det
the UNIQUE, of the UNIQUE
different-adj U-noun
Syntactic Phrase Patterns
<subj> passive-verb
DECL::NP VERB NP
<subj> ActInfVP
SYN125 of the
strong-tyranny, weak-aberration
n+aj, n+dj, av+n
POSITIVE of the
APP/Appreciation:ORI/Negative
Semantic Phrase Patterns
Select Prior Studies
Abbasi et al., 2008
Morinaga et al., 2002
Pang et al., 2002
Wiebe et al., 2004
Ng et al., 2006
Das & Chen, 2007
Abbasi & Chen, 2008
Pang et al., 2002
Gamon, 2004
Wiebe et al., 2004
Wiebe et al., 2001
Wiebe et al., 2004
Abbasi et al., 2008
Riloff & Wiebe, 2003
Gamon, 2004
Riloff et al., 2006
Burgun & Bodenreider,
2001
Riloff et al., 2003
Fei et al., 2004
Ng et al., 2006
Argamon et al., 2007
7.2.3 Feature Selection for Sentiment Analysis
Various automated and manual approaches have been used to craft sentiment
classification feature sets. However, little emphasis has been given to feature subset
selection techniques. Feature reduction/selection techniques have several important
benefits (Li et al., 2006). They can potentially improve classification accuracy. They can
also narrow in on a key feature subset of sentiment discriminators. Furthermore, feature
selection can provide greater insight into important class attributes, resulting in a better
understanding of positive and negative sentiment cues. There are two categories of
feature selection methods (Guyon et al., 2002): univariate and multivariate. Either
category has its advantages and disadvantages (Guyon and Elisseeff, 2003).
208
Univariate methods consider attributes individually. Examples include information
gain, chi-squared, log likelihood, and occurrence frequency (Forman, 2003). Univariate
methods are computationally more efficient. It is also easier to interpret the contribution
of individual attributes using univariate methods. However, only evaluating individual
attributes can also be a disadvantage since important attribute interactions are not
considered. Information gain has been shown to work well for various text categorization
tasks, including stylometric analysis (Juola and Baayen, 2003; Koppel and Schler, 2003)
and topic classification (Efron et al., 2003). Forman (2003) performed an extensive
empirical comparison of numerous univariate feature selection methods for topic
classification.
In contrast, multivariate methods consider attribute groups or subsets. These
techniques often use a wrapper model for attribute selection, where feature subsets are
evaluated as a group until some stopping condition is reached (Liu and Motoda, 1998).
Examples include decision tree models, recursive feature elimination, and genetic
algorithms. By performing group level evaluation, multivariate methods consider
attribute interactions. Consequently, these techniques are also computationally expensive,
relative to univariate methods. Furthermore it is also more difficult to interpret individual
feature weights. Decision tree models (DTM) use a wrapper where a DTM is built on the
training data and features incorporated by the tree are included in the feature set (Liu and
Motoda, 1998; Abbasi and Chen, 2008). Recursive feature elimination (Guyon et al.,
2002) has been used for topic (Mladenic et al., 2004) and gender classification (Koppel et
al., 2002). This method uses a wrapper model based on an SVM classifier. Each iteration,
209
the remaining features are ranked based on the absolute values of their SVM weights. A
certain number or percentage of features is retained for the next iteration (Guyon et al.,
2002).
Most opinion mining studies have used univariate feature selection methods such as
minimum frequency thresholds (i.e., selecting all features occurring n number of times).
Many studies used the log likelihood ratio (Yi et al., 2003; Gamon, 2004; Ng et al.,
2006). Using log likelihood, Gamon (2004) was able to improve accuracy and narrow in
on a key feature subset of sentiment discriminators. Wiebe et al. (2004) evaluated the
effectiveness of various potential subjective elements (PSEs) for subjectivity
classification based on their occurrence distribution across classes. Abbasi and Chen
(2008) used decision tree models to select key sentiment features for product review
opinion classification. Riloff et al. (2006) used feature subsumption hierarchies: a method
intended to select the most suitable set of word –grams and syntactic n-gram patterns.
This approach uses the idea of performance based feature subsumption to remove
redundant or irrelevant higher order n-grams (Riloff et al., 2006). For instance, only those
word bigrams and trigrams are retained which provide additional information (measured
using some univariate heuristic) over the unigrams they encompass. For example, the
word bigram I LIKE may be subsumed by the unigram LIKE, however BASKET CASE
may be retained since it contains important sentiment information not provided by
BASKET or CASE alone.
Table 7.2 shows select univariate and multivariate feature selection methods used for
text classification. Based on the table, we can see that opinion classification research has
210
made limited use of feature selection methods, particularly multivariate selection
methods. However, it is unclear how beneficial existing multivariate selection methods
may be, given the large potential feature spaces created using rich heterogeneous ngrams. Large scale feature selection requires addressing relevance and redundancy,
something many existing methods fail to do (Yu and Liu, 2004). Redundancy is a big
problem, since there are a finite number of attributes that can be incorporated and ngrams tend to be highly redundant by nature. Redundant features could occupy valuable
spots that could otherwise be utilized by features providing additional information and
discriminatory potential.
Table 7.2: Select Univariate and Multivariate Methods used for Text Classification
Category
Univariate
Methods
Multivariate
Methods
Method
Chi Squared
Correlation
Information Gain
Text Classification Task
Topic Categorization
Topic Categorization
Style Categorization
Log Likelihood Ratio
Topic Categorization
Opinion Categorization
Decision Tree Models
Feature Subsumption
Hierarchy
Genetic Algorithm
Recursive Feature Elimination
Opinion Categorization
Opinion Categorization
Study
Forman, 2003
Forman, 2003
Juola & Baayen, 2003
Koppel et al., 2006
Forman, 2003
Yi et al., 2003
Gamon, 2004
Ng et al., 2006
Abbasi & Chen, 2008
Riloff et al., 2006
Style Categorization
Style Categorization
Topic Categorization
Li et al., 2006
Koppel et al., 2002
Mladenic et al., 2004
7.3 Research Gaps and Questions
Based on our review of features and feature selection methods for sentiment analysis
we have identified appropriate gaps and questions.
211
7.3.1 Research Gaps
Most studies have used limited sets of n-gram features, typically employing one or
two categories (e.g., Pang et al., 2002; Ng et al., 2006). Although numerous n-gram
feature categories have been developed, few studies have attempted to combine these rich
heterogeneous n-gram groups into a single classifier. This is partially due to the
computational difficulties associated with incorporating larger sets of n-grams and
performance degradation stemming from noisy feature sets. The lack of appropriate
feature selection methods capable of handling large sets of n-grams is another issue, as
discussed below. Consequently many studies have relied solely on word n-grams.
Feature selection has seen limited usage, despite its usefulness in related text
categorization studies. Most studies have employed methods such as frequency
thresholds or univariate selection methods. Powerful multivariate methods have rarely
been employed. However, it is unclear if existing multivariate methods are suitable for
selecting feature subsets from hundreds of thousands of potential n-grams since these
methods have typically been applied to smaller feature sets (e.g., Guyon et al., 2002; Li et
al., 2006).
7.3.2 Research Questions
We present the following research questions:
•
Can the use of an extended n-gram feature set improve opinion classification
performance?
o Over traditional feature sets such as bag-of-words and word n-grams.
•
How can the use of a rule based multivariate feature selection method
212
applied to the extended n-gram feature set further enhance performance?
o Compared to generic univariate and multivariate feature selection
methods.
•
What impact will different feature quantities have on the performance of
different feature sets and selection methods?
Figure 7.1: Sentiment Analysis Research Design
7.4 Research Design
Figure 7.1 shows our research design. We propose the use of a rich set of n-gram
features, coupled with the Feature Relation Network (FRN) for enhanced sentiment
intensity and polarity classification performance. We intend to compare the extended
feature set against a bag-of-words baseline and word n-grams, which have been shown to
be highly effective in prior sentiment analysis research (Ng et al., 2006). The proposed
213
FRN feature selection method will be compared against various univariate and
multivariate selection techniques used in prior research, including log likelihood ratio,
decision tree models, and recursive feature elimination. The extended feature set and
FRN feature selection method are discussed in the remainder of this section.
7.4.1 Extended N-Gram Feature Set
We incorporate a rich set of n-gram features, comprised of all the categories discussed
in the literature review (i.e., word, POS, POSWord, legomena, syntactic, and semantic ngrams). The feature set is shown in Table 7.3. The syntactic n-grams were derived using
the Sundance package developed as a collaboration between the Universities of Utah and
Pittsburgh (Riloff et al., 2003; 2006). This tool extracts n-gram instantiations of
predefined pattern templates. Sundance learns n-grams which have the greatest
occurrence difference across user defined classes. For instance, the n-gram “endorsed
<dobj>” is generated from the pattern template “ActVP <dobj>”. The semantic n-grams
were derived using WordNet, following an approach similar to that used by Kim and
Hovy (2004) and Mishne (2005). Word are clustered into semantic categories based on
the number of common items in their synsets. New words are added to the cluster with
the highest percentage of synonyms in common, provided the percentage is above a
certain threshold. Otherwise the word is added to the new cluster.
214
Table 7.3: N-Gram Feature Set
Label
N-Char
Description
Character-level ngrams
N-Word
Word-level n-grams
N-POS
Part-of-speech tag ngrams
N-POSWord
Word and POS tag ngrams
N-Legomena
Hapax legomena and
Dis legomena n-grams
Semantic class n-grams
N-Semantic
IEP-A/E
Information extraction
patterns
Examples
1-Char
2-Char
3-Char
1-Word
2-Word
3-Word
1-POS
2-POS
3-POS
1-POSWord
2-POSWord
3-POSWord
2-Legomena
3-Legomena
1-Semantic
2-Semantic
3-Semantic
IEP-A
IEP-B
IEP-C
IEP-D
IEP-E
I, L, O, V, E, C, H, O, C, O, L, A, T, E
LO, OV, VE, CH, HO, OC, CO, OL, LA, AT
LOV, OVE, CHO, HOC, OCO, COL, OLA
I, LOVE, CHOCOLATE
I LOVE, LOVE DARK, DARK CHOCOLATE
I LOVE DARK, LOVE DARK CHOCOLATE
I, ADMIRE_VBP, NN
I ADMIRE_VBP, ADMIREVBP NN
I ADMIRE_VBP NN
I I, LOVE ADMIRE_VBP, CHOCOLATE NN
I I LOVE ADMIRE_VBP
I I LOVE ADMIRE_VBP CHOCOLATE NN
LOVE DIS
I LOVE DIS
SYN-Pronoun, SYN-Affection, SYN-Candy
SYN-Pronoun SYN-Affection
SYN-Pronoun SYN-Affection SYN-Candy
<possessive> NP, <subj> AuxVP AdjP, <subj>
AuxVP Dobj, ActVP <dobj>, ActVP Prep <np>,
NP Prep <np>, Pass VP Prep <np>, Subj Aux VP
<dobj>
<subj> PassVP, InfVP Prep <np>, InfVP <dobj>
<subj> ActVP
<subj> ActVP Dobj
<subj> ActInfVP, <subj> PassInfVP, ActInfVP
<dobj>, PassInfVP <dobj>
7.4.2 Feature Relation Network
Whenever possible, domain knowledge should be incorporated into the feature
selection process (Guyon and Elisseeff, 2002). For text n-grams, the relationship between
n-gram categories can facilitate enhanced feature selection by considering relevance and
redundancy, two factors critical to large scale feature selection (Yu and Liu, 2004). We
propose a rule-based multivariate feature selection method that leverages the
relationships between n-gram features in order to efficiently remove redundant and
irrelevant n-gram features. Comparing all features within a feature set directly with one
215
another can be an arduous endeavor unless the relationship between features can be
leveraged for efficient comparison between only some logical subset of attributes. Given
large quantities of heterogeneous n-gram features, the Feature Relation Network utilizes
two important n-gram relations: subsumption and parallel relations. These two relations
enable intelligent comparison between features in a manner that facilitates enhanced
removal of redundant and/or irrelevant n-grams.
7.4.2.1 Subsumption Relations
The notion of subsumption was originally proposed by Riloff et al. (2006). A
subsumption relation occurs between two n-gram feature categories where one category
is a more general, lower-order form of the other (Riloff et al., 2006). A subsumes B
(AB) if B is a higher order n-gram category whose n-grams contain the lower order ngrams found in A. For example, word unigrams subsume word bigrams and trigrams,
while word bigrams subsume word trigrams (as shown in Figure 7.2). Given the sentence
“I love chocolate,” there are 6 word n-grams: I, LOVE, CHOCOLATE, I LOVE, LOVE
CHOCOLATE, and I LOVE CHOCOLATE. The unigram LOVE is obviously important,
generally conveying positive sentiment. However what about the bigrams and trigrams?
It depends on their weight, as defined by some heuristic (e.g., log likelihood or
information gain). We only wish to keep higher order n-grams if they are adding
additional information greater than that conveyed by the unigram LOVE. Hence, given
AB, we keep features from category B if their weight exceeds that of their general
lower-order counterparts found in A by some threshold t (Riloff et al., 2006). For
instance, the bigrams I LOVE and LOVE CHOCOLATE would only be retained if their
216
weight exceeded that of the unigram LOVE by t they (i.e., if they provided additional
information over the more general unigram). Similarly, the trigram I LOVE
CHOCOLATE would only be retained if its weight exceeded that of the unigram LOVE
and any remaining bigrams (e.g., I LOVE and LOVE CHOCOLATE) by t.
Figure 7.2: Subsumption Relations between Word N-Grams
7.4.2.2 Parallel Relations
A parallel relation occurs where two heterogeneous same order n-gram feature groups
may have some features with similar occurrences. For example, word unigrams (1-Word)
can be associated with many POS tags (1-POS), and vice versa. However, certain word
and POS tags’ occurrences may be highly correlated. Similarly, some POS tags and
semantic class unigrams may be correlated if they are used to represent the same words.
For example, the POS tag ADMIRE_VP and the semantic class SYN-Affection both
represent words such as “like” and “love.” Given two n-gram feature groups with
potentially correlated attributes, A is considered to be parallel to B (A—B). If two
features from these categories A and B, respectively, have a correlation coefficient greater
than some threshold p, one of the attributes is removed to avoid redundancy. Figure 7.3
shows some examples of bigram categories with parallel relations.
217
Correlation is a commonly used method for feature selection (Forman, 2003).
However, correlation is generally used as a univariate method by comparing the
occurrences of an attribute with the class labels across instances (Forman, 2003).
Comparing attribute correlations with one another (multivariate feature selection) could
remove redundant attributes. However, comparing every attribute with one another is
computationally infeasible. The FRN allows the incorporation of correlation information
by only comparing select n-grams (ones from parallel relation categories within the
feature relation network).
Figure 7.3: Parallel Relations between Various Bigams
7.4.2.3 The Complete Network
Figure 7.4 shows the entire FRN, comprised of the nodes previously described in
Table 7.3. The network encompasses 23 n-gram feature category nodes and numerous
subsumption and parallel relations between these nodes.
218
Figure 7.4: The Feature Relation Network
The detailed list of relations is presented in Table 7.4. The order in which the relations
are applied is important to ensure that the redundant and irrelevant attributes are removed
correctly. For instance, Riloff et al. (2006) used a feature subsumption hierarchy where
higher order n-grams were compared against one another. The remaining higher order ngrams (i.e., bigram and trigram features) were then evaluated in comparison with
unigrams. In the FRN, the feature order Subsumption relations are applied prior to
parallel relations. Furthermore, subsumption relations between n-gram groups within a
feature category are applied prior to across category relations (i.e., 1-Word2-Word is
applied prior to 1-Word1-POSWord). Table 7.4 presents the various relations in the
order that they are applied. Relations within a feature group are applied in the order listed
in the “Relations” column moving from left to right. For instance, 1-Word2-Word is
applied prior to 2-Word3-Word.
219
Table 7.4: List of Relations between N-Gram Feature Groups
Relation
Order
1
Feature Group
N-Char
N-Word
N-POS
N-POSWord
N-Legomena
N-Semantic
IEP-A/E
2
Char-Word
Word-POSWord
POS-POSWord
Word-Legomena
3
Word-POS
Word-Semantic
POS-Semantic
POSWordSemantic
Relations
Subsumption Relations
1-Char 2-Char, 1-Char 3-Char, 2-Char 3-Char
1-Word 2-Word, 1-Word 3-Word, 2-Word 3-Word
1-POS 2-POS, 1-POS 3-POS, 2-POS 3-POS
1-POSWord 2-POSWord, 1-POSWord 3-POSWord, 2POSWord 3-POSWord
2-Legomena 3-Legomena
1-Semantic 2-Semantic, 1-Semantic 3-Semantic, 2-Semantic
3-Semantic
1-Word IEP-A, 1-Word IEP-C, IEP-C IEP-D, 2-Word IEP-B,
3-Word IEP-E, IEP-B IEP-E
1-Char 1-Word, 2-Char 1-Word, 3-Char 1-Word
1-Word 1-POSWord, 2-Word 2-POSWord, 3-Word 3POSWord
1-POS 1-POSWord, 2-POS 2-POSWord, 3-POS 3POSWord
1-Word 2-Legomena, 2-Word 3-Legomena
Parallel Relations
1-Word — 1-POS, 2-Word — 2-POS, 3-Word — 3-POS
1-Word — 1-Semantic, 2-Word — 2-Semantic, 3-Word — 3Semantic
1-POS — 1-Semantic, 2-POS — 2-Semantic, 3-POS — 3-Semantic
1-POSWord — 1-Semantic, 2-POSWord — 2-Semantic, 3POSWord — 3-Semantic
Figure 7.5 describes the FRN algorithm details. Given feature a from category A, we
firstly find the feature categories that are subsumed by A (based on the precedence
defined in Table 7.4). Then, all features from these categories containing the substring a
and having the same semantic orientation are retrieved. The semantic orientation of a
feature is defined as the class for which that attribute has the highest weight. Feature
weights can be computed using any univariate heuristic. Here we employ the weighted
log likelihood ratio (WLLR) since it worked well in prior opinion classification research
(e.g., Yi et al., 2003; Ng et al., 2006). Given 5 classes (e.g., 1-5 star reviews), the word
220
HATE is likely to have the highest WLLR for class 1 or 2. The semantic orientation of
features is compared to avoid having features such as DON’T LIKE get subsumed by the
unigram LIKE (since the two features have an opposing semantic orientation). The
weights for the retrieved features are compared against that of a and only those features
are retained with weight greater than a by some threshold t.
The parallel relations are enforced as follows. Given feature a from category A, we
find the feature categories that are parallel to A. Features from these categories with
potential co-occurrence with a are retrieved. The correlation coefficient for these features
is computed in comparison with a. If the coefficient is greater than or equal to some
threshold p, one of the features is removed. We remove the feature with the lower weight
(ties are broken arbitrarily). It is important to note that for subsumption and parallel
relations, only features still remaining in the feature set are analyzed and/or retrieved
(i.e., ones with a weight greater than 0).
221
Let A = {a1,a2 ,...an } and B = {b1,b2 ,...bm } denote two sets of n - grams
//e.g., 1 - Word.
if A → B
//A subsumes B
For each a x , where ax = (a x1 ,...a xd ) denotes a tuple in A with w(a x ) > 0
Let C ⊆ B, where C = {c1,c2 ,...c y } and cx = (cx1 ,...cxe ) denotes a tuple in C with w(cx ) > 0
Where the tuple a x is a part of each c x
if s(a x ) = s(cx )
//check the semantic orientation of the two features
if w(a x ) ≥ w(cx ) − t
w(cx ) = 0
if A − B
//A is parallel to B
For each a x , where ax = (a x1 ,...a xd ) denotes a tuple in A with w(a x ) > 0
Let C ⊆ B, where C = {c1,c2 ,...c y } and cx = (cx1 ,...cxe ) denotes a tuple in C with w(cx ) > 0
Where each c x is potentially correlated with a x
if Corr(a x ,cx ) ≥ p
if w(a x ) ≥ w(c x ) then w(c x ) = 0
if w(a x ) < w(c x ) then w(a x ) = 0
Where :
Corr(a, b) is the correlation coefficient for features a and b across the m training instances :
m
∑ (a
Corr(a, b) =
x
)(
− a bx − b
)
x =1
m
∑ (a
x =1
x
−a
m
) ∑ (b
2
x
−b
)
2
x =1
w(a ) is the maximum weighted log - likelihood ratio (WLLR) for feature a x across classes s :


 P(a | s)  
 P(a | s)  
  s (a x ) = arg max P (a | s ) log
 
w(a x ) = max P (a | s ) log
s
s
 P ( a | ¬s )  
 P ( a | ¬s )  


t and p are predefined thresholds //we used t = 0.05 and p = 0.90
Figure 7.5: The FRN Algorithm
Figure 7.6 shows an illustration of the Feature Relation Network applied to a 6
sentence test bed (3 positive and 3 negatively oriented sentences). The table in the bottom
left corner shows the feature weights for many key categories (e.g., word, POS, and
222
semantic n-grams). This table shows the initial weights computed using the weighted log
likelihood ratio (across our 6 sentence sample test bed). It also shows the FRN weights:
removed features get a weight of 0. The FRN is able to remove redundant or less useful
n-grams, keeping only 4 of the top 12. For example, the bigram I LOVE gets subsumed
by the unigram LOVE. Similarly, the semantic class unigram SYN-Affection is parallel
to the POS tag ADMIRE_VBP and therefore removed. Details for each removed n-gram
are provided in the FRN on the right hand side of the diagram. The removed n-grams are
placed next to the subsumption or parallel relation responsible for their removal. These
features correspond to the features whose FRN weight is 0.
Figure 7.6: Example Application of FRN to Six Sentence Test Bed
223
7.5 Experiments
We conducted opinion classification experiments on three review test beds, shown in
Table 7.5. The first contained digital camera reviews collected from Epinions. This test
bed featured one thru five star reviews. We only used whole star reviews (i.e., no half star
reviews were included). The second test bed encompassed automobile reviews taken
from Edmunds. These reviews were on a continuous 10 point scale. We discretized them
into 5 classes by taking all odd integer reviews. For example, all reviews between 1.01.99 were assigned 1 star while reviews between 3.0-3.99 were considered 2 star reviews.
The third test bed was a benchmark movie review data set developed by Pang et al.
(2002). This data set contains reviews taken from Rotten Tomatoes that are either positive
or negative. For each test bed, we used a total of 12,000 reviews.
Table 7.5: Descriptions of Online Review Test Beds
Review Test Bed
Digital Cameras
Automobiles
Movies
Source
www.epinions.com
www.edmunds.com
www.rottentomatoes.com
# Reviews
12,000
12,000
12,000
# Classes
5
5
2
Class Descriptions
1-5 Stars
1,3,5,7,9 Stars
Positive and Negative
For each test bed, we ran two experimental settings. In setting A we performed 5-fold
cross validation on 2,000 reviews. These reviews were balanced across classes. Hence,
there were 400 reviews per class for the digital camera and automobile test beds and
1,000 reviews per class for the movie review test bed. In setting B, we these 2,000
reviews for training, and tested on a second set of 10,000 reviews. This setting was
included in order to allow statistical testing on the results, since the training and testing
data was independent in setting B (Ng et al., 2006; Das and Chen, 2007). All experiments
were run using a linear kernel SVM classifier. Feature presence was used as opposed to
224
frequency since it has wielded better results in past research on n-grams for opinion
classification (Pang et al., 2002; Ng et al., 2006). Hence we used binary feature vectors (1
if n-gram is present in document, 0 if not present).
The following two metrics were used. The % within-one accuracy was incorporated
since multi-class opinion classification, involving 3 or more classes, can be challenging
given the relationship and subtle differences between semantically adjacent classes. It is
often difficult even for humans to accurately differentiate between, for instance, one and
two star reviews (Das and Chen, 2007).
% Accuracy =
# correctly assigned
# total reviews
% Within One =
# assigned within one class of correct
# total reviews
Based on our research questions, two different experiments were conducted. In
experiment 1, we compared the proposed extended n-gram feature set against word ngrams and a bag-of-words baseline. In experiment 2 the proposed Feature Relation
Network (FRN) was compared against previously used univariate and multivariate
feature selection methods, including weighted log likelihood ratio (WLLR), recursive
feature elimination (RFE), and decision tree models (DTM). For each experiment we ran
the two aforementioned settings: A and B. In setting A, 5-fold cross validation (CV) was
performed on the 2,000 balanced reviews. Setting B utilized these 2,000 reviews for
training and performed testing on the remaining 10,000 unbalanced reviews.
7.5.1 Experiment 1a: Comparison of Feature Sets using Cross Validation
We compared three feature sets: bag-of-words (BOW), word n-grams (Word-NG),
and all n-grams (All-NG). The all n-gram feature set included the word, POS, POSWord,
225
character, legomena, syntactic, and semantic n-grams described in Section 4.1. We
extracted all feature occurring at least 3 times, as done in prior research (Riloff et al.,
2006; Abbasi et al., 2008). The extracted features were ranked using the weighted log
likelihood ratio (WLLR) on the training data for each of the 5 CV folds. WLLR has been
shown to be effective in prior opinion classification studies (Ng et al., 2006).
When comparing feature sets and selection methods, it is difficult to decide upon the
number of features that should be included. Different feature set sizes can wield varying
performance depending on the nature of the features and selection methods employed. In
order to allow a fair comparison between feature sets, we evaluated the top 10,000 to
100,000 features (based on the WLLR weights), in 5,000 feature increments. Hence, 19
feature quantities were used for all three feature sets. The WLLR weights for all features
occurring three times or more were computed on the 1,600 training reviews for each fold.
The total number of bag-of-words typically did not exceed 20,000, so only that many
were evaluated. Such a set up is consistent with experimental designs used in prior
research (e.g., Riloff et al., 2006).
Figure 7.7 shows the 5-fold cross-validation results for all three feature sets across all
three 2,000 review test beds. The table on the left of the figure shows the % accuracy, %
within one, and number of features used to attain the best results. The charts on the right
show the results for all 19 feature quantities (using between 10,000 and 100,000
features). Looking at the best results for each feature set (left side of Figure 7.7), all ngrams outperformed word n-grams and the bag-of-word baseline on all three data sets.
The difference was greatest on the automobile reviews where it had 3% greater accuracy
226
and nearly 4% higher within-one in comparison with word n-grams.
Feature
Set
Best %
Accuracy
% Within
One
#
Features
Accuracy Using the Top 10,000 to 100,000
Features
Digital Camera Reviews (Epinions)
All-NG
50.55
87.10
90,000
Word-NG
49.90
86.25
70,000
BOW
42.05
78.60
10,000
Automobile Reviews (Edmunds)
All-NG
56.95
89.75
35,000
Word-NG
54.00
86.05
20,000
BOW
50.15
80.15
10,000
Movie Reviews (Rotten Tomatoes)
All-NG
89.20
-
50,000
Word-NG
88.55
-
45,000
BOW
83.70
-
20,000
Figure 7.7: Results for Feature Sets on 5-Fold Cross Validation Experiment (Setting A)
The right side of Figure 7.7 show the accuracies for the three feature sets using
between 10,000 and 100,000 features. All n-grams had the best overall performance on
all test beds. However it overtook the word n-gram feature set at different feature levels
on the digital camera and movie review data sets (i.e., 85,000 and 40,000 features,
respectively). This was not surprising since the all n-grams feature set includes many
redundant attributes across the various categories. For smaller feature quantities, word n-
227
grams includes a greater number of non-redundant attributes. Consequently, all n-grams
uses more features to get the necessary depth required for enhanced opinion classification
accuracy. Surprisingly however, all n-grams dominated the word n-gram feature set far
quicker, with enhanced performance even when using as little as 10,000 features.
7.5.2 Experiment 1b: Comparison of Features on 10,000 Review Test Beds
Table 7.6 shows the results attained for all three feature sets across the three 10,000
review test beds. Here, the classifiers were trained on the 2,000 reviews from the
previous experiment. For each feature set we used the number of features which attained
the best results in experiment 1a (left side of Figure 7.7). For example, the same 90,000
features were used for the All-NG feature set on the digital camera review test bed, since
these attained the best results in experiment 1a.
Once again, All-NG outperformed Word-NG and the BOW baseline on all three data
sets. The performance increase on accuracy was nearly 2% for the digital camera and
automobile reviews. Pair wise t-tests on percentage accuracy and within-one showed that
the improved performance was significant at alpha = 0.01 (n=10000, all p-values <
0.00001).
Table 7.6: Results for Feature Sets on 10,000 Test Review Experiment (Setting B)
Feature
Set
All-NG
Word-NG
BOW
Digital Camera Reviews
%
% Within
#
Accuracy
One
Features
48.93
85.36
90,000
46.86
84.09
70,000
43.14
79.95
10,000
Automobile Reviews
%
% Within
#
Accuracy
One
Features
50.78
81.51
35,000
48.88
79.23
20,000
44.20
75.52
10,000
Movie Reviews
%
#
Accuracy
Features
85.23
50,000
84.19
45,000
20,000
81.23
228
7.5.3 Experiment 2a: Comparison of Feature Selection Methods using Cross Validation
We ran the Feature Relation Network (FRN) on the All-NG feature set using the
WLLR scores as the feature weights. We used WLLR since it worked better than
information gain (IG) in our initial testing. FRN was compared against several feature
univariate and multivariate feature selection methods: weighted log likelihood ratio
(WLLR), recursive feature elimination (RFE), and decision tree models (DTM). All
comparison feature selection methods were also run on the All-NG feature set. For RFE,
we began with the top 100,000 WLLR features, and decreased the feature set by 5,000
each iteration (until only 10,000 features were left). We evaluated all 19 feature subsets
generated, similar to the approach used by Guyon et al. (2002). For the DTM, we ran 10
iterations each on the training data for all 19 feature set sizes between 10,000 and
100,000 (in 5,000 feature increments). Each iteration, all features selected by DTM were
added to the feature set. The remaining features were reevaluated by the DTM during the
ensuing iterations. The total feature set generated after 10 iterations was evaluated on the
testing data.
Figure 7.8 shows the 5-fold CV results for all four feature selection methods across
the 2,000 review test beds. For DTM, the # of features listed is the amount upon which
the DTM was applied, not the amount actually selected by the DTM. That is, the DTM
attained its best results when selecting a subset of the top 90,000 WLLR features for the
digital camera test bed. Looking at the best overall results (left side of Figure 7.8), FRN
outperformed WLLR, RFE, and DTM on all three test beds.
229
Selection
Method
Best %
Accuracy
% Within
One
#
Features
Accuracy Using the Top 10,000 to 100,000 Features
Digital Cameras (Epinions)
FRN
52.45
87.95
100,000
WLLR
50.55
87.10
90,000
RFE
50.85
87.30
75,000
DTM
49.65
85.75
90,000
Automobiles (Edmunds)
FRN
57.50
90.85
40,000
WLLR
56.95
89.75
35,000
RFE
55.50
89.50
95,000
DTM
55.85
89.75
30,000
Movie Reviews (IMDb Tomatoes)
FRN
89.65
-
55,000
WLLR
89.20
-
50,000
RFE
88.45
-
85,000
DTM
85.30
-
90,000
Figure 7.8: Feature Selection Results on 5-Fold Cross Validation Experiment (Setting A)
While substantially outperforming RFE and DTM, the performance gain was
marginal compared to WLLR (especially on the automobile and movie review test beds).
The charts on the right side of Figure 7.8 show the accuracies for the four feature
selection methods using between 10,000 and 100,000 features. FRN clearly had better
performance than RFE and DTM on all three test beds. It also had considerably better
230
performance than WLLR on the digital camera data set when using more than 90,000
features. However, its performance was only marginally better than WLLR on the
automobile and movie review test beds. WLLR even outperformed FRN for certain
feature sizes on the automobile test bed (e.g., when using between 55,000-65,000
features).
7.5.4 Experiment 2b: Comparison of Selection Methods on 10,000 Review Test Beds
Table 7.7 shows the best results attained for all four methods across the three 10,000
review test beds. Here, the classifiers were trained on the 2,000 reviews from the
previous experiment. For each method we used the number of features which attained the
best results in experiment 2a (as done in experiment 1b). FRN again outperformed all 3
comparison methods. The performance increase was 2%-3% over other methods on the
digital camera and automobile data sets. However, FRN’s performance was almost the
same as WLLR on the movie review test bed. The improved performance compared to
DTM was significant at alpha = 0.01 for (n=10000, all p-values < 0.00001). FRN also
significantly outperformed RFE on all three test beds (p-values < 0.00001 on digital
camera and automobile test beds, p-value = 0.00246 on movie reviews). IN comparison
with WLLR, FRN performed significantly better on the digital camera and automobile
data sets (p-values < 0.00001) but not on the movie review test bed (p-value = 0.1272).
The enhanced performance of FRN over multivariate methods such as RFE and DTM
was attributable to FRN’s ability to efficiently remove large quantities of redundant
attributes by leveraging domain knowledge about n-gram relations.
231
Table 7.7: Results for Selection Methods on 10,000 Test Review Experiment (Setting B)
Feature
Set
FRN
WLLR
RFE
DTM
Digital Camera Reviews
%
% Within
#
Accuracy
One
Features
51.21
87.12
100,000
48.93
85.36
90,000
48.96
86.30
75,000
48.04
85.17
90,000
Automobile Reviews
%
% Within
#
Accuracy
One
Features
53.47
83.75
40,000
50.78
81.51
35,000
51.33
82.07
95,000
47.52
80.68
30,000
Movie Reviews
%
#
Accuracy
Features
85.25
55,000
85.23
50,000
84.85
85,000
81.04
90,000
7.5.5 Results Discussion
FRN was able to remove irrelevant and redundant attributes, facilitating the inclusion
of additional useful attributes. Figure 7.9 shows the WLLR and FRN feature plots for the
Epinions digital camera test bed. The two plots show the top 200,000 WLLR features in
descending order (based on feature weight) for either feature selection method. The yaxis represents feature weights (i.e., WLLR or FRN scores) while the x-axis denotes the
feature number/index in thousands. The vertical dotted lines show the cutoff points for
the top 100,000 WLLR and FRN features, respectively. FRN removes approximately
50,000 of the top 150,000 attributes in the WLLR feature set. The n-grams are considered
redundant or irrelevant by FRN. Hence, the top 100,000 FRN features correspond to the
top 150,000 features in the WLLR feature set. The removed features (denoted by the gray
region) allow FRN to include 50,000 more useful attributes as compared to WLLR. These
additional n-grams allow FRN to attain a 2%-3% performance gain over WLLR.
232
Figure 7.9: Weights for Top 200,000 N-Grams on Digital Camera Test Bed
Table 7.8 shows features from the Epinion Digital Camera review test bed. The ngrams shown were all excluded from the WLLR feature set they were ranked outside the
top 100,000. However they were included in the FRN feature set. FRN’s ability to
remove 50,000 redundant n-grams enabled these useful attributes to be included. We also
note that the n-grams come from various categories. For example, there is an information
extraction pattern representing a subject having flaws and a semantic bigram referring to
the class of words pertaining to “design.”
Table 7.8: N-Grams from the Digital Camera Test Bed Included in the FRN Feature Set
N-Gram Feature
<subj>AuxVp Dobj: <subj> have flaws
can’t beat
start to
photos aren’t very
The DT Best QUALITY_NNPJS
SYN-Design maybe
N-gram
Category
IEPatt
2-Word
2-Word
3-Word
2-POSWord
2-Semantic
WLLR
Rank
104,259
111,075
114,546
118,384
121,310
125,761
FRN
Rank
71,462
73,348
75,455
77,334
80,239
88,414
Orientation
Negative
Positive
Negative
Negative
Positive
Negative
7.6 Conclusions
In this essay we proposed the use of a rich set of n-gram features and a Feature
233
Relation Network (FRN) for enhanced opinion classification. The proposed features and
selection method improved review classification performance over commonly used ngram feature sets and selection methods. The extended n-gram feature set and FRN each
individually improved performance by approximately 2% on the 10,000 review test beds.
Collectively, they outperformed the benchmark of word n-grams coupled with WLLR by
over 4% on the digital camera (51.21% versus 46.86%) and automobile (53.47% versus
48.88%) review test beds. We have identified several future directions. The FRN was able
to effectively remove large quantities of redundant and irrelevant attributes, as many as
33% of the top 150,000 n-gram features. Given the computationally efficient nature of
FRN compared to other multivariate methods such as DTM and genetic algorithms, we
believe FRN may be suitable for other text classification problems as well (e.g., topic,
affect, style, and genre classification). We would also like to extend the network by
adding additional feature representation types (i.e., a multidimensional FRN). We used
only feature presence in this essay. Other representations such as occurrence frequency
and various positional/distributional features (e.g., first/last occurrence, compactness,
etc.) could be added for enhanced performance.
234
CHAPTER 8: AFFECT ANALYSIS OF WEB FORUMS AND BLOGS USING
CORRELATION ENSEMBLES
8.1 Introduction
In the previous chapters, we addressed an important information type related to the
ideational meta-function of Systemic Functional Linguistic Theory: sentiments. Another
related information type is affects: emotive content in text. In this chapter we explore
features and techniques for affect analysis of web forums and blogs.
The need for enhanced information retrieval and knowledge discovery from computer
mediated communication archives has been articulated by many individuals in recent
years. One suggested information access refinement has been to mine directional text:
text containing emotions and opinions (Hearst, 1992; Wiebe, 1994). Affects play an
important role in influencing people’s perceptions and decision making (Picard, 1997).
Analysis of sentiment and affects is particularly important for online discourse, where
such information is often more pervasive than topical content (Subasic and Huettner,
2001; Nigam and Hurst, 2004). With the increased popularity of social computing, the
presence and significance of affective text is likely to grow (Liu et al., 2003). There has
been considerable recent work on sentiment analysis of online forums and product
reviews (Turney and Littman, 2003; Wiebe et al., 2004). However research on analysis of
affects (including emotions and moods) is still relatively sparse (Cho and Lee, 2006).
While recent studies have analyzed the presence of affects in blogs, online stories, chat
dialog, transcripts, song lyrics etc.; it is unclear which features and techniques are most
useful for affective computing of online texts. There is therefore a need to compare
existing features for representing affective content as well as the techniques used for
235
assigning emotive intensities.
In this essay we compare features and techniques for classification of affective
intensities in online text. The features investigated include a large set of learned n-grams
as well as automatically and manually generated affect lexicons used in prior research.
We also propose a support vector regression correlation ensemble (SVRCE) method for
text-based affect classification. SVRCE combines feature subset ensembles with affect
correlation information for improve affect classification performance. Evaluation of the
various feature representations and the proposed method in comparison with existing
affect analysis techniques found that the use of SVRCE with n-grams is highly effective
for affect classification of online forums, blogs, and stories.
The remainder of this chapter is organized as follows. Section 8.2 provides a review
of related work on textual affect analysis. Section 8.3 outlines our research framework
based on gaps and questions derived from the literature review. Section 8.4 presents an
experimental evaluation of the various features and techniques incorporated in our
framework. Section 8.5 features a brief case study illustrating how the proposed affect
analysis methods can be applied to large CMC archives. Section 8.6 contains concluding
remarks and describes future research directions.
8.2 Related Work
Affect analysis is concerned with the analysis of text containing emotions (Picard,
1997; Subasic and Huettner, 2001). Emotional intelligence, the ability to effectively
recognize emotions automatically, is crucial for learning preference related information
and determining the importance of particular content (Picard et al., 2001). Affect analysis
236
is associated with sentiment analysis, which looks at the directionality of text, i.e.,
whether a text segment is positively or negatively oriented (Hearst, 1992). However,
there are two major differences between affect analysis and sentiment analysis. Firstly,
affect analysis involves a large number of potential emotions or affect classes (Subasic
and Huettner, 2001). These include happiness, sadness, anger, hate, violence, excitement,
fear, etc. In contrast, sentiment analysis primarily deals with positive, negative, and
neutral sentiment polarities. Secondly, while the sentiments associated with particular
words or phrases are mutually exclusive, text segments can contain multiple affects
(Subasic and Huettner, 2001; Grefenstette et al., 2004b). For example, the sentence “I
can’t stand you!” has only a negative sentiment polarity but simultaneously contains hate
and anger affects. Word level examples include the verb form of “alarm,” which can be
attributed to fear, warning, and excitement affects (Subasic and Huettner, 2001) and the
adjective “gleeful,” which can be assigned to the happiness and excitement affect classes
(Grefenstette et al., 2004b). Additionally, certain affect classes may be correlated
(Subasic and Huettner, 2001). For instance, hate and anger often co-occur in text
segments, resulting in a positive correlation. Similarly, happiness and sadness are
opposing affects that are likely to have a negative correlation. In summary, affect analysis
involves assigning text with emotive intensities across a set of mutually inclusive and
possibly correlated affect classes. Important affect analysis characteristics include the
features used to represent the presence of affects in text, techniques for assigning
affective intensity scores, and the level of text granularity at which the analysis is
performed. Table 8.1 presents a summary of the relevant prior studies based on these
237
important affect analysis characteristics.
Table 8.1: Related Prior Affect Analysis Studies
Study
Features
Technique(s)
Donath et al.,
1999
Manual lexicon,
punctuation
Posting scoring
Analysis
Level
Posting
Subasic &
Huettner, 2001
Manual lexicon (fuzzy
semantic typing)
Word scoring
Word
Liu et al., 2003
Language patterns
derived from knowledge
base
Manual lexicon
Sentence scoring
Sentence
Support vector
machine (SVM)
Sentence
Grefenstette et
al., 2004a
Manual lexicon,
semantic orientation
Word
Grefenstette et
al., 2004b
Read, 2004
Manual lexicon
Manual tagging,
point-wise
mutual
information
(PMI)
Word scoring
Point-wise
mutual
information
(PMI)
Word scoring
Sentence
Chuang & Wu,
2004
Ma et al., 2006
Mishne, 2005
Cho & Lee,
2006
Semantic orientation
Manual lexicon
(WordNet-Affect
database)
BOWs, POS tags,
document length,
emphasized words,
semantic orientation,
WordNet lexicon
Manual lexicon, BOWs
Word
Test Bed and Results
Greek USENET forums,
visualization of anger
intensities over time
Movie reviews and news
stories; visualization of 83
affects
User study on email
browser
Drama broadcast
transcripts; 76.44%
accuracy for 7 class
experiments
Candidate affect words;
scored intensities across 86
affects
Political web pages; scored
text relating to certain topic
Short stories; 47.14%
accuracy for 2 class
experiments
Sentence
Instant messaging chat data;
no formal evaluation
Support vector
machine (SVM)
Posting
LiveJournal blog postings;
60.25% accuracy for 2 class
experiments
Song
Posting
Korean song lyrics; 77.3%
accuracy on 5 class
experiments
LiveJournal blog postings;
average error of 52.53%,
correlation coefficient of
0.827 for 2 class
experiments
Student chat dialog; 80.98%
accuracy for 3 class
experiments
Mishne &
Rijke, 2006
Word n-grams
Sentence scoring,
Support vector
machine (SVM)
Pace regression
Wu et al., 2006
Emotion generation and
association rules
Separable
mixture models
Posting
238
Based on the table, we can make several observations regarding the features, and
techniques used in previous affect analysis research. (1) Most prior research has used
either manually generated lexicons, lexicons automatically created using WordNet or
semantic orientation, or generic feature representations such as word and part-of-speech
tag n-grams. It is unclear which of these feature representations is most effective for
affect analysis. (2) Techniques used for assigning affect intensities can be predominantly
categorized into scoring methods and machine learning techniques. However, we’re
unaware of any prior work attempting to compare various techniques for affect
classification (3) Previous affect classification studies typically utilized between two and
seven affect classes, applied at the word, sentence, or document levels. Despite the
presence of multiple inter-related affects (Subasic and Huettner, 2001; Grefenstette et al.,
2004a), class correlation information was not leveraged for improved affect intensity
assignment. Additionally, regression based methods have seen limited usage despite their
effectiveness in related application domains (Pang and Lee, 2004; Schumaker and Chen,
2006) (4) Prior studies mainly focused on a single application domain, such as movie
reviews, web forums, blogs, chat dialog, song lyrics, stories, etc. Given the differences in
the degree of interaction, language usage, and communication structure across these
domains, it is unclear if an approach suitable for classifying story affects will be
applicable on web forums and blogs. The features and techniques used in prior affect
analysis research are expounded upon in the remainder of the section.
8.2.1 Features for Affect Analysis
The attributes used to represent affects can be classified into lexicon based features
239
and generic n-gram based features. Considerable prior research has used manually or
automatically generated lexicons. As previously stated, in affect lexicons, the same
word/phrase can be assigned to multiple affect classes. The intensity score for an attribute
is based on its degree of severity towards that particular affect class. Depending upon the
semantic relation between affects, certain classes can have a positive or negative
occurrence correlation (Subasic and Huettner, 2001).
Many studies have incorporated manually developed affect lexicons. Subasic and
Huettner (2001) used Fuzzy Semantic Typing where each feature was assigned to
multiple affect categories with varying intensity and centrality scores depending upon the
word and usage context. For example, the word “rat” was assigned to the disloyalty,
horror, and repulsion affect categories with intensity scores of 0.9, 0.6, and 0.7,
respectively (on a 0.0-1.0 scale where 1.0 was highest). In order to compensate for wordsense ambiguity, their approach also assigned each word-affect pair a centrality score
indicating the likelihood of the word being used for that particular affect class. For
example, the word “rat” was assigned a centrality score of 0.3 for the disloyalty affect
and 0.6 for the repulsion affect (also on a 0.0-1.0 scale), since the usage of “rat” to
convey disloyalty is not as common. Thus, while “rat” was more intense for the
disloyalty affect, it was also less central to this class. In Subasic and Huettner’s (2001)
approach, the intensity and centrality scores were both utilized for determining the
affective composition of a text document. Although the accuracy for specific term affects
may be inaccurate, the fuzzy logic approach is intended to capture the essence of a
document’s various affect intensities. A similar method for generating manual lexicons
240
was employed in related work (Grefenstette et al., 2004a; 2004b). Many other studies
have also utilized manually constructed affect lexicons (Chuang and Wu, 2004; Cho and
Lee, 2006). Donath et al. (1999) used a set of keywords relating to anger for analyzing
USENET forums. Ma et al. (2006) incorporated the WordNet-Affect database created by
Valitutti et al. (2004). This database is comprised of manually assigned affect intensities
for words found in the WordNet lexical resource (Fellbaum, 1998). Liu et al. (2003)
manually constructed sentence level language patterns for identification of six affect
classes, including happiness, sadness, anger, fear, etc.
Although manually created affect lexicons can provide powerful insight, their
construction can be time consuming and tedious. As a result, many studies have explored
the use of automated lexicon generation methods such as semantic orientation
(Grefenstette et al., 2004a; Read, 2004; Mishne, 2005) and WordNet lexicons (Mishne,
2005). These methods take a small set of manually generated seed/paradigm words which
accurately reflect the particular affect class, and use automated methods for lexicon
expansion of candidate word scoring.
Based on the work of Turney and Littman (2003), the semantic orientation approach
assesses the intensity of each word based on its frequency of co-occurrence with a set of
core paradigm words reflective of that affect class (Grefenstette et al., 2004a). The
occurrence frequencies for the paradigm words and candidate words are derived from
search engines such as AltaVista (Grefenstette et al., 2004a; Read, 2004; Mishne, 2005)
or Yahoo! (Mishne, 2005). The number of paradigm words used for a particular affect
class is generally five to seven (Grefenstette et al., 2004a; Read, 2004). For example, the
241
paradigm words for the praise affect may include “acclaim, praise, congratulations,
homage, approval,” (Grefenstette et al., 2004a), and additional lexicon items generated
automatically using semantic orientation include the words “award, honor, extol.” The
semantic orientation approach is typically coupled with a point-wise mutual information
(PMI) scoring mechanism for assigning candidate words intensity scores (Turney and
Littman, 2003). Traditional PMI assigns each word a score based on how often it occurs
in proximity with positive and negative paradigm words, however it has been modified to
be applicable with affect classes (Read, 2004). The affect analysis rendition of PMI
proposed by Grefenstette et al. (2004a) is as follows:
 Π hits( word Near cword ) 

PMI Score( word , Class ) = log 2  cword ∈Class

Π log 2 (hits( cword )) 
 cword ∈Class

where cword is one of the paradigm words chosen for an affect class Class and hits is the
number of pages found by Alta Vista.
Another automated affect lexicon generation method is WordNet lexicons. Originally
proposed by Kim and Hovy (2004), this method is similar to semantic orientation.
However, it uses WordNet to expand the seed words associated with a particular affect
class by comparing each candidate word’s synset with the seed word list (Mishne, 2005).
The intensity for a candidate word is proportional to the percentage of its synset also
present in the seed word list for that particular affect class. Word scores are assigned
using the following formula (Kim and Hovy, 2004):
n
1
∑ count ( syni , Class )
count (c) i=1
where Class is an affect class, syni is one of the n synonyms of word , P (Class ) is the
Word Net Score( word , Class ) = P (Class )
number of words in Class divided by the total number of words considered .
242
In addition to lexicon based affect representations, studies have also used generic ngram features. Mishne (2005) used bag-of-words (BOWs) and part-of-speech (POS) tags
in combination with automatically generated lexicons while Mishne and Rijke (2006)
used word n-grams for affect analysis of blog postings. Cho and Lee (2006) used BOWs
for classifying affects inherent in Korean song lyrics. N-grams have also been shown to
be highly effective in the related area of sentiment classification (Wiebe et al., 2004;
Abbasi and Chen, 2008), especially when combined with machine learning methods
capable of learning n-gram patterns conveying opinions and emotions. While prior
research has used various n-gram and lexicon representations, we are unaware of any
work done to evaluate the effectiveness of various potential affect analysis features.
8.2.2 Techniques for Assigning Affect Intensities
Prior research has utilized scoring and machine learning methods for assigning affect
intensities. Scoring-based methods, which are generally used in conjunction with
lexicons, typically use the average intensity across lexicon items occurring in the text
(i.e., word spotting) (Subasic and Huettner, 2001; Liu et al., 2003; Cho and Lee, 2006).
Sentence level averaging has also been performed in combination with the word-level
PMI scores generated using semantic orientation (Turney and Littman, 2003) as well as
with WordNet lexicons (Kim and Hovy, 2004). Studies that directly developed lexicons
comprised of sentence patterns obviously do not use averaging (at least at the sentence
level), instead simply matching sentences with lexicon entries and assigning intensity
scores accordingly (Liu et al., 2003, Chuang and Wu, 2004).
243
Machine learning techniques have also been used for assigning affect intensities.
Many studies used support vector machine (SVM) for determining whether a text
segment contained a particular affect class (Chuang and Wu, 2004, Mishne, 2005, Cho
and Lee, 2002). One shortcoming of using SVM is that it can only deal with discrete class
labels, whereas affect intensities can vary along a continuum. Recent work has attempted
to address this problem by using regression based classifiers (Pang and Lee, 2004). For
example, Mishne and Rijke (2006) used word n-grams in unison with Pace Regression
(Witten and Frank, 2005) for assigning affect intensities in LiveJournal blogs.
Nevertheless, regression based learning methods have seen limited usage despite their
effectiveness in related application domains such as using news story text for stock price
prediction (Schumaker and Chen, 2006). Furthermore, although scoring and machine
learning methods have been utilized for classifying affect intensities, there has been no
research done to investigate the effectiveness of these methods.
8.3 Research Design
In this section we highlight affect analysis research gaps based on our review of the
related work. Research questions are then posed based on the relevant gaps identified.
Finally, a research framework is presented in order to address these research questions,
along with some research hypotheses. The framework encompasses various feature
representations and techniques for assigning affective intensities to sentences.
8.3.1 Gaps and Questions
Prior research has utilized manually or automatically generated lexicons as well as
244
generic n-gram features for representing affective content in text. Since most studies used
a single feature category and did not compare different alternatives, it is unclear which
emotive representation is most effective. Furthermore, prior research has used scoring
based techniques and machine learning methods such as SVM. Regression based methods
capable of assigning continuous intensity scores have not been explored in great detail,
with the exception of Mishne and Rijke (2006). Leveraging the relationship between
mutually inclusive affect classes in combination with powerful regression based machine
learning methods such as Support Vector Regression (SVR) could be highly effective for
accurate assignment of affect intensities. Additionally, most prior affect analysis research
was applied to a single domain (e.g., blogs, stories, etc.). Application across multiple
domains could lend greater validity to the effectiveness of affect analysis features and
techniques. Based on these gaps we present the following research questions:
•
Which feature categories are best at accurately assigning affect intensities?
o Can the use of an extended feature set enhance affect analysis
performance over individual generic and lexicon-based feature
categories?
•
Can a regression ensemble that incorporates affect correlation information
outperform existing machine learning and scoring based methods?
•
What impact will the application domain have on affect intensity
assignment?
8.3.2 Research Framework
Our research framework (shown in Figure 8.1) relates to the features and techniques
245
used for assigning affect intensity scores. We intend to compare generic n-gram features
with automatically and manually generated lexicons. We also plan to assess the
effectiveness of using an extended feature set encompassing all these attributes in
comparison with individual feature categories. With respect to affect analysis techniques,
we propose a support vector regression (SVR) ensemble that considers affect correlation
information when assigning emotive intensities to sentences. We intend to compare the
SVR correlation ensemble (SVR-CE) with other machine learning and scoring based
methods used in prior research. These include pace regression (Witten and Frank, 2005;
Mishne and Rijke, 2006), semantic orientation (Grefenstette et al., 2004a; Read, 2004),
WordNet (Kim and Hovy, 2004), and manual lexicon scoring (Subasic and Huettner,
2001). We also plan to perform ablation testing to see how the different components of
the proposed SVR-CE method contribute to its overall performance. All testing will be
performed on several test beds encompassing sentences derived from web forums, blogs,
and stories. Features and techniques will be evaluated with respect to their percentage
mean error and correlation coefficients in comparison with a human annotated gold
standard. Further details about the features, techniques, ablation testing, and our research
hypotheses are presented below while the test bed and evaluation metrics are discussed in
greater detail in the ensuing evaluation section.
246
Figure 8.1: Affect Analysis Research Framework
8.3.2.1 Affect Analysis Features
The n-gram feature set is comprised of word, character, and part-of-speech (POS) tag
n-grams. For each n-gram category we used up to trigrams only (i.e., unigrams, bigrams,
and trigrams), as done in prior related research (Pang et al., 2002; Wiebe et al., 2004).
Word n-grams, including unigrams (e.g., “LIKE”), bigrams (e.g., “I LIKE”, “LIKE
YOU”), and trigrams (e.g., “I LIKE YOU”) as well as POS tag n-grams (e.g., “NP VB”,
“JJ NP VB”) have been used in prior affect analysis research (Mishne, 2005). We also
include character n-grams (e.g., “li”, “ik”, “ike”), which have been useful in related
sentiment classification studies (Abbasi and Chen, 2008). In addition to standard word ngrams, we incorporate hapax legomena and dis legomena collocations (Wiebe et al.,
247
2004). Such collocations replace once (hapax legomena) and twice occurring words (dis
legomena) with “HAPAX” and “DIS” tags. Hence, the trigram “I hate Jim” would be
replaced with “I hate HAPAX” provided “Jim” only occurs once in the corpus. The
intuition behind such collocations is to remove sparsely occurring words with tags that
will allow the extracted n-grams to be more generalizable, and hence, more useful (Wiebe
et al., 2004). For instance, in the above example, the fact that the writer hates is more
important from an affect analysis perspective than the specific person the hate is directed
towards.
The lexicons employed are comprised of automated lexicons derived using semantic
orientation and WordNet models as previously done by Grefenstette et al. (2004a) and
Mishne (2005). We selected seven paradigm words for each affect class for input into the
semantic orientation algorithm, as described in section 2.1. For the WordNet models, sets
of up to 50 words were used as the seeds, following the guidelines described by Kim and
Hovy (2004).
Our feature set also consists of a manually crafted word level lexicon. The lexicon is
comprised of over 1,000 affect words for several emotive classes (e.g., happiness,
sadness, anger, hate, violence, etc.). Each word is assigned an intensity and ambiguity
score between 0 and 1. The intensities are assigned based on the word’s degree of
severity or valence for its particular affect category (with 1 being highest). This approach
is consistent with the intensity score assignment methods incorporated in previous studies
that utilized manually crafted lexicons (Donath et al., 1999; Subasic and Huettner, 2001;
Grefenstette et al., 2004a; Chuang and Wu, 2004). The ambiguity score for each word
248
indicates. Each affect feature is also assigned an ambiguity score. The ambiguity score is
the probability of an instance of the feature having semantic congruence with the affect
class represented by that feature. The ambiguity score for each feature is determined by
taking a sample set of instances of the feature’s occurrence and coding each occurrence as
to whether the term usage is relevant to its affect. A maximum of 20 samples was used
per term. Using more instances would be exhaustive and we observed that the size used
was sufficient to accurately capture the probability of an affect being relevant. The
ambiguity score for each word can be computed as the number of correctly appearing
instances divided by the total number of instances sampled for that word. Hence, an
ambiguity value of one suggests that the term always appears in the appropriate affective
connotation. The intensity and ambiguity assignment was done by two independent
coders. Each coder initially assigned values without consulting the other. The coders then
consulted one another in order to resolve tagging differences. The inter-coder reliability
tests revealed a kappa statistic of 0.78 prior to coder discussions and 0.89 after
discrepancy resolution. For situations where the disparity could not be resolved even after
discussions, the two coders’ values were averaged. Table 8.2 shows examples from the
violent affect lexicon. The weight for each term is the product of its intensity and
ambiguity value. This is the value assigned to each occurrence of the term in the text
being analyzed. For example, “lynch” was considered more severe by the coders than
“hang”. Although the two terms represent similar actions, the more severe motivation
behind “lynch” as compared to “hang” resulted in a higher intensity score. Furthermore,
the word “lynch” was also less ambiguous, conveying only a single violent meaning in
249
the samples analyzed by the coders during the disambiguation procedure.
Table 8.2: Manual Lexicon Examples for the Violence Affect
Term
hit
beat
stab
hang
kill
lynch
Intensity
0.210
0.400
0.575
0.800
0.850
1.000
Ambiguity
0.800
0.667
1.000
0.650
0.950
1.000
Weight
0.168
0.267
0.575
0.520
0.808
1.000
8.3.2.2 Affect Analysis Techniques
Ensemble classifiers use multiple classifiers with each built using different
techniques, training instances, or feature subsets (Dietterich, 2000). Particularly, the
feature subset classifier approach has been shown to be effective for analysis of style and
patterns. Stamatatos and Widmer (2002) used an SVM ensemble for music performer
recognition. They used multiple SVMs each trained using different feature subsets.
Similarly, Cherkauer (1996) used a Neural Network ensemble for imagery analysis. Their
ensemble consisted of 32 neural networks trained on 8 different feature subsets. The
intuition behind using a feature ensemble is that it allows each classifier to act as an
‘expert’ on its particular subset of features (Cherkauer, 1996; Stamatatos and Widmer,
2002), thereby improving performance over simply using a single classifier. We propose
the use of a support vector regression ensemble that incorporates the relationship between
various affect classes in order to enhance affect classification performance. Our ensemble
includes multiple SVR models; each trained using a subset of features most effective for
differentiating emotive intensities for a single affect class. We use the information gain
(IG) heuristic to select the features for each SVR classifier. Since affect intensities are
250
continuous, discretization must be performed before IG can be applied. We use 5 and 10
class bins (e.g., an intensity value of 0.15 would be placed into class 1 of 5 and 2 of 10
using 5 and 10 class bins). All features with an average information gain greater than a
threshold t are selected, as done in prior research (Yang and Pederson, 1997).
The support vector regression correlation ensemble (SVRCE) adjusts the affect
intensity prediction for a particular sentence based on the predicted intensities of other
affects. The amount of adjustment is proportional to the level of correlation between
affect classes (i.e., the affect class being predicted and the ones being used to make the
adjustment) as derived from the training data. The SVRCE formulation is shown in
Figure 8.2. The rationale behind SVRCE is that in certain situations, a particular sentence
may get misclassified by a trained model due to a lack of prior exposure to the affective
cues inherent in its text. In such circumstances, leveraging the relationship between affect
classes may help alleviate the magnitude of such erroneous classifications.
251
Let M = {1,2,...m} denote the set of training instances.
The SVR correlation ensemble intensity score for instance i and affect class c
can be computed as follows :
n
(
SVRCE c (i ) = SVR c (i ) + ∑ Corr(c, a) 2 (SVR a (i) − SVR c (i ) )K
)
a =1
Where :
SVR c (i ) is the prediction for instance i for affect class c using an SVR model
trained on M;
the feature subset for SVR c is selected using the IG heuristic;
c and a are part of the set of n affect classes being investigated, and c ≠ a;
K = 1 if Corr(c, a) > 0, K = −1 otherwise;
Corr(c, a) is the correlation coefficient for affect classes c and a across the m training
instances as follows :
m
∑ (c
Corr(c, a) =
x
)(
− c ax − a
)
x =1
m
∑ (c
x =1
x −c
m
) ∑ (a
2
x
−a
)
2
x =1
For :
c x and a x are the actual intensity values for affects c and a assigned to x ∈ M;
c and a are the average intensity values for affects c and a across
the m training instances.
Figure 8.2: SVR Correlation Ensemble for Assigning Affect Intensities
We intend to compare the proposed SVRCE method against machine learning and
scoring based methods used in prior affect analysis research. These include the Pace
Regression technique proposed by Witten and Frank (2005), which was used to analyze
affect intensities in weblogs (Mishne and Rijke, 2006), as well as the semantic
orientation, WordNet model, and manual lexicon scoring approaches. In addition to
comparing the proposed SVRCE against other affect analysis techniques, we also intend
to perform ablation testing to better understand the impact different components of our
252
proposed method have on classification performance. Since SVRCE uses correlation
information and feature subset based ensembles, we plan to compare it against an SVR
ensemble that does not use correlation information as well as an SVR trained using a
single feature set for all affect classes. The hypotheses associated with our research
framework are presented below.
8.3.3 Research Hypotheses
H1: Features
The use of learned generic n-gram features will outperform manually and
automatically crafted affect lexicons. Additionally, using an extended feature set
encompassing all features will outperform individual feature sets.
•
H1a: N-Grams > manual lexicon, semantic orientation, WordNet models
•
H1b: All features > n-grams, manual lexicons, semantic orientation,
WordNet models
H2: Techniques
The proposed SVRCE method will outperform comparison techniques used in prior
studies for affect analysis.
•
H2: SVRCE > Pace regression, semantic orientation scores, WordNet model
scores, manual lexicon scores
H3: Ablation Testing
The SVRCE method will outperform an SVR ensemble not using correlation
information as well as SVR run using a single feature set. Furthermore, the SVR
ensemble will also significantly outperform SVR run using a single feature set.
253
•
H3a: SVRCE > SVR ensemble, SVR
•
H3b: SVR ensemble > SVR
8.4 Evaluation
We conducted experiments to evaluate various affective feature representations along
with different affect analysis techniques, including the proposed support vector regression
correlation ensemble (SVRCE). The experiments were conducted on four test beds
comprised of sentences taken from web forums, blogs, and short stories. This section
encompasses a description of the test beds, experimental design, experimental results, and
outcomes of the hypotheses testing.
8.4.1 Test Bed
Analyzing affect intensities across application domains is important in order to get a
better sense of the effectiveness and generalizability of different features and techniques.
As a result, our test bed consisted of sentences taken from 4 corpora (shown in Table 8.2).
The first test bed was a set of supremacist web forums discussing issues relating to Nazi
and socialist ideologies. The second was comprised of 1000 sentences taken from a
couple of Arabic language Middle Eastern forums discussing issues relating to the war in
Iraq. Analysis of such forums is important to better understand Cyberactivism, social
movements and people’s political sentiments. Additionally, sentences were extracted
from LiveJournal weblogs; a test bed used in prior research (Mishne, 2005; Mishne and
Rijke, 2006). The fourth test bed consisted of sentences taken from Fifty Word Fiction, a
website that posts short stories (Read, 2004).
254
Two independent coders tagged the sentences for intensities across the four affect
classes used for each test bed (shown in Table 8.3). Each sentence was tagged with an
intensity score between 0 and 1 (with 1 being most intense) for each of the affects. The
tagging followed the same format as the one used for the manual lexicon creation. Each
coder initially assigned values without consulting the other. The coders then consulted
one another in order to resolve tagging differences. For situations where the disparity
could not be resolved even after discussions, the two coders’ values were averaged. The
inter-coder reliability kappa values shown in Table 8.3 are from after discrepancy
resolution (prior to averaging). For the Middle Eastern forums, the coders were unable to
meet to resolve coding differences. For this test bed, the kappa value shown is for the two
coders’ initial tagging.
Table 8.3: Test Bed Description
Test Bed Name
Source URL(s)
Fifty Word Short
Stories (FWF)
LiveJournal weblogs
(LJ)
Supremacist Web
Forums (SF)
Middle Eastern Web
Forums (MEF)
www.tangents.co.uk/50words
# of
Sentences
758
www.livejournal.com
1,000
www.stormfront.org
www.nazi.org
www.montada.com
www.alfirdaws.com
1,000
1,000
Affect Classes Tagged
Happiness, sadness,
pleasantness, excitement
Happiness, sadness,
anger, hate
Violence, anger, hate,
racism
Violence, anger, hate,
racism
Inter-coder
Reliability
0.91
0.93
0.89
0.79*
* Kappa value from initial tagging
8.4.2 Experimental Design
Based on our research framework and hypotheses presented in section 3, three
experiments were conducted. The first was intended to compare the performance of
learned n-grams against manually and automatically crafted lexicons. We also
investigated the effectiveness of an extended feature set comprised of n-grams and
lexicons versus individual feature groups. The second experiment compared different
255
affect analysis techniques, including the proposed SVRCE, Pace regression, and scoring
methods. The final experiment pertained to ablation analysis of the major components of
SVRCE, including the use of correlation information and an ensemble approach to affect
classification. In order to allow statistical testing of results, we ran 50 bootstrap instances
for each condition across all three experiments. In each bootstrap run, 95% of the
sentences were randomly selected for training while the remaining 5% were used for
testing (Argamon et al., 2007). The average results across the 50 bootstrap runs were
reported for each experimental condition. Performance was evaluated using standard
metrics for affect analysis, which include the mean percentage error and the correlation
coefficient (Mishne and Rijke, 2006):
Mean % Error =
100
∑ x− y
n
Corr( X , Y ) =
∑ (x − x )(y − y )
∑ (x − x ) ∑ (y − y )
2
2
Where x and y are the actual and predicted intensity values for one of the n testing
instances denoted by the vectors X and Y .
8.4.3 Experiment 1: Comparison of Feature Sets
In this experiment we compared generic n-grams with semantic orientation (SO),
WordNet model (WNet), and the manual lexicon (ML). We also constructed an extended
feature set comprised of n-grams, SO, WNet, and ML (labeled “All”). All feature sets
were evaluated using the support vector regression correlation ensemble (SVRCE).
SVRCE was run using a linear kernel. N-grams were selected using the information gain
heuristic applied at the affect level, as outlined previously. The information gain was
applied to the 95% training data during each of the 50 bootstrap instances, these features
were then used to train the SVRCE classifiers used on the testing data. This resulted in 16
n-gram feature subsets (1 for each affect class across the 4 test beds), and a corresponding
256
SVRCE model for each feature subset. SO and WNet were run using the formulas
described in sections 2.1 and 3.1.2. For SO, WNet, and ML, the word level scores were
computed for each sentence, resulting in a vector of scores for each sentence. Since
different paradigm/seed words were used for each affect across all four test beds, the
lexicon methods also generated 16 sets of sentence vectors each. Consistent with Mishne
(Mishne, 2005), these vectors were treated as features input into the SVRCE. For the
“All” feature set, the lexicon sentence vectors were merged with the n-gram frequency
vectors.
Table 8.4 and Figure 8.3 show the macro-level experimental results for the mean
percentage error and correlation coefficients across the 5 feature sets applied to all 4 test
beds. The values shown were averaged across the 4 affect classes used within each test
bed. The test bed labels correspond to the abbreviations presented in Table 8.3 under the
column “Test Bed Name.” The n-gram features appeared to have the best performance,
with the lowest mean percentage error and highest correlation coefficient for all four test
beds. The automated (i.e., SO and WNet) and manual lexicons all had fairly similar
performance, with mean errors typically in the 5%-7% range and correlation coefficients
between 0.2-0.5. As anticipated, the use of all features performed well, outperforming the
use of individual lexicons. Surprisingly however, using all features (i.e., n-grams in
conjunction with lexicons) did not outperform the use of n-grams alone. N-grams
outperformed the extended feature set by as much as 0.5% and 0.14 on mean error and
correlation coefficient, respectively. This suggests that the learned n-grams were able to
effectively represent affective patterns in the text. Adding lexicon features introduced
257
redundancy, and in some instances, noise. Further elaboration regarding the performance
of n-grams in comparison with other feature sets is provided in the hypotheses testing
section (4.6).
Table 8.4: Overall Results for Various Feature Sets
Mean % Error
LJ
SF
6.6472
4.6360
7.1601
5.0725
7.1019
4.9646
7.3417
4.9767
6.9265
4.8176
FWF
4.9527
6.4928
5.9080
5.5881
5.0184
MEF
3.8066
4.4742
4.5507
4.6147
4.3522
FWF
0.4547
0.2809
0.3147
0.3448
0.4422
8
1.0
7
0.8
Corr. Coefficient
Mean % Error
Features/
Test Bed
N-Grams
SO
WNet
ML
All
6
5
4
3
Correlation Coefficient
LJ
SF
0.4367
0.6627
0.2389
0.4558
0.1993
0.4952
0.1810
0.5388
0.3577
0.6238
MEF
0.7455
0.5308
0.5122
0.4121
0.6036
0.6
0.4
0.2
0.0
FWF
N-Grams
LJ
SO
SF
Test Bed
Wnet
MEF
ML
FWF
All
N-Grams
LJ
SO
SF
Test Bed
Wnet
MEF
ML
All
Figure 8.3: Macro-level Mean % Error and Correlation Coefficients for Feature Sets
Figure 8.4 shows the micro level results for mean percentage error and correlation
coefficient across the 16 classes incorporated (4 affects x 4 test beds). Each class is
labeled with its test bed and the first letter of its affect. The micro-level results indicate
that the performance differences for various feature sets were fairly consistent across
classes. N-grams had the lowest class-level mean error and the highest correlation
coefficients, followed by the extended feature set. Generally the highest mean errors
occurred on the sadness and hate affects on the short story and blog test beds,
respectively (FWF-S and LJ-HT). The semantic orientation (SO) features had the worst
258
performance, with especially low correlation coefficients on the supremacist forum test
bed when analyzing the racism affect class (SF-R).
FWF-H
FWF-H
MEF-R
FWF-S
MEF-H
MEF-R
FWF-P
MEF-A
MEF-H
FWF-E
MEF-V
LJ-H
SF-R
FWF-E
MEF-V
LJ-H
SF-R
LJ-S
SF-H
LJ-A
SF-A
FWF-P
MEF-A
LJ-S
SF-H
FWF-S
LJ-A
SF-A
LJ-HT
LJ-HT
SF-V
SF-V
N-Grams
SO
Wnet
ML
All
N-Grams
SO
Wnet
ML
All
Figure 8.4: Micro-level Mean % Error and Correlation Coefficients for Feature Sets
8.4.4 Experiment 2: Comparison of Techniques
The SVRCE method was compared against scoring and machine learning methods
used in prior studies. The comparison techniques included Pace regression (Mishne and
Rijke, 2006), WordNet (WNet) scores (Kim and Hovy, 2004; Mishne, 2005), the pointwise mutual information scores from the semantic orientation (SO) approach, and the
scores from our manual lexicon (ML). For SO, WNet, and ML, the average word level
intensities were used as the sentence level scores as done in prior affect analysis research
(Subasic and Huettner, 2001; Grefenstette et al., 2004a; Cho and Lee, 2006. SVRCE and
Pace regression were both run using the n-gram features. N-grams were used since they
had the best performance in Experiment 1. Both techniques (i.e., SVRCE and Pace) were
run using identical features; with each using 16 feature subsets selected using the
information gain heuristic as described in Experiment 1. Any scores outside the 0-1 range
259
were adjusted to fit the possible range of intensities).
Table 8.5 and Figure 8.5 show the macro-level experimental results for the mean
percentage error and correlation coefficients across the 5 techniques. The SVRCE method
had the best performance, with the lowest mean percentage error and highest correlation
coefficient for all four test beds. Pace regression, WordNet (WNet) models and the
manual lexicon (ML) scoring methods were all in the middle while the semantic
orientation scoring method had the worst performance. The results are consistent with
prior research that has also observed large differences between the word level scores
assigned using WNet and SO (Mishne, 2005). The machine learning methods (SVRCE
and Pace) both fared well with respect to their correlation coefficients. Pace also
performed well on the supremacist and Middle Eastern forums in terms of mean
percentage error, but not on the blogs test bed (LJ).
Table 8.5: Results for Experiment 2 (comparison of techniques)
Techniques/
Test Bed
SO
WNet
ML
SVRCE
Pace
FWF
9.5634
7.0981
6.4866
4.9527
7.5748
Mean % Error
LJ
SF
12.8245
8.6590
7.6321
5.9899
7.7012
6.7270
6.6472
4.6360
10.6183
6.3038
MEF
14.8759
8.6639
8.3860
3.8066
5.8473
FWF
0.4044
0.5005
0.4805
0.5797
0.4878
Correlation Coefficient
LJ
SF
0.1271
0.4673
0.2396
0.5837
0.2352
0.5500
0.4367
0.6627
0.3856
0.5692
MEF
0.2530
0.5224
0.5251
0.7455
0.6124
15
1.0
12
0.8
Corr. Coefficient
Mean % Error
260
9
6
3
0
0.6
0.4
0.2
0.0
FWF
SO
LJ
WM
SF
Test Bed
ML
SVRCE
MEF
Pace
FWF
SO
WM
LJ
SF
Test Bed
ML
MEF
SVRCE
Pace
Figure 8.5: Macro-level Mean % Error and Correlation Coefficients for Techniques
Figure 8.6 shows the micro level results for mean percentage error and correlation
coefficient across the 16 classes. The micro-level results indicate that the performance
differences for various techniques were fairly consistent across classes. SVRCE had the
lowest mean percentage error and the highest correlation coefficient for most classes. SO
fared especially poorly on the Middle Eastern forums for the racism, hate, and violence
affects (MEF-R, MEF-H, MEF-V), with very high error percentages and low correlation
coefficients.
FWF-H
FWF-H
MEF-R
MEF-R
FWF-S
MEF-H
FWF-P
MEF-A
FWF-E
MEF-V
LJ-H
SF-R
LJ-S
SF-H
LJ-A
SF-A
SO
WM
MEF-H
SVRCE
FWF-P
MEF-A
FWF-E
MEF-V
LJ-H
SF-R
LJ-S
SF-H
LJ-HT
SF-V
ML
FWF-S
LJ-A
SF-A
Pace
SO
WM
LJ-HT
SF-V
ML
SVRCE
Pace
Figure 8.6: Micro-level Mean % Error and Correlation Coefficients for Techniques
261
8.4.5 Experiment 3: Ablation Testing
Ablation testing was performed to evaluate the effectiveness of the different SVRCE
components. The SVRCE was compared against a support vector regression ensemble
(SVRE) that does not utilize correlation information, as well as a support vector
regression classifier using only a single feature set (SVR). The SVR was trained using a
single feature set (for each test bed) selected by using all n-grams occurring at least 5
times in the corpus (Jiang et al., 2004). The SVRE and SVRCE were both run using
information gain on the training data to select the 16 feature subsets most representative
of each affect class. The experiment was intended to evaluate the two core components of
SVRCE: (1) its use of feature ensembles to better represent affective content; (2) the use
of correlation information for enhanced affect classification. Table 8.6 and Figure 8.7
show the macro-level results for the mean percentage error and correlation coefficients
for SVRCE, SVRE, and SVR. The SVRCE method had the best performance, with the
lowest mean percentage error and highest correlation coefficient for all four test beds.
SVRCE marginally outperformed SVRE while both techniques outperformed SVR. The
results suggest that use of feature ensembles and correlation information are both useful
for classifying affective intensities.
Table 8.6: Results for Experiment 3 (ablation testing)
Techniques/
Test Bed
SVRCE
SVRE
SVR
FWF
4.9527
5.0351
5.2379
Mean % Error
LJ
SF
6.6472
4.6360
6.6501
5.0776
7.7871
5.7676
MEF
3.8066
4.0667
5.0460
FWF
0.4547
0.4271
0.3896
Correlation Coefficient
LJ
SF
MEF
0.4367
0.6627
0.7455
0.4098
0.5990
0.7231
0.3267
0.5631
0.5757
8
0.8
7
0.7
Corr. Coefficient
Mean % Error
262
6
5
4
3
0.6
0.5
0.4
0.3
FWF
SVRCE
LJ
SF
Test Bed
SVRE
MEF
SVR
FWF
SVRCE
LJ Test BedSF
SVRE
MEF
SVR
Figure 8.7: Macro-level Mean % Error and Correlation Coefficients for Ablation Testing
8.4.6 Hypotheses Results
We conducted pair wise t-tests on the 50 bootstrap runs for all three experiments.
Given the large number of comparison conditions, a Bonferroni correction was performed
to avoid spurious positive results. All p-values less than 0.0005 were considered
significant at alpha = 0.01.
8.4.6.1 H1: Feature Comparison
Pair wise t-tests were conducted to compare the effectiveness of the extended and ngram feature sets with other feature categories. N-grams and the extended feature set both
significantly outperformed the lexicon based representations on all test beds with respect
to mean error and correlation (all p-values < 0.0001). Surprisingly, the extended feature
set did not outperform n-grams. In contrast, the n-gram feature set significantly
outperformed the use of all features (n-grams plus the three lexicons), with all p-values
significant at alpha=0.01 except the correlation coefficient on the FWF test bed (p-value
= 0.0034).
Table 8.7 provides examples of learned n-grams taken from the LiveJournal test bed
263
for the hate affect. It also shows some related hateful items from the manual lexicon. The
n-grams were able to learn many of the concepts conveyed in the lexicon. Furthermore
the n-grams were able to provide better context for some features and also learn deeper
patterns in several instances. For example, the hate in LiveJournal blogs is often directed
towards specific people and frequently involves places and times. This pattern is captured
by the POS tag n-grams. In contrast, word lexicons cannot accurately represent such
complex patterns. The example illustrates how the n-gram features learned were more
effective than the lexicons employed in this essay.
Table 8.7: Sample Learned N-Grams and Lexicon Items for Hate Affect
Learned N-Grams
Category
N-Gram
Character N-Grams
uck, ck, fuc
Word N-Grams
terribly, suck, the stupid, the s**t,
the f**k
Hapax and Dis Legomena HAPAX so awful
Collocations
POS Tag N-Grams
PERSON_SG, WEEKDAY_NNP,
TIME_SG
Lexicon Items
awful, stupid, terrible, sicken,
s**t, f**k
8.4.6.2 H2: Technique Comparison
The SVRCE method significantly outperformed all four comparison techniques on
mean percentage error and correlation coefficient across all four test beds. All p-values
were less than 0.0005 and therefore significant at alpha=0.01. The results indicate that the
SVRCE method’s use of ensembles of learned n-gram features combined with affect
correlation information allows the classifier to assign affect intensities with greater
effectiveness than comparison approaches used in prior research.
264
8.4.6.3 H3: Ablation Tests
Pair wise t-tests were conducted to assess the contribution of the major components
of the SVRCE method. The results of SVRCE versus SVRE revealed that the use of
correlation information significantly enhanced performance on most test beds (significant
for 3 out of 4 test beds on mean error and correlation). The results were not significant for
mean error on the LiveJournal blog test bed (p-value = 0.3452) as well as for correlation
on the Middle Eastern forum data set (p-value = 0.0013). Both SVRCE and SVRE also
significantly outperformed SVR, indicating that the use of feature ensembles is effective
for classifying affect intensities (all p-values less than 0.0001, significant at alpha=0.01).
8.5 Case Study
Many prior studies have used brief case studies to illustrate the utility of their
proposed affect analysis methods (Subasic and Huettner, 2001; Mishne and Rijke, 2006).
In order to demonstrate the usefulness of the SVRCE method coupled with a rich set of
learned n-grams, we analyzed the affective intensities in two popular Middle Eastern web
forums: www.alfirdaws.org/vb and www.montada.com. Analysis of affects in such
forums is important for sociopolitical reasons and to better our understanding of social
phenomena in online communities. We hypothesized that our SVRCE method would be
able to effectively depict the likely intensity differences for appropriate affect classes,
between these two web forums.
We used spidering programs to collect the content in both web forums. Table 8.8
shows summary statistics for the content collected from the two forums. The Montada
forum was considerably larger, with over 31,000 authors and a large number of threads
265
and postings, partially because it had been around for approximately 7 years. Firdaws
was a relatively newer forum, beginning in 2005. Due to the nature of its content and
time duration of existence, this forum had fewer authors and postings.
Table 8.8: Summary Statistics for Two Web Forums Collected
Forum
Firdaws
Montada
# Authors
2,189
31,692
# Threads
14,526
114,965
# Messages
39,775
869,264
# Sentences
244,917
2,052,511
Duration
1/2005 – 7/2007
9/2000 – 7/2007
Figures 8.8 shows number of posts for each month the forums have been active.
Montada was very active in 2002 and 2005, with over 20,000 posts in some months, yet
appears to be in a down phase in 2007 (similar to 2004). Firdaws consistently had
between 2,500-3,000 posts per month since the second half of 2006.
Montada Posts By Month
Al-Firdaws Posts By Month
25000
3500
3000
20000
# posts
# posts
2500
2000
1500
15000
10000
1000
500
May-07
Sep-06
Jan-06
May-05
Sep-04
Jan-04
May-03
Sep-02
Jan-02
Sep-00
May-07
Jan-07
Sep-06
May-06
Jan-06
Sep-05
May-05
Jan-05
0
May-01
5000
0
Figure 8.8: Posting Frequency for Two Web Forums
The SVRCE classifier was employed in conjunction with the n-gram feature set to
analyze affect intensities in the two web forums. Analysis was performed on violence,
hate, racism, and anger affects. We computed the average posting level intensities
(averaged across all sentences in a posting) as well as the total intensity per post (the
summation of sentence intensities in each posting). The analysis was performed on all
postings in each forum (approximately 900,000 postings and 2.3 million sentences). As
shown in Table 8.9, the Firdaws forum had considerably higher affect intensities for all
266
four affect classes, usually 2-3 times greater than Montada.
Table 8.9: Affect Intensities per Posting across Two Web Forums
Intensity
Average per
Message
Total per
Message
Forum
Firdaws
Montada
Firdaws
Montada
Violence
0.084
0.027
0.523
0.246
Anger
0.018
0.012
0.127
0.105
Hate
0.037
0.010
0.178
0.092
Racism
0.032
0.014
0.191
0.134
Figure 8.9 depicts the average message violence and hate intensities over time for all
postings in each of the two web forums. The x-axis indicates time while the y-axis shows
the intensities (on a scale of 0 to 1). Each point represents a single message while areas
with greater message concentrations are darker. The blank periods in the diagrams
correspond to periods of posting inactivity in forums (see Figure 8.8 for correspondence).
Based on the diagram we can see that Al-Firdaws has considerably higher violence and
also greater hate intensity across time. Al-Firdaws also appears to have increasing
violence intensity in 2007 (based on the concentration of postings), possibly attributable
to the increased activity in this forum. In contrast, violence and hate intensities are
consistently low in Montada. The results generated using SVRCE and n-gram features
are consistent with existing knowledge regarding these two forums. The case study
illustrates how the proposed features and techniques can be successfully applied towards
affect analysis of computer mediated communication text.
267
Firdaws-Violence
Firdaws-Hate
Montada-Violence
Montada-Hate
Figure 8.9: Temporal View of Intensities in Two Web Forums
8.6 Conclusions
In this chapter we evaluated various features and techniques for affect analysis of
online texts. In addition, the support vector regression correlation ensemble (SVRCE)
was proposed. This method leverages an ensemble of SVR classifiers with each
constructed for a separate affect class. The ensemble of predictions combined with the
correlation between affect classes is leveraged for enhanced affect classification
performance. Experimental results on test beds derived from online forums, blogs, and
stories revealed that the proposed method outperformed existing affect analysis
techniques. The results also suggested that learned n-grams can improve affect
268
classification performance in comparison with lexicon based representations. However,
combining n-gram and lexicon features did not improve performance due to increased
amounts of noise and redundancy in the extended feature set. A case study was also
performed to illustrate how the proposed features and techniques can be applied to large
cyber communities in order to reveal affective tendencies inherent in these communities’
discourse. To the best of our knowledge, the experiments conducted in this essay are the
first to evaluate features and techniques for affect analysis. Furthermore, we are also
unaware of prior research applied to such a vast array of domains and test beds.
We believe this chapter provides an important stepping stone for future work intended
to further enhance the feature representations and techniques used for classifying affects.
Based on this work, we have identified several future research directions. We intend to
apply the techniques across a larger set of affect classes (e.g., 10-12 affects per test bed).
We are also interested in exploring additional feature representations, such as the use of
richer learned n-grams (e.g., semantic collocations, variable n-gram patterns, etc.). We
also plan to evaluate the effectiveness of real world knowledge bases such as those
employed by Liu et al. (2003).
269
CHAPTER 9: CYBERGATE: A SYSTEM AND DESIGN FRAMEWORK FOR TEXT
ANALYSIS OF COMPUTER-MEDIATED COMMUNICATION
9.1 Introduction
Up until now, we have considered the textual and ideational meta-functions of
Systemic Functional Linguistic Theory in isolation. In this chapter, we propose a design
framework for text analysis of computer mediated communication. The framework
argues for systems that consider the ideational, textual, and interpersonal meta-functions.
The CyberGate system is developed as an instantiation of the proposed framework.
Computer mediated communication (CMC) has seen tremendous growth due to the
fast propagation of the Internet. Text-based modes of CMC include email, listservs,
forums, chat, and the World Wide Web (Herring, 2002). These CMC modes have
redefined the fabric of organizational culture and interaction. With the persistent
evolution of communication processes and constant advancements in technology, such
metamorphoses are likely to continue. An important trend has been the increased use of
online communities; communities interacting virtually via CMC (Cothrel, 2000). Online
communities provide invaluable support for various business operations including
organizational communication, knowledge dissemination, transfer of goods and services,
and product reviews (Cothrel, 2000). Electronic communities (Wenger and Snyder, 2000)
and networks (Wasko and Faraj, 2005) of practice enable companies to tap into the
wealth of information and expertise available across corporate lines. Virtual teams and
group support systems facilitate organizational operations regardless of physical
boundaries (Fjermestad and Hiltz, 1999; Montoya-Weiss et al., 2001). Internet
marketplaces allow the efficient transfer of goods and services and offer a medium for
270
consumer feedback equally useful to potential customers and marketing departments
(Turney and Littman, 2003).
In spite of the numerous benefits of CMC, it is not without its pitfalls. Two
characteristics have proven to be particularly problematic: the lack of controls on
information quality and the enormity and complexity of data present in CMC.
Newsgroups and knowledge exchange communities suffer from lurkers and agitators that
decrease the signal to noise ratio in CMC, casting doubts onto the reliability of
information exchanged (Smith and Viegas, 2004; Wasko and Faraj, 2005). Additionally,
online communities encompass very large scale conversations involving thousands of
users (Sack, 2000; Herring, 2002). The enormous information quantities make such
places difficult to navigate and analyze (Viegas and Smith, 2004).
CMC text analysis is the analysis of text-based modes of CMC. There is a need for
analysis techniques that can evaluate, summarize, and present CMC text. Systems
capable of navigation and knowledge discovery can enhance informational transparency
(Sack, 2000; Wellman, 2001). Tools supporting social translucence via the measurement
of social accounting data in CMC may improve information quality and analysis
capabilities, a condition mutually beneficial to online community members and
researchers/analysts (Smith, 2002; Erickson and Kellogg, 2000). Consequently, numerous
CMC systems have been developed to address these needs (Xiong and Donath, 1999;
Fiore and Smith, 2002; Viegas and Smith, 2004). These systems generally visualize data
provided in the message headers such as interaction (send/reply structure) and activity
(posting patterns) based information. Little support is provided for analysis of
271
information contained in the message body text. In the instances where text analysis is
provided, simple feature representations such as those used in information retrieval
systems are utilized (Mladenic, 1999; Sack, 2000).
CMC text is rich in social cues including emotions, opinions, style, and genres (Yates
and Orlikowski, 2002; Henri, 1992; Hara et al., 2000). Improved CMC text analysis
capabilities based on richer text representations are necessary (Paccagnella, 1997). CMC
analysis systems often neglect message text due to representational and presentation
complexities. Thus, design guidelines for CMC systems supporting text analysis are
needed (Sack, 2000). Using Walls et al.’s (1992) model, this paper proposes a design
framework for CMC text analysis systems. Grounded in System Functional Linguistic
Theory, the framework calls for the development of systems that support various
information types found in CMC text. Based upon it, we developed the CyberGate
system which incorporates various features, feature selection, and visualization
techniques including the Writeprints and Ink Blots techniques.
The remainder of the chapter is organized as follows. We firstly highlight the unique
characteristics of text-based CMC and provide a review of systems developed to support
CMC text analysis. We then describe challenges associated with CMC text and present an
overview of our design framework. Subsequent sections elaborate on the components of
the design framework. A description of the CyberGate system (developed as an
instantiation of our framework) is then offered. The two ensuing sections provide
application examples and experimental evaluations of the CyberGate system and its
underlying framework. We conclude with a summary of our research contributions and
272
potential future directions.
9.2 Background
Many studies have expounded upon the significance of CMC text analysis for
analyzing organizations (Chia, 2000). Online discourse via computer mediation has
resulted in new forms of communicative practice worthy of in depth analysis (Wilson and
Peterson, 2002). CMC text analysis is important for evaluating the effectiveness and
efficiency of electronic communication in various organizational settings, including
virtual teams and group support systems (Fjermestad and Hiltz, 1999; Montoya-Weiss et
al., 2001). Analysis of CMC text also plays a crucial role in facilitating the measurement
of return on investment for various online communities including electronic communities
and networks of practice (Cothrel, 2000; Wenger and Snyder, 2000; Wasko and Faraj,
2005). CMC and other Internet related technologies are not a source of business value by
themselves. They require the utilization of ancillary IT resources (Barua et al., 2004).
Paccagnella (1997) emphasized the need for such systems supporting CMC analysis (pp.
4-5), stating that “deep, interpretative research on virtual communities could be greatly
helped by an accurate use of new analytic, powerful yet flexible tools, exploiting the
possibility of cheaply collecting, organizing and exploring digital data.” In the remainder
of this section, we describe the unique characteristics of CMC text that differentiate it
from other text documents. We also review prior CMC systems and emphasize the need
for ones supporting enhanced text analysis.
273
9.2.1 CMC Text
Computer mediated communication text has several unique characteristics which
differentiate it from non-CMC documents (e.g., essays, reports, news articles, resumes,
research papers). Three of these distinct properties are described here. (1) The
communicative nature of CMC text makes it rich in interaction (Sack, 2000), while nonCMC documents are generally devoid of interaction information. Asynchronous and
synchronous forms of CMC both contain high levels of interaction, with the specific
discourse patterns and dynamics varying depending on the communication context and
CMC mode (Herring, 2002; Fu et al., 2008). (2) CMC text also differs from non-CMC
documents with respect to its informational composition. While non-CMC documents
have a high concentration of topical information (Mladenic, 1999), such information is
less pervasive in CMC. Nigam and Hurst (2004) analyzed thousands of messages posted
on USENET (a large collection of newsgroups), and found that only 3% of sentences
contained topical information. In contrast, web discourse is rich in opinion and emotion
related information (Subasic and Huettner, 2001; Nigam and Hurst, 2004). (3) CMC text
and non-CMC documents also differ linguistically, with new CMC technologies bringing
about the emergence of novel language varieties (Wilson and Peterson, 2002). CMC text
encompasses a large spectrum of stylistic, genre-based, and idiosyncratic language usage
attributable to age, gender, educational, cultural, and contextual differences (Sack, 2000;
Herring, 2002).
9.2.2 CMC Text Analysis Features
The richness of CMC has brought about the emergence of many types of CMC text
274
analysis. These include analysis of participation levels, interaction, social cues, topics,
user roles, linguistic variation, types of questions posed, response complexity, etc. (Henri,
1992; Hara et al., 2000). The features (i.e., attributes) utilized for CMC text analysis can
be broadly categorized as either structural or text-based.
Structural features are attributes based on communication topology. These features
are extracted solely from message headers, without any use of information contained in
the message body (Sack, 2000). Structural features support activity and interaction
analysis. Posting activity related features include number of posts, number of initial
messages, number of replies, number of responses to a particular author’s posts, etc.
(Fiore and Smith, 2002). These features can be used to represent an authors’ social
accounting metrics (Smith, 2002). Analysis of activity based attributes also provides
insight into different roles played by online community members, such as debaters,
experts, and disseminators (Zhu and Chen, 2001; Viegas and Smith, 2004). Features used
for interaction analysis include the frequency of incoming and outgoing messages. These
features are used as input for the construction of social networks based on who is talking
to whom (Sack, 2000; Smith and Fiore, 2001).
Text features are attributes derived from the message body. Although the
informational richness of CMC text was previously questioned (Daft and Lengel, 1986),
numerous studies have since demonstrated the opulence of CMC text (Yates and
Orlikowski, 2002; Lee, 1994). In addition to topical information, CMC text is rich in
social cues (Henri, 1992), power cues (Panteli, 2002), and genres (Yates and Orlikowski,
2002). Social cues are elements not related to formal content or subject matter (Henri,
275
1992). Examples include self-introductions, expressions of feeling, greetings, signatures,
jokes, use of symbolic icons, and compliments (Hara et al., 2000). CMC text also
contains evidence of power cues; stylistic indicators of ones position/rank within an
organization or online community (Panteli, 2002). Genres are types of writing based on
purpose and form (e.g., memos, meetings, reports, etc.). Highly prevalent in CMC, they
serve as sources of organizing structures and communicative norms (Yates and
Orlikowski, 2002).
Inclusion of structural and text-based features is critical for CMC text analysis. For
instance, online community sustainability analysis requires the use of communication
activity, interaction, and text content attributes (Butler, 2001). Similarly, Cothrel (2000)
incorporated structural features (activity measures) and text features (discussion topics)
into his model for measuring an online community’s return on investment. He noted that
activity measures describe the general health of a community while discussion topic
metrics “assess the ongoing insights that the community offers into the business’s
products or processes” (p. 19).
9.2.3 CMC Text Analysis Systems
CMC systems can be categorized into two categories based on functionality: those
that support the communication process and those that support analysis of
communication content (Sack, 2000). While it is certainly possible for a single system to
support both functions (e.g., Erickson and Kellogg, 2000), we only focus on the analysis
functionalities provided by these systems due to their relevance to CMC text analysis.
Table 9.1 provides a review of prior CMC systems supporting analysis of text-based
276
CMC, based on the analysis features included.
Table 9.1: Previous CMC Systems
System Name
Reference
Chat Circles
Loom
People Garden
Babble
Conversation Map
NetScan
Communication Garden
Donath et al., 1999
Donath et al., 1999
Xiong and Donath, 1999
Erickson and Kellogg, 2000
Sack, 2000
Smith and Fiore, 2001
Zhu and Chen, 2001
Coterie
Newsgroup Treemaps
PostHistory
Social Network Fragments
Authorlines
Newsgroup Crowds
Donath, 2002
Fiore and Smith, 2002
Viegas et al., 2004
Viegas et al., 2004
Viegas and Smith, 2004
Viegas and Smith, 2004
Feature Types
Structural Text-based
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
Feature
Descriptions
length, headers
terms, punc., headers
headers
headers
semantic, headers
headers
noun
phrases,
headers
headers
headers
headers
headers
headers
headers
A plethora of CMC systems have been developed to support structural features.
Several tools visualize posting activity patterns, such as Loom (Donath et al., 1999) and
Authorlines (Viegas and Smith, 2004). PeopleGarden and Communication Garden both
use garden metaphors with flower glyphs to display author and thread activity (Xiong and
Donath, 1999; Zhu and Chen, 2001). The number of petals and thorns, petal colors, and
stem lengths are used to represent activity features such as the total number of posts and
number of threads an author has been active in. Babble (Erickson and Kellogg, 2000) and
Coterie (Donath, 2002) are both geared towards showing activity patterns in persistent
conversation. In these systems, all participants are displayed in a two-dimensional space.
More active authors are shown in the center while participants with fewer postings
gradually shift towards the perimeter. The visual effect is a good method for identifying
active participants versus lurkers (Donath, 2002). Systems displaying interaction
277
information also exist. Conversation Map visualizes social networks based on send/reply
patterns (Sack, 2000). NetScan displays message and author interactions (Smith and
Fiore, 2001) while Loom shows thread-level interaction structures (Donath et al., 1999).
Previous CMC systems offer limited support for text-based features. Loom (Donath et
al., 1999) shows some content patterns based on message moods. The moods are assigned
based on the occurrence of certain terms and punctuation in the message text. Chat
Circles (Donath et al., 1999) displays messages based on body text length. Conversation
Map (Sack, 2000) and Communication Garden (Zhu and Chen, 2001) provide more in
depth topical analysis. Conversation Map uses computational linguistics to build
semantic networks for discussion topics while Communication Garden performs topic
categorization based on noun phrases.
Text systems are a related class of systems that are used for either information
retrieval (IR systems) or general text categorization and analysis (text mining systems).
However, information retrieval (IR) systems are more concerned with information access
than analysis (Hearst, 1999). Mladenic (1999) presented a review of 29 IR systems, all of
which used bag-of-words to represent text document topics. Similarly, Tan (1999)
reviewed 11 commercial text mining systems and found IBM’s Intelligent Miner to be the
most comprehensive. However this system also utilizes limited feature representations
(i.e., bag-of-words, named entities) and only performs topic categorization and analysis
(Dorre et al., 1999).
Overall, the features used in existing CMC text analysis systems are insufficient to
effectively capture text-based content in CMC (Sack, 2000). Paccagnella (1997)
278
suggested that computer programs to support CMC text analysis would be helpful, yet do
not exist. He noted numerous ways in which automated systems could benefit CMC text
analysis, including data linking, content analysis, data display, and graphic mapping.
Without appropriate CMC text analysis systems, text features are often overlooked
(Panteli, 2002). There has been limited analysis of CMC text since manual methods are
time consuming (Hara et al., 2000). Cothrel (2000) stated that discussion content is an
essential dimension of online community success measurement, yet proper definition and
measurement remains elusive.
9.3 A Design Framework for CMC Text Analysis
Given the need for CMC text analysis and lack of systems that address this need, an
important and obvious question arises. Why do most CMC systems support structural
features but neglect text content features? There are three major differences that are likely
responsible for the disparity between the numbers of systems representing these feature
types, including feature definitions, extraction, and presentation. Structural features are
well defined, easy to extract, and easy to visualize. Appropriate activity (Fiore and Smith,
2002) and interaction based features have been established in the sociology literature.
These features are also easy to extract and visualize using bar chart variants for activity
frequency (Xiong and Donath, 1999; Viegas and Smith, 2004) and networks for
interaction (Donath et al., 1999, Smith and Fiore, 2001). In contrast text features are not
well defined, difficult to extract, and harder to present to end users. The richness of CMC
text necessitates a complex set of text features (Donath et al., 1999). For example, over
1,000 text features have been used for analyzing style, with no consensus (Rudman,
279
1997). Additionally, text feature extraction can be challenging due to high noise levels in
CMC text (Nasukawa and Nagano, 2001). Finally, the informational richness of text
requires multiple complementary presentation views (Losiewicz et al., 2000; Keim,
2002). Different techniques have been developed to support various facets of text
visualization with no ideal solution (Wise, 1999; Miller et al., 1998; Huang et al., 2005).
In light of these challenges, Sack (2000) argues for a new CMC system design
philosophy that incorporates automatic text analysis techniques. He states “…it is
necessary to formulate a complementary design philosophy for CMC systems in which
the point is to help participants and observers spot emerging groups and changing
patterns of communication…” (p. 86). Design guidelines are needed due to the lack of
previous tools supporting in depth CMC text analysis, the complexity associated with
properly representing CMC text, and the lack of consensus regarding appropriate text
features and presentation formats.
9.3.1 Proposed CMC Text Analysis Framework
According to the design science paradigm, design is a product and a process (Walls et
al., 1992; Hevner et al., 2004). Development of a design framework for CMC text
analysis systems requires consideration of the design product and design process. The
design product is the set of requirements and necessary design characteristics that should
guide IT artifact construction. An IT artifact can be a construct, method, model, or
instantiation (Hevner et al., 2004). The design process is the steps and procedures taken
to develop the artifact.
Information systems development typically follows an iterative design process of
280
building and evaluating (March and Smith, 1995), which is analogous to the generate/test
cycle proposed by Simon (1996). Such an approach is particularly important in design
situations involving complex or vaguely defined user requirements (Markus et al., 2002).
We believe that the ambiguities associated with CMC text analysis also warrant the use of
an iterative design process. Hence, we focus on the design product. Walls et al. (1992)
presented a model for the formulation of information systems design theories (ISDTs).
Their model incorporates four components guiding the design product aspect of an ISDT.
These include the kernel theories, meta-requirements, meta-design, and testable
hypotheses (shown in Table 9.2). The kernel theories govern meta-requirements for the
design product. The meta-design is anticipated to fulfill these meta-requirements by
providing detailed specifications for the class of IT artifacts addressed by the design
product. Testable hypotheses are used to evaluate how well the meta-design satisfies
meta-requirements. A good example of an ISDT design product is the relational database
(Walls et al., 1992). Using relational database theory as a kernel theory, metarequirements are the elimination of insertion, update, and deletion anomalies. The metadesign consists of a set of tables in at least third normal form. The testable hypotheses are
theorems and proofs validating the normalized database tables as being devoid of any
anomalies.
Table 9.2: Components of an ISDT Design Product
1. Kernel theories
2. Meta-requirements
3. Meta-design
4. Testable hypotheses
Theories from natural or social sciences governing design requirements
Describes a class of goals to which theory applies
Describes a class of artifacts hypothesized to meet meta-requirements
Used to test whether meta-design satisfies meta-requirements
(Walls et al., 1992)
281
Using Walls et al.’s model, we propose a design framework for CMC text analysis
systems (shown in Table 9.3). Employing Systemic Functional Linguistic Theory as our
kernel theory, we propose meta-requirements and a meta-design necessary to support
CMC text analysis. We also present hypotheses intended to evaluate how well the metadesign satisfies our meta-requirements. The ensuing sections elaborate upon the
components of our design framework.
Table 9.3: Components of the Proposed Design Framework for CMC Text Analysis
1. Kernel theory
2. Meta-requirements
3. Meta-design
4. Testable
hypotheses
Systemic Functional Linguistic Theory (SFLT)
Support for various information types found in CMC text that represent the
ideational, textual, and interpersonal meta-functions.
The incorporation of a rich set of text features coupled with appropriate feature
selection and visualization methods; collectively capable of representing the
ideational, textual, and interpersonal meta-functions. Specific meta-design
elements are as follows:
(a) Utilization of extended feature set comprised of language and processing
resources.
(b) Use of ranking and projection based feature selection techniques.
(c) Inclusion of multi-dimensional, text overlay, and interaction visualization
methods.
Empirical evaluation of the features and selection/visualization techniques’
ability to accurately represent information types associated with the three metafunctions. Specific testable hypotheses are as follows:
(a) Ability of the features, feature selection, and visualization methods to
characterize information types associated with the three meta-functions.
(b) Ability of the features, feature selection, and visualization methods to
discriminate information types associated with the three meta-functions.
9.4 Kernel Theory
Perhaps the most important characteristic of CMC is the language complexities it
introduces as compared to other forms of text (Wilson and Peterson, 2002). Effective
analysis of CMC text entails the utilization of a language theory that can provide
representational guidelines. Grounded in Functional Linguistics, Systemic Functional
Linguistic Theory (SFLT) provides an appropriate mechanism for representing CMC text
282
information (Halliday, 2004). SFLT states that language has three meta-functions:
ideational, interpersonal, and textual (Halliday, 2004). The three meta-functions are
intended to provide a comprehensive functional representation of language meaning by
encompassing the physical, mental, and social elements of language (Fairclough, 2003).
The ideational meta-function states that language consists of ideas. According to
Halliday (2004), the ideational meta-function of language suggests that a message is
“about something” or “construing experience” (p. 30). It pertains to the use of “language
as reflection” (Halliday, 2004; p. 29). The ideational meta-function relates to aspects of
the “mental world” which include attitudes, desires, and values (Fairclough, 2003;
Halliday, 2004).
The textual meta-function indicates that language has organization, structure, flow,
cohesion, and continuity (Halliday, 2004). It relates to aspects of the “physical world”
pertaining to the manner in which ideas are communicated (Fairclough, 2003; Halliday,
2004). The textual meta-function therefore serves as a facilitating function enabling the
conveyance of the ideational and interpersonal meta-functions. It can be present via
information types such as style, genres, and vernaculars (Argamon et al., 2007). For
instance, an author bio and vita may convey similar ideational meaning about ones
educational background and career accomplishments using contrasting textual functions,
in this case due to genre differences.
The interpersonal meta-function refers to the fact that language is a medium of
exchange between people (Sack, 2000). It pertains to the use of “language as action”
(Halliday, 2004; p. 30). The interpersonal meta-function is concerned with the enactment
283
of social relations; it relates to aspects of the “social world” (Fairclough, 2003; Halliday,
2004). It is generally represented using CMC interaction information.
9.5 Meta-Requirements
Analysis of CMC text requires the inclusion of all three language meta-functions
described by SFLT: ideational, textual, and interpersonal. “Any summary of a very large
scale conversation is incomplete if it does not incorporate all three of these metafunctions (ideational, interpersonal, and textual),” (Sack, 2000; p. 75). Therefore,
effective depiction of CMC text entails consideration of information types capable of
representing these three meta-functions.
The ideational meta-function in CMC text can be manifested in the form of various
information types, including topics, events, opinions, and emotions. Topics are the most
commonly represented information type in text (Mladenic, 1999; Tan, 1999). Events are
specific incidents with a temporal dimension. While “hurricane” is a topic, “Hurricane
Katrina” is an event. Event detection has garnered significant attention in recent years,
though it continues to present challenges since effective representation of events in text
remains elusive (Allan et al., 1998). Additional information types representing the
ideational meta-function include opinions and emotions. Opinions include sentiment
polarities (e.g., positive, neutral, negative) and intensities (e.g., high or low) about a
particular target (Pang, 2002). Popular applications of opinion related information include
mining online movie and product reviews for consumer preference information (Turney
and Littman, 2003). CMC text is also rich in emotional information (Picard, 1997).
Emotions encompassed in online communication consist of various affects such as
284
happiness, sadness, horror, anger, etc. (Subasic and Huettner, 2001).
Styles, genres, and vernaculars are information types representing the textual metafunction. Style is based on the literary choices an author makes, which can be a reflection
of context (who, what, when, why, where) and personal background (education, gender,
etc.). Example styles are formal (use of greetings, structured sentences, paragraphs) and
informal (no sentences, no greetings, erratic punctuation, use of slang). Stylistic
information is utilized in numerous forms of analysis. Authorship analysis identifies and
characterizes individuals based on their writing style (Zheng et al., 2006). Deception
detection attempts to determine if an individual’s writing is deceitful (Zhou et al., 2004),
while power cue identification explores the writing style differences between superiors
and subordinates in organizational settings (Panteli, 2002). Genres are classes of writing.
Genres found in CMC include inquiries, informational messages, memos, reports,
interview transcripts, feedback comments etc. (Yates and Orlikowski, 2002; Santini,
2004).
The interpersonal meta-function is generally represented by CMC interaction
information (i.e., who is communicating with whom). Interaction information can be
derived from message headers for certain CMC modes such as email and blogs. In email,
the “RE:” in the message subject coupled with the presence of quoted content are salient
interaction cues. However other CMC modes (e.g., chat rooms, instant messaging, web
forums) require the use of text interaction cues inherent in the body text. Text-based
interaction cues include direct references to fellow users’ names, references to previously
posted content, and conjunction and ellipsis based cues indicating continuation of an
285
existing conversation between users (Sack, 2000; Fu et al., 2008). Interaction information
is useful for social network analysis and evaluation of conversation streams based on
communication thread patterns (Smith and Fiore, 2001).
It is worth noting that the three meta-functions and their associated information types
are interrelated and should not be considered in isolation from one another. For instance,
an analyst may be concerned with opinions regarding a particular topic, or the stylistic
tendencies for two interacting participants’ text. Table 9.4 below shows examples for
information types that represent the three meta-functions, and their related analysis
applications. The following section presents a meta-design for how the three metafunctions can be supported by accurately representing their corresponding information
types.
Table 9.4: Various Information Types for the Three Meta-Functions
Meta-Function
Ideational
Textual
Info. Types
Topics
Events
Opinions
Emotions
Style
Genres
Vernaculars
Interaction
Interpersonal
Analysis Types
Topical Analysis
Event Detection
Sentiment Analysis
Affect Analysis
Authorship Analysis
Deception Detection
Power Cues
Genre Analysis
Semantic Networks
Social Networks
Conversation
Streams
References
Mladenic, 1999; Chen et al., 2003
Allan et al., 1998
Turney & Littman, 2003; Argamon et al., 2007
Picard, 1997; Subasic & Huettner, 2001
Zheng et al., 2006; Abbasi & Chen, 2006
Zhou et al., 2004
Panteli, 2002
Yates & Orlikowski, 2002; Santini, 2004
Sack, 2000; Koppel & Schler, 2003
Sack, 2000; Viegas et al., 2004
Smith & Fiore, 2001
9.6 Meta-Design
While meta-requirements are derived from the kernel theories, the objective of the
meta-design is to introduce a class of artifacts hypothesized to meet the meta-
286
requirements (Walls et al., 1992). Three critical elements of any text mining, text
analysis, or information retrieval system are the features, feature selection methods, and
visualization techniques (Tan, 1999; Mladenic, 1999; Chen, 2001; Cunningham, 2002).
For CMC text analysis, the meta-design requires the incorporation of an extended set of
linguistic features (Mladenic, 1999; Cunningham, 2002) capable of representing various
information types associated with the ideational, textual, and interpersonal metafunctions. Feature selection methods present features in a ranked and/or reduced state for
improved knowledge discovery (Guyon and Elisseeff, 2003). Feature selection techniques
are necessary for enhancing the representational richness of various information types
present in CMC text (Mladenic, 1999). Although a large number of linguistic features are
needed for CMC text analysis, only a subset of these may be relevant or useful for a
particular information type (Hearst, 1999; Forman, 2003). Furthermore, visualization
techniques are needed for effective analysis of CMC text (Tan, 1999; Chen, 2001). Such
methods are capable of presenting important CMC text information in a concise and
informative manner (Wise, 1999; Keim, 2002). In the subsequent sections we review the
merits of potential meta-design alternatives for features, feature selection, and
visualization.
9.6.1 Features for CMC Text Analysis
Text features are linguistic attributes used to represent various information types.
They can be classified into two broad categories: language resources and processing
resources (Cunningham, 2002). Language resources are data-only resources such as
lexicons, thesauruses, and word lists. These self-standing features exist independent of
287
their application context and provide powerful discriminatory potential. However,
language resource construction is often manual, and features may be less generalizable
across information types (Pang et al., 2002).
Processing resources require algorithms for computation. Parts-of-speech tags, ngrams, statistical features (e.g., average word length), and bag-of-words are all examples
of processing resources. The majority of processing resource features are contextdependent; they change according to the text corpus. However, the extraction procedures
remain constant, making processing resources highly generalizable across information
types. Consequently, features such as bag-of-words, parts-of-speech, and n-grams are
used to represent numerous information types including topics, events, opinions, style,
and genres (Pang et al., 2002; Santini, 2004).
Using language and processing resources in conjunction can improve text
categorization and analysis capabilities since processing resources provide breadth across
information types while language resources offer depth within specific information types
(Cunningham, 2002). Table 9.5 provides a summary of numerous language and
processing features as well as information types these feature groups have been used to
represent. The table can be read as follows: syntactic language resources, including
function words, punctuation, and special characters have been used to represent opinion,
style, genre, and interaction information in text. While most of the feature descriptions
are straight forward, certain categories (e.g., lexical) are more involved. Interested
readers can attain further details about these feature groups from prior studies (Koppel
and Schler, 2003; Zheng et al., 2006).
288
Table 9.5: Various Linguistic Features used for Text Analysis
Resource
Language
Category
Syntactic
Feature Group
Function Words
Punctuation
Special Characters
Examples
of, for, the, on, who, what, because
!,?,:,”
$,@,#,*,&
Structural
Lexicons
Technical Structure
Sentiments
Affect Classes
Idiosyncrasies
Geographic
Temporal
Synonyms
file extensions, font colors, sizes
positive/negative term lists
happiness, anger, hate, etc. terms
misspelled word lists, vernaculars
lists of places (e.g., states, cities)
time references (e.g., day, month)
synonymy information for words
Word Lexical
Character Lexical
Vocabulary Richness
Word Length Dist.
Character N-grams
Digit N-grams
POS Tag N-Grams
Word N-grams
Noun Phrases
Named Entities
Bag-of-Words
total words, % char. per word
total char., % numeric char.
hapax legomana, Yules K
frequency of 1-20 letter words
at, att, atta, attai
12, 94, 192
NNP_VB VB,VB ADJ
went to, to the, went to the
account, bonds, stocks
Enron, Cisco, El Paso, California
all words except function words
Document Structure
has greeting, url, quoted content
Thesaurus
Processing
Lexical
Syntactic
Semantic
Structural
Info. Type
Opinions
Style
Genres
Interaction
Style
Opinions
Emotions
Style
Events
Events
Opinions
Emotions
Style
Opinions
Style
Genres
Topics
Events
Opinions
Style
Genres
Interaction
Style
9.6.2 Feature Selection Techniques for CMC Text Analysis
Two categories of feature selection techniques commonly applied to text are ranking
and projection based methods (Guyon and Elisseeff, 2003). Ranking techniques rank
attributes based on some heuristic (Hearst, 1999). Examples include information gain,
chi-squared, and Pearson’s correlation coefficient (Forman, 2003; Koppel and Schler,
2003). Projection methods are transformation based techniques that utilize dimensionality
reduction (Huang et al., 2005). Examples are principal component analysis (PCA), multidimensional scaling (MDS), and self-organizing map (Chen et al., 2003; Huang et al.,
2005). Ranking and projection based methods each have their advantages and
289
disadvantages.
Ranking methods have been used to analyze several information types, including
topics, style, and opinions (Abbasi and Chen, 2005; Pang et al., 2002). They offer greater
explanatory potential than projection methods since they preserve the original feature set
and simply rank/sort attributes (Seo and Shneiderman, 2005). Ranking methods also offer
simplicity and scalability. However, they typically only consider an individual feature’s
predictive power; resulting in the potential loss of information stemming from feature
interactions (Guyon and Elisseeff, 2003).
Projection methods have been used to transform text feature spaces into lower
dimensional projections for style and topic categorization (Allan et al., 2001; Chen et al.,
2003; Abbasi and Chen, 2006). Projection methods are highly robust against noise,
making them useful for text analysis. They can uncover important underlying patterns
(Abbasi and Chen, 2006). However, the transformation process from original features to
projections can also diminish explanatory potential (Seo and Shneiderman, 2005).
Projection methods may describe important high-level patterns but have difficulty
explaining details about specific features.
The rank-by-feature framework states that systems designed to support complex
analysis tasks should incorporate divergent feature selection methods to enhance analysis
capabilities (Seo and Shneiderman, 2005). For instance, using ranking and projection
methods in unison, i.e. independently applying them to the same data, can facilitate
analysis of overview (projection methods) and specific feature details (ranking methods).
Therefore, CMC text analysis systems should employ both categories of feature selection
290
techniques. Table 9.6 shows examples of ranking and projection based methods applied
to various information types.
Table 9.6: Examples of Ranking and Projection Methods Applied to Text
Selection Method
Ranking
Projection
Example Technique
Information Gain
Chi-Squared
Decision Tree Model
Minimum Frequency
Principal Component Analysis
Multi-Dimensional Scaling
Self-Organizing Map
Info. Type
Topics
Topics
Style
Opinions
Style
Topics
Topics
Reference
Koppel & Schler, 2003
Forman, 2003
Abbasi & Chen, 2005
Pang et al., 2002
Abbasi & Chen, 2006
Allan et al., 2001
Chen et al., 2003
9.6.3 Visualization Techniques for CMC Text Analysis
CMC text analysis systems should present interaction information using network and
tree representations as done in prior systems (Sack, 2000; Smith and Fiore, 2001).
However, visualization of text information derived from message bodies is challenging
since text cannot easily be described by numbers (Keim, 2002). Visualization of complex
high dimensional information can be enhanced using coordinated views, i.e. multiple
complementary presentation formats (Losiewicz et al., 2000; Andrienko and Andrienko,
2003). Wise (1999) noted that text analysis should “…provide a basis for altered
visualization of the information for different users and purposes…why should we
preconceive that there is only one ‘correct’ visualization of text information in a
document corpus?” (p. 1230). For instance, text itself is one-dimensional, textual features
are multi-dimensional (Huang et al., 2005), and the relation between features and the text
they represent is often established using 2D-3D text overlay (Cunningham, 2002). CMC
text analysis systems can dramatically benefit from complementary presentation formats
including multi-dimensional and text overlay methods (Wise, 1999; Keim, 2002). These
291
two categories of visualization techniques are described below.
Multi-dimensional techniques used for text visualization include graphs and reduced
dimensionality views. Graphical formats such as radar charts, parallel coordinates, and
scatter plot matrices have been applied to topic, affect, and style information (Subasic and
Huettner, 2001; Huang et al., 2005). Reduced dimensionality visualizations decrease the
feature space to show essential patterns. These techniques are typically used in
conjunction with projection-based feature selection techniques to create two or three
dimensional views. Examples include Writeprints (Abbasi and Chen, 2006),
ThemeRiver© (Havre et al., 2002), and Themescapes™ (Wise, 1999). Text overlay
methods combine text with feature occurrence patterns to provide greater insight. The
Stereoscopic Document View in Topic Islands™ uses wavelet transformations to show
key topical patterns, superimposed onto the document text (Miller et al., 1998). Text
annotation highlights feature occurrences in text (Cunningham, 2002).
Multi-dimensional views are often used to visualize text feature statistics such as
frequency, variance, and similarity (Keim, 2002). While these views provide important
insight and summarization capabilities, they abstract away from the underlying nonnumeric content they are intended to represent. Multi-dimensional techniques can tell us
what features are important but not how or why. In contrast, text overlay techniques serve
an important complementary function. They have greater explanatory potential; allowing
users to see exactly how and where features occur within their proper context. Hence it is
important to include multi-dimensional presentation formats that can summarize feature
statistics as well as text overlay illustrations that can bridge the gap between feature
292
statistics and their actual occurrences in text.
9.7 Testable Hypotheses
Testable hypotheses are intended to assess whether the meta-design satisfies metarequirements (Walls et al., 1992). For the proposed design framework, this entails
evaluating the meta-design’s ability to accurately represent information types associated
with the three meta-functions, as outlined in the meta-requirements. In text mining,
“representation” can imply data characterization or data discrimination (Han and Kamber,
2001). Data characterization is “a summarization of the general characteristics or features
of a target class of data,” (Han and Kamber, 2001; p. 21). From a data characterization
perspective, a good representation is able to derive important patterns, trends, or
phenomenon of interest from text information (Tan, 1999). Data discrimination is
clustering or categorization of information types into meaningful classes (Tan, 1999;
Chen, 2001). With respect to data discrimination, a good representation is one capable of
accurately categorizing text into various information classes (Han and Kamber, 2001).
For the proposed CMC text analysis design framework, a suitable meta-design must
incorporate features, feature selection, and visualization techniques capable of effectively
characterizing and discriminating information types used to represent the meta-functions.
Prior CMC systems used application examples or case studies to illustrate their systems’
data characterization capabilities (e.g., Sack, 2000; Erickson and Kellogg, 2000; Zhu and
Chen, 2001; Smith and Fiore, 2001; Viegas and Smith, 2004). In contrast, the
effectiveness of data discrimination is generally assessed using rigorous text
categorization experiments for various information types (Pang et al., 2002; Zheng et al.,
293
2006; Argamon et al., 2007).
In the following section, we describe the CyberGate system developed as an
instantiation of our design framework. We use CyberGate to evaluate the effectiveness of
our meta-design. A brief application example is used to illustrate the system’s ability to
characterize information types associated with the meta-functions. Text categorization
experiments are also conducted to test the meta-design’s effectiveness for discriminating
information types used to represent the ideational, textual, and interpersonal metafunctions.
9.8 System Design: The CyberGate System
Based on our design framework, we developed the CyberGate system for text
analysis of CMC (Figure 9.1). The system was developed using a cyclical design process
involving several iterations of adding and testing system components (March and Smith,
1995; Simon, 1996). The testing phase included experiments for performance evaluation
and feedback from CMC researchers and analysts. CyberGate supports features for
representing several information types associated with all three meta-functions. It also
uses various feature selection and visualization techniques, including Writeprints and Ink
Blots. We firstly present an overview of the CyberGate system and then provide details
about the Writeprints and Ink Blots techniques.
294
Figure 9.1: CyberGate System Design
9.8.1 Information Types and Features
CyberGate supports several information types for representing the ideational, textual,
and interpersonal meta-functions. These include topics, opinions, affects, style, genres,
and interaction information. In order to capture such a bevy of information, several
language and processing resources were incorporated (i.e., most of the features shown in
Table 5). The language resources encompass sentiment and affect lexicons, word lists,
and the WordNet thesaurus (Fellbaum, 1998). Embedded processing resources include ngrams, statistical features, parts-of-speech, noun phrases, and named entities (Koppel and
Schler, 2003; Zheng et al., 2006).
295
9.8.2 Feature Selection
CyberGate uses both ranking and projection based feature selection methods. For
feature ranking it uses information gain and decision tree models (Forman, 2003; Abbasi
and Chen, 2005). PCA and MDS projections are used for dimensionality reduction
(Abbasi and Chen, 2006; Huang et al., 2005). Figure 9.2 shows examples of the feature
selections techniques used in CyberGate. The table on the left (a) shows the complete set
of features while (b) shows the top two dimensions of the PCA projections (Ex and Ey)
and (c) shows the decision tree model rankings.
a) All features
b) Projection
c) Ranking
Figure 9.2: CyberGate Feature Selection Examples
9.8.3 Visualization
CyberGate includes multi-dimensional, and text overlay based visual representations.
Multi-dimensional visualizations include Writeprints, which shows usage variation, and
parallel coordinates, which shows feature occurrences (Figures 9.3a and 9.3b). Each
circle in Writeprints denotes a single message or text window projected using principal
component analysis. The blue polygonal lines in parallel coordinates also represent
messages or text windows. The selected Writeprints point corresponds to the selected
parallel coordinates’ polygonal line. The intersection between a polygonal line and a
296
vertical axis in parallel coordinates represents the occurrence frequency of that feature in
that particular message. For example, the selected message in Figure 9.3b has high
occurrence of feature #7 (occurs 21 times).
CyberGate also utilizes MDS plots (Figure 9.3d) to show overall feature similarities
and radar charts (Figure 9.3c) for comparing feature occurrence statistics. The radar chart
shown compares the selected author against another author and the mean normalized
usage frequencies for a set of features (which are numbered along the perimeter). The
MDS plot in Figure 9.3d shows features projected based on occurrence similarity for the
bag-of-words features. We can see one large cluster and two smaller ones in addition to 34 features that are on their own. These features (e.g., “services”) do not frequently cooccur with any of the three clusters.
CyberGate’s text overlay techniques are shown in Figure 9.4. Text annotation
highlights key features in the text (Cunningham, 2002). Figure 9.4a shows an example
where the bag-of-words features are highlighted in blue while the selected feature
(“CounselEnron”) is highlighted in red. Ink Blots (Figure 9.4b) superimposes colored
circles (blots) onto text for key features as identified by the underlying feature ranking
method used. The size of the blot indicates the feature weight (based on the feature
ranking technique). Features unique to a particular author have higher weights than ones
that are equally common across authors. The color indicates the author’s usage of the
particular feature (red = high, blue = low, yellow = medium). The selected feature (again
“CounselEnron”) is highlighted with a black circle. This feature is represented with large
red blots indicating that it has a high weight: it’s unique to this author.
297
a) Writeprints
b) Parallel Coordinates
N-dimensional PCA projections based on feature
occurrences. Each circle denotes a single message.
Selected message is highlighted in pink.
Writeprints show feature usage/occurrence
variation patterns. Greater variation results in
more sporadic patterns.
c) Radar Charts
Parallel vertical lines represent features. Bolded
numbers are feature numbers (0-15). Smaller
numbers above and below feature lines denote
feature range. Blue polygonal lines represent
messages. Selected message is highlighted in red.
Selected feature is highlighted in pink (#2).
d) MDS Plots
Chart
shows
normalized
feature
usage
frequencies. Blue line represents author’s average
usage, red line indicates mean usage across all
authors, and green line is another author (being
compared against). The numbers represent feature
numbers. Selected feature is highlighted (#6).
MDS algorithm used to project features into twodimensional space based on occurrence similarity.
Each circle denotes a feature. Closer features have
higher co-occurrence. Labels represent feature
descriptions. Selected feature is highlighted in
pink (the term “services”).
Figure 9.3: Multi-dimensional Text Views in CyberGate
298
a) Text Annotation View
Feature occurrences are highlighted in blue. The
selected bag-of-words feature is highlighted in red
(“CounselEnron”).
b) Ink Blots View
Colored circles (blots) superimposed onto feature
occurrence locations in text. Blot size and color
indicates feature importance and usage. Selected
feature’s blots are highlighted with black circles.
Figure 9.4: Text Overlay Views in CyberGate
CyberGate also includes graph and tree visualizations for viewing interaction
information in CMC text. Author and thread social networks show the interaction
between author nodes represented using links (Figures 9.5a and 9.5b). Discussion trees
(Figure 9.5c) denote the interaction between subsequent message nodes within a thread.
In addition to deriving interaction information from message headers, CyberGate also
utilizes body text features (Fu et al, 2008). These features include the occurrence of user
names and keywords that serve as indicators of user interaction. The use of text-based
interaction features allows CyberGate to construct interaction patterns even when
structural features are unavailable or insufficient (e.g., chat rooms and web forums).
299
a)
c)
b)
Figure 9.5: Interaction Views in CyberGate for Representing Interpersonal Information
9.8.4 Writeprints and Ink Blots
CyberGate includes the Writeprints and Ink Blots techniques, which are the core
components driving the system’s analysis functions. These techniques epitomize the
essence of the proposed design framework: representation of rich features using divergent
feature selection and visualization techniques. Writeprints and Ink Blots can incorporate
an array of features representing various information types. Both techniques also utilize
complementary feature selection and visualization methods. Writeprints uses principal
component analysis (PCA) with a sliding window algorithm to create lower dimensional
plots that accentuate feature usage variation. Ink Blots uses decision tree models (DTM)
to select features which are superimposed onto text to show them as they occur.
Writeprints is better suited towards presenting a broad overview across large numbers of
features. Ink Blots is intended to show detailed examples of feature occurrences. Both
techniques can be used for text characterization and discrimination (i.e., analysis and
categorization). Specific details about the two methods are presented below.
300
9.8.4.1 Writeprints
The steps for the Writeprints technique are listed below:
Writeprints Technique Steps
7) Derive n primary eigenvectors (ones with largest eigenvalues) from feature usage
matrix where n is determined by the stopping rule or end user.
8) Extract feature vectors for sliding window instance.
9) Compute window instance coordinates by multiplying window feature vectors with
n eigenvectors.
10) Plot window instance points in n dimensional space.
11) Repeat steps 2-4 for each window.
A sliding window of length L with a jump interval of J characters is run over the
messages. The feature occurrence vector for each window is projected to an n
dimensional space by taking the product of the window feature vector and the n primary
eigenvectors, where n is determined using a stopping rule. Writeprints uses the KaiserGuttman stopping rule where all eigenvectors with an eigenvalue greater than one are
selected (Jackson, 1993), or a user defined number of eigenvectors. Figure 9.6 illustrates
the key steps in the Writeprints process for a sample two-dimensional projection. The
product of the window feature vector and the first eigenvector is used to get the x-axis
coordinate ( ε 1 ) while the product of the feature vector and the second eigenvector
produces the y-axis coordinate ( ε 2 ). Writeprints is geared towards showing occurrence
variation patterns. These patterns can be used for text categorization of stylistic
information or analysis of information types serving the ideational and textual metafunctions.
301
Figure 9.6: Writeprints Process Illustration on Two Dimensions
9.8.4.2 Ink Blots
The steps for the Ink Blots technique are listed below:
Ink Blots Technique Steps
1) Separate input text into two classes (one for class of interest, one class containing
all remaining texts).
2) Extract feature vectors for messages.
3) Input vectors into DTM as binary class problem.
4) For each feature in computed decision tree, determine blot size and color based on
DTM weight and feature usage.
5) Overlay feature blots onto their respective occurrences in text.
6) Repeat steps 1-5 for each class.
The Ink Blots process is shown in Figure 9.7. The Ink Blots technique identifies the
most important features for a given class using a binary class decision tree model (DTM).
DTM efficiently considers feature interactions, unlike methods such as information gain
and log likelihood (Forman, 2003). A class can refer to an author, opinion, emotion, topic
etc. The class of interest is input into the DTM along with a second class containing text
from all other classes. The DTM determines the key features that differentiate the class of
interest from other classes, weighted by their level of entropy reduction. For each
selected feature, the weights determined by the DTM are used to determine the attributes’
blot size (higher weight = larger blot size). Blot colors are determined based on feature
302
usage. Red is assigned to features for which the class has the highest usage, while blue is
for features never occurring in the class’ text. All other features are assigned yellow. Let
us assume we have 10 topics of interest for which we would like to identify the key blot
features. For each topic, a DTM is generated comparing that topic against all others (to
determine the topic’s key features). These features are assigned weights and colors based
on their DTM rankings and occurrence frequencies, respectively. The process is repeated
for each topic. Finally, text overlay is performed by superimposing a topic’s blot features
at every location where the features occur. Once each class’ key features have been
extracted and assigned weights and colors, they can be used for categorization and
analysis. For categorization, superimposing a class’ blots onto an unclassified text can
provide insight into whether the text belongs to that particular class. Correct class-text
matches should result in patterns featuring high levels of red and yellow (features that
occur frequently in this class’ texts) and a minimal amount of blue (features rarely or
never occurring in this class’ texts). Ink Blots can also be used to analyze how key class
features occur and interact within a certain piece of text.
Figure 9.7: Ink Blots Process Illustration
303
9.9 A CMC Text Analysis Example Using CyberGate: The Enron Case
We present an application example from the Enron email corpus to illustrate how
CyberGate can be used for data characterization of CMC text. The example utilizes
Writeprints and Ink Blots as well as additional CyberGate views such as parallel
coordinates and MDS plots. The example relates to two Enron employees, neither of
whom were directly involved in the scandal. Author A worked in the sales division while
Author B was in the company’s legal department. Figure 9.8 shows a temporal view of
the two authors’ Writeprints taken across all features (lexical, syntactic, structural,
semantic, n-grams, etc.). Each circle denotes a text window that is colored according to
the point in time at which it occurred. The bright green points represent text windows
from emails written after the scandal had broken out while the red points represent text
windows from emails written before the scandal. Looking at the two patterns, we can see
that Author B has greater overall feature variation as well as a distinct difference in the
spatial location of points prior to the scandal (located more towards the right) as opposed
to afterwards (drifting towards the left). In contrast, Author A has no such difference, with
his newer (green) text points placed directly on top of his older (red) ones. This suggests
that Author B has had a profound change with respect to the text in his emails while there
doesn’t appear to be any major changes for Author A. In order to further investigate this,
we sampled points from the green and red regions for both authors and analyzed them
using Ink Blots and parallel coordinates.
304
Author A
Author B
Figure 9.8: Writeprints for Two Enron Employees
Figure 9.9 shows the Ink Blots and parallel coordinate views for sample points taken
from Author A for text windows prior to and following the scandal. Ink Blots shows the
author’s key features superimposed onto the text. The usage of these features before as
compared to after the scandal seems similar. The parallel coordinates shows the author’s
32 most important bag-of-words, including sales and negotiation related terms. These
features signify the major topical content of the author’s text. Again, the before and after
coordinate patterns seem fairly similar, suggesting little text content deviation attributable
to the scandal.
305
Before Scandal Text
After Scandal Text
Figure 9.9: Author A Ink Blots and Parallel Coordinates
Figure 9.10 shows the Ink Blots and parallel coordinate views for sample text
windows taken from Author B before and after the scandal. The Ink Blots view for the
after-scandal text has considerably greater occurrence of key blot features. While the
emails before the scandal focus on legal aspects of business deals with terms such as
“counterparties” and “negotiations,” the discourse after the scandal mostly revolves
around Author B providing advice and legal counsel to fellow employees. The postscandal emails are more formal, containing greater usage of email signatures with the
Author’s job title and contact information. The bag-of-words features for these signature
terms (e.g., title, address, phone number) correspond to the first 12 features shown in the
parallel coordinates view. The terms relating to business legalities correspond to the latter
features (e.g., 15-30) in the parallel coordinates view. The parallel coordinates view
exemplifies the stark contrast in Author B’s emails as a result of the scandal. Clearly this
dramatic alteration is attributable to a change in Author B’s job functions.
306
Before Scandal Text
After Scandal Text
Figure 9.10: Author B Ink Blots and Parallel Coordinates
Yates and Orlikowski (2002) stated that “the purpose of a genre is not an individual’s
private motive for communicating, but purpose socially constructed and recognized by
the relevant organizational community…” (p. 15). Important characteristics of a genre
include structural and linguistic features. For Author B, the post scandal emails signify a
shift in genres. The author’s job function changes from working on business contracts to
providing advice and counseling to fellow employees. Figure 9.11a shows the key bagof-words terms clustered based on occurrence similarity using MDS plots. The large
cluster represents the business legality related terms (features 15-30 in parallel
coordinates above) while the two smaller clusters near the bottom contain the author’s
contact information and job title related terms, respectively. Author B shifts from usage
of terms in the large cluster to the smaller ones. Similarly, the number of employees
interacting with Author B increases considerably after the scandal as the author advises
fellow co-workers (Figure 9.11b and 9.11c).
307
a) Bag-of-Words MDS Clusters b) Before Scandal Social Network c) After Scandal Social Network
Figure 9.11: Author B Bag-of-Words Clusters and Social Networks
This example illustrates how CyberGate and the proposed underlying framework’s
meta-design can be used for data characterization based representation of the metafunctions outlined in the meta-requirements. The example utilized a rich set of features:
lexical, syntactic, structural, semantic, and various lexicons. A variety of ranking and
projection based feature selection methods were incorporated (e.g., DTM, PCA, MDS).
Multi-dimensional, text overlay, and interaction visualization techniques were also
employed (Writeprints, MDS Plots, parallel coordinates, Ink Blots, social network
graphs). The meta-design was used to represent information types (e.g., topics, genres,
style, and interaction) associated with the ideational, textual, and interpersonal metafunctions.
9.10 Experimental Evaluation: Text Categorization using CyberGate
Text categorization experiments were conducted using CyberGate. The experiments
were intended to test the meta-design’s effectiveness for discriminating information types
used to represent the ideational, textual, and interpersonal meta-functions. Experiments
evaluating the representation of the ideational and textual meta-function assessed
CyberGate’s features and selection/visualization techniques against comparison features
308
and techniques. Evaluation of information types representing the interpersonal metafunction compared CyberGate’s features against those used in prior systems.
The representation of ideational meta-functions was evaluated by categorizing topics
and opinions. For the textual meta-function, style and genres were tested. In these
experiments, the Writeprints or Ink Blots technique was compared against Support Vector
Machine (SVM). As previously alluded to, Writeprints and Ink Blots both support text
categorization. Writeprints is effective at capturing occurrence variation which can be
useful for categorizing style. Ink Blots is geared towards occurrence frequency which can
be beneficial for genre, topic, and opinion categorization. SVM was incorporated since it
has been a powerful machine learning algorithm for categorization of various information
types including topics (Dumais et al., 1998), style (Zheng et al., 2006), and opinions
(Pang et al., 2002). SVM was run using a linear kernel. In all experiments a subset of the
CyberGate feature set was used based on the information type being evaluated. These
feature subsets were composed of attributes commonly used for categorization of their
respective information types. The same set of features was used for SVM and the
CyberGate technique being evaluated in an experiment. In addition, a baseline
configuration was included in all experiments, comprised of SVM run with bag-of-word
(BOW) features (referred to as “Baseline” from here on). BOWs have been used as the
sole feature representation in virtually all text systems evaluated in prior research
(Mladenic, 1999; Tan, 1999). BOWs are a fairly generalizable processing resource
previously used for categorization of topics, style, opinions, and genres (Dumais et al.,
1998; Pang et al., 2002). However, we do not believe BOW features are sufficient to
309
effectively capture the various information types inherent in CMC text. Thus, while the
SVM versus Ink Blots/Writeprints comparison was intended to demonstrate the efficacy
of these techniques, the comparison with the baseline was intended to illustrate the
effectiveness of the CyberGate features over those included in standard text systems.
For representation of the interpersonal meta-function, the ability to accurately
construct CMC interactions patterns was evaluated. As previously alluded to, while prior
CMC systems rely solely on structural features derived from message headers (Sack,
2000), CyberGate uses structural and body text features for constructing interaction
patterns in CMC. This is beneficial in CMC modes where headers are unavailable and
interaction patterns less obvious. We evaluated the effectiveness of CyberGate’s features
against a baseline set comprised of only structural features. The experiments entailed
evaluating the feature sets’ ability to correctly assign user interactions (i.e., which
message or user a given message is responding to).
9.10.1 Research Hypotheses
Table 9.7 presents hypotheses regarding CyberGate’s ability to categorize information
types representing the ideational, textual, and interpersonal meta-functions. For all
experiments, pair wise t-tests were used to evaluate the hypotheses. In order to enable
easier summarization of the results, the p-values from the five ensuing experiments are
included here.
310
Table 9.7: Hypotheses Testing Results for Text Categorization Experiments
Hypotheses
P-Values
Representation of the Ideational Meta-function
Setting 1
Setting 2
H1a: Techniques using CyberGate’s features will outperform the baseline
features for the categorization of topics.
H1b: CyberGate techniques will outperform SVM for the categorization
of topics.
H2a: Techniques using CyberGate’s features will outperform the baseline
features for the categorization of opinions.
H2b: CyberGate techniques will outperform SVM for the categorization
of opinions.
Representation of the Textual Meta-function
< 0.001*
< 0.001*
< 0.001+
< 0.001+
< 0.001*
< 0.001*
0.086
0.062
Setting 1
Setting 2
< 0.001*
< 0.001*
< 0.001*
< 0.001*
< 0.001*
< 0.001*
0.127
0.103
Test Bed
1
< 0.001*
Test Bed
2
< 0.001*
H3a: Techniques using CyberGate’s features will outperform the baseline
features for the categorization of style.
H3b: CyberGate techniques will outperform SVM for the categorization
of style.
H4a: Techniques using CyberGate’s features will outperform the baseline
features for the categorization of genres.
H4b: CyberGate techniques will outperform SVM for the categorization
of genres.
Representation of the Interpersonal Meta-function
H5: CyberGate’s features will outperform the baseline features for
categorization of interaction patterns.
* P-value significant at alpha = 0.01
+ Result contradicts hypotheses
9.10.2 Information Types Representing the Ideational Meta-function
Experiment 1: Topic Categorization
We extracted emails pertaining to ten topics from the Enron email corpus. The
messages were extracted and tagged by an independent coder who read each message
before deciding upon a topic tag. Only messages that were tagged with a single topic
were included. Example topics include energy, shares, and litigation. For each topic, 100
email messages were used, resulting in a test bed of 1000 emails. In order to gauge the
effectiveness of the coding, a second coder tagged 100 messages from the test bed. The
kappa statistic was computed between the two coders, with a value of 0.83 (which is
311
considered reliable). Our feature set consisted of bag-of-words and noun phrases. Both
feature representations have been effectively used for topic categorization (Dumais et al.,
1998; Chen et al., 2003). A minimum frequency threshold of three was used to determine
the number of bag-of-words and noun phrases to include (Joachims, 1998).
Two experimental settings were run, one using 5 topics and the other one using all 10
topics. The experiments featured Ink Blots in comparison with SVM and the baseline
(SVM with only BOWs). All techniques were run using 10-fold cross validation. For Ink
Blots, this meant that the DTM and occurrence analysis used to assign each topic class its
blot sizes and colors was run on 90% of the data each fold, while the other 10% was used
for evaluation. The class with the highest ratio of red to blue blot area was assigned the
anonymous message.
Table 9.8: Topic Categorization Results (accuracy)
Topic Setting
5 Topics
10 topics
SVM
95.70
93.25
Techniques
Ink Blots
92.25
90.10
Baseline
88.75
86.55
Table 9.8 shows the topic categorization results. Both techniques using the richer
feature representation achieved over 90% accuracy, significantly outperforming the
baseline (p-values < 0.001). However, SVM significantly outperformed the Ink Blot
technique for the 5 and 10 topic experiment settings (p-values < 0.001). Error analysis on
Ink Blots’ misclassified messages revealed that the higher performance of SVM was
likely attributable to its ability to better classify the small percentage of messages that
were in the gray area between topics (e.g., messages primarily talking about energy, but
also mentioning litigation). A second coder tagged these misclassified 60 messages, with
312
a kappa statistic of only 0.65 with the original coding. The considerably lower inter-coder
reliability of these messages as compared to the 0.83 overall kappa value supports our
conclusion.
Experiment 2: Opinion Classification
The objective of the opinion classification experiment was to test the effectiveness of
the CyberGate features and techniques for capturing sentiment polarities. The test bed
consisted of 2,000 digital camera reviews from www.epinions.com. The 2,000 reviews
were composed of 1,000 positive (4-5 star) and 1,000 negative (1-2 star), with 500
reviews for each star level (i.e., 1,2,4,5). Two problem scenarios were tested: (1)
classifying 1 star versus 5 star reviews (extreme polarity) and (2) classifying 1+2 star
versus 4+5 star reviews (milder polarity). The feature set encompassed a lexicon of 3,000
positive or negatively oriented adjectives (Turney and Littman, 2003) and word n-grams
(Pang et al., 2002). Once again SVM, Ink Blots, and the baseline were run using 10-fold
cross validation for each experiment setting (mild and extreme polarity).
The experimental results are presented in Table 9.9. SVM marginally outperformed
Ink Blots however the enhanced performance was not statistically significant (p-values
on pair wise t-tests > 0.05). SVM and Ink Blots both significantly outperformed the
baseline (p-values on pair wise t-tests < 0.001) by a margin of over 10%, highlighting the
importance of a representational richness for opinion categorization. The overall
accuracies for both SVM and Ink Blots were consistent with previous work which has
been in the 85%-90% range (e.g., Pang et al., 2002). Once again the improved
performance of SVM was attributable to its ability to better detect messages containing
313
sentiments with less polarity. In many cases it was more difficult for the Ink Blot
technique to detect the overall orientation of these messages. This is evidenced by the fact
that the Ink Blot technique’s accuracy dropped more when switching from extreme to
mild polarity as compared to SVM.
Table 9.9: Opinion Classification Results
Sentiment Setting
Extreme Polarity
Mild Polarity
SVM
93.00
89.40
Techniques
Ink Blots
92.20
86.80
Baseline
83.00
77.10
9.10.3 Information Types Representing the Textual Meta-function
Experiment 3: Style Classification
We conducted authorship classification experiments to test the effectiveness of our
features and techniques for capturing style. The objective of the experiments was to
correctly categorize individuals based on their writing style. Our test bed consisted of
authors from the Enron email corpus. The experiments involved an entity resolution
classification task in which half of messages were used for training (treated as the known
entity) and half for testing (considered an anonymous entity). The objective in such a task
is to match anonymous entities to the correct known entities based on stylistic tendencies.
The experiments were run using 25 and 50 authors. Thus, in the 25 author setting, we had
25 “known entities” and 25 “anonymous entities,” each set constructed using one half of
the messages. The feature set consisted of lexical, syntactic, structural, and semantic
features. Lexical features included word and character level measures (e.g., words per
sentence, characters per word, etc.). The syntactic features used were function words,
punctuation marks, and POS and word n-grams. The structural features encompassed the
314
use of greetings, quoted content, hyperlinks etc. Semantic features used included noun
phrases and named entities. These feature categories are described in greater detail in
Table 9.5. The effectiveness of these features for capturing style has previously been
demonstrated (Zheng et al., 2006). The Writeprints technique was used in comparison
with SVM. Writeprints’ ability to capture feature variation patterns is conducive to
stylistic classification. For Writeprints, the sliding window was run over each entity
creating an n-dimensional pattern. The anonymous patterns were each compared against
the known patterns, with the anonymous entity being assigned to the known entity with
the most similar pattern. Similarity was determined based on the average n-dimensional
Euclidean distance between the two patterns’ points. For SVM, 100 text tiles were created
for each known and anonymous entity. The feature vectors for these 100 tiles were used
for the training (known entity) and testing data (anonymous entity). The anonymous
entities were classified as the known entity assigned the highest number of tiles by SVM
during the testing phase. The experimental results are shown in Table 9.10.
Table 9.10: Style Classification Results
Author Setting
25 Authors
50 Authors
SVM
84.00
80.00
Techniques
Writeprints
92.00
90.00
Baseline
62.00
51.00
Writeprints outperformed SVM by 8%-10% for both experimental settings. The
enhanced performance was statistically significant for 25 and 50 authors. Furthermore,
the Writeprints accuracies are an improvement over prior research (Zheng et al., 2006).
Writeprints and SVM both also significantly outperformed the baseline by 20%-30%.
This is attributable to CyberGate’s use of features that can effectively capture stylistic
315
information usage and variation.
Experiment 4: Genre Classification
For genre classification, a test bed of 3000 forum postings from the Sun Technology
Forum (forum.java.sun.com) was used. Categorization of genres in such forums can be
useful for studying knowledge transfer patterns in electronic networks of practice (Wasko
and Faraj, 2005). The genres categorized included questions, informative messages, and
general messages (uninformative comments), with 1000 messages used for each genre.
Two experiment settings were run: (1) questions (1000 messages) versus non-questions
(500 informative, 500 comments) and (2) all three genres (1000 messages each). The
feature set consisted of lexical, syntactic, structural, semantic, and n-gram features. The
Ink Blot technique was compared against SVM and the BOW baseline. Each technique
was run using 10-fold cross validation (same settings as the topic and opinion
categorization experiments).
The experimental results are presented in Table 9.11. Both SVM and Ink Blots
significantly outperformed the baseline (p-values < 0.001). Ink Blots outperformed SVM;
however the margin was not statistically significant (p-values > 0.05). The overall
accuracies for both SVM and Ink Blots were consistent with prior results dealing with 2-3
genres (Santini, 2004), validating the efficacy of the underlying features and techniques
for genre categorization.
Table 9.11: Genre Classification Results
Genre Setting
Questions vs. Non-questions
All Three Genres
SVM
98.10
96.40
Techniques
Ink Blots
98.55
96.50
Baseline
90.10
86.00
316
9.10.4 Information Types Representing the Interpersonal Meta-function
Experiment 5: Interaction Classification
For interaction classification, we used two test beds: four conversation threads taken
from the Sun Java Technology forum (1200 messages posted by 120 users) and three
threads taken from the LNSG social discussion forum (400 messages posted by 100
users). Two independent coders tagged the test beds for message interactions. The coders
carefully read each message to determine which prior posting (if any) it referenced or
responded to. The inter-coder reliability had kappa statistics of 0.88 and 0.81 for the two
test beds, respectively.
The CyberGate feature set consisted of structural features (taken from the message
headers) as well as function words, bag-of-words, noun phrases, and named entities
derived from body text. These features are intended to represent various interaction cues,
including direct address (reference to user names) and lexical relation (reference to
keywords from prior postings). The baseline feature set consisted of only structural
features, as used in prior systems (Donath et al., 1999; Smith and Fiore, 2001).
CyberGate assigns direct address relations when a user name is referenced (in which case
a relation is assigned between the message author and the referenced user). Lexical
relations are assigned using a modified vector space model (Fu et al., 2008). The
structural interaction cues are matched by comparing message titles and quoted content
with the title and content of prior postings. Consistent with prior research, the F-measure
was used to evaluate the effectiveness of each feature set (Fu et al., 2008).
317
Table 9.12: Interaction Classification Results
Test Bed
Sun Java Forum
LNSG Forum
Features
CyberGate
Baseline
77.40
86.00
55.55
77.11
The experimental results are presented in Table 9.12. CyberGate’s extended feature
set significantly outperformed the baseline (p-values < 0.001). The performance
difference was more pronounced on the LNSG forum. Users in this forum make less use
of structural features when interacting with one another, instead preferring to rely on textbased interaction cues. The results illustrate the importance of using richer features for
representing CMC interactions.
9.10.5 Results Discussion
CyberGate’s feature set was better at representing information types associated with
the three meta-functions as compared to baseline feature sets commonly used in prior
systems. The extended feature set effectively represented topic, opinion, style, genre, and
interaction information. It significantly outperformed the BOW and structural feature
baselines. The CyberGate techniques also performed well, with accuracies generally over
90%. SVM appeared to perform better on information types supporting the ideational
meta-function. Writeprints and Ink Blots outperformed SVM in experiments on
information types representing the textual meta-function. For instance, SVM had higher
significantly higher accuracy for topic classification while Writeprints and Ink Blots
performed better on style and genre classification.
The objective of the experiments was to test the meta-design’s effectiveness for
318
discriminating information types associated with the ideational, textual, and interpersonal
meta-functions. CyberGate’s extended feature set enhanced representation of the three
meta-functions. The Writeprints and Ink Blot techniques were also successful in
discriminating information types related to the ideational and textual meta-functions,
though more so for the textual meta-function. The results suggest that an extended feature
set (using language and processing resources) and complementary feature selection and
visualization techniques can enhance data discrimination based representation of
information types reflective of the three meta-functions.
In the previous section, an application example was used to illustrate the metadesign’s ability to characterize information types associated with the meta-functions. This
section presented text categorization experiments to assess the data discrimination
capabilities of features, feature selection, and visualization techniques based on the metadesign. Collectively, the results lend validity to the meta-design satisfying metarequirements. Systems using the proposed design framework may foster better analysis of
CMC text by representing the ideational, textual, and interpersonal meta-functions (Sack,
2000). CMC systems supporting only a subset of the language meta-functions are likely
to lose the deeper understanding that arises from the synergy created by representing the
three meta-functions in unison.
9.11 Conclusions
Our major research contributions are two-fold. Firstly, using Walls et al.’s (1992)
model, we developed a design framework for systems supporting CMC text analysis. The
framework advocates the development of systems that support all three language meta-
319
functions described by Systemic Functional Linguistic Theory. The framework also
provides guidelines for the choice of appropriate features, feature selection, and
visualization techniques necessary to effectively represent the ideational, textual, and
interpersonal meta-functions. Secondly, we developed the CyberGate system based on
our design framework. CyberGate includes an array of language and processing resources
capable of representing various information types. It also incorporates a bevy of ranking
and projection based feature selection methods and various complementary visualization
formats, including the Writeprints and Ink Blots techniques. Using CyberGate’s features
and techniques, text categorization experiments and an application example were used to
validate the meta-design elements of the proposed design framework.
Our design framework and the resulting CyberGate system are not without their
shortcomings. There are likely to be additional information types, and feature selection
and visualization techniques that could have been considered but were omitted.
Nevertheless, we believe that the proposed framework has important implications for
practitioners, including CMC system users and developers. Our intention and hope is that
future research will improve upon our design framework, resulting in CMC systems with
enhanced text analysis capabilities.
320
CHAPTER 10: CONCLUSION
10.1 Contributions
In Chapter 2 we proposed the use of stylometric analysis techniques to help identify
individuals based on writing style. We incorporated a rich set of stylistic features
including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We
also developed the Writeprint technique for identification and similarity detection of
anonymous identities. The Writeprint technique and extended feature set were evaluated
on a test bed encompassing four online data sets spanning different domains: email,
instant messaging, feedback comments, and program code. Writeprints outperformed
benchmark techniques including SVM, Ensemble SVM, PCA, and standard KarhunenLoeve transforms on the identification and similarity detection tasks. The extended
feature set also significantly outperformed a baseline set of features commonly used in
previous research.
In Chapter 3 we evaluated the use of Writeprints and comparison stylometric methods
to help identify online traders based on the writing style traces inherent in their posted
feedback comments. Experiments conducted to assess the scalability (number of traders)
and robustness (against intentional obfuscation) of Writeprints were promising. The
results indicated that the proposed method may help militate against the effects of easy
identity changes and reputation manipulation in electronic markets.
In Chapter 4 we evaluated the effectiveness of various features and kernels for
detecting fake escrow and spoof websites. Our analysis included a rich set of features
321
extracted from web page text, image, and link information. We also proposed a composite
kernel specifically tailored to represent the properties of fake websites, including content
duplication and structural attributes. Experiments were conducted to assess the proposed
extended feature set and composite kernel on two test beds, each comprised of
approximately 100,000 web pages taken from hundreds of fake escrow and spoof
websites. The combination of the extended feature set and the composite kernel enabled
us to attain high performance levels. The results suggested that automated web-based
systems utilizing rich feature sets and customized kernel functions may be highly
effective for detecting fake websites.
Leveraging our work from the previous chapter, in Chapter 5 we proposed a fake
website detection system. The AZProtect system used a Support Vector Machine (SVM)
classifier coupled with a rich set of features derived from website text, linkage, and image
content. In experiments conducted on hundreds of generated fraud and spoof sites,
AZProtect consistently outperformed seven comparison tools. Combining the proposed
classifier with a lookup mechanism allowed the creation of a dynamic hybrid system
which further enhanced performance.
In Chapter 6 we used sentiment analysis methodologies for classification of web
forum and online review opinions. The utility of stylistic and syntactic features was
evaluated. Furthermore, the Entropy Weighted Genetic Algorithm (EWGA) was
developed, which is a hybridized genetic algorithm that incorporates the information gain
heuristic for feature selection. EWGA was designed to improve performance and get a
better assessment of the key features. The experimental results using EWGA indicated
322
high performance levels on movie review and web forum test beds, indicating the utility
of these features and techniques for document level classification of sentiments.
In Chapter 7 we proposed the use of an extended set of n-gram features in
conjunction with a rule based multivariate feature selection method for enhanced opinion
classification of online reviews. The feature set encompassed an array of fixed and
variable n-gram categories, including syntactic and semantic n-gram features. The
proposed feature selection method leveraged domain knowledge to efficiently remove
irrelevant and redundant attributes from large n-gram spaces. Experimental results on
three online review test beds revealed that the extended feature set and feature selection
method significantly outperformed existing features and attribute selection techniques in
their ability to classify sentiment polarity and intensity.
In Chapter 8 we compared several feature representations for affect analysis,
including learned n-grams and various automatically and manually crafted affect
lexicons. We also proposed the support vector regression correlation ensemble (SVRCE)
method for enhanced classification of affect intensities. SVRCE uses an ensemble of
classifiers each trained using a feature subset tailored towards classifying a single affect
class. The ensemble was combined with affect correlation information to enable better
prediction of emotive intensities. Experiments were conducted on four test beds
encompassing web forums, blogs, and online stories. The results revealed that learned ngrams were more effective than lexicon based affect representations. The findings also
indicated that SVRCE outperformed comparison techniques, including Pace regression,
semantic orientation, and WordNet models. Ablation testing showed that the improved
323
performance of SVRCE was attributable to its use of feature ensembles as well as affect
correlation information.
Using the knowledge gained from previous chapters, we proposed a design
framework for CMC text analysis systems in Chapter 9. Grounded in Systemic
Functional Linguistic Theory, the framework advocates the development of systems
capable of representing the rich array of information types inherent in CMC text. It also
provided guidelines regarding the choice of features, feature selection, and visualization
techniques that CMC text analysis systems should employ. The CyberGate system was
developed as an instantiation of the design framework. CyberGate incorporates a rich
feature set and complementary feature selection and visualization methods, including the
Writeprints and Ink Blots techniques. An application example was used to illustrate the
system’s ability to discern important patterns in CMC text. Furthermore, results from
numerous experiments conducted in comparison with benchmark methods confirmed the
viability of CyberGate’s features and techniques. The results revealed that the CyberGate
system and its underlying design framework can dramatically improve CMC text analysis
capabilities over those provided by existing systems.
10.2 Relevance to MIS
Many MIS studies have expounded upon the significance of CMC text analysis for
analyzing organizations (Chia, 2000). CMC text analysis is important for evaluating the
effectiveness and efficiency of electronic communication in various organizational
settings, including virtual teams and group support systems (Fjermestad and Hiltz, 1999;
Montoya-Weiss et al., 2001). Analysis of CMC text also plays a crucial role in facilitating
324
the measurement of return on investment for various online communities including
electronic communities and networks of practice (Cothrel, 2000; Wenger and Snyder,
2000; Wasko and Faraj, 2005).
MIS consists of two important research paradigms: behavior science and design
science. According to the design science paradigm, design is a product and a process
(Walls et al., 1992; Hevner et al., 2004). The design product is the set of requirements and
necessary design characteristics that should guide IT artifact construction. An IT artifact
can be a construct, method, model, or instantiation (Hevner et al., 2004). The design
process is the steps and procedures taken to develop the artifact. This dissertation
presents a collection of essays pertaining to the design product aspect of design science.
Our work investigates issues related to the construction of IT artifacts supporting analysis
of computer-mediated communication text. Using Systemic Functional Linguistic Theory
as a kernel theory, we introduced features, feature selection, and visualization methods
capable of discriminating and characterizing CMC text. Chapters 2-5 focused on
information related to the textual meta-function of SFLT and how it could be used for
enhanced identity and institutional trust. In these chapters, we analyzed various CMC
modes prevalent in organizational settings and often associated with cybercrime,
including email, web forums, chat, feedback comments, and websites. Chapters 6-8
emphasized the ideational meta-function and its application to online opinion mining. In
these chapters we demonstrated the utility of using a rich representation of ideational
information types for enhanced consumer sentiment analysis regarding products, movies,
and social issues. Chapter 9 used Walls et al. (1992)’s model to introduce an information
325
systems design framework for the construction of CMC text analysis systems. That
chapter leveraged knowledge gained from preceding chapters towards a synergistic
culmination with a design framework and system developed as an instantiation of that
framework.
10.3 Future Directions
The future research will continue to broaden and deepen work relating to online
authentication, analysis of CMC text, and information visualization.
(1) Extending Research on Online Authentication: Future work on authentication in
electronic markets and CMC will look to combine stylometric models with graph-based
models of user interaction. Using such a combined approach is intended to take
advantage of the high accuracy of stylometric models in unison with the scalability of
graph-based models, which can be applied to hundreds of thousands of online market
traders. Graph models of website linkage may also be useful for further improving the
performance of machine learning based fake website detection techniques, which
currently rely heavily on content based cues.
(2) Analyzing Power Cues in Computer Mediated Communication: Computer
mediated communication is rich in genres and social cues, which have a major impact on
the communication dynamics. Recent literature has suggested that an equally important
factor that effects online communication is the power structure/relation between the
sender and recipient of a message. That is, people interact very differently depending
upon whether they are communicating with a superior (e.g., their boss), someone on the
same level (e.g., colleague or team member), or a subordinate (e.g., assistant or support
326
staff). Of particular interest is analysis of key textual features (i.e., power cues) inherent
in text-based CMC that are capable of identifying the power relations for various CMC
interactions. The construction of a set of power cue attributes could help provide insight
into the communication dynamics of various organizations, online communities, and
electronic networks of practice.
(3) Developing and Evaluating Techniques for Visualizing Electronic Markets: The
work on CMC visualization has several interesting potential future directions.
Application of the visualization framework and system proposed in Chapter 9 towards
better representing electronic market and CMC information could be useful for
improving online trust. Specifically, developing an online authentication system that
allows internet buyers and sellers to visualize information regarding potential online
traders as well as escrow websites could help alleviate the information asymmetry
problem. The system could potentially integrate analysis and visualization of textual
content and linkage/interaction information. Evaluating the impact of such a socially
translucent system on the users’ perceived trust regarding online buyers and sellers and
escrow websites would be of particular interest.
327
REFERENCES
Abbasi, A., and Chen, H. “Identification and Comparison of Extremist-Group Web
Forum Messages using Authorship Analysis,” IEEE Intelligent Systems (20:5), 2005,
pp. 67-75.
Abbasi, A. and Chen, H. "Visualizing Authorship for Identification", In the 4th IEEE
Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA,
2006.
Abbasi, A. and Chen, H. “Detecting Fake Escrow Websites using Rich Fraud Cues and
Kernel Based Methods,” In Proceedings of the 7th Annual Workshop on Information
Technologies and Systems (WITS), Montreal, Canada, 2007.
Abbasi, A., and Chen, H. “Writeprints: A Stylometric Approach to Identity-Level
Identification and Similarity Detection in Cyberspace,” ACM Transactions on
Information Systems, (26:2), 2008.
Abbasi, A. and Chen, H. “CyberGate: A System and Design Framework for Text Analysis
of Computer Mediated Communication,” MIS Quarterly, forthcoming, 2008.
Abbasi, A., Chen, H., Thoms, S., and Fu, T. “Affect Analysis of Web Forums and Blogs
using Correlation Ensembles,” IEEE Transactions on Knowledge and Data
Engineering, forthcoming, 2008.
Airoldi, E., and Malin, B. “Data Mining Challenges for Electronic Safety: The Case of
Fraudulent Intent Detection in E-Mails,” Workshop on Privacy and Security Aspects
of Data Mining, 2004.
Allan, J. Carbonell, J, Doddington, G., Yamron, J. and Yang, Y “Topic Detection and
Tracking Pilot Study: Final Report,” In Proceedings of the DARPA Broadcast News
Transcription and Understanding Workshop, 1998, pp. 194-218.
Allan, J., Leuski, A., Swan R. C., and Byrd, D. “Evaluating Combinations of Ranked
Lists and Visualizations of Inter-Document Similarity,” Information Processing and
Management, (37:3), 2001, pp. 435-458.
Alexouda, G., and Paparrizos, K. “A Genetic Algorithm Approach to the Product Line
Design Problem using the Seller's Return Criterion: An Extensive Comparative
Computational Study,” European Journal of Operational Research, (134), 2001, pp.
165–178.
Aggarwal, C. C., Orlin, J., and Tai, R. P. “Optimized Crossover for the Independent Set
Problem,” Operations Research, (45:2), 1997, pp.226-234.
328
Agrawal, R., Rajagopalan, S., Srikant, R. and Xu, Y. “Mining Newsgroups using
Networks Arising from Social Behavior,” In Proceedings of the 12th International
World Wide Web Conference, 2003, pp. 529-535.
Andrienko, N. and Andrienko, G. “Informed Spatial Decisions through Coordinated
Views,” Information Visualization, (2:4), 2003, pp. 270-285.
Arasu, A., Cho, J., Garcia-Molina, H., Paepcke, A., and Raghavan, S. “Searching the
Web,” ACM Transactions on Internet Technology, (1:1), 2001, pp. 2-43.
Argamon, S., Koppel, M., and Avneri, G. “Routing Documents According to Style,” First
International Workshop on Innovative Information, 1998.
Argamon, S., Whitelaw, C., Chase, P., Raj Hota, S., Garg, N., and Levitan, S. “Stylistic
Text Classification using Functional Lexical Features,” Journal of the American
Society for Information Science and Technology, (58:6), 2007, pp. 802-822.
Argamon, S., Saric, M., and Stein, S, S. “Style Mining of Electronic Messages for
Multiple Authorship Discrimination: First Results,” in Proceedings of the ninth ACM
SIGKDD International conference on Knowledge discovery and data mining, 2003.
Ba, S. and Pavlou, G. “Evidence of the Effect of Trust Building Technology in Electronic
Markets: Price Premiums and Buyer Behavior,” MIS Quarterly, (26:3), 2002, pp.
243-268.
Baayen, R. H., Halteren, H. V., and Tweedie, F. J. “Outside the Cave of Shadows: Using
Syntactic Annotation to Enhance Authorship Attribution,” Literary and Linguistic
Computing, (11:3), 1996, pp. 121-132.
Baayen, R. H., Halteren, H. v., Neijt, A., and Tweedie, F. (2002). “An Experiment in
Authorship Attribution,” Paper presented at the In Proceedings of the 6th
International Conference on the Statistical Analysis of Textual Data (JADT), 2002.
Balakrishnan, P. V., Gupta, R., and Jacob, V. S. “Development of Hybrid Genetic
Algorithms for Product Line Designs,” IEEE Transactions on Systems, Man, and
Cybernetics, (34:1), 2004, pp. 468-483.
Baldwin, R. G. Image Pixel Analysis using Java, Online Press, Austin, Texas, 2005.
Barua, A., Konana, P. and Whinston, A. “An Empirical Investigation of Net-Enabled
Business Value,” MIS Quarterly, (28:4), 2004, pp. 585-620.
Beineke, P., Hastie, T. and Vaithyanathan, S. “The Sentimental Factor: Improving Review
Classification via Human-Provided Information,” In Proceedings of the 12th Annual
Meeting of the Association for Computational Linguistics, 2004, pp 263-280.
329
Berry, R. E. and Meekings, B. A. E. “A Style Analysis of C Programs,” Communications
of the ACM, (28:1), 1985, pp. 80-88.
Binongo, J. N. G., and Smith, M. W. A. “The Application of Principal Component
Analysis to Stylometry,” Literary and Linguistic Computing, 14(4), 1999, pp. 445466.
Bolton, G. E., Katok, E., and Ockenfels, A. “How Effective are Electronic Reputation
Mechanisms? An Experimental Investigation,” Management Science, (50:11), 2004,
pp. 1587-1602.
Brynjolfsson, E. and Smith, M. D. “Frictionless Commerce? A Comparison of Internet
and Conventional Retailers,” Management Science, (46:4), 2000, pp. 563-585.
Burgun, A. and Bodenreider, O. “Comparing Terms, Concepts, and Semantic Classes in
WordNet and the Unified Medical Language System,” In Proceedings of the North
American Association of Computational Linguistics Workshop, 2001, pp. 77-82.
Burrows, J. F. “Word Patterns and Story Shapes: The Statistical Analysis of Narrative
Style,” Literary and Linguistic Computing, (2:2), 1987, pp. 61-70.
Butler, B. S. “Membership Size, Communication Activity, and Sustainability: A
Resource-Based Model of Online Social Structures,” Information Systems Research,
(12:4), 2001, pp. 346-362.
Chaski, C. E. “Empirical Evaluation of Language-based Author Identification
Techniques,” Forensic Linguistics, (8:1), 2001, pp. 1-65.
Chaski, C. E. “Who’s at the Keyboard? Authorship Attribution in Digital Evidence
Investigation,” International Journal of Digital Evidence (4:1), 2005, pp. 1-13.
Chen, H. Knowledge Management Systems. A Text Mining Perspective, Knowledge
Computing Corporation, 2001.
Chen, H., Lally, A.M., Zhu, B., and Chau, M. “HelpfulMed: Intelligent Searching for
Medical Information over the Internet,” Journal of the American Society for
Information Science and Technology (54:7), 2003, pp. 683-694.
Cherkauer, K. J. “Human Expert-level Performance on a Scientific Image Analysis Task
by a System Using Combined Artificial Neural Networks,” In Chan, P. (Ed.), Working
Notes of the AAAI Workshop on Integrating Multiple Learned Models, 1996, pp. 1521.
Chia, R. “Discourse Analysis as Organizational Analysis,” Organization, (7:3), 2000, pp.
513-518.
330
Cho, Y. H. and Lee, K. J. “Automatic Affect Recognition using Natural Language
Processing Techniques and Manually Built Affect Lexicon,” IEICE Transactions on
Information Systems, (E89:12), 2006, pp. 2964-2971.
Chou, N. Ledesma, R., Teraguchi, Y., Boneh, D. and Mitchell, J. C. “Client-side Defense
Against Web-based Identity Theft,” In Proceedings of the Network and Distributed
System Security Symposium, San Diego, CA., 2004.
Chua, C. E. H. and Wareham, J. “Fighting Internet Auction Fraud: An Assessment and
Proposal,” IEEE Computer, (37:10), 2004, pp. 31–37.
Corney, M., Vel, O. d., Anderson, A., and Mohay, G. “Gender-preferential Text Mining of
E-mail Discourse,” Paper presented at the 18th Annual Computer Security
Applications Conference (2002 ACSAC), Las Vegas, Nevada, USA, 2002.
Cothrel, J, P. “Measuring the Success of an Online Community,” Strategy and Leadership
(20:2), 2000, pp. 17-21.
Cui, H., Mittal, V., and Datar, M. “Comparative Experiments on Sentiment Classification
for Online Product Reviews,” In Proceedings of the Twenty First AAAI Conference on
Artificial Intelligence, Boston, Massachusetts, 2006, pp. 1265-1270.
Cunningham, H. “GATE, A General Architecture for Text Engineering,” Computers and
the Humanities (36), 2002, pp. 223-254.
Daft, R, L., and Lengel, R, H. “Organizational Information Requirements, Media
Richness and Structural Design,” Management Science (32:5), 1986, pp. 554-571.
Das, S. R. and Chen, M. Y. “Yahoo! for Amazon: Sentiment Extraction from Small Talk
on the Web,” Management Science (53:9), 2007, pp. 1375-1388.
Dash, M. and Liu, H. “Feature Selection for Classification,” Intelligent Data Analysis,
(1), 1997, pp. 131-156.
Dave, K., Lawrence, S. and Pennock, D. M. “Mining the Peanut Gallery: Opinion
Extraction and Semantic Classification of Product Reviews,” In Proceedings of the
12th International Conference on the World Wide Web, 2003, pp. 519-528.
Dellarocas, C. “The Digitization of Word of Mouth: Promise and Challenges of Online
Feedback Mechanisms,” Management Science, (49:10), 2003, pp. 1407-1424.
De Vel, O., Anderson, A., Corney, M., and Mohay, G. “Mining E-mail Content for Author
Identification Forensics,” ACM SIGMOD Record, (30:4), 2001, pp. 55-64.
Diederich, J., Kindermann, J., Leopold, E., and Paass, G. “Authorship Attribution with
Support Vector Machines,” Applied Intelligence (19), 2003, pp. 109-123.
331
Dietterich TG. “Ensemble Methods in Machine Learning,” In Proceedings of the First
International Workshop on Multiple Classifier Systems, 2000, pp. 1-15.
Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L., and Gori, M. “Focused
Crawling using Context Graphs,” In Proceedings of the 26th Conference on Very
Large Databases, Cairo, Egypt, 2000.
Ding, H., and Samadzadeh, H. M. “Extraction of Java Program Fingerprints for Software
Authorship Identification,” Journal of Systems and Software, (72:1), 2004, pp. 49-57.
Donath, J. “Identity and Deception in the Virtual Community,” In Communities in
Cyberspace, London, Routledge Press, 1999.
Donath, J., Karahalio, K. and Viegas, F. “Visualizing Conversation,” in Proceedings of
the 32nd Conference on Computer-Human Interaction (CHI' 02), Chicago, USA,
1999.
Donath, J. “A Semantic Approach to Visualizing
Communications of the ACM, 45(4), 2002, pp. 45-49.
Online
Conversations,”
Dorre, J., Gerstl, P., and Seiffert, R. “Text Mining: Finding Nuggets in Mountains of
Textual Data,” In Proceedings of the 5th ACM SIGKDD International Conference,
1999, pp. 398-401.
Drost, I. and Scheffer, T. “Thwarting the Nigritude Ultramarine: Learning to Identify
Link Spam,” In Proceedings of the European Conference on Machine Learning
(ECML), pp. 96-107, 2005.
Dumais, S., Platt, J., Heckerman, D. And Sahami, M. “Inductive Learning Algorithms
and Representations for Text Categorization,” In Proceedings of the Seventh of ACMCIKM, 1998, pp. 148-155.
Efron, M. “Cultural Orientations: Classifying Subjective Documents by Cocitation
Analysis,” In Proceedings of the AAAI Fall Symposium Series on Style and Meaning
in Language, Art, Music, and Design, 2004, pp. 41-48.
Efron, M., Marchionini, G., and Zhiang, J. “Implications of the Recursive Representation
Problem for Automatic Concept Identification in Online Government Information,” In
Proceedings of the ASIST SIG-CR Workshop, 2003.
Erickson, T. and Kellogg, W. A. “Social Translucence: An Approach to Designing
Systems that Support Social Processes,” ACM Transactions on Computer-Human
Interaction (7:1), 2000 pp. 59-83.
Ester, M., Grob, M., and Kriegel, H. “Focused Web Crawling: A Generic Framework for
332
Specifying the User Interest and for Adaptive Crawling Strategies,” In Proceedings of
the International Conference on Very Large Databases, 2001.
Fetterly, D., Manasse, M., and Najork, M. “Spam, Damn Spam, and Statistics,” In
Proceedings of the Seventh International Workshop on the Web and Databases, 2004.
Furnkranz, J. “Hypertext Ensembles: A Case Study in Hypertext Classification,”
Information Fusion, (3), 2002, pp. 299-312.
Fairclough, N. Analysing Discourse: Textual Analysis for Social Research, Routledge,
New York, NY, 2003.
Fei, Z., Liu, J., and Wu, G. “Sentiment Classification using Phrase Patterns,” In
Proceedings of the 4th IEEE International Conference on Computer Information
Technology, 2004, pp. 1147-1152.
Fellbaum, C. Wordnet: An Electronic Lexical Database, The MIT Press, Cambridge, MA,
1998.
Fiore, A, T., and Smith, M, A. “Tree Map Visualizations of News Groups,” Poster
Presented at IEEE Symposium on Information Visualization, 2002, Boston,
Massachusetts.
Fjermestad, J. and Hiltz, S. R. “An Assessment of Group Support Systems Experimental
Research: Methodology and Results,” Journal of Management Information Systems
(15:3), 1999, pp. 7-49.
Forman, G. “An Extensive Empirical Study of Feature Selection Metrics for Text
Classification,” Journal of Machine Learning Research (3), 2003, pp. 1289-1305.
Forsyth, R. S., and Holmes, D. I. “Feature Finding for Text Classification,” Literary and
Linguistic Computing, (11:4), 1996.
Friedman, E. and Resnick, P. “The Social Cost of Cheap Pseudonyms,” Journal of
Economic Management Strategy, (10:1), 2001, pp. 173-199.
Fu, T., Abbasi, A., and Chen, H. “A Hybrid Approach to Interactional Coherence Analysis
in Web Forums,” Journal of the American Society for Information Science and
Technology, (59:7), 2008.
Gamon, M. “Sentiment Classification on Customer Feedback Data: Noisy Data, Large
Feature Vectors, and the Role of Linguistic Analysis,” In Proceedings of the 20th
International Conference on Computational Linguistics, 2004, pp. 841-847.
Garson, G. D. Public Information Technology and E-Governance: Managing the Virtual
State, Jones and Bartlett Publishers, Boston, MA, 2006.
333
Grandison, T. and Sloman, M. “A Survey of Trust in Internet Applications,” IEEE
Communications Surveys and Tutorials, (4:4), 2000, pp. 2-16.
Gray, A., Sallis, P., and MacDonell, S. “Software Forensics: Extending Authorship
Analysis Techniques to Computer Programs,” In Proceedings of the Third Biannual
Conference of the International Association of Forensic Linguists, Durham, NC,
1997, pp. 1-8.
Green, P. E., Krieger, A. M., and Wind, Y. “Thirty Years of Conjoint Analysis: Reflections
and Prospects,” Interfaces (31:3), 2001, pp. 56-73.
Grefenstette, G. Qu, Y., Evans, D. A. and Shanahan, J. G. “Validating the Coverage of
Lexical Resources for Affect Analysis and Automatically Classifying New Words
Along Semantic Axes,” In Yan Qu, James Shanahan, and Janyce Wiebe, (eds),
Exploring Attitude and Affect in Text: Theories and Applications, AAAI-2004 Spring
Symposium Series, 2004a, pp. 71–78.
Grefenstette, G., Qu, Y., Shanahan, J. G. and Evans, D. A. "Coupling Niche Browsers and
Affect Analysis for an Opinion Mining Application,” In Proceedings of the 12th
International Conference Recherche d'Information Assistee par Ordinateur, 2004b,
pp. 186-194.
Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. “Gene Selection for Cancer
Classification using Support Vector Machines,” Machine Learning (46), 2002, pp.
389-422.
Guyon, I., and Elisseeff, A. “An Introduction to Variable and Feature Selection,” Journal
of Machine Learning Research (3), 2003, pp. 1157-1182.
Gyongi, Z. and Garcia-Molina, H. “Spam: It’s not Just for Inboxes Anymore,” IEEE
Computer, (38:10), 2005, pp. 28-34.
Halliday, M.A.K. An Introduction to Functional Grammar, 3rd (ed). Revised by Christian
Matthiessen, London: Hodder Arnold, 2004.
Han, J., and Kamber, M. Data Mining: Concepts and Techniques, San Francisco:
Academic Press, 2001.
Hara, N., Bonk, C, J., and Angeli, C. “Content Analysis of Online Discussion in an
Applied Educational Psychology Course,” Instructional Science (28), 2000, pp. 115152.
Hariharan, P., Asgharpour, F, and Jean Camp, L. “NetTrust – Recommendation System
for Embedding Trust in a Virtual Realm,” In Proceedings of the ACM Conference on
Recommender Systems, Minneapolis, Minnesota, 2007.
334
Hatzivassiloglou, V. and McKeown, K. R. “Predicting the Semantic Orientation of
Adjectives.,” In Proceedings of the Association for Computational Linguistics, 1997,
pp. 174-181.
Hayne, C, S., Pollard, E, C., and Rice, E, R. “Identification of Comment Authorship in
Anonymous Group Support Systems,” Journal of Management Information Systems,
(20:1), 2003, pp. 301-329.
Hayne, C. S., and Rice, E. R. “Attribution Accuracy when using Anonymity in Group
Support Systems,” International Journal of Human-Computer Studies, (47:3), 1997,
pp. 429-452.
Havre, S., Hetzler, E., Whitney, P. and Nowell, L. “ThemeRiver: Visualizing Thematic
Changes in Large Document Collections,” IEEE Transactions on Visualization and
Computer Graphics, (8:1), 2002, pp. 9-20.
Hearst, M. A. "Direction-based Text Interpretation as an Information Access
Refinement,” In P. Jacobs (Ed.), Text-Based Intelligent Systems: Current Research
and Practice in Information Extraction and Retrieval. Mahwah, NJ: Lawrence
Erlbaum Associates, 1992.
Hearst, M. A. “Untangling Text Data Mining,” in Proceedings of the Association for
Computational Linguistics, 1999, pp. 3-10.
Henley, N. M., Miller, M. D., Beazley, J. A., Nguyen, D. N., Kaminsky, D., and Sanders,
R. “Frequency and Specificity of Referents to Violence in News Reports of Anti-Gay
Attacks,” Discourse & Society, (13:1), pp. 75-104.
Henri, F. “Computer Conferencing and Content Analysis,” in Collaborative Learning
through Computer Conferencing: The Najaden papers, A.R. Kaye, (ed), 1992, pp.
115-136.
Herring, S. C. “Computer-Mediated Communication on the Internet,” Annual Review of
Information Science and Technology (36:1), 2002, pp. 109-168.
Herring, S., Job-Sluder, K., Scheckler, R., and Barab, S. ”Searching for Safety Online:
Managing ‘Trolling’ in a Feminist Forum,” The Information Society, 2002, (18:5), pp.
371-384.
Herring, S. and Paolillo, J. C. ”Gender and Genre Variations in Weblogs,” Journal of
Sociolinguistics, (10:4), 2006, pp. 439-450.
Hevner, A, R., March, S, T., Park, J., and Ram, S. “Design Science in Information
Systems Research,” MIS Quarterly (28:1), 2004, pp. 75-105.
335
Hoar, S. B. “Trends in Cybercrime: The Darkside of the Internet,” Criminal Justice,
(20:3), 2005, pp. 4-13.
Holland, J. Adaptation in Natural and Artificial Systems. Ann Arbor, University of
Michigan Press, 1975.
Holmes, D. I. “A Stylometric Analysis of Mormon Scripture and Related Texts,” Journal
of Royal Statistical Society, (155), pp. 91-120, 1992.
Holzer, R., Malin, B., and Sweeney, L. “Email Alias Detection using Social Network
Analysis,” In Proceedings of the Third International Workshop on Link Discovery,
Chicago, IL: ACM Press, 2005, pp. 52-57.
Hu, X., Lin, Z., Whinston A. B., and Zhang H., “Hope or Hype: On the Viability of
Escrow Services as Trusted Third Parties in Online Auction Environments,”
Information Systems Research, (15:3), 2004, pp. 236-249.
Hu, M. and Liu, B. “Mining and Summarizing Customer Reviews,” In Proceedings of the
Tenth ACM SIGKDD Conference, Seattle, Washington, 2004, pp. 168-177.
Huang, S., Ward, M, O., and Rundensteiner, E, A. “Exploration of Dimensionality
Reduction For Text Visualization,” in Proceedings of The Third International
Conference on Coordinated and Multiple Views in Exploratory Visualization
(CMV’05), 2005.
Jackson, D. “Stopping Rules in Principal Component Analysis: A Comparison of
Heuristical and Statistical Approaches," Ecology (74:8), 1993, pp. 2204-2214.
Jain, A. and Zongker, D. “Feature Selection: Evaluation, Application, and Small Sample
Performance,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
(19:2), 1997, pp. 153-158.
Jiang, M., Jensen, E., Beitzel, S. and Argamon, S. (2004). Choosing the Right Bigrams
for Information Retrieval. In Proceedings of the Meeting of the International
Federation of Classification Societies.
Joachims, T. “Text Categorization with Support Vector Machines: Learning with Many
Relevant Features,” In Proceedings of the 10th European Conference on Machine
Learning, 1998, pp. 137-142.
Joachims, T., Cristianini, N., Shawe-Taylor, J. “Composite Kernels for Hypertext
Categorisation,” In Proceedings of the 18th International Conference on Machine
Learning, 2001, pp. 250-257.
Josang, A., Ismail, R., and Boyd, C. “A Survey of Trust and Reputation Systems for
336
Online Service Provision,” Decision Support Systems, (43:2), 2007, pp. 618-644.
Juola, P. “What can we do with Small Corpora? Document Categorization via CrossEntropy,” In Proceedings of the Interdisciplinary Workshop on Similarity and
Categorization, Edinburgh, UK, 1997.
Juola, P. “The Time Course of Language Change,” Computers and the Humanities,
(37:1), 2003, pp. 77-96.
Juola, P. and Baayen, H. “A Controlled-corpus Experiment in Authorship Identification
by Cross-Entropy,” Literary and Linguistic Computing (20), 2005, pp. 59-67.
Kacmarcik, G. and Gamon, M. “Obfuscating Document Stylometry to Preserve Author
Anonymity,” In Proceedings of the Conference on Linguistics, Sydney, Australia:
Association for Computational Linguistics, 2006, pp. 444-451.
Kanayama, H., Nasukawa, T., and Watanabe, H. ”Deeper Sentiment Analysis using
Machine Translation Technology,” In Proceedings of the 20th International
Conference on Computational Linguistics, 2004, pp. 494-500.
Keim, D, A. “Information Visualization and Visual Data Mining,” IEEE Transactions on
Visualization and Computer Graphics (7:1), 2002, pp. 100-107.
Keselj, V., Peng, F., Cercone, N., and Thomas, C. “N-Gram Based Author Profiles for
Authorship Attribution,” In Proceedings of the Pacific Association for Computational
Linguistics, Nova Scotia, Canada: 2003, pp. 255-264.
Khmelev, D. V. “Disputed Authorship Resolution using Relative Entropy for Markov
Chains of Letters in Human Language Texts,” Journal of Quantitative Linguistics,
(7:3), 2000, pp. 115-126.
Khmelev, D. V. and Tweedie, F. J. “Using Markov Chains for Identification of Writers,”
Literary and Linguistic Computing, (16:3), 2001, pp. 299-307.
Kim, S. and Hovy, E. “Determining the Sentiment of Opinions,” In Proceedings of the
Twentieth International Conference on Computational Linguistics, 2004, pp. 13671373.
Kirby, M. and Sirovich, L. “Application of the Karhunen-Loeve Procedure for the
Characterization of Human Faces,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, (12:1), pp. 103-108.
Kjell, B., Woods, W.A., and Frieder, O. “Discrimination of Authorship using
Visualization,” Information Processing and Management, (30:1), 1994, pp. 141-150.
Kolari, P. Finin, T. and Joshi, A. “SVMs for the Blogosphere: Blog Identification and
337
Splog Detection,” In AAAI Spring Symposium on Computational Approaches to
Analysing Weblogs, 2006.
Koppel, M., Argamon, S., and Shimoni, A.R. “Automatically Categorizing Written Texts
by Author Gender,” Literary and Linguistic Computing (17:4), 2002, pp. 401-412.
Koppel, M. and Schler, J. “Exploiting Stylistic Idiosyncrasies for Authorship
Attribution,” In Proceedings of IJCAI’03 Workshop on Computational Approaches to
Style Analysis and Synthesis, Acapulco, Mexico, 2003.
Koppel, M., Akiva, N., and Dagan, Ido. “Feature Instability as a Criterion for Selecting
Potential Style Markers,” Journal of the American Society for Information Science
and Technology, (57:11), 2006, pp. 1519-1525.
Krsul, Ivan., and Spafford, H, E. “Authorship Analysis: Identifying the Author of a
Program,” Computers and Security (16:3), 1997, pp. 233-257.
Lee, A, S. “Electronic Mail as a Medium of Rich Communication: An Empirical
Investigation using Hermeneutic Interpretation,” MIS Quarterly, 1994, pp. 143-157.
Levine, D. “Application of a Hybrid Genetic Algorithm to Airline Crew Scheduling,”
Computers and Operations Research, (23:6), pp. 547-558.
Levy, E. and Arce, I. “Criminals Become Tech Savvy,” IEEE Security and Privacy, (2:2),
2002, pp. 65-68.
Li, J., Zheng, R., and Chen, H. “From Fingerprint to Writeprint,” Communications of the
ACM (49:4), 2006, pp. 76-82.
Li, J., Su, H., Chen, H., and Futscher, B. “Optimal Search-Based Gene Subset Selection
for Gene Array Cancer Classification,” IEEE Transactions on Information Technology
in Biomedicine, (11:4), 2007, pp. 398-405.
Li, L. and Helenius, M. “Usability Evaluation of Anti-Phishing Toolbars,” Journal in
Computer Virology, (3:2), 2007, pp. 163-184.
Lin, F. C., Shi, H. and Wang, X. “Splog Detection using Self-similarity Analysis on Blog
Temporal Dynamics,” In Proceedings of the 3rd International Workshop on
Adversarial Information Retrieval on the Web (AIRWeb), 2007.
Liu, H. and Motada, H. Feature Extraction, Construction, and Selection – Data Mining
Perspective, 1998, Kluwer Academic Publishers.
Liu, B., Hu, M., and Cheng, J. “Opinion Observer: Analyzing and Comparing Opinions
on the Web,” In Proceedings of the 14th International World Wide Web Conference,
2005, pp. 342-351.
338
Liu, W., Deng, X., Huang, G. and Fu, A. Y. “An Antiphishing Strategy Based on Visual
Similarity Assessment,” IEEE Internet Computing, (10:2), 2006, pp. 58-65.
Liu, H., Lieberman, H., and Selker, T. “A Model of Textual Affect Sensing using Realworld Knowledge”. In Proceedings of the 8th International Conference on Intelligent
User Interfaces, Miami, Fl., 2003.
Losiewicz, P., Oard, D. and Kostoff, R. N. “Textual Data Mining to Support Science and
Technology Management,” Journal of Intelligent Information Systems, (15), 2000,
pp. 99-119.
Ma, H., Prendinger, and Ishizuka, M. “Emotion Estimation and Reasoning Based on
Affective Textual Interaction,” In Proceedings of the First International Confernece
on Affective Computing and Intelligent Interaction, 2005, pp. 622-628.
MacInnes, I., Damani, M., and Laska, J. “Electronic Commerce Fraud: Towards an
Understanding of the Phenomenon,” In Proceedings of the Hawaii International
Conference on Systems Sciences (HICSS), 2005.
March, S. T. and Smith, G. “Design and Natural Science Research on Information
Technology,” Decision Support Systems (15:4), 1995, pp. 251-266.
Markus, M, L., Majchrzak, A., and Gasser, L. “A Design Theory for Systems that
Support Emergent Knowledge Processes,” MIS Quarterly (26:3), 2002, pp. 179-212.
Martin, J. R. and White, P. R. R. The Language of Evaluation: Appraisal in English.
Palgrave, London, 2005.
Martindale, C., and McKenzie, D. “On the Utility of Content Analysis in Author
Attribution: The Federalist,” Computers and the Humanities, (29), 1995, pp. 259-270.
McDonald, D., Chen, H., Hua S., and Marshall, B. “Extracting Gene Pathway Relations
using a Hybrid Grammar: The Arizona Relation Parser,” Bioinformatics (20:18),
2004, pp. 3370-3378.
McKnight, D. H., Choudhury, V., and Kacmar, C. “The Impact of Initial Consumer Trust
on Intentions to Transact with a Website: A Trust Building Model,” Journal of
Strategic Information Systems, (11), 2002, pp. 297-323.
Menczer, F. “Lexical and Semantic Clustering by Web Links,” Journal of the American
Society for Information Science and Technology, (55:14), 2004, pp.1261-1269.
Menczer, F., Pant, G., and Srinivasan, M. E. “Topical Web Crawlers: Evaluating Adaptive
Algorithms,” ACM Transactions on Internet Technology, (4:4), 2004, pp. 378-419.
Merriam, T. V. N., and Matthews, R. A. J. “Neural Computation in Stylometry II: An
339
Application to the Works of Shakespeare and Marlowe,” Literary and Linguistic
Computing, (9:1-6), 1994.
Metaxas, P. T. and DeStefano, J. “Web Spam, Propaganda and Trust,” In Proceedings of
the 1st International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb), 2005.
Miller, N. E., Wong, P. C., Brewster, M. and Foote, H. “Topic Islands: A Wavelet-based
Text Visualization System,” In Proceedings of IEEE Visualization, Research Triangle
Park, NC, USA. 1998
Mishne, G. “Experiments with Mood Classification,” In Proceedings of the Stylistic
Analysis of Text for Information Access Workshop, 2005.
Mishne, G., Carmel, D., and Lempel, R. “Blocking Blog Spam with Language Model
Disagreement,” In Proceedings of the 1st International Workshop on Adversarial
Information Retrieval on the Web (AIRWeb), 2005.
Mishne, G. and Rijke, M. D. “Capturing Global Mood Levels using Blog Posts,” In
Proceedings of the AAAI Spring Symposium on Computational Approaches to
Analysing Weblogs, 2006.
Mitra, M., Buckley, C., Singhal, A., and Cardie, C. “An Analysis of Statistical and
Syntactic Phrases, In Proceedings of the 5th International Conference Recherche
d'Information Assistee par Ordinateur, Montreal, Canada, 1997, pp. 200-214.
Mladenic, D., “Text-Learning and Related Intelligent Agents: A Survey,” IEEE Intelligent
Systems (14:4), 1999, pp. 44-54.
Mladenic, D., Brank, J., Grobelnik, M., and Milic-Frayling, N. “Feature Selection using
Linear Classifier Weights: Interaction with Classification Models,” In Proceedings of
the 27th ACM SIGIR Conference on Research and Development in Information
Retrieval, Sheffield, UK, 2004, pp. 234-241.
Morinaga, S., Yamanishi, K., Tateishi, K., and Fukushima, T. “Mining Product
Reputations on the Web,” In Proceedings of the Eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining , Edmonton, Canada, 2002,
pp. 341-349.
Montoya-Weiss, M., Massey, A. P., and Song, M. “Getting it Together: Temporal
Coordination and Conflict Management in Global Virtual Teams,” Academy of
Management Journal, (44:6), 2001, pp. 1251-1262.
Moores, T., and Dhillon, G. “Software Piracy: A View from Hong Kong,”
Communications of the ACM, (43:12), 2000, pp. 88-93.
340
Morzy, M. “New Algorithms for Mining the Reputation of Participants of Online
Auctions,”, 2005, pp. 112-121.
Mosteller, F., and Wallace, D. L. Applied Bayesian and Classical Inference: The Case of
the Federalist Papers (2 ed.): Springer-Verlag, 1964.
Mui, L., Mohtashemi, M., and Halberstadt, A. “A Computational Model of Trust and
Reputation,” In Proceedings of the Thirty-Fifth Hawaii International Conference on
System Sciences, Hawaii: IEEE Computer Society Press, 2002, pp. 2431-2439.
Mullen, T., and Collier, N. “Sentiment Analysis using Support Vector Machines with
Diverse Information Sources,” In Proceedings of the Empirical Methods in Natural
Language Processing, 2004, pp. 412-418.
Muller, K., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf, B. “An Introduction to
Kernel-based Learning Algorithms,” IEEE Transactions on Neural Networks, (12:2),
2001, pp. 181-201.
Nasukawa, T. and Nagano, T. “Text Analysis and Knowledge Mining System,” IBM
Systems Journal (40:4), 2001, pp. 967-984.
Ng, V., Dasgupta, S., and Arifin, S. M. N. “Examining the Role of Linguistic Knowledge
Sources in the Automatic Identification and Classification of Reviews,” In
Proceedings of the COLING/ACL Conference, Sydney, Australia, 2006, pp. 611-618.
Nigam, K., and Hurst, M. “Towards a Robust Metric of Opinion,” In Proceedings of the
AAAI Spring Symposium on Exploring Attitude and Affect in Text, 2004.
Novak, J., Raghavan, P., and Tomkins, A. “Anti-aliasing on the Web,” In Proceedings of
the Thirteenth International World Wide Web Conference, New York, NY: ACM
Press, 2004, pp. 30-39.
Ntoulas, A., Najork, M., Manasse, M. and Fetterly, D. “Detecting Spam Web Pages
through Content Analysis,” In Proceedings of the International World Wide Web
Conference (WWW), 2006, pp. 83-92.
Oliveira, L. S., Sabourin, R., Bortolozzi, F., and Suen, C.Y. “Feature Selection using
Multi-objective Genetic Algorithms for Handwritten Digit Recognition,” In
Proceedings of the 16th International Conference on Pattern Recognition, 2002, pp.
568-571.
Oman, W, P., and Cook, R, C. “Programming style Authorship Analysis,” in Proceedings
of the 17th Annual ACM Computer Science Conference 1989, pp.320-326.
Paccagnella, L. “Getting the Seats of Your Pants Dirty: Strategies for Ethnographic
341
Research on Virtual Communities,” Journal of Computer Mediated Communication
(3:1), 1997.
Pan, Y. “ID Identification in Online Communities,” Working Paper, 2006.
Pang, B., Lee, L., and Vaithyanathain, S. "Thumbs Up? Sentiment Classification using
Machine Learning Techniques", In Proceedings of the Empirical Methods in Natural
Language Processing, 2002, pp. 79-86.
Pang, B., and Lee, L. “A Sentimental Education: Sentimental Analysis using Subjectivity
Summarization Based on Minimum Cuts,” In Proceedings of the 42nd Annual Meeting
of the Association for Computational Linguistics, 2004, pp. 271-278.
Panteli, N. “Richness, Power Cues and Email Text,” Information and Management, 2002,
pp. 75-86.
Pavlou, P. A. and Gefen, D. “Building Effective Online Marketplaces with Institutionbased Trust,” Information Systems Research, (15:1), 2004, pp. 37-59.
Peng, F., Schuurmans, D., Keselj, V., and Wang, S. “Automated Authorship Attribution
with Character Level Language Models,” In Proceedings of the 10th Conference of
the European Chapter of the Association for Computational Linguistics, 2003.
Picard, R. W. Affective Computing, MIT Press, Cambridge, MA., 1997.
Picard, R. W., Vyzas, E. and Healey, J. “Toward Machine Emotional Intelligence:
Analysis of Affective Physiological State,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, (23:10), 2001, pp. 1179-1191.
Platt, J. “Fast Training on SVMs using Sequential Minimal Optimization,” In
B.Schoelkopf, C. Burges, and A. Smola (eds.) Advances in Kernel Methods: Support
Vector Learning. MIT Press, Cambridge, MA, 1999, pp. 185-208.
Popescu, A. and Etzioni, O. “Extracting Product Features and Opinions from Reviews,”
In Proceedings of the HLT/EMNLP Conference, Vancouver, Canada, 2005, pp. 339346.
Quinlan, J. R. “Induction of Decision Trees,” Machine Learning, 1(1), 1986, pp. 81-106.
Rao, J. R. and Rohatgi, P. “Can Pseudonymity Really Guarantee Privacy? In Proceedings
of the Ninth USENIX Security Symposium, Denver, CO: USENIX Association, 2000,
pp. 85-96.
Rasmusson L. and Jansson, S. “Simulated Social Control for Secure Internet Commerce,”
In Proceedings of the New Security Paradigm Workshop, Lake Arrowhead, CA: ACM
Press, 1996, pp. 18-25.
342
Read, J. “Recognizing Affect in Text using Point-wise Mutual Information,” Masters
Thesis, 2004.
Resnick, P., Zeckhauser, R., Friedman, E., and Kuwabara, K. “Reputation Systems,”
Communications of the ACM, (43:12), 2000, pp. 45-48.
Resnick, P., Zeckhauser, R., Swanson, J., and Lockwood, K. “The Value of Reputation on
eBay: A Controlled Experiment,” Experimental Economics, (9:2), 2006, pp. 79-101.
Riloff, E. and Wiebe, J. “Learning Extraction Patterns for Subjective Expressions,” In
Proceedings of the Conference on Empirical Methods in Natural Language
Processing, 2003, pp. 105-112.
Riloff, E., Wiebe, J., and Wilson, T. “Learning Subjective Nouns using Extraction Pattern
Bootstrapping,” In Proceedings of the Seventh Conference on Natural Language
Learning Conference, Edmonton, Canada, 2003, pp. 25-32.
Riloff, E., Patwardhan, S., and Wiebe, J. “Feature Subsumption for Opinion Analysis,” In
Proceedings of the Conference on Empirical Methods in Natural Language
Processing, Sydney, Australia, 2006, pp. 440-448.
Robinson, L. “Debating the Events of September 11th: Discursive and Interactional
Dynamics in Three Online For a,” Journal of Computer-Mediated Communication,
(10:4), 2005.
Rudman, J. “The State of Authorship Attribution Studies: Some Problems and Solutions,”
Computers and the Humanities (31), 1997, pp. 351-365.
Sack, W. “Conversation Map: An Interface for Very Large-Scale Conversations,” Journal
of Management Information Systems (17:3), 2000, pp. 73-92.
Salvetti, F. and Nicolov, N. “Weblog Classification for Fast Splog Filtering: A URL
Language Model Segmentation Approach,” In Proceedings of the Human Language
Technology Conference, 2006, pp. 137-140.
Santini, M. “A Shallow Approach to Syntactic Feature Extraction for Genre
Classification,” in Proceedings of the 7th Annual Colloquium for the UK Special
Interest Group for Computational Linguistics (CLUK 04), 2004.
Schumaker, R. and Chen, H. “Textual Analysis of Stock Market Prediction using
Financial News Articles,” Americas Conference on Information Systems, Acapulco,
Mexico, 2006.
Sebastiani, F. “Machine Learning in Automated Text Categorization,” ACM Computing
Surveys, (34:1), 2002, pp. 1-47.
343
Seo, J., and Shneiderman, B. “A Rank-by-Feature Framework for the Interactive
Exploration of Multidimensional Data,” Information Visualization, (4), 2005, pp. 99113.
Shannon, C. E. “A Mathematical Theory of Communication,” Bell System Technical
Journal, (27:4), 1948, pp. 379–423.
Shannon, C. E. “Prediction and Entropy of Printed English,” Bell System Technical
Journal, (30:1), 1951, pp. 50–64.
Shen, G., Gao, B., Liu, T. Y., Feng, G., Song, S., and Li, H. “Detecting Link Spam using
Temporal Information,” In Proceedings of the International Conference on Data
Mining (ICDM), 2006.
Siedlecki, W. and Sklansky, J. “A Note on Genetic Algorithms for Large-Scale Feature
Selection,” Pattern Recognition Letters, (10:5), 1989, pp. 335-347.
Simon, H, A. The Sciences of the Artificial, 3rd (ED), MIT Press, Cambridge, MA, 1996.
Smith, M, A., and Fiore, A, T. “Visualization Components for Persistent Conversations,”
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
Seattle, Washington, United States, 2001, pp. 136-143.
Smith, M. “Tools for Navigating Large Social Cyberspaces,” Communications of ACM
(45:4), 2002, pp. 51-55.
Stamatatos, E., Fakotakis, N., and Kokkinakis, G. “Automatic Text Categorization in
Terms of Genre and Author,” Association of Computer Linguistics, 2001, pp. 471495.
Stamatatos, E., and Widmer, G. “Music Performer Recognintion Using an Ensemble of
Simple Classifiers,” Proceedings of the 15th European Conference on Artificial
Intelligence (ECAI’2002), 2002, Lyon, France.
Subasic, P., and Huettner, A. “Affect Analysis of Text Using Fuzzy Semantic Typing,”
IEEE Transactions on Fuzzy Systems (9:4), 2001, pp. 483-496.
Sullivan, B. “Seduced Into Scams: Online Lovers Often Duped,” MSNBC, July 28, 2005.
Sun, A. Lim, E., Ng, W., and Srivastava, J. “Blocking Reduction Strategies in
Hierarchical Text Classification,” IEEE Transactions on Knowledge and Data
Engineering, (16), 2004, pp. 1305-1308.
Tan, A. “Text Mining: The State of the Art and the Challenges,” In Proceedings of the
PAKDD Workshop on Knowledge Discovery and Data Mining, 1999.
344
Tan, Y. and Wang, J. “A Support Vector Machine with a Hybrid Kernel and Minimal
Vapnik-Chervonenkis Dimension,” IEEE Transactions on Knowledge and Data
Engineering, (16), 2004, pp. 385-395.
Tong, R. “An Operational System for Detecting and Tracking Opinions in On-line
Discussion,” In Proceeding of the ACM SIGIR Workshop on Operational Text
Classification, 2001, pp. 1-6.
Turney, P. D. “Thumbs Up or Thumbs Down? Semantic Orientation Applied to
Unsupervised Classification of Reviews,” In Proceedings of the 40th Annual Meetings
of the Association for Computational Linguistics, 2002, pp. 417-424.
Turney, P, D., and Littman, M, L. “Measuring Praise and Criticism: Inference of
Semantic Orientation from Association,” ACM Transactions on Information Systems
(21:4), 2003, pp. 315-346.
Tweedie, F. J., Singh, S.,and Holmes, D. I. “Neural Network Applications in Stylometry:
The Federalist Papers,” Computers and the Humanities, (30:1), 1996, pp. 1-10.
Uenohara, M. and Kanade, T. “Use of Fourier and Karhunen-Loeve Decomposition for
Fast Pattern Matching with a Large Set of Features” IEEE Transactions on Pattern
Analysis and Machine Intelligence, (19:8), 1997, pp. 891-898.
Urvoy, T., Lavergne, T., and Filoche, P. “Tracking Web Spam with Hidden Style
Similarity,” In Proceedings of the 2nd International Workshop on Adversarial
Information Retrieval on the Web (AIRWeb), 2006.
Vafaie, H. and Imam, I. F. “Feature Selection Methods: Genetic Algorithms vs. GreedyLike Search,” In Proceedings of the International Conference on Fuzzy and
Intelligent Control Systems, 1994.
Valitutti, A, Strapparava, C. and Stock, O. “Developing Affective Lexical Resources,”
PsychNology Journal, (2:1), 2004, pp. 61-83.
Viegas, F.B., and Smith, M. “Newsgroup Crowds and AuthorLines: Visualizing the
Activity of Individuals in Conversational Cyberspaces,” in Proceedings of the 37th
Hawaii International Conference on System Sciences (HICSS, 04), Hawaii, USA,
2004.
Walls, J, G., Widmeyer, G, R., and El Sawy, O, A. “Building an Information System
Design Theory for Vigilant EIS,” Information Systems Research (3:1), 1992, pp. 3659.
Wang, H., Fan, W., Yu, S, P., and Han, J. “Mining Concept-Drifting Data Streams Using
Ensemble Classifiers,” In Proceedings of the SIGKDD Conference 2003.
345
Wasko, M, M., and Faraj, S. “Why Should I Share? Examining Social Capital and
Knowledge Contribution in Electronic Networks of Practice,” MIS Quarterly (29:1),
2005, pp. 35-57.
Watanabe, S. Pattern Recognition: Human and Mechanical. John Wiley and Sons, Inc.,
New York, NY, 1985.
Webb, A.
2002.
Statistical Pattern Recognition. John Wiley and Sons, Inc., New York, NY,
Wellman, B. “Computers Networks as Social Networks,” Science (293), 2001, pp. 20312034.
Wenger, E, C., and Snyder, W, M. “Communities of Practice: The Organizational
Frontier,” Harvard Business Review, 2000.
Whitelaw, C., Garg, N., and Argamon, S. “Using Appraisal Groups for Sentiment
Analysis,” In Proceedings of the ACM Fourteenth Conference on Information and
Knowledge Management, 2005, pp. 625-631.
Wiebe, J. “Tracking Point of View in Narrative,” Computational Linguistics, (20:2),
1994, pp. 233-287.
Wiebe, J., Wilson, T., and Bell, M. “Identifying Collocations for Recognizing Opinions,”
In Proceedings of the ACL/EACL Workshop on Collocation, Toulouse, France, 2001.
Wiebe, J., Wilson, T., Bruce, R., Bell, M., and Martin, M. “Learning Subjective
Language,” Computational Linguistics, (30:3), 2004, pp. 277-308.
Wiebe, J., Wilson, T., and Cardie, C. ”Annotating Expressions of Opinions and Emotions
in Language,” Language Resources and Evaluation, (1:2), 2005.
Wilson, T., Wiebe, J., and Hoffman, P. ”Recognizing Contextual Polarity in Phrase-Level
Sentiment Analysis,” In Proceedings of the Human Language Technology Conference
and Conference on Empirical Methods in Natural Language Processing, 2005.
Wilson, S. M. and Peterson, L. C. “The Anthropology of Online Communities,” Annual
Review of Anthropology, (31), 2002, pp. 449-467.
Wise, J.A. “The Ecological Approach to Text Visualization,” Journal of the American
Society for Information Science and Technology (50:13), 1999, pp. 1224-1233.
Witten, I. H., and Frank, E. Data Mining: Practical Machine Learning Tools and
Techniques. 2nd Edition, Morgan Kaufmann, San Francisco, 2005.
Wu, M., Miller, R.C., and Garfunkel S. L., “Do Security Toolbars Actually Prevent
346
Phishing Attacks?,” In Proceedings of the Conference on Human Factors in
Computing Systems, Montreal, Canada, 2006, pp. 601-610.
Wu, B. and Davidson, B. D. “Detecting Semantic Cloaking on the Web,” In Proceedings
of the World Wide Web Conference (WWW), 2006, pp. 819-828.
Wu, C., Chuang, Z. and Lin, Y. “Emotion Recognition from Text using Semantic Labels
and Separable Mixture Models,” ACM Transactions on Asian Language Information
Processing, (5:2), 2006, pp.165-182.
Xiong, R., Donath, J., “PeopleGarden: Creating Data Portraits for Users,” in Proceedings
of UIST 1999.
Yang, Y. and Pedersen, J. O. “A Comparative Study on Feature Selection in Text
Categorization,” In Proceedings of the Fourteenth International Conference on
Machine Learning, San Francisco, CA: Morgan Kaufmann Publishers, 1997, pp. 412420.
Yang, J. and Honavar, V. “Feature Subset Selection using a Genetic Algorithm,” IEEE
Intelligent Systems, (13:2), 1998, pp. 44-49.
Yates, J., and Orlikowski, W. J. “Genre Systems: Structuring Interaction through
Communicative Norms,” The Journal of Business Communication, (39:1), 2002, pp.
13-35.
Yi, J., Nasukawa, T., Bunescu, R. and Niblack, W. “Sentiment Analyzer: Extracting
Sentiments about a Given Topic using Natural Language Processing Techniques, In
Proceedings of the Third IEEE International Conference on Data Mining, 2003, pp.
427-434.
Yi, J., and Niblack, W. “Sentiment Mining in WebFountain,” In Proceedings of the 21st
International Conference on Data Engineering, 2005, pp. 1073-1083.
Yu, L. and Liu, H. “Efficient Feature Selection via Analysis of Relevance and
Redundancy,” Journal of Machine Learning Research (5), 2004, pp. 1205-1224.
Yu, H. and Hatzivassiloglou, V. “Towards Answering Opinion Questions: Separating
Facts from Opinions and Identifying the Polarity of Opinion Sentences,” In
Proceedings of the Conference on Empirical Methods in Natural Language
Processing, 2003.
Yu, H., Han, J., and Chang, K. C. “PEBL: Web Page Classification without Negative
Examples,” IEEE Transactions on Knowledge and Data Engineering, (16), 2004, pp.
70-81.
347
Yule, G. U. “On Sentence Length as a Statistical Characteristic of Style in Prose,”
Biometrika, (30), 1938.
Yule, G. U. The Statistical Study of LiteraryVocabulary. Cambridge University Press,
1944.
Zdziarski, J., Yang, W. and Judge, P. “Approaches to Phishing Identification using Match
and Probabilistic Digital Fingerprinting Techniques,” Paper presented at the MIT
Spam Conference, 2006.
Zhang, Y., Egelman, S., Cranor, L. and Hong, J. “Phinding Phish: Evaluating Antiphishing Tools,” In Proceedings of the 14th Annual Network and Distributed System
Security Symposium (NDSS), 2007.
Zheng, R., Qin, Y., Huang, Z., and Chen, H. “A Framework for Authorship Analysis of
Online Messages: Writing-style Features and Techniques,” Journal of the American
Society for Information Science and Technology (57:3), 2006, pp.378-393.
Zhu, B. and Chen H. “Social Visualization for Computer-Mediated Communications: A
Knowledge Management Perspective,” in Proceedings of the Eleventh Workshop on
Information Technologies and Systems 2001, Baton Rouge, LA, USA.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement