AN AUTOMATIC TEXT MINING FRAMEWORK FOR KNOWLEDGE DISCOVERY ON THE WEB

AN AUTOMATIC TEXT MINING FRAMEWORK FOR KNOWLEDGE DISCOVERY ON THE WEB
AN AUTOMATIC TEXT MINING FRAMEWORK FOR KNOWLEDGE
DISCOVERY ON THE WEB
by
Wingyan Chung
Copyright © Wingyan Chung 2004
A Dissertation Submitted to the Faculty of the
COMMITTEE ON BUSINESS ADMINISTRATION
In Partial Fulfillment of the Requirements
For the Degree of
DOCTOR OF PHILOSOPHY
WITH A MAJOR IN MANAGEMENT
In the Graduate College
THE UNIVERSITY OF ARIZONA
2004
UMI Number: 3132206
Copyright 2004 by
Chung, Wingyan
All rights reserved.
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 3132206
Copyright 2004 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Ml 48106-1346
2
THE UNIVERSITY OF ARIZONA ®
GRADUATE COLLEGE
As members of the Final Examination Committee, we certify that we have
read the dissertation prepared by
entitled
Winqyan Chung
An Automatic Text Mining Framework for Knowledge Discovery
on the Web
and
recommend
that
it
be
requirement for the Degree of
accepted
as
fulfilling
Date
fay
Date
5^
dissertation
Doctor of Philosophy
HsJ^nchun Chei%5<jP^. D.
/flunamaker Jr., Ph.D.
the
4-/^(
J.^eon Zhao, Ph.D.
Date
Richard T.Snodgrass,Prf.D.
Date
D. Terence Langendoery^ Ph.D.
Date
^
Final approval and acceptance of this dissertation is contingent upon
the candidate's submission of the final copy of the dissertation to the
Graduate College.
I hereby certify that I have read this dissertation prepared under my
direction and recommend that it be accepted as fulfilling the
dissertation requirement.
Hsinchun Chen, Ph.D.
Dissertation Director
Date
3
STATEMENT BY AUTHOR
This dissertation has been submitted in partial fulfillment of requirements for an
advanced degree at The University of Arizona and is deposited in the University Library
to be made available to borrowers under rules of the Library.
Brief quotations from this dissertation are allowable without special permission,
provided that accurate acknowledgment of source is made. Requests for permission for
extended quotation from or reproduction of this manuscript in whole or in part must be
granted by the copyright holder.
SIGNED;
4
ACKNOWLEDGMENTS
This research was supported by funding from the United States of America National
Science Foundation and National Institute of Justice, and from the Department of
Management Information Systems at the University of Arizona. I thank my dissertation
advisor and mentor, Dr. Hsinchun Chen, who provided me with material support,
intellectual guidance, and effective coaching. I am deeply indebted to him. I want to
thank my dissertation committee members, Dr. Jay F. Nunamaker Jr., Dr. J. Leon Zhao,
Dr. Richard T. Snodgrass, and Dr. D. Terence Langendoen for their teaching, comments,
and guidance. In particular, I benefited much from Dr. Snodgrass's insightful and
constructive comments. I also want to thank Dr. Olivia R. L. Sheng for her guidance and
participation as a member of my doctoral written preliminary examination committee.
I am grateful for assistance from a number of excellent individuals at the University
of Arizona's Artificial Intelligence Lab. I thank the following persons for their efforts in
the CBizPort project discussed in Chapter 4: Mark Chen, Alan Yip, Michael Chau, Dan
McDonald, BjTon Marshall, Thian-Huat Ong, Zan Huang, Yiwen Zhang, Gang Wang,
Jialun Qin, Yilu Zhou, Ada Leung, Chienting Lin, and Victoria Zhang. I thank Jennifer J.
Xu and Chun-Ju Tseng for their help in programming parts of the BIE system discussed
in Chapter 5 and Ann Lally for expertly evaluating the BIE system. I am grateful to Edna
O. F. Raid for contributing her expert knowledge to the work discussed in Chapter 6. I
also want to thank all the 102 human subjects who participated in the experiments
described in Chapters 4 to 6.
I am very grateful to Mrs. Barbara Sears for her careful editing of this dissertation and
many of my papers. She has greatly improved my writing.
I want especially to thank my parents and relatives for their love, care and support
over the past years. I am most gratefiil to my wife, Christina, for her patience,
understanding, love, care and enduring support. This research would not have been
possible without her support. I thank my baby daughter, Lydia, who rejuvenates my life
with exceeding joy, happiness, hope, and love.
5
DEDICATION
I dedicate this dissertation to Mrs. Helen Thornton (1919-2004) who, to my sorrow,
passed away on February 23, 2004. Helen is dear to all my family, always enthusiastic to
share the gospel and always willing to help people in need. I won't forget my first
Thanksgiving dinner at her home. Her kind concern in a soft voice asking me "how is
your study?" enduringly touches my heart.
"Beneath Helen's gruff exterior was a soft heart of conviction and compassion. Helen
loved her church, her Lord, her pastor and lost people. She was truly a dedicated
Christian who gave her money and herself sacrificially to serve the Lord. She had friends
all over the world who she personally led to Christ and will meet someday in glory."
(Hart, 2004)
I find it very difficult to accept the fact that she has left us. But I know one day I will
happily meet her again. May the Lord bless her and her labor.
6
TABLE OF CONTENTS
LIST OF ILLUSTRATIONS
11
LIST OF TABLES
12
ABSTRACT
13
CHAPTER 1. INTRODUCTION
15
1.1
1.2
1.3
1.4
Background
Research Questions
The Dissertation
Structure and Writing Style of the Dissertation
CHAPTER 2. LITERATURE REVIEW
2.1
Knowledge and Knowledge Management
2.1.1 Definition of Knowledge
2.1.2 Classification of Knowledge
2.1.3 Knowledge Management
2.2
Human-Computer Interaction
2.2.1 Evolution of Human-Computer Interaction
2.2.2 HCI and the Web
2.2.2.1 Searching
2.2.2.2 Browsing
2.2.3 Knowledge Discovery Processes
2.3
Text Mining for Web Analysis
2.3.1 Automatic Text Processing
2.3.1.1 Logic-based Model
2.3.1.2 Vector Space Model
2.3.1.3 Probabilistic Models
2.3.2 From Machine Learning to Data Mining
2.3.2.1 Machine Learning
2.3.2.2 Data Mining
2.3.3 Web Mining
2.3.3.1 Resource Discovery and Collection on the Web
2.3.3.2 Pattern Extraction from the Web
2.3.4 Text Mining for Business Intelhgence
2.3.4.1 Text Mining Processes
2.3.4.2 Business Intelligence Tools and Techniques
2.3.5 Summary
2.4
Summary of the Literature Review
15
17
18
19
20
21
21
24
27
28
29
31
32
33
35
38
39
40
40
42
42
43
45
48
49
51
53
53
54
56
58
7
TABLE OF CONTENTS - Continued
CHAPTER 3. RESEARCH FORMULATION AND FRAMEWORK ...59
3.1
Research Gaps
3.2
An Automatic Text Mining Framework
3.2.1 Components of the Framework
3.2.1.1 Collection
3.2.1.2 Conversion
3.2.1.3 Extraction
3.2.1.4 Analysis
3.2.1.5 Visualization
3.2.2 HCI Issues
3.2.3 Structure of the Framework
3.3
Principles of Applying the Framework
3.4
Comparison with Existing Text Mining Frameworks
3.5
Evaluating the Framework
3.5.1 Research Methodology
3.5.2 Domain of Study: Business Intelligence
3.5.3 Structure of the Empirical Studies
3.6
Dissertation Chapters
59
60
61
62
63
64
64
65
66
67
67
70
71
71
72
74
75
CHAPTER 4. BUILDING A BUSINESS INTELLIGENCE SEARCH
PORTAL FOR INTEGRATED ANALYSIS OF HETEROGENEOUS
78
INFORMATION
4.1
Background
4.2
Related Work
4.2.1 Approaches to Information Seeking on the Web
4.2.1.1 System-centered Approach
4.2.1.2 User-centered Approach
4.2.2 Web Searching in a Heterogeneous Environment
4.2.2.1 English Search Engines
4.2.2.2 Chinese Search Engines
4.3
Research Questions
4.4
Application of the Framework
4.4.1 User Interface
4.4.2 Encoding Converter
4.4.3 Information Sources
4.4.4 Summarizer
4.4.5 Categorizer
4.4.5.1 Mutual Information Approach
4.4.5.2 Chinese Phrase Lexicon
4.5
Evaluation Methodology
4.5.1 Objectives and Experimental Tasks
78
80
80
80
81
83
83
85
89
90
92
93
94
95
96
96
98
103
103
8
TABLE OF CONTENTS - Continued
4.5.2 Hypotheses
4.5.2.1 Hypotheses on CBizPort's Enhanced Analysis Capabilities
4.5.2.2 Hypotheses on Search Engine Performance Comparison
4.5.2.3 Hypotheses on Users' Subjective Evaluations
4.5.2.4 Additional Hypotheses
4.5.3 Experimental Design
4.6
Experimental Results and Implications
4.6.1 CBizPort's Assistance in Human Analysis
4.6.2 Search Engine Performance Comparison
4.6.3 Users' Subjective Evaluations and Verbal Comments
4.6.4 Results of Testing Additional Hypotheses
4.6.5 Implications of the Results
4.7
Conclusions
105
107
107
108
109
109
113
113
116
117
120
120
121
CHAPTER 5. APPLYING WEB PAGE VISUALIZATION
TECHNIQUES TO DISCOVERING BUSINESS INTELLIGENCE
FROM SEARCH ENGINE RESULTS
123
5.1
Background
5.2
Related Work
5.2.1 Commercial BI Tools
5.2.2 Browsing the World Wide Web
5.2.2.1 Hypertext and Browsing
5.2.2.2 Visual Displays of Textual Information
5.2.3 Document Visualization
5.2.3.1 Document Analysis
5.2.3.2 Algorithms
5.2.3.2.1
5.2.3.2.2
5.2.3.2.3
Hierarchical Clustering
Partitional Clustering
Multidimensional Scaling
5.2.3.3 Visualization
5.2.3.3.1
5.2.3.3.2
Knowledge Map
Kohonen Self-organizing Map Visualization
5.3
Research Questions
5.4
Application of the Framework
5.4.1 Data Collection
5.4.1.1 Identifying Key Terms
5.4.1.2 Meta-searching
5.4.2 Automatic Parsing and Indexing
5.4.3 Co-occurrence Analysis
5.4.4 Web Community Identification
5.4.5 Knowledge Map Creation
5.5
Evaluation Methodology
5.5.1 Objectives
123
126
126
127
127
129
130
131
132
132
133
136
137
137
139
140
141
142
142
144
145
146
148
151
153
154
9
TABLE OF CONTENTS - Continued
5.5.2 Experimental Tasks
5.5.3 Experimental Design and Hypotheses
5.5.4 Performance Measures
5.5.5 Hypotheses
5.6
Experimental Results and Implications
5.6.1 Comparison between Web Community and Result List
5.6.2 Comparison between Web Community and Knowledge Map
5.6.3 Comparison between Knowledge Map and Kartoo Map
5.6.4 Discussion
5.7
Conclusions
157
158
159
160
162
163
166
167
169
170
CHAPTER 6. USING WEB PAGE CLASSIFICATION TECHNIQUES
FOR BUSINESS STAKEHOLDER ANALYSIS ON THE WEB
173
6.1
Background
6.1.1 Collaborative Commerce
6.1.2 Understanding Business Relationships on the Web
6.2
Related Work
6.2.1 Stakeholder Analysis
6.2.2 Tools and Approaches for Exploiting Business Intelligence
6.2.3 Web Page Classification Techniques
6.3
Research Questions
6.4
Application of the Framework
6.4.1 Building a Research Testbed
6.4.2 Creation of a Domain Lexicon
6.4.3 Automatic Stakeholder Classification
6.4.3.1 Manual Tagging
6.4.3.2 Feature Selection
6.4.3.3 Automatic Stakeholder Classification
6.5
Evaluation Methodology
6.5.1 Experimental Design
6.5.2 Hypotheses and Experimental Procedures
6.6
Experimental Results and Implications
6.6.1 Algorithm Comparison
6.6.2 Effectiveness of the Framework
6.6.3 Comparing the Framework with Human Judgment
6.6.4 Studying the Use of Features
6.6.5 Users' Subjective Comments
6.7
Conclusions
173
173
174
175
176
178
180
183
184
185
188
189
190
190
194
196
196
199
201
201
204
204
206
206
207
10
TABLE OF CONTENTS - Continued
CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS
7.1
7.2
7.3
7.4
7.5
Conclusions
Contributions
Relevance to Business, Management, and MIS
Limitations
Future Directions
APPENDIX A: DOCUMENTS RELATED TO CHAPTER 4
A.I
A.2
A.3
Approval Letter from the University Human Subjects Committee
Subject's Disclaimer Form
Questionnaire for CBizPort Evaluation
APPENDIX B: DOCUMENTS RELATED TO CHAPTER 5
B.I
B.2
B.3
Approval Letter from the University Human Subjects Committee
Subject's Disclaimer Form
Questionnaire for User Study on Internet Browsing
APPENDIX C: DOCUMENTS RELATED TO CHAPTER 6
C.I
C.2
C.3
Approval Letter from the University Human Subjects Committee
Subject's Disclaimer Form
Questionnaire for Web-based Business Stakeholder Analysis
REFERENCES
209
210
212
213
214
215
217
217
218
219
227
227
228
229
238
238
239
240
248
11
LIST OF ILLUSTRATIONS
Figure 2.1; The hierarchy of understanding
23
Figure 3.1: An automatic text mining framework for knowledge discovery on the Web 61
Figure 3.2: Application of text mining techniques to knowledge discovery processes.... 76
Figure 4.1: Framework components used to develop CBizPort
91
Figure 4.2: System architecture of CBizPort
93
Figure 4.3: Screen shots of various functions of CBizPort
100
Figure 4.4: Search page of CBizPort (Traditional Chinese version)
101
Figure 4.5: Search page of CBizPort (Simplified Chinese version)
101
Figure 4.6: Result page of CBizPort
102
Figure 4.7: Web page summarizer
102
Figure 4.8: Web page categorizer
103
Figure 5.1: A typical document visualization process
131
Figure 5.2: Framework components used to develop BIE
141
Figure 5.3: System architecture of BIE
142
Figure 5.4: User interface of the Business Intelligence Explorer
144
Figure 5.5: Formulae used in co-occurrence analysis
148
Figure 5.6: Formulae used to compute normalized cut
151
Figure 5.7: Steps in using a genetic algorithm for recursive Web graph partitioning.... 153
Figure 5.8: A simpUfied example of GA graph partitioning
153
Figure 5.9: Result list browsing method
155
Figure 5.10: Web community browsing method
155
Figure 5.11: Knowledge map browsing method
156
Figure 5.12: Kartoo map browsing method
156
Figure 6.1: Framework components used to develop BSA
185
Figure 6.2: System architecture of BSA
187
Figure 6.3: A business stakeholder Web page of ClearForest
193
Figure 6.4: Formulae and procedure in the thresholding method
194
Figure 6.5: Front page of Business Stakeholder Analyzer
198
Figure 6.6: Business stakeholders of Siebel
198
12
LIST OF TABLES
Table 2.1: Definitions of the word "search"
33
Table 2.2: Definitions of the word "browse"
34
Table 3.1: Detailed applications and evaluations of the framework
77
Table 4.1: Comparing major Chinese search engines
88
Table 4.2: Information sources of CBizPort
94
Table 4.3: Hypotheses tested in the experiment
106
Table 4.4: Definitions of 15 dimensions of information quality and expert ratings
112
Table 4.5: Searching and browsing performance of CBizPort and benchmark search
engines
114
Table 4.6: Results of users' subjective evaluations
114
Table 4.7: Results of hypothesis testing
115
Table 4.8: Subjects' profiles
115
Table 4.9: A summary of subjects' verbal comments
119
Table 5.1: A search of "knowledge management" on various search engines (September
2002)
125
Table 5.2: A task by data type taxonomy for viewing collections of items
129
Table 5.3: Summary of key statistics
163
Table 5.4: j!7-values of various t-tests
164
Table 5.5: Number of subjects who expressed a preference for Knowledge Map or Kartoo
169
Table 6.1: Stakeholder types* considered in previous research
178
Table 6.2: Companies selected as training examples
188
Table 6.3: Stakeholder tj^^es used in manual tagging of Web pages
189
Table 6.4: Examples of terms indicative of the partner/supplier/sponsor stakeholder type
189
Table 6.5: Companies selected as testing examples
195
Table 6.6: Hypotheses tested in this study
199
Table 6.7: Results of hypothesis testing
202
Table 6.8: Subjects' profiles
203
Table 6.9: Within-class accuracies achieved by different methods
203
Table 6.10: Subjects' preferences toward automatic stakeholder classification
203
13
ABSTRACT
As the World Wide Web proliferates, the amounts of data and information available
have outpaced human ability to analyze them. Information overload is becoming ever
more serious. Effectively and efficiently discovering knowledge on the Web has become
a challenge.
This dissertation investigates an automatic text mining framework for knowledge
discovery on the Web. It consists of five generic steps: collection, conversion, extraction,
analysis, and visualization. Input to and output of the framework are respectively Web
data and knowledge discovered after applying the steps. Combinations of data and text
mining techniques were used to assist human analysis in different scenarios. The research
question was determining how knowledge discovery can be enhanced by using the
framework. Three empirical studies applying the framework to business intelligence
applications were conducted.
First, the framework was applied to building a business intelligence search portal that
provides meta-searching, Web page summarization, and result categorization. The portal
was found to perform comparably to existing search engines in searching and browsing.
Users liked its search and analysis capabilities. Thus, the framework can be used to
analyze and integrate information distributed in heterogeneous sources.
Second, the framework was applied to developing two browsing methods for
clustering and visualizing business Web pages. In terms of precision, recall and accuracy,
14
both outperformed list and map displays of search engine results. Users strongly favored
the methods' usability and quality. Thus, the framework facilitated exploration of
business intelligence from numerous results.
Third, the framework was applied to classifying Web pages into different business
stakeholder types. Experimental results showed that the framework could effectively help
classify certain frequently appearing stakeholder types (e.g., partners). Users strongly
preferred the efficiency and capability of this application. Thus, the framework helped
identify and extract business stakeholder relationships.
In conclusion, our framework alleviated information overload and enhanced human
analysis on the Web effectively and efficiently. The research thereby contributes to
developing a useful and comprehensive framework for knowledge discovery on the Web
and to achieving better understanding of human-computer interaction.
15
CHAPTER 1. INTRODUCTION
1.1
Background
Advances in electronic network and information technology support ubiquitous
access to and convenient storage of information. They have changed human lives
fundamentally by bringing far-apart people close and distributed information together
(Negroponte, 2003). Work that previously required actual travel, hard copies or access to
physical libraries can now be done through the global electronic network. People connect
to the network that provides real-time communication and instant transactions, distributes
electronic copies of documents, and serves as digital libraries for information seekers.
The Internet has emerged as the largest global electronic network and digital library
in the world. As a major interface for accessing the Internet, the World Wide Web (or the
Web) consists of over 2 billion documents and is estimated to be growing by 7.3 million
pages per day (Lyman and Varian, 2000). Although Web growth has been slowing for the
past two years, empirical evidence suggests that it still has not reached its full potential,
as shown in internationalization of the Web and the comparatively limited progress in
making information more discoverable than merely available (O'Neill et al., 2003).'
' The Web Characterization Project (http://wcp.oclc.org/) conducts an annual Web sample to
analyze trends in the size and content of the Web. Analysis based on the sample is publicly available. The
sample is obtained by creating a list of randomly generated IPv4 addresses, and then attempting to
connect to Port 80 at each address to identify the presence of public Web services. If an HTTP service is
16
As the Web provides a seemingly unlimited capacity to facilitate information storage
and retrieval in today's world, it also poses problems for information seekers. Convenient
storage of information on the Web has made information exploration difficult. Flooded
with a large amount of information, a problem commonly known as "information
overload" (Bowman et al., 1994), makes it hard for users to get a big picture of the vast
information space. Another closely related problem is heterogeneity and unmonitored
quality of information on the Web (Bowman et al., 1994; Graham and Metaxas, 2003).
Because information is distributed in different (or heterogeneous) Web repositories, users
cannot easily obtain a comprehensive coverage of the topic they are studying. The quality
of information varies greatly because anyone can post any information on the Web.
Furthermore, while it is easy to access a large number of Web repositories nowadays, it is
difficult to identify the relationships among interconnected Web resources. For example,
a company's manager may not know its business stakeholders on the Web.
As businesses increasingly consider the Internet to be a major source of information
(Futures-Group, 1998), they need a higher level of understanding (e.g., knowledge) than
simply information retrieval and data access. Effectively and efficiently discovering
knowledge (business intelligence) from vast amount of information on the Web thus has
challenged researchers and practitioners. Better understanding in the interaction between
business analysts and information technologies is needed.
identified, harvesting software captures the site and stores it for future analysis. The scope of the sample
17
Historically, information technologies (IT) have been plajdng a key role in facilitating
human understanding and analysis of information, hi the 1970s, researchers in the
information retrieval field (e.g., (Salton et al., 1975a; van Rijsbergen, 1979)) developed
important techniques for automating text processing and laid a firm foundation for later
research. In the 1980s, many efforts were made to achieve machine learning of human
behaviors (e.g., (Jain and Dubes, 1988; Minsky, 1982; Simon, 1983)), resulting in
promising techniques in artificial intelligence, pattern recognition, information clustering
and summarization. In the 1990s, information visualization, Web mining, knowledge
management systems emerged as key technologies to enhance human cognition of a large
amount of information, especially information on the Web.
1.2
Research Questions
Despite abundant IT, the amount of information overwhelms technologies. Manual
processing and analysis are still needed to obtain higher levels of understanding, and
managers and business analysts still are required to manually sift through large amounts
of information retrieved from Internet search engines. In the 21st century, an automatic
framework is expected to facilitate such analysis by means of information technologies
that not only support information retrieval and access but also facilitate analysis,
integration and discovery.
is confined to publicly available Web content only.
18
In contrast to the vibrant growth of the Web, human-computer interaction (HCI)
aspects of employing IT to assist business analysts to discover knowledge (business
intelligence) on the Web have not been widely explored. Against this background, we
undertake three research questions in this dissertation:
1. How can an automatic text mining framework be applied to addressing the problems
of knowledge discovery on the Web?
2. How effectively and efficiently does such a framework assist human beings in
discovering knowledge on the Web?
3. What lessons can be learned from applying such a framework in the context of
human-computer interaction (HCI)?
1.3
The Dissertation
This dissertation research investigated application of an automatic text mining
framework to assisting human analysts in discovering knowledge on the Web. The
framework involves collection, conversion, extraction, analysis, and visualization of Web
data. Combinations of data and text mining techniques in the framework were used to
provide flexibility in helping human analysis in different scenarios.
To demonstrate the usability and value of the framework, the research conducted
three empirical studies applying automatic text mining to business intelligence
applications. These addressed problems of information exploration, information
19
heterogeneity and unmonitored quality, and relationship extraction from information on
the Web to examine whether the framework benefited human analysis and led to better
understanding of HCI.
While other approaches might be used, we chose to study the automatic text mining
framework because most resources on the Web are currently text-based. We decided not
to focus on only one technique in our framework, believing different techniques provide
different advantages, combinations of which might be expected to yield a better
performance in knowledge discovery.
1.4
Structure and Writing Style of the Dissertation
The following chapters are organized as follows. Chapter 2 surveys the literature in
three areas related to this research: knowledge and knowledge management, humancomputer interaction, and text mining for Web analysis. Chapter 3 identifies gaps found
in previous research and an automatic text mining framework for knowledge discovery
on the Web and explains the domain of study. Chapters 4 to 6 describe three empirical
studies conducted to validate the framework and answer the research questions. Chapter 7
concludes the dissertation, summarizes its contributions, and discusses future directions.
Although the author is accountable for more than 95% of the work reported in this
dissertation, he acknowledges contributions from the individuals and organizations
mentioned in the "Acknowledgments" section (page 4) by using the pronoun "we"
instead of "I."
20
CHAPTER 2. LITERATURE REVIEW
In today's organizations, knowledge is considered a crucial resource, like manpower,
land, and money (Drucker, 2002). However, knowledge is intangible, often embedded in
and carried through different entities such as documents, policies, Web pages, business
stakeholders, and organizational culture (Alavi and Leidner, 2001). In particular, the
World Wide Web has become a major knowledge repository, providing convenient
information retrieval but also posing challenges to discovering hidden knowledge.
Human analysis and information technologies (IT) play important roles in discovering
knowledge on the Web. Understanding related technologies and the interplay between
humans and IT provides useful guidance to effective knowledge discovery on the Web.
We begin our review in this chapter by describing the nature, classification, and
management of knowledge, with an aim to establish the background of this research. We
then review previous research in the field of human-computer interaction that provides
insights on human analysis needs in knowledge discovery. Next, we survey text mining
technologies applied to Web analysis, with a focus on business intelligence applications.
The purpose is to investigate how human analysis can be supported by automated
techniques. For each area, we survey related work, explain its relevance to the
dissertation research, and identify gaps that need to be closed. Important linkages among
different areas are described.
21
2.1
Knowledge and Knowledge Management
As business activities are increasingly conducted over the Web, an understanding of
different views of "knowledge" can help to reveal the underlying assumptions about
knowledge management, an important process in today's networked organizations (Hacki
and Lighton, 2001) . Various perspectives have been proposed from which to define
"knowledge" and its role in organizations.
2.1.1 Definition of Knowledge
There is a rich literature discussing knowledge. The notion of knowledge first appears
in religious and philosophical writings, which laid foundations for later developments in
science, technology, and management. Religious literature such as the Christian Bible
describes "knowledge" as a gift from God (Solomon, 940 B.C.). Classical Greek
philosophers, including Socrates, Plato, and Aristotle, established the rationalist tradition
(later on elaborated by Descartes) and posited that knowledge came from prior learning
and innate ideas. Such a view was in contrast to the empiricist tradition, led by Bacon and
Locke, that considers knowledge to be induced based on data and experience from the
^ According to Hacki & Lighton (2001), a "networked company" acts as an orchestrator to draw together
participating companies by facilitating the exchange of information among them. The orchestrator
chooses both its partners and the standard to exchange information about customers and products, thus
having a better position to manage and profit from their growth. Other characteristics of a networked
company include: rigorous performance standards maintained mostly through customer evaluations and
partner incentives built into the network; the sharing of benefits generated by the network with all
partners; an on-line presence for all key business processes; and the development and dynamic testing of
22
external world (Bacon, 1620; Jurafsky and Martin, 2000). Similarly to the empiricist
tradition, the Chinese philosophical view of knowledge is based on empirical
investigation of the natural environment, though it is tightly tied to human morality
(Confucius, 500 B.C.). The two traditions were echoed by Penberton, who identifies the
subjectivist and the objectivist approaches to studying knowledge (Pemberton, 1998).
The former approach considers knowledge to be limited to personal experience, thus
knowledge sharing depends on shared mental models. The latter approach proposes that
people can know a world of objects or concepts outside their limited experience, and is
most familiar to information scientists of the modem world. More detailed reviews of the
philosophical evolution of knowledge can be found in Polanyi (1962), Hospers (1967),
Dancy (1985), Hallis (1985), Moser and Nat (1985), and Winograd and Flores (1986).
In the information science and technology literature, knowledge is commonly
distinguished from data and information (e.g., Ackoff (1989), Maglitta (1995), Vance
(1997), Bellinger et al. (2000), and Chen (2001)). Data are raw facts, information is
processed data, and knowledge is information analyzed in a person's mind or information
made actionable. A hierarchy of understanding has been proposed (Nunamaker et al.,
2001; Briggs et al., 2002) that captures the context and nature of different understandings
new opportunities with network partners. Examples of successful networked companies include Cisco,
Charles Schwab, and eBay.com.
^ From "The Great Learning": "Wishing to be sincere in their thoughts, they first extended to the utmost
their knowledge. Such extension of knowledge lay in the investigation of things." (Available online at
http : //uweb . superlink. net/~f su/daxue . html, retrieved on January 29, 2004)
23
(see Figure 2.1). The hierarchy encompasses a broad range of understandings but has
been criticized for a lack of scrupulous evaluation (Alavi and Leidner, 2001). An
alternative of the hierarchy of understanding was presented by Tuomi (1999). He
considered that the hierarchy of data-information-knowledge should be turned around
because knowledge is the basic driving force of the other two types of understanding. In
his view, data can emerge only if a meaning structure, or semantics, is first fixed and then
used to represent information (e.g., information stored in a semantically well-defined
computer database) and there are no "isolated facts" unless someone has created them
using his or her knowledge. However, the view implicitly assumes that human beings
possess the knowledge required to create the semantics and structure of information and
data. Such an assumption may not always hold in the context of knowledge discovery on
the Web, because users typically may not have substantial knowledge of the tasks at hand
(e.g., searching for information on an unfamiliar topic on the Web).
Level
Meaning
Data
Understanding of symbols
Information
Understanding of relationships among
data
Knowledge
Understanding of patterns, processes,
and context
Wisdom and
judgment
Understanding of the principles, causes,
and consequences that give rise to
intellectual and ethical positions
The Hierarchy of Understanding
Context
Wisdom
Principles
Knowledge
Patterns
Information
Relations
Data
Symbols
Easy
Figure 2.1: The hierarchy of understanding
Noise Detection
•
Difficult
24
Alternative perspectives of knowledge have also been proposed. Huber defines
knowledge as a justified belief that increases an individual's capacity to take effective
action (Huber, 1991; Alavi and Leidner, 1999; Nonaka, 1994). Various researchers view
knowledge as a state of mind, an object, a process, a condition of having access to
information, or a capability (Alavi and Leidner, 2001).
In the management literature, knowledge is considered an important resource of the
firm (Nonaka and Takeuchi, 1995; Spender, 1996). This perspective originates from the
resource-based theory of the firm (Penrose, 1959; Wemerfelt, 1984; Barney, 1991). It
postulates that a firm's knowledge affects how its tangible resources (e.g., raw materials,
manpower) are combined to provide services and products. Thus knowledge has an
economic value to a firm. Peter Drucker, widely regarded as the father of modem
management, views knowledge as the key economic resource and the dominant source of
competitive advantage in a post-capitahst society (Drucker, 1993; Drucker, 1999). In his
opinion, power comes from transmitting information to make it productive, not from
hiding it (Drucker, 1995). Effective management of knowledge is therefore crucial to the
success of a firm.
2.1.2 Classification of Knowledge
Just as knowledge has been defined differently, there are also different classifications
of knowledge, ranging from general to organizational classification schemes.
25
Understanding different types of knowledge resources helps businesses to manage their
intangible knowledge assets better.
A widely accepted classification system is the tacit and explicit dimensions of
knowledge (Polanyi, 1967), which have been applied to studying the creation of
knowledge in organizations (Nonaka, 1994). Tacit knowledge is based on one's
experience and involvement in a certain context (e.g., skills of persuading others to buy a
product). Explicit knowledge is codified and communicated in symbolic form (e.g., the
manual of how to operate a machine contains explicit knowledge about the machine).
Apart from these two types of knowledge, there exist other classification schemes, as
summarized in the knowledge taxonomies in Alavi and Leidner (2001). Individual
knowledge is created by and inherent in the individual (e.g., personal experience gained
fi-om using a machine) whereas social knowledge is created by and inherent in collective
actions of a group (e.g., discipline in an army). Declarative knowledge deals with the
content (know-what) of a certain object (e.g., what does computer software do) while
procedural knowledge deals with the practical use (know-how) of the object (e.g., how to
use the computer software to solve a problem). Causal knowledge is the understanding of
the reasons (know-why) behind an event (e.g., understanding why a computer program
works). Conditional knowledge is the understanding of the timing (know-when) of an
event (e.g., when to use the computer program to solve the problem). Relational
knowledge is the understanding of how an object relates to another (know-with) (e.g.,
how the program interacts with a certain operating system). Pragmatic knowledge is the
26
insights useful to an organization (e.g., best practices, business frameworks or models,
project experiences).
Organizational knowledge resources have been classified into schematic resources
and content resources, based on a hierarchical framework derived from a Delphi study in
which 122 knowledge management experts participated (Holsapple and Joshi, 2001).
Schematic resources depend on the organization for their existence. They include the
culture, infrastructure, purpose, and strategy of the organization. Collectively, they
establish the organization's identity. Content resources exist independent of an
organization to which they belong. They include participants' knowledge (e.g., an
employee's knowledge of a job) and artifacts (e.g., policy manual). Many of these
resources are in a textual format, which can facilitate efficient storage, retrieval, and
transfer especially on the Web. However, it is difficult to discover knowledge from a vast
amount of text. A study found that 74% of respondents believed that their organization's
best knowledge was inaccessible (Gazeau, 1998). We believe that much knowledge
useful to organizations is embedded on the Web, which has become one of the top five
sources of business information nowadays (Futures-Group, 1998). That knowledge, if
properly extracted and discovered, can be a precious resource for organization to compete
in its environment.
27
2.1.3 Knowledge Management
The management of knowledge resources has recently been identified as an important
goal of organizations (Spender, 1996; Alavi and Leidner, 2001), as exemplified by the
fact that over a quarter of Fortune 500 companies have employed a Chief Knowledge
Officer (Steward, 1998). Rooted in a long history (Wiig, 2000) and widely heralded in
the 1990s, knowledge management (KM) is regarded as the process of creating,
capturing, organizing, retrieving, sharing, and applying knowledge for business functions
and decisions (O'Leary, 1998; Alavi and Leidner, 2001; Chen, 2001; Morrow, 2001). KM
is distinguished from data management and information management by the presence of
human, expertise, and context in exercising knowledge (Blair, 2002).
KM projects aim to make knowledge visible through maps, yellow pages and
hypertext tools, to encourage knowledge sharing within organizations, and to build a
knowledge infrastructure such as a network connecting people, tools, and systems
(Davenport and Prusak, 1998). It implies that both technologies and people are important
to devising strategies for knowledge management. Such an implication is confirmed by a
three-stage Delphi study of 2,073 KM practitioners and managers (King et al., 2002).
Strategic and operational management were found to include the most important (13 out
of 20) KM issues.
Two strategies for knowledge management have been proposed: the codification
strategy and the personalization strategy (Hansen et al., 1999). The codification strategy
28
captures knowledge from
individuals and artifacts, and codifies and stores it into
knowledge repositories to allow reuse (Zack, 1999). Computers are used to store the
knowledge. In contrast, the personalization strategy emphasizes knowledge sharing
through direct person-to-person contacts. Computers are used to help people
communicate knowledge, not to store it. Both strategies require significant efforts to
capture and transfer knowledge, either through technologies or personal contacts.
Traditionally, humans play a crucial role in transforming lower level understandings
(e.g., data and information) to knowledge. It has been argued that such transformation is
idiosyncratic and personal, and thus a technology-driven paradigm should not be applied
to knowledge management (Morrow, 2001). However, as advanced information and
communication technologies have been developed in the 1980s and 1990s, automating
parts of human analysis process may become possible (Fayyad et al., 1996). Therefore, it
is useful to review different human analysis needs and then investigate how the analysis
can be automated. The discipline of human-computer interaction (HCI) provides a wealth
of research in human models and theories.
2.2 Human-Computer Interaction
Human-computer interaction (HCI) is a discipline concerned with the design,
evaluation and implementation of interactive computing systems for human use and with
the study of major phenomena surrounding them (Hewett et al., 1996). The goal is to
systematically apply knowledge about human purposes, capabilities and limitations, and
29
machine capabilities and limitations to enable human beings to do things that could not
be done before (Norman, 1995b). Various models and theories in human information
processing have been proposed over the past three decades (Carroll, 1997; Olson and
Olson, 2003).
2.2.1 Evolution of Human-Computer Interaction
HCI originated from psychology but borrowed computer science concepts to study
human information processing. The term "software psychology" was used in the 1970s to
represent what we nowadays refer to HCI (Shneiderman, 1980). A computer metaphor
for human mental processing was firmly established in the early development of HCI
(Neisser, 1967; Lindsay and Norman, 1977). The study of human information processing
has been categorized into studies of sensation, perception, cognition, and motor control
(Norman, 1995a). Sensation refers to human responses to sight, sound, touch, olfaction,
and taste (with the latter two playing little role in HCI). Cognition is subdivided into
attention, categorization, learning and expertise, problem solving, performance, memory,
and mental models. The study of cognition"* has been mostly influenced by such
disciplines as artificial intelligence, linguistics, philosophy, and psychology (Laird et al.,
1987; Simon, 1981; Winston, 1984).
The frequently mentioned "information overload" problem is related to human cognition. In many tasks,
there is a high demand of cognitive resources, which are often in short supply. Information processing
typically places a high cognitive load on humans. Thus the design of information systems needs to
30
In the early 1980s, much effort was put into human mental modeling (e.g., Andersen,
(1983), Fodor (1983), Centner and Stevens (1983), and Gardner (1985)). The cornerstone
in these efforts was The Psychology of Human-Computer Interaction (Card et al., 1983),
in which the authors tried to create a scientific base for an applied psychology concerned
with the human users of interactive computer systems. They developed a model called
COMS that represents a user's cognitive structure in terms of goals, operators (actions
that a user takes), methods (sequences of sub-goals and operators carried out to achieve
goals), and selection rules (for choosing among different possible methods for reaching a
particular goal). COMS predicts the methods that a skilled person will employ to carry
out editing tasks and the time that will be taken. In spite of promise for a scientifically
grounded HCI design (Newell and Card, 1985), the actual impact has been narrow.
GOMS has been criticized as too low level, too limited in scope, and too difficult to apply
(Olson and Olson, 1990).
Despite the drawbacks of GOMS, cognitive modeling still received much attention
during the late 1980s (e.g., Minsky (1986)). Norman presents a task performance model
that includes seven activities: goals, intention, action specification, execution, perception,
interpretation, and evaluation (Norman, 1986). Although not quantitative, the model
provides insights for subsequent model development. In particular, human information
consider the cognitive load that may be produced. Factors correlating directly with cognitive load include
learning time, fatigue, stress, proneness to error, and inability to "time-share" (Norman, 1995a).
31
processing in an electronic environment (e.g., online retrieval systems, the Internet)
started to draw attention from researchers.
2.2.2 HCI and the Web
In the 1990s, two streams in HCI emerged: (1) sociological and anthropological
aspects of computer use in human lives; and (2) human information seeking in an
electronic environment. In the former stream, HCI researchers started to study cognitive,
developmental, and cultural psychology in computer design. Activity theory, a major
approach in this stream, focuses on how human interactions affect individual, social and
cultural development (Bodker, 1991). It sees computer users as individuals continually
being shaped by their ongoing pattern of activities. Cognitive artifact analysis, another
major approach, tries to carefully study the ways that artifacts (e.g., text categorization
tools, visuahzation aids) are used and to understand what features are responsible for
their success, and why (Carroll et al., 1991; Norman, 1991).
A manifestation of the attention to social and organizational use of computers is the
emergence of computer supported collaborative work (CSCW) as a field. CSCW tools
that HCI researchers have studied include electronic mail (Andersen et al., 1995),
electronic meeting support (Nunamaker et al., 1991b), conferencing tools (Finn et al.,
1997), messengers (Erickson et al., 1999), and groupware (Grudin, 1994).
In the human information seeking stream of HCI research, researchers typically adopt
a process model. Their work has been extended to study the electronic environment (e.g.,
32
the Internet). The process consists of various stages of problem identification, problem
definition, problem resolution, and solution presentation (Wilson, 1999). Variations of
the process model also can be found in the literature (Marchionini, 1995; Kuhlthau, 1998;
Sutcliffe and Maiden, 1998). For example. Bates' model for information search, called
"berrypicking," captures the idea of an evolving multi-step search process as opposed to a
system that supports submitting single queries alone (Bates, 1989). Kuhlthau found that
high school students began research assignments by conducting general browsing and
performed more directed search as their understanding of the subject increased
(Kuhlthau, 1991). Ellis studied the patterns of academic information-seeking behavior
and found six features of social scientists' individual information-seeking patterns (Ellis,
1989). These features are starting, chaining, browsing, differentiating, monitoring, and
extracting.
Two areas of information seeking have attracted much attention from researchers:
searching and browsing, considering the proliferation of the Web in the 1990s.
2.2.2.1 Searching
Searching has been one of the most frequently used functions of Web portals. Table
2.1 shows definitions of the word "search" from three dictionaries that reveal the extent
of search behavior.
33
Table 2.1: Definitions of the word "search"
To look carefully for something or someone (Sinclair et al., 1998)
To examine (a place, vehicle, or person) thoroughly in order to find something or
someone (Hornby and Cowie, 1987)
To look into or over carefully or thoroughly in an effort to find or discover something
(Mish et al., 2003)
Sutcliffe and Ennis succinctly described four stages in their process model of
information searching: problem identification, need articulation, query formulation and
results evaluation (Sutcliffe and Ennis, 1998). By "information searching," they consider
a range of behaviors from goal directed information searching, where the user has a
specific target in mind, to more serendipitous or exploratory information browsing when
the only goal is to explore the information repository. Depending on the degree of goaldirectedness, searching and browsing can occur differently in each of the four stages. In
directed searching, the user first decomposes his goal into smaller problems, then
expresses his needs as concepts and higher level semantics. He next formulates queries
using such supports as Boolean query languages and syntax directed editors and finally
evaluates the results by serial search or systematic sampling. In contrast, browsing tends
to be less goal-directed and has been examined by many researchers.
2.2.2.2 Browsing
Browsing is an activity in which users of the World Wide Web frequently engage.
Table 2.2 shows definitions of the word "browse" from three dictionaries.
34
Table 2.2: Definitions of the word "browse"
To look at things in a fairly casual way (Sinclair et al., 1998)
To read (parts of a book or books) without any definite plan, for interest or
enjoyment (Hornby and Cowie, 1987)
To look over or through an aggregate of things casually especially in search of
something of interest (Mish et al., 2003)
All these definitions convey a meaning of "casual reading." But they do not explain
the operational details of browsing. On the other hand, Marchionini and Shneiderman
(1988) define browse as "an exploratory, information seeking strategy that depends upon
serendipity ... especially appropriate for ill-defined problems and for exploring new task
domains." Chang and Rice (1993) state that browsing is a direct application of human
perception to information seeking, in both electronic and non-electronic environments.
Spence (1999) defines "browse" as the registration (or elicitation or assessment) of
content such that browsing answers the question "what's there?" but without integrating
the result into some structure or map. In his framework for navigation, browsing is
similar to the act of perception (Solso, 1988) in which the result of perception is held
momentarily in sensory storage. A typical example is the scanning of a restaurant menu
to see what is available.
While Spence (1999) considers that no search is involved in browsing, Carmel et al.
(1992) deem that the act of searching is present in browsing. Based on a cognitive study
of hypertext browsing, they have identified three browsing strategies; scan-browse
(scanning for interesting information without review), review-browse (scanning and
reviewing to integrate information into the user's mental model in the presence of
transient browse goals), and search-oriented browse (scanning and reviewing information
35
relevant to a fixed task). According to Carmel et al. (1992), users search for specific
information when they adopt a search-oriented strategy in browsing. In a cognitive study,
they found that most subjects primarily used review-browse interspersed with searchoriented browse. Thus, the component of search is included in browsing.
Similarly, Marchionini (1987) states that browsing connotes an informal search
process characterized by the absence of planning, using techniques ranging from random
and informal to systematic and formal. Considering the motives and operational details of
browsing from previous research, "browse" can be considered an exploratory information
seeking process characterized by the absence of planning, with a view to forming a
mental model of the content being browsed.
Sutcliffe and Ennis (1998) describe exploratory browsing in their information
searching process model. The user first transforms his general information need into a
problem, then articulates his needs as search terms or hyperlinks that appear on the
system interface, searches using the terms or explores the hyperlinks using such browse
supports as concept maps, automatic summarization, and hypertext, and finally evaluates
the results by scanning through them.
2.2.3 Knowledge Discovery Processes
Through the processes of searching and browsing in HCI, users can achieve higher
levels of understanding from data and information, often in the forms of intelligence or
36
knowledge. Previous research has described processes involved in knowledge discovery
that have different depths of analyses.
An intensive two-week study of Web-use activities revealed that knowledge workers
engaged in a range of complementary modes of Web information seeking in their daily
work (Choo et al., 2000). The 34 study participants, coming from seven companies and
holding jobs as IT technical specialists or analysts, marketing staff, consultants, etc.,
primarily utilized the Web for business purposes. Experimental findings confirmed that
knowledge workers performed such multiple analyses as browsing, differentiating,
monitoring, and extracting in the business domain. Value-adding processes such as
information seeking, monitoring, and extracting were observed.
Rooted in military strategy (Cronin, 2000; Nolan, 1999), the competitive intelligence
(CI) field also provides insights into various value-adding processes in knowledge
discovery. Taylor proposes a value-added CI spectrum consisting of four major phases:
organizing processes (grouping, classifying, relating, formatting, signaling, displaying);
analyzing processes (separating, evaluating, validating, comparing, interpreting,
synthesizing); judgmental processes (presenting options, presenting advantages,
presenting disadvantages), and decision processes (matching goals, compromising,
bargaining, choosing) (Taylor, 1986). Some authors add "evaluation" as a feedback loop
(Fuld et al., 2002). Through the different phases, transformations take place in the order
of data, information, informing knowledge, productive knowledge, and action.
37
An empirical study of CI implementation helps to identify four phases (Westney and
Ghoshal, 1994) similar to Taylor's CI spectrum. The data management phase consists of
acquisition, classification, storage, retrieval, editing, verification and quality control,
presentation, aggregation, distribution, and assessment. The analysis phase consists of
synthesis, hypothesis, and assumption building and testing. The implication and action
phases respectively concern how analysts should respond and what tasks should be
performed.
From our literature review, we summarize knowledge discovery into three processes:
•
Information seeking", the process of locating useful information from a large
amount of data or information that is potentially relevant. Information seekers
often rely on searching and browsing to identify relevant information;
•
Intelligence generation: the process of acquisition, interpretation, collation,
assessment, and exploitation of the information obtained (Davies, 2002).
Typically, such higher-level processes as filtering, classification, categorization,
summarization, and visualization are involved;
•
Relationship extraction^', the process of deriving patterns and relationships from
data and information. In today's networked organizations (Hacki and Lighton,
^ According to Choo et al. (2000), extracting is the activity of systematically working through a particular
source or sources in order to identify material of interest. As a form of retrospective searching, extracting
may be achieved by directly consulting the source, or by indirectly looking through bibliographies,
indexes, or online databases. Retrospective searching tends to be labor intensive, and is more likely when
there is a need for comprehensive or historical information on a topic.
38
2001), this process helps analysts to identify relationships previously unknown by
the information seekers. The product often is personalized for them and provides
contextual meaning to the tasks at hand.
To enable human analysts to discover knowledge from textual information (the major
medium of communication on the Web), text mining emerged as a growing discipline in
the late 1990s (Trybula, 1999). The next section reviews the literature of technologies in
text mining.
2.3 Text Mining for Web Analysis
Text mining is the process of finding interesting or useful patterns in textual data and
information, and combines many of the techniques of information extraction, information
retrieval, natural language processing, and document summarization (Hearst, 1999;
Trybula, 1999). It provides a means of developing knowledge links and knowledge
sharing among people within organizations. Though the field is in its infancy, it has been
anticipated to have explosive growth in order to address growing information challenges
in organizations (Trybula, 1999).
Text mining evolved from the field of automatic text processing that emerged in the
1970s, and continues to be influenced by related fields of machine learning in the 1980s,
and data mining, knowledge discovery and Web mining in the 1990s. In recent years,
businesses increasingly rely on text mining to discover intelligence on the Web. The
39
following sections review the foundations, evolution, and current status of applying text
mining to Web analysis.
2.3.1 Automatic Text Processing
Researchers foresaw automatic text processing in the 1940s and 1950s because, for
the first time in human history, machines with interchangeable parts (e.g., automobile,
movie camera, telephone) could be constructed economically. Such economy of scale
would not be possible in previous centuries since the costs were too heavy, even though
innovative ideas in computing had been proposed by researchers such as Leibnitz and
Babbage. In 1945, Bush^ envisioned a device called "memex" in which a person could
store all his books, records, and communications, and which was so mechanized that it
might be consulted rapidly and flexibly (Bush, 1945). Subsequently, researchers
embarked on developing approaches and techniques to automate text processing (SparckJones and Willett, 1997). Examples of such work include automatic indexing (Joyce and
Needham, 1958; Luhn, 1961; Maron, 1961), probabihstic information retrieval (Maron
and Kuhns, 1960), automatic information retrieval (Fairthome, 1961), term association
mapping (Doyle, 1961), studying the indexing value of statistically derived vocabularies
(Doyle, 1962), and automatic language processing (especially machine translation)
® As Director of the Office of Scientific Research and Development in the US Government during World
War II, Dr. Vannevar Bush coordinated the activities of some six thousand leading American scientists in
the application of science to warfare.
40
(Borko, 1967). Although these are considered pioneers in the field, it was not until the
1970s that more formal information retrieval models and techniques were proposed.
Models in information retrieval developed based on disciplines such as logic-based
inference, statistics, and set theory are largely quantitative because they generally
presuppose a careful formal analysis of a problem and assumptions (Robertson, 1977).
2.3.1.1 Logic-based Model
A logic-based model uses Boolean logic to link query terms by operators AND, OR,
and NOT, and its retrieval system returns documents that have combinations of terms
satisfying the logical constraints of the query. While such a model is simple, it has
several limitations (Salton et al., 1983). Users need extensive training in order to
formulate queries that yield highly relevant results. The number of results also varies
greatly and cannot be controlled by users. Unranked results and inability to specify
relative importance of different components of the query further undermine the
usefulness of the model.
2.3.1.2 Vector Space Model
The Vector Space model is the most influential in the information retrieval field and
development of operational information retrieval systems. Developed by Gerard Salton
(widely regarded as the father of information retrieval) and his colleagues (Salton et al.,
1975a), the Vector Space model employs a geometric interpretation of information
41
retrieval. Indexing terms are regarded as the coordinates of a multidimensional
information space. Documents and queries are represented by vectors in which the z-th
element denotes the weight of the /-th term, with the weight determined by the product of
term frequency and inverse document frequency (Salton et al., 1975b), as shown below:
Term weight = Term frequency x log
^ Number of documents in the collection ^
Number of documents having the term
Using the model, the process of indexing is thus seen as separating documents from
each other in the multidimensional term space, with similar documents having shorter
distances between one another. An automatic document retrieval system called SMART
was developed to show the robustness of the model (Salton, 1971). Matching a query
against clusters of documents has been shown to provide higher retrieval effectiveness
than a matching operation that fails to consider the similarity relationships between
documents (van Rijsbergen and Sparck-Jones, 1973).
The importance of Vector Space model lies on the foundation it provides to a variety
of retrieval operations, including indexing, relevance feedback, searching, document
classification, document clustering, and document visualization. On the one hand, the
model's advantages include its simplicity, ease of implementation, and being able to
overcome limitations of a Boolean model. However, disadvantages include its incorrect
assumption that terms are independent of each other and the difficulty of specifying
phrasal relationships within it.
42
2.3.1.3 Probabilistic Models
Probabilistic models originated from the work of Maron and Kuhns (1960) and their
strength has been demonstrated in a study of schemes for the weighting of query terms
using relevance information (Robertson and Sparck-Jones, 1976). The idea was to use
information about distribution of query terms in existing documents that have been
assessed for relevance to determine the probability of relevance of previously unjudged
documents. Since the developments, probabilistic approaches to information retrieval had
been developed rapidly (van Rijsbergen, 1979). One approach has been inference
network, which has been extensively studied. For example, a Bayesian inference network
has been used to encode information into a network structure of various types of
document nodes (Turtle and Croft, 1992). Probabilities are associated with nodes and the
terms appearing in the documents. The outcome is an overall estimate of the probability
that a particular document satisfies a particular search need.
2.3.2 From Machine Learning to Data Mining
The proliferation of computers in the 1980s significantly increased the amount of
information stored in databases worldwide. Such amount was estimated to double every
20 months in 1992 (Frawley et al., 1992) and every 9 months in 2002 (Fayyad and
Uthurusamy, 2002). Unfortunately, the rate of growth has far outpaced humans' ability to
analyze and understand data, resulting in tremendous unused "data tombs." In response,
techniques and approaches have been developed to provide analysis capability and to
43
extract knowledge from data. Machine learning and data mining were major efforts in the
1980s and early 1990s respectively.
2.3.2.1 Machine Learning
Extending from the work in probabilistic modeling, machine learning tries to enable
computers to adapt to new circumstances and to detect and extrapolate patterns (Russell
and Norvig, 1995). It has been defined as "any process by which a system improves its
performance" (Simon, 1983) and as "the study of computer algorithms that improve
automatically through experience" (Mitchell, 1997). Major approaches include neural
networks, symbolic learning, and genetic algorithms.
Neural networks use as a metaphor the human nervous system to make predictions
based on learned information (Rumelhart et al., 1994). A neural network consists of a
graph of nodes and links, respectively representing neurons and synapses in a human
being. Through intensive computation and repeated iterations of learning from input
examples, knowledge is automatically acquired and stored in its mesh-like network.
Interconnected, weighted links can then be used to predict the output values given new
input values. Various types of neural networks have been developed. The
feedforward/backpropagation neural network is a fully-connected and multi-layered
network that is activated from the input layer, through the hidden layer, and to the output
layer (Rumelhart et al., 1986). Each layer consists of a number of nodes that are linked to
nodes on the next layer. Input examples are passed through the network and adjustments
44
are made upon each pass. The operations are repeated until the network stabilizes or a
certain number of iterations have been reached. Other types of neural networks include
the Kohonen self-organizing feature map (Kohonen, 1995), a two-layer network with a
set of input neurons and an output layer of neurons organized according to input values,
and the Hopfield network (Hop field, 1982), in which each neuron is connected to each
other neuron, forming a self-connected network with a weight matrix.
Symbolic learning techniques employ learning strategies such as rote learning,
learning by being told, learning by analogy, and learning from examples, and learning
from discovery to induce a description or to identify patterns of the data being studied
(Carbonell et al., 1983). Among the symbohc learning techniques developed, the ID3
decision-tree building algorithm (Quinlan, 1983) and its descendants such as C4.5
(Quinlan, 1993) are widely used for inductive learning. ID3 relies on an informationeconomics approach to minimize the uncertainty of information in building a decision
tree to classify objects into distinct classes.
Genetic algorithms (GA) are evolutionary, stochastic algorithms that model the
natural processes of Darwinian survival of the fittest (Holland, 1975). Objects are
represented as chromosomes in a population that undergoes genetic operations such as
mutation and crossover to reproduce succeeding generations. Based on a fitness function,
the best chromosomes are selected to reproduce while less-fit chromosomes are removed.
The process converges after a large number of generations, and the best chromosome
represents the optimal solution.
45
Some of the earhest efforts in applying neural networks to document retrieval include
Preece (1981), Belew (1989), and Lin et al. (1991). GA use has been experimented with
in document retrieval (Gordon, 1988). Applications of neural networks, symbolic
learning, and genetic algorithms to information retrieval have been demonstrated in Chen
(1995).
2.3.2.2 Data Mining
Evolved from such fields as machine learning, statistics, artificial intelligence and
pattem recognition, knowledge discovery in databases (KDD) has been defined as "the
nontrivial process of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data (Faj^ad et al., 1996)." It is a multistaged process of
extracting previously unanticipated knowledge from large databases and applying the
results to decision making (Benoit, 2002). Data mining, a major step in KDD, involves
application of specific algorithms for identifying interesting structures in data, where
structure designates patterns, statistical or predictive models of the data, and relationships
among parts of the data (Fayyad et al., 1996; Fayyad and Uthurusamy, 2002). In the
context of automatic text processing, data mining helps to extract meaningful textual
pattems in the form of document clusters, meaningful terms, key sentences serving as
summary, and data models used for prediction and classification. Some major data
mining algorithms and their applications to automatic text processing are described
below.
46
Clustering is the process of grouping data items into naturally-occurring classes of
similar characteristics. Various clustering algorithms have been proposed (Jain and
Dubes, 1988). In information retrieval, textual terms and documents are often clustered to
assist analysis and understanding. Recognizing that humans rarely use the same term to
describe a certain thing or phenomenon (known as the vocabulary (difference) problem'
(Fumas et al., 1987)), Chen et al. (1997) propose a statistics-based, algorithmic approach
called concept space to automatically construct networks of concepts to characterize
document databases. Central to the approach are a co-occurrence analysis based on an
asymmetric cluster function (Chen and Lynch, 1992) and associative retrieval using
spreading activation algorithms. The approach was also used to automatically generate a
thesaurus to help users explore a large knowledge network of concepts. Symbolic branchand-bound search and connectionist Hopfield net spreading activation were applied to the
exploration (Chen and Ng, 1995).
Summarization provides a compact description for a subset of data. Simple mean and
standard deviation of numerical data are examples of summarization. Text summarization
often relies on linguistic and statistical information to extract key phrases or sentences
from a document. Hearst (1994) has developed the TextTiling algorithm that partitions
expository text into coherent multi-paragraph discourse units that reflect the subtopic
' In an experiment, Fumas et al. (1987) found a surprisingly large variability in spontaneous word choice
for objects in five domains. In every case, two people favored the same term with probability less than
47
structure of the texts. Using domain-independent lexical frequency and distribution
information, the algorithm recognizes the interactions of multiple simultaneous themes.
McDonald and Chen (2002) extend the TextTiling algorithm by using sentence-selection
heuristics to extract key sentences as summary for Web pages. Salton et al. (1997) use
inter-document link generation techniques to identify intra-document links between
passages of a document. Such knowledge of text structure was applied to automatic text
summarization that yielded satisfactory results. Other text summarization techniques and
evaluation issues have been reported in Mani and Maybury (1999).
Classification (or categorization) is the process of assigning data items into
predefined categories based on characteristics of the items. Text categorization labels
natural language texts with thematic categories from a predefined set. Techniques and
approaches applicable to text categorization include decision trees, regression models,
neural networks (see Section 2.3.2.1), Rocchio learning algorithm, and support vector
machines. A decision tree text classifier is a tree in which non-leaf nodes are labeled by
terms and leaf nodes are labeled by categories (Mitchell, 1997). It classifies a document
by recursively testing the weights that terms in non-leaf nodes have in the document
vector until a leaf-node is reached. Regression models try to find a function that fits in the
training data and can approximate the original, real-valued function (Fuhr and Pfeifer,
1994). For instance, linear least-squares fit regression represents each document as two
20%. This fundamental property of human language limits the success of various designs of
48
vectors (an input vector of weighted terms and an output vector of category weights),
minimizes the error on the training set, and categorizes new documents by determining
their output vectors (Yang and Chute, 1994). Rocchio learning algorithm computes a
classifier by Rocchio's formula (Rocchio, 1971) adapted to information retrieval (first
proposed in Hull (1994)). It rewards the closeness of a test document to the centroid of
positive training examples and the distance from the centroid of negative training
examples. Based on statistical learning theory (Vapnik, 1995), support vector machines
(SVM) attempts to find the best decision surface in a high dimensional space to separate
positive examples from negative examples (Joachims, 1998). Term selection and
parameter tuning are often not needed in SVM. A more complete survey of text
categorization techniques can be found in (Sebastiani, 2002).
2.3.3 Web Mining
Web mining is the use of data mining techniques to automatically discover and
extract information from Web documents and services (Etzioni, 1996). It involves the
tasks of resource discovery on the Web, information extraction from Web resources, and
uncovering general patterns at individual Web sites and across multiple sites. Machine
learning techniques (such as those discussed previously) have been applied to Web
mining (Chen and Chau, 2004).
methodologies for vocabulary-driven interaction (e.g., information retrieval systems).
49
2.3.3.1 Resource Discovery and Collection on the Web
The discovery and collection of resources on the Web has long been a challenge to
researchers and practitioners. Search engines rely on a Web page collection program
(often called a spider or crawler) to automatically harvest resources on the Web. A core
of the program is the algorithm that guides the crawler to fetch Web pages, and various
techniques and applications have been developed to perform this task (Chen et al., 1998a;
Pant and Menczer, 2002; Srinivasan et al., 2002; Chau and Chen, 2003).
Given the exponential growth of the Web, it is difficult for any single search engine
to provide comprehensive coverage of search results. An empirical study has shown that
13 of 15 popular commercial search engines exhibited bias in their search results
(Mowshowitz and Kawaguchi, 2002). "Bias" exists when some results obtained from a
search engine occur more frequently or prominently with respect to the norm (formed by
pooling the results of a basket of search engines), while other results occur less frequently
or prominently. Another study found that users missed over 77% of the references they
would find most relevant because no search engine could return more than 45% of
relevant results (Selberg and Etzioni, 1995). Moreover, it was estimated that any single
search engine on the Web could cover only about 16% of the entire Web and its resource
collection could not catch up with the Web's exponential growth rate (Lawrence and
Giles, 1999).
50
Meta-searching has been shown to be a highly effective method of resource discovery
and collection on the Web. By sending queries to multiple search engines and collating
the set of top-ranked results from each search engine, meta-search engines can greatly
reduce bias in search results and improve coverage. Chen et al. (2001) showed that the
approach of integrating meta-searching with textual clustering tools achieved high
precision in searching the Web. In particular, their use of Arizona Noun Phraser (Tolle
and Chen, 2000) as a phrase indexing tool and of a self-organizing map as a Web page
categorization tool (Kohonen, 1995) showed promising results. Mowshowitz and
Kawaguchi (2002) concluded from their study that the only realistic way to counter the
adverse effects of search engine bias is to perform meta-searching. In addition, many
commercial meta-search engines allow the searching of various large search engines and
provide added functionality. MetaCrawler (http://www.metacrawler.com/)
provides analysis of relevance rankings from source search engines and elimination of
duplicates (Selberg and Etzioni, 1997). Vivisimo (http://www.vivisimo.com/)
automatically clusters the search results into different groups (Palmer et al., 2001).
Kartoo (http://www.kartoo.com/) graphically presents results as a network of
nodes that are linked by lines showing common terms appearing in the pair of result
pages.
51
2.3.3.2 Pattern Extraction from the Web
To extract information and uncover patterns from Web pages or sites, three categories
of Web mining have been identified: Web content mining, Web structure mining, and
Web usage mining (Kosala and Blocked, 2000).
Web content mining refers to the discovery of useful information from Web contents,
such as text, image, video, audio, and so on. It originates from the study of automatic
information retrieval and indexing where a Web page is represented by a vector of
weights of key terms (Salton and McGill, 1983). Previous works on Web content mining
include Web page categorization (Chen et al., 1996), clustering (Zamir and Etzioni,
1999), rule and pattern extraction (Hurst, 2001), and concept retrieval (Chen et al.,
1998b; Schatz, 2002).
Web structure mining refers to the analysis of link structures that model the Web. It
originates from the studies of social networks (Granovetter, 1973; Wasserman and Faust,
1994; Scott, 2000), bibhometrics (Borgman, 2002), and citation analysis (Garfield, 1972;
Small, 1973). Previous works on Web structure mining focus on resource discovery
(Chakrabarti et al., 1999b), Web page ranking (Brin and Page, 1998; Lempel, 2001),
authority identification (Kleinberg, 1999; Mendelzon and Rafiei, 2000), exploring the
underlying structure of the Web (Chakrabarti et al., 1999b; Broder et al., 2000; Kleinberg
and Lawrence, 2001), and locating communities on the Web (Gibson et al., 1998; Kumar
et al., 1999; Flake et al., 2000). Knowledge can be extracted from the Web by randomly
52
sampling the Web and using different methods to analyze Web link structure. It has been
pointed out that studies using these methods should be repeated at different times to
reveal evolution of the Web (Henzinger and Lawrence, 2004).
Many efforts have been made to combine Web content mining and Web structure
mining to improve the quality of analysis. For example, using a similarity metric that
incorporated textual information, hyperlink structure and co-citation relations, He et al.
(2001) proposed an unsupervised clustering method that was shown to identify relevant
topics effectively. The clustering method employed a graph-partitioning method based on
normalized cut criterion first developed for image segmentation (Shi and Malik, 2000).
Bharat and Henzinger (1998) augmented a connectivity analysis based algorithm with
content analysis and achieved an improvement of precision by at least 45% over purely
connectivity analysis. Chakrabarti et al. (1999a) augmented the HITS algorithm by
considering the anchor texts and showed that their system could be used to compile large
topic taxonomies automatically. Analyzing the textual similarity among linked
documents, Menczer (2004) developed a model that explains how a network of
documents evolves over time without central control.
Web usage mining studies techniques that can predict user behavior while the user
interacts with the Web. Web usage data (e.g., click sequence, length of time spent on
browsing a Web site/page) recorded in Web server log files are typically analyzed to
reveal their behavioral patterns or interests. Knowledge of Web usage can contribute to
building e-commerce recommender systems (Pazzani, 1999), Web-based personalization
53
and collaboration (Adomavicius and Tuzhilin, 2001; Mobasher et al., 2000; Pazzani and
Billsus, 1997), Web interface design (Marchionini, 2002), and decision support (Chen
and Cooper, 2001).
2.3.4 Text Mining for Business Intelligence
As most resources on the Web are text-based, automated tools and techniques have
been developed to exploit textual information. For instance, Fuld et al. (2003) have
noticed that more business intelligence tools are now compatible with the Web. Although
text expresses a vast, rich range of information, it encodes this information in a form that
is difficult to decipher automatically (Hearst, 1999). Therefore, researchers have
identified text mining as a potential solution. Compared with data mining, text mining
focuses on knowledge discovery in textual documents and involves multiple processes.
2.3.4.1 Text Mining Processes
Trybula (1999) proposes a framework for knowledge discernment in text documents.
The framework includes several processes to transform textual data into knowledge; (1)
Information acquisition: The text is gathered from textbases at various sources, through
finding, gathering, cleaning, transforming, and organizing. Manuscripts are compiled into
a preprocessed textbase. (2) Extraction: The purpose of extraction is to provide a means
of categorizing the information so that relationships can be identified. Activities include
language identification, feature extraction, lexical analysis, syntactic evaluation, and
semantic analysis. (3) Mining: It involves clustering in order to provide a manageable
54
size of textbase relationships that can be evaluated during information searches. (4)
Presentation: Visualizations or textual summarizations are used to facilitate browsing and
knowledge discovery.
Although Trybula's framework covers important areas of text mining, it has several
limitations for text mining on the Web. First, there needs to be more preprocessing of
documents on the Web, because they exist in many formats such as HTML, XML, and
dynamically-generated Web pages. Second, efficient and effective methods are needed to
collect Web pages because they are often voluminous. Human collection does not scale to
the growth of the Web. Third, information on the Web comes from heterogeneous
sources and requires better integration and more discrimination. Fourth, more mining and
visualization options other than clustering are needed to reveal hidden patterns in noisy
Web data. In the commercial world, text mining software and tools have been developed
for discovering knowledge in the form of business intelligence.
2.3.4.2 Business Intelligence Tools and Techniques
Business intelligence (BI) tools enable organizations to understand their internal and
external
environments
through
the
systematic
acquisition,
collation,
analysis,
interpretation and exploitation of information. Two classes of intelligence tools have been
defined (Carvalho and Ferreira, 2001). The first class of these is used to manipulate
massive operational data and to extract essential business information from them.
Examples include decision support systems, executive information systems, online-
55
analytical processing (OLAP), data warehouses and data mining systems. They are built
on database management systems and are used to reveal trends and patterns that would
otherwise be buried in their huge operational databases (Choo, 1998). The second class of
tools, sometimes called competitive intelligence tools, aim at systematically collecting
and analyzing information from the competitive environment to assist organizational
decision making. They mainly gather information from public sources such as the Web.
Fuld et al. (2003) found that the global interest in intelligence technology has
increased significantly over the past five years. They compared 16 BI tools based on a
five-stage intelligence cycle: (1) planning and direction, (2) published information
collection, (3) source collection from humans, (4) analysis, and (5) reporting and
information sharing. It was found that the tools have become more open to the Web,
through which businesses nowadays share information and perform transactions. There is
no "one-size-fits-all solution" because different tools are used for different purposes.
In terms of the weaknesses of BI tools, automated search capability in many tools can
lead to information overload. Despite improvements in analysis capability over the past
year (Fuld et al., 2002), there is still a long way to go to assist qualitative analysis
effectively. Most tools that claim to do analysis simply provide different views of
collection of information (e.g., comparison between different products or companies).
More advanced tools use text mining technology or rule-based systems to determine
relationships among people, places, and organizations using a user-defined dictionary or
dynamically generated semantic taxonomy. According to Fuld et al. (2003), because
56
existing BI tools are not capable of illustrating the landscape of a large number of
documents collected from the Web, their actual value to analysis is questionable. In
addition, only few improvements have been made to reporting and information sharing
functions, although many tools integrate their reports with Microsoft Office products and
present them in a textual format.
2.3.5 Summary
Various text mining techniques have been developed to collect and discover resources
on the Web, to extract and organize useful information, to uncover hidden patterns, and
to present new insights. From our literature review, we believe the following text mining
techniques would be applicable to knowledge discovery on the Web.
Web content mining and Web structure mining can be used to collect and analyze
relevant resources on the Web. The use of domain spidering and meta-spidering helps
collect Web pages from relevant sources identified through related seed URLs or
keywords for querying multiple search engines. Analyzing the content of a large amount
of text on the Web provides statistics on occurrence of textual patterns that can be
extracted as meaningful entities. In addition, Web link structure reveals the social
relationships existing among resources on the Web. Techniques that exploit Web link
structure can be used to extract these relationships.
Web page summarization, a process of automatically generating a compact
representation of a Web page based on the page features and their relative importance,
57
can be used to facilitate understanding of search engine results. The summary helps to
save the time of analyzing the page content, especially when the page is long.
Web page classification, a process of automatically assigning Web pages into pre­
defined categories, can be used to assign pages into meaningful classes. Analysts can rely
on this process to save their time when studying pages of different types.
Web page clustering, a process of identifying naturally-occurring subgroups among a
set of Web pages, can be used to discover trends and patterns within a large number of
pages. For example, clustering Web pages of many companies can help analysts obtain
an overview of the market situation. It facilitates the identification of market trends and
groups of similar market players.
Web page visualization, a process of transforming a high-dimensional representation
of a set of Web pages into a two- or three-dimensional representation that can be
perceived by human eyes, can be used to represent important knowledge as pictures.
Such abstraction is crucial to business analysis due to the large volume of data and
information involved. The results can complement well with human visual ability of
parallel information processing.
These techniques represent major categories of approaches to automating important
parts of human analyses. The following section summarizes our review on multiple areas
reviewed in this chapter.
58
2.4
Summary of the Literature Review
In this chapter, we have reviewed three areas related to the dissertation: knowledge
and knowledge management, human-computer interaction, and text mining. Although
these areas appear to be disjoint, important relations exist in the context of knowledge
discovery on the Web. Widely studied in different disciplines, knowledge is fundamental
to human civilization. As the Web facilitates the storage and communication of human
knowledge in textual format, it also embeds important knowledge in voluminous and
heterogeneous Web resources. Effectively and efficiently discovering such knowledge
becomes a challenge. Traditionally, human beings play a central role in transforming data
and information into knowledge. Various processes such as information seeking,
intelligence generation, and relationship extraction have been studied in HCI research.
However, human analyses are not efficient and not scalable to the rapid growth of the
Web. They also are error-prone especially when the data size is large. Text mining
provides useful techniques and approaches to enhance human analysis on the Web. It has
evolved from the field of automatic text processing and was influenced by such fields as
machine learning, data mining, and Web mining. Several categories of text mining
techniques (identified in Section 2.3.5) can enhance human analysis, thereby facilitating
knowledge discovery on the Web. From the review, we believe that text mining holds
great promises of supporting human analyses. The interplays between various knowledge
discovery processes and text mining techniques are interesting and deserve further
investigation.
59
CHAPTER 3. RESEARCH FORMULATION AND FRAMEWORK
After a review of multiple disciplines related to this dissertation, this chapter
identifies gaps found in previous research and proposes our framework to close them. We
explain the rationale, components, and applications of the framework, justify the choice
of our domain of study, and describe the structure of the empirical studies.
3.1
Research Gaps
From our literature review, several gaps were found. First, human analysis is precise
but not efficient and not scalable to the astonishing growth of the Web. Currently, many
analysis activities such as information seeking, intelligence generation, and relationship
extraction are done manually. Such efforts need to be augmented by more efficient and
scalable approaches, so that humans can spend their time and effort on other valuable
work.
Second, a number of text mining technologies exist but there has not yet been a
comprehensive framework to address such problems of knowledge discovery on the Web
as information overload, heterogeneity and unmonitored quality of information, and
difficulties of identifying relationships on the Web. Text mining technologies hold the
promise for alleviating these problems by augmenting human analysis. However,
applying these technologies effectively requires consideration of several factors related to
the Web itself, such as the use of collection methods, Web page parsing and information
60
extraction, the presence of hyperlinks, and language differences in heterogeneous
information sources. Existing text mining frameworks (e.g., Trybula (1999), Nasukawa
and Nagano (2001)) do not address these issues.
Third, the HCI aspects of using an automatic text mining framework for knowledge
discovery on the Web have not been widely explored. Although the discipline of HCI has
been developing for decades, studies of how automated approaches benefit human
analysis on the Web are lacking.
3.2 An Automatic Text Mining Framework
This research investigated a general question that embraces the more specific
questions stated in Section 1.2 and Chapters 4 to 6: How can knowledge discovery on the
Web be enhanced by using text mining techniques? We believe that there is no single
"silver bullet" to solve all the existing complex problems. Rather, an integrated
framework combining different text mining techniques is needed.
In this section, we describe an automatic text mining framework for knowledge
discovery on the Web. The rationale underljdng our framework is to capture strengths of
different text mining techniques and to complement their weaknesses, thereby effectively
assisting human analysts as they tackle problems of knowledge discovery on the Web. In
the following, we explain components of the framework (see Figure 3.1), principles of
applying the framework, comparisons with existing frameworks, and evaluation of the
61
framework. Detailed applications of the components are elaborated in the contexts of the
three empirical studies.
Collection Conversion Extraction
'
HTMiyXML
pages an
Web sites
Meta-searching
/ Metaspidering
({(eywords)
Language
identification
HTML/XML
Parsing
Analysis Visualization
.
!nde»>nff
(vwTRl^nwe)
Unit
The Web
Extraction
EnUty
Domain
Spidering
(iinics)
Domain/DB
^jedfic
Parsing
(LatfcH/
syrtiagtic)
Hidden Web
(beliind a
DB)
Web pages
and
Documents
Tagged
Collection
Indexes and
Relationships
Data and Text Bases
Similarities,
Classes,
Clusters
Hierarchies,
Maps,
Graphs
Knowledge Bases
User
Figure 3.1: An automatic text mining framework for knowledge discovery on the Web
3.2.1 Components of the Framework
The framework consists of five steps: collection, conversion, extraction, analysis, and
visualization. Input to and output jfrom the framework are, respectively, Web data and
knowledge discovered after applying the steps. Each step allows human knowledge to
guide the application of techniques (e.g., heuristics for parsing, weighting in calculating
similarities, keywords for meta-searching/meta-spidering). Below the steps shown in
Figure 3.1 are collections of processed results: Web pages and documents; a tagged
62
collection; indexes and relationships; similarities, classes, and clusters; and hierarchies,
maps, and graphs. As we move from left to right of these collections, the degree of
context and difficulty to detect noise in the results increase (refer to Figure 2.1 for a
related diagram). The three left-hand collections are labeled "data and text bases" and the
two right-hand collections are labeled "knowledge bases." The former mainly contain
raw data and processed textual information while the latter contain knowledge discovered
from data and text bases. We explain each step in the following sections.
3.2.1.1 Collection
The purpose of this step is to acquire raw data for creating search testbeds. Data in the
forms of textual Web pages (e.g., HTML, XML, JSP, ASP, etc.) are collected. Several
types of data are found in these pages: textual content (the text that can be seen on an
Internet browser), hyperlinks (embedded behind anchor text), and structural content
(textual mark-up tags that indicate the types of content on the pages).
To collect these data, meta-searching/meta-spidering and domain spidering are used.
Meta-spidering is an enhanced version of meta-searching (discussed in Section 2.3.3.1)
using keywords as inputs. These keywords can be identified by human experts or by
reviewing related literature. In addition to obtaining results from multiple search engines
and collating the set of top-ranked results, the process follows the links of the results and
downloads appropriate Web pages for further processing. Data in the hidden Web (i.e.,
Web sites behind a firewall or protected by passwords) can be collected through meta-
63
spidering. Domain spidering uses a set of seed URLs (provided by experts or identified
in reputable sources) as starting pages. A crawler (discussed in Section 2.3.3.1) follows
links in these pages to fetch pages automatically. Oftentimes, a breadth-first search
strategy is used because it generally provides good coverage of resources on the topic
being studied. The result of this step is a collection of Web pages and documents that
contain much noisy data.
3.2.1.2 Conversion
Because collected raw data often contain irrelevant details (i.e., the data are noisy),
several steps may be needed to convert them into more organized collections and to filter
out unrelated items. Language identification (mentioned in the framework by Trybula
(1999)) is used mainly for Web pages in which more than one language may exist or
English may not be the primary language. Heuristics (such as reading the meta-tags about
language encoding) may be needed. HTML/XML parsing tries to extract meaningful
entities based on HTML or XML mark-up tags (e.g., <H1>, <TITLE>, <A
HREF="http ://www. arizona . edu/" >). Domain/database specific parsing tries
to add in domain knowledge or database schematic knowledge to improve the accuracy
of entity extraction. For example, knowledge about major business intelligence
companies can be used to capture hyperlinks appearing in Web pages. Further analysis
can be done to study the relationships among the interlinked companies. The result of this
step is a collection of Web pages that is tagged with the above-mentioned semantic
64
details (e.g., language, meaning of entities, domain knowledge) with more contextual
information than the results from the previous step.
3.2.1.3 Extraction
This step aims to extract entities automatically as inputs for analysis and
visualization. Indexing is the process of extracting words or phrases from
textual
documents. A list of stop words is typically used to remove non-semantically bearing
terms (e.g., "of," "the," "a"), which can be identified in the literature (e.g., van
Rijsbergen (1979)). Link extraction identifies hyperlinks within Web pages. Anchor texts
of these links are often extracted to provide further details about the linkage relationships.
Lexical or syntactic entities can be extracted to provide richer context of the Web pages
(i.e., entity extraction). An example of a lexical entity is a company name (e.g., "Siebel,"
"ClearForest") appearing on a Web page. The results of this step are indexes to Web
pages and relationships between entities and Web pages (e.g., indicating which terms
appear on which pages, showing the stakeholder relationship between a business and its
partner). They provide more contextual information to users by showing the relationships
among entities. Noise in data is much reduced from the previous steps.
3.2.1.4 Analysis
Once the indexes, relationships, and entities have been extracted in the previous step,
several analyses can be performed to discover knowledge or previously hidden patterns.
Co-occurrence analysis tries to identify frequently occurring pairs of terms and similar
65
Web
pages.
Pairwise
comparison
between
pages
is
often
performed.
Classification/categorization helps analysts to categorize Web pages into predefined
classes so as to facilitate understanding of individual or an entire set of pages. Web page
classification has been studied in previous research (Glover et al., 2002; Lee et al., 2002;
Kwon and Lee, 2003). Clustering organizes similar Web pages into naturally occurring
groups to help detect patterns. Related works include (Jain and Dubes, 1988; Chen et al.,
1998b; Roussinov and Chen, 2001). Summarization provides the gist of a Web page and
has been studied in Hearst (1994) and McDonald and Chen (2002). Link or network
analysis reveals the relationships or communities hidden in a group of interrelated Web
pages (e.g., Menczer (2004)). Depending on the contexts and needs (to be discussed in
Section 3.3), these functions are selectively applied to individual empirical studies by
using appropriate techniques. The results of this step are similarities (e.g., a similarity
matrix among pairs of Web pages), classes (e.g., classes of stakeholders), and clusters
(e.g., groups of closely related Web pages). They are more abstract than the results from
previous steps while support the use of structured analysis techniques (e.g., visualization
techniques).
3.2.1.5 Visualization
In some applications (e.g., understanding the market environment of an industry), it
would be worthwhile to graphically present a high-level overview of the results.
Visualization appears to be a promising way to accomplish this. Three kinds of
66
visualization can be performed on results from the previous step. Structure visualization
reveals the underlying structure of the set of Web pages, often in the form of hierarchies.
Spatial navigation presents information (abstracted from voluminous data) in a two- or
three-dimensional space, allowing users to move around in different directions to explore
the details. A specific instance of spatial navigation is map browsing, in which a user
navigates on a map to look for relevant information. Web pages (or other entities) can be
presented as points on a map (i.e., placing entities on map), allowing analysts to study
relationships among pages. Often, the distances among the points are used to reflect
similarity among the pages. For example, the Kohenon self-organizing map has been
used to visualize large numbers of Web pages (Shneiderman, 1996; Chen et al., 1998b;
Spence, 2001; Yang et al., 2003). Various frameworks and a taxonomy for information
visualization have been proposed (Shneiderman, 1996; Spence, 2001; Yang et al., 2003).
The results of this step include hierarchies (e.g., hierarchically related Web pages or
sites), maps (e.g., Web sites placed as points on a map), and graphs (e.g., interconnected
Web sites represented as graphs). They can be perceived graphically, supporting the
understanding of large amount of information.
3.2.2 HCI Issues
To study how the framework benefits human analyses, several issues related to the
information retrieval (Baeza-Yates and Ribeiro-Neto, 1999), human-computer interaction
(Perlman, 2002) and data quality fields (Wang et al., 1995) were studied. These include
67
accuracy, information quality, usability, effectiveness, and efficiency. Human subjects
were recruited to provide feedback on the application of the framework to knowledge
discovery processes. Their performances were measured objectively. Precise definitions
of the performance measures are given in the contexts of the three empirical studies in
Chapters 4 to 6.
3.2.3 Structure of the Framework
As shown in Figure 3.1, the order of the five steps reflects a gradual change from data
to information to knowledge, ensuring that the results of one step are input to the
subsequent step. However, it generally is not possible to point out which collections
contain information or knowledge, because human perception and interpretation of these
collections are involved. Nevertheless, we believe that data and information are obtained,
processed, and converted in the collection, conversion, and extraction steps whereas
much knowledge is discovered in the analysis and visualization steps. The framework is
expected to facilitate human discovery of knowledge from voluminous and
heterogeneous Web data.
3.3
Principles of Applying the Framework
This section describes the principles that guide the application of certain techniques
and steps in our framework to addressing the needs in certain contexts.
68
A range of collection, conversion, and extraction techniques are required to alleviate
the information overload problem on the Web. Meta-searching/meta-spidering should be
used when many potentially relevant information sources are available or relevant
information can be obtained by using some special functions of search engines (e.g.,
searching pages having hyperlinks pointing to a Web page). It can reduce the bias caused
by relying on a small number of information sources (Mowshowitz and Kawaguchi,
2002), thereby contributing to increased quality of the information collected. Metasearching/meta-spidering also helps collation of relevant information distributed in
heterogeneous repositories. Domain spidering should be used when building a high
quality, domain-specific Web page collection, which can then be used for extracting
meaningful entities (e.g., noun phrases) in the selected domain. If the collection consists
mainly of non-English content, language identification heuristics should be used (e.g.,
recognizing the markup tag for language encoding on the Web page's HTML source
code) and pattern extraction techniques are needed (e.g., the mutual information approach
(Ong and Chen, 1999)). Parsing a collection requires knowledge of the structure of Web
pages (e.g., HTML tags) and knowledge of the domain (e.g., terms indicative of business
stakeholder types). Typically, it is required to produce a machine-readable collection for
further processing. In addition to extracting words or phrases, hyperlinks should be
extracted if they represent important relationships between Web pages or sites (e.g.,
business partnership).
69
Analysis and visualization techniques should be used to transform the collected
information into knowledge. When pre-defined categories are available (e.g., groups of
Web pages identified by common keywords), classification/categorization techniques
should be used. Clustering/summarization techniques should be used when the amount of
information present is large (e.g., the displayed text occupies more space than a computer
screen's size). Co-occurrence analysis should be used when a similarity matrix of pairs of
Web pages is needed to enable understanding of the underlying structure of the collection
and to provide input for clustering and visualization. Link/network analysis is used when
hyperlinks on Web pages reveal important social relationships (e.g., business
stakeholders). Visualization techniques should be used to facilitate information
exploration, especially when the topics to be explored are broad.
Because human cognitive resources are limited, analysis and visualization techniques
should be selectively apphed. A general guideline for avoiding "technique overload," a
situation where users are overwhelmed with too many analysis or visualization functions,
is to provide no more than three analysis or visualization functions in an application. The
empirical studies described in Chapters 4 to 6 provide examples of how such a selection
can be made.
HCI issues such as accuracy, usability, effectiveness, and efficiency should be studied
because they reflect the performance level gained by applying the framework.
Information quality should be studied when information sources are heterogeneous (e.g.,
70
information comes from different regions with vastly different economic status and Web
usage) and hence may not guarantee high-quality information.
3.4
Comparison with Existing Text Mining Frameworks
Compared with all existing text mining frameworks to our knowledge, the proposed
framework recognizes special needs for collecting and analyzing Web data. While
Trybula's framework (Trybula, 1999) touches on issues of finding and gathering data, it
does not address the voluminous and heterogeneous nature of Web data. Chen's text
mining pyramid (Chen, 2001) focuses on analysis techniques, thus not addressing
collection of Web data. The framework proposed by Nasukawa and Nagano (2001)
assumes the use of operational data stored in business databases and hence does not deal
with data collection and conversion on the Web. In contrast, different spidering
techniques in the proposed framework provide broader and deeper exploration of a
domain's content.
Conversion and extraction methods in our framework provide more comprehensive
details specific to the Web, such as h5;perlinks, anchor texts, and meta-contents, than
Trybula's framework,
which considers clustering only in its mining stage. Although
Chen's pyramid considers co-occurrence analysis and categorization, it does not include
analyses of hyperlinks, network, and Web page structure. Nasukawa and Nagano's
framework mainly relies on natural language processing techniques to extract concepts
from textual documents and is not tailored to the processing of noisy Web data. In
71
contrast, our framework
encompasses a wider range of analysis and visualization
techniques, taking into account the noisiness and heterogeneity of Web data. Such
techniques as structure visualization and spatial navigation are not found in Trybula's and
Nasukawa and Nagano's frameworks. Moreover, both Trybula's framework and Chen's
pyramid do not address HCI issues important to understanding the benefits and impacts
on humans.
3.5
Evaluating the Framework
We explain in this section the methodology and the domain of study used to evaluate
the proposed framework. The structure of each empirical study is also described.
3.5.1 Research Methodology
To tackle the complicated and multi-faceted problems related to the Web, this
research adopted a system development research methodology (Nunamaker et al., 1991a),
a multi-methodological approach in which conceptual frameworks or mathematical
models are proposed. Proof-of-concept prototypes are developed in order to test and
measure the underlying concepts by means of observation, experimentation, or case
studies. While building systems in itself is not research, the lessons learned and findings
from empirical studies contribute to advance science and technology.
Framework development focused on textual media because they are a major channel
for embedding and transferring knowledge on the Web. Other media, though beyond our
72
scope, may be studied in future research. The "automatic" nature of our framework is
emphasized because the use of information technologies enables automated analyses.
Although it has been commented that information technologies may never automate some
parts of business intelligence analysis (Fuld et al., 2003) or knowledge management
activities (Morrow, 2001), our empirical studies show how several important analysis
needs can be satisfied through the fi-amework.
3.5.2 Domain of Study: Business Intelligence
The shared application domain of Chapters 4 to 6, business intelligence, was chosen
because of three reasons.
First, BI is becoming increasingly important in today's organizations, as reflected by
the money and effort being invested. According to Fuld et al. (2003), more than 40% of
individuals surveyed (persons who downloaded the Intelligence Software Report (Fuld et
al., 2002)) have organized BI programs^, and over one-third to nearly one-half of
companies interviewed plan to establish a BI process. The Internet has become one of the
top five sources of business information (Futures-Group, 1998). Web portals have been
developed to collect and analyze business intelligence (Marshall et al., 2004). Managers
and business analysts should benefit fi-om using the proposed framework in the BI
process.
^ The report uses the term "Competitive Intelhgence" to refer to BI.
73
Second, collecting and analyzing business intelligence has become a profession. A
key example is the Society of Competitive Intelligence Professionals (SCIP), established
in 1986. SCEP's mission is to enhance the skills of knowledge professionals in order to
help their companies achieve and maintain a competitive advantage. Currently, SCIP has
more than 50 chapters worldwide, with members in more than 50 nations. A new
academic journal called Journal of Competitive Intelligence and Management
(http://www.scip.org/jcim.asp) was launched in 2003.^ Use of the proposed
framework to discover business intelligence on the Web should benefit the BI
community.
Third, the vibrant growth of electronic commerce has called for better approaches to
knowledge discovery on the Web. A 2003 report by financial firm Morgan Stanley shows
that overall e-business transaction revenue grew steadily from 1996 to 2003 (MorganStanley, 2003). In 2002, it was worth approximately $2.3 biUion worldwide. Like
physical cities, "information cities" have emerged over the past five years (Sairamesh et
al., 2004). They provide a range of services to millions of people wishing to do business
over the Internet. For example, eBay now has more than 75,000 suppliers and millions of
users routinely searching, bidding, coming, going, bujdng, selling, socializing, and
' The Journal of Competitive Intelligence and Management (JCIM) is a quarterly, international, blind
refereed journal edited under the auspices of the Society of Competitive Intelligence Professionals
(SCIP). JCIM is the premier voice of the Competitive Intelligence (CI) profession and the main venue for
scholarly material covering all aspects of the CI and management field. Its primary aim is to further the
development and professionalization of CI and to encourage greater understanding of the management of
74
accessing the site. Other examples include Yahoo and AOL's digital cities. In the face of
such significant trends, information overload and other problems of knowledge discovery
on the Web can only be aggravated and the proposed framework should assist individuals
who wish to gain insight to the voluminous data on the Web.
3.5.3 Structure of the Empirical Studies
Each of Chapters 4 to 6 is an empirical study conducted to evaluate the framework
and is structured as follows. The Background section provides details related to the
specific context of the study. The Related Work section reviews literature on the areas of
study and surveys relevant text mining techniques. The Research Questions section lists
and explains the specific questions addressed by the study. The Application of the
Framework section describes how the automatic text mining framework was applied to
answering the research questions. The Evaluation Methodology section describes how the
framework was evaluated. The Experimental Results and Implications section reports and
discusses findings from experiments in the study. Finally, the Conclusions section
presents insights gained from the study.
competition by publishing original, high quality, scholarly material in an easily readable format with an
eye toward practical applications.
75
3.6
Dissertation Chapters
The colored ellipses on the chart shown in Figure 3.2 refer to the three empirical
studies in Chapters 4 to 6. The chart summarizes the application of text mining
techniques to knowledge discovery processes. On the vertical axis, text mining
techniques (summarized in Section 2.3.5) are arranged according to increasing level of
abstraction of their results. Five categories of text mining techniques are identified;
pattern extraction, summarization, classification, clustering, and visualization, with each
category representing a set of techniques that support certain text mining functions. For
example, the mutual information approach can be used to extract textual patterns from a
large corpus as meaningful phrases. Neural networks can be used to classify Web pages
into different types. Various clustering and visualization techniques can be used to
explore business intelligence from a large number of Web pages. On the horizontal axis,
three knowledge discovery processes (summarized in Section 2.2.3) are arranged in
increasing order of frequency of use'°. Each process transforms data and information into
knowledge and can be facilitated by text mining techniques.
The positions of the ellipses on the chart indicate which text mining techniques were
applied to which knowledge discovery processes. In Chapter 4, we applied
Here, the frequency of use is considered in a general sense. We believe that Web users engage most
frequently in information seeking (e.g., using search engines) while relatively fewer users perform
intelligence gathering and relationship extraction. However, considering the special duties of certain
workers (e.g., analysts responsible for gathering business intelligence), some users may engage more
frequently in knowledge discovery processes lying to the right of the axis.
76
summarization and pattern extraction techniques to information seeking on the Web. An
intelligent Web search portal incorporating meta-searching, Web page summarization,
and categorization of business information was developed. In Chapter 5, we applied
information clustering and visualization techniques to intelligence gathering on the Web.
Two browsing methods were developed and implemented to support exploration of a
large number of business Web pages. In Chapter 6, we applied classification techniques
to extracting business stakeholder relationships on the Web. Although any techniques can
be applied to any processes, we chose to study these applications because they address
the needs for knowledge discovery in their particular contexts.
Text Mining
Techniques
Visualization
c
•.g
(D
4-i
^
<
Clustering
Classification
M—
O
a5 Summarization
Pattern
Extraction
* The colored ellipses refer to
empineal studies conducted in this
dissertation
Ch. 4; Building
a BI Search
V Portal
Information
Intelligence
Relationship
Seeking
Generation
Extraction
Knowledge Discovery Processes
Figure 3.2: Application of text mining techniques to knowledge discovery processes
77
Table 3.1 shows detailed applications and evaluations of the framework in the three
empirical studies. Most of the components of collection, conversion, and extraction were
applied, while analysis and visualization components were selectively applied to specific
studies that focused on certain knowledge discovery problems. Most HCI issues were
addressed in the evaluations.
Table 3.1: Detailed applications and evaluations of the framework
Component
Chsipter 4
( liapter 5
Chapter 6
X
X
X
X
X
X
X
X
X
X
X
Collection
Meta-searching/meta-spidering
Domain Spidering
X
Conversion
Language Identification
HTML/XML Parsing
Domain/Database Specific Parsing
X
Extraction
Indexing (word/phrase)
Link Extraction
Entity Extraction (Lexical/syntactic)
X
X
X
X
X
Analysis
Co-occurrence Analysis
Classification/categorization
Clustering/ summarization
Link/network Analysis
X
X
X
X
X
X
Visualization
Structure Visualization
Spatial Navigation
Placing Entities on Map
X
X
X
HCI Issues
Accuracy
Information Quality
Usability
Effectiveness
Efficiency
X
X
X
X
X
X
X
X
X
X
X
X
X
78
CHAPTER 4. BUILDING A BUSINESS INTELLIGENCE SEARCH
PORTAL FOR INTEGRATED ANALYSIS OF HETEROGENEOUS
INFORMATION
Relevant information for business analysis often is distributed among heterogeneous
sources on the Web. Although search engines are available to support searching these
sources, they typically provide more results from the regions they focus on than from
other regions. For example, Google, despite its claim to serve global users, provides
mainly English contents from the United States. Search engines in China, Taiwan, and
Hong Kong serve primarily their own regions, even though the same language is used
there. Worse still, the quality of search engines' results varies greatly as anyone can post
any materials on the Web. This chapter investigates how our framework can benefit
information seeking from heterogeneous sources on the Web.
4.1
Background
As electronic commerce grows in popularity worldwide, business analysts need to
access more diverse information, some of which may be scattered in different regions. A
report published in September 2002 shows that the majority of the total global online
population (63.5%) lives in non-English-speaking regions (Global-Reach, 2002).
Moreover, that population was estimated to grow from 403.5 million in 2002 to 657
million in 2004 (a growth rate of 62.8%), while the population of English-speaking users
79
only will grow from 230.6 million to 280 million during the same period (a growth rate
of 21.4%). These statistics imply a high potential of Web growth in many non-English
speaking regions.
The Chinese e-commerce environment provides a good example. Chinese is the
primary language for people in mainland China, Hong Kong, and Taiwan, where
emerging economies are bringing tremendous growth to the Internet population. In
mainland China, the number of Internet users has been growing at 65% every 6 months
since 1997 (CNNIC, 2002). Taiwan and Hong Kong lead the regions by having the
highest Internet penetration rates in the world (ACNelisen, 2002). The need for searching
and browsing Chinese business information on the Internet is growing just as quickly.
To facilitate business analysis in such environment, we describe in this chapter how
we applied our framework to information seeking and analysis of Chinese business
information on the Web. Based on the framework, an intelligent search portal called the
Chinese Business Intelligence Portal was developed. An experiment was conducted to
compare the portal with existing Chinese search engines. We report the experimental
findings and discuss issues related to human interaction and analysis with automated
systems.
80
4.2
Related Work
In this section, we review various issues related to Internet searching and browsing in
a heterogeneous environment. These include approaches to information seeking on the
Web and Web searching in a multilingual world.
4.2.1 Approaches to Information Seeking on the Web
As the Internet evolves to be a major information-seeking platform, the humancomputer interaction aspect has been addressed in recent research. Two approaches are
found in previous research, namely, a system-centered approach and a user-centered
approach.
4.2.1.1 System-centered Approach
The system-centered approach aims to use information technologies to assist human
beings in their information-seeking process. The use of information retrieval systems
(most notably search engines) is one major strategy. Because different search engines
have different methods of page collecting, indexing and ranking, they may include
systematic bias in their search results. Meta-searching (reviewed in Section 2.3.3.1) has
been proposed as a promising method to alleviate the problem.
In addition, post-retrieval analysis provides added values to results returned by search
engines. Previews and overviews of retrieved Web pages are important elements in postretrieval analysis. A preview is extracted from, and acts as a surrogate for, a single object
81
of interest (Greene et al., 2000). Document summarization techniques provide previews
of individual Web pages in the form of indicative summaries (Firmin and Chrzanowski,
1999), query-biased summaries (Tombros and Sanderson, 1998), or generic summaries
(McDonald and Chen, 2002). An overview is constructed from and represents a
collection of objects of interest (Greene et al., 2000). Document categorization techniques
such as the self-organizing map algorithm (Kohonen, 1995) have been used to categorize
and search the Internet (Chen et al, 1996; Chen et al., 1998b). Document visualization
techniques have also been used to amplify human cognition in browsing Internet search
results (Gloor, 1991; Furnas and Zacks, 1994; Lin, 1997). Despite the potential
advantages of meta-searching and information previews and overview, they are rarely
found to be applied to developing non-English search engines.
4.2.1.2 User-centered Approach
The user-centered approach of information seeking concerns the behavioral and
cognitive aspects of the information seekers. Under the approach, human informationseeking has been described as a behavior that includes questions, dialogue, and social and
cognitive situations, associated with a user's interaction with an information retrieval
system (Saracevic et al., 1988; Kuhlthau et al., 1992; Kuhlthau, 1993). The informationseeking process involves user judgments, search tactics or moves, interactive feedback
loops, and cycles (Spink, 1992; Spink and Saracevic, 1997). Previous research has dealt
with issues relating to user cognitive structure (Ingwersen, 1992) and factors affecting
82
user-intermediary interaction process (Saracevic, 1996). However, relatively little
research was done to study the perception of information seekers in the context of
Internet information seeking in a heterogeneous environment, such as a multilingual
world (an example of such research is found in Spink et al. (2002)). Considering the
multiple cross-regional information sources that are typically used, two issues deserve
more attention: the quality of information sources and regional impacts.
Information quality is considered to be an important aspect of evaluating the quality
of a Web site (Loiacono, 2002). It is a multi-faceted concept that has been explored in
recent research (Ballou and Pazer, 1985; Redman, 1996; Wang and Strong, 1996; Huang
et al., 1999). A Web site with high information quality is expected to facilitate searching
and browsing. To evaluate information quality, a set of 16 dimensions was developed
(Wang and Strong, 1996) and was tested in Pipino et al. (2002). They were mainly used
in evaluating the quality of information of organizations or companies, but not in
evaluating the quality of information obtained from search engines. Previous research
assumed that equal weightings were applied to these dimensions (Kahn et al., 2002).
However, such an assumption may not be valid for evaluating information of domains
that emphasize different dimensions differently.
As a language can be used in more than one region or country, regional impacts arise
because of different cultural, social and economic environments. For example, Chinese is
used differently and has different encodings and vocabularies in Taiwan, Hong Kong and
mainland China. Spink et al. (2002) compared the searching behaviors of FAST search
83
engine users (who are largely European) with those of Excite search engine users (who
are largely American) and found that FAST users input queries more frequently while
Excite users focused more on e-commerce topics. These results suggest a potential for
regional differences in the public Web, arising from
possible cultural and social
differences. However, their studies focused only on query and topic differences and did
not reveal differences in search-engine effectiveness. In the context of Web searching in a
multilingual world, the evaluation of regional impacts should improve understanding of
optimal design of search engines and portals.
4.2.2 Web Searching in a Heterogeneous Environment
As more non-English speaking people use the Internet to search and browse
information, major search engines have been trying to expand their services to nonEnglish speakers. Also, regional search engines are emerging to provide more localized
searching. In addition to English, they typically accept queries in a user's native language
and return pages from the regions being served. The following presents a survey of major
search engines in English and Chinese, the most popular languages used on the Web
(Global-Reach, 2002). The features, contents, and functions are discussed.
4.2.2.1 English Search Engines
English has been the primary language of Web content since the inception of the
Internet, although the proportion of native Enghsh users is declining (Bowen, 2001). At
84
first, English search engines served the English-speaking communities but they have
gradually been expanded to provide searching for non-English Web content.
With over 3 billion Web pages indexed, Google (http://www.google.com/)
was rated the most popular search site in the United States (with 29.5% audience reach)
in January 2003 (Sullivan, 2002). Google currently supports searching in 78 languages
and has local sites in 40 countries. It allows users to specify a language and a country for
searching (Google, 2002). A translation service among 6 European languages (English,
Spanish, Portuguese, French, German, Italian) recently has been provided.
With an audience reach of 28.9% (Sullivan, 2002) close to Google's, Yahoo
(http : / /www. yahoo . com/) has local sites in 25 different languages or regions (9 in
Europe, 9 in Asia Pacific, 7 in the Americas). It also supports searching services in some
regions having more than one language (e.g., French and English in Canada).
As
one
of
the first
search
engines
on
the
Web,
AltaVista
(http : / /www. altavista. com/) provides searching of its over 1.1 billion items in
25 languages. It also provides a translation service among 9 languages, including 6
European languages (English, Spanish, Portuguese, French, German, Italian) and 3
Oriental languages (Chinese, Korean, Japanese). It recently launched a new search
assistance tool called Prisma that provides a broad range of suggested terms for refining a
search query (Sherman, 2002). At present only available for English search terms, the
tool will soon be expanded to other languages.
85
Powered by FAST technology, AlltheWeb (http://www.alltheweb.com/)
has indexed over 2.1 billion Web pages and allows users to specify one of 49 languages
in which Web pages can be returned. Compared with Yahoo and AltaVista, AlltheWeb
focuses on Web searching and does not include other services like Internet shopping,
email, translation, finance, and entertainment.
Supported
by
Microsoft,
a
multinational
software
company,
MSN
(http://www.msn.com/) provides Web searching in 15 languages and has local
sites in 32 countries or regions across five major continents (MSN, 2002). Due to its
connection with Microsoft Windows products (such as Internet Explorer), MSN has
advantages over other search engines in terms of audience reach (e.g., being the default
homepage of Internet Explorer).
4.2.2.2 Chinese Search Engines
Chinese is the primary language used by people in mainland China, Taiwan, and
Hong Kong. Language encoding, vocabularies, economies and societies of the three
regions differ significantly. Regional search engines therefore have been developed to
support Internet searching.
In mainland China, the major search engines include Sina and Baidu. Baidu
(http: //www. baidu. com/) currently powers over 80% of Internet search services
in China, including ChinaRen, 163.net, etc. The database of Baidu stores over 60 million
Web pages collected from mainland China, Hong Kong, Taiwan and Singapore, and
86
grows at a speed of several hundreds of thousands of Web pages per day. Sina
(http://www.sina.com.cn/) is an hitemet portal providing comprehensive
services such as Web searching, email, news, business directory, entertainment, weather
forecast, etc. From our review of search engines in mainland China, we found that Baidu
has better search capabilities than the others, as shown by its content coverage. Sina has a
wider scope of functions than Baidu.
In Taiwan, the two major Internet search portals are Openfmd and Yam. Openfmd
(http://www.openfind.com.tw/), established in 1998, is one of the largest
portals in Taiwan. In addition to basic searching, Openfmd suggests terms that are highly
associated with users' queries to help them refine their search. It also allows users to find
more related items from each search result and highlights the query terms in the results.
Established in 1995, Yam (http : //www. yam. com/) provides comprehensive online
services. Its four major focuses are content, communication, community, and commerce
(4C). Yam's search engine allows users to search various media: Web sites, Web pages,
news, Internet forum messages, and activities (in 18 Taiwan cities or regions). We found
that Openfmd has better functionality and content coverage, but Yam was better
established in the local market (e.g., it powers the search function of Taiwan
govemment's Web sites).
In Hong Kong, due to its bilingual culture, people rely on both English and Chinese
when accessing and searching the Internet. Major search portals include Yahoo Hong
87
Kong and Timway. Of these, Yahoo Hong Kong (http: //hk. yahoo. com/) is one
of the most popular. Yahoo Hong Kong's search engine returns results in different
categories, Web sites, Web pages and news. Headquartered in Hong Kong, Timway
(http;//www. timway. com/) provides services such as Web searching, Web
directory, email, news, forums, etc. Its database stores over 30,000 Hong Kong Web sites
and over 10 million Web pages. Although Timway claims to be the search engine for
Hong Kong people, its content coverage is smaller than that of Yahoo Hong Kong. The
functions of the two search engines are similar.
Table 4.1 summarizes the content coverage and functionality of the major search
engines in the three Chinese regions. It shows that these search engines have similar
types of content but their sizes and functions differ. Most search engines only search for
information about their own regions. Some search engines have different versions for
different regions, but users need to visit different Web sites to perform searching. Thus,
their Web page collections are not comprehensive with respect to the Greater China
regions. Furthermore, none of the existing search engines uses meta-searching to collate
and integrate different business information sources, or provides post-retrieval analysis
for assessment and exploitation of business information.
88
Table 4.1; Comparing major Chinese search engines
Content
Web pages and news on
IT
Business
Government
Financial
Medical
General
Size of collection
Functionality
Encoding conversion
Links to Related Resources
Membership services
Newsgroup search
Web directory
Search for Web sites
Search Stock prices
Search by time period
Search for news
Multimedia search (image,
music, software, etc)
Term suggestion
User Interface
China
Baidu
Sina
•/
•'
Hon gKong
Timway Yahoo HK
Taiwan
Yam
Openfind
V
•'
•/
•/
•/
•/
•/
Very good
Baidu
Good
Sina
y
Very good
Fair
Timway Yahoo HK
y
y
Good
Yam
Very good
Openfind
•/
,/
y
•
•/
•/
Good
Fair
•/
Good
Very good Very good Very good
In general, English search engines are better developed than Chinese search engines
in terms of their coverage and functions. The reasons are two-fold: English search
engines rely on techniques developed in the information retrieval field but Chinese
information retrieval techniques are less mature. The word segmentation problem also
contributes to the different levels of development of technologies. For English, words are
segmented by spaces. For Chinese, words (or characters) are not clearly segmented,
making it hard to extract meaningful semantic units from a text. To overcome problems
caused by the nature of a specific language, a generic approach is needed to build search
engines in any language. From previous research (such as Kwok (1997) and Ong and
89
Chen (1999)), we conclude that a statistical approach is more generic than a linguistic
approach because the former is not affected by linguistic differences.
4.3
Research Questions
From our literature review, three research gaps were identified. First, the rapid growth
of non-English Web content aggravates information overload for Internet searching and
browsing in a heterogeneous environment. However, technologies for non-English Web
searching are not as mature and well developed as those for English Web searching.
Second, human perception on the information quality and regional impacts of crossregional information sources has not been explored in previous research. Third, how
human analysis can be assisted by automated information preview and overview has not
been widely explored. The three research questions addressed in this study are:
1. How can we apply our automatic text mining framework to Internet searching and
browsing in a heterogeneous environment such that it can be used to extract
meaningful phrases from any human languages, to integrate information from
different sources, and to provide automatic summarization and categorization of
search results?
2. How can human analysis be made more effective (as measured by accuracy of tasks
performed and users' subjective evaluation) by using an automated informationseeking tool developed using the framework?
90
3. What is the human perception on the improvement in information quahty and
regional impacts (measured by users' subjective evaluation) brought about by the tool
(mentioned in Question 2) in comparison with existing search engines?
4.4
Application of the Framework
We have applied our framework to building an intelligent search portal called the
Chinese Business Intelligence Portal (CBizPort). The portal integrates information from
heterogeneous sources and provides post-retrieval analysis capabilities. Meta-searching,
pattern extraction, and summarization were major components of the portal. The
following describes the portal's functionality as well as how our framework was used to
develop it.
CBizPort is a meta-search portal for business information of Greater China. The
domain of Chinese business was selected because of the growing importance of the
Chinese language on the Web and the emerging roles of Chinese economies. "Greater
China" is composed of three regions - mainland China, Taiwan, and Hong Kong. With
the rapid growth of regional economies and global economic integration, an efficient onestop portal for searching and browsing cross-regional business information is needed.
Because Chinese business information sources are numerous, diverse, and have varying
quality, information overload becomes an issue. Users are more concerned with business
intelligence than business information. Professionals such as business consultants,
marketing executives and financial analysts are heavily involved in the discovery of BI.
91
The quality of their work relies mainly on the capability of the tools they use to obtain
business information. Since existing Chinese search engines provide business information
rather than business intelligence, there is a need for a better Chinese search portal that
integrates results from the three regions.
Figure 4.2 shows CBizPort's system architecture, which was developed based on the
highhghted components (in blue ovals) shown in Figure 4.1 (modified from Figure 3.1).
We discuss these components below.
Collection
Conversion Extraction
Analysis Visualization
HTML/XML
/Metaspiderlng
KTML/XML
The Web
Domain
Spidering
>
Domain/DB
Spedte J
Nj'arslng^
Hidden Web
(behind a
DB)
Web pages
and
Documents
Tagged
Collection
Indexes and
Relationsliips
Data and Text Bases
Similanties,
Classes,
Clusters
Hierarchies,
Maps,
Graphs
Knowledge Bases
Figure 4.1: Framework components used to develop CBizPort
92
4.4.1 User Interface
CBizPort has two versions of user interface (Simphfied Chinese and Traditional
Chinese) that have the same look and feel. Each version uses its own character encoding
when processing queries. The encoding converter is used to convert all Chinese
characters into the encoding of the interface version. On the search page (Figures 4.3 4.4), the major component is the meta-searching area, on top of which is a keyword input
box. Users can input multiple keywords on different lines and can choose among eight
carefully selected information sources (see Table 4.2) from the three regions by checking
the boxes. A one-sentence description is provided for each information source. On the
result page (Figure 4.6), we display the top 20 results from each information source. The
results are organized according to the information sources on one Web page. Users can
browse the set of results from a particular source by clicking on the bookmark at the topright-hand side of the page (e.g., "HKTDCmeta," "Baidu," and "Yahoo Hong Kong" in
Figure 4.6). Users can also click on the "Analyze results" button to use the categorizer or
choose a number of sentences provided to summarize the Web page.
93
The Chinese Business
Intelligence Portal
(CBizPort)
Retrieved Web
pages
User request
(summarize Web page)
Front End
Search Page
Result Page
User's
Query
Results
returned
Folder display of
categorized
results
User
request
(analyze
results)
Summarization
result with
original page
Middleware
Encoding Converter {GB2312
Categorized
results
BigS)
Query/results converted
into appropriate encoding
All retrieved
results
IVIeta-search engine
Back End
Query
Categorizer
Results
Query
Results
Query
Results
Summarized
Web page
Requested
Web page
Web Page
Summarizer
Chinese Phrase
Lexicon
Search
engin^
Mainland China:
Baidu, CSRC
(GB2312 encodingb
Taiwan: Yam,
GIO, PCHome
(BfgS encoding).
Hong Kong:
' YahooHK, HKTDC,
HKGovt(Big5
encoding)
search
The Web
Figure 4.2: System architecture of CBizPort
4.4.2 Encoding Converter
The encoding converter relies on a conversion dictionary with 6,737 Chinese
characters in each of the two encodings (Big5 and GB2312). The dictionary includes the
most commonly used characters in the Chinese language. Encoding conversion is
performed when the portal sends out queries to other search engines having encoding
different from its own or when the portal collates results from those search engines. We
can consider the Simplified Chinese version of CBizPort as an example. Before the portal
sends out queries Irom the Simplified Chinese interface to search engines in Traditional
94
Chinese, the encoding converter converts the queries from GB2312 encoding to Big5
encoding. Upon getting results in Big5 encoding, the encoding converter is used again to
convert the results to GB2312 encoding before displaying it on the result pages.
4.4.3 Information Sources
The eight information sources selected for meta-searching are major Chinese search
engines or business-related portals from the three regions (see Table 4.2). Businessrelated portals including commercial and government Web sites were selected because of
their high quality of Internet searching and browsing and their important role in business
domain of their regions. With high authoritativeness, accuracy, relevance, and timeliness,
these information sources serve to provide high quality of information, thus addressing
the unmonitored quality of information on the Web.
Table 4.2: Information sources of CBizPort
Kcuioii
lnforin:iti(iii Smirci-
Mainland China Baidu
China Secxirity Regulatory
Commission
Yahoo Hong Kong
Hong Kong
Hong Kong Trade
Development Council
Hong Kong Government
Information Center
Taiwan
Yam
PCHome
Taiwan Government
Information Office
Dc-scriptioii
A general search engine for mainland China
A portal containing news and financial reports of the listed
companies in mainland China
A general search engine for Hong Kong
A business portal providing information about local
companies, products, trading opportunities.
A portal with government publications, services and
policies, business statistics, etc.
A general search engine for Taiwan
An IT news portal with hundreds of online publications in
business and IT areas
A government portal with business and legal information
95
4.4.4 Summarizer
The CBizPort summarizer was modified from
an Enghsh summarizer called
TXTRACTOR that uses sentence-selection heuristics to rank text segments (McDonald
and Chen, 2002) (An overview of text summarization techniques can be found in Section
2.3.2.2.). This heuristic strives to reduce redundancy of information in a query-based
summary (Carbonell and Goldstein, 1998). The summarization takes place in three main
steps: (1) Sentence evaluation, (2) Segmentation or topic identification and (3) Segment
ranking and extraction. First, a Web page to be summarized is fetched from the remote
server and parsed to extract its full text. All sentences are extracted by identifying
punctuations acting as periods such as
°
"! " and " ? ". Important information
such as presence of cue phrases (e.g., "therefore", "in summary"), sentence lengths and
positions are also extracted for ranking the sentences. Second, the Text-Tiling algorithm
(Hearst, 1994) is used to analyze the Web page and determine where the topic boundaries
are located. A Jaccard similarity function is used to compare the similarity of different
blocks of sentences. Third, document segments identified in the previous step are ranked
according to the ranking scores obtained in the first step and key sentences are extracted
as summary. The CBizPort summarizer can flexibly summarize Web pages using one to
five sentence(s). Users can invoke it by choosing the number of sentences for
summarization in a pull-down menu under each result. Then, a new window is activated
(shown in Figure 4.7), that displays the summary and the original Web page.
96
4.4.5 Categorizer
The CBizPort categorizer organizes the Web pages into various folders labeled by the
key phrases appearing in the page summaries or titles (see Figure 4.8). It relies on a
Chinese phrase lexicon to extract phrases from Web page summaries obtained from the
eight search engines or portals. The lexicon for Simplified Chinese CBizPort is different
from that for Traditional Chinese because the terms and expressions are likely to be
different in the two contexts.
4.4.5.1 Mutual Information Approach
To create the lexicons, we collected a large number of Chinese business Web pages
and extracted meaningful phrases from them using the mutual information approach,
which is a statistical method that identifies as meaningful phrases significant patterns
from a large amount of text in any language (Church and Hanks, 1989; Ong and Chen,
1999). The approach is an iterative process of identifying significant lexical patterns by
examining the frequencies of word co-occurrences in a large amount of text. Three steps
are involved: tokenization, filtering and phrase extraction. First, in the tokenization step,
each word (or token) in the text is identified by recognizing the delimiter separating it
from another word. In English (or many other European languages), a word is delimited
from another by a space. In Chinese (or many other oriental languages), in which the
smallest meaning-bearing unit is a character, the delimiter is identified as the boundary of
each word (or character). Second, in the filtering step, a list of stop words is used to
97
remove non-semantic-bearing expressions and a list of included words is used to retain
good expressions (words or phrases). Regular expressions can be used in the two lists to
specify patterns of words. Stop words like
('s), "T" (function word, no meaning)
and "S" (and) are removed. The included word list, which has priority over the stopword list, allows users to have the flexibility to retain words that appear in the stop-word
list. For example, the Chinese phrase
(aim) can be listed in the included words
although the word "6'^" ('s) appears in the stop-word list. Third, in the phrase extraction
step, statistics of patterns of the words extracted from the above steps are computed and
compared against thresholds to decide whether certain patterns are extracted as
meaningful phrases. The mutual information (MI) algorithm is used to compute how
frequently a pattern appears in the corpus, relative to its sub-patterns. Based on the
algorithm, the MI of a pattern c (MIc) can be found by
where / stands for the frequency of a set of words. Intuitively, MIc represents the
probability of co-occurrence of pattern c, relative to its left sub-pattern and right subpattern. Phrases with high MI are likely to be extracted and used in automatic indexing.
For example, if the Chinese phrase
(knowledge management) appears in the
corpus 100 times, the left sub-pattern (^Pi^^) appears 110 times and the right sub-
98
pattern
appears 105 times, then the mutual information (MI) for the pattern
is 100 / (110 + 105 - 100) = 0.87.
We also employed an updateable PAT-Tree data structure developed in Ong and
Chen (1999) that supports online frequency update after removing extracted patterns to
facilitate subsequent extraction. Hence, repetitive removal of sub-patterns is not
necessary. To decide whether a pattern is extracted as a phrase, a frequency distribution
analysis is performed to filter out patterns that have MI values lower than certain
thresholds. The extracted patterns are then manually examined (1) to remove nonsemantic bearing phrases that are added to the stop-word list; (2) to identify from the
original corpus meaningful phrases that should have been extracted and add them to the
included word list; and (3) to adjust the MI threshold values if needed. The process is
repeated (for about two or three times typically) to obtain phrases used to index the Web
pages. Since the approach does not rely on the specific nature of a language, we believe
that our approach has generic applicability.
4.4.5.2 Chinese Phrase Lexicon
For creating the Simplified Chinese lexicon, over 100,000 Web pages in GB2312
encoding were collected from major business portals such as Sohu.com, Sina Tech, and
Sina Finance in mainland China. For creating the Traditional Chinese lexicon, over
200,000 Web pages in Big5 encoding were collected from major business or news portals
in Hong Kong and Taiwan (e.g., HKTDC, HK Government, Taiwan United Daily News
99
Finance Section, Central Daily News). The Simplified Chinese lexicon has about 38,000
phrases and the Traditional Chinese lexicon has about 22,000 phrases.
Using the Chinese phrase lexicon, the categorizer performed full-text indexing on the
title and summary of each result (or Web page) and extracted the top 20 (or fewer)
phrases from the results. Phrases occurring in the text of more Web pages were ranked
higher. A folder then was used to represent a phrase and the categorizer assigned the Web
pages to respective folders based on the occurrences of the phrase in the text. A Web
page can be assigned to more than one folder if it contains more than one of the extracted
phrases. The number of Web pages in each folder also is shown. After clicking on a
folder, users can see the titles of the Web pages assigned to that folder. Further clicking
on a title will open the Web page in a new window.
100
Result Page
Search Page
K«yMr«K=A,Vi6
RMUKS
#<TDCniete TOD 7 rssuffs
H
Swnnwfv
ilgQ.
IDR^«««
lUARMt<$n
HgRfrntiitMiT
Kaywerdi: z=Ji . ««e
:2l fPffl(5docs)
(4 does)
J FS/4i(8docs)
mmmm
iosimtMuB muB
^l9#=ii(9docs)
H«n9 K*no]
•«R«ifnasmsit.
agflfflErr*•e Htsf KbivC
«•
«it9«9reBaic
MBA. s»mits rMSBixKj IBKSUIKKrSIM
Sl«ttX]|(IDSiJI.
Summarizer
MU.
«9a»ifA:3-J»«««•
ft.
^9ma DHHOXKr
a«9;ja39inaL
nffAmR0mT-Xti9TmHa:sm. «
:¥TR.
2i (5 docs)
2i ^ (4 docs)
2J .\J>£j93I (5 docs)
Zi (4 docs)
CJ 9^ (4 docs)
(4 docs)
D l)ni(4docs)
»RB(5docs)
23 «fr« (5 docs)
Categorizer
Z3 ft«(5 docs)
Figure 4.3: Screen shots of various functions of CBizPort
Figures 4.2 - 4.7 illustrate how CBizPort works. An online demo of the system is
available at: http: / /ai.bpa. arizona. edu/go/dl/cbizport. html.
101
a
•
w
•
fT«
mm
p
•:
SEnterMft^tWMrr 5iff
it*
ramm
SrETT
#**- * msMit
^mwm.
ma SanS'^SiJAflAfttilll
m,vm«.
ftffiM.
sajuam'
a*MRKa^«P>pa((l?1ffD
moammmsmam m^mmm wmbmj PC hoM SHOFPINQKS*
eiRKi»nr. «mfm«
miHiiBcAalff*
mmmm.
Figure 4.4: Search page of CBizPort (Traditional Chinese version)
D mgjkMW
D i^w
pr
<^) :
<^) .
1=1
r±is
lygj*
rsaiiEM
mk
r^jg
xm.
psmMMm
mm, ]mm>
&So
r:ts
ifeJiW-tfSflWRl*.
iir3ti^JNfc*^W'
fsa?j>.p0
**=±wttj.
«AS«-s
ePaper
%TM#BPC home SHOPPM?
n»ii¥Ji«iMi. itmim fS]nF^>f««»»rn. #«
SD^h
i w^mvtx I
Figure 4.5; Search page of CBizPort (SimpliHed Chinese version)
102
R«sults from: | h
°!
0 w»aiM
a
DBsaum
•:
PietJ 1
I .3!''0C
cn^ [
I HKTDCmeia Top 7 results
summav
«i!(5k»Ha« '"•aess^ffflwij
Sammar<ie If rn 0 *
ifKFS#Ha««S)6aw3«ses«
sent
• '"Attj fflfgi ''Hiij a, -ssiffiAaMBSM;
r%HIRnJ
Suninariie n m
3 Jli
Smrmm
|0 j senio/ici
I*«W] »»7t«»«MWlt;0«*6ailBS«a»^,
1B±
fit, «P«WflW«'"3S_. , ^(I!i«iMS«ffi8ffinSWll!«ISM8l»=Pa-
SAKneSKIHiRSSmiS
5;;.T/narije a :-n {o
S«
svnenccs
AmsmmmmSummafv JBWSBKI (144)
Figure 4.6: Result page of CBizPort
iSKfSSWfc
2txminiiB *w£ mm'-.
mmmm
««
M«!iftii«^)iiiea
eR««B^$K rAtg%KKSS(iS««flK:9li
«ll»».
«aR!^WA:!rj
mm.
tim.
K.
*T». tSilD«?h»iSS«lt*.U»
DSHiaXSJ ^
Figure 4.7: Web page summarizer
The summarizer returns key sentences shown on the left frame of the Web page as a summary
103
Keywords:
•:
D iraaiaM
°
MjIMH
.S?B
U ffM(5docs)
^ Si$ (4 docs)
C3 RI^ (0 docs)
tdfSSSM (9 docs)
S
.is #^l'iAT fj-ihoo Hong Kong)
.CVdhoo Hong Kong)
a:
m BBC News I MEWS
_Ci'ahoo Hong Kong)
a YahoDisB»aa-»w$g?^g-iisaeH» ^e-k c^^,. H.„. K.„J,
0
CYjttqo Hons Kong)
B
SSl
.OtKTOCmda)
JHKTDCmrtjJ
^ DHLsaa^iWEiyamft cxktoc„M>
vvww.vfistock.CQnn/ci>w/ciQd/2002Q709/739B132.htm rsjiiiu't
^
(5 docs)
C] SB (t docs)
• :*:l^fiS(5docs)
d]
(4 docs)
D ¥^f(4docs)
• *5! (4 docs)
C2 lt|9i (4 docs)
2] IBS (5 docs)
D iS««(5docs)
Cj S8 (5 docs)
Figure 4.8: Web page categorizer
The categorizer groups results into dijferent categories (shown as folders) according to keywords
4.5 Evaluation Methodology
In this section, we describe our methodology used to evaluate CBizPort. We present
the objectives, experimental tasks, hypotheses, and design as follows.
4.5.1 Objectives and Experimental Tasks
Our evaluation objectives were three-fold: (1) to evaluate the performance of the
summarizer as a preview fimction and categorizer as an overview function to study how
effective they can assist hxunan analysis; (2) to compare CBizPort with regional Chinese
104
search engines to study its effectiveness and usability; and (3) to evaluate human
perception on information quality and regional impacts of CBizPort, in comparison with
existing regional Chinese search engines.
To evaluate how the search engines assist human analysis, scenario-based search
tasks and browse tasks consistent with TREC standards were designed (Voorhees and
Harman, 1997). Sponsored by the National Institute of Standards and Technology (NIST)
and the Defense Advanced Research Project Agency (DARPA), TREC strives to provide
a common task evaluation that allows cross-system comparisons. An example of a search
task is "find two cities in mainland China that Motorola has set up its manufacturing
operations." An example of a browse task is "describe, in a number of distinct themes,
the economic impacts of removing trade barriers between mainland China and Taiwan
towards Hong Kong" (see Figure 4.4). The theme identification method was used to
evaluate performance in browse tasks (Chen et al., 2001). Appendix A.3 provides the
complete questionnaire used in the experiment.
To achieve objective (1), we compared the performances of CBizPort's summarizer
and categorizer with not using them. To achieve objective (2), we selected a search
engine from
each of the three regions as a benchmark against which to compare
CBizPort. Based on our literature review, we used Sina, Yahoo Hong Kong, and
Openfind as our benchmarks. Although Yahoo Hong Kong has been selected as a metasearch engine in CBizPort, we chose it again as a benchmark search engine because of its
familiarity among Hong Kong people and its rich content. To achieve objective (3), we
105
designed tasks that required users to search for information from regions different from
their places of origin (i.e., heterogeneous information) to compare the performances of
CBizPort and benchmark search engines. In addition, qualitative data in the form of
subjects' comments and actions were recorded to provide more details about their
behaviors and feedback.
4.5.2 Hypotheses
Four groups of hypotheses were tested (see Table 4.3). To compare the effectiveness
of the systems, we used accuracy for search tasks, and precision and recall for browse
tasks. Accuracy refers to how well the system helped users fmd exact answers to search
tasks. Precision measured how well the system helped users fmd relevant results and
avoid irrelevant results in browse tasks. Recall measured how well the system helped
users fmd all the relevant results in browse tasks. A single measure called F value was
used to balance between recall and precision (Shaw et al., 1997). It gives an intuition of
the performance achieved by the expert and subjects simultaneously. The formulae used
to calculate the above metrics are stated below.
Accuracy
Number of correctly answered parts
Total number of parts
. .
Number of relevant results identified by the subject
Precision =
^
Number of all results identified by the subject
^
Number of relevant results identified by the subject
Number of relevant results identified by the expert
„ ,
2 X Recall x Precision
t value =
——
Recall + Precision
106
Table 4.3: Hypotheses tested in the experiment
_C;ode_
Hl.l
H1.2
H1.3
H1.4
H2.1
H2.2
H2.3
H2.4
H2.5
H2.6
H2.7
H2.8
H2.9
H2.10
H2.11
H2.12
H2.13
H2.14
H3.1
H3.1a
H3.1b
H3.1C
H3.2
H3.3
HAl
HA2
Hypothesis
1. CBizPort's Assistance in Human Analysis
CBizPort's summarizer significantly improves the effectiveness of searching
CBizPort's summarizer significantly improves the effectiveness of browsing
CBizPort's categorizer significantly improves the effectiveness of searching
CBizPort's categorizer significantly improves the effectiveness of browsing
2. Search Engine Performance Comparison
For general search tasks, CBizPort performs similarly to regional Chinese search engines in terms
of effectiveness
For general browse tasks, CBizPort performs similarly to regional Chinese search engines in
terms of effectiveness
For cross-regional search tasks, CBizPort is more effective than regional Chinese search engines
For cross-regional browse tasks, CBizPort is more effective than regional Chinese search engines
A combination of CBizPort and a regional Chinese search engine is more effective than CBizPort
in searching
A combination of CBizPort and a regional Chinese search engine is more effective than CBizPort
in browsing
A combination of CBizPort and a regional Chinese search engine is more effective than
CBizPort's summarizer in searching
A combination of CBizPort and a regional Chinese search engine is more effective than
CBizPort's summarizer in browsing
A combination of CBizPort and a regional Chinese search engine is more effective than
CBizPort's categorizer in searching
A combination of CBizPort and a regional Chinese search engine is more effective than
CBizPort's categorizer in browsing
For general search tasks, a combination of CBizPort and a regional Chinese search engine is more
effective than regional Chinese search engine
For general browse tasks, a combination of CBizPort and a regional Chinese search engine is
more effective than regional Chinese search engine
For cross-regional search tasks, a combination of CBizPort and a regional Chinese search engine
is more effective than regional Chinese search engine
For cross-regional browse tasks, a combination of CBizPort and a regional Chinese search engine
is more effective than regional Chinese search engine
3. Users'Subjective Evaluations
CBizPort provides a higher information quality than regional Chinese search engines
In terms of presentation quality and clarity, CBizPort provides a higher information quality than
regional Chinese search engines
In terms of coverage and reliability, the information quality of CBizPort is similar to that of
regional Chinese search engines
In terms of usability and analysis quality, CBizPort provides a higher information quality than
regional Chinese search engines
CBizPort has a better cross-regional searching capability than regional Chinese search engines
CBizPort users achieve a higher overall satisfaction than regional Chinese search engines' users
4. Additional Hypotheses
Search performance of the three regional Chinese search engines are not significantly different
Browse performance of the three regional Chinese search engines are not significantly different
107
4.5.2.1 Hypotheses on CBizPort's Enhanced Analysis Capabilities
In Hl.l - HI.4, we hypothesized that the use of CBizPort's summarizer and
categorizer could significantly improve the searching and browsing performance of
CBizPort because the summarizer could extract key sentences from Web pages, thereby
saving users' time in browsing and the categorizer could classify Web pages into groups,
thereby providing analysis capability that is not widely found in regional Chinese search
engines.
4.5.2.2 Hypotheses on Search Engine Performance Comparison
In H2.1 and H2.2, we hypothesized that CBizPort would perform similarly to regional
Chinese search engines for general search and browse tasks because the two systems
have comparable advantages. ("General search and browse tasks" refer to tasks that may
or may not ask for information about a particular region that is different from a subject's
place of origin.) CBizPort was good at integrating results from different information
sources while regional Chinese search engines provided deep coverage of the regions
they served. In H2.3 and H2.4, we believed that CBizPort's ability to integrate
information sources from the three regions provided more comprehensive coverage of
search results. ("Cross-regional search and browse tasks" refers to tasks that require a
subject to find information about a particular region that is different from his/her place of
origin.)
108
In H2.5 - H2.14, we believed that a combination of CBizPort and a regional Chinese
search engine could augment the insufficiencies in both systems and provide the highest
quality of searching and browsing. Since we expected to obtain significantly different
results firom the two systems, combining the results from them would significantly
increase recall but create only a small change in precision. Through this arrangement, we
tried to mimic a situation in which each subject was allowed to use CBizPort and a
benchmark search engine together to solve the same problem.
4.5.2.3 Hypotheses on Users' Subjective Evaluations
In H3.1 - H3.3, we beheved that CBizPort has better information quality because,
unlike commercial search engines, CBizPort supports searching of various high-quality
and authoritative information sources (see Table 4.2). Also, we believed that CBizPort
performed similarly to regional search engines in the dimensions classified under
"Coverage and reliability" (HB.lb) because the former provides comprehensive coverage
of the three regions while the latter mainly have regional coverage. In H3.2, we
hypothesized that CBizPort had a better cross-regional searching capability because of its
ability to integrate results from the three regions. Based on the cited advantages of
CBizPort, we therefore believe that CBizPort's users achieved higher overall satisfaction
ratings (H3.3).
109
4.5.2.4 Additional Hypotheses
In this experiment, we assumed that the three chosen benchmark search engines
(Sina, Yahoo Hong Kong, and Openfmd) belonged to the same category called "regional
Chinese search engines" and had similar searching and browsing capabilities. Such an
assumption was tested in HAl and HA2. We expected that the performances of the three
benchmarks would not be significantly different from each other, thus allowing us to
compare CBizPort with the entire category (but not individual search engines).
4.5.3 Experimental Design
Thirty University of Arizona's Chinese students, ten from each region, were recruited
as subjects of the experiment. Each of them received a fixed amount of money as an
incentive for their voluntary participation in our experiment. The number of subjects was
the same for all regions, as we wanted to maintain equal influence of regional impacts.
Each subject's name, age range, gender, education level, and computer literacy were
recorded, but were kept confidential in accordance with the Institutional Review Board
(IRB) Guidebook'^ Appendices A.l and A.2 respectively provide the approval letter and
disclaimer form approved by the University of Arizona Human Subject Protection
Program.
" The IRB Guidebook can be found at
http://ohrp.osophs.dhhs.gov/irb/irb_guidebook.htm.
110
The experiment required each subject to perform 5 search tasks and 5 browse tasks. A
time limit of 4 minutes was imposed on each search task and 5 minutes on each browse
task. Among the 10 tasks, 3 search tasks and 3 browse tasks were performed using
CBizPort (either general search capability, or general search plus summarizer, or general
search plus categorizer was used), and 2 search tasks and 2 browse tasks were performed
using the benchmark search engine from the region of a subject's origin (one task was
about information seeking within the region of the subjects' place of origin and the other
was about seeking information from a region different irom subjects' place of origin). All
tasks were randomly assigned to different questions to avoid bias due to task content. A
pilot test involving three subjects was conducted to evaluate the appropriateness of the
tasks before they were actually used in the experiment. In the pilot test, we found that the
subjects used all the time assigned for most search and browse tasks regardless of the
system they used. Limited by the duration of the whole experiment (approximately one
hour), we decided not to allocate more time to the tasks and focused only on studying the
effectiveness and usability (but not efficiency) of the systems.
During the experiment, a subject used each of the two systems to perform the tasks.
The order in which the systems were used was randomly assigned to the subjects to avoid
bias due to system sequence. To avoid bias arising from the fact that subjects might favor
CBizPort because it was developed at the University of Arizona, the experimenters did
not show any preference when introducing the two systems and also maintained an
unbiased attitude toward each system. We believe such bias may still exist but did not
Ill
pose a major problem to the findings. As each subject was asked to perform similar tasks
using the two systems, a one-factor repeated-measures design was used, because it gives
greater precision than designs that employ only between-subjects factors (Myers and
Well, 1995). All verbal comments were analyzed using protocol analysis (Ericsson and
Simon, 1993).
After finishing the tasks with a system, a subject needed to rate the system on: (1) the
information quality provided by the system; (2) the abihty to retrieve cross-regional
information; and (3) his/her overall satisfaction with the system. To measure information
quality, we modified the 16-dimension construct developed in Wang and Strong (1996)
by dropping the dimension on "security" which is not relevant because the information
provided by the systems is already pubhc. In addition, because there are different levels
of importance in the remaining 15 dimensions, we invited our experts (as described
below) to provide ratings on the relative importance of different dimensions. Such ratings
were used to weigh the different dimensions of information quality for the Chinese
business domain. Their ratings as well as the definitions of the 15 dimensions categorized
into three categories are shown in Table 4.4.
Three experts, one from each region, also were recruited to provide answers for all
browse tasks. The expert fi-om mainland China has an MBA degree. The Taiwan expert
holds an M.S. degree in management information systems and is pursuing a Ph.D. degree
in MIS. The Hong Kong expert is a Ph.D. candidate in Marketing Management and
worked for four years in market research in Hong Kong.
112
Table 4.4: Definitions of 15 dimensions of information quality and expert ratings
Dimension
Dellnition
Rating*
Acccibibililj
Concise Representation
Presentation quality and clarity
llic exlcni to which mfoinuuon is available, oi easily and
quickly retrievable
The extent to which information is compactly represented
The extent to which information is presented in the same
format
Ease of Manipulation
The extent to which information is easy to manipulate and
apply to different tasks
Interpretability
The extent to which information is in appropriate
languages, symbols, and units, and the definitions are clear
Coverage and reliability
The extent to which the volume of information is
Appropriate amount of
information
appropriate for the task at hand
Believability
The extent to which information is regarded as true and
credible
Completeness
The extent to which information is not missing and is of
sufficient breadth and depth for the task at hand
Free-of-error
The extent to which information is correct and reliable
The extent to which information is unbiased, unprejudiced,
Objectivity
and impartial
Usability and analysis quality
The extent to which information is applicable and helpful
Relevancy
for the task at hand
Reputation
The extent to which information is highly regarded in terms
of its source or content
Timeliness
The extent to which information is sufficiently up-to-date
for the task at hand
Understandability
The extent to which information is easily comprehended
Value-Added
The extent to which information is beneficial and provides
advantages from its use
* Expert rating: 3 = extremely important, 2 = very important, 1 = important
Consistent Representation
3
3
2
2
2.67
2.67
2
2.33
2.67
2.33
3
2.33
3
2.33
3
Each expert was assigned three browse tasks that were related to businesses of his/her
own place of origin. To increase the quality of the experts' judgment, they were first
required to provide a version of answers they had decided on after using both CBizPort
and other search engines, and to organize the answers into themes. After the data from all
subjects had been collected, the experts read subjects' answers and modified the original
113
answers if needed. The final version of experts' answers was obtained after this two-step
process and was used to evaluate the performance of the systems.
4.6
Experimental Results and Implications
In this section, we describe and analyze the results of our user evaluation study. Table
4.5 summarizes the system performance in search tasks (measured by accuracy) and
browse tasks (measured by precision, recall and F value). Table 4.6 shows the mean
ratings on various dimensions. Table 4.7 shows the p-values and results of testing various
hypotheses, and Table 4.8 summarizes subjects' profiles. The presented figures
are
rounded to two significant digits.
4.6.1 CBizPort's Assistance in Human Analysis
The results of testing hypotheses Hl.l - HI.4 show that there was no significant
difference between the accuracy, precision and recall when using or not using CBizPort's
summarizer or categorizer. We believe that it could be attributed to the processing speed
and time constraint on the tasks. The summarizer processing time included the time to
fetch and process Web pages fi^om remote servers, some of which might have slow
response times or prevent automatic spidering, thus undermining the performance of
summarizer (especially when the Web pages contained the answers of the tasks). In
addition, although CBizPort's summarizer and categorizer could provide analysis
capabilities, the limited time of the experiment might not be long enough to demonstrate
the power of the two functions fiilly.
114
Table 4.5: Searching and browsing performance of CBizPort and benchmark search engines
Portal
Setting
lask
Measure
Mean
Perlonnancv
Accuracy
37%
Precision
59%
Browse
Recall
23%
F value
31%
Search
Accuracy
25%
Basic searching +
Precision
51%
with summarizer
Browse
Recall
26%
only
F value
33%
Search
Accuracy
35%
Basic searching +
Precision
53%
with categorizer
Browse
Recall
27%
only
F value
33%
Search
Accuracy
Benchmark
40%
search engines
Precision
56%
General searching
Browse
Recall
22%
F value
29%
Search
Accuracy
28%
Precision
Cross-regional
66%
searching
Browse
Recall
26%
F value
34%
Search
Accuracy
Combination
65%
(CBizPort +
(Randomly
Precision
77%
Benchmark)
assigned*)
Browse
Recall
43%
F value
52%
* The random assignment for the combination was to randomly use one of the three settings in
plus one of the two settings in the benchmark search engine for the same task.
CBizPort
Basic searching
(with neither
summarizer nor
categorizer)
Search
Std.
Deviation
45%
43%
22%
25%
41%
44%
27%
32%
48%
45%
29%
31%
50%
45%
23%
28%
43%
42%
23%
25%
48%
32%
26%
27%
CBizPort
Table 4.6: Results of users' subjective evaluations
Himi-nsion
CBizPort
Mc;in Uuliiiu* Sid. nr\iaMon
Information Quality (Overall)
4.5
- Presentation quality and clarity
4.6
- Coverage and reliability
4.5
- Usability and analysis quality
4.4
Cross-regional searching capability
4.5
Overall satisfaction
4.4
* The range of rating is from 1 to 7 with 7 being the best
1.1
1.1
1.1
1.3
1.3
1.3
iti-nrhniark Search Fngiiii'
Mean Rating* Std. Deviatiun
4.4
4.4
4.3
4.4
4.1
4.0
1.2
1.3
1.4
1.2
1.6
1.7
115
Table 4.7: Results of hypothesis testing
Hypotheses for Search Tasks
HI: Enhanced Analysis Capabilities p-value
Hl.l CBiz+Summ > CBiz
0.33
H1.3 CBiz+Categ > CBiz
0.88
|
Result
Hypotheses for Browse Tasks (F value)
1
H1.2 CBiz+Summ > CBiz
H1.4 CBiz+Categ > CBiz
Not Confirmed
Not Confirmed
p-value
0.81
0.85
Result
Not Confirmed
Not Confirmed
1
H2: Search Engine Performance Comparison
0.76 Confirmed
H2.1 CBiz = Bench (general)
0.78 Confirmed H2.2 CBiz = Bench (general)
0.49 Not Confirmed H2.4 CBiz > Bench (cross)
0.67 Not Confirmed
H2.3 CBiz > Bench (cross)
H2.5 Combined > CBiz
0.028* Confirmed H2.6 Combined > CBiz
0.0070* Confirmed
H2.7 Combined > CBiz+Summ
H2.8 Combined > CBiz+Summ 0.0020* Confirmed
0.00* Confirmed
H2.9 Combined > CBiz+Categ
0.019* Confirmed H2.10 Combined > CBiz+Categ 0.025* Confirmed
H2.11 Combined > Bench(gen)
0.061 Not Confirmed H2.12 Combined > Bench(gen) 0.0010* Confirmed
H2.13 Combined > Bench(cross)
H2.14 Combined > Bench(cross) 0.010* Confirmed
0.0040* Confirmed
H3: Users' Subjective Evaluations
H3.1 Information quality: CBiz > Bench
H3.1a Information quality (presentation quality and clarity): CBiz > Bench
H3.1b Information quality (coverage and reliability): CBiz = Bench
H3.1c Information quality (usability and analysis quality): CBiz > Bench
H3.2 Cross-regional searching: CBiz > Bench
H3.3 Satisfaction: CBiz > Bench
Additional Hypotheses
HAl Search performance: Sina = Openfind = YahooHK
HA2 Browse performance: Sina = Openfind = YahooHK
Note: *alpha error = 5%
For details of the hypotheses, please refer to Table 2.
o-value
0.65
0.60
0.50
0.95
0.37
0.71
Result
Not Confirmed
Not Confirmed
Not Confirmed
Not Confirmed
1.0
0.72
Confirmed
Confirmed
Not Confirmed
Confirmed
Table 4.8: Subjects' profiles
.Vttribuic
Computer literacy
Gender
Education level
Age
Subjects' Profile
Average computer hteracy = 5 (range: 1-7, with 7 being "excellent")
17 subjects are male, 13 are female
10 subjects are undergraduate students, 9 subjects have earned a bachelor's degree,
10 subjects have earned a master's degree, 1 subject did not provide education
information
15 subjects age between 18-25, 12 subjects age between 26-30, 1 subject ages
between 30-35, 2 subjects did not provide age information
116
Despite non-significant results, we found from subjects' verbal comments that the
summarizer and categorizer actually helped their searching. Eleven subjects explicitly
mentioned that the summarizer and categorizer could facilitate their understanding and
searching of results. For example, subject #5 said: "CBizPort's summarizer and
categorizer are much more helpful than YahooHK's general search." Subject #26 also
said that the summarizer and categorizer "can easily extract most useful information."
We believe that CBizPort's summarizer and categorizer provide helpful analysis
capabilities for users' search and browse tasks, thus addressing Research Question
2 stated in Section 4,3.
4.6.2 Search Engine Performance Comparison
The results of testing hypotheses H2.1 - 2.14 show that 11 of the 14 hypotheses were
confirmed while H2.3, H2.4, and H2.11 were not confirmed. As the p-values of H2.1 H2.4 are very high (ranging from 0.49 to 0.78), we found that CBizPort performed
similarly to regional Chinese search engines in both general and cross-regional search
and browse tasks. The fact that hypotheses H2.3 and H2.4 were not confirmed might
result from the additional processing needed for cross-regional search and browse tasks in
which subjects tended to issue more queries. As a prototype system, CBizPort did not
process the queries as quickly as benchmark search engines that were professionally
developed. Moreover, CBizPort searched from different information sources while
benchmark search engines searched only from their own databases. The slower speed of
117
CBizPort may thus explain why hypotheses H2.3 and H2.4 were not confirmed.
Nevertheless, benchmark search engines did not outperform CBizPort. Therefore, we
conclude that CBizPort's searching and browsing performance is comparable to
that of regional Chinese search engines. We believe that further improvements need to
be made to CBizPort to enhance its performance.
In H2.5 - H2.14, the p-values were mostly below 0.05 (except for H2.11 where the pvalue was 0.061, very close to 0.05), indicating that a combination of the two systems
performed significantly better than any other settings in most of the search tasks and all
browse tasks. The accuracy of search tasks and recall of browse tasks were increased
significantly. An unexpected result was that the precision of browse tasks done with the
combination also was increased because the increased number of correct themes was
slightly higher than the increase in the number of distinct themes obtained by the
combination. As the results only showed that a combination of two systems performed
better than one system alone, we are not sure whether CBizPort or the benchmark
search engines contributed to the significant improvement. Further studies are thus
needed to test whether CBizPort can significantly augment regional Chinese search
engines.
4.6.3 Users' Subjective Evaluations and Verbal Comments
The results of testing hypotheses H3.1 - H3.3 show that there was no significant
difference between the two systems' ratings. However, among all three subjective
118
evaluation criteria (information quality, cross-regional searching capability, and overall
satisfaction), CBizPort obtained the highest average scores (see Table 4.6).
Subjects' verbal comments, summarized in Table 4.9, revealed more about the
differences in the two systems' performance. Nine subjects agreed that CBizPort
generally performed better than the benchmark search engine, four subjects said that
benchmark search engine performed better, and seven subjects did not explicitly say
which system performed better, or said that both systems performed similarly.
Many positive comments were made about CBizPort.
Seven subjects said that
CBizPort was user-friendly and could obtain more precise and relevant results than
benchmark search engine. For example, subject #15 said: "Sina gives many results that
are not focused, and is poor at searching for Hong Kong and Taiwan results." Four
subjects said that they liked CBizPort's large variety of options in searching for
information from different regions and search engines. For example, subject #2 said;
"YahooHK is more limited when searching certain terms in a specific region ... while
CBizPort can fulfill what YahooHK couldn't do."
In contrast, benchmark search engines received relatively fewer positive comments.
Four subjects mentioned that they were familiar with the user interfaces and functions
because of their popularity. For example, subject #27 said; "I am familiar with the format
of Openfmd. So that's the reason that I am more satisfied with it than CBizPort." Three
subjects complained about CBizPort's slow processing speed. This is understandable
119
because CBizPort is currently an experimental prototype that does not have the
professional operations and rich contents of benchmark search engines.
Therefore, from
the results of testing H3.1 - H3.3, we conclude that users'
subjective evaluations on information quality, cross-regional searching capability
and overall satisfaction of CBizPort are comparable to those of regional Chinese
search engines. We also found that subjects' verbal comments strongly favored
CBizPort's analysis functions, cross-regional searching capabilities and user-friendliness,
and regional Chinese search engines had more efficient operation and were more popular.
Despite this, we believe that CBizPort's information quality and cross-regional searching
capability need to be improved to significantly outperform existing regional Chinese
search engines.
Table 4.9: A summary of subjects' verbal comments
I'orliil
CBizPort
Benchmark
Search
Engines
S1rcii<'lhs
- Provided useful tools to enhance the
searching ability
- Allowed summarization and categorization
that other search engines couldn't provide
- More user-friendly because it allowed users
to choose from different regions and data
sources
- Allowed users to type in more than one
keywords on the text area
- Provided higher search speed generally
- Users were more familiar with its interface
and functions
- Provided other functions (e.g., attractive
images, news) that were appealing to users
Weaknesses
- The processing .speed was soniclniK's
slow
- Too many results coming from different
regions might overwhelm users
- Users were not familiar with categorizer
- Insufficient time of the experiment made
users felt that CBizPort's functions were
not as useful as they should have been
- Search results had less variety, were
sometimes less precise and relevant
- Analysis functions are limited
- Did not provide much information about
the regions of the Web pages
120
4.6.4 Results of Testing Additional Hypotheses
The non-significant results of HAl and HA2 confirmed our belief that the three
benchmark search engines have similar performance and thus could be treated as a group
in comparisons with CBizPort.
4.6.5 Implications of the Results
Three implications can be drawn from our experimental results. First, the results of
testing CBizPort's functions suggest that our framework can be used for Internet
searching and browsing in a heterogeneous environment. As Fuld et al. (2003) have
pointed out, existing business intelligence tools generally lack analysis capabilities; our
framework addresses the need by providing summarization, categorization, and metasearching for obtaining business intelligence. However, due to the insignificant results,
we are not sure whether these capabilities can provide better information quality and
analysis in a heterogeneous environment.
Second, although it is not our intent to create new model of human information
seeking, the results have pointed out the importance of using preview and overview in
assisting human information seeking in the context of Internet searching and browsing.
Future research in developing human information seeking model can pay more attention
to such assistance. Relevant research questions include: "How should a model of
information seeking be developed that explains the interaction between human and
automated assistance to Internet searching?" "How can such a model be applied to
121
information seeking involving individual differences?" "What kinds of information
overview and preview work best with human in Internet information seeking?"
Third, cross-regional searching capability is important for building Internet search
portals in a heterogeneous environment because a language may be used in regions with
different cultural, social and economic characteristics. We have applied our framework to
developing a portal that jdelded comparable performance to regional Chinese search
engines; subjects expressed a strong preference for CBizPort's cross-regional searching
capability. Apart from Chinese, some languages are widely used in regions that have
different needs for Internet searching. For example, Spanish is currently the second most
popular language in the United States and is the main language in more than 20 regions
including Latin and South American countries. Arabic is widely used in Middle East and
North African countries. Having capability for effective cross-regional searching is a
promising direction for next-generation Internet searching and browsing.
4.7 Conclusions
In this chapter, we have applied our automatic text mining framework to building the
Chinese Business Intelligence Portal (CBizPort) that facilitates searching and browsing
in a heterogeneous environment. We have conducted a systematic evaluation to test
CBizPort's ability to assist human analysis of Chinese business intelligence. Our
experimental results show that CBizPort's analysis functions did not enable the portal to
achieve significantly better searching and browsing performance, despite subjects' many
122
positive comments. While CBizPort's searching and browsing performance was
comparable to that of regional Chinese search engines, a combination of the two systems
performed significantly better than using either one alone for search and browse tasks.
In addition, users' subjective evaluations on information quality, cross-regional
searching capability and overall satisfaction of CBizPort were comparable to regional
Chinese search engines. Subjects' verbal comments indicated that CBizPort performed
better than regional Chinese search engines in terms of analysis functions, cross-regional
searching capabilities and user-friendliness, while regional Chinese search engines had
more efficient operation and were more popular. Overall, we believe that improvements
are needed in applying the framework to addressing the heterogeneity and unmonitored
quality of information on the Web.
123
CHAPTER 5. APPLYING WEB PAGE VISUALIZATION
TECHNIQUES TO DISCOVERING BUSINESS INTELLIGENCE
FROM SEARCH ENGINE RESULTS
Nowadays, information overload often hinders the discovery of business intelligence
on the Web. Business analysts need to sift through tons of irrelevant information to
obtain insights and knowledge. The process is time-consuming and tedious. In response
to users' queries, search engines provide voluminous data that are poorly organized and
overwhelm users. Existing business intelligence tools suffer from a lack of analysis and
visualization capabilities because many of them do not reveal underlying structure of the
data. This chapter examines the use of clustering and visualization techniques to assist
analysts to explore business intelligence on the Web. Through the experimental findings,
we demonstrate how our framework alleviated the problems and enabled exploration of
business intelligence.
5.1
Background
A study found that most of the world's information'^ has been stored in computer
hard drives or servers (Lyman and Varian, 2000), many of which form the repository of
the Internet. Consequently, business analysts often experience information overload,
It was found that the world produces between 635,000 and 2.12 milHon terabytes of unique information
per year.
124
given that the Internet is one of the top five sources of business information (FuturesGroup, 1998). For example, a business analyst in the database technology field might find
it difficult to answer the following questions:
•
What is the overall landscape of the database technology industry (we define
"landscape" as a picture depicting an aggregate of businesses in a certain industry
or field)?
•
What are the different subgroups inside the community of database technology
companies?
•
To which group of communities in the entire competitive environment does our
company belong?
•
Which 10 competitors in our field most resemble us?
Answers to these questions reveal business intelligence, which is obtained through
"the acquisition, interpretation, collation, assessment, and exploitation of information"
(Davies, 2002) in the business domain.
Web search engines are commonly used to locate information for business analysis,
and business analysts usually retrieve a large number of Web pages firom a simple search.
For example. Table 5.1 shows the number of results obtained from various search engines
in a search of "knowledge management." Overwhelmed by many results, business
analysts often browse only individual pages, unable to find Web communities from all
results or to visualize the landscape that surrounds them. A textual list display of search
125
engines presents numerous results in a linear manner, making it hard to separate relevant
results from irrelevant ones. This result list display often leads to information overload.
Table 5.1: A search of "knowledge management" on various search engines (September 2002)
At
1 T-<
•
Scarch
l-.ngiiii;s
Number of Results
Google
Alltheweb
Lycos
Wisenut
AltaVista
Teoma
3,860,000
1,599,427
14,948,890
2,439,622
4,690,123
2,837,000
Community search engines have been developed to provide more focused searching
on the Web (SearchEngineWatch, 2001). They allow volunteers to contribute their
opinions in building the Web directory. Examples of such community search engines
include Open Directory (http : //dmoz . org/), Zeal (http : //www. zeal. com/),
Hotrate ( h t t p : / / w w w . h o t r a t e . c o m / ) ,
Xoron ( h t t p : / / w w w . x o r o n . c o m / ) .
The advantages of community search engines are their wide coverage of Internet user
groups and Web communities. Human judgment of relevance of Web communities
usually is precise, since it is based on their experience and acquired knowledge.
However, the quality of the Web directories depends highly on the capability of those
volunteer participants and the sizes of communities. As the Web continues to grow in
size and diversity, it is not easy for community search engines to maintain the quality of
their Web directories, so these search engines are still vulnerable to the information
overload problem.
In addition to textual result lists and community search engines, new browsing
methods are needed by business analysts to enable automatic visualization of landscapes
126
and discovery of communities on the Web. These new methods can potentially assist
better analysis while reduce information overload. To develop such new browsing
methods, we need to review previous research on the following issues:
•
How do commercial business intelligence tools perform?
•
What are the characteristics of browsing on the World Wide Web? In particular,
what are the mechanics of and human factors involved in it?
•
How can information visualization techniques help in discovering underlying
patterns from
documents (e.g., Web pages)? In particular, what analysis
techniques, algorithms and visualization metaphors are used in the literature?
5.2 Related Work
In this section, we survey commercial BI tools, review previous research on browsing
the World Wide Web and document visualization.
5.2.1 Commercial BI Tools
An overview of business intelligence (BI) tools can be found in Section 2.3.4.2. Here
we provide more details on commercial BI tools. A closer look at these tools reveals their
weaknesses in content collection, analysis and interface used to display large amount of
information. In general, many tools simply provide different views of the collected
127
information (e.g., Market Signal Analyzer^^, BrandPulse''') but do not support more
thorough analysis. Some more advanced tools use text-mining and rule-based techniques
to process collected information. For example, ClearResearch Suite^^ extracts
information from documents and shows a visual layout of relationships between entities
such as people, companies, and relationships, and events. However, such analysis
capability is not commonly provided by BI tools. In terms of the interface displaying the
results, many BI tools integrate their reports with Microsoft Office products and present
them in a textual format. Due to limited analysis capability, they are not capable of
illustrating the landscape of large number of documents, thus hindering browsing and
analysis on the Web.
5.2.2 Browsing the World Wide Web
5.2.2.1 Hypertext and Browsing
Hypertext is the dominant approach to browsing the World Wide Web (see Section
2.2.2.2 for a review on "Browsing"). First introduced in the 1960s (Nelson, 1965),
hypertext is defined as non-sequential writing in which each node (representing a Web
page on the Web) on a directed graph (representing the Web) contains some amount of
Market Signal Analyzer is a product of Docere Intelligence Inc., http://www.docere.se/
BrandPulse is a product of Intelliseek Inc./Planetfeedback, http://www.planetfeedback.com/biz
ClearResearch Suite is a product of ClearForest Corporation, http://www.clearforest.coni/
128
text or other information and is connected by directed links to other nodes (Nielsen,
1990). Users can navigate the hypertext network by clicking on the links to the nodes.
Navigation facilities such as overview, backtracking, interaction history, timestamps, and
footprints can be built in hypertext. Due to its flexibility, hypertext is used in Internet
browsers (such as Microsoft hitemet Explorer or Netscape Navigator) to provide
navigation on the Web. Users no longer need to follow a fixed sequence to browse for
information but can move freely through the information space according to their own
needs.
Despite its benefits, hypertext may lead to users' disorientation while navigating the
information space. A study found that 56 percents of the readers of a hypertext document
expressed confusion about where they were (Nielsen and Lyngbaek, 1989). The problem
is more serious in textual display of Web pages when a limited amount of information
can be presented on a computer screen. Users need to click on the links many times to
browse through the whole set of Web pages related to their tasks. If a user is searching
for a broad topic on the Web with a typical search engine, he needs to browse through the
long lists of result pages in order to locate relevant results. Therefore, traditional result
list display of hypertext for Web browsing is likely to lead to the problem of user
disorientation. New browsing methods that can avoid such disorientation and reduce
information overload are needed.
129
5.2.2.2 Visual Displays of Textual Information
To deal with the problems of result list display of hypertext, researchers in humancomputer interaction and information retrieval have proposed frameworks and techniques
to create visual displays for textual information. Shneiderman proposes a task by data
type taxonomy (TTT) for information visualizations (Shneiderman, 1996). Table 5.2
shows the seven tasks and seven data types involved in the process of viewing collections
of items. Any task can be performed using any data type.
Table 5.2: A task by data type taxonomy for viewing collections of items
Data t>pcs
Tasks
Overview
Gain an overview of the entire
collection
Zoom
Zoom in on items of interest
Filter
Filter out uninteresting items
Details-ondemand
Select an item or group and get details
when needed
Extract
Allow extraction of sub-collections and
of the query parameters
Keep a history of actions to support
undo, replay, and progressive
refinement
View relationships among items
History
Relate
One-dimensional
Linear data types such as textual documents, program
source code and alphabetical lists of names organized in
a sequential manner
Two-dimensional Planar or map data such as geographic map, floor plans,
or newspaper layouts
Three-dimensional Real world objects such as molecules, human body, and
buildings
Temporal
Time lines used in medical records, project management,
or historical presentation. Items have start and finish
time and they may overlap
Multidimensional Relational and statistical databases in which items with n
attributes become points in a n-dimensional space
Tree
Hierarchies or tree structures that are collections of
items with each item having a link to one parent item
(except the root)
Network
Items are linked to an arbitrary number of other items
Traditional result list display of hypertext belongs to the one-dimensional data type.
While still widely used in many Web search engines or information retrieval systems,
result list allows only limited room for browsing (e.g., scrolling a long list of results). In
contrast, data types such as two-dimensional data, tree data, and network data allow more
browsing tasks to be done and support human visual capabilities more effectively. Four
types of visual display format are identified in Lin (1997): hierarchical displays, network
displays, scatter displays and map displays. Compared with the data types in TTT,
130
hierarchical displays are similar to tree data type, network displays are similar to network
data type, and both scatter displays and map displays are similar to the 2-dimensional
data type. Among these four displays, hierarchical (tree) displays were shown to be an
effective information access tool, particularly for browsing (Cutting et al., 1992), scatter
displays most faithfully reflected underlying structures of data among the first three
displays (Lin, 1997), and map displays could provide a view of the entire collection of
items at a distance, according to one of the earliest researchers who proposed the use of
map display for information retrieval (Doyle, 1961). In summary, hierarchical and map
displays of Web search results can potentially alleviate the problems of traditional result
list display of hypertext. However, they are not widely used in existing search engines.
5.2.3 Document Visualization
Document visualization primarily concerns the task of getting insight into information
obtained from one or more documents without users having read those documents (Wise
et al., 1995). Most processes of document visualization involve three stages; analysis,
algorithms, and visualization (see Figure 5.1) (Spence, 2001). In the analysis stage,
essential features of a text collection are extracted according to users' interests expressed
as keywords. In the algorithms stage, an efficient and flexible structure of the document
set is created by clustering and projecting the high-dimensional structure into a two or
three-dimensional space. In the visualization stage, the data are presented to users and
131
made sensitive to interaction. The following subsections review in more detail techniques
used in the three stages of document visualization.
Analysis
Algorithms
Extract useful attributes
Cluster similar documents
from documents
Reduce dimensionality of
Visualization
Display the encoded data
in a visual format
the original representation
Figure 5.1: A typical document visualization process
5.2.3.1 Document Analysis
Document analysis relies on meta-searching and Web mining techniques to
respectively collect and analyze the documents. As a highly effective method of resource
discovery and collection on the Web, meta-searching has been reviewed in Section
2.3.3.1. The three categories of Web mining, Web content, structure, and usage mining,
have been reviewed in Section 2.3.3.2. Here we point out that the concept of "Web
community" in Web structure mining is very close to the concept of "cluster" in Web
content mining. Both involve grouping Web pages having similar attributes, such as link
structure or content information. For the purpose of finding Web pages that form
communities, the approach used by He et al. (reviewed in Section 2.3.3.2) has the
advantage of combining both Web content information and Web structure information for
clustering.
132
5.2.3.2 Algorithms
Many algorithms for creating meaningful structure for a document set are available.
In particular, cluster algorithms and multidimensional scaling algorithms are frequently
used in visualization. Cluster algorithms classify objects into meaningful disjoint subsets
or partitions (see Section 2.3.2.2 for an overview of clustering). An important objective
of cluster algorithms is to achieve high homogeneity within each cluster and large
disassociation between different clusters. Two categories of cluster algorithms are
commonly used; hierarchical and partitional (Jain and Dubes, 1988).
5.2.3.2.1
Hierarchical Clustering
Hierarchical clustering is a procedure for transforming a proximity matrix into a
sequence of nested partitions (Grabmeier and Rudolph, 2002). In general, the method
takes input from a population of n objects and a proximity matrix of these objects.
Starting with the n one-element clusters, the method combines a pair of clusters into one
cluster, thereby reducing the total number of clusters by one. The process of combining
clusters is repeated until only one cluster remains. Variations of the way to combine pair
of clusters in this general procedure result in different hierarchical clustering algorithms.
The single-link method combines the pair of clusters with the greatest proximity (or
smallest distance) and forms maximally connected subgraphs. In contrast, the completelink method forms maximally complete subgraphs by combining the pair of clusters that
have the maximum proximity among all of the minimum cluster proximities. Other
133
similar methods of clustering include average link method, centroid method, weighted
average method, unweighted centroid method, weighted centroid method, and Ward's
method (Ward, 1963). They differ from single-link and complete-link methods by
applying different weightings to the proximity updating formula proposed by Lance and
Williams (1967), thus avoiding the extremes of those two methods. The strengths of
hierarchical clustering are its computational efficiency and its ability to present results in
the form of a taxonomy (dendrogram) that allows analysts to see how objects are
organized into clusters. The weaknesses include the adverse chaining effect found in the
single-link method, the tendency for the hierarchical structure to change dramatically
with small changes in the rank orders of proximities, and vulnerability of the completelink method to ties in the proximity matrix (ties refer to the presence of two or more
edges having the same proximity value) (p.79, (Jain and Dubes, 1988)).
5.2.3.2.2
Partitional Clustering
Partitional clustering assigns objects into groups such that objects in a cluster are
more similar to each other than to objects in different clusters. The number of groups may
or may not be specified beforehand. In general, partitional clustering starts with an initial
partition with K clusters. Then a new partition is generated by assigning objects to their
closest cluster centers and new cluster centroids are computed. This step is repeated until
an optimal value of the criterion fimction is obtained. Then the clusters are merged or
split and the whole process is repeated until the cluster membership stabilizes. Typically,
134
a clustering criterion is adopted to guide the search for optimal grouping. The squareerror clustering criterion tries to minimize the sum of within-cluster variations (Gordon
and Henderson, 1977). A graph-theoretic criterion called normalized cut treats clustering
as graph partitioning and computes the normalized cost of cutting a graph. The criterion's
use in image segmentation (Shi and Mahk, 2000) and Web page clustering (He et al.,
2001) has achieved good results. Although partitional clustering tries to achieve optimal
result, it is usually difficult to evaluate all possible partitions because the number of
partitions is extremely large. In fact, finding optimal graph partitioning has been shown
to be NP-complete (Garey and Johnson, 1979). Therefore, heuristics are needed to find
good values to the criterion selected. Examples of such heuristics include genetic
algorithms, taboo search, scatter search, and simulated annealing. Genetic algorithms
(GA) simulate the evolutionary process of species that reproduce sexually and are
essentially a parallel hill-climbing technique that performs global searching for the
optimal value (Holland, 1975) (see Section 2.3.2.1 for an overview of GA). Taboo search
modifies a solution locally and repeatedly while memorizing these modifications to avoid
visiting the same solution twice or in a cyclic manner (Glover, 1986). Scatter search is
based on a population of solutions (integer vectors) that evolves through selection, linear
combination, integer vector transformation and culling to produce a new population of
solutions (Glover, 1977). Simulated annealing continually moves in the direction of
increasing value but allows the search to take some downhill steps to escape the local
maximum (van Laarhoven and Aarts, 1988). It has been pointed out that the
135
implementations of these general solving methods are increasingly similar and are
grouped under the name "adaptive memory programming" (Taillard et al., 2001).
Among these techniques, GA performs best when the search space is very large
because of its global searching capability. Being a general optimization technique, GA
has been successfully used in information retrieval to find the best document description
(Gordon, 1988), to identify new documents (Chen et al., 1998c), to spider Web pages
(Chen et al., 1998a), to leam from user interests in Web searching (Nick and Themis,
2001), and to partition graphs (Maini et al., 1994; Bui and Moon, 1996).
Both hierarchical and partitional clustering have their strengths and weaknesses.
While no theory regarding the best clustering method for a particular application exists
(p.88, (Jain and Dubes, 1988)), factors such as computational efficiency, quality of
clusters formed, and visual impact can be considered. Grabmeier and Rudolph pointed
out that hierarchical clustering is good for initial partitioning, but that partitional
clustering seeks to achieve optimization (Grabmeier and Rudolph, 2002). Thus partitional
clustering brings about higher-quality partitions. However, hierarchical clustering is often
efficient and provides a visual dendrogram. For use in visualization and Web browsing, it
appears that the combination of hierarchical and partitional cluster methods will provide a
better clustering quality.
136
5.2.3.2.3
Multidimensional Scaling
Multidimensional scaling (MDS) algorithms consist of a family of techniques that
portray a data structure in a spatial fashion (Young, 1987). It constructs a geometric
representation of the data (such as a similarity matrix), usually in a Euclidean space of
low dimensionality. Based on earlier work on multidimensional psychophysics
(Richardson, 1938) and point mutual distances (Young and Householder, 1938),
Torgerson provided the first systematic procedure for determining a multidimensional
map of points (metric solutions) from errorful interpoint distances (Torgerson, 1952).
Kruskal introduced nonmetric multidimensional scaling by using his least square
monotonic transformation and the method of steepest descent (Kruskal, 1964). Takane,
Young and de Leeuw consolidated many earlier developments into a single approach that
is capable of either metric or nonmetric analysis using either a weighted or an unweighted
Euclidean model (Takane et al., 1977). Their approach, ALSCAL, has become the
standard MDS procedure in many statistical packages such as SPSS.
Apart from its theoretical aspect, MDS has been applied in many different domains.
He and Hui (2002) applied MDS to display author cluster maps in their author co-citation
analysis. Eom and Farris (1996) applied MDS to author co-citation in decision support
systems (DSS) literature over 1971 through 1990 in order to find contributing fields to
DSS. McQuaid et al. (1999) used MDS to visualize relationships between documents on
group memory in an attempt to overcome information overload. Kanai and Hakozaki
(2000) used MDS to lay out 3D thumbnail objects in an information space to visualize
137
user preferences. Kealy (2001) applied MDS to study changes in knowledge maps of
groups over time to determine the influence of a computer-based collaborative learning
environment on conceptual understanding. Although much has been done to visualize
relationships of objects in different domains using MDS, no attempts to apply it to the
discovery of business intelligence on the Web have been found. In addition, no existing
search engine applies MDS to facilitate Web browsing.
5.2.3.3 Visualization
Visualization is the process of displajdng encoded data in a visual format that can be
perceived by human eyes. The output often takes the form of a knowledge map, which is
a knowledge representation that reveals underlying relationships of its knowledge
sources, such as Web page content, newsgroup messages, business market trends,
newspaper articles, and other textual and numerical information.
5.2.3.3.1
Knowledge Map
Early work on creating knowledge maps involves manual drawing of blocks and
connecting lines representing concepts and relationships respectively, such as Concept
Map (Novak and Gowin, 1984) and Mind Map (Buzan and Buzan, 1993). As the Web
was introduced in the 1990s and has become a major knowledge repository,
automatically generated knowledge maps have been proposed. Based on automatic
indexing techniques (Salton, 1989), CYBERMAP provides a personalized overview of a
subset of all documents of interest to a user, based on his/her profile (Gloor, 1991). It
138
clusters documents into distinct groups called HYPERDRAWERs, among which
similarity links are created automatically using a keyword similarity function and a
dynamically adjustable threshold value. Because CYBERMAP does not use the explicit
link structure (e.g., hyperlinks) of the documents, it may lose valuable information about
link structure in hj'pertext documents (e.g., Web pages).
The Galaxy of News system displays news articles and titles in a three-dimensional
space and allows users to move in a continuous fashion (Rennison, 1994). Users can
control the zooming level to decide in how much detail he/she wants to browse the news
content. The problems of the system are its use of a single color (white text, black
background) to display content and serious overlapping of texts when users choose to
browse more content. Another three-dimensional colored display, called Themescape
generates a landscape showing a set of news articles on a map with contour lines (Wise et
al., 1995). Documents in a themescape are represented by small points and those with
similar content are placed close together by using proprietary lexical algorithms. Peaks
represent a concentration of closely related documents; valleys contain fewer documents
and more unique content. Themescape was implemented as Cartia's NewsMap
(http://www.cartia.com/) to show articles related to the financial industry. It
allows users to specify a focus circle, to flag certain points, and to display details of
articles. However, when many articles are placed closely to a peak, it is difficult for users
to distinguish them without viewing details of all articles.
139
5.2.3.3.2
Kohonen Self-organizing Map Visualization
A neural network technique, called Kohonen's self-organizing map (SOM), takes a
set of input objects and maps them onto the nodes of a two-dimensional grid (Kohonen,
1995). Lin et al. (1991) used a single-layer SOM to cluster concepts in a small database
of 140 abstracts related to artificial intelligence. Using manually indexed descriptors as
concepts, they found that SOM was able to create a semantic map that captured the
relationships between concepts. Chen et al. (1996) applied SOM to generate a
hierarchical knowledge map automatically by categorizing around 110,000 Web pages
contained in the Yahoo! entertainment subcategory. In a subsequent experiment, Chen et
al. (1998b) showed that SOM performed very well with broad browsing tasks and
subjects liked the visual and graphical aspects of the map. Lin (1997) applied SOM to
display 1,287 documents found in DIALOG'S INSPEC database. He used the most
frequently occurring 637 words to index the documents. Yang et al. (2002) showed that
both fisheye and fractal views can increase the effectiveness of the SOM visualization.
Kartoo, a commercial search engine, presents search results as interconnected objects on
a map (http://www.kartoo.com/). Each page shows ten results represented as
circles of varying sizes corresponding to their relevance to the search query. The circles
are interconnected by lines showing common keywords for the results. Different details
(such as summary of results, related keywords) are provided while users are moving the
mouse on the screen. However, their placement of results on the screen does not bear
specific meaning, nor does it reflect similarities between Web pages.
140
5.3 Research Questions
The literature review in Section 5.2 reveals three research gaps. First, commercial BI
tools suffer from a lack of analysis and visualization capabilities. There is a need to
develop better methods to enable visualization of landscape and discovery of
communities from public sources such as the Web. Second, hierarchical and map displays
were shown to be effective ways to accessing and browsing information. However, they
have not been widely used to discover business intelligence on the Web. Third, none of
the existing search engines allows users to visualize the relationships among the search
results in terms of their relative closeness. We state our three research questions as
follows.
1. How can our automatic text mining framework be used to assist in exploring business
intelligence on the Web?
2. How can hierarchical and map displays of information be used in the framework?
3. What are the effectiveness, efficiency, and usability of using the framework in
exploring business intelligence on the Web, in comparison with a textual result list
and a graphical result map used by search engines?
141
5.4
Application of the Framework
We have applied our automatic text mining framework to developing a system, called
Business Intelligence Explorer (BIE), that assists in the discovery and exploration of
business intelligence from a large number Web pages. Figure 5.2 (modified from Figure
3.1) shows the framework components (in blue ovals) used to develop BIE. Summarized
in Figure 5.3, the specific processes included: data collection, automatic parsing and
indexing, co-occurrence analysis, Web community identification, and knowledge map
creation.
Collection
Conversion ExtracAion
Analysis Visualization
HTMU/XML
Language
identification
/ Metaspidering
HTMiyXML
"Slsgywords)^
The Web
Domain
Spidering
(iinla)
L
Domain/DB
Specific J
XyEarsing^
Hidden Web
(behind a
DB)
Web pages
and
DocuPfients
Tagged
Collection
Indexes and
Relationships
Data and Text Bases
Similarities,
Classes,
Clusters
Hierarchies,
Maps,
Graphs
Knowledge Bases
Figure 5.2: Framework components used to develop BIE
142
similarities
Identification
Parsing,
Indexing
and
Analysis
*•
Knowledge Map Creation
page
clusters,
page indexes
Co-occurrence Analysis
Data Collection
KM
DB Tech
CRM
ERP
Bl articles
Parser, Indexer,
Noun Phraser
Automatic Parsing
and Indexing
business
Web
pages
Meta-searching
AltaVista
Yahoo
Teoma
AlltheWeb
MSN
LookSmart
search.
The Web
Wisenut
Figure 5.3: System architecture of BIE
5.4.1 Data Collection
Data were collected in two steps: identifying key terms and meta-searching. Key
topics identified in the first step were used as input queries in meta-searching.
5.4.1.1 Identifying Key Terms
The purpose of this step was to identify key terms that could be used as business
intelligence queries to search for Web pages. These queries are all related to "business
143
intelligence" because we wanted to demonstrate the capability of our framework in
discovering business intelligence on the Web. To identify the queries, we first entered the
term "business intelligence" into the INSPEC literature indexing system. INSPEC is one
of the leading English-language bibliographic information services providing access to
the world's scientific and technical literature. It is used by IT professionals, business
practitioners and researchers to search for business and technical articles. The INSPEC
system returned 281 article abstracts published between 1969 and 2002, with a majority
of articles (230 articles) published in the last 5 years. The earliest one was written by H.
Luhn (1969) on the topic "A business intelligence system." He was considered to be a
pioneer in developing business intelligence systems. Based on the keywords appearing in
the titles and abstracts, we identified the following nine key terms: knowledge
management, information management, database technology, customer relationship
management, enterprise resource planning, supply chain management, e-commerce
solutions, data warehousing, business intelligence technology. These became the nine key
topics on business intelligence and were shown on the front page of the system's user
interface shown in Figure 5.4.
144
teii-lDl
Three browsing
methods - Result list,
Web community, and
Knowledge map - are
provided.
The Business Intelfigenee Eqilorer is designed
to pmvidB 3 illtfenmt luuwBfaig methods on each
of die 9 bufneis toplBS.
xl
Choose a topic
Database Technology
It allDiR ofeis to hnmefaiftannatlon by reiiih
list dtqiloy, UBraEcUcal Wdb commaidty
display, aodkmiiriedgB map dlqdiv.
Customer Relafionship Management
Enterprise Resource Planning
Supply Chain Management
E-commerce Solution
Select a topic on the
side to indicate yinir
preference. Itaii eiick oa one Of the above tabs
to choose a hnmsfng method.
r
Data Warehousing
Business Intelligence Technology
InformaBon Management
Users can choose
N a business
intelligence topic
here to browse.
^j
Unut ]| EmT
Figure 5.4: User interface of the Business Intelligence Explorer
5.4.1.2 Meta-searching
Using the nine business topics, we performed meta-searching on seven major search
engines; AltaVista, AlltheWeb, Yahoo, MSN, LookSmart, Teoma, and Wisenut. They
were the major search engines also used by Kartoo, which was compared with the
knowledge map of our system. Kartoo is a new meta-search engine that presents results
in a map format. We aimed to create a collection that was comparable to the one used by
Kartoo. From each of the seven search engines, we collected the top 100 results. As page
redirection was used in the front page of many Web sites, our spider automatically
followed these URLs to fetch the pages that were redirected. Since we were interested in
only business Web sites, URLs from educational, government, and military domains
145
(with host domain "edu", "gov", and "mil" respectively) were removed. Further filtering
was applied to remove non-English Web sites, academic Web sites that did not use the
"edu" domain name, Web directories and search engines, online magazines, newsletters,
general news articles, discussion forums, case studies, etc. In total we collected 3,149
Web pages from 2,860 distinct Web sites, or about 350 Web pages per each of the nine
topics. Each Web page represented one Web site.
5.4.2 Automatic Parsing and Indexing
Since Web pages contain both textual content and HTML tag information, we needed
to parse out this information to facilitate further analysis. In this step, we used a parser to
automatically extract key words and hyperlinks from the Web pages collected in the
previous step. A stop word list of 444 words was used to remove non-semantic-bearing
words (e.g. "the," "a," "of," "and"). Using HTML tags (such as <TITLE>, <H1>, <IMG
SRC= 'car.gif
alt= 'Ford'>), the parser also identified types of words and
indexed the words appearing in each Web page. Four types of word were identified (in
descending order of importance): title, heading, content text, and image alternate text. If a
word belonged to more than one type, then the most important type was used to represent
that term in the Web page. The word type information was used in the co-occurrence
analysis step (discussed below).
Then we used Arizona Noun Phraser (AZNP) to extract and index all the noun
phrases from each Web page automatically based on part-of-speech tagging and linguistic
146
rules (Tolle and Chen, 2000). Developed at the University of Arizona, AZNP has three
components. The tokenizer takes the full text of each Web page as input and creates
output that conforms to the Perm Treebank word tokenization rules by separating all
punctuation and sjmibols from text (Marcus, 1999). The tagger module assigns a part-ofspeech to every word in the Web page. The last module, called the phrase generation
module, converts the words and associated part-of-speech tags into noun phrases by
matching tag patterns to noun phrase patterns given by linguistic rules. For example, the
phrase strategic knowledge management will be considered a valid noun phrase because
it matches the noun phrase rule: adjective + noun + noun. Then, we treated each key
word or noun phrase as a subject descriptor.
Based on a revised automatic indexing technique (Salton, 1989), we computed the
importance of each descriptor or term in representing the content of the Web page. We
measured the term's level of importance by term frequency
and inverse Web page
frequency. Term frequency measured how often a particular term occurs in a Web page.
Inverse Web page frequency indicated the specificity of the term and allowed terms to
acquire different strengths or levels of importance based on their specificity. A term
could be a one-, two-, or three-word phrase.
5.4.3 Co-occurrence Analysis
Co-occurrence analysis converted raw data (indexes and weights) obtained from the
previous step into a matrix that showed the similarity between every pair of Web sites.
147
The similarity between every pair of Web sites contained its content and structural
(connectivity) information. He et al. (2001) computed the similarity between every pair
of Web pages by a combination of hyperlink structure, textual information and cocitation. However, their algorithm placed a stronger emphasis on co-citation than
hyperlink and textual information. When no hyperlink existed between a pair of Web
pages, the similarity weight included only co-citation weight, even if their textual content
were very similar. The same situation appeared when no common word appeared in the
pair of Web pages, even if many hyperlinks existed between them.
In order to impose a more flexible weighting in the three types of information, we
modified He et al.'s algorithm to compute the similarity. Figure 5.5 shows the formulae
used in our co-occurrence analysis. We normalized each of the three parts in computing
the similarity and assigned a weighting factor to each of them independently. We
computed the similarity of textual information (Sij) by asymmetric similarity function
which was shown to perform better than cosine function (Chen and Lynch, 1992). When
computing the term importance value (dij), we included a term type factor that reflected
the importance of the term inside a Web page. Using the formulae shown in Figure 5.5, a
similarity matrix for every pair of Web sites in each of the nine business intelligence
topics was generated.
148
AS
Similarity between site i and site j = W =a j.—^ + B j.—'{r- + (l — a-
'
IMIL
ll'5|l.
C
^
llclL
where A, S, C are matrices for A.j, S^j, Q respectively, a and P are parameters between 0 and 1, and 0 < a
P<1
A.J = 1 if site i has a hyperlink to site j, A- = 0 otherwise
Sjj = Asymmetric similarity score between site i and site j (Chen and Lynch, 1992)
Yd,A,
. A_^l
where
Sji=sim{Dj,D^)=
k=]
III
n = total number of terms in £>,
m = total number of terms in D,
p = total number of terms that appear in both £), and Dj
TV
rf, =;y;,xiog
X {Term type factor).
df,
tfij = number of occurrence of term j in Web page i
dfj = number of Web pages containing term j
Wj = number of words in term j
{Term type factor). = 1 +
\0-2xtypej
where typej = min
10
1
if term j appears in title,
2
if term j appears in heading,
3
4
if term j appears in content text,
if term j appears in image alternate text
C.. = number of Web sites pointing to both site i and site j (co-citation matrix)
Figure 5.5: Formulae used in co-occurrence analysis
5.4.4 Web Community Identification
We define a Web community as a group of Web sites that exhibit high similarity in
their textual and structural information. The term "Web community" originally was used
by researchers in Web structure mining to refer to a set of Web sites that are closely
related through the existence of links among them (Gibson et al., 1998; Kumar et al.,
1999; Flake et al., 2000; Flake et al., 2002). In contrast, researchers in Web content
mining prefer to use the term "cluster" to refer to a set of Web pages that are closely
related through the co-occurrence of keywords or phrases among them (Chen et al., 1996;
149
Schatz, 2002). We chose to use "Web community" to refer to our groups of similar Web
sites because it cotmotes the meanings of common interests and mutual references (but
not just a group of similar objects as connoted by "cluster").
To identify Web communities for each business intelligence topic, we modeled the
set of Web sites as a graph consisting of nodes (Web sites) and edges (similarities). Based
on previous research, hierarchical and partitional clustering had been found applicable for
different purposes but it was not likely that any one method was the best (Hartigan,
1985). Moreover, contradictory results have been obtained from previous studies on
comparing clustering methods (Milligan, 1981). Here, we decided to choose a
combination of hierarchical and partitional clustering so as to obtain the benefits of both
methods. We used a partitional cluster method to partition the Web graph recursively in
order to create a hierarchy of clusters. This way, we could obtain the clustering quality of
the partitional method while being able to show the results in a visual dendrogram.
However, partitional clustering is computationally intensive. As graph partitioning
had been shown to be NP-complete (Garey and Johnson, 1979), search heuristics were
required to find good solutions. To obtain high quality of clustering using partitional
methods, we used an optimization technique that finds the "best" partition point in each
cluster task. Being a global search technique, genetic algorithms (GA) can perform a
more thorough space search than some other optimization techniques (such as taboo
search and simulated annealing). GA is also suitable for large search space such as the
Web graph. Therefore, we selected GA as the optimization technique in our graph
150
partitioning. During each iteration, the algorithm tries to find a way to bipartition the
graph such that a certain criterion (the fitness function) is optimized. Based on previous
work on Web page clustering (He et al., 2001) and image segmentation (Shi and Malik,
2000), we used a normalized cut criterion as the GA's fitness function. The normalized
cut criterion measures both the total dissimilarity between different partitions as well as
the total similarity within the partitions. It has been shown to outperform the minimum
cut criterion, which favors cutting small sets of isolated nodes in the graphs (Wu and
Leahy, 1993). Figure 5.6 shows the formulae used to compute the normalized cut in the
partitioning of a graph into two parts: A and B.
In our GA partitioning, we used Nasso(x) as the fitness function and the GA tried to
maximize the function value. Figure 5.7 shows the steps in the GA and Figure 5.8
illustrates how the GA worked by a simplified example, in which we set the maximum
number of levels in the hierarchy to be 2 and the maximum number of nodes in the
bottom level to be 5. Ten nodes were initially partitioned into graph A and graph B,
which were on the first level of the hierarchy. Since graph B contained less than 5 nodes,
the partitioning stopped there. Graph A continued to be partitioned to create graphs C and
D, which were on the second level of the hierarchy. Then the whole procedure stopped
because the maximum number of levels and the maximum number of nodes in the bottom
level had been reached. The graphs (A, B, C, D) partitioned in the process were
considered to be Web communities. In our actual Web site partitioning, the maximum
number of levels was 5 and the maximum number of nodes in the bottom level was 30.
151
Web communities were labeled by the top 10 phrases having the highest term importance
value {dij shown in Figure 5.6). Manual selection among these 10 phrases was used to
obtain the one that best described the community of Web sites.
A cut on a graph G = (V, E) is the removal of a set of edges such that the graph is split into two
disconnected sub-graphs.
where Wy = similarity between node i and nodeJ
A, B are two distinct partitions in G
X is a binary vector showing which nodes belong to which
partitions. For example, in the graph shown on the right, the
number on each node represents the position of the digit in x. When
the digit is 0, then the k* node belongs to partition A. When the
digit is 1, then the k"* node belongs to partition B. In the example
shown on the right, x = 01001101
The normalized cut is a fraction of the total edge connections to all the nodes in the graph.
cu((A, B) ^ CUI {A, B)
Ncut{x) assoc(A,V)
where assoc(A f) = V
\
assoc(B,F) ='y
/ ^^ueA.teV
fF
assoc (B,L^)
is the total coimection from nodes in A to all nodes in the graph and
W. is similarly defined (see figure below). It was also shown in Shi and Malik
(2000) that Ncut(x)= 2-Nasso(x) where
\
assoc
reflects how tightly on
associB,V)
average nodes within the group are connected to each other.
In fact, minimizing the dissimilarity between the two
A
^
partitions (i.e. minimizing Ncut(x)) and maximizing the
B o
similarity within the partitions (i.e. maximizing Nasso(x))
are identical.
Figure 5.6: Formulae used to compute normalized cut
5.4.5 Knowledge Map Creation
In this step, we used Multidimensional Scaling (MDS) to transform a high-dimension
similarity matrix into a 2-dimensional representation of points and displayed them on a
map. As described in our literature review, MDS has been applied in different domains
for visualizing the underlying structure of data. We used Torgerson's classical MDS
152
procedure which does not require iterative improvement (Torgerson, 1952). The
procedure was shown to work with non-Euclidean distance matrix (such as the one we
used here) by giving approximation of the coordinates (Young, 1987). We detail the
MDS procedure as follows.
1. Convert the similarity matrix into a dissimilarity matrix by subtracting each
element by the maximum value in the original matrix. Call the new dissimilarity
matrix D.
2. Calculate matrix B, which is the scalar products, by using the cosine law. Each
element in B is given by
^
\ n
^ n
^
^
k=\
- i n n
^
g=l
^
h=\
where dg is an element in D, « = number of nodes in the Web graph
3. Perform a singular value decomposition on B and use the following formulae to
find out the coordinates of points.
B=UxVxU'
(1)
X=UX
(2)
where U has eigenvectors in its columns and V has eigenvectors on its diagonal
Combining (1) and (2), we have B = X x X'
We then used the first
two column vectors of X to obtain the 2-dimensional
coordinates of points, which were used to place the Web sites on our knowledge maps.
153
0. Set up parameters for convergence:
- GA convergence criteria: population size, maximum number of generation, cross-over and mutation probability
- Hierarchical clustering convergence criteria: maximum number of levels in hierarchy, minimum number of nodes
1. A chromosome is represented by a binary vector which has the same number of digits as the number of nodes in the
graph to be bipartitioned. Each gene represents a Web site on a partition. The chromosome is equivalent to the
binary vector x in Figure 5.6.
2. Initialize the population randomly.
3. Evaluate the fitness of each chromosome using Nasso measure.
4. Spin the roulette wheel to select individual chromosomes for reproduction based on the probability of their fitness
relative to the total fitness value of all chromosomes.
5. Apply cross-over and mutation operators.
6. Evaluate the fitness and find the highest fitness value in this generation.
7. Check if GA convergence criteria are met. If not, go back to step 3. Otherwise, use the best chromosome to guide the
graph partitioning, store the results in a tree structure, and proceed to step 8.
8. Check if the hierarchical clustering convergence criteria are met. If not, go back to step 2 and apply GA to partition
each of the two graphs partitioned in step 7. Otherwise, stop the whole procedure and return the results.
Figure 5.7: Steps in using a genetic algorithm for recursive Web graph partitioning
Figure 5.8: A simplified example of GA graph partitioning
5.5
Evaluation Methodology
This section describes the evaluation methodology used to compare among different
browsing methods. Details of the evaluation objectives, variables involved, experimental
tasks, hypotheses, subject profile, and performance measurement are described.
154
5.5.1 Objectives
The objectives of the evaluation were to understand the effectiveness, efficiency and
usability of the two browsing methods developed based on our framework, namely, Web
community and knowledge map; and to compare them with a textual result list display
and
a
powerful
commercial
searching
and
browsing
tool
called
Kartoo
(http: / /www. kartoo. com/). We chose Kartoo because it is the only tool we found
that displays results in a graphical map format that is most nearly comparable to our
knowledge map. Our textual result list display resembles the display of results presented
by typical search engines such as Google.
To make the comparison with other browsing methods fairer, we did not provide
search function, which was not our focus of study. Kartoo displays search results in a
map format, with circles representing Web sites and lines linking the Web sites through
common keywords. The independent variables studied in the experiment were four
browsing methods (result list (RL), Web community (WC), knowledge map (KM), and
Kartoo map (KT)). The dependent variables were three system attributes (effectiveness,
efficiency, and usability). Figures 5.9-5.12 show the screenshots of the four browsing
methods.
155
3
Topfc KnowNnlgo Manajammw
Softront - SoHware Sniulions for CRM. Defect Tracking. Help DBSk. Salf-Sen/icg Knowtedqe Mana^iBment. SalBS
Force Aulomallon and Call Center.
Showing the
current topic
Summary SoTfront - Software Soiudons for CRM, Defect TracMrjg, Help Desk, self-serviceKnowledge Management, Sales Force Automation and
Call Center President's Desk Success Stories Weblnar CopyrlghtG) 1992-2002 SofTror^ Software, Inc Sales Marketing Customer Service eCall
Center Knowledge Management Employee Service Asset Management Defect TracMng Over 9 years of product matur
URL txip
soPronf corn
Summary Knowtedge Management Magazine CiicK on 'current Issue* to find out wnafs in thismontft's magazine Some of the features in the
printed issue of June 2002 To order a sample copy click on the'Subscntie* link in the main menu to tne left 'Knowledge management will provide
trie competiove edge to drtve your oi^anisatjon into sustainable growtn and profit in the 2ist century * Welcome tothe K
URL http/.'mf/jkjnmagco.uh'
Knowledge Manaaement Resource Center
Summary Knowledge Management Resource Center Your Complete Knowledge Management Resource for
Products/Services, Books, Professional Organizations.Articles, Forums. Links
Title, summary,
URL, and remarks
of each result are
shown
Sites, Conferences, News,
'R£ litlp//WfMi kniresource conv
LLRX.com Resource Canler- intranets/Knowledggment Management
Summary UJ^com Resource Center - tntranets/Knowledgement ManagementNavigation 9-11-2001 News a Legal Resources Act/anced Search
-I
Buttons allow
users to turn to
other pages
Figure 5.9: Result list browsing method
mm-Groups of Web
sites organized
in hierarchical
communities
Clicking on any
nodes immediately
below the root will
open that sub-tree
, Resoiffce Mansgemant Association of Canada (iRMACil /
pocumentSeivlceg
]|/
e InftirmsttDn Management Systems-Maldng sDlffiwencv
^dtagnosisandtreatmentsbva bacKspeclsliet
/
irmaflon Management
UJ Home page
Inlbnnafion & Ubfanr StaffingSpeclalim
Zi^
, lation Management
mnoiogies Company
Infbrmation Management System -PowerlAB US /LIMS
C MO
lulaVon Uct
in - Services
Physician ComputerNetMrork, Inc Online
Home PagB
.le e-Laaming Management Systems
AflHSBt-Netwohred Data Sy^ms
UmryUnlc- current awareness fcrthe InformationprofSsslonal
BnaWa;1he 24 hour gateway to managementresearch
OotimibrRaMarcliin UbraiyartdInformation Management- CEI
IQtQMaM Management
Back button allows
users to traverse
upward in the tree.
WWAtl^rtdcom
First Mtinday
F8UWit)B.08 Workstiop - General Infonnatton
dtmtUettr Ir^netResource- IntranetReferertceSrteConrtptete l(
ftltMiiaKiBal Summer Sctiool onttie Digital Utireiy2001 Indsx
UBiBn tSiK- current awareness forffelnformationprofessional
ZDNecTbch Update Enterprise Hardware
C. 9-10 Privacy Infbmiatlon management system
Panel showing details on
demand (labels, title,
summary, URL)
Figure 5.10: Web community browsing method
Clicking this
button, users
can open a Web
site when they
have specified
It.
156
The closeness of
any two points
reflects their
similarity
«
Details of this Web
site is being shown
on the bottom panel
Users can control the
nmnber of Web sites
to be displayed
^ * C** *»
ieKnowledBeCenler- KM Certltlcalion Program
j ] URL hllp
Panel showing
details (title, URL
summary)
ekncfrieOgecer^er oonV
ShovfHigtop jl5 1 web
I j Summary eWitwHedgeCenter- KM CertillcaBon Program Knowledge
J Management Certiflcation Program LoginKM Certification Login Discussion
I Groups Contact inronnafion Latest News Upcoming Conferences and
Figure 5.11: Knowledge map browsing method
.r
---r Ml'lp OptlllllN VlfVV S)'nil
Zooming buttons
allow zoom-in or
zoom-out functions
Navigation buttons
allow browsing in
four directions
KH|(I.I(
Common
keywords are
shown on lines
linking Web sites
AUifti:i.-il Inti'lliijoni i: I
Click hyui J(i see Of. Hsinclitin
Cfinn's n(!W bnnk nnttttod
"K/)ovv!pfJ<je Mfln^qement
Sysipins: A Taxi Mjniriq
l»(!iS|)L»ctrve"
PDh FUc
Details of Web site
shown here when
users move the
mouse over a circle
•
m
-
(.llCK llf'ro to (ipt'll this Mf
Figure 5.12: Kartoo map browsing method
Web site URLs
are shown on
circles
157
5.5.2 Experimental Tasks
Two experimental tasks related to one of the nine business intelligence topics were
designed for each browsing method. Task 1 incorporated the closed-ended tasks used in
the Text Retrieval Conferences (TREC) (Voorhees and Harman, 1997) and an exact
answer to each task was expected. In the task, subjects were given the names of two
companies and were asked to find their URLs and major business areas in four minutes.
An example was "use the Web community display, select the topic Supply Chain
Management, find the URL and the major business areas of Gensym Corporation.'''' Task
2 was designed according to the open-ended tasks used in TREC and topics relevant to
each task were expected to be found. Appendix B.3 provides the complete questionnaire
used in the experiment.
In the task, subjects were given a broad issue in a business intelligence topic and were
asked to find the titles and URLs of Web sites related to that issue in eight minutes. They
could find as many Web sites as they wanted. An example was "use the result list display,
select the topic Customer Relationship Management, find the titles and URLs of Web
sites that are related to CRM benchmarking." Thirty University of Arizona students were
recruited to participate voluntarily in the experiment. Half of them were female. Most of
them aged below 30 and were studying in business school.
158
5.5.3 Experimental Design and Hypotheses
In our experiment of browsing methods, three comparisons were performed. Web
community (WC) was compared with result list (RL) because we wanted to determine
whether clustering and hierarchical display helped subjects visualize the business
landscape. Knowledge map (KM) was compared with Web community because we
wanted to study the visual effects and accuracy of analysis algorithms used by different
browsing methods. Knowledge map was compared with Kartoo map (KT) to examine
point placement and the visual effects of using different browsing methods. Other
pairwise comparisons could be performed but we were mainly interested in these three.
As each subject was asked to perform the same set of tasks using the four browsing
methods, a one-factor repeated-measures design was used, because it gave greater
precision than designs that employ only between-subjects factors (p.280, (Myers and
Well, 1995)). The whole experiment took about an hour and was divided into four
sections. In each section, subjects used one of the four browsing methods to perform two
tasks as described in Section 5.5.2. The task contents were different for different
browsing methods but their natures were the same (i.e., closed-ended for task 1 and openended for task 2). The functionalities of each browsing method were explained to subjects
before they used it. During the experiment, the experimenter also provided necessary
assistance to subjects and recorded their behavior, verbal protocols and other
observational data. Appendices B.l and B.2 respectively provide the approval letter and
disclaimer form approved by the University of Arizona Human Subjects Committee.
159
5.5.4 Performance Measures
The performance measures used in this experiment were effectiveness, efficiency, and
usability. The effectiveness of a browsing method has three components: accuracy,
precision and recall. Accuracy refers to how well the browsing methods helps users find
exact answers to closed-ended tasks (Task 1). Precision measures how well the browsing
method helps users find relevant results and avoid irrelevant results in open-ended tasks
(Task 2). Recall measures how well a browsing method helps users find all the relevant
results in open-ended tasks (Task 2). A single measure called F value was used to
combine recall and precision (Shaw et al., 1997). An expert who had a master's degree in
library and information science and five years of content management and hitemet
searching experience was recruited to provide answers to the experimental tasks for
judging the effectiveness. She used our tool, Kartoo, and other search tools such as
Google to manually identify all relevant results. The formulae to obtain the above
measurements are stated below.
^
Number of correctly answered questions
Accuracy =
Total number of questions
. .
Number of relevant results identified by the subject
Precision =
^
Number of all results identified by the subject
yj
I I - Number o f relevant resultsidentified by the subject
Number of relevant results identified by the expert
„ ,
2 X Recall x Precision
F value =
Recall + Precision
160
Efficiency refers to the amount of time users required to use a browsing method to
finish the tasks. Usabihty refers to how satisfied users are with using a browsing method.
This was obtained from subjects' overall ratings of the browsing method on a five-point
Likert scale where 1 means "poor" and 5 means "excellent". One feature of each
browsing method was selected for rating (using the same 5-point scale): textual list of
RL, labels of WC, placement of points in KM, and placement of circles in KT. Subjects
were encouraged to speak during the experiment so the experimenter could record the
reasons behind their behaviors and feelings. They also could type their comments
regarding the strengths and weaknesses of the methods into a computer file. All verbal
comments were analyzed using protocol analysis (Ericsson and Simon, 1993). Moreover,
we asked users to compare KM with KT in the following aspects: user friendliness,
graphical user interface, quality of results, and meaning of placement of points. For each
aspect, subjects provided verbal comments and indicated whether KM or KT performed
better.
5.5.5 Hypotheses
Nine hypotheses, grouped by the three system attributes, were tested and are listed
below.
HI: Web community is more effective than result list.
H2: Knowledge map is more effective than Web community.
H3: Knowledge map is more effective than Kartoo map.
161
H4: Web community is more efficient than result list.
H5: Knowledge map is more efficient than Web community.
H6: Knowledge map is more efficient than Kartoo map.
H7: Web community obtains higher users' ratings than result list.
H8: Knowledge map obtains higher users' ratings than Web community.
H9; Knowledge map's placement of Web sites is more meaningful than Kartoo map's
placement.
The rationales behind the hypotheses are now described. HI, H4, and H7 are about
the comparison between Web community (WC) and result list (RL). Since WC optimized
the partitional clustering process with a genetic algorithm and displayed the results in a
hierarchical format with guiding labels, it could group similar Web sites together. The
visual effects of WC also helped to reduce the degree of information overload that was
believed to be serious in RL. Therefore, users of WC should perform the tasks more
accurately and quickly with higher satisfaction. Thus, we believed that WC would
perform better than RL in these three system attributes.
H2, H5, and H8 are about the comparison between Knowledge map (KM) and Web
community (WC). Because KM employed multidimensional scaling to calculate the
positions of Web sites' on the map, subjects could compare the physical distances among
different points to perceive the similarity among Web sites. In contrast, because WC put
similar Web sites inside a community, subjects could not distinguish among Web sites
within the same community. Furthermore, KM allowed subjects to adjust the zooming
162
levels, the number of Web sites to be shown, and the areas to be browsed. We had
anticipated that KM would be more flexible than WC, which had a predefined structure
that could not be changed and in which users could not alter the members of
communities. In sum, we believed that KM would perform better than WC in these three
system attributes.
H3, H6, and H9 are about the comparison between Knowledge map (KM) and Kartoo
map (KT). KM presented the Web sites as title-labeled points on a map while KT
presented Web sites as URL-labeled circles with interlinked lines and keywords. KM
used MDS to determine the coordinates of points but it was unknown how KT
determined the position of the circles on the map. We believed that the relative closeness
of Web sites and the cleaner interface of KM would enable subjects to find precise and
relevant results more quickly than KT. In contrast, KT's placement of circles did not
explicitly correspond with Web sites' similarity and the links, keywords, and circles
flashed while users were moving the mouse on the screen. We thought KT to be less
intuitive than ICM and therefore believed that KM would perform better than KT in these
three system attributes.
5.6
Experimental Results and Implications
This section describes the results of our user study comparing four browsing methods.
Table 5.3 summarizes the means, standard deviations and rankings of different
163
performance measures for different browsing methods. Table 5.4 shows the p-values of
various t-tests in testing the hypotheses.
Table 5.3: Summary of key statistics
L'i-rl«ri)i:iuci' .Mc:i«uit
.4veragc
S.l).
Ranking
Accuracy
37%
26%
2nd
Precision
84%
28.5%
2nd
5.7%
Recall
6.5%
4th
F value
9.8%
10%
4th
Total time (minutes)
12
0.00
4th
0.96
Rating on result listing
2.3
2.7
0.92
Overall rating
3rd
Web Community
Accuracy
100%
0.00%
1st
5.7%
Precision
92%
1st
14%
2nd
Recall
65 %
11%
F value
75%
1st
1. 7
Total time (minutes)
9.3
2nd
4.7
0.54
Rating on node labeling
0.62
1st
Overall rating
4.2
Knowledge Map
Accuracy
100%
0.00%
1st
17%
Precision
3rd
66%
70%
18%
1st
Recall
67%
15%
2nd
F value
9.3
0.61
1st
Total time (minutes)
Rating on point placement
0.91
4.0
Overall rating
0.95
2nd
3.8
Accuracy
22%
25%
3rd
Kartoo Map
Precision
38%
26%
4th
Recall
13%
6.6%
3rd
8.9%
F value
18%
3rd
Total time (minutes)
12
0.45
3rd
Placement rating
2.2
1.0
* All ratings are on a five-point discrete Likert scale, where 1 means "poor" or "not
helpful", and 5 means "excellent" or "very helpful".
Result List
5.6.1 Comparison between Web Community and Result List
From the testing results of HI, H4 and H7, we found that Web community (WC)
performed significantly better than result list (RL) in terms of all performance measures
we considered. We believe that the advantages of clustering and visualization were the
164
main reasons for our observations. In terms of effectiveness (accuracy, precision, recall, F
value), the main reason for the superior performance of WC was its ability to group
similar Web sites in the same community accurately, while RL's one-dimensional Ust did
not. Subjects could locate a set of results related to a topic in WC once they had found the
community.
Table 5.4: p-values of various t-tests
Hypothesis
HI: WORL
H2: KM > WC
H3; KM > KT
Accuracy
0.00 (>)'
Undefined*^
0.00 (>)
Precision
0.15(>*)''
0.00 (>)
0.00 (>)
Kliiciency
Time for task 2
0.00 (>)
0.00 (<)
0.19 (>*)
Hypothesis
H4: WC > RL"
H5:KM>WC
H6: KM > KT
Time for task 1
0.00 (>)
0.00 (>)
0.00 (>)
Hypothesis
H7: WC > RL
H8:KM>WC
Overall Rating
Rating on feature
0.00 (>)
0.00 (>)
0.03 (<)
0.00 (<)
Rating on placement of Web sites
0.00 (>)
Recall
0.00 (>)
0.11 (>*)
0.00 (>)
F value
0.00 (>)
0.01 (<)
0.00 (>)
Conclusion
Confirmed
Not confirmed
Confirmed
Total time
0.00 (>)
0.97 (>*)
0.00 (>)
Conclusion
Confirmed
Not confirmed
Confirmed
Combined rating
0.00 (>)
0.00 (<)
Conclusion
Confirmed
Not confirmed®
H9: KM > KT
Confirmed
Notes:
a. A greater-than sign (>) means that the method on the left of the hypothesis yields a higher mean than the
method on the right. A smaller-than sign (<) means the reverse.
b. A star (*) means that t-test result was not significant at a 95% confidence level.
c. The correlation and t cannot be computed because the standard error of the difference is 0.
d. For H4-H6, a greater-than sign (>) means that the method on the left yields a higher efficiency (or
smaller amount of time) than the method on the right. A smaller-than sign (<) means the reverse.
e. H8 was not confirmed, but the opposite of H8 was confirmed.
In terms of efficiency (time spent on tasks), we believe that WC's hierarchical
structure allowed subjects to visualize the landscape of the entire collection of Web sites.
Subjects using RL needed to spend a lot of time opening the result pages sequentially. In
contrast, subjects relying on WC's visual effects could quickly locate the key labels
related to the topics they were searching for, thus saving time.
165
In terms of usability, subjects rated WC significantly higher than RL. It was because
of WC's visualization effects, clustering into hierarchical groups, labeling, and providing
details on demand. As subject #19 said "Once I spot the label, I can move to the relevant
topics very easily ... (WC) save(s) time, (I) don't need to read all the summaries and
Web pages to decide which are relevant." Fifteen subjects had similar comments.
Regarding visualization effects, subject #10 said that "visualization helps to navigate
faster and easier". Eight subjects said that the tree structure was helpful and seven
subjects said that labels were clear. We therefore believe that clustering and visualization
of WC contributed to its higher rating.
In addition, subjects also commented on the strengths and weaknesses of RL and WC.
They were familiar with RL's display of results because it was similar to typical search
engines' displays. However, RL's result list provided too much information and might
have created information overload. Subject #4 said that RL required "too much reading at
one time" and that it was "hard to search for a specific word or phrase." Eight subjects
said that it was hard to find specific words, four subjects said too much information was
provided, and three subjects complained that they could not use the search function to
search for a specific keyword. Regarding the weaknesses of WC, ten subjects said that
when they browsed at the root level the labels were overlapped and looked crowded. To
summarize, we found that Web community performed significantly better than
result lists in terms of effectiveness, efficiency, and usability.
166
5.6.2 Comparison between Web Community and Knowledge Map
Hypotheses H2, H5 and H8 were used to compare knowledge map (KM) against Web
community (WC). Surprisingly, we found that all three of these hypotheses were not
confirmed and the opposite of H8 was confirmed. In other word, KM performed very
similarly to WC in terms of effectiveness and efficiency. We suggest three reasons for
such results. First, both KM and WC displayed results in visual formats (tree or twodimensional map). Thus they both facilitated subjects' browsing by visualization.
Second, both KM and WC employed the same Web site similarity data to perform further
analysis. In WC, similar Web sites were grouped under the same nodes. In KM, similar
Web sites were placed close to each other on the map. Third, both methods provided a
landscape of the entire collection while allowing details on demand. In WC, labels on
nodes provided a rough overview of the landscape while subjects could click on the
bottom nodes to view Web site details. In KM, Web site titles and their location provided
a landscape of the entire collection while subjects were allowed to click on a point to
view Web site details.
In terms of usability (H8), WC was rated significantly higher than KM, contrary to
our expectation. We believe that KM's inadequate zooming function contributed to the
result. By default, the top ten results were displayed on KM. When subjects performed
the tasks, they could increase the number of Web sites shown on screen. This would also
increase the chance that titles overlapped. Unfortunately, KM's zooming function was
restricted to between 1 and 5 times of the original size. Thus, if subjects chose to display
167
more than fifty results, they were presented with too many overlapping labels. On the
other hand, WC's hierarchical display could automatically resize the sub-tree being
displayed. As subjects zoomed in on lower levels close to the bottom, they could get a
clearer picture of the results. Therefore, the overlapping-label problem did not appear in
those lower levels. In contrast, when subjects used KM, they got more and more
information as they increased the number of Web sites shown on screen. This also
increased the chance of information overload. As subject #30 commented on KM: "(It is)
easy to forget the Web sites I searched. It makes me do some repeating work". To
summarize, we found that knowledge map has effectiveness and efficiency similar to
those of Web community because of their similar functionality. But knowledge map
was rated less usable than Web community because of its inadequate zooming
function.
5.6.3 Comparison between Knowledge Map and Kartoo Map
Hypotheses H3, H6 and H9, and subjects' verbal comments were used to compare
knowledge map (KM) against Kartoo Map (KT). We found that KM performed
significantly better than KT in terms of effectiveness, efficiency and users' ratings on the
meaning conveyed by placement of points. We suggest that the main reasons contributing
to the superior performance of KM were its clean interface, intuitive communication of
the meaning of placement of points, and provision of details on demand. As subject #9
summarized succinctly: "This is an inteUigent tool and has features superior to any other
168
search as it gives a visual picture of the topic and all topics closely related to the one
under search. The map is intuitive and helps steer the user to the right topics or the ones
that are close." Twenty subjects agreed that the closeness of points and their relationships
with similarity helped in browsing. Five subjects said that finding results was quick using
KM and four subjects said that the display was clear.
Subjects' verbal comments revealed that KM had better user friendliness, quality of
results and placement of Web sites on the screen while KT had a better graphical user
interface. Table 5.5 shows their preferences on the four aspects. Subjects pointed out the
inaccuracy of results and problems of user interface of KT. Unlike KM that displayed the
titles of the search results, KT displayed Web site URLs that often did not bear semantic
meaning. When a URL did not contain the search terms a subject was looking for, he or
she could not decide whether the URL was relevant unless the Web site was opened to
browse its content or the pointer was moved (by controlling the mouse) over the circle to
see the title and summary. But when the pointer was moved away from the circle, the title
and summary could no longer be seen because KT displayed only the details of the Web
site to which the pointer had been moved.
In addition, Kartoo provided much information that might have confused subjects. As
subject #18 pointed out: "too many lines in Kartoo confused the users when we used it".
In general, subjects had difficulty adapting to the complicated display of Kartoo.
However, Kartoo, being a commercial search engine, had a more professional graphical
user interface that most subjects preferred. To summarize, we found that knowledge
169
map performed significantly better than Kartoo in terms of effectiveness, efficiency,
and users' rating on placement of Web sites because of KM's accurate placement of
Web sites and its clean interface. From users' verbal comments, we found that KM
was more user-friendly, produced quality results, and conveyed better meaning of
the placement of Web sites because of accurate placement and clear presentation.
KT was considered to have a better graphical user interface because of its
professional graphical design.
Table 5.5: Number of subjects who expressed a preference for Knowledge Map or Kartoo
Svstein attributes
Knowlodtti' M:ip is bolter Kartim is bctlcr Similar/undecided
User Friendliness
Graphical User Interface
Quality of results
Meaning of placement of Web sites
20
2
26
25
7
27
0
0
3
1
4
5
5.6.4 Discussion
The experimental results supported our belief that WC and KM would help to reduce
information overload and discover business intelligence on the Web. We concluded that
appropriate use of visualization is the main contributor to superior performance. We can
explain
it
by
using
Shneiderman's
taxonomy
for information
visualizations
(Shneiderman, 1996). Among the seven data types (ID, 2D, 3D, temporal, network, tree,
multidimensional), KM represents a 2D/network data type while WC represents a 2D/tree
data type. Both methods employ techniques to perform some of the seven visualization
tasks (overview, zoom, filter, detail on demand, extract, related, history) appropriately in
order to reduce information overload. WC's clustering provides an overview of the entire
170
collection of Web sites and its tree stracture provides details on demand. KM's intelligent
placement of Web sites as points on a map also provides an overview of the Web sites. Its
intuitive depiction of the relationship between physical distance on map and Web site
similarity helps users relate easily to different Web sites. Moreover, both WC and ICM
extract information (title, summary, URL) from the Web sites that users click on. KM
allows users to zoom into finer levels; WC allows users successively to open nodes
containing communities of Web sites. KM filters information by allowing users to change
the number of Web sites displayed; WC filters
information by not displaying the
communities that users do not select. However, neither WC nor KM provides history of
actions performed by users. Although this did not affect the significance of results due to
the short durations of tasks, providing the history might have helped users' browsing, as
reflected by a comment from subject #30, "(it is) easy to forget the websites I searched. It
makes me do some repeating work."
5.7 Conclusions
In this chapter, we have applied our automatic text mining framework to developing
the Business Intelligence Explorer for exploring and discovery of business intelligence on
the Web. The system adopted meta-searching, co-occurrence analysis, clustering, and
visualization techniques to address the problem of information overload on the Web.
Based on the ft-amework, we have developed two new browsing methods, namely, Web
community (WC) and knowledge map (KM), to help business analysts visualize the
171
landscape of search engine results and discover Web communities. A genetic algorithm
was applied in WC to identifjdng communities of Web sites and a multidimensional
scaling algorithm was applied in KM to transforming the high-dimensional similarity
matrix to two-dimensional (2D) coordinates. The results of WC were presented in a
hierarchical format while the results of KM were presented in a 2D map format.
Experimental results show that WC performed significantly better than result lists
(RL) in terms of effectiveness, efficiency and usability because of WC's accurate
clustering and appealing visualization. However, KM performed similarly to WC in
terms of effectiveness and efficiency because of their similarity in visualization features
and their use of the same similarity matrix in analysis. Contrary to our expectation, WC
obtained significantly higher users' ratings than KM because of KM's inadequate
zooming function. When comparing KM against Kartoo search engine (KT), we found
that KM performed significantly better in terms of effectiveness, efficiency, and users'
ratings on placement of Web sites because of KM's accurate placement of Web sites and
clean interface. Users' comments indicated that KM had better user friendliness, higherquality results, and more meaningful placement of Web sites because of its accurate
placement and clear presentation. KT had better graphical user interface because of their
professional art designs.
Overall, the encouraging results show that our framework is promising for alleviating
information overload in business analysis and discovering business intelligence on the
Web. It is potentially useful to business analysts for whom it could effectively and
172
efficiently extract knowledge, in the form of patterns, from large amounts of information.
Our Web community and knowledge map browsing methods are particularly suitable for
discovering the landscape of a large number of business Web sites or search results. As
traditional result list browsing method does not have this capability, the two browsing
methods have potential to be adopted by search engines as alternatives to result list
display.
173
CHAPTER 6. USING WEB PAGE CLASSIFICATION TECHNIQUES
FOR BUSINESS STAKEHOLDER ANALYSIS ON THE WEB
As the Web is used increasingly to share and disseminate information, business
analysts and managers are challenged to understand stakeholder relationships that are
often hidden among interconnected Web resources. Extracting such relationships would
provide insights on understanding the competitive environment, where customer
relationship management and collaborative commerce are becoming more important
nowadays. Traditional stakeholder analysis approaches offer theoretical frameworks for
studying the phenomena. But they tj^ically assume a manual approach to analysis that
does not scale up to accommodate the rapid growth of the Web. This chapter examines
how our framework can automate business stakeholder analysis on the Web. By
incorporating human knowledge and machine-learned information of Web pages, we
demonstrate the framework's capability to extract and classify complicated business
relationships, thereby helping analysts to better understand the competitive environment.
6.1
Background
6.1.1 Collaborative Commerce
The current networked business environment has greatly facilitated information
sharing and partner collaboration (Applegate, 2003). Business stakeholders increasingly
rely on the Internet to conduct collaborative activities, such as analyzing business
174
relationships and researching development opportunities. To automate business
processes, "collaborative commerce" has recently been proposed (Kownslar, 2002;
Scandar, 2003). It integrates business processes (such as sales support, vendor
management, and demand planning) between partners through electronic sharing of
information (Li and Du, 2003).
A survey of more than 300 business executives found that companies that use
collaborative commerce technology to enable cross-enterprise business processes and
information exchange across their trade partners were as much as 70 percent more
profitable than those that did not integrate with trading partners (Ferreira and Blonkvist,
2002). One of the tactics in collaborative commerce is knowledge sharing about
stakeholder relationships through a company's Web sites and pages. Important clues to
the knowledge are often expressed in textual content or annotated hyperlinks.
6.1.2 Understanding Business Relationships on tlie Web
As the Web is used increasingly to share and disseminate information, the problem of
information overload hinders analysis of stakeholder relationships. Business analysts may
not be aware of many of a company's stakeholders, who may have current or future
relationships. Knowledge is often hidden in interconnected Web resources, posing
challenges to identifying and classifying various business stakeholders on the Web.
Although traditional stakeholder analysis approaches offer theoretical foundations for
understanding business relationships, they are largely manually-driven and not scalable to
175
the rapid growth and change of the Web. The emergence of electronic commerce has
further complicated business relationships, so there is a need for better approaches to
uncover knowledge that may improve understanding of business relationships in
collaborative commerce.
In this chapter, we have applied our automatic text mining framework to business
stakeholder analysis on the Web. Web page classification techniques were used to
classify Web pages into stakeholder types. A business stakeholder analysis system was
developed to incorporate both human knowledge and business Web site content. An
experiment involving algorithm comparison, feature comparison, and user study was
conducted. Using the framework, we aim to facilitate business stakeholder analysis on the
Web. At a company's (i.e., microscopic) level, individual business stakeholders can be
identified and classified. At the competitive environment's (i.e., macroscopic) level,
groups of business stakeholders can be formed to conduct further analysis.
6.2 Related Work
In this section, we review the theoretical foundations of stakeholder research,
describe tools for exploiting stakeholder information on the Web, and discuss different
techniques for Web page classification.
176
6.2.1 Stakeholder Analysis
It is useful to review stakeholder theories in the context of the changing view of
firms, which has evolved over the past two centuries as society and technology have
progressed over time (Freeman, 1984). In the 19th century, the "Production View"
considered firms to be vehicles of production. Managers needed only to worry about
satisfying suppliers and customers in order to keep the production line running. On
entering the 20th century, development of transportation systems favored the
concentration of production in urban areas. The "Managerial View," in which managers
had to satisfy owners and employees in addition to suppliers and customers, emerged. In
1963, the term "stakeholder" was first proposed in the Stanford Research Institute
(although the history of stakeholder concept can be traced back many decades earlier
(Smith, 1759; Berle and Means, 1932; Barnard, 1938)). Managers increasingly faced
internal and external pressure due to increased awareness of government's role, growing
competition from overseas, emphasis on consumer rights, and rising environmental
concerns. The "Stakeholder View" emerged. It called for managers' attention to various
stakeholders and became the dominant view of the firm between the 1960s and the 1980s.
Many researchers have proposed stakeholder frameworks and theories. Widely
considered as a landmark in stakeholder management. Freeman's book describes three
levels of stakeholder management (Freeman, 1984): rational, process, and transactional
levels. The framework provides a solid foundation for future research. However, it does
not provide a systematic method for identifying and classifying stakeholders. Drawing
177
from results of 78 field studies of corporate social performance (CSP) in major Canadian
companies between 1983 and 1993, Clarkson (1995) proposed a framework for analyzing
and evaluating CSP. The research summarizes a list of typical stakeholder issues and
points out the importance of addressing stakeholders' needs rather than just shareholders'
needs. However, having mainly been conducted in the 1980s, this research does not
consider more complex relationships in the e-commerce environment that emerged in the
mid-1990s. Mitchell et al. (1997) proposed a stakeholder typology that identifies
stakeholders by combinations of stakeholder attributes (power, legitimacy, urgency).
They hypothesized that stakeholder salience increases with the number of attributes that
the stakeholder possesses. Although their theory was confirmed by empirical results of a
survey on 80 large U.S. firms (Agle et al., 1999), it applies only to understanding
stakeholders' salience but does not provide a practical classification system for
stakeholders.
Recognizing that the Internet enables virtually any individual or organization to relate
to any company, the "E-commerce view" of firms appeared in the 1990s. Stakeholders
who previously could not affect the firm can be virtually identified on the Web. Table 1
summarizes stakeholder types considered in recent research. Due to the ubiquitous nature
of the Internet, firms nowadays compete in a new environment. Traditional theories and
frameworks that assume only a manual approach to stakeholder analysis (e.g., (Elias and
Cavana, 2000; Reid, 2003)) may need to be augmented by Web-based, automatic
approaches to environmental scanning, stakeholder classification and analysis. In
178
particular, business intelligence (BI), obtained from the business environment, is likely to
help in stakeholder analysis and automated tools have been developed to exploit BI.
Table 6.1: Stakeholder types* considered in previous research
p
()
(J
Rcsciirch
s
\i
E C?
i;
K
T
F 1
No.
\
•
V
•
•'
/•
Reid, 2003
10
•/
Elias & Cavana,
9
2000
•/
•/
Agle et al., 1999
5
•/
Donaldson &
8
Preston, 1995
Clarkson, 1995
5
No.
4
2
5
5
5
1
5
1
1
2
1
1
1 3
* P = Partners/suppliers, E = Employees/Unions, C = Customers, S = Shareholders/investors, U =
Education/research institutions, M = Media/Portals, G = Public/Government, R = Recruiters, V =
Reviewers, O = Competitors, T = Trade associations, F = Financial institutions, 1 = Political groups, N =
Special Interest Groups/Communities (Note that a class "Unknown" is not included here), No. = Column or
row sum
6.2.2 Tools and Approaches for Exploiting Business Intelligence
Web content and structural information are important for understanding BI and
stakeholders (see Section 2.3.4.2 for an overview of BI tools and techniques). Web
content mining helps analysts to identify key terms used to describe business
stakeholders. Web structure mining facilitates understanding of how the macroscopic
environment relates to certain Web sites or pages. Examples of Web mining techniques
include Google's PageRank algorithm (Brin and Page, 1998), Hyperlink Induced Topic
Search (HITS) algorithm (Kleinberg, 1999), and the Web Impact Factors algorithm
(Ingwersen, 1998). The external-link pages can be seen to mirror social communication
phenomena, such as strategic or tactical referral behavior, and pragmatic or common
179
semantic interest in particular sites on the Web (Ingwersen, 1998). Such macroscopic
information is important to understanding a business's competitive environment.
Previous research has developed automated systems to exploit Web content and link
structure information for discovering business intelligence. The Flexible Organizer for
Competitive Intelligence (Ong et al., 2001) performs online searches on selected search
engines and clusters the results into different folders using a fuzzy ARAM algorithm,
which allows human judgment to be added in the clustering process. However, its
accuracy has not been clearly demonstrated. WebMon helps users monitor specified Web
pages for most recent changes and updates in information (Tan et al, 2002). Four types
of monitoring are provided: date, keyword, link and portion. Despite the authors' claim to
having discovered BI, WebMon simply performs checking on Web page changes (e.g.,
changes in list, table, and plain text) without providing deeper analysis. INSYDER is an
information assistant that provides four visualization views (result list, scatterplot,
barchart, and tilebars) to facilitate browsing of business information on the Web.
However, experimental results showed that the visualizations provided by INSYDER
performed worse than a traditional result list in terms of effectiveness and efficiency
(Reiterer et al., 2000).
Another research effort tried to identify a company's non-customer Web communities
using back link analysis (Reid, 2003). The company's stakeholders were manually
classified into different tj^es based on the anchor text and the context in which the
company's hyperlink appears on the stakeholders' Web pages. Despite the in-depth
180
analysis, the approach is labor intensive, was applied only to studying one company's
stakeholders, and may not work without deep domain knowledge. To facilitate
stakeholder analysis on the Web, a company called V-fluence provides software tools
that analyze a range of Web information sources such as Web sites, Usenet groups, and
search engine directories (Byrne, 2003). Information including site traffic, in-link,
content depth, and usage ranking was used to study stakeholders. However, the analysis
provided was shallow because only simple statistics were studied. It was also unclear as
of how to determine stakeholder types.
As shown in previous research, Web content and structural information are important
for understanding BI and stakeholders. However, existing tools lack analysis capability to
provide such understanding (Fuld et al., 2002). There is a need to automate stakeholder
classification, a primary step for stakeholder analysis. A promising way to alleviate the
problem is automatic classification of Web pages having direct linkages (through
hyperlinks) to the company. A general overview of classification techniques has been
provided in Section 2.3.2.2.
6.2.3 Web Page Classification Techniques
Web page classification is the process of assigning pages to predefined categories.
Machine learning has been widely used to automate this process. Major approaches
include k-nearest neighbor, neural network. Support Vector Machines, and NaiVe
Bayesian network (Chen and Chau, 2004). An important step in these approaches is to
181
form the set of features to be used for classification. The following reviews previous
research in this field with emphases on feature selection and technique application.
Web page textual content has been considered an important feature for classification.
Kwon and Lee (2003) used a A:-nearest neighbor approach to classifying selected Web
pages that was extended to classification of Web sites. HTML tags were used to weigh
importance of features (each represented by a single word) selected based on mutual
information. Although the approach yielded improvements over using only home pages
for classification, it depended greatly on heuristics judgment (e.g., number of Web pages
selected, value of k, weights assigned to features, etc.). The generalizability of the
approach is thus questionable. Mladenic (1998) used a Naive Bayesian classifier and the
Yahoo! Directory to automatically classify Web pages, each of which represented by
feature vectors containing «-gram (up to 5 words) with stop words removed. Low
accuracies ranging from 25% to 50% were achieved and could be improved if structural
features were also used.
Structural features of Web pages have been used in Web page classification.
Fumkranz (1999) employed the structure of an HTML representation and the structure of
the Web to represent Web pages that were classified by a RIPPER learning algorithm.
Average recall and precision of 78% and 87%, respectively, were achieved. The research
points out that Web structural information is useful for Web page classification.
However, textual content is likely to improve the performance when the Web pages have
not been widely cited.
182
Textual and structural content of Web pages have recently been used in classification.
To represent a page, Glover et al. (2002) used anchor texts and nearby words from pages
that linked to the target page. An entropy-based feature ranking method was used to
select features, represented as one- to three-word terms. An SVM classifier for binary
classification achieved accuracies ranging from 66.25 to 89.3% in positive examples and
over 90%) in negative examples. The data-driven feature selection method led to high
description power but the research did not explore the use in multi-category
classification, a typical requirement in stakeholder classification. Lee et al. (2002) applied
neural networks to filtering
pornographic Web content. A handcrafted pornography
lexicon of indicative terms was used to identify the terms' frequencies of occurrence in
Web page title, displayed contents, meta-contents (keyword, description), and image
tooltip. Kohonen's SOM and Fuzzy ART neural networks were compared in binary
classification, where SOM achieved a slightly higher accuracy than ART (92.8% vs.
87.9%), but SOM took significantly more time in the training process. However, manual
selection of terms in the lexicon yielded precise and intuitive results but was very labor
intensive and unable to adapt to changes. An automatic approach to feature selection has
thus been shown to be preferable in such broad areas as stakeholder classification.
In summary, previous research in Web page classification has considered a number of
features and methods for selecting features. The features include: (1) page textual
content: full text, page title, headings; (2) link related textual content: anchor text,
extended anchor text, URL strings; and (3) page structural information: number of words,
183
number of page out-links, inbound outlinks (i.e., links that point to its own company),
outbound outlinks (i.e., links that point to external Web sites). The feature selection
methods include: (1) human judgment; (2) feature ratios and thresholding; and (3) use of
a domain lexicon. It is promising to employ both textual content and structural
information for Web page classification in which many machine learning techniques have
been developed. When applied to Web-based business stakeholder analysis. Web-page
classification helps to discover companies' interest groups on the Web and to enable
companies to better understand the competitive environment. Surprisingly, this area has
not been widely explored.
6.3
Research Questions
Electronic commerce has greatly facilitated transaction processing but increased
sharing of information on the Web complicates understanding of business stakeholders.
Although previous research in stakeholder analysis provides rich theoretical background,
conclusions drawn from old data (collected before the mid-1990s) may not reflect rapid
developments in e-commerce. Existing stakeholder analysis approaches are manually
driven and do not scale up to rapid growth and change of the Web. Moreover, most
business intelligence tools lack stakeholder analysis capability. While various Web page
classification techniques have been developed, they have not been applied to business
stakeholder classification. We therefore state three research questions.
184
1. How can we apply our automatic text mining framework to business stakeholder
analysis on the Web?
2. How can Web page textual content and structural information be used in the
framework?
3. What are the effectiveness (measured by accuracy) and efficiency (measured by time
requirement) of the framework for business stakeholder classification on the Web?
6.4
Application of the Framework
This section describes the application of our automatic text mining framework and the
testbed we developed to address Research Questions 1 and 2 in Section 6.3. We aimed to
analyze business stakeholders on the Web automatically through the use of machine
learning techniques and human expert knowledge.
The rationale for applying the framework is two-fold. First, business stakeholders
who have an interest in a company should have identifiable clues (e.g., textual
descriptions, hyperlinks) that can be used to distinguish their stakeholder types. Second,
as reviewed in Section 6.2.2, Web content and structural information is important to
understanding clues to stakeholder classification. Relying on such clues, we tried to
automate stakeholder classification, an important step in business stakeholder analysis.
Specifically, two generic steps were taken: (1) formation of a domain lexicon; and (2)
automatic stakeholder classification. Applying the framework to business stakeholder
analysis, we have built a business stakeholder testbed containing Web pages of
185
knowledge management companies. Figure 6.1 (modified fi-om Figure 3.1) shows the
framework components (in blue ovals) used to develop the system, Business Stakeholder
Analyzer (BSA), which the system architecture is shown in Figure 6.2. In the following,
we describe the processes involved in building the research testbed, domain lexicon
formation, and automatic stakeholder classification.
Collection
HTMiyXML
pages an
Web sites
Conversion Extraction
Analysis Visualization
Language
identification
eta-searchin
/ l^eta
spidering
KTML/XML
The Web
Domain
Spidering
(lini(s)
Domain/DB
.
Specif
J
Hidden Web
(behind a
DB)
Web pages
and
Documents
Tagged
Collection
Indexes and
Relationships
Data and Tact Bases
Similarities,
Classes,
Clusters
Hierarchies,
Maps,
Graphs
Knowledge Bases
Figure 6.1: Framework components used to develop BSA
6.4.1 Building a Research Testbed
In order to test the usability and value of the framework, we have built a research
testbed consisting of Web pages of business stakeholders of the top 100 knowledge
management companies identified by the Knowledge Management World (KM World)
186
Web site (McKellar, 2003). KM World (http://www.kmworld.com/) is a major
Web portal providing news, publications, online resources, and solutions to more than
56,000 subscribers in the knowledge management systems market. To identify such
stakeholders, we used the back-link search function of the Google search engine
(http: / /www. google. com/) to search for Web pages having hyperlinks pointing to
the companies' Web sites. This method has been successfully used to analyze the noncustomer online communities of a company (Reid, 2003). To illustrate the method, we
can type "link: www. siebel. com" in Google's search box to find the Web pages
pointing to Siebel's Web site (the host company). A relationship exists between Siebel
and the results because the hyperlinks imply underlying stakeholder relations with the
enterprise.
For each host company, we considered only the first 100 results returned from Google
in order to limit the scope of analysis. We removed results that came from the same host
company (i.e., self links) and used only the first result, if more than one result came from
the same Web site (by recognizing the domain name of the results' URLs). After
filtering, we obtained 3,713 results in total. On average, we identified 37 stakeholders for
each host company.
187
Business Stakeholder
Analyzer (BSA)
r
Shareholder
Partner/supplier/
sponsor
Government
stakeholder
Classification
feature
vectors
Competitor
Automatic
classification
Portal
Neural
netwo
Employee
Media/
reviewer
Community
Education/research
institution,
r
Unknown
Business
^stakeholders
Tagging and
Extraction
Feature
selection
r
Customer
H Web page parser
and indexer
Automatic Parsing
and Indexing
Data Collection
Manual
tagging
Bl expert
business
Web
pages
Lexicon creation
Meta-searching
Business stal<eiiolders
of tiie KM World top ^
100 KM companies
URLs
Google
back-link
search
search
The Web
Figure 6.2: System architecture of BSA
Among the results of the 100 companies, we randomly selected the results of 9
companies, listed in Table 6.2, for building a domain lexicon (see Section 6.4.2) and for
training the algorithms (see Section 6.4.3). The HTML pages of these 414 results were
automatically spidered, parsed, and indexed to extract textual terms. Pages were filtered
out if (a) hyperlinks of host companies did not exist in the pages, (b) they contained too
little text (fewer than 20 words), or (c) they mainly contained non-English content. After
filtering, 283 Web pages were stored in our database for analysis. Among them, only 142
contained metadata in their HTML codes.
188
Table 6.2: Companies selected as training examples
Company
URL
Autonomy
ClearForest
Documentum
Fujitsu Software
Information Builders
Plumtree
SiteScape
TheBrain
West Group
http://www.autonomy.com
http://www.clearforest.com
http://www.documentum.com
http://www.fsc.fuj itsu.com
http://www.informationbuilders.com
http://www.plumtree.com
http://www.sitescape.com
http://www.thebrain.com
http://west.thomson.com
6.4.2 Creation of a Domain Lexicon
A domain lexicon was created by an expert in business intelligence'^ to include one-,
two-, and three-word terms that were indicative of business stakeholder types. Two steps
were involved: (1) Generation of a list of stakeholder types; (2) Extraction of terms that
are indicative of stakeholder types. Based on our review in Table 6.1, the list of
stakeholder types used in manual tagging of Web pages is shown in Table 6.3.
Next, the expert manually read through all the 414 Web pages of the nine companies'
business stakeholders to identify terms that indicate stakeholder types. For example,
terms such as "news," "newsletter," and "news archives" are indicative of the media type.
Table 6.4 provides further examples of terras indicative of the partner/supplier/sponsor
type. After analyzing all the pages, we extracted a total of 329 terms (67 one-word terms,
84 two-word terms, and 178 three-word terms) that would constitute our lexicon.
The expert holds doctoral and master's degrees in library science and a postgraduate certificate in MIS.
She was president of the Society of Competitive Intelligence Professionals, Singapore (SCIPSgp) and an
Associate Professor with Nanyang Business School (NBS), Nanyang Technological University (NTU),
189
Table 6.3: Stakeholder types used in manual tagging of Web pages
< iroup
I)i's4iiplion
1 ransactional
(internal
environment)
Actor that the enterprise interacts
with and influences
Contextual
(external
environment)
Distant actor that the enterprise has
no power or influence over
Other
Cannot identify a stakeholder type
l>pLPartner/supplier/sponsor
Customer
Employee
Shareholder
Government
Competitor
Community (Special Interest Groups)
Education/research institution
Media/reviewer
Portal Creator/Owner
Unknown
Table 6.4: Examples of terms indicative of the partner/supplier/sponsor stakeholder type
Number
of words
1
2
3
Terms
llluNlralivf Woh pi)}>e ri.<^ult^
Pariuci;
Alliances
•Culbcusuii is a partner of Autonomy ..."
(http://www.colbenson.com/Partners/partners.html);
"Satyamstrategic alliances includes Documentum ..."
(http://www.satyam.com/alliances/alliances.html)
For partners "For Partners, Resellers Directory ..." (Autonomy is one of Sun's
reseller partners) (http://be.sun.com/products/wheretobuy/solacc.html)
New strategic "Sybase and Fujitsu Software Corporation Have Forged New Strategic
partnership
Partnership ..." (http://www.sybase.eom/detail/l,6904,1025592,OO.html)
6.4.3 Automatic Stakeholder Classification
The purpose of this step is to automatically classify Web pages (business
stakeholders) linking to companies' Web sites into different stakeholder types based on
machine learning and human expert knowledge. Three steps were involved: (1) manual
tagging, (2) feature selection, and (3) automatic stakeholder classification.
Singapore. She is one of the founding members of SCIPSgp. Formerly, she was an entrepreneur with an
Intemet start-up in Malaysia.
190
6.4.3.1 Manual Tagging
The expert (mentioned in Section 6.4.2) manually classified each of the 414
stakeholder pages of the nine selected companies into one of the 11 stakeholder types
(listed in Table 6.3) selected based on our review of stakeholder review (see Section 6.2.1
and Table 6.1). Each Web page was reviewed on the basis of terms and clues indicative
of its type. If the texts or HTML tags on a Web page did not carry pre-specified terms
(those already in the domain lexicon) uniquely identifying a stakeholder type, generic
terms were arbitrarily assigned by the analyst on the basis of the manifest information of
the Web page.
All the Web pages for a specific host company were classified according to the
aforementioned list. Before proceeding to the next company, the expert reviewed the
entire list of terms for problems and inconsistencies. If the list needed modification as a
result of inconsistencies, then another round of classification was undertaken using the
modified list. This iterative and labor-intensive refining process was conducted to ensure
the consistency of outcomes. Results from the manual tagging were used to guide the
process of machine learning that used the same results as training examples.
6.4.3.2 Feature Selection
To prepare for automatic classification, we considered two sets of features of business
stakeholders' Web pages; structural content features and textual content features.
Structural content features contain occurrences of lexicon terms in different parts of the
191
Web page. To identify such occurrences, an HTML parser automatically extracted all
one-, two-, and three-word terms from the pages' full-text content. A list of 462 stop
words was used to remove non-semantic-bearing words (e.g. "the," "a," "of," "and").
Using HTML tags, the parser identified positions in which the terms appeared on the
page. We have considered terms appearing in page title, extended anchor text (the anchor
text plus 50 words surrounding it), and page full text, because they could reflect the
importance of the terms and have been successfully used in previous research (Lee et al.,
2002; Kwon and Lee, 2003).
Figure 6.3 shows an example of the HTML source codes and a screen shot of the
Web page of a business stakeholder of ClearForest, a company that provides content
management software. The title is "David Schatsky.- search and Discovery in the
Post-Cold War Era"
The extended anchor text includes "ciearForest" and terms
surrounding it: "l just saw a demo by" and "a company that provides tools for
analyzing unstructured textual information."
The parser automatically checked
the terms appearing in the three chosen positions (title, extended anchor text, and full
text) to see if they appeared in the lexicon.
Textual content features are the frequencies of occurrences of important one-, two-,
and three-word terms appearing in the business stakeholder pages. By considering terms
appearing in multiple categories of stakeholders, we modified the thresholding method
used in (Glover et al., 2002) to select important terms from a large number of extracted
192
terms. Figure 6.4 shows the formulae and procedure used in the method. Terms with high
feature ratios were selected as features for classification.
Through the procedure, we could retain features that had high discriminating power
among the stakeholder categories. Examples of such retained features included "portals,"
"companies," "knowledge," "coalition of the," "portals research," "building web
services," "onlinetrade links to," and "system design." Features that rarely appeared in
different categories were removed automatically. The rationale behind the procedure was
to provide the algorithms with high quality features as input, thereby enhancing the
performance of classification. Because of its statistical nature, we believed that the
selected features could help to differentiate the pages into stakeholder types. This
automated method also reduced the need for labor-intensive human work.
193
<htnil>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-l"
/>
<title>David Schatsky: Search and Discovery in the Post-Cold War
Era</title>
<p>I just saw a demo by <a href = "http://www.clearforest.coiti">
ClearForest, </a> a company that provides tools for analyzing
unstructured textual information. It's truly amazing, and truly the
search tool for the post-Cold War era. ... </p>
</body>
</html>
3 David Schatskv; Seaich and Discoverv in the Post Cold Wat Eia
j File
Edit
.
Back
Address P
yiew
F^ivoriUK
•>
. ®
Foi.^jrird
Microsoft Internet Explorei
WliHE!
loob
Stop
a
a
Refresh
Home
^
a
Seaich Favoriles
»
0
Media
Unks
Histoiy
•j t^Go
http: //weblogs.jupiterresearch. com/analysts/schatskji/archives/OOOl 68.html
Jupiterresearch.
^ Business Intelligence for Business Results
Analyst Weblog:
IjA
A TCkW
Q
« The Integration Frontier j Main | Intershop Rises Again? »
JANUARY 22, 2003
SEARCH AMB BISCOWIRYIM TOE POST-COLO WAR ERA
I just saw a demo by ClearForestj a company that provides tools for analyzing unstructured textual information. It's
truly ama2ing, and truly the search tool for the post-Cold War era. Rather than requiring laborious effort to construct
taxonomies by hand [iwhich mill inevitably be rigid and quickly become stale), the product enables automatic creation of
taxonomies and discovery of relationships. Its semantic analysis capabilities allow for queries such as "Who are the people
who have some relationship to anyone who is speaking at a Jupiter conference?' (like the upcomingJupiter Conent
Management Conference .) The technology is naturally very attractive to intelligence agencies, mho are today dealing
with threats that are structured must less predictably in the past, and who are busy trying to identify the relationships
among individuals and organizations that define today's threat to national security, Autonomy-pay attention.
Posted by Daxrid Schatsky at January 22. 2003 05:2"^ PW
Click heretohavea Jupiter Research representative contact you, or contact Kieran Kellybye<mailat [email protected] or
by tel^hone at 1-^0-481-1212.
—
€>2003 Jupiter Researcit, a division of ^itermedia Corporation
di
@ Wp://vwwf.cleaitoest.coiTi/
j
~[
|
Figure 6.3: A business stakeholder Web page of ClearForest
Internet
^
194
Step 1: Suppose we have n categories of stakeholders and m features of Web pages. For each stakeholder
category and a specific feature fi, calculate the feature ratios ( Rf, and Rj- ) as follows:
n Cf I
^
n Cjf
Rf. =Z"r^
'
where
ic,i
•
H |c,
C- = Web pages in class j e [1 ... n]
Cj f. = Web pages in class j that contain feature fj where i e [1 ... m]
Cj = Web pages not in class j that contain feature fj
Cj = all the Web pages not belonging to class j
Step 2: Sort R^ and Rj- . For each sorted list of Rf and R^ , select the top K features that have the
highest values of the feature ratio, where K is approximately the number of structural content features (In
our testbed, K was equal to 1000, which was close to the number of structural content features (= 3 x 329 =
987), because there were 329 terms in our domain lexicon and we considered 3 positions (title, extended
anchor text, and full text) that the terms would appear in Web pages). This assigns approximately equal
importance between structural and textual content features. Then, two lists of features are obtained.
Step 3: Remove duplicating features appearing in the two lists.
Figure 6.4: Formulae and procedure in the thresholding method
6.4.3.3 Automatic Stakeholder Classification
Two machine learning algorithms, feedforward/backpropagation neural network and
Support Vector Machines (Cristianini and Shawe-Taylor, 2000), were used to classify
business stakeholder pages automatically into their respective stakeholder types (see
Sections 2.3.2.1 and 2.3.2.2 respectively for reviews of the two algorithms). Neural
network, a computing system using the human brain's mesh-like network of
interconnected neurons as a metaphor, has been shown to be robust in classification and
has wide applicability in different domains (Lippman, 1987) and Web-page filtering (Lee
et al., 2002). Support Vector Machines (SVM), a machine learning algorithm that tries to
195
minimize structured risk in classification, has been successfully applied to text
categorization (Joachims, 1998) and Web-page classification (Glover et al., 2002).
The structural and textual content features selected in the previous step were used as
input to the algorithms. Each stakeholder page was represented as a feature vector
containing 987 structural content features (binary variables indicating whether certain
lexicon terms appeared in the page title, extended anchor text, and full text) and 1,334
textual content features (frequencies of occurrences of the selected features - 663 words
and 671 two- or three-word phrases). We used 283 pages of the nine companies'
stakeholders to train the algorithms. The model and weights obtained from the training
were used to predict the types of business stakeholder pages of 10 testing companies
Usted in Table 6.5. In this process, we assumed that meaningful classification could be
obtained from the business stakeholders who provided on their Web pages explicit
information about relationships with the host companies.
Table 6.5: Companies selected as testing examples
Company
I Ul.
Applied Semantics
Computer Associates
Dialog
Factiva
Intelliseek
Kamoon
Siebel
Stratify
Tacit Knowledge Systems
WebMethods
http://www.appliedsemantics.com
http://www.cai.com
http://www.dialog.com
http://www.factiva.com
http://www.intelliseek.com
http://www.kamoon.com
http://www.siebel.com
http://www.stratify.com
http://www.tacit.com
http://www.webmethods.com
196
6.5
Evaluation Methodology
This section describes the methodology used to evaluate the performance of our
framework. We explain below the experimental design, hypotheses, and experimental
procedures.
6.5.1 Experimental Design
The experiment consisted of algorithm comparison, feature comparison, and a user
evaluation study. In algorithm comparison, we compared the performance of neural
network (NN) with the performance of SVM in automatic stakeholder classification. In
addition, we also created a baseline method that randomly classified the stakeholders into
certain stakeholder types. Stakeholder pages of companies listed in Table 6.5 were
classified and used to test the performance of different methods. Performance referred to
effectiveness, measured by the overall accuracy and within-class accuracy defined below,
and efficiency measured by the time used (in minutes).
^
\ ^ Number of correctly classified stakeholders in sample i
Overall accuracy=— >
n^
Number of all classified stakeholders in sample i
where n - Number of stakeholder samples used for testing
.
,
(, \ Number of stakeholders correctly classified as class X
Within - class accuracyyclass x)=
^
Number of all stakeholders belonging to class x
197
In feature comparison, we compared structural content features, textual content
features, and a combination of the two sets of features. We were interested in knowing
how different feature types contributed to performance using same measures.
In the user evaluation study, we recruited 36 University of Arizona Business College
students as subjects to perform manual stakeholder classification. All the students were
registered in an undergraduate course called "Introduction to Business Information
Systems" and had prior experience in using the Internet to find business information.
Each subject was introduced to stakeholder analysis and was asked to use our system
named "Business Stakeholder Analyzer (BSA)" to browse companies' stakeholder lists.
Appendices C.l and C.2 respectively provide the approval letter and disclaimer form
approved by the University of Arizona Human Subject Protection Program. Appendix
C.3 provides the complete questionnaire used in the experiment.
We randomly selected three companies (Intelliseek, Siebel, and WebMethods) fi"om
Table 6.5 to be the targets of analysis. Each subject was randomly assigned one of these
three companies' 10 stakeholder pages to perform classification. Figure 6.5 and Figure
6.6 show screen shots of BSA and of the stakeholders of Siebel. The subject could find
definitions of the stakeholders fi-om BSA's front page and was also provided with a hard
copy of the stakeholder list. Their assignment was to classify the ten stakeholders into
their respective stakeholder types. We recorded the time the subject used to complete the
task. Upon finishing the task, the subject filled in a post-study questionnaire related to
demographic information and opinions on automatic stakeholder classification.
198
Welcome!
Collaborative commerce i8 a newly emerged phenomenon that allows companies to maintain better relationships with their business
stakeholders through automating their cross-enterpnse process logic, rules, workflow, and knowledge sharing As the Web is
increasingly used to share and disseminate information, customers, business partners, sponsors, media, and so forth often exploit
company resources on the Web through hypeilinking Web sites and pages Knowledge about stakeholder relationships is thus
embedded m the interconnected Web resources However, such interconnections pose challenges to analysts who want to identify
and classify among various business stakeholders on the Web Uncovering the knowledge can potentially benefit the understanding
of business relationships in collaborative commerce
In this user study, you will act as a business analyst who browses company Web pages to classify each page into one of the
following 11 stakeholder types:
1.
2.
3.
4.
P&rtn8is%uppllere^onsois - those who provide tangible or intangible resources to support another company's operations
CiMtomeis- individuals or organizations that purchase a company's products or services
Employee* • people who are paid by a company to perform some tasks or to achieve some objectives of the company
Sfcmholdeis - individuals or organizations that invest in a company and own stocks of the company
- organizations which officially govern the society in v^ich the company exists
companies or organizations which compete with a company by selling products or services similar to those
sold by the company
7. CommunNies- groups or individuals whose activities take place in the same environment in which a company operates
8. Educattonai/ressardi institutions - organizations which exist to educate people or to generate research outcomes
9. Hedia/raviewrefs - organizations or individuals who report news about a company or review products or services of the
company
• Web sites that provide comprehensive infonnation and fonctionality to users
stakeholder groups that cannot be classified into any one of the above groups
Through this study, you will acquire better knowledge of stakeholder analysis on the Web You will also understand how information
technology can transform traditional, manually-dnven stakeholder analysis into a new form of automated, business intelligence tool
Figure 6.5: Front page of Business Stakeholder Analyzer
Siebel
The following
pages (stakeholder group B] link to Siebel
CIC I About CIC: Channel Partnefs/Resallers
Our channel partners (resellers, ISVs, and alliances) include worldwide leaders
m their respective vertical market segments and industnes including retail...
http /.Swww cic conVabout/partners/ - Cached
Greenbrier & Russal Alliances
Want to understand what makes someone tick? Take a close look at the
company they keep Their friends Their business relationships ...
http./Aww gr comAwho/alliance asp • Cached
<l-lllle->The Quest for lnteroDerabllltv<l-Altle-><br><span •••
The Quest for lnteroperat)ility • CRM SPECIAL REPORT ...
http /Avww crmdaiiy com/perl/story/20142 htmi - Cached
SAS I Success Stories
Home, Products and Solutions, Success Stones, Partners, Company, Customer
Supfwrt, Worldwide Sites ...
http /Avww sae conx'euccees/compaq htmi - C ached
Enterprise Soflware / Lessons learned from a CRM success story Tech Update Enterprise Sofhware Lessons learned from a CRM success
story By Oavid Southgate TechRspufolic March 19,2003 ...
http/Vtechupdate zdnet com/lechijpdate/ston0s/maiaC,14179,2911582.X htm! - Cached
Sun Country
Site map & site overview; How many hnks are on this page? How many categones
are on this page? Expand all sections, Close all sections ...
htlp //resources solans-x86 org/ - Cached
Figure 6.6: Business stakeholders of Siebel
199
Table 6.6: Hypotheses tested in this study
Altinrilhni ('oiii|iarisoii
HI a: NN and SVM achieve similar effectiveness when structural content features are used.
Hlb: NN and SVM achieve similar effectiveness when textual content features are used.
Hlc: NN and SVM achieve similar effectiveness when both structural and textual content features are used
( tinipiiriiit; ;n;;iin».l I ho htisoliiu' iiiclhnil
H2a: NN achieves higher effectiveness than random classification when structural content features are used.
H2b: NN achieves higher effectiveness than random classification when textual content features are used.
H2c: NN achieves higher effectiveness than random classification when both structural and textual content
features are used.
H2d: SVM achieves higher effectiveness than random classification when structural content features are used.
H2e: SVM achieves higher effectiveness than random classification when textual content features are used.
H2f: SVM achieves higher effectiveness than random classification when both structural and textual content
features are used.
( (iiiiptirlim aijiiliisi hiiniaii jiidKiiieiit
H3a: Human judgment in stakeholder classification achieves similar effectiveness to using NN (with both
structural and textual content features)
H3b: Human judgment in stakeholder classification is less efficient than using NN (with both structural and
textual content features)
H3c: Human judgment in stakeholder classification achieves similar effectiveness to using SVM (with both
structural and textual content features)
H3d: Human judgment in stakeholder classification is less efficient than using SVM (with both structural and
textual content features)
K\iiliiatiiit; difkTt'iil U-iUiiru i\pgs
H4a: Using structural content features for automatic stakeholder classification yields similar effectiveness to
using textual content features in NN.
H4b; Using structural content features for automatic stakeholder classification yields similar effectiveness to
using textual content features in SVM.
H5a: Using a combination of structural and textual content features for automatic stakeholder classification
with NN is more effective than using only structural content features.
H5b: Using a combination of structural and textual content features for automatic stakeholder classification
with NN is more effective than using only textual content features.
H5c; Using a combination of structural and textual content features for automatic stakeholder classification
with SVM is more effective than using only structural content features.
H5d: Using a combination of structural and textual content features for automatic stakeholder classification
with SVM is more effective than using only textual content features.
6.5.2 Hypotheses and Experimental Procedures
Table 6.6 shows the five groups of hypotheses tested in this study. The first group
(HI) hypothesized that NN and SVM would achieve similar effectiveness when the same
set of feattrres was used because both techniques were robust in Web-page classification.
To test the hypotheses, we created 30 sets of stakeholder pages by randomly selecting
200
groups of 5 stakeholder pages of each of the 10 companies listed in Table 6.5. Structural
content features, textual content features, and combined features of these pages were used
as input to the algorithms that classified the pages into different stakeholder types.
The second group of hypotheses (H2) hypothesized that the two algorithms would
perform better than the baseline method because both algorithms incorporated human
knowledge and machine learning capability into the classification, thereby adding value
to business stakeholder analysis. To test these hypotheses, we compared the results
obtained from the baseline method with those obtained from the two algorithms.
The third group (H3) hypothesized that human judgment in stakeholder classification
would achieve effectiveness similar to that of machine learning, but that the former is less
efficient. The rationale was that both machine learning algorithms and human analysts
could make use of the Web page's textual and structural content in classifying
stakeholders. However, a human approach might be less efficient than an automatic
approach. Section 6.5.1 describes the procedure for carrying out this user evaluation
study.
The fourth and fifth groups (H4 and H5) examined the use of different types of
features in automatic stakeholder classification. H4 hypothesized that using structural
content features for automatic stakeholder classification would yield effectiveness similar
to that of using textual content features because both types of content convey important
information about stakeholders. H5 hypothesized that using a combination of structural
201
and textual content features for automatic stakeholder classification would be more
effective than using either set of the features alone, because it could reflect the Web-page
information more completely.
6.6
Experimental Results and Implications
This section reports and discusses the findings of our study. Table 6.7 details the
results of hypothesis testing. Table 6.8 hsts subjects' profiles. Table 6.9 lists the withinclass accuracies achieved by different methods (NN, SVM, baseline, and human
judgment). Table 6.10 summarizes subjects' preferences toward automatic business
stakeholder classification.
6.6.1 Algorithm Comparison
Hypotheses la, lb and Ic were not confirmed, because NN performed significantly
differently than SVM when the same set of features was used. Although both
algorithms were robust in automatic classification, our findings
revealed that their
performances actually varied according to the types of features used. NN performed
better than SVM when structural content features were used. Information about the
positions where lexicon terms appeared in Web pages (structural information) contributed
more to NN's performance. In contrast, SVM performed better than NN when textual
content features or a combination of both feature sets were used. Information about
textual content of Web pages (textual content) contributed more to SVM's performance.
202
We believe more studies would be needed to identify optimal feature sets for each
algorithm.
Table 6.7: Results of hypothesis testing
l l \ pollll-sis
S.I).
Mean
Mean
NN
Hla (Structural) ^
Hlb (Textual)
Hlc (Combined)
0.50
0.31
0.19
H2a (Structural)
H2b (Textual)
H2c (Combined)
0.50
0.31
0.19
H2d (Structural)
H2e (Textual)
H2f (Combined)
0.25
0.43
0.44
0.25
0.21
0.16
0.25
0.43
0.44
0.25
0.21
0.16
0.08
0.08
0.08
NN
0.23
0.24
0.24
Human"*
0.56
21.68
0.56
21.68
H4a (NN)
H4b (SVM)
H5a
H5b
H5c
H5d
Notes;
1.
2.
3.
4.
p-value
Resull
0.23
0.24
0.24
0.00
0.00
0.00
Not Confirmed
Not Confirmed
Not Confirmed
0.15
0.15
0.15
0.00
0.00
0.01
Confirmed
Confirmed
Confirmed
0.02
0.00
0.00
Confirmed
Confirmed
Confirmed
0.00
0.00
0.00
0.00
Not Confirmed
Confirmed
Not Confirmed
Confirmed
0.00
0.00
Not Confirmed
Not Confirmed
0.00
0.01
Not Confirmed^
Not Confirmed^
0.00
0.33
Confirmed
Not Confirmed
Random
SVM
H3a
H3b'
H3c
H3d^
S.D.
SVM
0.20
7.81
0.20
7.81
Textual
0.31
0.21
0.43
0.24
NN (Combined)
0.19
0.16
0.19
0.16
SVM (Combined)
0.44
0.24
0.44
0.24
Random
0.08
0.15
0.15
0.08
0.08
0.15
NN/SVM
0.33
0.10
0.33
0.00
0.43
0.10
0.017
0.00
Structural
0.50
0.25
0.25
0.23
NN (Structural/Textual)
0.25
0.50
0.31
0.21
SVM (Structural/Textual)
0.25
0.23
0.43
0.24
Effectiveness, ranging from 0 to 1, was measured by the overall accuracy.
For hypotheses H3b and H3d, efficiency was measured by the time used (in minutes).
The opposite of the hypothesis was confirmed.
The number of student subjects was 36.
203
Table 6.8: Subjects' profiles
Dimension
' Proliles
Computer usage
Gender
Education
On average, they spend 15 to 20 hours per week using computer
18 males, 18 females (Total participants: 36)
32 undergraduate students, 2 with associate degree, 2 with bachelor's degree (All
participants are business college students registered in an undergraduate MIS class)
31 subjects aged between 18 and 25; 3 subjects aged between 26 and 30; 2 subjects
aged between 31 and 35
Age
Table 6.9: Within-class accuracies achieved by different methods
StukrlioUk-r |iFi'cqiieiicv of
SVM
Rand Frequency of Student
NN •
lypc
occurrrnccs
Structural Textual ( oiiibiiied .Sirui'iunil Textual (iiiiiliiiitil
Partners/
0.59
37
0.62
0.86
0.70
0.97
0.97
0.08
156
0.97
suppliers/
sponsor
Customers
4
3
2
1
0
15
6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.25
0
0
0
0
0
0.17
0
0
0
0
0
0
0
0
0
0
0
0
0.27
0
0
0
0
0
0
0.27
0
0
0
0
0
0
0.13
0.17
24
0
0
0
0
0
12
0.42
0
0
0
0
0
0.17
51
0.69
0.22
0
0
0.33
0.33
0.04
108
0.53
23
0.52
0
0
0
0.22
1
0
0
0
0
0
Average
13
0.17
0.10
0.10
0.09
0.16
* No stakeholder belonging to the type "competitor" was found.
0.22
0
0.16
0.13
0
0.05
60
0
32.7
0.42
0
0.19
Employees
Shareholders
Government
Competitors*
Communities
Education/
Research
institutions
Media/
Reviewer
Portal
Unknown
Table 6.10: Subjects' preferences toward automatic stakeholder classification
Dinu-iisiiin
Mean*
Siiindard Deviation
Because there is much information on the Web, an automatic approach to
0.97
2.3
business stakeholder analysis is needed.
0.91
An automatic approach to business stakeholder analysis will help business
2.0
analysts to identify and classify business relationship on the Web.
An automatic approach to business stakeholder analysis will save the time
1.6
0.76
of business analysts.
* We used a 7-point Likert scale in which "1" represents "strongly agree" and "7" represents "strongly
disagree."
204
6.6.2 Effectiveness of the Framework
The results of testing hypotheses H2a-f were all positive, confirming our belief that
the use of any combination of features and techniques in automatic stakeholder
classification would outperform the baseline method significantly. Our framework
has integrated human knowledge with machine-learned information related to stakeholder
tj^es. Automatic Web-page classification of business stakeholder pages could alleviate
information overload and make analysis more efficient. The experimental findings
showed that our framework was significantly better than a random conjecture. We
therefore conclude that it is promising to apply the framework to automating business
stakeholder analysis on the Web.
6.6.3 Comparing the Framework with Human Judgment
Hypotheses H3a and H3c were not confirmed. Overall, humans were more effective
than NN or SVM, because they could rely on more clues in performing classification
(e.g., contextual factors, graphics and other media on Web pages, and combinations of
them). By using their experience in Internet browsing and searching, humans also could
narrow down the potential types to which a stakeholder might belong. In contrast, the
algorithms lacked such rich experience to aid classification. However, as shown in Table
6.9, the overall within-class accuracies were similar between humans and algorithms
(ranging fi-om 0.16 to 0.19).
205
Taking a closer look at the within-class accuracies, we found that the algorithms
outperformed human judgment in classifying certain stakeholder types. SVM
(textual/combined)
achieved
the
best
within-class
accuracies
of
the
partner/sponsor/supplier (0.97) and community (0.27) types. NN (structural) achieved the
best within-class accuracies of the media/reviewer (0.69) and portal (0.52) types. These
stakeholder types were among the most frequently
occurring. In contrast, humans
achieved lower within-class accuracies in all these types. Nevertheless, humans achieved
the best within-class accuracies of the customer (0.42) and educational/research
institution (0.17) types, because they could rely on their own experiences as customers
and were students familiar with school Web sites. Therefore, we conclude that humans
achieved signiflcantly higher overall accuracy than either of the two algorithms, but
they performed less well than the algorithms in classifying certain stakeholder types.
Hypotheses H3b and H3d were confirmed, demonstrating the high efficiency of using
the automatic framework to facilitate stakeholder analysis. The subjects took an average
22 minutes to finish the task. Their times to completion varied a lot, ranging from the
longest of 42 minutes to the shortest of 11 minutes. In contrast, the machine-learning
algorithms took from a few seconds to less than a minute to finish the classification. Such
encouraging results have led us to conclude that our framework, with its high
efficiency, could significantly augment human work.
206
6.6.4 Studying the Use of Features
To our surprise, hypotheses H4a-b, H5a-b, and H5d were not confirmed, showing that
different feature sets yielded different performances of the algorithms. Structural content
features enabled NN to achieve significantly better effectiveness than textual content
features. On the other hand, textual content features enabled SVM to achieve
significantly better effectiveness than structural content features. Furthermore, a
combination of the two feature sets made NN less effective than using any one set alone.
Another observation about combined features was that structural content features did not
add value to the performance of SVM, as shown in the insignificance of testing H5d.
However, we do not know exactly why the features differed. Future research can explore
this issue further by studying the effect of features and the nature of algorithms in more
depth.
Hypothesis H5c was confirmed, once again leading us to believe that structural
content feature did not add value to the performance of SVM. Future research can study
how to improve the use of structural features in SVM.
6.6.5 Users' Subjective Comments
Strong preferences toward an automatic approach to business stakeholder analysis
were shown in the user study (see Table 6.10). The approach was perceived to be
necessary to alleviate information overload on the Web (rating = 2.28 where 1 =
"Strongly agree") and to be able to help business analysts identify and classify business
207
relationships (rating = 2.03). Regarding the efficiency issue, subjects showed an
overwhelming agreement on the statement that the framework would save the time of
business analysts (rating = 1.64). None of the subjects gave a rating of "6" or "7"
("Strongly disagree") to any of the above three statements. Some subjects also provided
favorable comments, such as "It would be very helpful!" "That's cool!" and "I want to
use it." From these results, we conclude that the framework was perceived very
favorably as helping business analysts identify and classify business stakeholders. It
also confirmed our belief that the framework can facilitate human interaction in business
stakeholder analysis.
6.7 Conclusions
As the Web is used increasingly to share and disseminate company and industry
information, understanding stakeholder relationships has become an important area of
business analysis. In this chapter, we have applied our automatic text mining framework
to business stakeholder analysis on the Web. Human expert knowledge and machinelearned information about business stakeholders have been integrated to enable effective
and efficient analysis. Results of our experiment involving algorithm comparison, feature
comparison, and user evaluation showed that the framework is promising in terms of
effectiveness and efficiency. The framework
also achieved excellent within-class
accuracies in classifying certain frequently appearing stakeholder types. Subjects in our
user study strongly agreed that such a framework would save business analysts' time and
208
help in stakeholder analysis. There is a strong potential to use the framework to augment
traditional stakeholder classification.
209
CHAPTER 7. CONCLUSIONS AND FUTURE DIRECTIONS
Over the past decade, the hitemet has emerged as the global platform for information
storage, communication, and business transactions. The world has witnessed fundamental
changes in human lives attributable to the hitemet. People enjoy sharing and getting
information on the World Wide Web. Web sites such as eBay and Yahoo! have become
large "information cities" serving miUions of users every day (Sairamesh et al., 2004).
Individuals and consumers are empowered through the knowledge they gain from richer
content and more convenient access to information.
As the Web grows exponentially in information content, so does the difficulty of
discovering knowledge from it. Information overload is becoming ever more serious.
Voluminous results returned by search engines challenge understanding of the structures
and patterns behind them. Business relationships become more complicated as more
stakeholders participate in e-commerce activities. Useful knowledge for business analysis
is thus embedded in voluminous and interconnected Web resources.
This dissertation research investigated how an automatic text mining framework can
address some of these problems. A central thesis is to determine whether such a
framework can effectively and efficiently enhance knowledge discovery from a large
amount of Web textual data. The framework involves collection, conversion, extraction,
analysis, and visualization of Web data by means of various data and text mining
techniques. Three empirical studies were conducted to demonstrate the usability and
210
value of the framework. In Chapter 4, the framework
was apphed to building an
intelligent Web search portal that, in addition to textual lists of results, provides postretrieval analysis. Knowledge in the form of integrated and analyzed information from
different Chinese business regions was discovered. In Chapter 5, the framework was
applied to developing two browsing methods for clustering and visualizing business Web
pages. Knowledge about business landscape and commimity was discovered. In Chapter
6, the framework was applied to classifying business stakeholder Web pages into
different types. Knowledge embedded in interconnected pages was discovered to show
business relationships. In the experiment of each study, at least thirty human subjects
participated to provide subjective judgments and ratings on related issues. Objective
metrics such as precision, recall, and accuracy were used to measure performance of each
system.
7.1
Conclusions
A general conclusion is that the proposed framework helped alleviate information
overload and enhance human analysis on the Web effectively and efficiently.
The results of evaluating CBizPort indicate that the portal performed comparably
with existing regional Chinese search engines in searching and browsing. Subjects liked
the summarizer and the categorizer, which helped extract useful information. Their
favorable comments supported the usefulness of CBizPort's analysis capability.
211
However, insignificant testing results suggested further improvements on the framework
to enhance searching for information distributed in heterogeneous sources.
The results of comparing the browsing methods (Web community and knowledge
map) implemented in Business Intelligence Explorer with result list and Kartoo search
engine are very positive. The clustered results visualized on screen alleviated information
overload and helped users explore business intelligence effectively and efficiently. Web
community and knowledge map respectively had better user ratings than result list and
Kartoo. Thus the framework facilitated exploration of business intelligence on the Web
effectively and efficiently.
The results of the experiment comparing algorithms, human analysis, and a baseline
method in business stakeholder classification showed that the framework could classify
certain frequently appearing stakeholder types (e.g., partners/suppliers/sponsors,
media/reviewer) effectively and efficiently. The framework was found to assist in
identification and extraction of business stakeholder relationships on the Web. Subjects
strongly agreed that the framework could save analysts' time and could help identify and
classify stakeholder relationships.
In summary, several insights regarding human-computer interaction have been
obtained from the studies. The framework helped to meet analysis needs that would
otherwise require substantial human effort. Such needs include summarizing, classifying,
visualizing, exploring the information landscape, and extracting relationships. Each
212
empirical study incorporated a thorough review of related areas of study so as to ensure
appropriate use of the framework, which appeared to free humans to perform other valueadded work.
7.2 Contributions
Overall, the dissertation research has contributed to developing and validating a
useful and comprehensive framework for knowledge discovery on the Web. The
integration and application of these techniques together with appropriate human
intervention are new.
Chapter 4 presents a new integration and application of summarization and pattern
extraction techniques to building CBizPort, an intelligent Web portal for searching and
browsing in a heterogeneous environment. CBizPort provides meta-searching and postretrieval analyses that are not found in existing Chinese search engines. Chapter 5
presents a new application of clustering and visualization techniques to exploring
business intelligence on the Web, an application found neither in existing business
intelligence tools nor in previous research. Chapter 6 presents a novel application of
classification techniques to business stakeholder analysis that was empirically shown to
be able to augment traditional business stakeholder analysis. Such application is new to
stakeholder researchers and commercial tool developers.
In addition, a better understanding of HCI on the Web has been achieved through this
research. Chapter 4 contributes to understanding the needs of information seeking from
213
heterogeneous sources. Chapter 5 promotes understanding of human analysis needs by
means of visual analysis. Chapter 6 extends current stakeholder research by providing a
new perspective for automated analysis.
7.3 Relevance to Business, Management, and MIS
Because the dissertation deals with two profound concepts - knowledge and the Web,
it is highly relevant to the business, management, and management information systems
(MIS) disciplines.
As knowledge management (reviewed in Section 2.1.3) becomes a more and more
important function in today's networked organizations, effective and efficient discovery
of knowledge on the Web assumes a crucial role to their success. This dissertation
research provides knowledge management practitioners with a novel framework for
managing rapidly growing enterprise content on the Web that can serve to guide future
research in knowledge management.
The discipline of MIS, which deals with the apphcation of information technology to
solving real business problems, is relevant to this research. Multiple related disciplines
have been surveyed in Chapter 2. Various examples of IT application have been provided
in Chapters 4 to 6.
The BI applications described in this dissertation are especially important to
businesses relying on the Internet. Such applications serve to facilitate human analysis,
214
gather business intelligence, and discover hidden knowledge. Strategic advantages can be
gained by using the framework.
7.4
Limitations
Several limitations of this research are discussed below.
There are technical limitations related to system development. CBizPort (described in
Chapter 4) is a prototype system that lacks the professional operations and technical
support that benchmark search engines enjoy. The BI Explorer (described in Chapter 5) is
limited by high computational intensity in co-occurrence analysis, which requires 0(n^)
computational time, and by cumbersome identification of Web communities, which
involves
recursive application of a genetic algorithm for optimization. The
implementation of neural network algorithm (described in Section 6.4.3.3) has not been
optimized, thus its runtime efficiency was lower than SVM.
The research testbeds adopted were limited to data from public sources such as search
engines and business Web sites that are typically noisy and error-prone. For example, the
Web pages used as corpus to extract Chinese phrases (described in Section 4.4.5.2) may
contain many irrelevant terms that undesirably got included into the lexicon. Relying on
Google alone to identify stakeholder relationships (see Section 6.4.1) may have ignored
relationships not captured by the data.
Reliance on student subjects in the laboratory experiment of each empirical study
inevitably reduced external validity of the findings. Inability to recruit BI professionals to
215
use and comment on the systems was a major limitation. A limited experimental period
also restricted optimal testing of the systems' functionality, especially in the CBizPort
study.
7.5 Future Directions
This section discusses ongoing work and future directions related to this research.
In addition to developing CBizPort, the framework has been used to develop Internet
search portals in the Spanish business and Arabic medical domains. Regional and
language-specific issues will be addressed. Additional analysis techniques can be used to
enhance Internet searching and browsing in a heterogeneous environment.
The framework can be applied to the terrorism domain, amidst growing concern for
national security following the terrorist attacks (Popp et al., 2004). Tools and techniques
enabling deeper analysis and visualization can be developed (Chen et al., 2004).
Currently, we are developing a methodology for collecting and analyzing information on
the "Dark Web," the alternate side of the Web that is used to help to achieve the evil
goals of terrorists and extremists. Results are expected to contribute to advancing the
field of intelligence and security informatics. Other domains that are worth exploring
include bioinformatics (Doom et al., 2004) and medical informatics (Srinivasan, 2004)
which deal with knowledge discovery from voluminous textual data.
As an extension of the work described in Chapter 5, new visualization metaphors will
be developed for Web browsing. Metaphors that exploit the nature of the Web as well as
216
features of a specific domain may bring a more satisfactory and pleasurable browsing
experience to users. The map display and hierarchical display used in the study are only
two of many potentially available visualization metaphors. Others such as 3D displays
and animation could be further studied.
Several lines of research will be pursued to extend the work of Web-based business
stakeholder analysis. As classification is a beginning step of business stakeholder
analysis, a promising direction is automation of the next steps of such analysis. With
more expert participation and more Web page data, type-specific stakeholder analysis can
be performed. For example, partner relationships are often important in developing
business strategies. Gaining more specific knowledge about such relationships through
automatic approaches is expected to help. In addition, stakeholder relationships form
patterns over time. Tracing such patterns with automatic techniques such as visualization
is likely to uncover knowledge about the competitive environment. Another interesting
direction is to automate cross-regional business stakeholder analysis. Multinational
business partnerships and cooperation can be analyzed through explicit information
posted on the Web. Related human-computer interaction issues can be explored.
A promising future direction is to develop new text mining and visualizations
techniques to facilitate more effective and efficient knowledge discovery. Their
relationships with knowledge management, stakeholder analysis, and HCI are interesting.
Theoretical and technical issues will be studied. Such efforts are expected to advance the
disciplines of science, technology, and management.
217
APPENDIX A: DOCUMENTS RELATED TO CHAPTER 4
A.1 Approval Letter from the University l-luman Subjects Committee
THE UNiVERsrrv OF
lliiman Siib}ci.ti- Protcctk'n I'ldgiam
Imp; www irh.aii-tma.eiKi
ARIZONA.
TUCSON ARIZONA
I ^"50 N
Vine Avcnui-
VO Box24")n7
i i u s o n . A
/
i
CUO)620-6721
19 November 2002
Wingyaii Chung, M.S.
Advisor; Msinchun Chen, Ph.D.
Management hiforniation Systems
MeCleiland Hall. Room 430
I'O BOX 210108
RE:
USER STUDY OF THE CH1^ESE BUSINESS INTELLIGENCE PORTA!
Dear Mr. Chung:
We rcecivcd documents concerning your above eiled project. Regulations published by the U.S.
Department ofl lealthand I kuiian Services |45 CI'R I'art46.10I(K)(2)] exempt this1)7)0 ol'rcscarch
from review by our Institutional Review Hoard. Note; A copy of your disclaimer form, with IRB
approval .stamp affixed, is enclosed for duplication and use in enrolling subjects.
Li.xempt status is granted with the understanding that no further changes or additions will be made
either to the procedures followed or to the consenting iiislrunieut used (copies of which we have on
nic) without the review and approval of the Human Subjects Committee and your College or
Departmental Review Cominittee. Any research related physical or psychological harm to any
subject mu.st also be reported to each committee.
Thank you Ibr informing us of your work. If you have any t|ueslious conceniing (he above, please
contact this ofliee.
Sincerely.
David (i. .Iohnsoii/K4.D.
Chairman
Human Subjects Committee
DGJ/js
cc: Departmental/College Review Committee
l
i
.
'
218
A.2 Subject's Disclaimer Form
SUBJECT'S DISCLAIMER FORM
m usnrasiTY OF AZ m.
Ti{!8 sf*- wsT mm ON m
Title of Study; User Study on the Chinese Business Intdligence Portal
ICSBIKMfS 11818 TO COKKT ^WECTS.
BiTE: I Ih'yoX
YOU arc hcmg invited to \ohmtitrily partiupatc m liif above ill d esearcJi study. The study is on Web
brow.sing usui!-. different browsing tools, \ ou ite ehgiWe to pait Lipate because you are over age 18.
11 \i)u ii|',rec to p:iriicip:ite, your punn ipaiion will involve searcliing anti browsing infomiation on the Web. The
cKpctiment will take plaee in the Artificial Intelligence Lab. McClelland Hiill. During the cxpeTiment, written
note.s will he made iii order to help the investigator review what is said. Your name will not appear on tliese
notes i\ tiuestionnaire will he u^ed to reiord your an.swm to the experiment tasks on Web site browsing. Your
name will not appear this questionnaire, Ihe whole cxprriment takes about an hour.
Any quesiions \ou have will be jmswercd and ytiu may withdraw from the study at any time, lliere are no
fcnrwn ri.sks fioin sour participation and no direct benefit from your p;utjcipat)on is expected. There is no cost
to you c\ecpt tor >-otir time and you wilt receive SH) afler completing all experiment ta.sks.
Only the research investigators in this user study v> ill have access to the your name and the information tliat you
pro\-idc. Records containing rescarch-rclatcd data will not be idoitifted using your name. In order to maintain
your confidentiality, \'onr name will not be rcv'calcd in any rq)orts that result from this project. QucstioiinaiTe
information will be saved in a securc place.
You cajt obtain fuithcr infomiation from the investigators at (520) 621-2748. If you have questions concerning
your ritihts as a research subject. )'ou may call tlie University of Arizona Human Subjects Protection Program
office at (52tj) 026 6721.
By participatins' in the user study, you are giving permission for the investigators to use your infonnation for
research ptirposes.
Thank you.
(for) Wingyan Chung, Zan Huang, Gang Wang, Yivycn Zhang
219
A.3 Questionnaire for CBizPort Evaluation
Questionnaire for CBizPort Evaluation
Thank you for participating in this experiment.
In this experiment, you will be asked to perform search tasks
and browse tasks using Yahoo HK search engine and the Chinese
Business Intelligence Portal (CBizPort). The whole experiment
takes about an hour and includes a total of 10 tasks. You are
welcome to ask any questions during the experiment.
Scores
1
2
3
4
5
6
7
8
9
10
220
Section 1
In this section, you will use Yahoo HK to perform all the tasks. The URL of Yahoo HK is
http://hk • yahoo.com/
Task 1 ( 4 min.)
Under the companies ordinance in Hong Kong, what is the application fee (HK$) for
establishing a local private company having a shared capital in Hong Kong? (SI)
Answers:
Task 2 ( 5 min.^t
What are the different aspects that Hong Kong Government assists in the development of
local small and medium enterprises (SME)? Summarize your answers as a number of
distinct themes (short phrases or sentences). (A relevant theme should describe the
actions and policies that Hong Kong Government has taken to assist the development of
SME.) (Bl)
Answers;
Task ?i (4 min.^l
What is the total sales amount (in quantity) of CPU chips in mainland China in 2001?
(S3)
^2001^1^,
Answers:
221
Task 4 ( 5 min.'t
Summarize the current situations of desktop computer manufacturing industry in China
as a number of distinct themes (short phrases or sentences). (A relevant theme should
describe the sales figures, market situations and technological developments). (B3)
[email protected]
(desktop
^MMM
>
computer)
'
MMMM °
Answers;
Post-studv Survey
Please rate the following dimensions regarding the information quality provided by the
Yahoo HK.
Diincnsiuiis
Definitions
The extent to which
information is available,
or easily and quickly
retrievable
Appropriate amount The extent to which the
of information
volume of information is
appropriate for the task at
hand
The extent to which
Believability
information is regarded
as true and credible
Completeness
The extent to which
information is not
missing and is of
sufficient breadth and
depth for the task at hand
Concise
The extent to which
Representation
information is compactly
represented
The extent to which
Consistent
Representation
information is presented
in the same format
Ease of
The extent to which
Manipulation
information is easy to
manipulate and apply to
different tasks
Free-of-error
The extent to which
^ (iiir siitisfavlion
7=ver>' satisfied; N==1111 i-uinmcMiti
(l=ver\- dissatisfied.
Accessibility
1
I
5
4
J
0
7
N
1
L
5
4
J
o
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
222
Interpretability
Objectivity
Relevancy
Reputation
Timeliness
Understandability
Value-Added
information is correct and
reliable
The extent to which
information is in
appropriate languages,
symbols, and units, and
the definitions are clear
The extent to which
information is unbiased,
unprejudiced, and
impartial
The extent to which
information is applicable
and helpful for the task at
hand
The extent to which
information is highly
regarded in terms of its
source or content
The extent to which
information is
sufficiently up-to-date for
the task at hand
The extent to which
information is easily
comprehended
The extent to which
information is beneficial
and provides advantages
from its use
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
Please provide your ratings below.
Statement
(l-\iT> liKsalislifd
I am satisfied with the system's capability of
searching Web pages from Chinese regions that
are culturally different from my origin
Overall speaking, I am satisfied with the
system
^ our sjiisl'flction
•'=vcr\ satisrii-d; N=iiii ciiiiiincnt)
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
223
Section 2
In this section, you will use CBizPort to perform all the tasks. In task 5 and 6, you are
only allowed to use the basic search capability of CBizPort (with neither summarizer nor
analyzer). In task 7 and 8, you are only allowed to use the basic search capability and
summarizer of CBizPort (without analyzer). In task 9 and 10, you are only allowed to use
the basic search capability and analyzer of CBizPort (without summarizer). The URL of
CBizPort is httD://ail7.bpa.arizona.edu:8Q80/big5biz/index.html
Task 5 (CBizPort with basic searching only, give a specific answer) - 4 min.
In which cities of Mainland China and Taiwan has Motorola established manufacturing
centers? (S4)
Answers:
Task 6 (CBizPort with basic searching only, find as many themes as possible) - 5 min.
Compare between Shanghai and Beijing to study the different aspects (transportation,
application process, rent) for establishing regional headquarter of a foreign computer chip
manufacturing company. Summarize your findings in a number of distinct themes. (A
relevant theme should describe the relative advantages of the two cities and compare
them in meaningful dimensions.) (B4)
Answers:
Task 7 (basic searching + summarizer only) - 4 min.
224
It is more common than before to have Taiwan employees to work in mainland China.
State the formula that is used to calculate the taxable income of these employees. (S5)
Answers:
Task 8 (basic searching + summarizer only) - 5 min.
Summarize the current situations of computer motherboard manufacturing industry in
Taiwan in a number of distinct themes. (A relevant theme should describe the sales
figure, market situations and technological development). (B5)
IMM' MMU
>
°
Answers:
Task 9 (basic searching + analvzer onlv) - 4 min.
Which company announced the smallest cell phone chips in August, 2002? (S9)
° 2002^1^8^,
Hit"
Answers:
Task 10 (basic searching + analvzer onlv) - 5 min.
List at least three new materials used in computer chips manufacturing industry. (B9)
Answers:
225
Post-study Survey
Please rate the following dimensions regarding the information quality provided by the
CBizPort.
I)iiiii-iisii>iis
Definitions
The extent to which
information is available,
or easily and quickly
retrievable
Appropriate amount The extent to which the
of information
volume of information is
appropriate for the task at
hand
Believability
The extent to which
information is regarded
as true and credible
Completeness
The extent to which
information is not
missing and is of
sufficient breadth and
depth for the task at hand
Concise
The extent to which
Representation
information is compactly
represented
Consistent
The extent to which
Representation
information is presented
in the same format
Ease of
The extent to which
Manipulation
information is easy to
manipulate and apply to
different tasks
Free-of-error
The extent to which
information is correct and
reliable
Interpretability
The extent to which
information is in
appropriate languages,
symbols, and units, and
the definitions are clear
Objectivity
The extent to which
information is unbiased,
unprejudiced, and
impartial
Relevancy
The extent to which
information is applicable
and helpful for the task at
hand
Your satisifaction
(l-\ery dissattsticd,.
Natisllril: N-Mill romiiifiit 1
Accessibility
6
7
N
fi.
7
N
5
6
7
N
4
5
6
7
N
3
4
5
6
7
N
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
1
TA
1
2
3
4
1
2
3
1
2
1
3
4
5
4
226
Reputation
Timeliness
Understandability
Value-Added
The extent to which
information is highly
regarded in terms of its
source or content
The extent to which
information is
sufficiently up-to-date for
the task at hand
The extent to which
information is easily
comprehended
The extent to which
information is beneficial
and provides advantages
from its use
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
Please provide your ratings below.
Your satisfaction
" \m s:ili<il1i;d: N iiucninim-ni)
Stntcinciit
iIInsiiiMIciI
I am satisfied with the system's capability of
searching Web pages from Chinese regions that
are culturally different from my origin
Overall speaking, I am satisfied with the
system
1
2
3
4
5
6
7
N
1
2
3
4
5
6
7
N
Briefly Compare the search tools in section 1 and section 2:
Personal Information
Computer literacy:
Poor
1
2
3
4
5
6
7
Excellent
Gender: M / F
Age: Age range: 18-25
, 26-30
, 31-35
, 36-40
Education: Undergrad. Student , Bachelor earned
, 41 or above
, Master earned
Thank you very much!
, Doctor earned
APPENDIX B: DOCUMENTS RELATED TO CHAPTER 5
Approval Letter from the University IHuman Subjects Committee
THE UNIVERSITY O?
llUfiiaii Svihu'cis i'rotei'linn
ARIZONA
TUCSON ARIZONA
1 'i )0 \'itU' A\i iu
24-.1 iV
IiKM'ii, A7
i
n72;
VO. [u,s
16 Scplcmbcr 2002
Wingyan Cluing, M.S.
Advisor: l lsindiun Chen, F'li.l).
Depm'tmcnt ol' Management Inronnatioii Systems
McClolland Hall. Room 430
I'OBOX 210108
RE:
A KNOWLF.DGE MAP APPROAC H ro THK DISCOVERY OF BUSINESS
INTELLKJENCE ON THE WEB
Ocar rvtr. Cluing:
We reecived documents concerning your above cited project. Regulations published by the li.S.
Department ol'Health and Human Services |45 Cl-R Part46.101(b) (2)| exempt this type of research
from levievv hy our Institutional Review Board. Note: A copy of your disclaimer form, with IRB
approval .stamp aftlxcd, is enclosed for duplication and use in enrolling subjects.
fl\empt status is granted with the rniderstanding that no I'urlher changes or additions will he made
either to llie procedtnes followed or to the consenting instrument used (copies of which we have on
lile) without the review and approval of the llumaji Subjects Committee and your College or
Oepartnicntal Review Commillee. .Any research related physical or psychological harm to any
subject must also be reported to each committee.
riiank you for informing us of your work. Ifyou have any questions coneeniiug the above, plca.se
contact this offiec.
Sincerely.
Rebecca f)ahl, R.N.. I'h.l).
Director
Human Subjects Protection Program
RD/js
ee: Departmental/College Review Committee
228
B.2 Subject's Disclaimer Form
mmm IY UNIWBISIW OF AZ m.
mm iwsT mm on m
wmmmumnismmmism.
WIS
SUBJECT'*S DISCLAIMER FORM
Title of Study: User Study on Internet Browsing
You Jire being invited to voluntarily participate in the above-titled resejirch study. The study is on Web
browsing using different browsing tools You arc eligible to participate bccausc you are over age 18.
If you aurcc lo participate, your participation will iii\olvc browsing Web site information on 9 business topics.
The expcrimaiit will take place iti Room 430X, Mcriclland Hall. During the experiment, witten notes will be
made iii oj-der to help the investigator re%-iew what is said. Your naine will not appear on these notes. A
i|ueslionnairc wdi be used to record your answers to the experiment tasks on Web site browsing. Your name
will not appear this questioiiiiaire. The whole experiment takes about an hour.
Any questions >oti Iws'c will bo answered and you may withdraw from tfxe study at any time. There are no
known risks from your participation and no direct benefit from your participation is expected. ITiere is no cost
to you e.xeept for your time and you will receive .15 after completing all experiment tasks. In addition, the best
10 participants will get extra bonus - $10 or S2(). on top of the S5 you receive.
Only the principal investigator will have acccss to the your name and the information ttiat you pn)vide. Records
containing research-related data will not be identified using your name. In order to maintain yoiir
confidcntiality, your name will not be revealed in any reports that result from this project. Que.stionnaire
information will he saved in a secure place.
You can obtain further iiiformalion fi'om the principal investigator, Wingyan Chung (M.S.), at (520) 621-2748.
if y<ni have questions concerning your rights as a lesearch subject, you may call the University of Arizona
Human Subjects Protection Frojiram oftlcc at (.520) 026 6721.
By participating in the user study, you are giving permission for the investigator to use your information for
lescarch purposes.
Thank you.
229
B.3 Questionnaire for User Study on Internet Browsing
User Study on Internet Browsing
Number:
Date;
Thank you for participating in this experiment. We are now conducting a user study on
Web browsing using different browsing tools. Please read the following description of
the experiment.
1. The objective of this user study is to evaluate the effectiveness, efficiency and
usability of a prototype system called "Business Intelligence Explorer". The
system is designed to display Web sites information in 3 different settings - a
result list display, a hierarchically organized Web community, a knowledge map
display.
2. In this experiment, you will be presented with the results on 9 topics related to
business intelligence. You will be asked to perform browsing tasks using each of
the 3 displays. Another commercial searching tool (Kartoo.com) will be used for
comparison.
3. During the experiment, you may browse the actual Web sites using Internet
Explorer or Netscape. However, you are not allowed to use other search tools
(such as search engines) to answer the questions.
4. You can try your best to find answers to the questions. If you can't find the
answers for some questions, just leave it blank. Please finish one question at a
time. Do not go back to previous questions.
5. The evaluation for each browsing tool takes about 15 minutes. The whole
experiment takes about an hour. Please feel free to ask any questions during the
experiment.
Think out aloud: You are encouraged to speak out explicitly the reasons/rationale
why a particular choice is made when you interact with the system, so that we can
understand the user behavior better.
230
Part I. Tasks using Result List display
In this part, you are going to use the Result list display to browse and search for
information. Please complete the following two tasks.
1. Select the category "database technology". Write down the URLs and the major
business areas for the following companies;
No. Company
1 International Sybase
User Group
URL
Major Business Areas
2 The Database
Technology Group
2. Select the category "customer relationship management". Write down the titles
and URLs of the Web sites that are related to "customer relationship management
benchmarking" (You can write down the information of as many Web sites as you
find).
231
Post Study Survey (Part I. Result List Display)
1. How familiar are you with "database technology"?
Not familiar
1
2
3
4
5
Very familiar
2. How familiar are you with "customer relationship management"?
Not familiar
1
2
3
4
5
Very familiar
3. How helpful are the list of result pages to your browsing?
Not familiar
1
2
3
4
5
Very familiar
4. Overall speaking, how do you rate the result list display for Web browsing?
Not familiar
1
2
3
4
5
Very familiar
5. Please write down your comments and reasons on the result list display for Web
browsing:
a. Strengths:
Reasons:
b. Weaknesses:
Reasons:
232
Part II. Tasks using Web Community display
In this part, you are going to use the Web community display to browse and search for
information. Please complete the following two tasks.
1. Select the category "Supply chain management". Write down the URLs and the
major business areas for the following companies:
No. Company
URL
1 Gensym Corporation
Major Business Areas
2 MRO Software
2. Select the category "Information management". Write down the titles and URLs
of the Web sites that are related to "digital libraries" (You can write down the
information of as many Web sites as you find).
233
Post Study Survey (Part II. Web community Display)
1. How familiar are you with "Supply chain management"?
Not familiar
1
2
3
4
5
Very familiar
2. How familiar are you with "information management"?
Not familiar
1
2
3
4
5
Very familiar
3. How helpful are the labels in the Web community display to your browsing?
Not familiar
1
2
3
4
5
Very familiar
4. Overall speaking, how do you rate the Web community display?
Not familiar
1
2
3
4
5
Very familiar
5. Please write down your comments and reasons on the Web community display for
Web browsing:
a. Strengths:
Reasons:
b. Weaknesses:
Reasons:
234
lll(a). Tasks using Knowledge Map display
In this part, you are going to use the Knowledge Map display to browse and search for
information. Please complete the following two tasks.
1. Select the category "e-commerce solution". Write down the URLs and the major
business areas for the following companies;
No. Company
URL
NetStores
A
1
comprehensive ECommerce Solution
for Your Website
2 GBwebs E-commerce
Web Hosting Design
Services
Major Business Areas
2. Select the category "knowledge management". Write down the titles and URLs of
the Web sites that are related to "ITtoolbox Knowledge Management" (You can
write down the information of as many Web sites as you find).
235
Post study Survey (Part Ill(a). Knowledge Map Display)
1. How familiar are you with "e-commerce solution"?
Not familiar
1
2
3
4
5
Very familiar
2. How familiar are you with "knowledge management"?
Not familiar
1
2
3
4
5
Very familiar
3. How helpful is the placement of Web sites on the screen to your browsing (i.e.
"placement" refers to the closeness of points and its relationship with their
similarity)?
Not familiar
1
2
3
4
5
Very familiar
4. Overall speaking, how do you rate the result knowledge map display for Web
browsing?
Not familiar
1
2
3
4
5
Very familiar
5. Please write down your comments and reasons on the knowledge map display for
Web browsing:
a. Strengths:
Reasons:
b. Weaknesses:
Reasons:
236
lll(a). Tasks using Kartoo Map display
In this part, you are going to use the Kartoo Map display to browse and search for
information. Please complete the following two tasks.
1. Write down the URLs and major business areas of the following companies.
No. Company
1 Prominent Web
Solution Provider
URL
Major Business Areas
2 Many Internet Service
Providers
2. Write down the titles and URLs of the Web sites that are similar to or closely
related to "Quest data mining group" (You can write down the information of as
many Web sites as you find).
237
How helpful is the placement of Web sites on the screen to your browsing (i.e.
"placement" refers to the closeness of points and its relationship with their similarity)?
Not helpful
1
2
3
4
5
Very helpful
Please compare the knowledge map display with Kartoo.com map display in the
following aspects:
User Friendliness:
Graphical user interfaces:
Quality (accuracy, relevance, etc.) of results:
Meaning of placement of points
Other aspects:
Demographic Information
Please fill in some brief information about yourself:
1. Date of experiment:
2. Gender: M / F
3. Education level (check one)
High school graduate,
Master degree earned,
^University level,
Doctorate earned,
Bachelor degree earned,
Others
4. Age range (check one)
18-25,
26-30,
30-35,
35-40,
40-50,
51 or above
Thank you very much for participating in this experiment!!
238
APPENDIX C: DOCUMENTS RELATED TO CHAPTER 6
C.1 Approval Letter from the University Human Subjects Committee
THE UNIVERSITY OF
ARIZONA
TUCSON ARIZONA
] V"iO N. \ inc A\cfuiiUnv 1-rA
liu sun. \/
>1 Vt"
2 December 2003
Wingyan Chung. M.S.
Advisor: Hsinchun Chen. Ph.D.
Department of Management Information Systems
McClelland Hall. Room 430
I'U BOX 210108
RE:
USER STUDY OF THE WEB-BASED STAKEHOLDER ANALYSIS STUDY
Dear Mr. Chung:
We received documents concerning your above cited project. Regulations published by the U.S.
Department of Health and Human Services [45 CI'R Part 46.101(b) (2)| exempt this type of research
from review by our Institutional Review Board. Note: A copy of your disclaimer form, with IRB
approval stamp affixed, is cncloscd for duplication and use in enrolling subjects.
I*!xenipt status is granted with the understanding that no further changes or additions will be made
either to the procedures followed or to the consenting instrument used (copies of which we have on
file) without the review and approval of the Human Subjects Committee and your College or
Departmental Review Committee. Any research related physical or psychological harm to any
subject must also be reported to each committee.
fhank you for informing us of your vvork. If you have any questions concerning the above, please
contact this office.
Sincerely.
Rebecca Dahl. R.N., Ph.D.
Director
Human Subjects Protection Program
cc: Departmental/C'ollege Review Committee
239
C.2 Subject's Disclaimer Form
.i?PRO\fO) BY UNI»mr OF « IRB.
SI BJLLT S DISC I AIMER FORAbocOMEHTSUSaiTOGOnaiT8UIJECR.
D*T&
^
Title of Study: User Study of the Web-based Sttikeliolder Analysis Study
You are being invited to voluntarily participate in tiie above-titled research study. The study is on Web-based
stakeholder classification using different browwiig tools. You are eligible to participate because you ai'e over
age 18,
11 VDU aiTw. to patiii ipalc. soiu partiuipation % ill itnt>bc classifXing pages on the Web. You will also be asked
lead a <^cenartii ami le-^pnnd to «,tud', questions as well as till out a pwst study questionnaire. The experiment will
lake pl.K'c m tht Aitifirial Intclhwurc lab, McClcHand Hall. During the experiment, written notes will be
made rn order to help the investifatfir review what is said. Your nime will not appear on these notes. A
iltie.stininiuirt, will Hl u,i.d to record vuur aniwcr-i to the e.xpcrinicn$ tasks oti Web site browsing. Your name
w ill Slot itppe.ii 111 thih i.iiie.,li''nnajre 1 be w hole expernnent takes about an hour.
Any t|.uchhoiw you have will be answered and you ntay withdravv from the study at any time. There are no
known risks from yotrr participation and no direct benefit fi-ojn your participation is expected. Tliere is no cost
to you execpl for your time ;uid you wdl receive SIO after completing all experiment tasks.
Only ibc lescarch tn\esit"alots in this user study will have access to the your name a.nd the information that you
provide. Reconls coouminj; research-relatcd data will not be identified using your name. In order to maintain
som ronfidcnlutlity, j'onr name will not he revealed in any reports that result from this project. Questionnaire
infimnation will be saved in a secure place.
•^'oii euji obtain fiirtlier infomiation from the investigators at (520) 621-2748. If you have questions concerning
your fight.s as a research subject, you may call the University of Arizona Human Subjects Protection Program
office at (520) 62(>-f>72!.
By participating in the user study, you arc giving peniiission for the investigators to use your infomiation for
research ptirposes.
Thaak you.
/'
VVnmyan Chiini!
240
C.3 Questionnaire for Web-based Business Stakeholder Analysis
Questionnaire for Web-based Stakeholder Analysis
Participant Number;
Date:
Thank you for participating in this study.
In this study, you will be asked to classify on business Web
pages that are shown on the screen. The whole study takes about
30 minutes. You are welcome to ask any questions during the
study.
241
Instruction: Please classify each result shown on the screen into one of the following
11 categories. If more than one category can be assigned to a result, please choose
the best one.
1. Partners/suppliers/sponsors - those who provide tangible or intangible resources to
support another company's operations.
2. Customers - individuals or organizations that purchase a company's products or services.
3. Employees - people who are paid by a company to perform some tasks or to achieve
some objectives of the company.
4. Shareholders - individuals or organizations that invest in a company and own stocks of
the company.
5. Governments - organizations which officially govern the society in which the company
exists.
6. Competitors - companies or organizations which compete with a company by selling
products or services similar to those sold by the company.
7. Communities - groups or individuals whose activities take place in the same environment
in which a company operates.
8. Educational/research institutions - organizations which exist to educate people or to
generate research outcomes.
9. Media/reviewers - organizations or individuals who report news about a company or
review products or services of the company.
10. Portals - Web sites that provide comprehensive information and functionality to users.
11. Unknown - stakeholder groups that cannot be classified into any one of the above
groups.
Intelliseek
The following pages (stakeholder group A) link to Inteliiseek
Web Page
No.
1
Resume Searchina Software
Recruiters Network, ...
http;//www.recruitersnetwork.com/software/resume.htm - Cached
2
BloaPuise Kev Phrase Citations TBETAl Automated Trend Discovery...
Go to Blogpulse.com,...
http://www.blogpulse.com/03 08 14/2 keyPhrases.html - Cached
3
E-lvnks alphabetized "R" links
Scroll down the menus; E-lynks at the right, Superlynks at the left.
Alphabetized
A Lynks. Addictions. Advertising. Aerospace. Agricultural
Resources....
http://www.e-lynks.eom/r.htm - Cached
4
Meta Search Enaines
Meta Search Engines. Jian Liu, 02/2001 [email protected] uri:
http://www.indiana.edu/~librcsd/search/.
What is a meta search engine....
Your
Classification
242
http.7/www.indiana.edu/~librcscl/search/ - Cached
5
CareerDronews.com offers career classroom
activity throuah Career...
Career Classroom Activity A career classroom activity that will help
your students plan their future is Career Explorer/CX Online....
http;//www.careerpronews.com/careerciassroomactivity.htm Cciched
6
CONVERA - Partners
Convera and Intelliseek Partner to Broaden Product Offerings
Partnership Brings
Federated Search Capabilities to Convera; Indexing Technology to
Intelliseek....
http://www.convera.com/partners/partners_pr_ 040703.asp Cached
7
MCC Media - Research
Research. Following up on Nick Arnett's patent-pending work that
was the basis of Opion Inc., now part of PlanetFeedback, a unit...
http;//www.mccmedia.com/research.html - Cached
8
Web Farmina Newsletter - Julv 1999
Link to WFC Website. Newsletter - July, 1999. In This Issue. New
Netscape
Service. Web Scraping in VB. Ask Jeeves Flies. Drucker Plugs
Outside Info....
http;//www.webfarminq.com/new/NL199907.html - Cached
9
What's New Jan. - Aua. 1999 on Search Enaine
Showdown
Search Engine Showdown Statistics Directories Reviews Others
Read More Features
Strategies News Searches Multi-Search Engines Phone Numbers
What's New Jan....
http://www.searchenqineshowdown.com/new99a.shtml - Cached
10
LLRX.com - MetaSearch Engines
Front Page Bookstore Archives About Subscribe Comments,
Navigation....
http://www.llrx.com/features/metasearch.htm - Cached
243
Instruction: Please classify each result shown on the screen into one of the following
11 categories. If more than one category can be assigned to a result, please choose
the best one.
1. Partners/suppliers/sponsors - those who provide tangible or intangible resources to
support another company's operations.
2. Customers - individuals or organizations that purchase a company's products or services.
3. Employees - people who are paid by a company to perform some tasks or to achieve
some objectives of the company.
4. Shareholders - individuals or organizations that invest in a company and own stocks of
the company.
5. Governments - organizations which officially govern the society in which the company
exists.
6. Competitors - companies or organizations which compete with a company by selling
products or services similar to those sold by the company.
7. Communities - groups or individuals whose activities take place in the same environment
in which a company operates.
8. Educational/research institutions - organizations which exist to educate people or to
generate research outcomes.
9. Media/reviewers - organizations or individuals who report news about a company or
review products or services of the company.
10. Portals - Web sites that provide comprehensive information and functionality to users.
11. Unknown - stakeholder groups that cannot be classified into any one of the above
groups.
Siebel
The following pages (stakeholder group B) link to Siebel
No. Web Page
1
CIC 1 About CIC: Channel Partners/Resellers
Our channel partners (resellers, ISV's, and alliances) include worldwide
leaders
in their respective vertical market segments and industries including retail
http://www cic.com/about/partners/ - Cached
2
Greenbrier & Russel Alliances
Want to understand what makes someone tick? Take a close look at the
company they keep. Their friends. Their business relationships. ...
http://www.gr.com/who/alliance.asp - Cached
3
<!-title-->The Quest for lnteroDerabilitv<!-/titlexbrxsoan ...
The Quest for Interoperability - CRM SPECIAL REPORT ...
http;//www.crmdaily.com/perl/story/20142.htm! - Cached
4
SAS 1 Success Stories
Home, Products and Solutions, Success Stories, Partners, Company,
Your
classification
244
Customer
Support, Worldwide Sites....
http://www.sas.com/success/compaq.html - Cached
5
Enterprise Software / Lessons learned from a CRM
success storv -...
Tech Update Enterprise Software. Lessons learned from a CRM success
story By David Southgate TechRepublic March 19, 2003 ...
http://techupdate.zdnet.eom/techupdate/stories/main/0,14179,2911582,00
•html - Cached
6
Sun Countrv
Site map & site overview; How many links are on this page? How many
categories
are on this page? Expand all sections; Close all sections. ...
http://resources solaris-x86.ora/ - Cached
7
AlCC Membershio
AlCC Membership. ( Site Map | Search | Home |. The AlCC's international
membership includes : Major airframe manufacturers and their suppliers;
http://www.aicc.org/pages/aicc1.htm - Cached
8
Information Technoloav Industry Council
MEMBER COMPANIES. Agilent Technologies, Inc. Amazon.com AOL
Time Warner
Apple Canon USA Inc. Cisco Systems, Inc. Corning Dell Computer...
http://www.itiG.org/vote guide/107/members.html - Cached
9
Humminabird Ltd. Humminabird OEM Customers
Global Sites
http://www.hummingbird.com/solutions/oem/customers.html - Cached
10
Software alliances
ILOG, ...
http://www.ilog.com/partners/directory/software.cfm - Cached
245
Instruction: Please classify each result shown on the screen into one of the following
11 categories. If more than one category can be assigned to a result, please choose
the best one.
1. Partners/suppliers/sponsors - those who provide tangible or intangible resources to
support another company's operations.
2. Customers - individuals or organizations that purchase a company's products or services.
3. Employees - people who are paid by a company to perform some tasks or to achieve
some objectives of the company.
4. Shareholders - individuals or organizations that invest in a company and own stocks of
the company.
5. Governments - organizations which officially govern the society in which the company
exists.
6. Competitors - companies or organizations which compete with a company by selling
products or services similar to those sold by the company.
7. Communities - groups or individuals whose activities take place in the same environment
in which a company operates.
8. Educational/research institutions - organizations which exist to educate people or to
generate research outcomes.
9. Media/reviewers - organizations or individuals who report news about a company or
review products or services of the company.
10. Portals - Web sites that provide comprehensive information and functionality to users.
11. Unknown - stakeholder groups that cannot be classified into any one of the above
groups.
WebMethods
The following pages (stakeholder group C) link to WebMethods
No. Web Page
1
CIO Forum Financial Services 2004 1 Confirmed
SuDDliers
Home, Event Overview, Attendees, Conference, Press and News, General
Information, Richmond Events. Personal Event Management. Event
attendees ...
http://www.cioforum.com/sponsorlist/ - Cached
2
WEBMETHODS INTRODUCES WEBMETHODS FOR
TRADING NETWORKS
announce message....
http://lists.oasis-open.Org/archives/announce/200009/msg00010.html Cached
3
WebMethods. JBoss On Verae of Total Intearation
Computer Digital Expo. The event for search engine marketing &
optimization.
dc.internet.com/news/article.php/2195411. Back to Article....
http://dc.internet.com/news/print.php/2195411 - Cached
4
Our Alliances II Deloitte's Alliance Partners
Your
classification
246
There's nothing like good teamworl<. In today's business world, going
it alone is nearly impossible. Few companies have all of the ...
http;//www.dc.com/Expertise/AI!iances/index.asp?pageaction=printable Cached
5
Omaeo STP Partners
back to STP Partners. As the leading independent provider of integration
software, webMethods, Inc. (Nasdaq: WEBM - news) delivers ...
http://www.omgeo.com/profile-webmethods.html - Cached
6
XML.com; webMethods B2B
XML.com, Advertisement. XML.com WebServices.XML.com O'Reilly
Network oreilly.com.
Resources | Buyer's Guide | FAQs | Newsletter | Tech Jobs | Safari
Bookshelf,...
http://www.xml.eom/pub/p/34 - Cached
7
AskaChEa
Go back HOME. Materials & Chemicals. Subject Index
(coloured boxes indicate main categories):...
http://www askache.com/AskaChEq.htm - Cached
8
Alodar Systems' Partners
Our Partners. webMethods, The Business Integrator Company.
webMethods
is a leading provider of integration software. The webMethods ...
http://www.alodar.com/partners.html - Cached
9
A Shiftina Landscape - Comouterworld
Computerworld ...
http://www.computerworld.com/databasetopics/data/story/0,10801,70084,0
O.html - Cached
10
F5 Networks - webMethods
F5 Networks F5 Networks....
http://www.f5.com/solutions/alliance/partner webmethods/ - Cached
247
Post-Study Questionnaire
Participant Number;
Please fill in the following information.
Because there is much inlormcation on the A/eb an automatic approach to business
stakeholder analysis is neecJed.
Strongly Disagree
Strongly Agree
1
2
3
4
5
6
7
An automatic approact1 to busirless stakehol(jer cmalysis will help business analysts to
identify and classify bu.sines s relatior ships on the^Neb
Strongly Agree
Strongly Disagree
1
2
3
4
5
6
7
An automatic approach to Dusir ess stak BholcJer a naly sis will save the time of business
analysts.
Strongly Agree
Strongly Disagree
1
2
3
4
5
6
7
Demographic Information
Number of hours spent on using computer per week (please check):
Less than 5 hours
10 hours to less than 15 hours
20 hours to less than 25 hours
30 hours to less than 35 hours
5 hours to less than 10 hours
15 hours to less than 20 hours
25 hours to less than 30 hours
Equal to or more than 40 hours
Gender: M / F
Age range: 18-25
, 26-30 , 31-35
, 36-40
, 41-50
, 51-60 , 60 or above
Education: Undergrad. Student , Associate degree earned , Bachelor earned , Master
earned , Doctorate earned
Thank you very much!
248
REFERENCES
Ackoff, R.L. (1989). From data to wisdom. Journal of Applied Systems Analysis, 16 3-9.
ACNelisen (2002). Nelisen/ZNetratings reports a record half billion people worldwide
now have home Internet access. [Online]. Available at
http://asiapacitic.acnielsen.com.au/news.asp?newsID==74.
Adomavicius, G., & Tuzhilin, A. (2001). Using data mining methods to build customer
profiles. IEEE Computer, 34(2), 74-82.
Agle, B.R., Mitchell, R.K., & Sonnefeld, J.A. (1999). Who Matters to CEOs? An
Investigation of Stakeholders Attributes and Salience, Corporate Performance, and
CEO Values. Academy of Management Journal, 42(5), 507-525.
Alavi, M., & Leidner, D.E. (1999). Knowledge management systems: Issues, challenges,
and benefits. Communications of the Association for Information Systems, 1(7).
Alavi, M., & Leidner, D.E. (2001). Review: Knowledge Management and Knowledge
Management Systems: Conceptual Foundations and Research Issues. MIS Quarterly,
25(1), 107-136.
Andersen, J. (1983). Architecture of Cognition Harvard University Press.
Andersen, R.H., Bikson, T.K., Law, S.A., & Mitchell, B.M. (1995). Universal access to
e-mail: feasibility and societal implications. Santa Monica, CA: Rand.
Applegate, L.M. (2003). Building Businesses in a Networked Economy. In: Proceedings
of MIS Fall Conference on Managing Information Technologies in Networked
Organizations, Tucson, Arizona, USA.
Bacon, F. (1620). Novum Organum. Oxford, UK: Clarendon Press.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York:
ACM Press.
Ballou, D.P., & Pazer, H.L. (1985). Modeling data and process quality in multi-input,
multi-output information systems. Management Science, 31(2), 150-162.
Barnard, C. (1938). The Function of the Executive. Cambridge: Harvard University Press.
Barney, J.B. (1991). Firm resources and sustained competitive advantage. Journal of
Management, 17 99-120.
Bates, M.J. (1989). The design of browsing and berrypicking techniques for the on-line
search interface". Online Review, 13(5), 407-431.
Belew, R.K. (1989). Adaptive information retrieval: using a connectionist representation
to retrieve and learn about documents. In: Proceedings of the 12th Annual
249
International ACM SIGIR Conference on Research and Development in Information
Retrieval (pp. 11-20), Cambridge, Massachusetts, United States.
Bellinger, G., Castro, D., & Mills, A. (2000). Data, Information, Knowledge, and
Wisdom. http://wM-w.outsishts.com/svstems/dikw/dik\v.htm.
Benoit, G. "Data mining," in: Annual Review of Information Science and Technology,
M.E. Williams (ed.), Information Today, Inc., Medford, NJ, 2002, pp. 265-310.
Berle, A., & Means, G. (1932). The Modem Corporation and Private Property. New
York; Commerce Clearing House.
Bharat, K., & Henzinger, M.R. (1998). Improved Algorithms for Topic Distillation in
Hyperlinked Environments. In: Proceedings of the 21st International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 104-111),
Melbourne, Australia.
Blair, D.C. (2002). Knowledge management; Hype, hope, or help? Journal of the
American Society for Information Science and Technology, 53(12), 1019-1028.
B0dker, S. (1991). Through the Interface: A Human Activity Approach to User Interface
Design. Hillsdale, NJ; Erlbaum.
Borgman, C.L. "Scholarly communication and bibliometrics," in; Annual Review of
Information Science and Technology, M.E. Williams (ed.), Information Today, Inc.,
Medford, NJ, 2002, pp. 3-72.
Borko, H. (1967). Automated Language Processing: The State of the Art. New York, NY;
John Wiley & Sons, Inc.
Bowen, T.S. (2001). Enghsh could snowball on Net, Technology Research News.
[Online]. Available at
http://www.tmmag.com/Stories/2QQl/112101/English could snowball on Net 1121
01.html.
Bowman, C.M., Danzig, P.B., Manber, U., & Schwartz, F. (1994). Scalable Internet
resource discovery: research problems and approaches. Communications of the ACM,
37(8), 98-107.
Briggs, R.O., Vreede, G.-J.D., Nunamaker, J.F., & Sprague, R. (2002). Special Issue;
decision-making and a hierarchy of understanding. Journal of Management
Information Systems, 18(4), 5-10.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search
engine. In; Proceedings of the 7th International WWW Conference, Brisbane,
Australia.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopolan, S., Stata, R., Tomkins,
A., & Wiener, J.L. (2000). Graph structure in the Web. In; Proceedings of the 9th
250
International World Wide Web Conference (pp. 309-320), Amsterdam, The
Netherlands: Elsevier Science.
Bui, T.N., & Moon, B.R. (1996). Genetic algorithm and graph partitioning. IEEE
Transactions on Computers, 45(7), 841-855.
Bush, V. (1945). As we may think. Atlantic Monthly, 176 101-178.
Buzan, T., & Buzan, B. (1993). The Mind Map Book: How to Use Radiant Thinking to
Maximize Your Brain's Untapped Potential. New York: Plume Books (Penguin).
Byrne, J. (2003). Answering the Questions of the Universe: Who am I and How to I Fit
In? In: Ragan Annual PR Conference, Chicago, Illinois: http://www.vfliience.com/iateractive/monitoring.html.
Carbonell, J., & Goldstein, J. (1998). The use of MMR: diversity-based reranking for
reordering documents and producing summaries. In: Proceedings of the 21st Annual
International ACM-SIGIR Conference on Research and Development in Information
Retrieval (pp. 335-336), Melbourne, Australia: ACM Press.
Carbonell, J.G., Michalski, R.S., & Mitchell, T.M. "An overview of machine learning,"
in: Machine Learning: An Artificial Intelligence Approach, R.S. Michalski, J.G.
Carbonell and T.M. Mitchell (eds.), Tioga, Palo Alto, CA, 1983, pp. 3-23.
Card, S.K., Moran, T.P., & Newell, A. (1983). The Psychology of Human-Computer
Interaction. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Carmel, E., Crawford, S., & Chen, H. (1992). Browsing in hypertext: a cognitive study.
IEEE Transactions on Systems, Man, and Cybernetics, 22(5), 865-884.
Carroll, J., Kellogg, W., & Rosson, M.B. "The task-artifact cycle," in: Designing
Interaction: Psychology at the Human-Computer Interface, J. Carroll (ed.),
Cambridge University Press, 1991.
Carroll, J.M. (1997). Human-computer interaction: psychology as a science of design.
Annual Review of Psychology, 48 61-83.
Carvalho, R., & Ferreira, M. (2001). Using information technology to support knowledge
conversion processes. Information Research, 7(1).
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Kumar, S.R., Raghavan, P.,
Rajagopalan, S., & Tomkins, A. (1999a). Hypersearching the Web. Scientific
American,(imie).
Chakrabarti, S., Dom, B., S., K., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D.,
& Kleinberg, J. (1999b). Mining the Web's link structure. IEEE Computer, 32(8), 6067.
251
Chang, S.J., & Rice, R.E. "Browsing: a multidimensional framework," in: Annual Review
of Information Science and Technology, M.E. Williams (ed.). Information Today,
Inc., Medford, NJ, 1993, pp. 231-276.
Chau, M., & Chen, H. (2003). Comparison of three vertical search spiders. IEEE
Computer, 36(5), 56-62.
Chen, H. (1995). Machine learning for information retrieval: neural networks, symbolic
learning, and genetic algorithms. Journal of the American Society for Information
Science, 46(3), 194-216.
Chen, H. (2001). Knowledge Management Systems: A Text Mining Perspective. Tucson,
AZ: The University of Arizona.
Chen, H., & Chau, M. "Web mining: machine learning for Web applications," in: Annual
Review of Information Science and Technology (ARIST), M.E. Wilhams (ed.),
Information Today, Inc., Medford, NJ, 2004.
Chen, H., Chung, W., Xu, J.J., Wang, G., Chau, M., & Qin, Y. (2004). Crime data
mining; a general framework and some examples. IEEE Computer, 37(4), 50-56.
Chen, H., Chung, Y., Ramsey, M., & Yang, C. (1998a). A smart itsy bitsy spider for the
Web. Journal of the American Society for Information Science, 49(7), 604-618.
Chen, H., Fan, H., Chau, M., & Zeng, D. (2001). MetaSpider: meta-searching and
categorization on the Web. Journal of the American Society for Information Science
and Technology, 52(13), 1134-1147.
Chen, H., Houston, A., Sewell, R., & Schatz, B. (1998b). Internet browsing and
searching: user evaluation of category map and concept space techniques. Journal of
the American Society for Information Science, Special Issue on AI Techniques for
Emerging Information Systems Applications, 49(7), 582-603.
Chen, H., & Lynch, K.J. (1992). Automatic construction of networks of concepts
characterizing document databases. IEEE Transactions on Systems, Man, and
Cybernetics, 22(5), 885-902.
Chen, H., & Ng, T. (1995). An algorithmic approach to concept exploration in a large
knowledge network (automatic thesaurus consultation); symbolic branch-and-bound
search vs. connectionist Hopfield net activation. Journal of the American Society for
Information Science, 46(5), 348-369.
Chen, H., Ng, T.D., Martinez, J., & Schatz, B.R. (1997). A concept space approach to
addressing the vocabulary problem in scientific information retrieval; an experiment
on the Worm Community System. Journal of the American Society for Information
Science, 48(1), 17-31.
252
Chen, H., Schuffels, C., & Orwig, R. (1996). Internet categorization and search: a selforganizing approach. Journal of Visual Communication and Image Representation,
7(1), 88-102.
Chen, H., Shankaranarayanan, G., & She, L. (1998c). A machine learning approach to
inductive query by examples: an experiment using relevant feedback, IDS, genetic
algorithms, and simulated annealing. Journal of the American Society for Information
Science, 49(8), 693-705.
Chen, H.M., & Cooper, M.D. (2001). Using clustering techniques to detect usage patterns
in a Web-based information system. Journal of the American Society for Information
and Science and Technology, 52(11), 888-904.
Choo, C.W. (1998). The Knowing Organization. Oxford: Oxford University Press.
Choo, C.W., Detlor, B., & Tumbull, D. (2000). Information seeking on the web: an
integrated model of browsing and searching. First Monday, 5(2).
Church, K., & Hanks, P. (1989). Word association norms, mutual information, and
lexicography. In: Proceedings of the 27th Annual Meeting of Association for
Computational Linguistics (pp. 76-83), Vancouver, BC, Canada.
Clarkson, M.B.E. (1995). A Stakeholder Framework for Analyzing and Evaluating
Corporate Social Performance. Academy of Management Review, 20(1), 92-117.
CNNIC (2002). Analysis Report on the Growth of the Internet in China, China Internet
Network Information Center. [Online]. Available at
http:,//w^'w.cnnic.net.cn/develst/2002-7e/6.shtml.
Confucius (500 B.C.). The Great Learning. China.
Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines
and Other Kernel-based Learning Methods. Cambridge, U.K.: Cambridge University
Press.
Cronin, B. (2000). Strategic intelligence and networked business. Journal of Information
Science, 26 133-138.
Cutting, D.R., Karger, D.R., Pederson, J.O., & Tukey, J.W. (1992). Scatter/gather: a
cluster-based approach to browsing large document collections. In: Proceedings of
the Fifteenth Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (pp. 318-329), New York: ACM Press.
Dancy, J. (1985). Introduction to Contemporary Epistemology. New York, NY: Basil
Blackwell.
Davenport, T.H., & Prusak, L. (1998). Working Knowledge: How organizations manage
what they know. Boston, Massachusetts: Harvard Business School Press.
253
Davies, P.H.J. "Intelligence, information technology, and information warfare," in:
Annual Review of Information Science and Technology, M.E. Williams (ed.).
Information Today, Inc., Medford,NJ, 2002, pp. 313-352.
Doom, T., Raymer, M., & Krane, D. (2004). Bioinformatics. IEEE Potentials, 23(1), 2427.
Doyle, L.B. (1961). Semantic road maps for literature searcher. Journal of the
Association of Computing Machinery, 8(4), 553-578.
Doyle, L.B. (1962). Indexing and Abstracting by Association - Part 1. Santa Monica, CA:
System Development Corporation.
Drucker, P. (1993). Post-Capitalist Society (1st edition). New York, NY: Harper
Business.
Drucker, P. "The post-capitalist executive," in: Managing in a Time of Great Change,
Penguin, New York, NY, 1995.
Drucker, P. (1999). Management Challenges for the 21st century. Oxford, England:
Butterworth-Heinemann.
Drucker, P. (2002). Managing in the Next Society (1st edition). New York, NY: St.
Martin's Press.
Elias, A.A., & Cavana, R.Y. (2000). Stakeholder Analysis for Systems Thinking and
Modeling. In: Proceedings of the 35th Annual Conference of the Operational
Research Society of New Zealand, Wellington, New Zealand.
Ellis, D. (1989). A behavioral approach to information retrieval system design. Journal of
Documentation, 45(3), 171-212.
Eom, S.B., & Farris, R.S. (1996). The contributions of organizational science to the
development of decision support systems research subspecialties. Journal of the
American Society for Information Science, 47(12), 941-952.
Erickson, T., Smith, D.N., Kellogg, W.A., Laff, M.R., Richards, J.T., & Bradner, E.
(1999). Socially translucent systems: social proxies, persistent conversation, and the
design of "babble'. In: Proceedings of the ACM Conference on Computer-Human
Interactions (pp. 72-79): ACM Press.
Ericsson, K.A., & Simon, H.A. (1993). Protocol analysis: verbal reports as data.
Cambridge, MA: MIT Press.
Etzioni, O. (1996). The World-Wide Web: Quagmire or Gold Mine? Communications of
theACM,39{\\), 65-68.
Fairthome, R.A. (1961). Towards Information Retrieval. London: Butterworths.
254
Fayyad, U.M., Piatetcky-Shapiro, G., & Smyth, P. (1996). The KDD process for
extracting useful knowledge from volumes of data. Communications of the ACM,
39(11), 27-34.
Fayyad, U.M., & Uthurusamy, R. (2002). Evolving data mining into solutions for
insights. Communications of the ACM, 45(8), 28-31.
Ferreira, J., & Blonkvist, B. (2002). Directions in Collaborative Commerce; Managing
the Extended Enterprise, Deloitte Research, New York, NY.
Finn, K., Sellen, A., & Wilbur, S. (1997). Video Mediated Communication. Hillsdale, NJ:
Erlbaum.
Firmin, T., & Chrzanowski, M.J. (1999). An evaluation of automatic text summarization
systems. Cambridge: The MIT Press, pp. 325-336.
Flake, G.W., Lawrence, S., & Giles, C.L. (2000). Efficient identification of Web
communities. In: Proceedings of the Sixth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (pp. 150-160), Boston, MA, USA: ACM
Press.
Flake, G.W., Lawrence, S., Giles, C.L., & Coetzee, F.M. (2002). Self-organization and
identification of Web communities. IEEE Computer, 25(3), 66-71.
Fodor, J. (1983). Modularity ofMindMlT Press.
Frawley, W.J., Piatetsky-Shapiro, G., & Matheus, C.J. (1992). Knowledge discovery in
databases: An overview. AIMagazine, 13 57-70.
Freeman, E. (1984). Strategic Management: A Stakeholder Approach. Marshfield, MA:
Pitman.
Fuhr, N., & Pfeifer, U. (1994). Probablistic information retrieval as combination of
abstraction inductive learning and probablistic assumptions. ACM Transactions on
Information Systems, 12(1), 92-115.
Fuld, L.M., Sawka, K., Carmichael, J., Kim, J., & Hynes, K. (2002). Intelligence
Software Report™ 2002. Cambridge, MA, USA: Fuld & Company Inc.
Fuld, L.M., Singh, A., Rothwell, K., & Kim, J. (2003). Intelligence Software Report™
2003: Leveraging the Web. Cambridge, MA, USA: Fuld & Company Inc.
Fumas, G.W., Landauer, T.K., Gomez, L.M., & Dumais, S.T. (1987). The vocabulary
problem in human-system communication: an analysis and a solution.
Communications of the ACM, 30(11), 964-972.
Fumas, G.W., & Zacks, J. (1994). Multitrees: enriching and reusing hierarchical
structure. In: Proceedings of ACM CHr94 Conference on Human Factors in
Computing Systems (pp. 330-336), Boston, MA, USA: ACM Press.
255
Fumkranz, J. (1999). Exploiting Structural Information for Text Classification on the
WWW. In: Proceedings of the Third Symposium on Intelligent Data Analysis (pp.
487-497), Amsterdam, Nethderland: Springer-Verlag.
Futures-Group "Ostriches & Eagles 1997," in: The Futures Group Articles, 1998.
Gardner, H. (1985). The Mind's New Science: A History of the Cognitive Revolution
Basic Books.
Garey, M., & Johnson, D. (1979). Computers and Intractability: A Guide to the Theory of
NP-Completeness W. H. Freeman.
Garfield, E. (1972). Citation analysis as a tool in journal evaluation. Science, 178 471479.
Gazeau, M. (1998). Le Management de la CoxmdL\ss?LncQ. Etats de Veille (Juin),\-%.
Gentner, D., & Stevens, A. (1983). Mental Models Lawrence Erlbaum Associates.
Gibson, D., Kleinberg, J., & Raghavan, P. (1998). Inferring Web communities from link
topology. In: Proceedings of the Ninth ACM Conference on Hypertext and
Hypermedia: Links, Objects, Time and Space - Structure in Hypermedia Systems,
Pittsburgh, PA.
Global-Reach (2002). Global Internet Statistics. [Online]. Available at
http://www.glreach.com/globstats/.
Gloor, P.A. (1991). CYBERMAP: Yet another way of navigating in hyperspace. In:
Proceedings of the Third Annual ACM Conference on Hypertext (pp. 107-121), San
Antonio, Texas, USA: ACM Press.
Glover, E.J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D.M., & Flake, G.W. (2002).
Using Web structure for classifying and describing Web pages. In: Proceedings of the
11th International World Wide Web Conference, Honolulu, Hawaii, USA.
Glover, F. (1977). Heuristics for integer programming using surrogate constraints.
Decision Sciences, 8 156-166.
Glover, F. (1986). Future paths for integer programming and links to artificial
intelhgence. Computers and Operations Research, 13 533-549.
Google (2002). Language tools. [Online]. Available at
http://www.google.com/language_tools?hl=en.
Gordon, A.D., & Henderson, J.T. (1977). Algorithm for Euclidean sum of squares
classification. Biometrics, 33 355-362.
Gordon, M. (1988). Probabilistic and genetic algorithms for document retrieval.
Communications of the ACM, 31(10), 1208-1218.
256
Grabmeier, J., & Rudolph, A. (2002). Techniques of cluster algorithms in data mining.
Data Mining and Knowledge Discovery, 6 303-360.
Graham, L., & Metaxas, P.T. (2003). "Of course it's true; I saw it on the Internet!":
Critical thinking in the Internet era. Communications of the ACM, 46(5), 70-75.
Granovetter, M. (1973). The strength of weak ties. American Journal of Sociology, 78(6),
1360-1380.
Greene, S., Marchionini, G., Plaisant, C., & Shneiderman, B. (2000). Previews and
overviews in digital libraries: designing surrogates to support visual information
seeking. Journal of the American Society for Information Science, 51(4), 380-393.
Grudin, J. (1994). Groupware and social dynamics: eight challenges for developers.
Communications of the ACM, 37(1), 92-105.
Hacki, R., & Lighton, J. (2001). The future of the networked company. The McKinsey
Quarterly, 3 26-39.
Hallis, M. (1985). Invitation to Philosophy. Oxford: Basil Blackwell.
Hansen, M.T., Nohria, N., & Tiemey, T. (1999). What's your strategy for managing
knowledge. Harvard Business Review, 77(2), 106-116.
Hart, R.L. (2004). Memorial Service for Helen Thornton (February 28, 2004). Tucson,
AZ: First Southern Baptist Church.
Hartigan, J.A. (1985). Statistical theory in clustering. Journal of Classification, 2 63 -76.
He, X., Ding, C., Zha, H., & Simon, H. (2001). Automatic topic identification using
Webpage clustering. In: Proceedings of2001 IEEE International Conference on Data
Mining (pp. 195-202), Los Alamitos, CA.
He, Y., & Hui, S.C. (2002). Mining a Web citation database for author co-citation
analysis. Information Processing and Management, 38(4), 491-508.
Hearst, M.A. (1994). Multi-paragraph segmentation of expository text. In: Proceedings of
the 32th Annual Meeting of the Association for Computational Linguistics (pp. 9-16),
Las Cruces, New Mexico.
Hearst, M.A. (1999). Untangling text data mining. In: Proceedings of the 37th Annual
Meeting of the Association for Computational Linguistics, College Park, MD: The
Association for Computational Linguistics.
Henzinger, M.R., & Lawrence, S. (2004). Extracting knowledge from the World Wide
Web. Proceedings of the National Academy of Sciences of the United States of
America.
257
Hewett, T.T., Baecker, R., Card, S., Carey, T., Gasen, J., Mantei, M., Perlman, G.,
Strong, G., & Verplank, W. (1996). ACM SIGCHI Curricula for Human-Computer
Interaction. New York, NY, USA: The Association for Computing Machinery, Inc.
Holland, J.H. (1975). Adaption in Natural and Artificial Systems. Ann Arbor, MI: The
University of Michigan Press.
Holsapple, C.W., & Joshi, K.D. (2001). Organizational knowledge resources. Decision
Support Systems, 31(1), 39-54.
Hopfield, J.J. (1982). Neural networks as physical systems with emergent collective
computational abilities. In: Proceedings of the National Academy of Sciences of USA
(pp. 2554-2558).
Hornby, A.S., & Cowie, A.B. (1987). Oxford Advanced Learner's English-Chinese
Dictionary. Hong Kong: Oxford University Press.
Hospers, J. (1967). An Introduction to Philosophical Analysis (2nd edition). London:
Routledge & Kegan Paul.
Huang, K., Lee, Y.W., & Wang, R.Y. (1999). Quality Information and Knowledge.
Upper Saddle River, NJ, USA: Prentice Hall.
Huber, G. (1991). Organizational learning: The contributing processes and the literatures.
Organization Science, 2(1), 88-115.
Hull, D.A. (1994). Improving text retrieval for the routing problem using latent semantic
indexing. In: Proceedings of the 17th ACM International Conference on Research
and Development in Information Retrieval (pp. 282-289), Dublin, Ireland: ACM
Press.
Hurst, M. (2001). Layout and language: challenges for table understanding on the Web.
In: Proceedings of the 1st International Workshop on Web Document Analysis (pp.
27-30), Seattle, WA, USA.
Ingwersen, P. (1992). Information retrieval interaction. London: Taylor Graham.
Ingwersen, P. (1998). The Calculation of Web Impact Factors. Journal of
Documentation, 54(2), 236-243.
Jain, A.K., & Dubes, R.C. (1988). Algorithms for Clustering Data. Englewood Cliffs, NJ,
USA: Prentice-Hall.
Joachims, T. (1998). Text categorization with support vector machines: Learning with
many relevant features. In: Proceedings of the Tenth European Conference on
Machine Learning
137-142), Chemnitz, Germany: Springer Verlag.
Joyce, T., & Needham, R.M. (1958). The thesaurus approach to information retrieval.
American Documentation, 9 192-197.
258
Jurafsky, D., & Martin, J.H. "Chapter 10. Parsing with Context-free Grammars," in:
Speech and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition, Prentice Hall, Upper Saddle
River, NJ, 2000.
Kahn, B.K., Strong, D.M., & Wang, R.Y. (2002). Information quality benchmarks:
product and service performance. Communications of the ACM, 45(4), 184-192.
Kanai, H., & Hakozaki, K. (2000). A browsing system for a database using visualization
of user preferences. In: Proceedings of the 2000 IEEE International Conference on
Computer Visualization and Graphics, Los Alamitos, CA, USA: IEEE Computer
Society.
Kealy, W.A. (2001). Knowledge maps and their use in computer-based collaborative
learning. Journal of Educational Computing Research, 25(4), 325-349.
King, W.R., Marks, P.V., & McCoy, S. (2002). The most important issues in knowledge
management. Communications of the ACM, 45(9), 93-97.
Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the
Association of Computing Machinery, 46(5), 604-632.
Kleinberg, J., & Lawrence, S. (2001). The Structure of the Web. Science, 294 1849-1850.
Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag.
Kosala, R., & Blockeel, H. (2000). Web mining research: a survey. ACM SIGKDD
Explorations, 2(1), 1-15.
Kownslar, S. (2002). Collaborative Commerce. ACM Ubiquity, 3(32).
Kruskal, J.B. (1964). Nonmetric multidimensional scaling: a numerical method.
Psychometrika, 29(2), 115-129.
Kuhlthau, C. (1993). A principle of uncertainty for information seeking. Journal of
Documentation, 49(4), 339-355.
Kuhlthau, C. (1998). Longditudinal case studies of the information search process of
users in libraries. Library and Information Science Research, 10(3), 257-304.
Kuhlthau, C., Spink, A., & Cool, C. (1992). Exploration into stages in the information
search process in on-line IR: Communication between users and intermediaries. In:
Proceedings of the Annual Meeting of the American Society for Information Science,
(pp. 67-71).
Kuhlthau, C.C. (1991). Inside the search process: Information seeking from the user's
perspective. Journal of the American Society for Information Science, 42(5), 361-371.
259
Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). Trawling emerging
cyber-communities automatically. In: Proceedings of the 8th WWW Conference (pp.
403-415), Amsterdam: Elsevier Science.
Kwok, K. (1997). Comparing Representations in Chinese Information Retrieval. In:
Proceedings of ACM SIGIR (pp. 34-41), Philadelphia, PA, USA.
Kwon, O.-W., & Lee, J.-H. (2003). Text categorization based on k-nearest neighbor
approach for Web site classification. Information Processing & Management, 39(1),
25-44.
Laird, J.E., Newell, A., & Rosenbloom, P.S. (1987). Soar: An architecture for general
intelligence. Artificial Intelligence, 33(1), 1-64.
Lance, G.N., & Williams, W.T. (1967). A general theory of classificatory sorting
strategies: II. Clustering systems. Computer Journal, 10 271-277.
Lawrence, S., & Giles, C.L. (1999). Accessibihty of information on the Web. Nature, 400
107-109.
Lee, P.Y., Hui, S.C., Cheuk, A., & Fong, M. (2002). Neural Networks for Web Content
Filtering. IEEE Intelligent Systems, 17(5), 48-57.
Lempel, R., and Moran, S. (2001). SALSA: The Stochastic Approach for Link-Structure
Analysis. ACM Transactions on Information Systems, 19(2), 131-160.
Li, E.Y., & Du, T.C. (2003). Emerging Issues in Collaborative Commerce; Call for
papers. Decision Support Systems, 35(2), 257-258.
Lin, X. (1997). Map displays for information retrieval. Journal of the American Society
for Information Science, 48(1), 40-54.
Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for
information retrieval. In: Proceedings of the Fourteenth Annual International
ACM/SIGIR Conference on Research and Development in Information Retrieval (pp.
262-269), Chicago, IL: ACM Press.
Lindsay, P.H., & Norman, D.A. (1977). Human Information Processing: An Introduction
to Psychology (2nd edition) International Thomson Publishing.
Lippman, R.P. (1987). Introduction to Computing with Neural Networks. IEEE ASSP
Magazine, 4(2), 4-22.
Loiacono, E. (2002). WebQual™: A Web Site Quality Instrument. In: Proceedings of
International Conference on Information Systems (ICIS) Doctoral Consortium,
Charlotte, NC, USA.
Luhn, H.P. "The automatic derivation of information retrieval encodements from
machine-readable texts," in: Information Retrieval and Machine Translation, A. Kent
(ed.), Interscience Publication, New York, 1961, pp. 1021-1028.
260
Luhn, H.P. "A business intelligence system," in: Pioneer of information science, selected
works, Macmillan, London, UK, 1969, pp. 132-139.
Lyman, P., & Varian, H. (2000). How much information. University of California,
Berkeley. [Online]. Available at http://www.sims.berkeley.edu/how-much-info.
Maglitta, J. (1995). Smarten Up! Computerworld, 29(23), 84-86.
Maini, H., Mehrotra, K., Mohan, C., & Ranka, S. (1994). Genetic algorithms for graph
partitioning and incremental graph partitioning (CRPC-TR94504), Center for
Research on Parallel Computation, Rice University.
Mani, L, & Maybury, M.T. (1999). Advances in Automatic Text Summarization.
Cambridge, MA: MIT Press.
Marchionini, G. (1987). An invitation to browse: designing full text systems for novice
users. Canadian Journal of Information Science, 12(3), 69-79.
Marchionini, G. (1995). Information seeking in electronic environments. New York:
Cambridge University Press.
Marchionini, G. (2002). Co-evolution of user and organizational interfaces: a longitudinal
case study of WWW dissemination of national statistics. Journal of the American
Society for Information and Science and Technology, 53(14), 1192-1209.
Marchionini, G., & Shneiderman, B. (1988). Finding facts vs. browsing knowledge in
hypertext systems. IEEE Computer, 21(1), 70-80.
Marcus, M. (1999). Treebank tokenization. University of Pennsylvania. [Online].
Available at http://www.cis.upenn.edu/~treebank/tokenization.html.
Maron, M.E. (1961). Automatic indexing: an experimental inquiry. Journal of the
Association of Computing Machinery, 8(3), 404-417.
Maron, M.E., & Kuhns, J.L. (1960). On relevance, probabilistic indexing and information
retrieval. Journal of the Association of Computing Machinery, 7(3), 216-244.
Marshall, B., McDonald, D., Chen, H., & Chung, W. (2004). EBizPort: Collecting and
Analyzing Business Intelligence Information. Journal of the American Society for
Information and Science and Technology, (Accepted for pubhcation, forthcoming).
McDonald, D., & Chen, H. (2002). Using sentence selection heuristics to rank text
segments in TXTRACTOR. In: Proceedings of the second ACM/IEEE-CS Joint
Conference on Digital Libraries (pp. 28-35), Portland, OR, USA: ACM/IEEE-CS.
McKellar, H. (2003). KMWorld's 100 Companies that Matter in BCnowledge Management
2003, KM World. [Online]. Available at http://www.kmworld.eom/100.cfiTi.
McQuaid, M.J., Ong, T.H., Chen, H., & Nunamaker, J.F. (1999). Multidimensional
scaling for group memory visualization. Decision Support Systems, 27(1-2), 163-176.
261
Menczer, F. (2004). Evolution of document networks. Proceedings of the National
Academy of Sciences of the United States of America.
Mendelzon, A.O., & Rafiei, D. (2000). What do the neighbours think? Computing Web
page reputations. IEEE Data Engineering Bulletin, 23(3), 9-16.
Milhgan, G.W. (1981). A Monte-Carlo study of 30 internal criterion measures for clusteranalysis. Psychometrika, 46 187-195.
Minsky, M. (1982). Why people think computers can't? AIMagazine, 3(4), 3-15.
Minsky, M. (1986). Society of Mind Simon and Schuster.
Mish, F., Withgott, J., & Morse, J. (2003). Merriam-Webster's Collegiate Dictionary.
Springfield, MA: Merriam-Webster, Inc.
Mitchell, R.K., Agle, B.R., & Wood, D.J. (1997). Toward a Theory of Stakeholder
Identification and Salience: Defining the Principle of Who and What Really Counts.
Academy of Management Review, 22(4), 853-886.
Mitchell, T.M. (1997). Machine Learning. New York: McGraw-Hill.
Mladenic, D. (1998). Turning Yahoo into an Automatic Web Page Classifier. In:
Proceedings of the 13 European Conference on Artificial Intelligence (pp. 473-474),
Brighton, UK.
Mobasher, B., Dai, H., Luo, T., Nakagawa, M., Sun, Y., & Wiltshire, J. (2000).
Discovery of aggregate usage profiles for Web personalization. In: Proceedings of the
Workshop on Web Mining for E-Commerce - Challenges and Opportunities, Boston,
M.A.
Morgan-Stanley (2003). Digital World versus New World.
Morrow, N.M. "Knowledge management: An introduction," in: Annual Review of
Information Science and Technology, M.E. Williams (ed.). Information Today,
Medford, NJ, 2001, pp. 381-422.
Moser, P.K., & Nat, A.V. (1985). Human Knowledge. Oxford: Oxford University Press.
Mowshowitz, A., & Kawaguchi, A. (2002). Bias on the Web. Communications of the
ACM, 45(9), 56-60.
MSN (2002). MSN Search Worldwide Sites, Microsoft Corporation. [Online]. Available
at http://search.msn.com/worldwide.asp.
Myers, J., & Well, A. (1995). Research Design and Statistical Analysis. Hillsdale, NJ,
USA: Lawrence Erlbaum Associates, Publishers.
Nasukawa, T., and Nagano, T. (2001). Text analysis and knowledge mining system. IBM
Systems Journal, 40(4), 967-984.
262
Negroponte, N. (2003). Keynote Speech: Geo-Digital - The hnpact of
Telecommunications on Nations. In: The National Conference on Digital Government
Research, Boston, Massachusetts.
Neisser, U. (1967). Cognitive Psychology. New York, NY: Appleton-Century-Crofts.
Nelson, T.H. (1965). Complex information processing: a file structure for the complex,
the changing and the indeterminate. In: Proceedings of the Twentieth National
Conference (pp. 84-100), New York: Association for Computing Machinery.
Newell, A., & Card, S.K. (1985). The prospects for psychological science in humancomputer interaction./wteracrion, 1 209-242.
Nick, Z., & Themis, P. (2001). Web search using a genetic algorithm. IEEE Internet
Computing, 16(2), 18-26.
Nielsen, J. (1990). The art of navigating through hypertext. Communications of the ACM,
33(3), 296-310.
Nielsen, J., & Lyngbaek, U. (1989). Two field studies of hypermedia usability. In:
Proceedings of the Hypertext II Conference, York, U.K.
Nolan, J. (1999). Confidential: Uncover Your Competitor's Secrets Legally and Quickly
and Protect Your Own. New York: Harper Business.
Nonaka, I. (1994). A Dynamic Theory of Organizational Knowledge Creation.
Organization Science, 5(1), 14-37.
Nonaka, I., & Takeuchi, H. (1995). The Knowledge-Creating Company: How Japanese
Companies Create the Dynamics of Innvation. New York, NY: Oxford University
Press.
Norman, D.A. "Cognitive engineering," in: User Centered System Design, D.A. Norman
and S.W. Draper (eds.), Erlbaum, Hillsdale, NJ, 1986, pp. 31-61.
Norman, D.A. "Cognitive artifacts," in: Designing Interaction: Psychology at the
Human-Computer Interface, J. Carroll (ed.), Cambridge University Press, 1991.
Norman, D.A. "Chapter 9: Human information processing," in: Readings in HumanComputer Interaction: Toward the Year 2000 (2nd edition), R.M. Baecker, W.
Buxton, S. Greenberg and J. Grudin (eds.), Morgan Kaufmann, San Francisco, CA,
1995a, pp. 573-586.
Norman, D.A. "Introduction to human-computer interaction," in: Readings in HumanComputer Interaction: Toward the Year 2000 (2nd edition), R.M. Baecker, W.
Buxton, S. Greenberg and J. Grudin (eds.), Morgan Kaufmann, San Francisco, CA,
1995b, pp.1-3.
Novak, J.D., & Gowin, D.B. (1984). Learning How to Learn. New York: Cambridge
University Press.
263
Nunamaker, J.F., Chen, M., & Purdin, T. (1991a). Systems development in information
systems research. Journal of Management Information Systems, 7(3), 89-106.
Nunamaker, J.F., Dennis, A.R., Valacich, J.S., Vogel, D.R., & George, J.F. (1991b).
Electronic meeting systems to support group work. Communications of the ACM,
34(7), 40-61.
Nunamaker, J.F., Romano, N.C., & Briggs, R.O. (2001). A framework for collaboration
and knowledge management. In: Proceedings of the 34th Annual Hawaii
International Conference on System Sciences (pp. 461 -472), Hawaii.
O'Leary, D. (1998). Enterprise knowledge management. IEEE Computer, 31(3), 54-61.
Olson, G., & Olson, J. (2003). Human-computer interaction; psychological aspects of the
human use of computing. Annual Review of Psychology, 54 491-516.
Olson, J., & Olson, G. (1990). The growth of cognitive modeling in human-computer
interaction since GOMS. Human-Computer Interaction, 5 221-265.
O'Neill, E.T., Lavoie, B.F., & Bennett, R. (2003). Trends in the Evolution of the Public
Web 1998 - 2002. Digital Library Magazine, 9(4).
Ong, H.-L., Tan, A.-H., Ng, J., Pan, H., & Li, Q.-X. (2001). FOCI: Flexible Organizer for
Competitive Intelligence. In: Proceedings of the Tenth International Conference on
Information and Knowledge Management (pp. 523-525), Atlanta, Georgia, USA.
Ong, T.-H., & Chen, H. (1999). Updateable PAT-array approach for Chinese key phrase
extraction using mutual information: a linguistic foundation for knowledge
management. In: Proceedings of the Second Asian Digital Library Conference (pp.
63-84), Taipei, Taiwan.
Palmer, C., Pesenti, J., Valdes-Perez, R., Christel, M., Hauptmann, A., Ng, D., &
Wactlar, H. (2001). Demonstration of hierarchical document clustering of digital
library retrieval results. In: Proceedings of the 1st ACM/IEEE Joint Conference on
Digital Libraries, Roanoke, VA, USA.
Pant, G., & Menczer, F. (2002). MySpiders: evolve your own intelligent Web crawlers.
Autonomous Agents and Multi-Agent Systems, 5(2), 221-229.
Pazzani, M. (1999). A framework for collaborative, content-based and demographic
filtering. Artificial Intelligence Review, 13(5), 393-408.
Pazzani, M., & Billsus, D. (1997). Learning and revising user profiles: the identification
of interesting web sites. Machine Learning, 27 313-331.
Pemberton, M.J. (1998). Knowledge management and the epistemic tradition. Record
Management Quarterly, 32(3), 58-62.
Penrose, E.T. (1959). The Theory of the Growth of the Firm. New York, NY: Wiley.
264
Perlman, G. (2002). Web-Based User Interface Evaluation with Questionnaires. [Online].
Available at http://www.acm.org/~perlman/question.html.
Pipino, L.L., Lee, Y.W., & Wang, R.Y. (2002). Data quality assessment.
Communications of the ACM, 45(4), 211-218.
Polanyi, M. (1962). Personal Knowledge: Toward a Post-Critical Philisophy. New York:
Harper Torchbooks.
Polanyi, M. (1967). The Tacit Dimension. London: Routledge and Keoan Paul.
Popp, R., Armour, T., Senator, T., & Numrych, K. (2004). Countering terrorism through
information technology. Communications of the ACM, 47(3), 36-43.
Preece, S.E. (1981). A Spreading Activation Network Model for Information Retrieval
(Ph.D. Thesis), Department of Computer Science, University of Illinois at UrbanaChampaign,
Quinlan, J.R. "Learning efficient classification procedures and their application to chess
end games," in: Machine Learning: An Artificial Intelligence Approach, R.S.
Michalski, J.G. Carbonell and T.M. Mitchell (eds.), Tioga, Palo Alto, CA, 1983, pp.
463-482.
Quinlan, J.R. (1993). C4.5: Programs for machine learning. Los Altos, CA: Morgan
Kaufmann.
Redman, T.C. (1996). Quality for the Information Age. Boston, MA, USA: Artech House.
Reid, E.O.F. (2003). Identifying a Company's Non-Customer Online Communities: a
Proto-typology. In: Proceedings of the 36th Hawaii International Conference on
System Sciences (HICSS-36), Island of Hawaii, HI, USA: IEEE Computer Society.
Reiterer, H., MuBler, G., Mann, T.M., & Handschuh, S. (2000). INSYDER - an
information assistant for business intelligence. In: Proceedings of the 23rd Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval (pp. 112-119), Athens, Greece: ACM.
Rennison, E. (1994). Galaxy of news: an approach to visualizing and understanding
expansive news landscapes. In: Proceedings of ACM Symposium on User Interface
Software and Technology (pp. 3-12).
Richardson, M.W. (1938). Multidimensional psychophysics (Abstract). Psycological
Bulletin, 35 659.
Robertson, S.E. (1977). Theories and models in information retrieval. Journal of
Documentation, 33(2), 126-148.
Robertson, S.E., & Sparck-Jones, K. (1976). Relevance weighting of search terms.
Journal of the American Society for Information Science, 27(3), 129-146.
265
Rocchio, J.J. "Relevance feedback in information retrieval," in: The Smart Retrieval
System - Experiments in Automatic Document Processing, G. Salton (ed.), PrenticeHall, Inc., Englewood Cliffs, NJ, 1971, pp. 313-323.
Roussinov, D., & Chen, H. (2001). Information navigation on the Web by clustering and
summarizing query results. Information Processing and Management, 37(6), 789816.
Rumelhart, D.E., Hinton, G., & Williams, R. "Learning Internal Representations by Error
Propagation," in: Parallel Distributed Processing, D.E. Rumelhart and J. McClelland
(eds.), MIT Press, Cambridge, MA, 1986, pp. 318-363.
Rumelhart, D.E., Widrow, B., & Lehr, M.A. (1994). The basic ideas in neural networks.
Communication of the ACM, 37(3), 87-92.
Russell, S., & Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Upper
Saddle River, NJ: Prentice Hall.
Sairamesh, J., Lee, A., & Anania, L. (2004). Information cities. Communications of the
ACM, Alii), 28-31.
Salton, G. (1971). The SMART Retrieval System - Experiments in Automatic Document
Processing. Englewood Cliffs, NL: Prentice Hall Inc.
Salton, G. (1989). Automatic Text Processing: The Transformation, Analysis, and
Retrieval of Information by Computer. Reading, MA: Addison-Wesley.
Salton, G., Fox, E.A., & Wu, H. (1983). Extended Boolean information retrieval.
Communications of the ACM, 26(12), 1022-1036.
Salton, G., & McGill, M.J. (1983). An Introduction to Modern Information Retrieval.
NY: McGraw-Hill.
Salton, G., Singhal, A., Mitra, M., & Buckley, C. (1997). Automatic text structuring and
summarization. Information Processing and Management, 33(2), 193-207.
Salton, G., Wong, A., & Yang, C.S. (1975a). A vector space model for automatic
indexing. Communications of the ACM, 18 613-620.
Salton, G., Yang, C.S., & Yu, C.T. (1975b). A theory of term importance in automatic
text analysis. Journal of the American Society for Information Sciences, 26(1), 33-44.
Saracevic, T. (1996). Modeling interaction in IR. Review and proposal. In: Proceedings
of the Annual Meeting of the American Society for Information Science (pp. 3-9).
Saracevic, T., Kantor, P., Chamis, A.Y., & Trivison, D. (1988). A study of information
seeking and retrieving. I Background and methodology. II Users, questions and
effectiveness. Ill Searchers, searches and overlap. Journal of the American Society for
Information Science, 39(3), 161-216.
266
Scandar, J. (2003). Ready for Collaborative Commerce? in: Line56: The E-Business
Executive Daily. [Online]. Available at
http;//www.line56.com/articles/default.asp?articleid=5024.
Schatz, B. (2002). The Interspace: concept navigation across distributed communities.
IEEE Computer, 35(1), 54-62.
Scott, J. (2000). Social Network Analysis: A Handbook Sage Publications.
SearchEngineWatch (2001). Community-based search engines, SearchEngineWatch.com.
[Online]. Available at http://searchenginewatch.eom/link:s/community.html.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM
Computing Surveys, 34(1), 1-47.
Selberg, E., & Etzioni, O. (1995). Multi-service search and comparison using the
MetaCrawler. In: Proceedings of the 4th World Wide Web Conference, Boston, MA,
USA.
Selberg, E., & Etzioni, O. (1997). The MetaCrawler architecture for resource aggregation
on the Web. IEEE Expert, 12(1), 8-14.
Shaw, W.M.J., Burgin, R., & Howell, P. (1997). Performance standards and evaluations
in information retrieval test collections: cluster-based retrieval models. Information
Processing and Management, 33(1), 1-14.
Sherman, C. (2002). AltaVista Introduces Prisma Results, SearchEngineWatch.com.
[Online]. Available at http://www.searchenginewatch.eom/searchday/02/sd0702avprisma.html.
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(8), 888-905.
Shneiderman, B. (1980). Software Psychology: Human Factors in Computer and
Information Systems. Cambridge, MA: Winthrop.
Shneiderman, B. (1996). The eyes have it: a task by data type taxonomy for information
visualizations. In: Proceedings of Visual Languages (pp. 336-343), Boulder, CO,
USA: IEEE Computer Society.
Simon, H. (1981). The Sciences of the Artificial. Cambridge, MA: MIT Press.
Simon, H. "Why should machine learn?," in: Machine Learning: An Artificial
Intelligence Approach, R.S. Michalski, J.G. Carbonell and T.M. Mitchell (eds.),
Tioga Publishing Company, Palo Alto, CA, 1983.
Sinclair, J.M., Fox, G., Bullon, S., & Manning, E. (1998). Collins Cobuild English
Dictionary. London, UK: HarperCollins Publishers.
267
Small, H. (1973). Co-citation in the scientific literature: a new measure of the relationship
between two documents. Journal of the American Society for Information Science, 24
265-269.
Smith, A. (1759). The Theory of Moral Sentiments (1976 Edition by E.G. West ed.).
Indianapolis: Liberty Fund Inc.
Solomon, K. "Proverbs 2:6," in: New International Version of the Holy Bible,
International Bible Society, East Brunswick, NJ, 940 B.C., p. 473.
Solso, R.L. (1988). Cognitive Psychology (5th ed.). Boston: Allyn and Bacon.
Sparck-Jones, K., & Willett, P. (eds.) Readings in Information Retrieval. Morgan
Kaufmann Publishers, Inc., San Francisco, CA, 1997.
Spence, R. (1999). A framework for navigation. International Journal of HumanComputer Studies, 51(5), 919-945.
Spence, R. (2001). Information Visualization ACM Press.
Spender, J.C. (1996). Making BCnowledge the Basis of a Dynamic Theory of the Firm.
Strategic Management Journal, 17 45-62.
Spink, A. (1992). Recognition of stages in the user's information-seeking during online
searching by novice searchers. Online Review, 16(5), 297-301.
Spink, A., Ozmutlu, S., Ozmutlu, H.C., & Jansen, B.J. (2002). U.S. versus European Web
Searching Trends. SIGIR Forum, 36(2).
Spink, A., & Saracevic, T. (1997). Interaction in IR: Selection and effectiveness of search
terms. Journal of the American Society for Information Science, 48(8), 741-761.
Srinivasan, P. (2004). Text mining: generating hypotheses from Medline. Journal of the
American Society for Information and Science and Technology, 55(5), 396-413.
Srinivasan, P., Mitchell, J., Boderreider, O., Pant, G., & Menczer, F. (2002). Crawling
agents for retrieving biomedical information. In: Proceedings of NETTAB 2002
Workshop on Agents in Bioinformatics, Bologna, Italy.
Steward, T.A. (1998). Is this job really necessary? Fortune (January 12).
Sullivan, D. (2002). Nielsen/ZNetRatings: Search Engine Ratings. [Online]. Available at
http;//searchenginewatch.com/reports/netratings.html.
Sutcliffe, A.G., & Ennis, M. (1998). Towards a cognitive theory of Information Retrieval.
Interacting with Computers (Special Edition on HCI & Information Retrieval), 10
321-351.
Sutcliffe, A.G., & Maiden, N.A.M. (1998). The Theory of Domain Knowledge for
Requirements Engineering. IEEE Transactions on Software Engineering, 24(3), 174196.
268
Taillard, D., Gambardella, L., Gendreau, M., & Potvin, J. (2001). Adaptive memory
programming: a unified view of metaheuristics. European Journal of Operational
Research, 135.
Takane, Y., Young, F.W., & de Leeuw, J. (1977). Nonmetric individual differences
multidimensional scaling: An alternative least squares method with optimal scaling
features. Psychometrika, 42(1), 7-67.
Tan, B., Foo, S., & Hui, S.C. (2002). Web Information Monitoring for Competitive
Intelligence. Cybernetics & Systems, 33(3), 225-251.
Taylor, R.S. (1986). Value-added Processes in Information Systems. Norwood, NJ:
Ablex.
Tolle, K.M., & Chen, H. (2000). Comparing noun phrasing techniques for use with
medical digital library tools. Journal of the American Society for Information Science
(Special Issue on Digital Libraries), 51(4), 352-370.
Tombros, A., & Sanderson, M. (1998). Advantages of Query Biased Summaries in
Information Retrieval. In: Proceedings of the 21st Annual International ACM-SIGIR
Conference on Research and Development in Information Retrieval (pp. 2-10),
Melbourne, Australia: ACM Press.
Torgerson, W.S. (1952). Multidimensional
Psychometrika, 17(4), 401-419.
scaling:
I.
Theory
and
Method.
Trybula, W.J. "Text mining," in: Annual Review of Information Science and Technology,
M.E. Williams (ed.), Information Today, Inc., Medford, NJ, 1999, pp. 385-419.
Tuomi, I. (1999). Data is More Than Knowledge: Implications of the Reversed
Knowledge Hierarchy for Knowledge Management and Organizational Memory.
Journal of Management Information Systems, 16(3), 107-121.
Turtle, H.R., & Croft, W.B. (1992). Inference networks for document retrieval. In:
Proceedings of the 13th International Conference on Research and Development in
Information Retrieval (pp. 1-24), Brussels, Belgium: ACM Press.
van Laarhoven, P.J.M., & Aarts, E.H.L. (1988). Simulated Annealing: Theory and
Applications. Dordrecht: D. Reidel Publishing Company.
van Rijsbergen, C.J. (1979). Information Retrieval (2nd ed.) (2nd ed.). London:
Butterworths.
van Rijsbergen, C.J., & Sparck-Jones, K. (1973). A test for the separation of relevant and
non-relevant documents in experimental test collections. Journal of Documentation,
29(3), 251-257.
269
Vance, D.M. (1997). Information, knowledge and wisdom: The epistemic hierarchy and
computer-based information systems. In: Proceedings of the 1997 America's
Conference on Information Systems (AMCIS), Indianapolis, Indiana.
Vapnik, V.N. (1995). The Nature of Statistical Learning Theory. New York, NY:
Springer-Verlag.
Voorhees, E., & Harman, D. (1997). Overview of the Sixth Text Retrieval Conference
(TREC-6). In: NIST Special Publication 500-240: The Sixth Text Retrieval
Conference (TREC-6), Gaithersburg, MD, USA: National Institute of Standards and
Technology.
Wang, R.Y., Storey, V.C., & Firth, C.P. (1995). A framework for analysis of data quality
research. IEEE Transactions on Data and Knowledge Engineering, 7(4), 623-640.
Wang, R.Y., & Strong, D.M. (1996). Beyond accuracy: what data quality means to data
consumers. Journal of Management Information Systems, 12(4), 5-34.
Ward, J.H. (1963). Hierarchical grouping to optimize an objective function. Journal of
the American Statistical Association, 58 236-244.
Wasserman, S., & Faust, K. (1994). Social Network Analysis: methods and applications
Cambridge University Press.
Wemerfelt, B. (1984). A resource-based view of the firm. Strategic Management
Journal, 5 171-180.
Westney, E., & Ghoshal, S. "Building a competitor intelligence organization: adding
value in an information function," in: Information Technology and the Corporation in
the 1990s: Research Studies, T.J. Allen and M.S. Scott (eds.), Oxford University
Press, New York, 1994, pp. 430-453.
Wiig, K.M. "Knowledge management: An emerging discipline rooted in a long history,"
in: Knowledge Horizons: The Present and the Promise of Knowledge Management,
C. Daniele and C. Despres (eds.), Butterworth-Heinemann, Boston, MA, 2000.
Wilson, T.D. (1999). Models of information behavior
Documentation, 55(3), 249-270.
research. Journal of
Winograd, T., & Flores, F. (1986). Understanding Computers and Cognition: A New
Foundation for Design. Norwood, NJ: Ablex Pubishing Corporation.
Winston, P.H. (1984). Artificial Intelligence. Reading, MA: Addison-Wesley.
Wise, J.A., Thoma, J.J., Pennock, K., Lantrip, D., Pottier, M., Schur, A., & Crow, V.
(1995). Visualizing the non-visual: spatial analysis and interaction with information
from text documents. In: IEEE, Proceedings of Information Visualization (pp. 51-58).
270
Wu, Z., & Leahy, R. (1993). An optimal graph theoretic approach to data clustering:
theory and its application to image segmentation. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 15(11), 1101-1113.
Yang, C.C., Chen, H., & Hong, K. (2002). Internet browsing; Visualizing category map
by fisheye and fractal views. In: Proceedings of the IEEE International Conference
on Information Technology: Coding and Computing (pp. 34-39), Los Alamitos, CA,
USA.
Yang, C.C., Chen, H., & Hong, K. (2003). Visualization of large category map for
Internet browsing. Decision Support Systems, 35(1), 89-102.
Yang, Y., & Chute, C.G. (1994). An example-based mapping method for text
categorization and retrieval. ACM Transactions on Information Systems, 12(3), 252277.
Young, F.W. (1987). Multidimensional Scaling: History, Theory, and Applications.
Hillsdale, NJ, USA: Lawrence Erlbaum Associates, Publishers.
Young, G., & Householder, A.S. (1938). Discussion of a set of points in terms of their
mutual distances. Psychometrika, 3(1), 19-22.
Zack, M. (1999). Managing codified knowledge. Sloan Management Review, 40(4), 4548.
Zamir, O., & Etzioni, O. (1999). Grouper: A dynamic clustering interface to Web search
results. In: Proceedings of the 8th World Wide Web Conference, Toronto, Canada.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement