constructing similarity via rogets

constructing similarity via rogets
Using Roget’s Thesaurus to Determine the
Similarity of Texts
Jeremy Ellman
A thesis submitted in partial fulfilment of the
requirements of the University of Sunderland
for the degree of Doctor of Philosophy
June 2000
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
This thesis addresses the problem of extracting a representation of text's meaning from its
content. The solution investigated is based on the use of Roget’s thesaurus as an external
knowledge source and can be used to analyse texts of any length or complexity. The
resulting document representation can then be compared to others, producing a new
method for text similarity assessment.
All coherent texts contain embedded sequences of words that are related in meaning.
These sequences can be detected by identifying simple relationships between the relevant
thesaural entries in which the words are found. The identification of initial sequences
drives the addition of further related words into conceptually related “lexical chains”.
Although they differ in content, it is shown that the distribution of the links in these
“lexical chains” is independent of the type of text in which they are embedded, and
therefore this technique is of general applicability.
Every coherent text contains many lexical chains of different lengths and strengths. These
may be used to represent the broad subject matter of a text. By identifying the key
concept of each chain, and relating this to its presence we may produce an attribute value
vector of concepts and their strengths. This may then be used to identify other texts as
closer or further away in meaning.
This thesis describes the creation of a tool suitable for the detection of lexical chains in
large texts, and the design and implementation of algorithms to measure text similarity.
The performance of the algorithms has been compared with human judgements and
experimentally verified. The results show that lexical chain based similarity matching is
capable of producing a ranking between a source text and several examples equivalent to
that produced by human subjects. This illustrates the utility of Roget’s thesaurus as a
resource for the determination of lexical chains.
I would like to thank Bill Black of UMIST and Mark Stairmand for first interesting me in
lexical chains, and for initial discussions on this thesis.
My former employers, The MARI Group, were most generous for encouraging me to start
this research, and funding its first three years.
I am extremely grateful to the staff and students of the University of Sunderland who used
their class time to participate in my experiments.
I would like to acknowledge advice from Dr Sharon McDonald on experimental design,
and Dr Malcolm Farrow on statistical analysis, and critical insights from my second
supervisor Prof. Gilbert Cockton.
I am also most grateful to Addison Wesley Longman Limited for permission to use The
Original Roget's Thesaurus of English Words and Phrases Copyright © 1987 by
Longman Group UK Ltd. in portions of this work.
I would also like to thank the Karpeles library for permission to include an image of Dr.
Peter Mark Roget’s original work.
The work described in this thesis would not have been possible without software written
and supported by the open source community.
Finally, I owe a huge debt to my supervisor, Prof. John Tait.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table of Contents
Acknowledgements ......................................................................................................iii
Chapter 1. Introduction.......................................................................................1
1.1 Introduction ............................................................................................................. 1
1.2 Research problem and research questions ............................................................... 2
1.3 Justification for the research.................................................................................... 3
1.4 Methodology............................................................................................................ 4
1.5 Thesis Overview...................................................................................................... 5
1.6 Definition of Terms ................................................................................................. 6
1.7 Delimitation of Scope and Key Assumptions ....................................................... 11
1.8 Summary................................................................................................................ 12
Chapter 2. Literature Review............................................................................13
2.1 Introduction ........................................................................................................... 13
2.2 Information Retrieval ............................................................................................ 13
2.3 Case Based Reasoning........................................................................................... 25
2.4 Natural Language Processing ................................................................................ 28
2.5 Conclusions ........................................................................................................... 42
Chapter 3. Hesperus: A System for Comparing the Similarity of Texts Using
Lexical Chains:..................................................................................................45
3.1. Introduction .......................................................................................................... 45
3.2 Hesperus: A System for comparing Text Similarity using Lexical Chains........... 47
3.3 A program to analyse lexical chains in a text using Roget's Thesaurus................ 48
3.4 An Example ........................................................................................................... 58
3.5 The Generic Document Profile.............................................................................. 62
3.6 Using the Generic Document Profile to Determine the Similarity of Texts ......... 64
3.7 Adherence to Zipf’s Law....................................................................................... 65
3.8 Visualisation of Results......................................................................................... 66
3.9 Conclusion............................................................................................................. 68
Chapter 4. The General Nature of Lexical Links.............................................70
4.1 Introduction ........................................................................................................... 70
4.2 Selection of the Experimental Texts...................................................................... 72
4.3 Reading Complexity of the Texts.......................................................................... 73
4.4 Determination of the Lexical Cohesive Relationships .......................................... 75
4.5 Analysis 1: Link Distribution between Documents............................................... 75
4.6 Analysis 2: Link Distributions Change across Different Document Types .......... 77
4.7 Related Work......................................................................................................... 79
4.8 Conformance to Zipf’s Law .................................................................................. 81
4.9 Conclusion............................................................................................................. 82
Chapter 5. Word Sense Disambiguation and Hesperus.................................84
5.1: Introduction .......................................................................................................... 84
5.2. The Problem of evaluating the effects of Word Sense Disambiguation in Hesperus
..................................................................................................................................... 86
5.3. The motivation for HESPERUS participating in Senseval as SUSS ................... 86
5.4. SUSS: The Sunderland University Senseval System ........................................... 87
5.5. Local Disambiguator .......................................................................................... 101
5.6. Conclusion.......................................................................................................... 106
Chapter 6. Evaluating Hesperus ....................................................................108
6.1 Introduction ......................................................................................................... 108
6.2 Hypotheses .......................................................................................................... 113
6.3 Text Similarity Experiments................................................................................ 115
6.4 Discussion and Conclusion.................................................................................. 139
Chapter 7. Conclusions and Further Work ...................................................143
7.1 Introduction ......................................................................................................... 143
7.2 Conclusions about the research hypotheses ........................................................ 145
7.3 Contributions ....................................................................................................... 146
7.4 Future Work......................................................................................................... 147
7.5 Summary.............................................................................................................. 153
Bibliography ....................................................................................................167
Appendix I. Experimental Examples on Rosetta Stone ..............................170
Appendix II. Experimental Texts....................................................................179
Appendix III. Lexical Chain Visibility Algorithm ............................................199
Appendix IV. Basics Statistics of the Experimental Data ............................200
Appendix V. Help information given to Experimental Subjects ..................202
Appendix VI. Roget’s Thesaurus – A brief Overview. ..................................205
Appendix VII. Papers published related to this thesis. ................................211
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Index of Figures
Figure 1-1: Conceptual organisation of the chapters of the thesis ...................................... 6
Figure 2-1: A classification of text retrieval techniques. .................................................. 16
Figure 2-2: Case Based Reasoning Cycle ......................................................................... 26
Figure 3-2:Hesperus System Architecture ........................................................................ 48
Figure 5-2 : SUSS System design ..................................................................................... 89
Figure 5-3: Distribution of the senses of “Shake”............................................................. 99
Figure 6-1: Source text Topic Screen.............................................................................. 120
Figure 6-2: Source and Example Text Comparison ........................................................ 121
Figure 7-1: Books on Chains in Hereford Cathedral Library.......................................... 143
Figure VII-1:Major Headings in Roget's Thesaurus ....................................................... 206
Figure VII-2 Sub divisions of “Abstract Relations”........................................................ 206
Figure VII-3 Sub divisions of “Space”............................................................................ 206
Figure VII-4 Sub divisions of “Matter” ......................................................................... 206
Figure VII-5 Sub divisions of “Emotion” ....................................................................... 206
Figure VII-6 Sub divisions of “Volition”........................................................................ 207
Figure VII-7: Sub divisions relating to "Existence"........................................................ 207
Figure VII-8 : An extract from Roget's thesaurus ........................................................... 207
Index of Algorithms
Algorithm 3-1: Creation of the E-Roget............................................................................ 52
Algorithm 3-2: Create Lexical Chains. ............................................................................. 57
Algorithm 5-1: SUSS Algorithm Processing Phase .......................................................... 90
Algorithm 5-2: Generate Links. ...................................................................................... 104
Algorithm 5-3: Local Word Disambiguation. ................................................................. 105
Algorithm 6-1: Reducing the number of example texts to five....................................... 117
Algorithm III-1: An Algorithm to make the lexical chains in a text visible. .................. 199
Index of Graphs
Graph 3-1: Zipf Law: GDP Profile Values Vs Rank......................................................... 66
Graph 4-1: Link Type Vs Book Title ................................................................................. 76
Graph 4-2: Identical Links (%) Vs Inter-word Distance ................................................... 77
Graph 4-3: Percentage of Category Links Vs Inter-word Distance .................................. 78
Graph 4-4: Percentage of Group Links Vs Inter-word Distance....................................... 78
Graph 4-5 : Non-Self Triggers (Beeferman et al. 1997) ................................................... 79
Graph 4-6 : Self Triggers (Beeferman et al.1997)............................................................. 80
Graph 4-7 : Moby Dick. Number of Same Category words Vs Rank............................... 82
Graph 6-1 : Copyright S1 ratings .................................................................................... 127
Graph 6-2: AI: S1 ratings ................................................................................................ 129
Graph 6-3: Rosetta S1 ratings ......................................................................................... 131
Graph 6-4: Socialism: S1 ratings..................................................................................... 133
Graph 6-5: Ballot: S1 ratings........................................................................................... 135
Graph 6-6: Breakdance: S1 ratings ................................................................................. 137
Graph VI-1: Distribution of Polysemic Words in Roget’s Thesaurus ............................ 209
Graph VI-2: Frequency of Collocations in Roget’s Thesaurus ....................................... 210
Index of Tables
Table 2-1: Common steps in Information Retrieval. (From Robertson 1994, p3) ............ 15
Table 2-2: Comparison of IR and Textual CBR (reproduced from Lenz 1998) ............... 28
Table 2-3: Sub-domains of NLP. (Reproduced from Liddy 1998) ................................... 29
Table 2-4: Correlation of similarity measurements........................................................... 34
Table 3-1: Thesaural Relations Vs Mean Words ............................................................. 56
Table 3-2: Value of the different lexical links................................................................... 58
Table 3-3: Quotation from Einstein 1939 (cited by StOnge 1995) ................................... 59
Table 3-4: An example lexical chain embedded in a text ................................................. 60
Table 3-5: Senses of “TRAIN”.......................................................................................... 61
Table 3-6: An Example Generic Document Profile .......................................................... 63
Table 3-7: Link Type indications ...................................................................................... 67
Table 4-1: Texts Selected .................................................................................................. 73
Table 4-2: Reading Complexity of the Texts .................................................................... 74
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table 4-3: Samples of Trigger Pairs (Beeferman et al.1997)............................................ 80
Table 5-1 Disambiguation Success (% accuracy) Vs Word by method............................ 97
Table 6-1: Rejection Criteria for example texts .............................................................. 118
Table 6-2 Copyright: Experimental Comparative Results .............................................. 128
Table 6-3: Copyright: Numeric Rank of similarity scores .............................................. 128
Table 6-4: Copyright: Hesperus and SWISH Spearman Rank Correlation. ................... 129
Table 6-5: AI: Experimental Comparative Results ......................................................... 130
Table 6-6: AI: Hesperus and SWISH Spearman Rank Correlation. ............................... 130
Table 6-7: Rosetta: Experimental Comparative Results ................................................. 132
Table 6-8: Rosetta: Hesperus and SWISH Spearman Rank Correlation......................... 132
Table 6-9: Socialism: Experimental Comparative Results.............................................. 134
Table 6-10: Socialism: Hesperus and SWISH Spearman Rank Correlation................... 134
Table 6-11: Ballot: Experimental Comparative Results.................................................. 136
Table 6-12: AI: Hesperus and SWISH Spearman Rank Correlation. ............................. 136
Table 6-13: Breakdance: Experimental Comparative Results ........................................ 138
Table 6-14: Breakdance: Hesperus and SWISH Spearman Rank Correlation................ 138
Table 6-15 Hesperus: Table of significance of experimental results ......................... 139
Table II-1: Generic Document Profile for “Rosetta Stone” from MS Encarta................ 179
Table II-2 Generic Document Profile for “Copyright” from MS Encarta....................... 184
Table II-3 Generic Document Profile for “Socialism” from MS Encarta ...................... 189
Table II-4: Generic Document Profile for “Ballot” from MS Encarta ........................... 193
Table II-5 Generic Document Profile for “AI” from MS Encarta .................................. 196
Table II-6 Generic Document Profile for “Breakdance” from MS Encarta .................. 198
Figure 1: Extract from Roget’s Thesaurus1
Reproduced with the kind permission of the Karpeles Library.
Chapter 1. Introduction
1.1 Introduction
This study describes a new method to determine the similarity of two texts based on the
words they contain that are related in meaning. These words may be linked into chains
that contribute towards the cohesion of the text. These “lexical chains” (Morris and Hirst
1991) are identified using Roget's thesaurus. Roget has not been used previously in a
computer program to identify lexical chains. Neither has it been suggested that the
similarity of whole texts may be compared in this way. Thus, the study brings together
ideas of similarity judgements, text cohesion, and Roget’s thesaurus.
Similarity judgements are an essential component of human thought and inference
processes (Sloman and Rips 1998). Since no situation is exactly like another, people must
be able to generalise their experiences in order to apply them in new circumstances (Hahn
and Chater 1998, Schank and Abelson 1977). These may be basic responses to simple
stimuli as in classical Pavlovian conditioning, or making legal judgements based on
previous, similar cases. Similarity is at the core of the artificial intelligence problem
solving method known as “Case Based Reasoning” (Aamodt and Plaza 1994), which
seeks to identify problem solutions based on their similarity to past successes.
Cohesion is that property of a text that allows it to be read as a unified entity, as opposed
to a series of unconnected sentences. Halliday and Hasan (1976, 1989) have identified
many devices used to make text cohesive. These include linguistic phenomena, such as
anaphora, cataphora, ellipsis, co-extension, and chains of words. Lexical chains may be
composed of identical or similar words.
Roget's Thesaurus is a well-known scholarly work and writer's tool. It is used by authors
to find related and relevant words. It contains thousands of words organised by their
similarity to each other in a conceptual four-level hierarchy. Roget, and the semantic
information in its hierarchy, has been used in Information Retrieval (Spärck-Jones 1986;
Boyd et al. 1993), Word Sense Disambiguation (Yarowsky 1992), and in Text Cohesion
(Morris and Hirst 1991). Roget’s Thesaurus is described further in Appendix VII.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Morris and Hirst (1991) proposed using Roget's thesaurus to identify the lexical chains in
a text. This would suggest its structure, which is an essential step in recognising its deeper
meaning. Morris and Hirst recognised that lexical chains provided a semantic context for
the interpretation of words and sentences. This study defines a method representing that
semantic context, which may then be used for estimating the similarity of two texts.
1.2 Research problem and research questions
This study addresses the problem of extracting a representation of a text's meaning from
its content using Roget's thesaurus as an external knowledge source. The resulting
document representation can then be compared to others, giving rise to a new method for
text similarity assessment. Specifically, the study tries to ascertain:1. Whether a text similarity measure may usefully be constructed from a text's
lexical chains as identified using Roget's Thesaurus.
2. If the text similarity measure defined provides a better approximation of human
judgements than purely statistical methods.
3. Whether the text representation considered is suitable for the analysis of texts of
different lengths and complexities.
4. Whether the measure may be improved by including word sense disambiguation
at current levels of accuracy.
There are three motivations for this work:
Firstly, natural language approaches to Information Retrieval have consistently performed
no better than statistical and heuristic methods (Strzalkowski 1999). Such statistical
methods consider documents to be described by a representative set of keywords (BaezaYates and Ribeiro-Neto 1999). However, natural language texts contain sentences that
have a grammatical structure and use a varied vocabulary rich in synonyms. These
elements contribute to a text’s meaning, which is not considered in statistical IR. Since
people use meaning when considering text similarity it is important to demonstrate a
robust application of Natural Language Processing that considers it –even in the most
superficial sense- and demonstrates improved performance on text similarity matching.
Secondly, the study considers the potential advantages and disadvantages associated with
the use of Roget's thesaurus in identifying lexical chains. Several other computer
Chapter 1
implementations of lexical chains (see Chapter 2) have used Princeton's WordNet (Miller
et al. 1990, Fellbaum 1998). Since these have not performed better than rival statistical
measures it is important to decide whether the technique does not work well, or whether it
may be improved by a different knowledge source.
Thirdly, all coherent texts contain embedded sequences of words related in meaning.
These sequences can be detected by identifying simple relationships between the relevant
thesaural entries in which the words are found. The identification of initial sequences
drives the addition of further related words into conceptually related lexical chains.
Although they differ in content, it is not known whether the distribution of the links in
these lexical chains is also dependent on the type of text in which they are embedded.
Consequently this needs to be determined if this technique is to be of general
Every coherent text contains many lexical chains of different lengths and strengths. We
may use these to represent the broad subject matter of a text. This is done by identifying
the key concept of each chain, and relating this to its magnitude, giving an attribute value
vector of concepts and their strengths. We can then use this to identify other texts as
closer or further away in meaning.
1.3 Justification for the research
The purpose of this study is to ascertain whether chains of words in texts that are related
by virtue of their position in Roget's thesaurus could be used to define a general measure
of a text's subject matter. This metric needs to be sufficient to allow the similarity of two
texts to be compared more accurately than measures that consider texts as composed of
unrelated terms.
This study is important for three reasons: firstly, it examines the suitability of Roget's
thesaurus as an external knowledge source. Other work on lexical chains (Stairmand
1996, and StOnge 1995) expressed the view that the general relationships between related
words included in Roget may be more useful than those found in Princeton's WordNet
(Fellbaum et al. 1998).
Secondly, it develops a robust, but shallow method of estimating a text's subject matter
from its content. Most Natural Language Processing (NLP) methods are not capable of
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
analysing unconstrained text (Lewis and Spärck-Jones 1996). Additional robust methods
could increase the practical utility of NLP, for example by making them applicable to real
world tasks, such as analysing on-line information from the World Wide Web (BernersLee et al. 1994, Ellman and Tait 1996).
Thirdly, it has significant potential in textual case based reasoning (T-CBR). T-CBR
(Lenz et al. 1998) applies the problem solving methodology (Watson 1997) of case based
reasoning (CBR) to knowledge bases stored as texts (see Chapter 2). Current applications
in T-CBR all use purpose built techniques to analyse a text's contents. This study will
describe a method of determining text's similarity that is applicable to any subject
domain. Since similarity assessment is a core element of CBR, a generic method could
significantly ease the burden of building a T-CBR system by avoiding writing a text
analyser specifically for each new application.
1.4 Methodology
This is an experimental study in Computational Linguistics with an emphasis on
comparative evaluation. The core research method used was the construction of a lexical
chaining program that uses Roget's thesaurus as a knowledge source. This program is
known as “Hesperus”. Hesperus' performance was evaluated against judgements of text
similarity made by human subjects in an experiment based in a realistic setting. The
results were also compared to those given by a statistically based Information Retrieval
program. The approach to word sense ambiguity was analysed by participating in an
international word sense disambiguation benchmarking competition called Senseval
(Kilgarriff and Rosenzweig 2000).
Hesperus is made up of a lexical chaining program, and a computationally tractable
version of Roget's thesaurus as a knowledge source. The lexical chaining program is an
enhanced version of one described in (StOnge 1995) which is based on (Morris and Hirst
1991). Enhancements include storing lexical chains according to their prominence in the
text. The procedure that calculates the importance of a lexical chain is based on
Stairmand's (1996) work. Computational complexity is also controlled using a sliding
window approach (Schütze 1992). Details are given in Chapter 3.
A machine-readable version of Roget's thesaurus was not available to Morris and Hirst
(1991). Since then the 1911 version of Roget's thesaurus has been made available by
Chapter 1
Project Gutenberg (1999). Whilst this is machine-readable it is not machine tractable, as it
is one large block of text. It was made machine tractable by splitting it into multiple files,
and then using an Information Retrieval program to make these accessible. An identical
procedure was applied to the 1987 Roget when permission had been granted to use this
for research purposes.
1.5 Thesis Overview
Following the introduction contained in this chapter, the remaining components of this
thesis are as follows.
Chapter 2 surveys the state of the art in areas related to this thesis. Following brief links
into the parent disciplines of Natural Language Processing, Information Retrieval, and
Textual Case Based Reasoning attention is focussed on other work in lexical chains, and
in text and concept similarity assessment. Other possible approaches to the text similarity
problem are also considered, such as those that are not knowledge based.
Chapter 3 describes the lexical chaining program Hesperus, and its implementation.
Procedures are also given for converting a text-based thesaurus into a resource suitable
for lexical chaining, for text similarity assessment, and for visualising lexical chains in a
Chapter 4 examines the general nature of the approach. That is, we raise the issue of text
genre, and whether lexical chains (and hence Generic Document Profiles) derived from
simple texts may be compared to those from ones that are more complex. This question is
answered by analysing several book length texts that differed in complexity according to
a standard readability metric and comparing the frequency and types of the thesaural
Chapter 5 considers the problem of word sense ambiguity, and possible approaches to it
that could be used in Hesperus. These ideas were tested within the context of an
international contest known as Senseval (Kilgarriff and Rosenzweig 2000) which was
held to evaluate different approaches to word sense disambiguation. Appropriate concepts
were then migrated into the Hesperus system design, where their effectiveness could be
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Chapter 6 describes a fully randomised experiment designed to evaluate Hesperus, the
text similarity program. This involved generating a benchmark set of data whose
similarity was assessed by people. Hesperus was then operated under several conditions,
and its similarity judgements compared against those of the human subjects. This gave an
indication of their efficacy.
The final chapter summarises findings, draws conclusions, and makes suggestions for
further research. The conceptual relationships between the chapters is shown in fig 1-1
Literature Review.
Hesperus: A system for comparing the similarity of texts
using lexical chains.
The General Nature of Lexical Links.
Word Sense Disambiguation and Hesperus.
Evaluating Hesperus
Figure 1-1: Conceptual organisation of the chapters of the thesis
1.6 Definition of Terms
Definitions adapted by researchers are rarely uniform, so essential, or unusual terms are
defined here to eliminate future confusion.
Chapter 1
Roget's Thesaurus
This study depends critically on Roget’s thesaurus. However, since Roget’s thesaurus was
first was published in 1852 there have been many different editions1. This study uses the
version of 1911 in Chapters 3 and 4, and that of 1987 thereafter. Nonetheless, the lessons
from the study are general, because of the common nature of Roget’s thesauri. To
understand this requires a brief introduction to the structure of Roget.
There are three essential components common to any Roget’s thesaurus:
1. A four-level2 hierarchical structure that includes approximately one
thousand concepts or classes.
2. A large number of words organised into these classes.
3. An index that identifies to which classes a word belongs.
The index (3) is highly dependent on the body of words that it is produced from (2). This
varies considerably from pocket editions of Roget (such as that of 1911) which have
approximately 60,000 words, to large desktop editions that have 250,000 words.
The performance of algorithms developed here depends on the presence of words in the
index. Consequently the exact edition of Roget is important, and better performance is
seen with larger editions of Roget
The class structure of Roget varies little. Roget originally used 1000 classes, but
subsequently added approximately 16 sub classes. Later lexicographers reduced this to
990 classes, which is a variation of ±1.5%.
The algorithms developed here exploit Roget’s class structure. However, given the small
variation in that structure they are applicable to any version of Roget. Strictly though, the
work reported here may only be reproduced exactly with the 1911 edition for the
experiments reported in Chapter 3, and the 1987 edition subsequently3. Nonetheless, the
There were twenty four in Roget's lifetime alone (Encarta 1997)
The four named levels in the hierarchy, and up to three levels underneath these identifiable either by syntactic
category, or punctuation – especially semicolons
However the paper edition of 1962 is used for reference throughout.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
principles identified are applicable to any Roget’s thesaurus -albeit with some tuning.
Further information about Roget’s thesaurus is given in Appendix VII.
This thesis is about “similarity”, but what does that mean? A standard definition would be
“of the same kind, nature, or amount; having a resemblance” (COD9), and most authors
imply this (e.g. Lee 1997). Hahn and Chater (1998) offer the following intuition (that they
note as being vague), which will be used here:
1. Similarity is some function of common properties.
2. Similarity is graded.
3. Similarity is maximal for identity.
Word Meaning and Word Sense
There is no universally acceptable definition of word meaning. One view taken by a
logical positivist would be that the meaning of a word is determined by the truth value of
the proposition of which it forms a part. The view of a pragmatist would be that a word’s
meaning is given by its context of use, and the intention of whoever uttered or wrote it.
An everyday view would be that a word’s meaning is given by its dictionary entry. This is
possibly one of the least satisfactory definitions, as it implies that lexicographers dictate
the meaning of words in a language, whereas their aim is to interpret it.
The dictionary definition of meaning does have the advantage of operational adequacy.
That is, it can be implemented in a computer program so as to offer users appropriate
interpretations as desired.
The view is taken in this thesis that word meaning and word senses are given by their
entries in the dictionary specified. If no dictionary is mentioned, then Roget’s thesaurus is
implied. That is, we assume that a word sense corresponds to a Roget category even
though Roget only defines words by association with others, rather than giving formal
Now there are two classic problems that affect work with natural language texts: several
different words can often be used to express the same concept, and one word can be used
for several different concepts. Let us call these problems “synonymy” and “word sense
Chapter 1
In synonymy, a word is equivalent to another in some (but possibly not all) senses. An
example would be “wedding” and “marriage” in the ceremony sense4, where both are
equivalent in meaning, although “wedding” can not be substituted for “marriage” in all its
dictionary senses. For example, both words can be used interchangeably in (1), but not in
1. The wedding took place in church.
2. The marriage had irretrievably broken down.
Synonymy and word sense ambiguity lead to the classic information retrieval “vocabulary
problem” (Furnas et al. 1987, Blair and Maron 1985) in which users will enter different
terms for desired objects or actions from that envisaged by a system's designer.
Ambiguous words (see Chapter 5) are often divided into “homographic” and
“polysemous” senses. Homographic words have identical spellings, but completely
different meanings that are often the result of different derivations. These correspond to
major sub entries in a dictionary. An example would be “dram” in the sense of a small
drink of spirits, as opposed to computer memory specification e.g. “24meg dram”.
Polysemous words have many meanings. Here “polysemy” refers to different sense
variations within one major dictionary entry. For example, the “onion” in cheese and
onion crisps is derived from, as opposed to identical to, the onion that may be grown in
the garden.
Identifying which word sense(s) were intended by an author is known as the word sense
disambiguation problem.
Lexical Chains
The term “lexical chain” is due to Morris and Hirst (1991). They used the term to identify
sequences of related words in a text. Their work was based on the stricter definitions of
Halliday and Hasan (1976, 1989) who defined the earlier term “cohesive chain” as being
a set of terms that are semantically related. Halliday and Hasan (1989) also specified that
these were of two types of chains: “identity chains” where every member refers to the
same thing (that is, they are co-referential), and “similarity chains”. In similarity chains,
the terms are related by co-classification, or co-extension, that is, they refer to members
of the same class of things or events. Halliday and Hasan (1989, p84) stress that the
where “wedding” is derived from old English, and “marriage” from Middle English through Old French (COD9).
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
distinction between identity and similarity chains is important: However, it is not possible
to maintain this computationally. Therefore, this study will follow Morris and Hirst
(1991) in its use of the term lexical chain.
An Example text with Lexical Chains marked
To better clarify the notion, we are going to briefly consider an example text with lexical
chains indicated. The following quotation from Einstein was considered by StOnge
(1995). It is sufficiently brief to look at in some detail and also permits some comparison
of the two works (Section 3.4).
“We suppose a very long train travelling along the rails with the constant velocity t'
and in the direction indicated in Figure 1. People travelling in this train will with
advantage use the train as a rigid reference-body; they regard all events in
reference to the train. Then every event which takes place along the line also takes
place at a particular point of the train. Also, the definition of simultaneity can be
given relative to the train in exactly the same way as with respect to the
embankment.” Einstein 1939, cited by StOnge (1995
StOnge (1995) manually identified three lexical chains in this text. These are indicated by
subscripts in the text5, and then listed below.
We suppose a very long train1 travelling2 along the rails1 with the constant velocity2
t' and in the direction2 indicated in Figure 1. People travelling2 in this train1 will
with advantage use the train1 as a rigid reference-body3; they regard all events in
reference3 to the train1. Then every event which takes place along the line1 also
takes place at a particular point1 of the train1. Also, the definition of simultaneity
can be given relative to the train1 in exactly the same way as with respect to the
embankment1. Einstein)
1. {train, rails, train, train, train, line, point, train, train, embankment}
2. {travelling, velocity, direction, travelling}
3. {reference-body, reference}
It is important to remember that there is no absolute truth in the selection of any particular
Reproduced from StOnge (1995). Colour has been added to the chains StOnge identifies for clarity of exposition.
Chapter 1
chain of words in a text. The relationships between words that form part of a coherent
theme in the text may or may not be detectable by reference to an external thesaurus (see
Section 3.4). Furthermore, the presence of a word in a thesaurus may itself be problematic
due its possible multiple interpretations. (See Chapter 6). Nevertheless, the example
shows the type of semantic relationship found in a lexical chain.
Lexical Chains and their component Links
Halliday and Hasan (1976, 1989) use the word “tie”6 to indicate relations in a text that can
be joined into a cohesive chain. This study consistently uses “links” to describe elements
of a lexical chain since this is the primary sense, as judged by position of the dictionary
entry. Links in a lexical chain (hence lexical links) should be considered as equivalent to
Halliday and Hasan's ties.
Lexical Chains and the Generic Document Profile
Lexical chains are used to compute a semantic representation of a text that we call its
“Generic Document Profile” (Chapter 3). This is an attribute value vector of Roget
categories whose strengths are determined using the lexical chains identified in the text.
Ecological Validity
The term “ecologically valid” is widely used in psychology to describe experiments that
try to ensure that a task's content and features are representative of the larger
circumstances of a person's activities. The term is due to Brunswik, and is discussed at
length by Hammond (1998).
We consider the experiments described in Chapter 6 as ecologically valid as they are
carried out in the subjects’ usual environment, using computer hardware and software
with which they are familiar.
1.7 Delimitation of Scope and Key Assumptions
There are inevitably some limitations inherent in the approach used. These will be
described here as they may restrict the ability to extrapolate from the results. As with
other work based on lexical chains, Hesperus can only process words that are found (or
may be morphologically reduced to those) in the thesaurus. Thus, document themes that
are tightly bound to proper names are not catered for in the program as it stands. This
This sense of “tie” is that of something (typically a beam) holding parts of a structure together. This sense is lesser
used as indicated by its sixth position in the dictionary entry (COD9).
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
might for example be addressed by incorporating work on named entities (Chinor 1998)
being done within the MUC program by Humphreys et al. (1998) on LaSIE, or Black et
al. (1998) on FACILE.
This study is based on a particular interpretation of lexical chains that may be derived
using an external thesaurus. Halliday and Hasan (1976, 1989) identified other cohesive
relations, such as those based on pronouns. Being able to automatically identify the
contribution these made to the sense of a text would clearly be useful, but its
implementation (e.g. Azzam, Humphreys, and Gaizauskas 1999) is beyond the scope of
this study. We claim that using thesaurally derived lexical chains is sufficient for crude
similarity assessment.
Hesperus is a research tool. Its performance has not been optimised based on intermediate
results. The premise is that if the technique works then it may be improved in the future
by tuning system parameters based on empirical data.
There are inevitable limitations with the text similarity experiment reported in Chapter 5.
The subjects were undergraduate and masters students at the University of Sunderland
following courses in Computing and Information Systems. Their assessments of similarity
may not be representative of the general population. Similar limitations of scope of course
apply to the vast majority of work in the psychological literature.
1.8 Summary
This chapter has presented the major subject of the thesis; the use of thesaurally related
lexical chains in determining text similarity. This has introduced lexical chains −
especially those dependant on relations that may be determined using Roget's Thesaurus.
The thesis methodology has also been specified. This is experimental, with an emphasis
on comparative evaluation. We shall now proceed with a discussion of the background to
the research followed by detailed description of the study.
Chapter 2. Literature Review
2.1 Introduction
The objective of this chapter is to provide a basis from the published literature for the
work outlined in Chapter 1. This will be done by reviewing the relevant areas to identify
germane research issues. These research issues come from Information Retrieval, Case
Based Reasoning, and Natural Language Processing, and include thesauri, similarity,
word sense ambiguity, and lexical chains. We now proceed to examine each of these
fields in turn.
2.2 Information Retrieval
2.2.1 Introduction
Information retrieval (IR) “deals with the representation, storage, organisation of and
access to information items” (Baeza-Yates and Ribeiro-Neto 1999, p1). That is, texts,
images, or other forms of information are often collected together for later use. Then, as
the collection size grows, automatic means of identifying entries useful to the enquirer are
required. This section will only consider text retrieval since this is most directly related to
the topic of this thesis, although retrieval of images and multimedia are active research
areas (e.g. Bertino, Catania, and Ferrari 1999).
As a mature field1, there are many general descriptions of IR, its methods, philosophy,
and origins (van Rijsbergen 1979, Salton and McGill 1983, Frakes and Baeza-Yates 1992,
Baeza-Yates and Ribeiro-Neto 1999).
IR may be broadly divided into manual or automatic techniques for indexing and
subsequent retrieval. Manual methods have the advantage of accuracy, but the
disadvantage of cost, in terms of person time required to assign information to an
appropriate category. An example of the successful application of manual methods would
be the Internet search engine Yahoo, which is one of the most popular (Baeza-Yates and
Ribeiro-Neto 1999). This uses a classification system, into which Internet documents are
manually classified. Information seekers retrieve information by searching this
classification hierarchy.
The Journal of Documentation, which is associated with IR, made its first appearance in 1946.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Automatic techniques have the advantage that a machine may analyse information for
storage and later retrieval. Thus, the volume of information processed in a given time far
exceeds human capabilities. The cost of processing information automatically is also
decreasing, as it is linked to the exponentially falling cost of computer time and storage.
Automatic methods also have the advantage of consistency, whereas human indexing is
subject to individual judgement.
The disadvantage of automatic methods is that they rarely achieve human levels of
performance. They are far more likely to identify erroneous information as relevant to
person’s query, or to ignore pertinent information.
Several measures exist for measuring and comparing IR systems (see van Rijsbergen
1979 for formal definitions). The two measures that are most often cited are known as
recall and precision. Recall is the proportion of relevant documents that are retrieved, and
precision is the proportion of retrieved documents that are relevant at a given cut-off in a
Automatic techniques may be divided into methods that directly exploit knowledge about
the contents of the information in order to improve retrieval performance and those that
rely solely on its statistical characterisation. We shall term the former “knowledge based”
methods, and the latter “statistical” methods.
FERRET (Mauldin 1991) is an example of a knowledge-based IR system. It exploited
script based language parsing, and used four levels of lexical knowledge. These included
a hand-coded lexicon, extracts from Webster’s 7th dictionary to account for synonyms,
rules to recognise near synonyms, and special rules to identify names. In a limited domain
of 1065 astronomy articles, Mauldin (1991) claimed performance increases of 30% recall,
and 250% increase in precision when compared to a standard information retrieval
technique (Boolean keyword query). The cost of this increase in precision was a
processing time that Mauldin (1991) notes “was limited to eight minutes per page”.
Automatic statistical methods have two tremendous advantages over knowledge based
methods. Firstly, they are capable of processing vast quantities of data in a limited time,
Chapter 2
Literature Review
and secondly, they are largely language independent. Mauldin (1997) describes Lycos, an
Internet search engine that uses statistical methodology.
Lycos is unusual in that it provides automated summaries of Internet documents. Other
Internet search engines do not. Mauldin (1997) describes how these summaries are
created using statistical IR techniques that identify the 100 words that most characterise
the text. He also discusses how both computational and financial cost was a factor in
deciding which approaches could be used in Lycos, since a commercial web service needs
to be capable of searching millions of documents per day.
Internet search services such as AltaVista, Excite, and Lycos that use statistical
approaches are equally applicable to any language. This differentiates them from
knowledge based approaches that are largely restricted to English, and widens their
appeal to a global audience.
In giving an overall review of IR, Robertson (1994) points out that many steps are present
in nearly all free-text systems. These are summarised in Table 2-1, which is derived from
Robertson (1994, p3).
Table 2-1: Common steps in Information Retrieval. (From Robertson 1994, p3)
Free-text indexing
Creates an index entry from part of the item only (e.g. title or abstract). This
exploits a previous manual selection of the important words to describe an
Word identification
There must be a set of rules for this, dealing not only with word separators
such as blank characters and punctuation, but also with upper-lower case,
embedded hyphens or hyphens at the end of lines, numbers etc.
Most systems identify and exclude certain common words (the list is usually
manually prepared).
Stemming or suffix stripping refers to the abbreviation of index and query
terms to reduce index size, and increase retrieval effectiveness.
Dictionary operations
These identify phrases, acronyms, synonyms possibly using a thesaurus
Inverted file generation
This is one of the principal data storage techniques. It allows the identification
of the containing document from a recognised query term.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
A classification of text retrieval techniques proposed by Belkin and Croft (1987) is shown
below in Figure 2-1.
Figure 2-1: A classification of text retrieval techniques. Reproduced from Belkin and Croft (1987)
Detailed descriptions of these methods are widely available (e.g., see van Rijsbergen
1979, Baeza-Yates and Ribeiro-Neto 1999).
At its broadest, text based Information Retrieval consists of collecting a set of texts,
indexing them, and then querying the index to identify relevant documents. It could be
said that one of the principle objectives of IR is the optimisation of query document
similarity to maximise precision and recall. Enhancements to this approach are being
explored, especially based on visualisation (Nowell et al. 1996), knowledge discovery
(Crimmins et al. 1999), and, within the Internet, citation count (Brin and Page 1998).
However, the following sections will pursue query document similarity, as it is most
relevant to the work outlined in Chapter 1.
A fundamental issue in query document similarity is known as “The Vocabulary
Problem” (Furnas et al. 1987). Simply put, this occurs when information seekers do not
use exactly the same words as those employed by information providers. They may for
example, use synonyms, or equivalent phrases. The converse of the vocabulary problem is
the issue of word sense ambiguity. Here users find inappropriate documents that include
the search terms they specified, but in a sense, other than they intended.
Chapter 2
Literature Review
The vocabulary problem may be addressed by exploiting thesauri (Qiu and Frei 1995), or
modifying queries (Xu and Croft 1996). In the following sections, we firstly consider
lexical ambiguity in IR to put the vocabulary problem in perspective. Next, we consider
standard measures of similarity measurement, as these are clearly related to the approach
outlined in Chapter 1. We then discuss query modification and the use of thesauri.
Finally, we consider IR evaluation. This is critical if we wish to know how a modified
technique has affected precision and recall.
2.2.2 Lexical Ambiguity and Information Retrieval
Approximately one third of the words in Roget’s thesaurus are found in more than one
entry, and could be considered ambiguous (see Appendix VII, and Section 2.4.2). This
33% ambiguity level is in broad accord with figures reported by Ide and Véronis (1998)
from a variety of sources. Consequently, an IR query containing an ambiguous word such
as plant (e.g. Manufacturing plant, as compared to plant life) may identify documents as
relevant that contain the term in a sense other that in the query. Although Krovetz and
Croft (1992) have argued that ambiguity is diminished in subject specific document
collections, it is a factor in heterogeneous document collections such as TREC, and the
Web. As such, there is a question as to whether automatic sense disambiguation is
desirable in IR, since it is an unresolved research problem.
Sanderson (1994) has carried out detailed experiments on the effects of lexical
disambiguation on Information Retrieval performance. Sanderson (1994) used the
artificial ambiguity technique that he credits to Yarowsky. Here, word pairs that are
unrelated are merged into pseudo words. For example if the word pair is “kalashnikov”
and “banana”, every occurrence of both in the document collection is replaced with
artificial term “kalashnikov/banana”. The advantage of this approach is the original text
can be used to identify which word sense was intended (i.e. either “kalashnikov” or
Sanderson (1994) tested the effect on IR performance of varying degrees of
disambiguation. He did this by creating pseudo words of between two to ten words in
length. The corpus used was the Reuters 22713. This is a collection of 22713 articles from
the Reuters news wire in 1986. Sanderson varied the accuracy of his artificial
disambiguator, and examined the effect on the E measure2 (van Rijsbergen 1979). He
E is a compound precision-recall performance measure (van Rijsbergen 1979).
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
concluded that IR systems are insensitive to lexical ambiguity, but very sensitive to
erroneous disambiguation. This needs to be more than 90% accurate to improve IR
Information Retrieval is not the same problem as text similarity matching, as the latter is a
subset of the former. Nonetheless, Sanderson’s point that inaccurate disambiguation may
degrade performance to a greater extent than no disambiguation is still applicable. A
research question remains as to what the level is, in text similarity matching, and whether
disambiguation technology exceeds this threshold or not.
2.2.3 Similarity and its measurement
Calculation and measurement of similarity is often of importance in IR. Similarity
measures allow documents to be ranked in order of relevance to a query, and are also used
to cluster similar documents together. Many similarity formulae have been proposed,
although none has gained universal acceptance (Bartell, Cottrel, and Belew 1998).
Examples include the Dice, Jaccard, and Cosine Coefficients (see van Rijsbergen 1979,
Salton and McGill 1983 for original references).
Similarity coefficients are combined with term weights for effective searching. Weights
exploit the relative frequency of search terms in the document collection, so that less
common terms contribute a greater weight to the similarity calculation. That is, they
exploit the heuristic that rare terms are better indicators of a relevant document than
common ones.
Zobel and Moffat (1998) report an assessment of eight query document similarity
measures, nine ways of choosing document weights, two methods of calculating
document term weights, and six ways of setting relative term frequencies. Zobel and
Moffat (1998) report that these combine to give a considerable quantity of similarity
Zobel and Moffat (1998) report experiments on Disk 2 of the TREC collection to
determine which similarity formula was most effective. Whilst the various combinations
have been in common use for many years, they have rarely been compared on one large
document collection. Zobel and Moffat’s (1998) results were inconclusive. Some
formulae performed better for some queries, but none was consistently superior.
Chapter 2
Literature Review
Bartell, Cottrel, and Belew (1998) report experiments on a system that automatically
adjusts the parameters in a similarity matching formula. These adjustments are based on
initial user judgements of the desired ranking of a limited training document set. This
method is claimed to equal or exceed the performance of all “classic” similarity measures.
Parameter adjustments to similarity formulae, such as Bartell, Cottrel, and Belew (1998),
may be compared to other methods of query modification. These may also alter the
weights of search terms, as we will see in the next section.
2.2.4 Query modification methods.
It is well known that retrieval performance is enhanced when, after a first retrieval
attempt, the user indicates the most relevant documents and his query is repeated. This
procedure known as relevance feedback (van Rijsbergen 1979, Salton and McGill 1983,
Baeza-Yates and Ribeiro-Neto 1999). One effect of relevance feedback is to augment the
user’s query with distinctive terms from the relevant documents indicated that do not
occur in the original query.
Query expansion has long been suggested as a way of coping with the word mismatch, or
vocabulary, problem in IR (Xu and Croft 1996). Xu and Croft (1996) note that there is
little evidence that a general purpose thesaurus can improve search effectiveness, and
propose local and global methods of document analysis. In local document analysis, the
results of an initial query are analysed, whereas in global document analysis, the entire
corpus of documents is analysed. Xu and Croft (1996) report a global analysis to discover
phrases that co-occur within a window of one to three sentences. For example, airline
pilot might be associated with plane, air, and traffic. These concepts are then stored in an
INQUERY (e.g. Allan et al. 1998) database. Queries are expanded when they are run
against this database.
Xu and Croft (1996) also discuss a related approach based on local document analysis.
Here, INQUERY is used to retrieve the top n ranked paragraphs. Concepts (noun phrases)
are then ranked using a variant of the tf*idf3 measure. The top ranked concepts may then
be used to augment the user’s query. Although this approach does require an analysis of
the document collection, this is only needed once.
the tf*iDf heuristic states that the frequency of a term in a text times the inverse of its occurrence in the collection
indicates its importance. See Baeza-Yates and Ribiero-Net 1999 p29
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Xu and Croft (1996) found that local analysis is more effective than global. However, a
combination of global techniques on the local set of documents is more effective and
predictable than simple local feedback.
A related technique due to Gauch and Wang (1997) automatically generates a similarity
thesaurus (Section 2.2.5). They use linguistic corpus analysis techniques to produce a
matrix of term-term similarities. These are then used to automatically expand queries
within SMART (Buckley 1995). Performance improvements of up to 23% are claimed for
the TREC-5 data.
Wilbur and Coffee (1994) considered two kinds of queries that may be applied to a
database. The first was written by a searcher to express an information need. The second
was a request for documents most similar to a document that the searcher had already
judged as being relevant. They found the similarity based query to be more effective than
the one expressing the information need. This provided a justification for document
neighbouring procedures (pre-computation of closely related documents). Wilbur and
Coffee (1994) showed that this feedback-based method provides significant improvement
in recall over traditional linear searching methods, and even appears superior to
traditional feedback methods in overall performance.
2.2.5 Thesauri
Thesauri are valuable resources for Information Retrieval systems. They contain a
number of terms and synonyms organised into a classified hierarchy. The principal
purpose of an IR thesaurus is to provide a controlled vocabulary. That is, to limit the
number of words used in indexing documents by replacing equivalent terms by their
synonyms, or class identifiers. This reduces the size of the index, with a consequent effect
on storage and retrieval efficiency. A similar operation is also applied to user’s queries, so
that Information Retrieval performance is not compromised.
Roget's thesaurus (Appendix VII) is very different from a typical IR thesaurus (Srinivasan
1992) since that was designed as a general tool to help writers express themselves, whilst
IR thesauri are usually domain specific and contain synonyms rather than the broader
word relationships used in Roget.
Chapter 2
Literature Review
Thesauri have been of interest to IR for many years. Srinivasan writing in 1992 refers to a
then considerable literature on thesaurus construction. Like much of IR, methods may be
manual or automatic, and the automatic techniques may be divided into statistical and
knowledge based approaches.
A manual thesaurus is built by subject experts who collect terms and their synonyms, and
group them hierarchically. This is clearly costly in terms of person time to build, and,
once built, manual thesauri need to be maintained as new terminology becomes used.
In principle, a less costly alternative would an automatically constructed statistical
thesaurus. In this method, the whole documented collection is analysed for correlated
terms, which are terms that co-occur in relation to a common context. These make up the
thesaurus and are used subsequently to expand user queries. The objective here is to
collect terms that differentiate most amongst the candidate documents.
Grefenstette (1994) has described SEXTANT, a program that constructs thesauri
automatically using knowledge-poor techniques. The techniques are knowledge-poor in
that they do not depend on domain knowledge. Thesauri are constructed by analysing a
large number of texts in one domain. The text corpus is split into individual words or
tokens and then tagged with part of speech information. The word senses are
disambiguated using a separate statistical program. Dependencies between words are then
identified using local lexical-syntactic relationships, such as identifying nouns that have
modifying adjectives. Noun similarity between the dependent fragments is then calculated
using a weighted Jaccard measurement. This list of related nouns is then pruned retaining
only those which Grefenstette (1994) terms “reciprocal near neighbors”.
Grefenstette (1994) presents several examples of thesauri produced using his methods.
These are claimed to resemble hand-built thesauri. However, whilst knowledge-poor,
Grefenstette’s (1994) method requires some linguistic knowledge.
A technique that requires no knowledge of language is Latent Semantic Indexing (Dumais
et al. 1997). Latent Semantic Indexing (LSI) is a statistical technique designed to improve
Information Retrieval by addressing the vocabulary problem where different terms are
often used to refer to the same concept (Furnas et al. 1987). This is done by automatically
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
recognising that related terms are frequently found in similar contexts. Thus, for example
“laptop” and “portable” are often found near to “computer”.
LSI works by constructing word co-occurrence matrices for a document set using a
technique called “single value decomposition”. This produces an emergent set of virtual
“concepts” and their strengths that is used for query-document and document-document
similarity in an extension of the vector-space model (Salton and McGill 1983). That is,
the document set is indexed using the concepts identified. This index is then searched by
mapping user queries into this reduced dimensionality concept space prior to using
conventional similarity matching.
Success has been reported with LSI in several area including the TREC-3 information
filtering task (Dumais, 1995), and cross language information retrieval (Dumais et al.
1997). This is particularly interesting since term-term similarity is of little use in cross
language IR as the query term is unlikely to be identical to the document term if they are
in different languages.
The principal problem with LSI is that it is essentially a machine learning technique that
needs to analyse many documents for success. This may lead to a combinatorial explosion
whose time requirements would make LSI impractical for large document collections.
Rada and Bicknell (1989) report experiments on ranking documents with a thesaurus.
Their work was in the medical domain and used MeSH (Medical Subject Headings) as the
thesaurus to encode queries and documents from the MEDLINE bibliographic database.
Rada and Bicknell (1989) treated MeSH as semantic net (Quillian 1968). Their particular
contribution was to introduce a procedure called DISTANCE that counted edges between
connected terms, differentially weighting those that were less specific (“broader than”).
Rada and Bicknell noted that it was not practical to calculate distances between all
documents and the query, since MEDLINE contained five million documents. Rather,
they applied their procedure to the output from searching MEDLINE in the usual way.
WordNet (Miller et al. 1990, 1993, Fellbaum, 1998) is an on-line lexical database. It is
hierarchically structured, and may be used as a general semantic thesaurus of English.
Chapter 2
Literature Review
Richardson and Smeaton (1995) used a WordNet based distance function to re-rank a set
of documents retrieved by traditional means within the TREC context. Their results were
disappointing, possibly as a result of deficiencies within WordNet, or due to problems of
word sense ambiguity (see sections 2.2.6, 2.4.2).
Gonzalo et al. (1998) report that indexing with WordNet thesaural categories (“synsets”)
can improve IR retrieval performance by up to 29%. Their experiments were based on
SEMCOR, a subset of the Brown corpus (Francis and Kucera 1979) that has been
manually tagged with WordNet senses.
2.2.6 IR Evaluation and document collections.
Experimental evaluation is an essential aspect of IR. Much of this has focused on
experimental test collections where document and query relevance is marked up by hand.
This serves as a benchmark against which to evaluate systems.
Document collections are the main data source for IR evaluation and research. Collections
provide a baseline against which to assess the performance of algorithms, as specialists in
the subject of the collection code the characteristics of a sample of text manually. This
gives a baseline against which to evaluate systems, and also permit systems and
algorithms to be mutually compared. Sanderson (1996) has produced a comparison of
many of the better known collections. The Cranfield collection for example is made up of
1400 abstracts relating to aeronautics (approx. 1.5Mb). There are 325 natural language
text queries included. Human experts have identified which of the articles are relevant to
which query.
Sanderson (1996) cites concerns that the small size of many IR test collections may
influence the applicability of IR findings. In particular Blair and Maron (1985) found that
retrieval effectiveness did vary with the size of document collection. To address the issue
of collection size, Sanderson (1996) used the “Reuters 22713” for his work on lexical
ambiguity (Section 2.2.2). The Reuters 22713 is a collection of articles from the Reuters
newswire that was collected by Carnegie Group in 1988, and subsequently modified by
Lewis (1992) for his text categorisation research.
TREC (Text REtrieval Conference) is the modern test bed for IR system comparison. The
basis of TREC is that a central organisation builds the test collection, and researchers
around the world use it to test their own methods and systems, reporting back to the
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
conference with results presented in a standardised way. The TREC collection is far
larger than any previous test collection being about 2.5Gb of heterogeneous data. This
includes the WSJ collection as a subset of the TREC-3 category B data. This consists of
550 megabytes of articles from the Wall Street Journal.
There have been eight TREC conferences covering a variety of tasks since 1991. Systems
that use automated statistical techniques have achieved consistent success in TREC.
Examples include SMART (e.g. Bucknell et al. 1998), INQUERY (Allen et al. 1998), and
OKAPI (e.g. Robertson, Walker, and Beaulieu 1998).
Zobel (1998) has pointed out that, as no one has categorised the TREC collections for
relevance, precise calculations of recall and precision may not be made. The many
terabytes of information on the Internet World Wide Web will pose problems that have
not been addressed in the past by Information Retrieval (Ellman and Tait 1996). One
possible help here would be an understanding of the statistical properties of language
known as Zipf’s Law.
2.2.7 Zipf’s Law
Zipf (1949) observed that there is a regular power law relationship between the frequency
of some event as a function of its rank. Zipf (1949) presented many areas where this
relationship can be observed including the size of cities, and, most importantly, word
frequency. Zipf (1949) noted that if the frequency of occurrence of each word in a text is
ranked, then the frequency of the second most common word will be a half that of the
most common word and that of the third most common will be a third of the most
common word, and so on. That is:Frequency of Rank N ≈ (Frequency of Rank 1)/N
From this it follows that if the Log of Word Frequency is plotted against Log of Rank a
straight line with a slope of minus one will be obtained. This implies that the frequency of
the most common word is equal to the rank of the word whose frequency of occurrence is
equal to one. Zipf (1949) showed the same phenomena occur for English and several
other languages. Furthermore, Zipf (1949) noted the same phenomena in many other
areas of language, such as the distance between identical words in a text (Zipf 1949, p41.)
Data on Zipf’s law are summarised by Li (2000).
Chapter 2
Literature Review
2.2.8 Summary
Information Retrieval is concerned with identifying documents in a collection that are
similar to a user’s query. Methods may be characterised as manual, or automatic. Manual
methods are costly, and there is considerable IR research in automatic methods. These
may be statistical, or knowledge based. Although the latter may give better results in
small document collections, they are too expensive to cope alone with large ones.
However, they may be used in combination as a post-processing step with purely
statistical methods. Such automatic query processing has been shown to improve IR
The vocabulary problem, where a searcher’s word does not match that in a relevant
document, is fundamental in IR. This may be addressed using a thesaurus, and such
thesauri may be automatically created. Generic thesauri have not been shown to be of
value in IR, although domain specific ones may be useful.
2.3 Case Based Reasoning
2.3.1 Introduction
Case Based Reasoning is a problem solving method that seeks to solve new problems by
reference to previous successful solutions to related ones. CBR has been used in
applications such as fault diagnosis, and building construction. (Kolodner 1983, Aamodt
and Plaza 1994, Watson 1997, Lenz et al. 1998)
CBR uses a particular development methodology where the systems performance
improves as more appropriate cases are added to the store of known problems and
solutions – known as the case base.
Aamodt and Plaza (1994) described this method as a CBR cycle, which is represented
graphically in fig 2-2 below.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Figure 2-2: Case Based Reasoning Cycle
Aamodt and Plaza (1994) describe the CBR cycle as being made up of four steps:① RETRIEVE the most similar cases
② REUSE the knowledge in those cases to solve the problem
③ REVISE the proposed solution
④ RETAIN the parts of the experience likely to be useful for future work
CBR is frequently applied to the development of Knowledge Based Systems, as it
potentially eliminates at least three problems with that area (Watson and Marir 1994).
Firstly, CBR does not require a model or understanding of the target domain. Secondly,
CBR does not require an explicit knowledge elicitation phase with its dependence on
skilled knowledge engineers, although it does require a set of appropriate cases to initially
populate its case base. Thirdly, CBR may be efficient, since collections of cases may be
stored using database technology (Shimazu et al. 1993). This is considerably more
efficient than the storage of large rule bases in flat files typically associated with
Knowledge Based Systems.
The key issues when building a CBR systems are:
① REPRESENTING a case to capture its true meaning.
② INDEXING cases for rapid retrieval.
③ ASSESSING the similarity between the test case and the stored cases.
④ ADAPTING a previously successful solution to a new problem.
The similarity between the problems faced by CBR and IR has long been noted (e.g.
Callan and Croft 1993, Rissland and Daniels 1996).
Chapter 2
Literature Review
2.3.2 Textual Case Based Reasoning
Lenz, Hübner and Kunze (1998) have coined the term “Textual CBR” for systems that
seek to apply case based reasoning technology to textual documents as opposed to highly
structured cases. T-CBR systems face a number of problems associated with textual
representation that differentiate them from other CBR systems that represent problems
and their stored solutions as simple data structures. These problems include the wellknown issues of structural and semantic ambiguity that apply to many areas of natural
language processing.
Example areas where text cases are used to derive solutions from previous successes
include case law, and medical reports.
The Law is a natural application area for Case Based Reasoning (Ashley and Rissland
1988), since previous cases are often used to supplement reasoning deductively with legal
rules. Furthermore, lawyers and judges reason analogically with precedent cases.
Consequently, KBS style rule predicates are simply not sufficiently well defined for the
inference of correct legal decisions.
Legal cases are encoded as text, so the law is a special candidate for text based CBR. This
has been explored by Ashley who developed a program known as HYPO. This explored
adversarial, case-based reasoning with cases and hypotheticals in the legal domain
(Ashley 1990, reviewed in Rissland and Daniels 1996).
Less specific approaches to T-CBR have been used by for example Kunze and Hübner
(1998) and Burke et al. (1997). Kunze and Hübner (1998) used a combination of shallow
NLP techniques and semi-structured documents in the FallQ project for document
management in the ExperienceBook project that provided UNIX system administrators'
support. Burke et al. (1997) exploit the question answer format in their FAQ Finder −
system which finds FAQs on the Internet that correspond to a user's question. FAQ Finder
uses a combination of statistical methods, and shallow WordNet based semantics.
Lenz (1998) has compared Textual CBR with methods used in Information Retrieval. His
results are summarised in Table 2-2 below.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table 2-2: Comparison of IR and Textual CBR (reproduced from Lenz 1998)
Textual CBR
Representation of
Sets of index terms obtained from
sets of features established during
statistic evaluations
knowledge acquisition
Similarity measure
Based on term frequency weighting
based on domain theory
Application to new
requires knowledge acquisition
Domain knowledge
not considered
Non textual
Cannot be used
can be integrated
Well designed
not yet addressed sufficiently
Lenz’s (1998) view of IR is certainly a caricature of extreme positions, since document
representations are not always statistical, similarity measures not always based on term
frequency, and non-textual information can be integrated. However, the views on IR
shown in table 2-2 does represent the majority of conventional Information Retrieval
systems that for example consistently perform better in TREC (Voorhees and Harman
1998). Lenz (1998) also recognises the mutual importance to IR and T-CBR of what he
terms the “Ambiguity problem”, and the “Paraphrase problem”. In the ambiguity
problem, one or more keywords in compared documents may be highly similar, but the
actual meaning of the documents may differ, whereas in the paraphrase problem the same
meaning may be expressed using completely different expressions. These phenomena
were termed lexical ambiguity, and the vocabulary problem in Section 2.2, which covered
2.3.3 Summary
Case Based Reasoning is a problem solving method that relies on identifying the
similarity between a problem and a previous one for which a solution has been identified.
T-CBR is a sub-domain that uses textual data. Generic approaches to text similarity may
be useful, or augment domain specific knowledge.
2.4 Natural Language Processing
2.4.1 Introduction
Natural Language Processing (NLP) is also commonly known as computational
linguistics. It “is a discipline between linguistics and computer science which is
concerned with the computational aspects of the human language faculty” (Radev 1997).
Chapter 2
Literature Review
NLP is commonly divided into different levels of analysis. Liddy (1998) gives the
following succinct breakdown of the field’s sub areas:
Table 2-3: Sub-domains of NLP. (Reproduced from Liddy 1998)
interpretation of speech sounds within and across words
interpretation of speech sounds within and across words
componential analysis of words, including prefixes, suffixes and roots
word level analysis including lexical meaning and part of speech analysis
analysis of words in a sentence in order to uncover the grammatical structure of the
determining the possible meanings of a sentence, including disambiguation of words
in context
interpreting structure and meaning conveyed by texts larger than a sentence
understanding the purposeful use of language in situations, particularly those aspects
of language which require world knowledge
Like IR, NLP is a well-established research area complete with textbooks (e.g. Allen
1995, Charniak and Wilks 1976, and many more), collections of important papers (Grosz,
Spärck-Jones, and Webber 1986), and sets of Frequently Asked Questions (Radev 1996)
that point to many more resources.
NLP encompasses very many areas of active research including tagging a text’s
grammatical parts of speech (e.g. Brill 1992); the derivation of a word’s base from
inflected variants (morphology, e.g. Antworth 1993), identifying the syntactic structures
of sentences (parsing), the generation of coherent natural language for program output,
and related studies of, semantics, and pragmatics.
A comprehensive account of text processing must either address the component areas of
syntax, semantics, and pragmatics with their unresolved problems, or avoid them. Since
all of these problems are sufficiently difficult as to be areas of scientific inquiry in their
own right there have been few approaches to processing whole texts in NLP, as that
requires solutions to many unresolved problems.
There are two basic reasons for considering whole texts from an NLP perspective. One
reason would be to generate texts as program output, whilst the other would be to analyse
text to determine its meaning. Text generation is simpler than analysis, as it is possible to
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
output ambiguous words or phrases and assume that readers will interpret the correct
sense from context. This does not resolve the issues of style or argument structure, which
text generation approaches must address. Examples of such approaches include text
grammars (van Dijk 1977), and Rhetorical Structure Theory (Mann, and Thompson
1988). Text grammars are used in restricted domains such as business correspondence
(Tait and Ellman 1999) where ad hoc associations between text structures can be
encoded. Rhetorical Structure Theory by contrast looks as formal relationships between
classes of discourse elements that can be justified at the speech act level (e.g. Austin
1962, Searle 1969, Ellman 1983). This leads to discourse-based approaches to natural
language generation (Dale et al. 1998).
In the next section we will examine work in NLP that is relevant to the idea sketched in
Chapter 1. That is, determining text similarity using lexical chains. In outline, we will
proceed as follows: Firstly, in order to determine if there is a coherence relation between
words, we need to identify the sense in which a word is being used. This problem is
related to the earlier discussion on lexical ambiguity in Section 2.2.2. Once a word sense
has been identified, we need to calculate the strength of the connection to a candidate
lexical chain. This is done through a discussion of semantic similarity. Again this
discussion parallels that on text similarity in Section 2.2.3. Next we consider sense tagged
corpora, which are used for evaluation and comparison of alternate approaches to word
sense disambiguation. This has a clear relationship with the issue of evaluation and test
collections in Section 2.2.6. Finally we look more comprehensively at work on lexical
chains as this forms the backbone of this study.
2.4.2 Word Senses and Classification Schemes
Many words have multiple senses irrespective of which classification scheme is used.
This leads to the word sense disambiguation problem. Word sense disambiguation is an
active sub field of NLP. Recent overviews and insights into the literature are given by Ng
and Zelle (1997), and especially Ide and Véronis (1998) who give approximately 200
references into the area.
Multiple word senses may be classified as homonymous or polysemous for which
definitions were given in Section 1.6. Collocate ambiguity is also important. We will look
briefly of examples of each.
Chapter 2
Literature Review
For an example of collocate ambiguity consider the word “cone” in phrases such as pine
cones, ice cream cones, and rods and cones. Collocate ambiguity can be resolved by
reference to immediate local context. Lesk (1987) used several machine-readable
dictionaries (e.g., Webster's 7th, Collins, OED) and looked for word overlaps with the
dictionary entries within a 10 word window of the target. Lesk (1987) reported a 50-70%
success rate in selecting the correct word sense using this technique.
Homographic ambiguity presents few problems to readers, and this type of ambiguity is
closest to the notion of lexical ambiguity discussed in Section 2.2.2. For example,
consider the senses of homonyms such as “rowing” as in:1. I saw a couple rowing outside the pub
2. I saw a couple rowing the boat
Polysemous ambiguity is problematic (e.g. Kilgarriff, 1997) as terms such as Business,
and Point may have many discernible senses. Exactly how many depends on the
classification system used. Usually these are based on machine-readable dictionaries and
thesauri, since these are accessible, and generally independent of theoretical bias.
Common works are:•
LDOCE: Longman's Dictionary of Contemporary English
Websters 7th
Roget's Thesaurus
Princeton's WordNet
Unfortunately, these sources are not guaranteed to contain the same words, or to
categorise their senses consistently with each other. The reasons for this are partly
functional, and partly economic. Kilgarriff (1997) points out that dictionary publishers
emphasise the number of entries in their dictionaries as a marketing ploy. Consequently
there is pressure to augment the number of sense distinctions identified. Conversely, a
dictionary aimed at language learners requires fewer, simpler entries than one aimed at
proficient native speakers.
WordNet (Miller et al. 1990) is often used in NLP systems since it is both large, and
easily available for research purposes. WordNet is a hierarchical lexical database that is
fully indexed. However, WordNet’s quality is “variable”, and its hierarchy is uneven
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
(Hearst and Schütze 1996). In fact, it is not clear that WordNet is the best classification
system. Voorhees (1994) indicated that she had difficulty selecting the correct word sense
from WordNet. This may account for performance degradation she observed in the TREC
Roget's thesaurus (Appendix VII) is also useful, since it has an implicit structured
hierarchy that is quite evenly balanced and although smaller than WordNet still has 350
pages of text with 1024 entries that have headings. which are known as “heads”.
Although this number has varied over different editions (e.g. the 1987 edition has 990
heads) Roget's thesaurus has been refined over more than 200 years. It is also available
electronically. Project Gutenberg distribute the 1911 edition, since this is out of copyright.
Yarowsky (1992) used Roget's Thesaurus (1977 edition) in a lexical disambiguator based
on a statistical model of the Roget categories. He achieved an accuracy of up to 92% on
12 polysemous words examined in a 50 word window, which demonstrates Roget’s
potential utility for the word sense disambiguation problem.
2.4.3 Semantic Similarity
Deciding how close words are in meaning is known as the semantic similarity problem.
This is critical when considering whether to link them into lexical chains. There have
been two main approaches to the semantic similarity problem. The first is based on
information content and the second by calculating distance in a semantic hierarchy.
A psychological experiment by Miller and Charles (1991) has provided an evaluation
metric for word similarity studies. They presented subjects with a list of thirty words pairs
and asked for them to be rated for “similarity in meaning” on a scale from zero to four. A
rating of zero implied the words were completely dissimilar and four that the words were
perfect synonyms.
Resnik (1995) replicated this task (on twenty-eight word pairs4) finding a correlation of
r=0.9011 between the similarity judgements his experiment found, and those found by
Miller and Charles (1991). This is considered a reasonable upper bound on what to expect
from a computational procedure (Resnik 1995, Jiang and Conrath 1997, McHale 1998).
Two word pairs include the word “Woodland” which was not then in WordNet
Chapter 2
Literature Review
Distance Based Approaches
Distance based approaches to computing semantic similarity consider the information
source as a semantic net, and count the number of links (or edges) between two concepts
as a measure of semantic distance. This idea is based on a model of human memory
introduced by Collins and Quillian (1969). They hypothesised that people store concepts
within a hierarchical structure and tested this using reaction time experiments.
Rada and Bicknell (1989) used conceptual distance in the MeSH (Medical Subject
Headings) thesaurus to rank MeSH encoded queries against documents from the
MEDLINE bibliographic database. They found significant correlation between the human
rankings, and those generated by their edge-counting algorithm.
If semantic similarity is based on counting shortest path between two concepts (Rada and
Bicknell 1989), the taxonomy needs to have edges of equal length and value. If not,
suitable weights need to be applied. For example, relative conceptual density in the
WordNet hierarchy has been used in a word sense disambiguation task (Agirre and Rigau
1996). Resnik (1995) replicated the Miller and Charles (1991) experiment using simple
WordNet edge counting (and other techniques), and found poor correlation (0.66, see
Table 2-4 below). Note however that McHale (1998) found far better correlation using
edge counting from Roget’s Thesaurus. Thus it appears that Roget’s thesaurus is at least
as well suited to the semantic similarity task as WordNet.
Information Based Approaches
The information content approach to semantic similarity considers the extent to which
two nodes in a hierarchical concept space share information. The information content of
each node is considered as:IC(c) = log-1 P(c)
where P(c) is the probability of encountering that concept
The similarity between two concepts is then the information content of the node which is
the lowest upper bound amongst those that subsumes both concepts. Formulae are given
in (Jiang and Conrath 1997).
Resnik (1995) defined concept frequency as the sum of word frequencies in that concept.
This however takes no account of word sense ambiguity. Richardson and Smeaton (1995)
corrected for this by dividing word frequency by the number of classes in which it is
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
found (i.e. its degree of polysemy and homonomy). However this estimation may be
unsound, since the distribution of word senses in the Senseval sense tagged corpus is not
linear (see Chapter 6).
Jiang and Conrath (1997) have proposed a combined approach to semantic similarity that
adds information content to edge counting. They estimate concept frequency using noun
frequency from SemCor – a WordNet sense tagged corpus (Miller et al. 1993). They note
that SemCor only includes half the words in WordNet, so it is unlikely that a word’s sense
frequencies will model its general usage.
Table 2-4 below (reproduced from McHale 1998 page 2) summarises results on semantic
similarity that have replicated Miller and Charles (1991).
Table 2-4: Correlation of similarity measurements
Similarity Method Correlation
Human judgements (replication)
Information Content
Edge Counting
Jiang and Conrath
Information Content
Edge Counting
Intervening Words
Jiang and Conrath
These results should be interpreted cautiously since the number of word pairs used by
Miller and Charles (1991) is very small in comparison to the number of words in English
(roughly 100,0005). Consequently these data are subject to fluctuations in response to
The Concise Oxford Dictionary (9th edition) contains 140,000 definitions including collocations. 100,000 words is
consequently a conservative estimate for the number of individual words.
Chapter 2
Literature Review
minor adjustments to the thesaural hierarchy. For example, Jiang and Conrath (1997)
increased the degree of correlation to r=0.8654 by removing a single questionable
classification of the word “furnace” in WordNet.
Edge counting does seem to provide a good measure of semantic similarity when applied
to MeSH (Rada and Bicknell 1987) and Roget (McHale 1998), whilst information content
performs better in WordNet (Jiang and Conrath 1997). Consequently, conceptual distance
as measured by edge counting could be considered a viable technique for the formation of
lexical chains.
2.4.4 Sense Tagged Corpora
A problem with the various approaches to word sense disambiguation is that the results
are rarely comparable, since evaluation uses different data sets. These will also have been
tagged with word senses to different degrees of accuracy. One option here is to use a
sense tagged corpus as a comparative benchmark.
A sense tagged corpus is a collection of texts where word senses have been marked by
human assessors.6 There are two common variations on sense tagged corpora. In the first
the preferred sense of every word in a complete text is indicated. The second type of
corpus is made up of isolated individual sentences (or paragraphs) in which only the
preferred sense of some particular word is indicated. This is known as the “lexical
samples” technique.
The full-text technique has the advantage that the task is more ecologically valid (see
Section 1.6). It represents an accurate picture of how most readers will encounter
ambiguous words. It does have the disadvantage that it is far more demanding since each
individual word must be looked up in a dictionary (and the entry compared to the word
sense in use). This makes it somewhat unlikely that individual words will occur
frequently enough to statistically represent actual usage.
The variety of words in this method also makes it difficult for human sense taggers to
become familiar with the individual dictionary entries. This means that sense tagged full
texts are more error prone than in the lexical sampling technique (Kilgarriff, and
Rosenzweig 2000).
Typically two assessors will mark-up the samples independently. Where they disagree, a third arbitrates.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The lexical samples technique has the advantage that a corpus may contain a
representative sample of word meanings and usage frequencies. Its principal disadvantage
is that it is not a wholly natural task. Nonetheless, human assessors rarely have difficulty
identifying the intended sense, so the advantages of statistical validity and ease of
creation outweigh the disadvantages.
Few corpora are available for either sense-tagging task, since the activity of producing
such a corpus is both intellectually demanding, and time consuming. One of the best
available full text resources is known as SemCor (semantic concordance), which is
available as part of the WordNet distribution (Fellbaum et al. 1998). Senseval is a recent
corpus that has been created for the evaluation of sense tagging of lexical samples
(Kilgarriff, and Rosenzweig 2000).
SemCor is a set of two hundred articles from the Brown corpus (Francis and Kucera
1979) that have been tagged with WordNet senses (Fellbaum et al. 1998). SemCor suffers
from two faults as a sense tagged corpus. Firstly, the sense tagging is not sufficiently
accurate, and secondly, it is not a random sample.
Ng and Lee (1996) report manually retagging SemCor. They found an agreement of
approximately 67%, between the two tagged versions, which is disappointingly low.
Leacock (1998 personal communication) reports that SemCor was tagged by different
individual graduate students and not subsequently independently retagged. In fact, the
quality control technique was to randomly sample every tenth word, and check the
existing classification. This method has the advantage of speed, but the disadvantage that
whoever checks the existing classification is firstly primed that the existing classification
is reasonable, and secondly does not study the dictionary entry to consider whether other
sense categories might be more appropriate.
Leacock also notes that SemCor is a linear, not random, extract from the Brown corpus.
Funding constraints had prevented completely tagging the corpus as was originally
The Senseval corpus (Kilgarriff 1998) was a deliberate attempt to overcome the problems
with SemCor. Senseval used sense tags derived from an SGML encoded machine36
Chapter 2
Literature Review
readable dictionary called Hector. This was derived from an internal research project at
the Oxford University Press. The entries in Hector are extremely detailed, and include:1. Surface Forms (Including Collocations)
2. Text Definitions
3. Part Of Speech
4. Examples Of Usage
5. Idiomatic Phrases
Since the dictionary entries are detailed, they identify nuances of meaning as polysemous
senses. That is, entries are defined to high level of granularity that would not often be
considered individual. This meant that it was difficult to identify in which precise sense
and sub sense a word was used.
The Senseval corpus of lexical samples was manually sense tagged by two professional
lexicographers. In case of disagreement, a third arbitrated. This lead to an overall
inter-tagger agreement of 90%, at the finest level of granularity, and more than 99% at the
coarse level.
2.4.5 Lexical Chains
Lexical chains (Halliday and Hasan 1976, 1989, Morris and Hirst 1991) have been
applied to several different areas of computer based language processing. An example and
definition of lexical chains were given in Section 1.6. This section reports work on lexical
chsins, paying particular attention to whether lexical cohesion based approaches perform
better than alternate (term based) methods.
The first published implementation of a lexical chaining program was for Japanese
(Okumura and Honda 1994). They used lexical chaining for words sense disambiguation
in a speech recognition task, and for text segmentation. Okumura and Honda (1994)
reported an accuracy of 66% on the word sense disambiguation task. They showed that
the lexical chaining process implicitly provides word sense disambiguation, since it
associates words in a text using relationships derived from a thesaurus. If a word is
ambiguous, it will have more than one entry in the thesaurus. If only one of these senses
plays a part in an association with another word, then the first word is disambiguated with
respect to that association, and is assumed to be used in that sense in the text.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
StOnge (1995), and StOnge and Hirst (1998) described a lexical chainer for English using
WordNet as the lexical database. Linking relations between nouns were determined by
finding relations from WordNet These include membership of the same WordNet set of
synonyms (“Synset”), hyponymy7 (IS-A relations), meronymy8(part-whole relations), and
other relations simply determined through WordNet’s semantic net.
The chaining algorithm improved on the stack-based approach used by Okumura and
Honda (1994) based on salience to one that uses recency to re-order the stack.
It should be noted that StOnge and Hirst (1998)’s chainer deals with nouns, or words that
may be morphologically reduced to nouns. This is because there is no clear relationship
between the noun, verb, adjective, and other hierarchies in WordNet.
StOnge and Hirst (1998)’s chainer was applied to the detection of malapropisms in text.
These words are correctly spelled, but inappropriate in their context. For example in the
following sentences:
1. The bees were attracted to the flour.
2. The bees were attracted to the flower.
It is clear that the author probably intended (2). However, no current spelling checker
could determine this. StOnge’s (1995) thesis was that words that could not be chained to
others could be inappropriate in their context. Since these words did not form chains, he
named them “atomic chains”. Unfortunately, StOnge detected approximately ten times
more atomic chains than were malapropisms, thus rendering the technique impractical for
malapropism detection.
StOnge’s (1995) attributes his negative results to inaccuracy in the chaining process. In
particular, he identifies two major problems, under and over-chaining. In under-chaining,
words that should be joined to an existing chain are omitted, whilst in over-chaining,
spurious associations between words are identified that cause them to be incorrectly
joined to an existing chain.
Hyponym: One of a group of terms whose meanings are included in the meaning of a more general term, eg spaniel
and puppy in the meaning of dog.(Chambers Dictionary)
Meronym: A word whose relation to another in meaning is that of part to whole, eg whisker in relation to cat.
(Chambers Dictionary)
Chapter 2
Literature Review
StOnge (1995) states that under-chaining may have four causes:
1. An inadequacy of WordNet's set of relations. For instance, child care and school
cannot be related using WordNet's relations.
2. A lack of connections in WordNet. For example, WordNet does have a proper
set of relations to link beef stew and beef with a single relation/substance
meronym/holonym but no such link exists in WordNet's graph.
3. A lack of consistency in the semantic proximity expressed by WordNet's links.
For example, in WordNet's graph, the shortest path between stew and steak has
6 links while the shortest path between Australian and millionaire has 4 links.
4. A poor algorithm for chaining.
StOnge (1995) also states that over-chaining might be caused whenever two words are
very close to each other in WordNet's graph while being distant semantically. This lack of
consistency in the semantic proximity expressed by WordNet's links often results in the
merging of two chains.
Stairmand (1996) also used WordNet to create a lexical chainer. His approach is based on
spreading activation rather than the linear approach initially suggested by Morris and
Hirst (1991). Like StOnge, Stairmand’s chainer checks nouns only with their near
neighbours for relationships that may be found in WordNet. However, rather than trying
to chain all the words in a text Stairmand aims to only identify chains that could reflect
the structure of a text as proposed by Morris and Hirst (1991). This makes Stairmand’s
work applicable to the text segmentation problem.
Stairmand (1996) also reports experiments on the text segmentation task. That is, the
division of a text into paragraphs. He compared his technique against that of Hearst’s
(1994) TextTiling algorithm and found TextTiling had superior performance. Stairmand
(1996) does not explain this. Similarly, Hearst reports that earlier version of TextTiling
used a thesaurus but found better performance without.
Stairmand 1996 and Stairmand and Black (1996) report experiments on Information
Retrieval using WordNet derived lexical chains. They indexed 90,000 news articles using
their chainer, and compared retrieval performance to SMART (Salton and Buckley 1990)
on twelve very simple queries based on topics used in TREC (Harman 1993). The
evaluation results were positive, although Stairmand and Black (1996) note that the
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
approach would not be suitable for general information retrieval, since only terms that
appear in WordNet can form lexical chains, and hence be indexed. Consequently, whilst
this study may be applicable to local document similarity assessment (Section 2.2.4) it is
not proposed as a general solution for Information Retrieval.
Barzilay and Elhadad (1997) have also implemented a lexical chaining algorithm that is
used for text summarisation. They suggest that good text summaries may be produced by
extracting sentences from texts that contain the “strongest” chains, where chain strength is
a function of chain length and homogeneity. They tag the text initially, and then segment
it using the TextTiling algorithm (Hearst 1994). Lexical chains are then derived using a
WordNet based algorithm similar to StOnge and Hirst’s (1998). Barzilay and Elhadad
(1997) however claim superior performance by adapting a lazy disambiguation strategy.
This involves maintaining all possible word senses until there is clear evidence which
word sense is preferred. No data are given that compare this approach with other text
summarisation techniques (e.g. Alterman 1991).
Kominek and Kazman (1997) report work on Jabber an experimental system that allows
users to retrieve records of videoconferences based upon their (transcribed) verbal
contents. Jabber is able to summarise a set of related words, giving a name to each topic
using lexical chains. Users can then use this name to query or browse the stored
Jabber, or its subsystem Conceptfinder uses nouns from WordNet to form clusters of
concepts that are then merged according to the “lowest common hypernym”, in an
approach reminiscent of Stairmand’s (1996). Kominek and Kazman (1997) report that
their Conceptfinder system is able to make distinctions among different senses of the
same words, and is able to summarise a set of related words. Kominek and Kazman
(1997) claim that early results are encouraging, but do not report any formal evaluation.
Green’s work (1996, 1997, 1999) used lexical chains to generate hypertext links between
the different paragraphs contained within newspaper articles and a variant of this method
to generate links between different articles. Similarity between the paragraphs was based
on their semantic content as derived from their lexical chains. The program to create these
was based on that of StOnge (1995).
Chapter 2
Literature Review
Green's (1997) method for calculating the similarity of paragraphs inside one article
(“within article similarity”) is based on the relative presence of lexical chains in the
different paragraphs (where the chain has components in more than one paragraph).
Green (1997) also describes a technique for calculating the similarity of paragraphs in
separate articles (“between article similarity”). This is based on the relatedness of their
lexical chains as measured by the relative presence of their component WordNet synsets.
That is, he identifies the synsets inside the paragraphs and then compares these using the
Dice coefficient, which is a commonly used IR similarity measure (e.g. Van Rijsbergen
Green (1999) further considered comparing two documents based on their lexical chains.
Green (1999) highlights that due to the hierarchical nature of WordNet, it is common to
find documents that contain a large number of related words. His solution to this is to
restrict lexical chains to those containing identical words, words in the same WordNet
synset, or words in adjacent synsets. Next, he represents each document using two
vectors, each containing an element for each of WordNet’s 60,577 noun synsets. The first
vector contains the weight of that particular vector in WordNet, and the second vector
contains the weights of that synset when it is one link away.
Green (1999) experimentally evaluated the quality of the hypertext links generated.
Disappointingly, the results from his lexical chain based procedure were not significantly
better than those derived by a standard IR, term based, approach. He attributes this failure
to the limitation faced by any lexical chaining program that a word not in the thesaurus
can not contribute to the representation of its meaning, in addition to problems with word
sense ambiguity. Green (1999) reports that there is no way to tell whether the lexical
chainer has disambiguated a word correctly, and has no data on the average number of
incorrect disambiguations. The effect of word sense ambiguity on lexical chaining thus
remains an outstanding research issue.
In summary then, lexical chains then have found application in areas as diverse as concept
identification in multimedia (Kominek and Kazman 1997), information retrieval and text
segmentation (Stairmand 1996), text summarisation (Barzilay and Elhadad 1997),
malapropism detection (StOnge 1995), and word sense disambiguation (Okumura and
Honda 1994).
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The approaches to lexical chaining described above, for which there are computer
implementations using non-domain specific thesauri, have used nouns only. In English,
this is due to the difficulty of associating nouns and verbs in the WordNet hierarchy.
Modest success is generally reported, although, where there has been formal comparison
with an alternative approach to the problem (Stairmand 1996 IR, Green 1997 hypertext
linking), this is not statistically significant. Explanations offered for poor performance
include deficiencies in WordNet’s vocabulary and organisation (StOnge 1995, Kominek
and Kazman 1997, Green 1997), inadequacies in the chaining algorithm (StOnge 1995),
and word sense ambiguity (Green 1997). Stairmand (1996), and StOnge (1995) both
suggest using Roget’s thesaurus as a method for circumventing the inadequacies of
WordNet for lexical chaining.
2.4.6 Summary
Lexical chaining is a method that has found application to a variety of problems in NLP.
Results have been generally disappointing though, possibly as a consequence of using
nouns only, or possibly due to word sense ambiguity. Researchers who have used
WordNet have suggested Roget’s thesaurus as an alternative knowledge source. This has
been shown in initial experiments to give sensible results on semantic similarity.
2.5 Conclusions
This chapter has briefly overviewed work in IR, CBR, and NLP with the objective of
justifying the text similarity method developed in this thesis. We found that IR methods
could be divided into manual and automatic techniques, and that automatic techniques
could be divided into statistical or knowledge based approaches. Knowledge based
approaches have not been shown to be practical for large document collections, although
they may give impressive results in research studies. We also found that similarity
methods using thesauri were well known in IR (e.g. Rada and Bicknell 1989) in specific
subject areas. Particularly, query document similarity ranking was effective when used
following an initial query to analyse the documents returned in a procedure known as
local document analysis (Xu and Croft 1996). Since the local document set is small, an
expensive procedure based on NLP could be applicable.
T-CBR deals with case bases of texts that are small in comparison to IR document
collections, so again, the additional processing incurred by an NLP based similarity
Chapter 2
Literature Review
measure would be acceptable if it provided improved similarity matching. T-CBR was
also shown to face similar problems to IR with respect to ambiguity.
A brief overview of NLP was also presented in Section 2.3. This focused on semantic
similarity where a simple edge counting technique using Roget’s thesaurus was shown to
be a possible alternative to information content approaches using WordNet. Although
WordNet has formed the basis of much work on lexical chains, several problems with it
were identified. Principal amongst these was the lack of a relationship between the noun
and verb hierarchies. All work in lexical chains to date9 has consequently been based on
nouns only.
There are consequently four research issues from the literature that motivate this research.
Firstly, whilst Morris and Hirst (1991) based their description of lexical chains on Roget's
thesaurus they did not provide a computer implementation due to a lack of a
machine-readable version. Thus, the relationships they identified in the thesaurus were
only verified by hand. A Roget based implementation would be a useful tool for assessing
their value, as an automatic system would be blind to the meaning of a text in a way that
is difficult or impossible for human readers. A Roget based lexical chainer will be
described in Chapter 3.
Secondly, Morris and Hirst (1991 p41) speculate that chain forming parameter settings
will vary according to an author’s style. This suspicion has not previously been
investigated, and remains an outstanding research issue. It will be addressed in Chapter 4.
Thirdly, both Stairmand (1996), and StOnge (1995) separately observed that the
performance of their chainers could be improved by using Roget. This is because some
Roget relationships are simple to compute, but not possible with WordNet. For example,
the words “blind” and “rainbow” have an intuitive association concerned with sight and
visual phenomena that is reflected in their membership of the same group of Roget
Loukachevitch and Dobrov (2000) have recently reported an all-words lexical chaining system for Russian that uses a
hand built thesaurus for the Socio-political domain. The thesaurus excludes ambiguous relations. The system has
been applied to text summarisation, categorisation, and conceptual indexing.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Finally, we note that Roget's thesaurus has a simpler, more balanced structure than
WordNet. If the main categories are not divided, the tree structure is only four ply deep.
Thus the whole tree structure, plus the pointers to related categories that are a feature of
Roget may be held in a computer’s main memory. This makes it suitable for an efficient
implementation using all of the Morris and Hirst linking relationships.
With Roget, the same procedures can be used to identify relationships between any two
words in the thesaurus. StOnge (1995 p47) comments that some deficiencies in his
program may have arisen due to “a lack of consistency in the semantic proximity
expressed by WordNet’s links”.
Roget's thesaurus naturally has some disadvantages with respect to WordNet. Roget’s
thesaurus gives no guidance as to word sense frequency. That is, whilst dictionaries order
their entries according to how frequently they are generally found, Roget gives equal
precedence to all senses of a word. This exacerbates problems of word sense ambiguity
(Chapter 5) and is an unresolved issue. By contrast, Princeton WordNet (version 1.5 and
up) lists word senses retrieved in order of their frequency of occurrence. Whilst the
derivation of this frequency information is suspect it is preferable to the situation with
Roget's thesaurus. Consequently, it is not known whether the word sense ambiguity
problems noted by Stairmand (1996), StOnge (1995), and Green (1999) will be
intractable in a Roget based lexical chainer, due to lack of frequency information, or
ameliorated due to the different thesaurus structure. These issue of ambiguity is addressed
in Chapter 5, whilst the evaluation issue is addressed in Chapter 6.
We now go on to describe a lexical chaining program based on Roget’s thesaurus which
uses all parts of speech. This is used to compare the similarity of texts.
Chapter 3. Hesperus: A System for Comparing the
Similarity of Texts Using Lexical Chains:
3.1. Introduction
We have hypothesised that the conceptual contents of texts may be used for similarity
judgements, and that these contents may be characterised with reference to an external
This chapter presents supporting evidence for that hypothesis. This covers the
implementation of the lexical chainer and its various components leading to the derivation
of algorithms to compare the conceptual similarity of texts.
The relationship of this chapter to the thesis argument is shown in fig 3-1.
Literature Review.
Hesperus: A system for comparing the similarity of texts using
lexical chains.
The General Nature of Lexical Links.
Word Sense Disambiguation and Hesperus.
Evaluating Hesperus
Figure 3-1: Thesis chapter structure
The purpose of the description of the system and its algorithms is threefold. Firstly, this
study is experimental in nature, and so the system constitutes an essential element of its
methodology. If the system design were conceptually flawed, work that is based upon it
in subsequent chapters would be built on weak foundations. A second reason for
describing the system is that this work needs to be understood in relation to previous work
on lexical chaining. This makes improvements explicit so that the techniques may be
understood or used by subsequent researchers. Finally, the description is essential if this
work is to be replicated.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
An objective of this work is the development of matching techniques for documents based
on their concepts as represented by thesaural categories, rather than their constituent
words. This requires:1. the selection of a structure to represent a document’s contents,
2. a method to derive this structure from texts,
3. a technique to compare these structures.
These requirements are interdependent, since overly complex representations of different
texts may not be directly comparable. Inspiration here comes from the area of Case Based
Reasoning (CBR , e.g. Aamodt and Plaza 1994) that attempts to resolve queries based on
previous problem solutions (Chapter 2). As a field, CBR is involved in the use of generic
similarity matching techniques applicable to areas as diverse as printer fault diagnosis,
building design, and personal income tax planning.
In CBR, a query and examples are usually represented as attribute value pairs. Thus to
apply CBR to document comparison, both the text acting as a query, and the documents to
be compared against need to be represented equivalently. If this representation is based on
simple terms (i.e. words), the problem becomes hugely complex, since there are about
100,000 words in the English language. This would also be a fragile approach, since
semantically equivalent words would not count as equal. However, if documents may be
represented as sets of Roget categories, the problem becomes tractable. The purpose of
this chapter is to describe a system that performs this transformation.
We firstly describe the implementation of a lexical chainer based on Roget's thesaurus
(Appendix VII). As described in Chapter 2, the identification of lexical chains in a text
supports several applications in Natural Language Processing. In this work, they are a
prerequisite for the derivation of the Generic Document Profile (GDP) that is used to
compute similarity between texts. Consequently, we describe how this GDP may be
derived from the lexical chains. As the word senses used in a lexical chain are difficult to
understand outside of the surrounding text, we present a method that allows their
visualisation in context. Finally, the derivation of the “Electronic Roget” is presented.
Although Project Gutenberg (1999) has made the machine-readable text of Roget's
thesaurus available, it lacks an index. Consequently, the derivation of the index is
described. This description may be useful to researchers in languages other than English,
Chapter 3
who are looking for methods to convert their thesauri into efficient supporting tools for
lexical chaining.
3.2 Hesperus: A System for comparing Text Similarity using Lexical
Hesperus is a system designed to compare how similar texts are by measuring their
conceptual contents as determined by their thesaurally defined lexical chains. This
process is described in some detail in this section. In outline it is however as follows:
Texts are firstly processed individually to determine their lexical chains, and subsequently
their generic document profiles. These are stored in a database of cases, which is known
as “case base” in CBR. The profiles can then be clustered for similarity to the exemplar
texts using the nearest neighbour algorithm as is common in CBR.
The architecture of the system is shown in figure 3-2 overleaf using a level 1 data flow
diagram and standard SSADM notation (e.g. Weaver 1993).
This chapter proceeds with a discussion of the Roget based chainer, and the algorithm
used. Subsequently, we look at the derivation of the Generic Document Profile.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Figure 3-2:Hesperus System Architecture
3.3 A program to analyse lexical chains in a text using Roget's Thesaurus
This work is based on Morris and Hirst's (1991) hypothesis that lexical chains may be
automatically identified in a text using Roget's thesaurus. Although we have seen in
Section 2.4 that lexical chainers have been written using WordNet (Fellbaum 1998) there
have been no published computer implementations based on Roget.
This section describes a lexical chainer based on Roget’s thesaurus. The techniques used
are general, and depend on the organisation of the thesaurus as a structured resource.
Thus, they should be applicable to languages for which WordNet’s have not been written,
but for which texts equivalent to Roget are available.
Chapter 3
This lexical chainer depends on the availability of an “Electronic Roget” - a machinereadable version of Roget's thesaurus together with an indexing program. This makes it
possible for a program to identify the Roget categories of which a word (or words) is a
Two versions of Roget's thesaurus have been used during the course of this study. Firstly,
the 1911 edition was used, as this is available electronically over the Internet (Project
Gutenberg 1911) and is out of copyright. This version however has several problems
before it can be used to find inter word thesaural relationships. It contains many obsolete,
literary, and foreign language terms that need to be filtered out, but most importantly, it
lacks an index. The editor of the 1962 Roget1 (Dutch 1962) points out that the index was
carefully created so as not to contain all possible terms. Therefore, an automatic approach
is going to lead to a higher degree of lexical and term ambiguity, since it does include all
possible terms - including those whose relationship is too tenuous for a human editor to
The second version of Roget’s thesaurus used was that of 1987. “The Original Roget's
Thesaurus of English Words and Phrases”2. This is structurally similar to the 1911
version, but includes a considerably enlarged and modernised vocabulary. Again, a
machine-readable index was not available.
The structural difference between the 1987 and the 1911 versions of Roget is due to the
increased size of the vocabulary, which consequently increases the size of the thesaural
entries. Whilst these conform in structure to Roget’s 1000 categories (see Section 1.6),
these categories are further subdivided into approximately into six further subcategories,
giving 6400 subcategories altogether. For the remainder of this chapter this subdivision
will be ignored. However, the 1987 edition of Roget does allow the possibility of working
at a finer level of granularity than that offered by the 1911 version. That is, we could
consider the thesaurus as having approximately 6400 smaller categories, rather than 1000
larger ones.
The 1962 version is available as paper only.
Copyright © 1987 by Longman Group UK Ltd. We are grateful to Longman’s for permission to use this work for
academic purposes.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The issue of word sense ambiguity inherent in an automatically created index is
connected to an observation in Morris and Hirst (1991) that of the visible connections
between words in a newspaper article, only 80% (approximately) could be detected with
Roget's thesaurus. They recognised the remaining relationships as between proper nouns,
anaphora, and knowledge of the environment, such as city districts. These are associations
familiar to Morris and Hirst (1991), and the article’s author, but not included in Roget in
the appropriate sense. It is quite plausible that some of these terms, especially proper
nouns, will be present in Roget in other senses. These erroneous senses will then be used,
leading to misclassification, and subsequent identification of spurious lexical links and
Lexical chaining is then an approximate, error-prone procedure. The objective for this
work is to produce an implementation that is accurate enough to produce a usable Generic
Document Profile. That is, sufficiently accurate as to offer an improvement over termonly similarity algorithms (Chapter 6), whilst being suitable for interactive use.
The next section proceeds as follows: Firstly we consider the creation of an Electronic
Roget, we then proceed to discuss the chaining algorithm itself.
The Creation of a Machine Readable Thesaurus
This thesis has used Roget’s thesaurus as a knowledge base for the identification of
lexical chains in a text. This was done using machine-readable versions of the thesaurus
without having access to an index. The index is essential in identifying how a word is
classified in the thesaurus. Without an index, it is not possible to find to which entries a
word belongs, or where it fits in the thesaural hierarchy. Consequently, it was necessary
to create a machine-readable index. The procedure that does is described in Algorithm 3-1
Algorithm 3-1, which creates a machine-readable index, may be applied to any similar
resource. It would be possible for example to apply Algorithm 3-1 to a non-English
thesaurus (whose text had been acquired say, by scanning), and subsequently use that to
derive lexical chains in texts in that language3.
It follows too that if the thesauri have equivalent structures the procedures for text similarity matching (or other
operations possible with lexical chains) could be applied cross-lingually, to texts written in different languages.
Chapter 3
Algorithm 3-1 particularly applies to the 1911 edition of Roget’s thesaurus. Project
Gutenberg (1999) have made the 1911 Roget publicly available, since its copyright has
expired. That edition of Roget is available as a single text (i.e. it is a book) with section
headings divided into the standard categories described in the preface to any Roget
(Dutch 1962). Each entry is indicated by a headword that describes the idea(s) of that
entry, which is independent of any part of speech. Each entry is numbered, and a feature
of Roget is that the reader is referred to related entries.
In the 1911 Roget the related entries are given by number at the end of each entry. Later
editions (e.g. Roget 1962) indicate pertinent categories next to the most relevant word.
Algorithm 3-1 does not exploit that enhancement.
The challenge is to be able to identify the thesaural categories a word (or words) belongs
to starting from the book form. Algorithm 3-14 divides that book into separate entries in
the thesaurus, simplifies them, and then creates an index from them using an information
retrieval program5.
The purpose and role of minor procedures is described in appropriate comments.
This work uses a public domain program, FFW, though many other programs are available on the internet -as is FFW.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Algorithm 3-1: Creation of the E-Roget
PROCEDURE Create-E-Roget
LET file-name := NULL
LET RogetPtrs := NULL
LET Roget_files := NULL
FOREACH Roget-entry R1:
//Remove Obsolete, Latin, & foreign words
Combine-Collocations(R1) //merge into single hyphenated terms
file-name :=
Compose(head_title, entry_number, part_of_speech)
WRITE R1 AS file-name
COLLECT pointers-to-related-entries INTO RogetPtrs
//array indexed by entry number
COLLECT file-name INTO Roget_files
WRITE RogetPtrs AS Roget.ptrs
//using an Information Retrieval (IR) program.
INDEX Roget_files INTO-FILE Roget.Index
When reloaded, the memory-based array RogetPtrs allows thesaural relations between
any two word categories to be determined. The search component of the IR program may
now be encapsulated as separate Function Categorise.
Chapter 3
Function Categorise (Word1 .. Wordn)
LET Set1:= RETRIEVE File-Names from
Roget.Index containing Word1
LET Setn:= RETRIEVE File-Names from
Roget.Index containing Wordn
=>Return (Set1 ∩ Setn)
// the set of Thesaural Entries containing word1 . wordn;
Efficiency Considerations.
The Electronic Roget uses two techniques that support efficient use by the lexical chainer.
Firstly, thesaural entries encoded as integers, and secondly, the internal pointer structure
of the Roget is held in memory. Let us look at these techniques in turn.
Integer Encoding
In order to carry out lexical chaining the function Categorise (above) is used. This
must return the three types of information that describe a word’s thesaural category:
1) The thesaural category (e.g. Entry 1. Entry 1000)
2) Thesaural sub division: that is, Noun, Verb, Adjective, Adverb & Phrase
3) Additional Category if relevant (e.g. Entry 16a, 16b, or 16c). These refer to
class divisions and additions created after the original classification.
Since these numbers are small, they may be combined into one integer (two bytes). This
both makes maximum use of available memory, and permits several lexical comparison
operations to be carried out simultaneously.
For example, in the various lexical linking relations described in Morris and Hirst (1991),
the simplest comparison is to determine whether two words are members of the same
thesaural entry. This requires testing the equivalence of both item (1) and (2) above. This
may be done in one operation. Other operations are similarly reduced to integer
arithmetic. Of course, if specific information is required (such as whether an entry refers
to a “Noun” subcategory this may be done using bit mask operations). Since these are low
level operations, they are also typically faster than the string matching operations that
would be required by other encoding schemes. This technique also supports the second
efficiency technique, the Roget pointer structure.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The Roget Pointer Structure
Lexical chaining depends upon the detection of relationships between words. These
relationships may be membership of the same category in the thesaurus, or more complex
links, which may be determined by following the internal references in the thesaurus.
Consequently, the efficient operation of lexical chaining depends upon the rapid
identification of relations between thesaural categories.
Roget’s thesaurus is distinctive in that the categories are ordered in a tree like structure
(outlined in Appendix VII), where entries contain pointers to related entries. Since this
information is static, it may be determined when the electronic index is created.
Since the Roget pointer table is held in memory, all the thesaural-linking operations are
carried out without disk access. This is important, as accessing a computer disk is one
hundred times slower than accessing its memory. This is a consequence of a disk being a
mechanical device. Additionally, as word category information is held as integers, lexical
chaining using Roget will be more efficient.
In summary, it is a pre-requisite of any lexical chaining procedure that we can determine
the thesaural categories to which a word belongs. This section has described a procedure
to accomplish this using routinely available Information Retrieval programs supported by
basic text processing.
Chaining Algorithm
The algorithm described here is based on that given for WordNet in StOnge (1995), who
in turn followed Okumura and Honda (1994), and Morris and Hirst (1991). That is, a
linear pass is made through the text, and where words can be associated using
relationships derived by reference to an external thesaurus, a “link” is stored. If one of the
members of the link was previously linked elsewhere in the text, the two links form a
“chain”, to which further links may be added.
Several relations between word pairs can be used to decide if they are members of the
same lexical chain. These were suggested and described fully by Morris and Hirst (1991).
Only a subset of the Morris and Hirst (1991) were found to be useful in Hesperus since
some are excessively prone to problems of word sense ambiguity.
Chapter 3
The most important relation is word repetition, known as the ID6 or identical word
relation. Simply, if two words are the same, they may be linked with high degree of
certainty. Although there are a mean of four thesaural entries per term, the discourse topic
acts to constrain sense usage. This means that a word used in one sense is frequently used
in that sense throughout the same text. For example, “bond” has one common sense in
financial journals, a second usage in chemistry papers, and a further widely accepted
sense in sociology books. This is an instance of “one sense per discourse” noted by
Krovetz and Croft (1992).
Next, we consider whether two words are members of the same thesaural category (CAT
relation). Again, due to one sense per discourse, this is mostly successful. It does appear
however more error prone than the ID method. The degree of error is related to the type
of text being analysed, with technical texts appearing to have fewer errors than fictional
texts, as they contain a greater proportion of polysemous words. Precise error rates have
not been calculated, as this would require a text corpus tagged with Roget categories. (see
Section 2.4.4)
Since Roget’s thesaurus contains groups of categories, we also consider whether word
pairs are members of neighbouring and related thesaural categories. If so, and the two
categories are members of the same thesaural group, the words may be linked with a
GROUP relation.
Roget categories often refer to other categories. Words may therefore be related where a
word's entry refers to an entry that contains the second word. This relationship is
abbreviated as ONE, since there is one level of indirection from one thesaural category to
the second.
All lexical chains are stored in common data structure called a ChainStore. Unlike
StOnge’s Chainstack, which ordered chains by recency of occurrence or Okumura and
Honda (1994) where chains were ordered by their length, chains in the ChainStore are
ordered by their value, as calculated using procedure Potential-Link-Value (see
Courier font is used to differentiate link names from descriptive text.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
pp56-57). This allows the concepts in a text to be considered according to their known
relative strengths as these emerge during the linear text “context” determination process.
The ChainStore is linked to the application of a variable width window within which
lexical links are considered. Furthermore, the type of link governs the region of text for
which links are considered. Identical word links are considered then within a region of
fifty non-stopwords. Other link types have this value reduced in proportion to their
relative weights.
The variable width window is motivated by ambiguity considerations. The progressively
weaker semantic relations consider increasing number of words as shown in table 3.1
Table 3-1: Thesaural Relations Vs Mean Words
Mean Words
Mean number of word repetitions in the thesaurus
Mean size of thesaural category
Mean number of categories in a thesaural group
Mean number of categories in a ONE relation
Mean number of categories in a TWO relation
A full discussion of this issue needs to be informed by word sense frequency data
(Chapter 5). To clarify table 3.1 we will assume that all word senses occur equally
Table 3.1 shows that GROUP relationship will consider possible links between two words
from a mean of four categories7 from Roget’s 1000. The ONE relationship will consider
six categories, whilst the TWO relationship considers two levels of indirection, or thirtysix categories.
As there are approximately 1000 categories in Roget, it is clear that there is a significant
probability that a TWO relationship will be found between unrelated words as this often
Some groups contain two and others as many as ten categories. Four categories is an estimated mean.
Chapter 3
includes so many categories. This naturally increases the greater number of words that are
considered. Indeed, there is a strong possibility that unrelated words that should form new
lexical chains could be erroneously linked to words in existing chains by the TWO
relation. As such, it appears to offer no positive contribution to representing a text for text
similarity assessment, and its use was not pursued.
The algorithm used to create chains is as follows: Algorithm 3-2: Create Lexical Chains.
PROCEDURE Create-Lexical-Chains
// get a new link from the input
LET NewLink := GetLink();
LET LinkTypes := [ID, CAT, GRP, ONE];
// types of links in preferred order
FOREACH LinkType in LinkTypes
FOREACH chain in ChainStore
FOREACH link in chain
IF Canlink(link, NewLink, LinkType)
Push(NewLink, chain)
IF Ambiguous(NewLink) // has multiple word senses
SORT ChainStore
// Done!!
IF Potential-Link-Value(link, NewLink)<1)
// Leave two loops to try next link type
LET NewChain := NULL;
//can’t be linked to an existing
Push(NewLink, NewChain)
// chain form a new chain
Push(NewChain, ChainStore)
SORT ChainStore
The function Potential-Link-Value restricts the potential chaining region
according to the link type being considered. Table 3-1 has shown that the different
thesaural linking relations proposed by Morris and Hirst (1991) cover increasing numbers
of words in the order CAT < GRP < ONE. Thus, as weaker relations are used the risk of a
spurious link being found increases.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Function Potential-Link-Value reduces the effect of erroneous links by restricting
the number of words being considered. This is done in two ways. Firstly, link types have
different weights, which are given in table 3-2 below. Secondly, the distance between two
links is used to determine the value of a link.
The values of the link weights were determined empirically. That is, different values were
used that reflected the links’ relative importance, and which gave reasonable results for a
sample text (Section 3.4). There is no benchmark data against which alternate link values
could be independently assessed. This issue is addressed however in Chapter 6, which
develops baseline test data for the complete text similarity task.
Function Potential-Link-Value (NewLink, OldLink)
RETURN (Linkvalue/WordNumber(Newlink)WordNumber(OldLink)))
Table 3-2: Value of the different lexical links.
Link Type
3.4 An Example
Now we are going to briefly consider an example creation of a lexical chain using
Hesperus. We will use the quotation from Einstein considered by StOnge (1995), and
introduced in Section 1.6 as an example to illustrate the notion of a lexical chain.
Recall that StOnge (1995) manually identified three lexical chains in this text, which are
indicated by text subscripts and listed below.
Chapter 3
We suppose a very long train1 travelling2 along the rails1 with the constant velocity2
t' and in the direction2 indicated in Figure 1. People travelling2 in this train1 will
with advantage use the train1 as a rigid reference-body3; they regard all events in
reference3 to the train1. Then every event which takes place along the line1 also
takes place at a particular point1 of the train1. Also, the definition of simultaneity
can be given relative to the train1 in exactly the same way as with respect to the
embankment1. Einstein)
1. {train, rails, train, train, train, line, point, train, train, embankment}
4. {travelling, velocity, direction, travelling}
5. {reference-body, reference}
Hesperus finds the chains shown below in the text. Table 3-4 illustrates their embedding.
(This notation is described in Section 3.8)
Table 3-3: Quotation from Einstein 1939 (cited by StOnge 1995)
We suppose4 a very long6 train0 travelling3 along the rails with a constant velocity v and
in the direction1 indicated in figure7 . People travelling in this direction will with
advantage5 use the train as a rigid reference2 - body; they regard all events in referenceto to the train. Then every event which takes place along the line also takes place at a
particular point10 of the train. Also, the definition8 of simultaneity9 can be given
relative-to to the train in exactly the same way as with respect to embankment
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The chains are given below numbered in importance from zero:
train, rails, train, train, line, train, train, embankment,
direction, people, direction,
reference, regard, relative-to, respect,
travelling, velocity, travelling, rigid
suppose, reference-to, place, place,
advantage, events, event
long, constant
figure, body
The remaining chains contain single words only, or are “atomic” using Hirst and StOnge'
(1998) terminology.
The most important chain is given below. The word and its number in the text are shown,
followed by the word's link relationship in the chain. This is followed by the number of
the word preceding it in the chain that it is linked to. The thesaural sense(s) possible for
the relationships between those words are also given. Thus, in the following, word 6
(Train) at the head of the chain is not linked to anything (↓), whilst word 10 (rails) is
linked to it by CAT. That is, they are members of the same thesaural category.
Table 3-4: An example lexical chain embedded in a text
Word Number
Link Type
Linked to
Thesaural Category
This lexical chain illustrates the reduction in possible word senses for “train” from
twenty-three to one. The possible senses are shown in table 3-5 below.
Chapter 3
Table 3-5: Senses of “TRAIN”
Senses of “TRAIN” from Roget' Thesaurus (1997)
The headword of the entry in the thesaurus is given followed by the main category
number where there are about 1000 categories in Roget's thesaurus. This is followed
by sub group (of which there are 6400) and grammatical part of speech.
We also see that “direction” is mistakenly linked to “people”, since both are members of
the thesaural category “government”. Nevertheless, the lexical chains produced by
Hesperus overlap well with those manually identified by StOnge (1995) above.
StOnge' Chainer Output
StOnge (1995) analysed the quotation from Einstein at the start of this section and his
program found the following lexical chains:
The headword is the title of the entry. No definition is given of what the category title means other than the words it
contains. Thus, to clarify the “learn” sense of train we need to find “train” in the context of related words in that
category “train, practice, exercise, be wont”.
Using Roget’s Thesaurus to determine the similarity of texts
[001] simultaneity(3),
Jeremy Ellman
reference(1), advantage(1), train(1), train(1), train(1), direction(0), velocity(0), train(0)
[002] travelling(1), travelling(0)
[003] line(2), rails(0)
[004] given(3), constant(0)
[005] body(1), people(1), figure(0)
[006] point(2), particular(2), regard(1)
[007] place(2), place(2)) event(2), events(1)
[008] definition(3)
[009] embankment (3)
As with Hesperus, the initial number (in square brackets) shows the chain creation order,
and words appear in reverse order of insertion. StOnge (1995) notes that rails is read but
is not associated with train, the distance between these two words being too large in
Further problems are identified in StOnge (1995). Nevertheless, this example shows that a
lexical chainer based on Roget's thesaurus can produce results equivalent in quality to
those produced using WordNet.
3.5 The Generic Document Profile
The purpose of the Generic Document Profile is to represent any text in such a way that
the similarity in meaning between two texts may be compared. The Generic Document
Profile is simply a set of semantic (Roget) categories with associated weights. These
weights are based on chain length and strength attached to the thesaural categories. This
profile can be matched against that derived from another text in a Case Based Reasoning
approach using a Nearest Neighbor9 algorithm (Aamodt and Plaza 1994).
This representation is known as the “Generic Document Profile”, since it is not word
specific, and is derived from the whole text. Now all that remains is to describe how a text
may be analysed so that values in each particular category can be determined.
US spelling is used, since this is the common name for this class of algorithm.
Chapter 3
Creating the Generic Document Profile
The Generic Document Profile is created from the lexical chains identified in a text. The
strength of every link is determined as described in the function Potential-LinkValue (Section 3.3). This value is then summed into the appropriate profile category.
This gives the required attribute value representation. Thus, the strength σ of any concept
c where C(n) gives the strength of concept c in link n, and where a text contains N lexical
chains is:
Equation 3-1: The strength of a concept
σ C = å C (n)
Table 3-7 below shows an example generic document profile. It is derived from the
example quotation from Einstein used in Section 3.4. For each thesaural category, the raw
score is converted into a percentage of the total score. Using these percentages normalises
document length and is an intrinsic part of creating a GDP.
Table 3-6: An Example Generic Document Profile
Roget Class
Raw Score
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Note that the GDP approach to text described assumes that meaning is compositional.
That is, that there is one meaning to a text that is summarised in the GDP. This may be
true of single paragraphs, but is rarely true of most whole texts. These contain themes, or
arguments, depending on the type text that is contained in the component sections, which
linked together form a coherent document. GDP’s could be used to represent themes in
whole texts by partitioning documents into sub-structures such as chapters or sections
first. The approach described is however suitable for determining the similarity of texts.
3.6 Using the Generic Document Profile to Determine the Similarity of
This section is the technical zenith of the approach. So far, we have described how to
determine the lexical chains in a text. Next we described how to derive the Generic
Document Profile, which is an attribute value vector whose attributes are categories from
Roget’s thesaurus, and whose values are derived from the lexical chains. Now we will
describe how texts may be compared to give a similarity score using a Nearest Neighbor
A CBR Nearest Neighbor algorithm takes two case descriptions, one input, and the other
retrieved from the database of cases and returns their similarity as a value between 0 and
1.0. Kolodner (1993) gives the following definition of a Nearest Neighbor algorithm:
Equation 3-2: Simple Nearest Neighbor Algorithm
MatchScore =
å ( w × sim ( f , f
i =1
i =1
where w is the importance weight of a feature or slot, sim is the similarity function, and
f ,f
are the values for feature i in the input and retrieved cases respectively.
Kolodner’s (1993) algorithm needs to be modified slightly for use in Hesperus, since not
all Roget senses will occur in the input text I, or the retrieved case R. If the sets of Roget
categories in the input and retrieved cases are C I , C R respectively, then only those
categories found in both (that is, C I 1 C R ) are considered in the match.
Chapter 3
Equation 3-3: Hesperus Nearest Neighbor formulation
C 1C
å (w × sim( f , f
MatchScore( I,R ) =
C 1C
The influence of any particular Roget category is governed by its weight
w . The weight
used is the value of that feature in the Generic Document Profile of the input case. Since
we have no external metric to evaluate differential feature weighting, Hesperus weights
all Roget categories equally. This issue is discussed further in Chapter 7.
In common with many CBR packages such as ART*IM (Inference 1994), the feature
similarity function sim is defined as the proportional range of the feature value between
the two cases. If
v ,v
are the values of feature i in the input and retrieved cases
respectively, then:
Equation 3-4: Hesperus Feature Similarity function
sim( f ,
min(vi , vi )
max(vi , vi )
This function has a range between 0.0 and 1.0.
In the case of several texts, the Generic Document Profile is determined for all the texts
and these profiles are then stored in the database of cases, or case base. Similarities may
then be calculated for any particular input text against those stored. Thus, the GDP which
is derived with Roget’s thesaurus may be used to determine the similarity of two or more
3.7 Adherence to Zipf’s Law.
Since the GDP is derived from text, a necessary requirement for a text representation is
that it conforms to Zipf’s law (Chapter 2). If the representation displayed another
frequency distribution, it would clearly be misrepresenting the text since the text’s
unprocessed contents do conform to Zipf’s Law.
Graph 3-1 below shows the values of the thesaural categories Vs their rank plotted using a
log/log scale. These values represent the frequency of the categories.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The data in Table 3-7 may be fitted to a straight line. Since these data were derived from a
one-paragraph example text, conformance is inevitably imperfect10. Nonetheless, basic
correspondence with Zipf’s law has been shown. This demonstrates that the GDP
derivation is representing the texts contents.
Many phenomena in natural language exhibit Zipf’s law (Section 2.2.7). Consequently,
conformance to Zipf’s Law is necessary but not sufficient proof that the lexical chaining
and GDP algorithms provide plausible text representations. This does however give
sufficient confidence in the methodology to proceed with testing the research hypotheses
given in Chapter 1.
Graph 3-1: Zipf Law: GDP Profile Values Vs Rank
3.8 Visualisation of Results
It was difficult to evaluate the relative success of initial implementations of the lexical
chaining algorithm since data was hidden in large volumes of output. It was consequently
hard to recognise what associations had been found. Morris and Hirst (1991) used a
complicated system of subscripts and superscripts that encoded the link type and
associations, however this required close attention to follow. The approach used in
example 3-4 improved on that situation, since the link type and associations are specified
in normal text, however the approach was still unsatisfactory since the lexical chain was
Longer texts are used in experiments reported in chapter 6. Their GDPs are given in Appendix II. These display better
adherence to Zipf’s law.
Chapter 3
viewed isolated from its surrounding context. This made it hard to interpret whether the
word sense selected was appropriate.
Since lexical chains essentially decompose a text into different “threads” of meaning, an
approach was required that allows each thread to be visualised individually, with the
option of returning to the overall view. This is equivalent to seeing any text as hypertext
made up of lexical chains.
HTML (Berners-Lee et al. 1994) is a highly practical hypertext medium, since it is
supported by public domain viewers such as Netscape and Internet Explorer.
A module was consequently designed that produces an HTML version of the initial text.
This shows the lexical chains identified, using colour to indicate chain membership. The
first element of the chain contains a hyperlink to that chain only.
The link type is encoded in the character styles used to print the other members of the
chain. Thus, a chain and its various links can be seen distributed in the surface text. An
example of this mark-up is given in table 3-3 earlier, whilst table 3-8 shows how the
different link types encoded as bold, underlined, italic or plain text faces. Thus for the
most dominant chain, we may see:Table 3-7: Link Type indications
Each link in the chain also contains links to the Roget’s thesaurus entry selected for that
word. This allows for rapid, qualitative checking of the entry’s suitability to represent that
word sense. Since Roget’s thesaurus does not give definitions of entry meanings, the
suitability of an entry has to be decided based on the whether the other words in the entry
would make suitable synonyms for that usage.
Note that this qualitative evaluation is only suitable for monitoring coarse system
performance. Due to the often-noted sparseness of language phenomena, modifications to
improve performance with less frequent words, may negatively affect more frequent
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
words, and consequently degrade system performance. Consequently, overall evaluation
on a corpus is preferable (Chapter 6).
Several examples of Hesperus’ demonstration output are included in Appendix II
The algorithm to make the lexical chains in a text visible is given in Appendix III.
3.9 Conclusion
We have described a system known as “Hesperus” that computes the similarity of pairs of
texts. The program identifies the lexical chains in a text using Roget’s thesaurus as a
knowledge source. This is used to create an attribute value vector of thesaural categories
that we have called the Generic Document Profile. This has been to conform to Zipf’s
law. Using this profile, the similarity between two texts based on their semantic content
can be calculated. We claim that this improves on much work on text similarity
assessment, since that is largely based on term repetition. This claim is experimentally
investigated in Chapter 6.
Two innovations have been introduced with respect to previous work on lexical chains.
Firstly, Roget's thesaurus has been shown to be a useful alternative to WordNet.
Secondly, the notion of a value ordered chain store was introduced. This allows a text’s
concepts to be considered according to their prevalence, rather than following the linear
text analysis.
Subsequent work is focused on improving the accuracy of the Generic Document Profile.
This is problematic, as there are no text corpora for which Generic Document Profiles or
lexical chains have been defined. This means that implementation decisions have been
described in this chapter which were justified solely on the basis of informal experiments.
What is required is a representative benchmark test that is independent of Hesperus.
Performance modifications can then be evaluated against that standard. Chapter 6
develops such a standard.
One cause of inaccuracy to Hesperus is the word sense ambiguity problem. This is
addressed by incorporating appropriate disambiguation techniques which will be
described in chapter 5.
Chapter 3
Chapter 6 evaluates the GDP approach by comparing it to human judgements of the
similarity of randomly selected texts. That chapter also evaluates the claim that the GDP
method is superior to an approach based on term repetition. We also return in Chapter 6 to
the possibility of using a finer level of granularity. This was mentioned in Section 3.3,
and involves using an expanded GDP containing approximately 6400 thesaural
categories. This is technically trivial, but such a modification would only be justified if it
can be shown that it improves text similarity matching performance compared to human
The applicability of the technique to different text complexities is addressed in the
following chapter.
Chapter 4. The General Nature of Lexical Links
4.1 Introduction
The Generic Document Profile (GDP) is designed to facilitate the comparison of texts and
measure how similar they are in content. It is derived from the lexical chains identified in
a text. The GDP is an attribute-value vector whose attributes are categories from Roget’s
Thesaurus, and whose values are the cumulative weights of the links in the lexical chains
(Chapter 3).
Texts may differ in several ways including style, length, genre, and complexity. These
dimensions necessarily interact, so for example many conference papers tend to be about
five pages in length due to space restrictions. Authors compensate for this by adapting a
more terse style. Journal papers are usually longer, and explain findings in more detail,
whilst successful textbook authors strive for explanations of the highest clarity.
For a procedure to represent the content of any text, it must not be sensitive to any of the
factors of style, genre, length, and complexity. If it were sensitive, then it would be a
measure of that aspect of the text. The objective of this chapter is to demonstrate that the
GDP is a general method.
The issue of document length is addressed mechanically in Hesperus using simple
mathematical methods. For example, the attribute-values in the document profile are
converted to percentages of the total value of the GDP. This normalises the value that a
particular attribute may take to between zero and one.
The issue of text style and genre is somewhat more difficult to address. There are
different types of texts, and texts that are about the same subject may be in different
genres (Karlgren and Cutting 1994).
The GDP calculation is based upon the weight attached to each link in every lexical
chain. This is determined by the strength of the link type divided by the distance between
the linked words (Chapter 3). If different genres contain higher proportions of the
different links, or if the distance between thesaurally related words varies this would alter
the distribution of attributes and values. This could arise if the threads of related words in
Chapter 4
The General Nature of Lexical Chains
the simple texts are shorter, hence making the text easier to read, or, alternately, denser
text could have longer inter-word link distances. Indeed, Morris and Hirst (1991 p41)
speculate that lexical chain forming parameter settings will vary according to an author’s
style. If this were to be the case, the lexical chaining approach would not be a general
tool, but would instead be some measure of document complexity. Lexical chaining could
only be made suitable as an enabling measure to determine text similarity if a text’s genre
were first classified. This classification would then need to be included in the GDP
The objective of this chapter is then to demonstrate that lexical links have the same
characteristics in texts of different genre and complexity. That provides the basis for
considering the GDP as general method. If it were not the case, we would need genre
identification and normalisation methods to be used before the GDP method could be
applied. This would reduce the attraction of the method for interactive applications.
In order to show that the GDP is a general method we need to show that: 1. The proportion of links does not depend on the type of text.
2. Links have similar distributions in different text types.
3. This distribution is independent of the link type.
This is done by selecting a set of texts of varying complexity. The lexical chaining
component of Hesperus is then applied to these and analysed to test the assertions above.
At the same time, basic data about lexical links will be collected. Although lexical chains
have been used in a variety of applications (see Chapter 2) these were small-scale. Thus,
basic data about large volumes of lexical chains have not been reported. These will be
used subsequently to tune the performance of the algorithm.
This chapter is arranged as follows: Firstly a collection (or 'mini-corpus') of appropriate
texts are selected. These varied from children's books such as “Alice in Wonderland” to
more challenging works such as Kant's “Critique of Pure Reason”. Next, we demonstrate
that these are of different reading complexities using the well-known Flesch-Kincaid
readability metric. Basic data about the types of lexical links found in the texts is then
reported. The principal finding being that identical words, and words in the same
thesaural category make up approximately 80% of the relations discovered. Next, we
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
move on to look at the distance between the words where a thesaural link can be found.
This uncovers a significantly similar distribution pattern for all the link types and in all
the texts. This data is very similar to that reported by Beeferman, Berger and Lafferty
(1997) using purely statistical analyses. This distribution is subsequently shown to
conform to Zipf’s (l949) law. The implications of these findings are then discussed and
conclusions drawn.
4.2 Selection of the Experimental Texts
The lexical chaining approach to text analysis is highly attractive, since it is both robust,
and deals with whole texts. It is though a heuristic approach. For example, we do not
know whether it is an independent measure, or a reflection of a text’s genre.
To answer this question, we decided to analyse a set of longer texts. Since most work in
lexical chaining has considered shorter texts, results, though interesting, may not be
general. Consequently, a mixed range of texts of differing complexity were chosen.
There were several constraints on the selection of texts for the experiments. They had to
be:1. analysed within the constraints of the current implementation.
2. available electronically.
3. several thousand words in length.
4. They should have demonstrably different complexities.
A range of texts of differing complexity was selected from those available on the Internet,
or CD-ROM. These varied from children's books such as “Alice in Wonderland” to more
challenging works such as Kant's “Critique of Pure Reason”. The texts chosen are listed
in table 4-1 below.
Chapter 4
The General Nature of Lexical Chains
Table 4-1: Texts Selected
Alice’s Adventures In Wonderland
Lewis Carroll
Through The Looking Glass
Lewis Carroll
Pride And Prejudice
Jane Austen
Moby Dick
Herman Melville
Lectures on the Industrial Revolution in England
Arnold Toynbee
The Critique Of Pure Reason
Immanuel Kant
4.3 Reading Complexity of the Texts
The books used in these experiments were selected as representing a range of literary
complexity. Books by Lewis Carroll are commonly read to junior school children, Austin
and Melville are high school texts, whilst Kant and Toynbee are not usually encountered
until University. Thus, we can expect intuitively that University level texts are harder to
read than those aimed at school children. Nonetheless, some independent confirmation of
their reading ease is desirable.
Readability is often measured by teachers to determine the suitability of books for pupils
of different reading abilities. Readability formulae (e.g. Harrison 1980) aim to predict the
level of a text’s reading difficulty by calculating statistics, such as sentence length and
mean syllables per word, from the text. They do not consider content, so need to be
applied with caution.
Harrison (1980) describes ten readability measures, including the Flesch formula, and the
Gunning FOG formula. Harrison (1980) reports a study by Lunzer and Gardner that
shows that seven of the readability formulae are approximately correlated with pooled
teachers’ assessments of text reading levels.
Karlgren and Cutting (1994) showed that texts may be simply classified into fifteen
different genres. They used the statistical technique of discriminant analysis on twenty
parameters. These included sentence length, proportion of pronouns, average characters
translated by J. M. D. Meiklejohn
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
per word, and number of relative pronouns. They applied this method to classify the five
hundred texts from the Brown corpus, which have been manually classified as belong to
different genres. Karlgren and Cutting comment that readability measures work well to
discriminate text types since they include the most salient features of their experiments.
These include sentence length, word length, and characters per word.
The Flesch-Kincaid grade level measure computes readability based on the average
number of syllables per word and the average number of words per sentence. It is a
common metric that is widely used. It is included in both Microsoft Word, and Corel’s
WordPerfect word processors, so it also has the advantage of convenience. The FleschKincaid grade level was consequently calculated for the initial 1000 lines of the books in
table 4-1. The 1000 line limit was chosen since this represents a reasonable subset of the
book that is sufficient to capture its style, assuming that this is approximately uniform
throughout the text. The results are shown in table 4-2 below.
Table 4-2 show that the books represent a range of reading complexity. They also
demonstrate the internal consistency of the measure as two books of a similar style by the
same author (“Looking Glass” and “Alice in Wonderland”) have similar Grade levels.
We now move on to look analyse the data produced from the lexical chains identified the
Table 4-2: Reading Complexity of the Texts
Book Title
Alice's Adventures In Wonderland
Through The Looking Glass
Pride And Prejudice
Moby Dick
Lectures on The Industrial Revolution in England
The Critique Of Pure Reason
Chapter 4
The General Nature of Lexical Chains
4.4 Determination of the Lexical Cohesive Relationships
We used an algorithm based on those of Morris and Hirst (1991), and StOnge (1995) to
identify the lexical cohesive relationships in the texts. This is described in Chapter 2. Four
relations were examined:
1. The links between identical words ( hence ID)
2. Links between words that are not identical, but are member of the same
Roget category (hence CAT)
3. Links between words that are members of the same group of categories in
Roget, but not in the same category. (hence GRP)
4. Links through one level of internal thesaural pointers. (hence ONE)
4.5 Analysis 1: Link Distribution between Documents.
This first analysis presents unprocessed sums of the link types. That is, all the lexical
chains found in the documents were examined, and simple sums made of the types of
lexical linking relationships found. This is shown in Graph 4-1 below.
Our initial hypothesis was that there would more “weaker” linking relationships (such as
GRP or ONE), since these can connect to a greater number of words than the identical
word or same category relations. However, this was not the case.
Simple word identity (ID) is the most common lexical linking relationship found.
Following that, we find Roget category entry (CAT), then Roget group membership
(GRP). The ONE relationship is least frequent. Words not contained in the thesaurus are
shown as NONE.
All the books in the experimental corpus show approximately the same total link
distribution. Since the books represent increasingly complex texts, we have shown that
the proportion of links of different types found in a text is broadly independent of the
complexity of that text. We have also reason to question the value of the more complex
thesaural relationships. The ONE level of indirection relation is sufficiently rare that one
may question whether it is worth calculating. Indeed, its cost of calculation as measured
in terms of program run time, far outweighs its potential benefits. Consequently, it is not
considered after this chapter.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
We may speculate as to why the ONE level of indirection is found so rarely. A possible
explanation could be that Roget is a tool to aid writers. As such, ONE pointers indicate
nuances of word senses in related categories. Possibly writers choose related words to
create a more cohesive document. If that is the case, this might be recognised through the
other relationships of CAT, GROUP, and ID since lexical chaining2 gives no insight into
the writing process.
indrev critique
Graph 4-1: Link Type Vs Book Title
Looking at the proportions of link types in Graph 4-1, it is clear that an algorithm that
only uses identical words could give useful performance: it will usually discover the
majority of relationships in a text, and will do this accurately. Although one may expect
better performance as more relationships are added, word sense ambiguity comes into
effect, and this will cause inappropriate linking. Indeed, Sanderson (1994) has concluded
from Information Retrieval experiments that word sense disambiguation is likely to
degrade performance unless it is more than 90% accurate. The problem of word sense
ambiguity may explain why Hearst (1994) reports that her text segmentation algorithm
‘TextTiling’ performed better when it was not aided by thesaural relationships.
Stairmand (1996) compared a text segmentation algorithm that used lexical chains
(derived from WordNet –see Chapter 2) to TextTiling. He found that TextTiling gave
superior performance. Although Stairmand (1996) offers no explanation for this, it seems
as currently formulated!
Chapter 4
The General Nature of Lexical Chains
likely that TextTiling could perform better with less –but perfect– data than Stairmand’s
approach with more –but less accurate– data.
Since the identical word relationship is so important, it is almost certainly an error to
eliminate words not found in the thesaurus during the pre-processing stage. The rationale
for this is that such words can not form chains with other words. However, they will form
chaining relationships with themselves, and this may form a significant aspect of a text.
How this error may be corrected remains however a research problem.
4.6 Analysis 2: Link Distributions Change across Different Document
Now we need to consider whether link distributions change across the different document
types. This is done by calculating the distances between each pair of words for which
there is a lexical linking relationship in each of the six experimental texts shown in Table
4-1. Comparative analysis between the texts is only possible if we compensate for their
different lengths. This is done by normalising the number of lexical links that share a
range of interword distances as a percentage of the total.
We do not know whether the distribution of different links varies in the same way for
each link type. Thus, the percentage calculation was divided according by the type of link
considered. We can then plot the percentages of each link type against the distance
between the words in that link.
Identical Links (%) Vs Interword Distance
15 22 29 36 43 50 57 64 71 78 85 92 99
Graph 4-2: Identical Links (%) Vs Inter-word Distance
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Percentage of Category Links Vs Interword Distance
7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97
Graph 4-3: Percentage of Category Links Vs Inter-word Distance
Percentage of GROUP links Vs Interword Distance
8 15 22 29 36 43 50 57 64 71 78 85 92 99
Graph 4-4: Percentage of Group Links Vs Inter-word Distance
Graphs 4-2, 4-3, and 4-4 show the results of this analysis by link type3. As can be seen,
the percentage distributions are almost identical for all the texts, and for all the linkage
types. This means that the type of text does not affect link creation in lexical chains. It
The ONE link type has been excluded since this does not occur frequently enough to generate consistent data.
Chapter 4
The General Nature of Lexical Chains
also follows that the distance between words in a text is independent of the thesaural
relationships sought between the words.
It can also be seen that Morris and Hirst had little justification for applying special status
to the identical word relation, as they follow similar distributions to the other thesaural
links (Section 3.3).
4.7 Related Work
A Mathematical model showing an exponentially decaying relationship between cooccurring words in English has been described by Beeferman et al. (1997). Their work is
empirical and based upon a statistical analysis of “trigger pairs” of words appearing in
five million words of the WSJ corpus (a collection of articles from the Wall St. Journal
Beeferman et al. (1997) divide trigger pairs into self, and non-self triggers. Self triggers
are identical words that are repeated in a text. Non self triggers are non-identical words
for which a statistical pattern of co-occurrence can be identified. Graph 4.5 below
(reproduced from Beeferman et al. 1997) shows the observed distance distributions.
Graph 4-5 : Non-Self Triggers (Beeferman et al. 1997)
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Graph 4-6 : Self Triggers (Beeferman et al.1997)
It appears that Beeferman et al. 1997's self triggers are the same as the identical word
relationship described by Morris and Hirst (1991). Therefore, they have a similar
distribution pattern. Of course, we have shown above that this relationship is found in
several texts of differing complexities.
Beeferman et al.’s (1997) non-self triggers are more interesting. The notion is more
powerful than thesaural linking, since it captures associations that could not have been
previously stored in a database. Examples of such relationships are highlighted in italics
in table 4.3 below.
Table 4-3: Samples of Trigger Pairs (Beeferman et al.1997)
Non-self triggers display the same distribution characteristics as the CAT, GROUP, and
ONE lexical linking relationship, however they were derived in a completely different
Chapter 4
The General Nature of Lexical Chains
way. Beeferman et al’s (1997) work is based on statistical analysis of large corpora,
whereas lexical linking uses a thesaurus to predict relationships without prior analysis.
Beeferman et al’s (1997) data support the hypothesis that the distance between related
words in texts is independent of text genre. As Beeferman et al (1997) used completely
different methods it is unlikely that graphs 4.4-4.7 are artefacts of the algorithm used.
Consequently, we can conclude that inter-word relationships are independent of text style.
4.8 Conformance to Zipf’s Law
The data in graphs 4.5-4.7, and Beeferman et al.’s (1997) data display the characteristic
power curve Zipf’s law that is often found in Natural Languages (Chapter 2).
Graph 4-7 demonstrates Zipf’s law in the experimental data of analysis two. Specifically,
data for Moby Dick has been extracted from Graph 4-4, and re-plotted using double
logarithmic scales. This shows the distance between words in the same thesaural category
Vs ranked number. A straight line can be observed within the 95% confidence limits
drawn by the statistical package SPSS.
The data in graphs 4-4 to 4-7 display the same power curve. We can conclude that this
conforms to Zipf’s law, as did the Generic Document Profile (Section 3.7).
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Graph 4-7 : Moby Dick. Number of Same Category words Vs Rank
4.9 Conclusion
Lexical cohesion is a property of the words in a text. Relationships that link words have
been termed lexical links. Links may be composed in to chains, and such lexical chains
have great potential utility in text processing tasks, such as information retrieval, text
similarity detection, or text summarisation.
A major concern is that types of lexical chains to be found in a text may depend on the
style of that text. If this had been true, we would not have been able to base a measure of
text similarity directly on lexical chains: it would have needed to be mediated by a
determination of text genre.
This concern has been rejected experimentally be analysing several book length texts.
These were selected to be no more recent than Roget’s 1911 thesaurus. This maximised
the applicability of the lexical chaining algorithm. In addition to intuition, the books were
Chapter 4
The General Nature of Lexical Chains
shown to be of different reading difficulty by comparing them using the Flesch-Kincaid
grade level readability measure.
An analysis of the distribution frequency of the lexical links found in the mini-corpus was
strikingly similar for all the link types. This supports the hypothesis that text analysis
measures based upon lexical cohesive links will be applicable to different styles of texts.
Thus, the text similarity technique discussed in Chapter 3 is capable in principle of
determining the similarity of texts about the same subject, but written in different styles.
There are several implications that follow from these findings. Firstly, for the existing
work on lexical chains reported in Chapter 2 it follows that no special attention should be
paid to the texts used in the lexical chaining experiments. If lexical links are independent
of text genre, then the results of Stairmand (1996), StOnge (1995), Green (1997), and
Barzilay and Elhadad (1998) would also be replicated if texts of different genres had been
chosen. Secondly, regarding future work, no particular control need be applied to ensure
the uniformity of text genres. This is particularly important for the experiments reported
in Chapter 6, where texts are randomly selected from the Internet for comparative
experiments on human similarity judgements.
We now proceed to Chapter 5, which considers the issue of word sense ambiguity in
relation to Hesperus.
Chapter 5. Word Sense Disambiguation and Hesperus
5.1: Introduction
This chapter addresses the issue of word sense disambiguation (WSD) in relation to the
derivation of a text’s Generic Document Profile. Word sense ambiguity arises since words
may be used in more than one sense, and so, for example, approximately 33% of the
unique words in Roget’s are found in more than one thesaural category, and some words
found in many categories (Appendix VII). This has considerable implications for a text’s
Generic Document Profile (GDP). It will be recalled from Section 3-6 that this involves
analysing an input text word by word, identifying the words’ thesaural categories and
linking them to chains of words related in meaning. Should an inappropriate word sense
be selected, two related problems arise: firstly it strengthens the wrong chain, and
secondly it correspondingly weakens the correct chain. This affects the overall accuracy
of the performance of the Generic Profile, and subsequently weakens the text similarity
The question then is whether we should try to choose a more valid word sense from
amongst the possible candidates, or accept that the inaccuracies of the lexical chaining
approach? That is, since word sense disambiguation is an unsolved problem it is quite
feasible that an attempt to solve it within Hesperus will introduce greater imprecision to
the text similarity match than that caused by the problem itself. This chapter addresses
that question.
Literature Review.
Hesperus: A system for comparing the similarity of texts using
lexical chains.
The General Nature of Lexical Links.
Word Sense Disambiguation and Hesperus.
Evaluating Hesperus
Figure 5-1: Structure of the Thesis
Chapter 5
Word Sense Disambiguation
This chapter proposes improvements to the text similarity process described in Chapter 3
by including a dedicated sense disambiguation phase. This is independently assessed, and
subsequently developed as an additional module for Hesperus. The improved system
shown graphically in fig 5-2 and subsequently evaluated in Chapter 6.
Okumura and Honda (1994) showed that the lexical chaining process implicitly provides
word sense disambiguation (Section 2.4.5). We may contrast this to explicit word sense
disambiguation, where we attempt to determine word senses prior to, and independent of,
lexical chaining. The objective here would be to avoid spurious word associations and
lexical links, by ensuring that words are only used in their intended sense.
Given the natural tendency of lexical chaining to disambiguate word senses, there is a
question as to whether explicit disambiguation should be attempted. As pointed out in
Section 2.2, Sanderson (1996) has argued that WSD will negatively affect Information
Retrieval (IR) performance unless it is more accurate than 90% (Chapter 2). Given that
Kilgariff (1998) has only demonstrated human sense tagger agreement of 91%1 it seems
questionable that any algorithmic system can approach the indicated level of performance
for IR. Text similarity matching is not however IR. Consequently the effect of WSD
needs to be specifically addressed.
The objective of this chapter is to examine whether an explicit WSD system compatible
with Hesperus can be produced. If not capable of disambiguation accuracy greater than
90%, it is certainly desirable that its performance is comparable to the current state of the
art. The impact of explicit WSD performance on a text similarity task can then be
addressed (in Section 6.3.4).
This chapter is arranged then as follows: Firstly, the problem of evaluating word sense
disambiguation in Hesperus is discussed. Secondly, plausible options for the “local
disambiguator algorithm” are described. Next, their evaluation is given within the context
of “Senseval”, an international word disambiguation competition. Finally, a local sense
disambiguation module for Hesperus is described, and conclusions drawn.
This refers to agreement over fine sense distinctions. Human agreement is higher where only coarse, homographic,
distinctions are considered (Ng et al. 1999)
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
5.2. The Problem of evaluating the effects of Word Sense Disambiguation
in Hesperus
Explicit word sense disambiguation was proposed in Section 3.9 as a pre-processing
phase that would feed sense disambiguated words as input to Hesperus. This would
improve the quality of the lexical chains produced from a text, by eliminating spurious
associations due to words being linked in other than their intended senses. The principal
problem with this approach is that word sense disambiguation is an unsolved problem
(Section 2.4.2), and the evaluation of possible approaches requires a sense tagged corpus
(Section 2.4.4). Senseval was selected for this purpose, as it was both timely, and purpose
designed for comparative evaluation.
Senseval (Section 2.4.3) was set up as a competition to allow the evaluation of the relative
performance of different word sense disambiguation approaches. Potential participants
firstly received (May 1998) a set of “dry run” data that gave the format of the
competition, and relevant sections of the Hector dictionary on which the Senseval
competition was based (Kilgarriff and Palmer (2000). A set of training data was
circulated in June 1998 tagged with senses for the forty-one words that were to be the
focus of the competition. Data for evaluation were circulated at the end of July 1998, and
two weeks were allowed for participating systems to tag this data and return results.
Participation in Senseval allowed a detailed evaluation of the several word sense
disambiguation techniques compatible with Hesperus. Consequently, a separate module
prototype was developed and entered the Senseval competition as “the Sunderland
University Similarity System” or SUSS.
5.3. The motivation for HESPERUS participating in Senseval as SUSS
SUSS's principal objective in Senseval was to evaluate different disambiguation
techniques suitable for use within Hesperus. These could then improve the performance
of a future version of Hesperus as a local disambiguator (Section 5-5).
A derived objective was to maximise the number of successful disambiguations. This was
both a requirement for success in the competition, and an objective for Hesperus, where
incorrect disambiguation is a possible source of inaccuracy.
Chapter 5
Word Sense Disambiguation
SUSS extensively exploited the Hector machine readable dictionary entries. There were
two reasons for this: firstly, Hector dictionary entries are extremely rich, and allowed us
to consider disambiguation techniques that would not have been possible using Roget
alone, secondly Hector sense definitions were much finer grained than those used in
Roget. A system that used Roget would consequently have been at a considerable
disadvantage since it would not have been able to propose exact Hector senses in the
competition. The use of Hector also allowed us to envisage the effect of better informed
WSD on future extensions to Hesperus.
Now we will look at the Hesperus paradigm, and the strategy used to develop SUSS.
The SUSS development strategy.
SUSS was developed using an iterative development strategy designed to maximise
performance as measured by the total number of successful disambiguations. The strategy
was as follows:
A basic system was implemented that processed the training data.
A statistics module was implemented that displayed disambiguation
effectiveness by word, word sense, and percentage precision.
As different disambiguation techniques were developed, effectiveness was
measured on the whole corpus.
Techniques that improved performance were further developed. Those that degraded
performance were dropped. Since the competition was time limited, there was no time to
pursue interesting, but unsuccessful approaches.
5.4. SUSS: The Sunderland University Senseval System
SUSS was a multi-pass implementation that reduced the number of candidate word senses
by repeated filtering. Following an initialisation phase, different filters are applied to
select a preferred sense tag, or eliminate inappropriate ones.
The order of filter application is important. Word and sense specific techniques are
applied first, more general techniques are used if these fail. Specific techniques are not
likely to affect any other than their prospective targets, whereas general methods
introduce probable misinterpretation over the entire corpus. For example, a collocate such
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
as “brass band” uniquely identifies that sense of “band”, with no impact on other word
senses. Other techniques required careful assessment to ensure that their overall effect
was positive. This was part of a structured development strategy.
A data flow diagram representing the overall system using standard SSADM notation
(e.g. Weaver 1993) is given in fig 5.2 overleaf, and descriptions of the filters given in
Section 5.3.
SUSS Initialisation Phase
SUSS used a preparation phase that included dictionary processing and other preparations
that would otherwise be repeated for each lexical sample to be processed. The Hector
dictionary was loaded into memory using a public domain program that parses SGML
instances. This made the definition available as an array of homographs that is further
divided into an array of finer sense distinctions. Each of these contained fields, such as
the word sense definition, part of speech information, plus examples of usage.
The usage examples were used in the “example comparison filter” and the “semantic
relations filter” techniques (described below). They were reduced to narrow windows W
words wide centred on the word to be disambiguated from which stopwords (Salton and
McGill 1983) have been eliminated. This facilitated comparison with identically
structured text windows produced from the test data. The main SUSS algorithm is as
Chapter 5
Word Sense Disambiguation
Figure 5-2 : SUSS System design
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Algorithm 5-1: SUSS Algorithm Processing Phase
FOREACH sample:
Filter possible entries as collocates.
DONE IF there is only one candidate sense
Filter remaining senses for
information extraction pattern.
DONE IF there is only one candidate sense
Filter remaining senses for idiomatic phrases.
DONE IF there is only one candidate sense
Eliminate-stopwords from sample.
Produce-window W words wide centred on target word
FOREACH example in the Hector dictionary entry
Match the sample window against the example window
Select the sense that has the highest
example matching score.
If no unique match found, return the most frequent sense
remaining dictionary entry ).
We now go on to describe the specific techniques tested.
Collocation Filter
Collocations are short, set expressions which have undergone a process of lexicalisation.
For example, consider the collocation ‘brass band’. This expression, without context, is
understood to refer to a collection of musicians, playing together on a range of brass
instruments, rather than a band made of brass to be worn on the wrist. The Hector
dictionary encodes such collocate expressions as distinct senses of the word.
Given the set nature of collocations, we decided to look for these senses early in the
disambiguation process since this would be a simple method of identifying or eliminating
them from consideration.
The calculation of sense occurrence statistics was designed to counter a perceived deficiency in Hector, where the
ordering of senses did not appear to match that of sense frequency in the corpus.
Chapter 5
Word Sense Disambiguation
The collocation identification module, therefore, worked as a filter using simple string
matching. If a word occurrence passing through the module corresponded to one of the
collocational senses defined in the dictionary, or could be morphologically reduced to
such an entry, it would be tagged as having that sense. If none of these senses were
applicable, however, all senses taking a collocational form were filtered out.
Information Extraction Pattern Filter
The Information Extraction filter refers exclusively to enhancements to the Hector
dictionary entries specifically to support word sense disambiguation. The Hector
dictionary is primarily intended for human readers. Many entries contain a clues field in a
restricted language that indicates typical usage. Examples include phrases such as “learn
at mother's knee, learn at father's knee, and variants”, or “usu. on or after”. Such phrases
have long been proposed as an important element of language understanding (Becker
1975). These phrases were manually converted into string matching patterns and
successfully used to identify individual senses.
For example, “shake” contains the following:
<idi>shake in one's shoes, shake in one's boots</idi>
<clues>v/= prep/in pron-poss prep-obj/(shoes,boots,seat)</clues>
This can be used to convert the idiom field (using PERL patterns) as follows:
<idi>shake in \w* (shoes|boots|seat)</idi>
This may now be used to match against any of the idiomatic expressions “shake in her
boots”, “your boots”, etc., as morphological variants would previously have been reduced
to base forms
We call a related method “phrasal patterns”. A phrasal pattern is a non-idiomatic multiple
word expression that strongly indicates use of a word in a particular sense. For example,
“shaken up” seems to occur only in past passive forms. Adding appropriate phrasal
patterns to a dictionary sense was found to increase disambiguation performance for that
sense. The majority of phrasal patterns were manually derived from the Hector dictionary
entries. Others were identified by observing usage patterns in the dictionary examples, or
the training data.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Collocation and other phrasal methods are important since they are tightly focused on one
word, and on one sense that word may be used in. They do not affect other word senses,
and can not influence the interpretation of other words.
Idiomatic Filter
Idiomatic forms identify some word senses. Unlike collocations, however, idiomatic
expressions are not constant in their precise wording. This made it necessary to search for
content words in a given order, rather than looking for a fixed string. An idiom was
considered present in the text if a subset of the content words were found exceeding a
certain (heuristically determined) threshold value. For example, the meaning of “too many
cooks” is clear, without giving the precise idiom.
Dictionary entries that contained idiomatic forms were processed as follows: Firstly, two
word idioms were checked for specifically. If the idiom was longer, stopwords were
removed from the idiomatic form listed, and remaining content words compared in order
with words occurring in the text. If 60% of the content words were found in the region of
the target word, the idiomatic filter succeeded, and senses containing that idiom selected.
Otherwise, senses containing that idiomatic form were excluded from further
Example Comparison Filter.
The Example Comparison Filter tries to match the examples given in the dictionary
against the word to be disambiguated, looking at the local usage context. It assigns a score
for each sense based on identical words occurring in the text and dictionary examples and
their relative positions. We take a window of words surrounding the target word, with a
specified width and specified position of the target, in the text and in a similar window
from each dictionary example.
For each example in each sense, all the words occurring in each window are compared
and, where identical words are found, a score, S, is assigned, where
åd d
Equation 5-1: Example Comparison Score
and w is a word in window W, and dS and dE are functions of the distance of the word
from the target word in the sample and example windows respectively, such that greater
Chapter 5
Word Sense Disambiguation
distances result in lower scores. The size of the window was determined empirically.
Window sizes of 24, 14, and 10 words were tried. Larger window sizes increased the
probability of spurious associations, and a window size of ten words, (which is five words
before and five words after the target word) was selected as optimal.
When all the example scores have been calculated for each word sense, the sense with the
highest example score is chosen as the correct sense of that occurrence.
In cases where this does not produce a result, the most frequently occurring sense (or first
dictionary sense) that has not been previously eliminated is chosen.
Other Techniques Evaluated.
One of the objectives of SUSS was to evaluate different disambiguation techniques.
Below we describe two methods that were evaluated, but not used in the final system,
since they lead to decreased overall performance.
Part of Speech Filter
Wilks and Stevenson (1996) have claimed that much of sense tagging may be reduced to
part-of-speech tagging. Consequently, we used the Brill (1992) Tagger on the subset of
the training data set that required part-of-speech discrimination. This should have
improved disambiguation performance by filtering out possible senses not appropriate to
the assigned part of speech. However, due to perceived tagging inaccuracy, this was just
as likely to eliminate the correct word sense too. Consequently, it did not make a positive
contribution (Klinke 1998).
Another routine that used the part-of-speech tags attempted to filter out the senses of
words marked as noun modifiers by the dictionary grammar labels where the following
word was not marked as a noun by the tagger. This routine also checked words that
contained an 'after' specification in the grammar tag and eliminated these senses where
the occurrence did not follow the word given. However, it gave no overall benefit to the
results either. One possible cause of this is in occurrences where there are two modifiers
joined by a conjunction so that the first is, legitimately, not followed immediately by a
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Semantic Relations Filter
The Semantic Relations Filter is an extension of the example comparison filter that uses
overlapping categories and groups in Roget’s thesaurus, rather than identical word
matching. This technique was prompted by in-sentence thesaural based disambiguation
described in Wilks, Slator and Guthrie (1995), and attributed to Masterman in the late
1960’s. It involves looking for strong, same category, or identical word relationships in
the window surrounding the current word. Where these are found alternative senses are
eliminated. This may lead to partial disambiguation and should allow us to recognise that
“accident” is used in the same sense in “car accident”, and “motor-bike accident”, since
both are means of transport.
Appropriate scores are allocated for each category in Roget that the test sentence window
has in common with the dictionary example window. As in the example comparison, the
sense that contains the highest scoring example is selected as the best.
Disappointingly, this technique finds many spurious relations where words in the local
context are interpreted ambiguously. This led to an overall performance degradation over
the test set, and so the technique was not part of the final SUSS algorithm.
Results Of The SUSS Evaluation.
Senseval was organised as a competition in order to allow the comparative evaluation of
different word sense disambiguation techniques. More importantly however, the
availability of the Senseval sense tagged (lexical samples) corpus permitted the
comparative assessment of the individual techniques possible within the Hesperus
The results given below are consequently include both intra-system assessments of the
various SUSS methods, followed by relative assessments of performance in comparison
to other WSD systems that competed in Senseval. First the performance of the various
techniques are given on the Senseval dry run data, then the performance of SUSS relative
to other word sense disambiguation systems is given.
Dry Run Results
These results demonstrate performance of SUSS techniques on the distributed sense
tagged corpus distributed for system training. Following the development strategy
Chapter 5
Word Sense Disambiguation
outlined earlier, several variation of the WSD techniques have been evaluated across the
entire Senseval corpus. This was especially important, as techniques that improved
performance on disambiguating senses of one word could cause an overall reduction in
performance across the corpus if they have negative effects on other words.
In table 5.1 (below) results are given for several combinations of techniques across the
Senseval training corpus of five thousand sense tagged samples. These are divided into
rows that correspond to the twenty-eight words whose senses have been tagged3.
The column headings in Table 5.1 are given the mnemonics, trlogall, origdictlog, noidlog,
th-log, synbrilllog, nostatslog, and lastlog. These are described below.
This version of the algorithm is that reported in Ellman, Klinke, and Tait (1998) at the
Senseval workshop.
This variation of the algorithm tested the contribution of sense specific information
extraction patterns added to the dictionary. This was done by using the original,
unaugmented Hector dictionary.
This variation of the algorithm tested the contribution of the example comparison filter.
This was done by disabling the filter so that sentence fragments are not matched against
those from the Hector dictionary definitions.
This series of tests used the semantic relations filter. That is, example matching was tried
using the semantic relations filter.
In this series of tests entries were pruned from the dictionary using the part of speech
filter if they did not correspond to parts of speech used in the competition.
That is, there were approximately two hundred samples of each word embedded in different sentences, where the
numbered sense of the test word has been manually identified and marked-up.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
This test series did not use calculate word sense frequency statistics from the training
data. Consequently the sense ordering used was that found from the dictionary. Since
dictionary entries are generally ordered in order of senses occurrence, calculating
statistics should have made no difference. This is the observation in a majority of cases.
However statistical calculation made a clear contribution to several words (e.g. shake,
bitter, and sanction).
Chapter 5
Word Sense Disambiguation
Table 5-1 Disambiguation Success (% accuracy) Vs Word by method
Table 5.1: Disambiguation Success (% accuracy) Vs Word by method
Word n
trlogall origdictlog noidlo th-log synbrilllog nostatslog
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
These results may be seen graphically as follows:Percent Score Vs Sample
Accuracy %
Word Number
Graph 5.2: Comparative Disambiguation Performance
Results of Senseval
The results of the Senseval competition are described in Kilgarriff and Rosenzweig
(2000). This gives multiple system analyses broken down by recall, precision, and
granularity. Recall refers to the proportion of correct disambiguations divided by the size
of the sample set; Precision refers to the proportion of correct disambiguation divided by
the number attempted. Granularity is concerned with whether polysemous senses are
viewed as distinct, or whether only homographic distinctions are important, polysemous
senses being considered equivalent.
Our concern here is not to discuss the comparative performance of SUSS, but to
demonstrate that simple techniques used which are suitable for inclusion in Hesperus are
at least of the level of the current international standard.
This is shown graphically in figs 5-3, 5-4, and 5.5. Fig 5-3 shows the distribution of word
senses of “Shake” and its collocations. No sense contributes more than 25% of possible
usage found in the corpus. Consequently, the challenge the systems faced was
The recall performance of all the systems for the word shake is given in fig 5-4. SUSS is
ranked 9th out of 35 competing systems on this assessment. The competitors from 1st to 8th
are machine learning based systems that extensively exploit the training data.
Chapter 5
Word Sense Disambiguation
System recall over the whole corpus (at fine granularity) is shown in fig 5-5. On this
measure SUSS was ranked 14th out of 39. This is comparable to the current level of
performance for dictionary based systems.
Figure 5-3: Distribution of the senses of “Shake”
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Figure 5-4: System Comparison: “Shake”
Chapter 5
Word Sense Disambiguation
Figure 5-5: Overall System Precision
5.5. Local Disambiguator
The experience from the SUSS prototype was incorporated into Hesperus as a local
disambiguator module. This would be an add-on utility designed to improve the
performance of the lexical chaining program. The local disambiguator takes raw text as
input, and passes “links” to the lexical chaining algorithm. Each link encapsulates no
more than one token from the input text, where a token contains one word or one
collocation. Each link contains information about the thesaural categories of which the
word may be a member, in addition to bookkeeping information. This includes the
sentence and paragraph number in which the word was found, and its surface form.
Text input and pre-processing
The local disambiguator carries out various pre-processing activities preceding its role in
word sense disambiguation. These include looking up candidate words in Roget’s
thesaurus, collocation identification, analysis of morphological roots, and stopword
elimination. These will be described in turn followed by the algorithm used that specifies
their order of application.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
These pre-processing activities interact, and are subject to empirical limits imposed by the
presence (or absence) of word forms in Roget. Program efficiency is also a consideration,
since looking words up in Roget requires disk access and is consequently slow.
Collocations are frequent word combinations that may modify or completely alter the
meaning of the component words. Collocations are less subject to ambiguity than single
words (Section 5.4), and Roget includes many examples. These include “bitter struggle”
where bitter modifies the meaning of struggle, and “bitter pill” which is usually used
metaphorically as one concept (bitter does not modify “pill”). It is consequently
advantageous to check for two word collocations in the thesaurus, since these may more
accurately identify a text’s concepts.
Verbs are often collocated with prepositions as phrasal verbs to indicate their sense more
clearly. However, the preposition may not be adjacent to the verb, and a full syntactic
analysis of the sentence would be required to recognise all verb collocations present in
Roget. For example, (1) and (3) below demonstrate the greater ambiguity of prepositions
not directly adjacent to the verb.
1. The accident seized the engine up.
2. The accident seized the engine.
3. The thief seized the car up the road.
4. The accident seized up the engine.
Note that recent editions of Roget’s thesaurus (e.g. 1987) contain many, but not all
inflected collocations. For example “seizing up” may be found in the index, whilst
“seized up” is not.
The cost of looking for multi-word collocations and the increased risk of ambiguous
interpretations outweighs the benefits of their use for Hesperus. Firstly, they are found far
less frequently than two word collocations (Appendix VII), and secondly longer
collocations are often stored in Roget in a generalised form (e.g. “afraid of one's own
shadow”) that could not be recognised without sophisticated and unreliable linguistic
processing. Collocations are consequently limited to adjacent word pairs.
Chapter 5
Word Sense Disambiguation
A word’s morphological root is its simplest form without possible inflections. For
example, the root of “giving” is “give”. Roget contains both inflected and uninflected
word forms. If a word is not present in Roget, its morphological root is determined using
the algorithm described in Winograd (1972). Programs that are more powerful are readily
available (such as PC-KIMMO, Antworth 1993) however their sophistication incurs
increased cost, whilst Winograd’s (1972) algorithm may be implemented as a short
Stopwords are common function words (such as prepositions and conjunctions) that have
been found to add little to the content of a document for Information Retrieval purposes
(Salton and McGill 1983). Words may be quickly identified as stopwords from a list held
in memory, assuming that they are not part of collocations. The stopword list used was
that defined by the SMART Information Retrieval system (Buckley 1985), as this is
widely used.
The algorithm that generates links from ASCII text is given below. This used to populate
the local-disambiguator which will be discussed next.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Algorithm 5-2: Generate Links.
WHILE input-file NOT end-of-file
READ next word AS W1
UNLESS end-of-sentence
//simple collocation ?
READ next word AS W2
Categorise collocation W1-W2
RETURN Link(W1-W2)
Find Morphological-root(W1) AS R1
//inflected collocation ?
Categorise collocation R1-W2
RETURN Link(R1-W2)
ELSE UNLESS Stopword(W1)
//simple word ?
Categorise W1
ELSE Categorise R1
//morphological root ?
The procedure Categorise returns the thesaural categories of a word (or collocation).
The procedure morphological-root returns the word without any inflections that it
may have. If none can be found, the word itself is returned.
Explicit Disambiguation
The local disambiguator uses a circular data structure as a source of links for the lexical
chaining algorithm. This is known as the “data-ring”. The data-ring is populated with
links derived from plain text by calling algorithm 5-2 above. These may then be
Chapter 5
Word Sense Disambiguation
disambiguated (to reduce the number of Roget categories that they refer to) before use in
lexical chaining.
The Local Disambiguator is designed specifically to counter the one pass, “greedy”
(Barzilay and Elhadad 1997) nature of the chaining algorithm. That is, a pure one-pass
algorithm could link one word (A) to another previously seen word using the weakest link
type. Alternative senses of that word would then be eliminated, since it had been assigned
to a chain. However it is often the case that a much better link could be formed with a
word yet unseen. This “sliding window” model of word sense disambiguation has been
described Schütze (1992).
The algorithm is based on the SUSS Semantic Relations Filter (Section 5-4).
Algorithm 5-3: Local Word Disambiguation.
LET the word to be disambiguated be W
RETURN IF W has one thesaural category
IF an earlier word in the DATA-RING is identical
link their categories.
//Already disambiguated
Initialise vector V with thesaural-categories(W)
E := 0
LET S :=
thesaural-categories(W) ∩
increment V(E)
IF all elements of V are 0 return.
others, return that
//No disambiguation
//full disambiguation
IF several sense share a value, remove the remainder, and
return those senses
//partial disambiguation
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Note that there are three possible outcomes to algorithm 5-3:
1. No Disambiguation Evidence
2. Partial Disambiguation
3. Full Disambiguation
In (1), no thesaural relationships are found, so no disambiguation may be done. In (2),
several word senses may form equally strong connections to neighbouring words. Weaker
word senses may be eliminated, which both simplifies the lexical chaining process, and
eliminates their spurious use in later processing. In (3), the number of candidate solutions
is reduced to one. This is optimal, providing the remaining sense is correct.
5.6. Conclusion
This chapter has investigated the effects of word sense disambiguation on the
performance of Hesperus. In particular techniques of local word sense disambiguation
were investigated in more detail. This was due to the availability of a large-scale sense
tagged corpus within the context of the Senseval word sense disambiguation evaluation
From the above it is clear that the performance of SUSS in Senseval was comparable with
other systems that did not include a machine learning element. This gives confidence that
we can generalise the performance of Hesperus’ explicit word sense disambiguation to
any similar technique that could be applied instead of it.
An essential finding was that the composition of the Senseval corpus and relative
frequency of sense occurrences significantly affected the performance of the various
disambiguation techniques. For example, the intended sense of “wooden” (where it is not
a component of the collocation “wooden spoon”) means “made of wood” 345 times, and
describes “lacking liveliness grace or spirit” on 5 occasions. Thus, by selecting the first
sense as default 98.5% accuracy can be achieved.
Simple word disambiguation techniques that conform to the Hesperus paradigm have
been shown as successful as most alternative approaches that do not use machine learning
techniques. Approaches that do use machine learning do however have superior
performance (even where explicit training data is not available).
Chapter 5
Word Sense Disambiguation
Not all the approaches to word sense disambiguation implemented in SUSS could be
readily transported into Hesperus, as they relied upon the availability of detailed
dictionary data. Exploiting such data remains a possibility for further work.
Several methods were however used to develop the local disambiguator. Roget includes a
considerable number of collocations, and phrases. It also includes many entries for verbparticle combinations separate from the verb entry alone. The lesson from SUSS was that
such multi-word expressions have far lower ambiguity than their components when seen
A local disambiguator was consequently developed to exploit such expressions. This
included identifying the morphological roots of unrecognised words, and considering
whether they may be associated with a verb particle. The size of the input data-ring was
also restricted to seven words in accordance with the optimum found in Senseval. These
enhancements are included in the system used in the next chapter, which evaluated
Hesperus against human performance.
Chapter 6. Evaluating Hesperus
6.1 Introduction
This chapter investigates experimentally how well the Generic Document Profile (GDP)
performs at text similarity matching. Since similarity assessment is a matter of human
opinion, the principal method used in this chapter is to collect people’s judgements
experimentally, and compare these to Hesperus. These judgements also provide a baseline
measure against which the performance of possible enhancements to Hesperus may be
Two modifications to Hesperus are considered that may improve its efficacy. These are
the effects of explicit word sense disambiguation (Chapter 5), and altering the granularity
of the similarity matching (Section 3.3). Further modifications may be considered in the
future once the baseline measure is defined.
Section 3.5 describes the procedure for producing a similarity match score between two
or more texts. In this chapter we call the text that we wish to match against the “source
text”, and other texts “example texts”. If we calculate the GDPs of both the source and
example texts, we can calculate a similarity score for all the examples that may then be
ranked in order of their similarity to the source text. However, we do not know how
accurate this similarity ranking is.
Authors write for human readers. Consequently, people have to be the baseline arbiters of
how similar texts are, and need to provide judgements of text similarity. These may then
be used to evaluate the performance of the GDP procedure as that can be measured
against their assessment. This chapter describes experiments to firstly generate a set of
human judgements, and secondly to contrast the GDP against those judgements.
People’s opinions will naturally vary according to circumstance. People who are well
informed about a specific subject may be able to identify differences that are not apparent
to those who have had little exposure to it. For example literary scholars dispute the
authorship of some of the works of Shakespeare based on differences in style that they
can identify.
Chapter 6
Evaluating Hesperus
The GDP is intended to be a general procedure. Consequently, it is important not to focus
on a specialist subject domain, as this would not fairly compare it against non-specialist
human judgements. Several areas should also be covered to ensure that the assessment is
not biased in favour of one particular subject. Therefore the texts should be general, and
their readers, the experimental subjects, should be non-specialists.
As the experimental subjects are to be non-specialists who will be asked to compare
random texts, we have to consider their motivation. Even the most well intentioned
subject will not be able to concentrate on differentiating between random texts if they can
not understand the task in principle. In other words, an ecologically valid design is most
important (see Section 1.6). That is, the experimental task needs to be similar to an
activity that the subjects routinely carry out, in an environment they are familiar with, and
using tools or equipment with which they are familiar.
Internet searching is now a routine part of student life. Given an assignment (or personal
interest), students are accustomed to sifting through the web pages identified by search
engines in for those that may be useful. Recognising unrelated items is a component part
of this task, which will be routinely carried out using a web browser such as Netscape or
Explorer. Basing the experiment on a purpose built web site will capitalise on this
activity, and satisfy the requirement for ecological validity.
There are a significant number of issues that need to be addressed to ensure that the
results are unbiased and hence of value. These start with three fundamental problems:
1. What constitutes a source text?
2. What texts are we trying to assess its similarity to?
3. What is the problem context?
We have claimed previously (Chapter 4) that the lexical chains found in texts of varying
lengths and styles are remarkably similar in distribution. Thus, our technique could be
applied in theory to any texts. This in turn gives rise to further concerns:
Using Roget’s Thesaurus to determine the similarity of texts
Are the source and example texts selected randomly?
Can the test data be made available electronically?
Is reliable human experimental verification possible?
Jeremy Ellman
We also need to compare the performance of Hesperus against a more standard method of
assessing text similarity to compare its relative effectiveness.
Of these concerns, (c) is most problematic. If one person provides a similarity assessment,
the results will reflect their opinion only, and that may be based on an idiosyncratic
interpretation of the text. This objection has to be countered by reference to statistics.
These may be collected by asking a number of experimental subjects to assess similarity
of several texts. This allows us to produce a group similarity assessment that is
independent of any one person's particular background and bias.
Text similarity assessment is nonetheless a difficult task. Unless the texts are short,
people find it difficult1 to read and compare several texts on the same subject. This sets a
practical upper bound on text length of between one and two pages.
A lower bound is imposed by the lexical chaining process. Since lexical chains are
derived from relationships between sets of words, there need to be sufficient words in the
set for several chains to emerge. This is unlikely to happen in one sentence, as word
repetition within one sentence violates English rules of style. Consequently, paragraph
length texts are a suitable minimum length for experimental purposes.
A great deal of text is electronically available now either via the Internet, or via
CD-ROM. The Internet contains huge quantities of texts (and graphics, sounds etc) on all
subjects, whereas CD-ROMS tend to be subject specific.
Microsoft’s “Encarta2“ is an example of one such widely available CD-ROM. Encarta 97
is an electronic multimedia encyclopædia that contains 31,108 texts based on the 29volume Funk and Wagnall’s New Encyclopaedia. The texts are aimed at a general
audience (Nadeau 1994) and are approximately one or two pages in length. They cover
As reported by several experimental subjects.
2“Copyright,” Microsoft® Encarta® 97 Encyclopedia. © 1993-1996 Microsoft Corporation. All rights reserved.
Chapter 6
Evaluating Hesperus
well-defined subjects, and the majority of the encyclopædia contributors and consultants
are university professors.
Encarta has several characteristics that make it an appropriate source of “source texts”.
Firstly, its articles have wide coverage, and offer an excellent choice of experimental
topics. That is, it is not necessary for the experimenter to choose topics, since a random
procedure has a good probability of identifying them. It is not a certainty however, since
whilst Encarta, like all encyclopaedias, may aspire to cover all possible subjects, in reality
it does not. Most non-domain specific topics can be found by simply using the various
searching tools. Secondly, Encarta has been subject to a single standard of editorial
control so no source text will be too long, or difficult for the experimental subjects.
Thirdly, Encarta is widely available so the experiment may be replicated.
There are four mechanisms for finding an article in Encarta. These are known collectively
as the “PinPointer”. The four techniques are as follows. Firstly, the 31,108 topics in the
encyclopaedia may be displayed in a pop-up window. Articles may be selected by
scrolling though this window, and clicking on an article title, or an article title may be
typed into a text box: this then scrolls through the topics list to one matching. Secondly,
the list of articles may be restricted those that belong to one of nine major categories such
as Sport, History (European History), Life Science, and so on with up to fifteen subcategories. This may then be scrolled through as in the first method. The third search
mechanism uses an interface that supports a text-input mode as opposed to the previous
menu choice methods. The text mode locates articles by keyword, or sub-string contained
in that article. The fourth and final mechanism supports an advanced feature that allows
Boolean queries using AND, OR, (), NOT, NEAR. These Boolean queries may also be
combined with sub-string search.
Now the question of source texts has been addressed, we return to the issue of example
texts. As there are multiple pages on almost every subject, the Internet is an ideal source
of example texts. It is also highly heterogeneous. Thus Internet derived example data will
(probably) use the same terms in different senses. Consequently, a procedure that
recognises conceptual similarity should be able to distinguish clearly between texts that
are on the same subject, as opposed to those that are on different subjects, but use
identical terms.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Single subject area test collections are commonly used in Information Retrieval for
evaluation purposes (see Baeza-Yates and Ribeiro-Neto 1999 Chapter 3). These are
readily available, but have two disadvantages. Firstly, terms tend to be used in one sense
(Krovetz and Croft 1992). An example would be use of the term “cancer” in the medical
literature, as opposed to texts on astrology. Thus, data derived from a single subject
collection would be an inferior source for text similarity experimentation as it would be
harder for the program to distinguish than Internet data that covers several domains. That
is, the single subject collection would tend to disambiguate the words it contains, making
its automatic differentiation more onerous.
The second disadvantage would be the difficulty that non-specialist would have in
discriminating between domain specific texts.
If the Internet is the source of example texts, the subjects (i.e. topics) used in the
experiment should correspond to Internet queries. Since we have decided that Microsoft's
Encarta is a good source of example texts to match against, these queries should also
correspond to index entries in Encarta.
Thus, random Internet queries, that may be found in Encarta’s index form our source
texts, whilst the Internet pages make up the example texts.
This chapter is organised then as follows: firstly we consider the questions, or hypotheses;
that the experiments in this chapter are designed to answer. Next, we describe how the
materials for the experiments were selected to be both unbiased, and ecologically valid.
The experiments with human subjects are described in Section 6.3. These provide a
similarity ordered baseline set of texts. This similarity ordering is then compared topic by
topic to that given by Hesperus, whilst a simple information retrieval program acts as a
Section 6.3 also evaluates the effects of explicit word sense disambiguation using the
same test data. This involved activating the local disambiguator (Section 3.3) so that
explicit disambiguation is done, and then assessing changes to performance.
Finally the overall results are discussed and conclusions are drawn.
Chapter 6
Evaluating Hesperus
6.2 Hypotheses
There are a number of hypotheses to be tested. These are given below in order of their
relative importance with respect to this study whose primary objective is decide whether
lexical chains derived using Roget’s thesaurus may be used to determine the similarity of
The hypotheses are:1. Lexical chain based similarity matching is able to produce a ranking
between a source text and several examples equivalent to that produced by
human subjects.
2. Lexical chain based similarity matching is not identical to a purely term
based approach.
3. The performance of Hesperus on text similarity matching will improve if
there is an explicit word sense disambiguation phase.
4. The performance of Hesperus will alter if the granularity level of GDP
matching is refined.
5. Articles written in an encyclopaedia style will be preferred by the subjects
over Internet web pages.
The first of these hypotheses is most important. It raises the question as to whether a
text’s lexical chains determined by Hesperus express its meaning sufficiently as to be a
useful tool for similarity matching. If this were the case, then it would be possible to place
texts in order, or rank, of similarity and use this in applications such as information
retrieval where text similarity matching is important. Note that we do not expect the
procedure to give the same numeric values as that determined by the human subjects, as
these values would be an artefact of the experimental procedure.
The second hypothesis follows from the first. We have indicated (Chapter 3) that a
majority of the strength of lexical chains derives from term repetition. It is possible that
Hypothesis One will be true, but the similarity matching effect is due solely to identical
terms in the source and example texts. Consequently, we need to enquire whether term
repetition alone will give equivalent or better results than Hesperus.
The third hypothesis tests whether word sense disambiguation (WSD) at current accuracy
levels (Chapter 5) can improve the performance of Hesperus on text similarity matching.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Sanderson’s (1996) conclusion was that WSD needed to have accuracy greater than 90%
to improve performance of an information retrieval system. By contrast, SUSS achieved a
mean WSD precision of 67% (Chapter 5). Thus, it would seem that Hypothesis three is
unlikely if Sanderson’s (1996) information retrieval task is equivalent to the text
similarity assessment proposed here. However, Sanderson (1996) used a corpus that
contained brief texts about finance only (the Reuters 22713 collection, Lewis 1991),
whilst we will use the random materials developed in looking at the earlier questions.
Thus, we have greater scope for word sense disambiguation at the homographic level, as
single subject corpora are known to favour particular word senses. For example, “bond3”
in financial documents refers to a particular type of certificate, whilst in a random corpus
it could equally refer to family relations, or adhesive properties.
The fourth hypothesis tests the effect of altering the match granularity. Reduced
granularity could improve the performance of Hesperus through better precision since
there are more, smaller, categories. Alternately, it could reduce GDP similarity score
since the smaller categories are less likely to match than larger ones. Consequently, this
could impair performance.
The fifth hypothesis is relatively minor. It concerns the fact that topic definitions found in
encyclopaedias are not routinely found on the Internet. Thus, the subjects may select
encyclopaedia texts as more similar to the Encarta text because of stylistic considerations.
For example, encyclopaedia texts may be preferred as they have been subject to
professional editing, whilst Internet texts will have been produced to varying quality
standards. Readers are sensitive to stylistic variations. For example, Karlgren and Cutting
(1994) have shown that text genre may be differentiated on stylistic grounds, whilst
Karlgren (1999) has suggested that stylistics may be used to augment information
retrieval performance. Thus, stylistics may be used by human readers, although it is not
used in lexical chain formation (Chapter 4), which is an essential step in determining a
texts’ GDP (Chapter 3). This hypothesis would be proven if the Infopedia texts were
consistently identified as more similar to the source text by the subjects than by Hesperus.
We now go on to look at the derivation of an appropriate random set of test materials, and
experimental comparison of performance with Hesperus.
This example is due to Ken Litkowski writing on the “Corpora” mailing list. Of course, “bond” is ambiguous at the
polysemic level, since it may refer to a variety of financial instruments, e.g. T-Bills, war bond, longbund etc.
Chapter 6
Evaluating Hesperus
6.3 Text Similarity Experiments
The objective of the text similarity experiments is to evaluate the comparative
performance of the Hesperus derived GDP on a realistic task.
This section describes three experiments. Firstly, the experiment with human subjects is
intended to produce a set of texts whose similarity has been ordered by people. This will
serve as a baseline assessment measure for Hesperus. The second experiment applies
Hesperus to these texts to produce a second set of similarity assessments. A third
experiment applies an information retrieval program to the retrieved data to test the
hypothesis that the GDP performance is principally influenced by term repetition.
Since our thesis is that the GDP is applicable to any text type and any subject the
selection of topics, and texts on those topics is critical4.
In the following section, we first discuss random query selection and then the acquisition
of example documents on these topics. Next, the methodology for each of the three
techniques of text similarity calculation is described. That is, firstly we discuss an
experiment in which people assess the similarity of the examples to a control text. Then
we discuss the generation of GDPs for this example set, and finally discuss an alternate
approach that uses a simple information retrieval application. The results are then
presented by topic.
6.3.1 Material Selection
Query Selection
Several Internet search engines publish (anonymously) current queries they are executing.
These would constitute random topics to be used in our experiments. For example,
Metaspy5 publishes ten current queries from MetaCrawler, a “meta” search engine
(Selberg and Etzioni 1997) that searches several other search engines in parallel. Metaspy
automatically refreshes the display every fifteen seconds with ten further queries.
Judicious selection of texts would give better results, but obtaining texts and topics at random provides far greater
confidence in the results obtained.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Selberg and Etzioni (1995) note that there is no particular pattern to these queries.
Nonetheless, approximately 1% corresponds to general topics that intuitively could be
found in the index of an encyclopaedia such as Encarta. The remaining queries are mostly
highly specific (e.g. “SNE 99x ROM), proper names (“toys r us stores”), non-English,
misspellings, or phrased in the syntax of web based information retrieval (“+nursery
+plants +florida”). Thus, one potential topic appears about every 2-3 minutes (some
queries are repeated, since there are not always ten new queries in fifteen seconds).
Twenty queries were recorded in a session conducted in May 1998. Of these, eight were
found in Encarta. These were: 1. Socialism
2. Ballot
3. Copyright
4. AI
5. Rosetta
6. Breakdance
7. Welfare
8. Fishing
Corpus Gathering
Having selected the queries, we needed to retrieve the Internet pages that users would find
in response. This was done using a purpose built application “Hesperus-Web”.
Hesperus-Web accepts single queries as identified above, and posts them to MetaCrawler
(Selberg and Etzioni 1995, 1997). This in turn queries AltaVista, Infoseek, WebCrawler,
Excite, and Yahoo! MetaCrawler then collates the links found into a common list of the
thirty or so most highly ranked. Hesperus-Web then retrieves these pages and stores them
on the local disk, renaming them where necessary.
Corpus Selection
There is still a problem once the pages have been retrieved, as MetaCrawler is
programmed to try to return thirty pages in response to any query. Pilot studies have
shown that five example texts are better for the experiment due to limitations on attention
span. Consequently, these must be selected from the thirty returned.
Chapter 6
Evaluating Hesperus
To avoid implicitly accepting the rank ordering imposed by MetaCrawler, an algorithmic
procedure was defined that could eliminate possible texts from the result set. This reduced
the possibility of experimenter bias in selecting example texts.
Algorithm 6-1: Reducing the number of example texts to five.
Initialise rejection criteria R to first in table 6.1
While there are still rejection criteria
If number of texts N, is equal to five EXIT
Eliminate texts that satisfy R.
Increment R to next in table 6.1
End While
If there are more then five texts
select five examples at random
Algorithm 6.1 was manually applied. The rejection criteria used are shown in table 6-1
below. The objective of these was to provide example texts suitable both for experiments
with human subjects, and suitable for Hesperus.
For the subjects the texts need to be short enough so that they may be read in one or two
minutes. The experiment was planned to last one hour to ensure the subjects’ attention
was maintained and for pragmatic reasons. Two minutes per text ensures that a sufficient
number may be read to produce data for several queries. Thus, they needed to be neither
too short nor too long. They also needed to be amenable for subsequent analysis with
Table 6-1 below shows the items that were consequently eliminated from consideration: -
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table 6-1: Rejection Criteria for example texts
Rejection Criteria
1 Empty Files
2 Not Found/Illegal Access/Redirected Links
3 Pages longer than three screenfuls
4 Pages shorter than one screenfuls
5 Pages containing Images only
6 E-mail “Threads”
7 Tables of Contents (otherwise containing little text)
8 Adverts
9 Book Reviews
10 Adult Material
In addition, an example was included from an electronic encyclopaedia “Infopedia 976“.
This allows us to perform a limited test of the minor hypothesis that topic definitions
found in encyclopaedia type summaries are not routinely found on the Internet.
6.3.2 Experiment with Human Subjects
The experiment was constructed as a dedicated Internet WWW site viewed with a Web
browser such as Microsoft Internet Explorer or Netscape. This ensured a high level of
ecological validity, since the texts were being seen in the medium and by the means for
which they were designed. The web pages used in the experiment were purposely
modified so that links other than those in the experiment could not be followed.
Approval for the experiment was sought from, and granted by the University of
Sunderland ethics committee.
The subjects were twenty-five undergraduate and MSc conversion students from the
School of Computing and Information Systems at the University of Sunderland. They
participated as class groups with the support of the course tutors. Participation was
voluntary, and the subjects were not paid.
The Hutchinson New Century Encyclopaedia. Copyright 1996 Softkey Multimedia Inc.
Chapter 6
Evaluating Hesperus
The students were given a brief explanation that the experiment was about text similarity,
and were informed of the address of the experimental web site. They were shown the
questions and had the five point Likert scale explained to them. This explanation was
reproduced on help pages on the web site and reproduced in Appendix IV.
No detailed explanation was attempted of the concept of similarity other than “means the
same thing”. It was felt that a technical explanation would have detracted from the
subject’s naïve views, and changed the nature of the task to that of evaluating similarity
based on the external definition. Support for this view comes from Hampton (1998)
writing in the psychological literature. Hampton (1998) reports a categorisation
experiment by Hampton and Dubois. Hampton and Dubois found little or no evidence
that the fuzziness of categorisation was reduced by providing a clear discourse context.
Subjects were provided with either an elaborate scenario before participating in a
categorisation experiment or no background scenario at all. Levels of disagreement, and
inconsistency in the experimental results were unaffected by whether subjects had
received the detailed explanation.
The experimental design was fully randomised. Its major component was a JavaScript
program that guided interaction through the web site. This consisted of six component
experiments that covered the selected topics. These experiments were presented in a
random order. The components of each experiment were also presented in random order.
This ensured that no part of the experiment was unduly influenced by fatigue, since some
subjects would have seen first those parts that others saw last.
Subject reports in initial experiments (Ellman 1998) indicated considerable difficulty
when faced with one question (whether one example text is more similar than another to a
source text). Consequently an experiment was designed in which subjects answered
several questions, only one of which was important to this study. That question being how
similar one text was to another7.
After reading the initial instructions, subjects were shown the Encarta entry as a Source
text. This was shown in a Frame based layout with the text in the left of the screen (the
I would like to thank Dr. Sharon Macdonald for the suggestion to ask multiple questions.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
right being initially blank). It was explained that each topic would be labelled as a
“Query”, and would be labelled as “0/6”, with the comparison texts numbered from one.
This explanation was acceptable to the subjects, and did not require further elaboration.
Four general statements about the text were shown in the lower part of the frame (see fig
6.0 below). The purpose of these questions was to ensure that subjects read the source text
in sufficient detail to make subsequent decisions about similarity of meaning.
The Source Text is a good explanation of the subject
I am very familiar with the topic of the Query
I would like to find the Source Text if I looked for this Query
The Source Text is a good definition of the subject
Statements for the Initial Topic Page
Figure 6-1: Source text Topic Screen
Chapter 6
Evaluating Hesperus
Subjects were required to indicate whether they agreed or disagreed with the statements
using a five point Likert scale. When the responses had been completed and the
“SUBMIT”8 button clicked, the example texts were shown in random order in the right
pane (see fig 6-2 below). Four further statements were made in the lower pane.
These statements also required agreement to be indicated on a five point Likert scale.
Once all the replies had been completed, and the “SUBMIT” button clicked, a further
example was shown. When all the examples had been seen on one subject, the experiment
proceeded to the next topic, until all the topics had been covered.
The Example means the same as the Source Text
The Example is relevant to the Query
The Example more specific than the Query
The Example is a good definition of the subject
Statements for the Example Comparison Pages
Figure 6-2: Source and Example Text Comparison
The “SUBMIT” button was positioned at the far right of the screen. This ensured that subjects saw all points of the
scale before confirming their choice.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The replies made by the subjects were collected from the Web site log, and then analysed.
The question to which we would like the answer is whether the source and example texts
similar in meaning. This needs to be phrased as a statement to which subjects can express
agreement or disagreement using the Likert scale. The statement form “the example text
means the same as the source text” was consequently taken as the principal human
similarity measure. This is known as statement S1. Their answers to statements S2-S4
were recorded, but the analysis is beyond the scope of this thesis.
The frequency distribution of the replies to statement S1 is shown below in graphs 6-1 to
6-6 below. This shows the variance in the answers. This allows us to see how well the
subjects agreed with each other, and to determine whether their responses were random or
It can be seen from graphs 6-1 to 6-6 that the subjects’ responses to S1 approximately
conformed to a normal distribution. This means that we can use the mean of S1 as an
accurate summary measure of the subjects’ judgement.
6.3.3 Hesperus Comparison Experiment
The purpose of comparing Hesperus with user judgements was to assess its usefulness as
a text similarity tool. This was discussed as Hypothesis 1 (Section 6.2). The similarity
ordered texts also provided a baseline measure against which modifications to Hesperus
can be evaluated. Two modifications to Hesperus were considered, with explicit
disambiguation, and using fine-grained similarity match. These were Hypotheses 3, and 4
(Section 6.2).
Hesperus was run under four settings, determining the GDP for both the source texts, and
the example texts (Chapter 3). This produces a similarity measure for each text derived
from its GDP (Section 3.6) that ranged from zero to one. The settings were:1. Standard grain.
2. Standard grain, with explicit word sense disambiguation.
3. Fine grained.
4. Fine grained, with explicit word sense disambiguation.
These settings have the following meanings: “standard grain” uses the ~1000 Roget
categories as described in Chapter 3. Explicit word sense disambiguation refers to the
Chapter 6
Evaluating Hesperus
augmentation of HESPERUS with the local disambiguator described in Chapter 5.
Finally, “fine grain” refers to the use of the smaller sized 6400 subcategories possible
with the 1987 Roget (Section 3-3).
The rank order produced by Hesperus was compared to that derived manually. It is shown
below in the corresponding tables for each topic.
6.3.4 Information Retrieval Experiment
The purpose of the information retrieval experiment was to test Hypothesis 2: that lexical
chain based similarity matching is different to that produced by term based approaches.
SWISH (Simple Web Indexing System for Humans9) is an example of a classic
information retrieval program. It is based on the tf*idf10 heuristic that indicates the
relative importance of the text in a document collection (Salton and McGill 1983).
SWISH was selected for the information retrieval experiment as it is both simple to use,
and is adapted for www pages. Consequently, it can both analyse unprocessed html, and
differentially weight keyterms that occur in document titles. Thus, if titles are good
indicators of content, SWISH would be at an advantage with respect to Hesperus, which
processes text only.
Note that this experiment is not an exact comparison with the human one, or that using
Hesperus as both those experiments compare the similarities of two texts. As stated, this
experiment compares the match between a query and several texts. Nonetheless it gives
an indicative comparison of a term based approach with little effort expenditure.
The page set retrieved for each query was indexed separately using SWISH. The raw
query (i.e. “Copyright”, “AI”, “Rosetta Stone” “Socialism”, “Ballot”, and “Breakdance”)
was then posed against this index. The relevance score returned for each example in the
test set was recorded. These are shown in tables 6-2, 6-5, 6-7, 6-9, 6-11, and 6-13 below.
. SWISH is a freely available basic IR program that understands HTML format, and increases the rank of terms found
in HTML heading tags. URL:
the tf*iDf heuristic states that the frequency of a term in a text times the inverse of its occurrence in the collection
indicates its importance
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
As in the Hesperus experiment, the rankings obtained from SWISH were compared to
those derived from the human assessors. The results are shown under the topic headings
6.3.5 Results Introduction
This section presents of the results of the experiments by topic. This requires the
comparison of three different types of data. Hesperus gives a similarity score from zero to
one, where one represents a perfect match. SWISH gives a relevance score for zero to one
thousand, where one thousand means highly relevant, whilst the human experiment gives
Likert data that range from 1 to 5.
For convenience, a percentage score was calculated from each of the three data sets. The
simple scaling techniques used are described below. Caution is required in comparing the
three percentage scores however as they are derived by different means, and are most
likely from different frequency distributions. For example, a human experimental score of
50% mean “neither agree or disagree”, whilst for Hesperus this represents quite strong
For the human experiments the subject judgements are converted into percentages by
firstly coding the responses to statements S1 to S4 from 1 (disagree) to 5 (agree). The
arithmetic means of these ratings were then taken, giving a group measure of agreement
to each statement. These data are presented graphically below under each query heading
as graphs 6-1 to 6-6.
Hesperus and SWISH scores were converted to percentages using simple scaling. For
Hesperus, raw similarity measures range from zero to one, whereas SWISH scores range
from zero to one thousand. Conversion to percentages was by simple multiplication by
one hundred for Hesperus, or division by ten for SWISH.
As the numeric data from the three experiments were not derived from the same
frequency distribution, they can not be compared directly. However, the rank (or order)
that they give to the example texts can be compared. That is, we can compare the most to
least similar example ordering derived from the experiment, with that given by Hesperus.
Chapter 6
Evaluating Hesperus
Spearman’s rank correlation statistic is useful here (henceforth Spearman). It is a nonparametric statistic that makes no assumptions about underlying frequency distribution of
the data it is used on. It is used for determining the correlation and hence statistical
significance of ordinal data (e.g. see Kinnear and Gray 1997). That is, data ordered into
ranks or assigned to ordered categories. It is applied below to each of the queries to
determine how well the experimental techniques concur.
Kilgarriff and Rose (1998) have noted a disadvantage of Spearman for corpus studies.
That is, Spearman does not take account of the magnitude of differences between
classifications leading to different rankings, only that they differ in order. This is useful
for comparing scales that can not be compared directly, but does not highlight when large
differences on one scale are matched by small alterations in the other ranking scale. Such
differences will be indicated in the description of the data, the format of which will now
be described. Results Presentation Format
The evaluation of Hesperus included six experimental topics, and three experiments, with
Hesperus operated under four different conditions. The success of the trials depended on
the topic considered. Consequently, the data generated are presented in a common format,
ordered by topic, which will now be described.
Firstly, there is a brief description of the topic that indicates the position of the source text
in Encarta. Any oddities of the data found for the example texts on the Internet are also
Next, a histogram of the experimental subjects’ responses to question S1 (“The Example
means the same as the Source Text”) is presented. This shows the percentage of subjects
who agreed and disagreed with this question as applied to each of the six example texts.
The histogram gives a visual representation of how well the subjects’ agreed with each
other. As with any experiment with a group of human subjects, we would expect these
results to be approximately normally distributed around some mean. This mean is used as
the assessment of similarity used to evaluate Hesperus.
The mean assessment of similarity is then presented in tabular form. This table includes
the percentage scores from the Hesperus comparison and Information Retrieval
experiments. This is followed by a table that presents the results of the Spearman rank
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
correlation. Brief comments are included that highlight features of the analysis, whilst
overall conclusions and discussion are given in Section 6.4.
The data from the six topics covered in the experiments are presented below in the order
“Copyright”, “AI”, “Rosetta Stone” “Socialism”, “Ballot”, and “Breakdance”. This
sequence is derived from the statistical significance of the results, with the most
significant presented first. Experimental Results Data
“Copyright” is one of the major articles in Encarta. It was found in the Encarta
“PinPointer”, which lists the principal encyclopaedia article titles. “Copyright” describes
the general body of legal rights related to the protection of creative works.
Graph 6-1 below shows the association between the experimental question S1 (“The
Example means the same as the Source Text”), and the texts shown to the subjects. As
mentioned in Section above, it is a stacked histogram which shows the proportion
of subjects who agreed with question S1, and how strongly.
The data show an approximately normal distribution, which supports the validity of the
experiment. The article “copytoc” is an exception, which shows a bimodal distribution.
The explanation for this may be in the content of the article, which is the table of contents
of a web site covering copyright. “Copytoc”. This could be considered to be a set of bullet
points about copyright, in which case it would be similar to the Encarta entry, or as a site
index list, which would not be similar. As can be seen in graph 6-1, the twenty three
subjects are divided on this point.
Graph 6-1 below shows that there were differences between the ratings of documents.
“Info”, “copyright”, and “copytoc” shared higher ratings, whilst “lawnet” had the lowest
rating. The arithmetic mean of the subject scores is shown in table 6-2 below.
Chapter 6
Evaluating Hesperus
Copyright: Document Vs Rating Vs Responses (%)
<== Subject Rating ==>
Example Text Title
Graph 6-1 : Copyright S1 ratings
The differences identified by the subjects are mirrored by their rating by Hesperus, and to
some extent SWISH. This is shown in table 6-2 which gives the normalised percentage
similarity scores. SPSS (Kinnear and Gray 1997) calculates their numeric order, or rank,
to give the Spearman rank correlation in table 6-4. This has been calculated manually in
this first example only to clarify what data are compared in the Spearman rank correlation
(table 6-3).
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table 6-2 Copyright: Experimental Comparative Results
Percentage similarity as measured using specified techniques
Fine Grained
Fine Grained Standard Grain Standard Grain
Explicit WSD
Explicit WSD
Table 6-3: Copyright: Numeric Rank of similarity scores
Rank of Similarity scores
Fine Grained
Fine Grained Standard Grain
Explicit WSD
Explicit WSD
Chapter 6
Evaluating Hesperus
Table 6-4: Copyright: Hesperus and SWISH Spearman Rank Correlation.
Fine Grained
Fine Grained
Standard Grain
Standard Grain
Explicit WSD
Explicit WSD
Spearman’s rho 1.000
Sig. (1-tailed)
** Correlation is significant at the .01 level (1-tailed).
* Correlation is significant at the .05 level (1-tailed).
Table 6-4 shows that Hesperus achieved a significant Spearman rank correlation on all
measures, with three out of four highly significant (p <.01).
“AI” is a main subject title in Encarta. It contains the single statement See also Artificial
Intelligence, which describes the discipline that aims to create artefacts that mimic human
thought or intellectual performance. That article was used as the source text.
AI: Document Vs Rating Vs Responses (%)
<== Subject Rating ==>
Example Text Title
Graph 6-2: AI: S1 ratings
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table 6-5: AI: Experimental Comparative Results
Percentage similarity as measured using specified techniques
Fine Grained Fine Grained
Explicit WSD
Explicit WSD
41% 100%
Table 6-6 below shows that Hesperus fine-grained found a significant correlation (p <0.5)
in rank ordering both with and without explicit WSD. Explicit WSD had a negative effect
on both levels of granularity.
Table 6-6: AI: Hesperus and SWISH Spearman Rank Correlation.
Fine Grained
Fine Grained
Standard Grain
Standard Grain
Explicit WSD
Explicit WSD
Spearman’s rho 1.000
Sig. (1-tailed)
* Correlation is significant at the .05 level (1-tailed).
Rosetta Stone
“Rosetta Stone” was found in the Encarta “PinPointer”. It describes the well known
Rosetta Stone held in the British Museum. The Rosetta Stone bears inscriptions in three
languages, two of which were known (demotic and Greek), which lead to the
decipherment of the third (hieroglyphic).
Chapter 6
Evaluating Hesperus
Rosetta stone is found in one entry in Roget’s thesaurus (Intellect: The exercise of the
mind: Means of communicating ideas: Writing lettering.) Consequently, it is not subject
to an ambiguous interpretation by Hesperus.
These example texts are identified by their web page titles, or suitable defaults. The text
called “Info” is the extract the “Infopedia” electronic encyclopaedia included to test
whether Internet retrieved data is comparable to that from a published encyclopaedia.
Rosetta: Document Vs Rating Vs Responses (%)
s 40
o 30
s 20
<== Subject Rating ==>
Example Text Title
Graph 6-3: Rosetta S1 ratings
Graph 6-3 above shows that the subjects found clear differences as to whether the
example texts meant the same as the source text. “Info” and “index” most similar in
meaning to the source text from Encarta, whilst “location” and “rosetta” were largely
judged dissimilar. This is shown numerically in table 6-7 below, alongside the scores
generated from Hesperus and SWISH.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table 6-7: Rosetta: Experimental Comparative Results
Percentage similarity as measured using specified techniques
Fine Grained Fine Grained
Explicit WSD
Grain Explicit
36.80% 100.00%
The Spearman rank correlation for the ordered percentages was calculated, and is given in
Table 6-8 below. This shows a significant (p < 0.05) correlation between Hesperus and
the subjects when standard grain matching is used, and the explicit disambiguation is not
Table 6-8: Rosetta: Hesperus and SWISH Spearman Rank Correlation.
Fine Grained
Fine Grained
Standard Grain
Standard Grain
Explicit WSD
Explicit WSD
Spearman’s rho 1.00
Sig. (1-tailed)
* Correlation is significant at the .05 level (1-tailed).
Note that all the four Hesperus methods capture the strong difference identified by the
subjects between “Index”, and “Info”, that were considered similar, as compared to
“location”, and “rosetta” that were not.
Chapter 6
Evaluating Hesperus
“Socialism” is one of the major articles found in the Encarta Pinpointer. It describes the
doctrine of state ownership and control of the fundamental means of production. The
article also refers to famous socialists, and their contribution to the movement.
There are four entries for “socialism” in Roget, plus a further three entries for “socialist”,
and two entries for “socialistic”. Consequently, there is a real possibility that Hesperus
may perform poorly due to word sense ambiguity on the fine grain match.
Socialism: Document Vs Rating Vs Responses (%)
s 40
o 30
s 20
<== Subject Rating ==>
Example Text Title
Graph 6-4: Socialism: S1 ratings
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table 6-9: Socialism: Experimental Comparative Results
Percentage similarity as measured using specified techniques
Hesperus Fine
Fine Grained
Standard Grain
Explicit WSD
Explicit WSD
Table 6-10: Socialism: Hesperus and SWISH Spearman Rank Correlation.
Fine Grained
Fine Grained
Standard Grain
Standard Grain
Explicit WSD
Explicit WSD
Spearman’s rho 1.00
Sig. (1-tailed)
Table 6-10 shows no significant rank correlation at the 0.05 level between Hesperus’
score on any measure, and that given by the subjects. However, Hesperus found the most
similar text (“Socialism”), as more similar than the least in all cases.
“Ballot” is one of the major topics in Encarta as it is found in the Encyclopaedia index.
The article describes both the sheet of paper used to cast votes in an electoral system, and
the general method and development of secret voting. This corresponds to a type token
distinction between a description of the voting process, and casting a ballot in on
particular election.
Chapter 6
Evaluating Hesperus
Several of the web pages retrieved on the “ballot” topic were on-line electoral or voting
forms. These pages were “Marking”, “03mba” and “index”. Graph 6-5 shows that these
were not considered strongly similar in meaning (S1) to the source text.
The Infopedia article gave a brief general description of the voting process. The subjects
considered it similar to the source text.
Ballot: Document Vs Rating Vs Responses (%)
<== Subject Rating ==>
Example Text Title
Graph 6-5: Ballot: S1 ratings
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table 6-11: Ballot: Experimental Comparative Results
Percentage similarity as measured using specified techniques
Title t
SWISH Hesperus Fine Hesperus Fine
Explicit WSD
Standard Grain
Explicit WSD
Table 6-12: AI: Hesperus and SWISH Spearman Rank Correlation.
SWISH Hesperus Fine
Hesperus Fine
Grained Explicit
Standard Grain Standard Grain
Explicit WSD
Spearman’s rho 1.000
Sig. (1-tailed)
Table 6-11 shows no significant rank correlation between Hesperus and the experimental
data at the 0.05 level. However, note that the two texts identified by subjects as most
similar (“Info”, and “STV5GIFs”) were preferred over the two lowest ranked items.
Breakdancing is a form of street dancing characterised by disjointed robotic movements
or acrobatic spins (Infopedia 96) “Breakdance” does not appear as an article title in
Encarta. The phrase “break AND dancing” was found by the Encarta PinPointer, so the
topic did satisfy the criteria for inclusion given in algorithm 6-1 above.
Breakdance is included in the article “Rock Music and Its Dances”. Consequently, the
source text for “Breakdance” is not focused on break dancing.
Chapter 6
Evaluating Hesperus
“Breakdance” has two senses in Roget: firstly it occurs as “Space: Motion: leap (noun)”,
and secondly as “Emotion, religion and morality: Personal emotion: Amusement dance
(verb)”. Consequently, its interpretation is ambiguous.
Breakdance: Document Vs Rating Vs Responses (%)
<== Subject Rating ==>
Example Text Title
Graph 6-6: Breakdance: S1 ratings
Graph 6-6 shows subjects found little similarity between the source and example texts.
This is reflected in table 6-12, which shows neither SWISH, nor Hesperus performed well
at this level of similarity.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table 6-13: Breakdance: Experimental Comparative Results
Percentage similarity as measured using specified techniques
Fine Grained
Fine Grained
Grain Explicit
Explicit WSD
Table 6-14: Breakdance: Hesperus and SWISH Spearman Rank Correlation.
S1 SWISH Hesperus Fine
Hesperus Fine Hesperus Standard
Grained Explicit
Grain Explicit
Standard Grain
Spearman’s rho 1.00
Sig. (1-tailed)
Table 6-14 shows no significant correlation with the experimental data at the 0.05 level.
In particular, SWISH did not register any text as relevant, as the term “Breakdance” was
not used in any of the documents retrieved. This indicates that the example pages
retrieved had been classified by their authors under this heading with an Internet
catalogue such as YAHOO, or indexed using a synonym (Such as “Hip-Hop”) before
being retrieved via MetaCrawler.
Hesperus also had difficulty with some of the texts, such as the Infopedia extract Info as
this was only one line long. Consequently, no lexical chain could be created from it.
Chapter 6
Evaluating Hesperus Experimental Results Summary
Table 6-15 summarises the statistical probabilities shown in tables 6-4, 6-6, 6-8, 6-10, 612, and 6-14 above. These indicate whether the similarity rankings could be rank
correlated by chance alone. Since there are twenty four Hesperus trials here, we may
expect one result where p<0.05. However, there are eight values where p<0.05.
Furthermore, these eight values include two results where p<0.01, and one in which
p<0.005. This alone would not be expected to occur by chance more than once in two
hundred and fifty trials. Consequently, we may conclude that Hesperus, under some
conditions, is capable of producing human like text similarity assessments. We now go on
to discuss these findings.
Table 6-15 Hesperus: Table of significance of experimental results
Probability of experimental outcome using specified techniques
Fine Grained
Fine Grained
Standard Grain
Explicit WSD
Explicit WSD
6.4 Discussion and Conclusion
This chapter has assessed the effectiveness of the Generic Document Profile as means of
comparing the similarity of different texts. This was done by creating a random, but
realistic, set of experimental texts, and having human subjects rank these in order of
similarity to texts on the same subject from the Encarta electronic Encyclopaedia. These
similarity judgements could then be used to examine the performance of Hesperus under
various operating conditions.
A number of experimental hypotheses were formulated in Section 6-2. We will now go on
to review these, and discuss their likelihood considering the experimental findings.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Hypothesis 1 proposed that “Lexical Chain based Similarity matching is able to produce
a ranking between an example text and several examples equivalent to those produced by
human subjects”
The results have shown that GDP values ranked in order have produced highly significant
statistical correlation for one topic, Copyright, and significant correlation in two further
topics, AI and Rosetta. Consequently, the hypothesis is proven with the caveat that
although Lexical Chain based similarity matching is able to produce a ranking it is not
guaranteed to do so, and did not in the remaining three topics.
This caveat is in part a consequence of the untuned nature of the algorithm (Section 3.6).
As discussed in Section 3.3 the weights of lexical links were determined empirically,
without reference to an external standard, since there is no document set readily available
that people have ranked in terms of similarity.
Hypothesis 2 proposed that “Lexical Chain based Similarity matching is not identical to a
purely term based approach”.
Hypothesis 2 was tested using an IR program, SWISH, to rank the experimental texts for
presence of the key topic term. The rankings found using SWISH were marginally
correlated with those derived from the human subjects on three topics, and poorly
correlated on two others. This differs considerably from the subjects' rating, and that of
The marginal correlation indicates that human similarity judgements are based on rather
more than term repetition.
Hypothesis 3 proposed that “The performance of Hesperus on text similarity matching
will improve if there is an explicit word sense disambiguation phase”.
This hypothesis was tested by generating GDPs, at both levels of granularity, both with
and without explicit word sense disambiguation. In some cases, disambiguation improved
performance slightly, however, in others it decreased it. Of the three cases where
similarity ranking achieved statistical significance, all had better results in the case
without explicit disambiguation. This would seem to provide support for Sanderson’s
Chapter 6
Evaluating Hesperus
(1996) thesis that disambiguation accuracy needs to be high to uniformly improve
Hypothesis 4 proposed that “The performance of Hesperus will alter if the granularity
level of GDP matching is refined”).
This hypothesis was tested by generating GDP’s for all the experimental texts at two
levels of granularity. The one thousand topic heads in Roget was considered the normal
level of granularity, so that each GDP could contain up to 1000 categories. The finegrained level of granularity used the further subdivisions possible with the 1987 Roget
that allowed each GDP to have up to 6400 categories.
The normal level of granularity lead to higher similarity scores but reduced the accuracy
of the matches in comparison to the human judgements. As a result, the best rank
correlations were achieved with fine-grained granularity even thought the match scores
were lower. Consequently, it would seem to be the case that granularity altered Hesperus’
Hypothesis 5 proposed that “Articles written in an encyclopaedia style will be preferred
by the subjects over Internet web pages”).
This hypothesis was tested by including an extract from a second encyclopaedia,
Infopedia, in the experimental texts. These extracts were consistently rated as similar to
the Encarta text. In all the six topics, the Infopedia extract was rated in the first three out
of six. It was considered most similar in two topics, and second most similar in a further
Hesperus usually matched the high rating of the Infopedia articles, ranking them in the
first three in most cases. It would appear then that the content of the articles caused their
high rating, rather than stylistic considerations.
Of the five hypotheses described above, the first was most important. This supports the
central aspect of this thesis; that Roget’s thesaurus may be used to determine the
similarity of text. The remaining four hypotheses, especially hypotheses two and five
provide supporting evidence that the results were not due to an artefact of the
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
experimental design. Hypotheses three and four provide evidence to determine the best
operating parameters for Hesperus. These appear to be fine-grained, without explicit word
sense disambiguation.
Two observations may also be made not directly connected to the hypotheses above.
These have general implications. Firstly, it was clear from Section 3.3 that a theoretical
minimum of two words are required to make a lexical chain. In practice however, several
sentences may be needed before any links may be identified using Roget’s thesaurus.
Thus, in the Breakdance experiment, a one sentence text was not comparable to the
source text, as it not form any chains. By contrast, Information Retrieval approaches are
capable of working with single word queries and texts. This makes SWISH like programs
more widely applicable, if less accurate, than the GDP text similarity methods evaluated
The second observation relates to the relationship between a query, and the content found
in formal indices such as encyclopaedias, compared to informal knowledge sources such
as the Internet. As noted in the Breakdance experiment, the majority of the Internet
derived texts do not describe the activity of breakdancing, but related experiences whilst
breakdancing. Thus, the fundamental relationship between topic and text is different
between the two information sources.
A similar observation may also be made regarding “Fishing”, which was one of the
rejected experimental topics. There Encarta describe the process of fishing, whilst
Internet sites that mention fishing describe the quality of fishing.
These observations do not invalidate the experimental results described in this chapter.
They do however indicate some limitations on the wider application of encyclopaedias as
tools to improve Internet searching.
We now go on to present overall conclusions from the study and discuss further work.
Chapter 7. Conclusions and Further Work
Figure 7-1: Books on Chains in Hereford Cathedral Library
7.1 Introduction
This study has explored whether Roget’s Thesaurus may be used as a knowledge source
to extract a representation of a text’s meaning from its content. This representation could
then be used to automatically identify whether, and by how much two texts are similar.
The method was based on thesaurally derived lexical chains. These are sequences of not
necessarily contiguous words in texts that support it as a coherent structure.
The structure of text is an emergent property of the nature of discourse. Text and writing
are evolutionary descendants of human verbal interaction. This is an intrinsically serial
process. Sounds follow each other to make up words and conversations. This may be
compared to visual perception, where an entire scene is seen in parallel.
An explicit structure is needed to manage the process of conversational interaction
(Schegloff, and Sacks 1973). This can be approximated as a text grammar (van Dijk
1977) for building applications (e.g. Tait and Ellman 1999) although the structure is an
emergent property of the process of communication, rather than rule following.
Any process that flattens the structure of text is carrying out an information reducing
transformation. The structure of discourse itself contains useful information such as
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
announcing topics before discussing them, outlining arguments, re-expressing key points
and forming conclusions.
The reason for losing structure from text was to derive a document representation that
could be used to compare one text to another irrespective of the precise words they
contain. This document representation was based on the cohesive properties of text as
described by Halliday and Hasan (1976), and proposed for a computer implementation by
Morris and Hirst (1991).
Related implementations (StOnge 1995, Stairmand 1996, Okumura and Honda 1994,
Richardson and Smeaton 1995, Kominek and Kazman 1997, and Barzilay and Elhadad
1997) that derive the lexical chains in a text were discussed in Chapter 2. These studies
used nouns only to derive texts’ lexical chains. This was due to the lack of a simple
relationship between parts of speech in the source of knowledge used for lexical chaining.
This was principally WordNet (Fellbaum 1998).
This study was based on Roget’s thesaurus. One motivation for this was the clear
relationship between semantically related words that are different parts of speech. A
second reason was that its semantic structure contains far fewer categories than WordNet.
This simplifies the document representation used for comparing texts. The derivation of
this representation was described in Chapter 3.
Chapter 4 raised and addressed the issue that text genre would have a profound effect on
the GDP. Several book length texts of classical literature were analysed. These were
shown to have different reading complexities, as found by readability statistics. The
lexical chains in these texts were determined, and the distribution of their lexical links
plotted. Furthermore, they were found to be equivalent across the different text
complexities, and to conform to a distribution frequency common in language phenomena
that was first observed by Zipf (1949). Chapter 4 concluded that lexical chains could be
used to determine the similarities of texts from different genres, as the analysis was genre
The issue of word sense disambiguation (WSD) was raised in Chapter 5. Several simple
WSD techniques suitable for use in Hesperus were presented as a system called SUSS.
These were evaluated in the context of the Senseval competition. Senseval provided a
Chapter 7
large, manually sense tagged corpus which provided a gold-standard against which word
sense disambiguation methods could be compared both against human performance, and
against each other.
Chapter 6 described the experimental verification of Hesperus by comparing its
performance to that of human subjects. The experiment entailed human subjects ranking a
random set of texts in order of their similarity to texts of a known standard. Hesperus was
evaluated by comparing its ordering of the texts’ similarity and comparing it to those
given by humans.
The experiments were statistically significant in two of the six topics used, and gave
indications of positive performance in a further two. This gives a good indication that
Hesperus was able to determine the similarity of texts using Roget’s thesaurus. Hesperus
did not produce significant results in the remaining two cases. We hypothesised in
Section 6.4 that this may have been because their topic classification was not derived
algorithmically from their contents, but by human editors.
7.2 Conclusions about the research hypotheses
Four research hypotheses were raised in Chapter 1. Here we look at each in turn and
assess as to whether it has been proved.
Whether a text similarity measure may usefully be constructed from a text’s
lexical chains as identified using Roget’s Thesaurus.
This hypothesis was demonstrated successfully in Chapter 6. Further improvements could
be made with tuning and weights.
Whether the text similarity measure defined provides a better approximation
of human judgements than statistical methods used in Information Retrieval.
Within the scope of the limited experiments in this study, IR judgements were inferior,
where they did not have access to term/document distribution frequency over the whole
corpus. This was shown again in Chapter 6.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Whether the text representation considered is suitable for the analysis of texts
of different lengths and complexities.
Chapter 4 showed that texts of different complexities contain the same underlying
distribution of lexical links. Since the text representation is built upon this, there is reason
to believe that the text representation may be applied to further text types.
Chapter 3 also showed that the GDP conforms to Zipf’s empirical laws.
Whether the measure may be improved by including word sense
disambiguation at current levels of accuracy.
Demonstrated as being experimentally false in Chapter 6.
7.3 Contributions
This study has examined whether Roget’s thesaurus may be used to determine the
similarity of texts. The specific approach has involved the generation of text
representations based on Roget categories that may then be compared.
The study has made a number of contributions:
1. We have shown that a lexical chaining program can be implemented based upon
Roget’s thesaurus as suggested by Morris and Hirst (1991). Unlike other lexical
chaining programs (StOnge 1995, Stairmand 1996, Okumura and Honda 1994)
this lexical chaining system has used all parts of speech.
2. We have examined the lexical cohesive relationships proposed by Morris and
Hirst (1991), based on the work of Halliday and Hasan (1973, 1984). It was
shown that only three of Morris and Hirst (1991) thesaural relationships are
practically useful. Some relationships hardly occur in normal text, and others are
found too frequently to be useful
3. We have shown that Zipf’s law applies to the frequency distribution of lexical
coherence relations when their frequency is compared to the distance between
related words.
4. A novel document representation has been proposed based on categories in
Roget’s thesaurus. This method has been used in a new technique to assess the
Chapter 7
similarity of whole texts. Furthermore, it has been experimentally compared to
human judgements and shown to be statistically significant in several, although
not all, instances.
5. The effect of words sense disambiguation and granularity have studied in
relation to the text similarity task. Support has been presented for Sanderson’s
(1996) view in information retrieval that word sense disambiguation needs to be
very accurate to contribute positively. This supports Voorhees (1994) findings
where WordNet was used to disambiguate TREC queries, but with no overall
7.4 Future Work
There is a wealth of exciting work that naturally follows from this study. This may be
broadly characterised as developing and extending the basis of the work on lexical chains,
and further developing the approach to text similarity matching. Here we will look at
these in turn.
Word Sense Ambiguity
In this section, we consider some implications of word sense ambiguity and their affect on
Roget based lexical chaining.
Word Sense Frequency Information
Selecting an inappropriate word sense is a cause of inaccuracy in the creation of lexical
chains. This was explored in Chapter 5, where an explicit word sense disambiguation
(WSD) phase was investigated. Chapter 5’s principal conclusion was that the contribution
that different WSD methods make should be considered carefully, as they may under
perform the selection of a most common word sense where this is known.
Dictionaries order the entries for different word senses according to their commonality.
Consequently, a baseline method of determining a word sense is to select the first
dictionary entry. Wilks (1999) has stated that selecting the first LDOCE sense results in
62% correct sense assignment, whilst Ng and Zelle (1997) report Miller et al., as finding
58.2% using WordNet.
Roget’s thesaurus gives no clue as to which entry the most frequent word sense
corresponds. The lexical chaining process must determine an appropriate word sense from
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
context. This sometimes means that an association will be found between two words
based on the selection of two inappropriate, rare, word senses.
WordNet’s sense frequency information could be used alleviate this problem. This
presupposes that WordNet’s sense inventory could be mapped against that of Roget. This
would be possible since WordNet’s 70,000 synsets contains narrow sense distinctions,
whereas Roget’s 1000 categories are broad.
A sense equivalence algorithm would need to calculate the best overlap between
WordNet’s synsets for a word, and its Roget categories. Once this had been determined,
the Roget categories could be ordered according to the order of their related synsets. This
would give an ordering to Roget categories where currently there is none.
Mapping between sense inventories has been explored by several workers. Yarowsky
(1992) suggested a method that would equate WordNet synsets with categories in Roget’s
thesaurus. Recently Litkowski (1999) has proposed using the sense equivalences that
were manually determined between Hector (Chapter 5) and WordNet for the Senseval
competition as a baseline measure to evaluate sense equivalence algorithms. His
suggested approach exploits the semantics of dictionary organisation, rather than simple
term mapping.
However determined, it is likely that the sense ordering procedure would be approximate,
since there is no exact mapping between any pair of word sense ordering schemes. It also
depends on the accuracy of the word sense frequency information in WordNet (see
Chapter 5). Whether it would improve Hesperus’ disambiguation accuracy would need to
be determined experimentally.
Word Sense Disambiguation, Chaining and Length
It is an experimental observation that the word sense ambiguity problem decreases in
importance as documents increase in length. This happens because the same concept will
occur sufficiently frequently (and with different synonyms) in a coherent and cohesive
text that the correct interpretation will eventually be chosen.
Chapter 7
Accuracy Metrics for Word Sense Disambiguation
There are reasons to question Sanderson’s (1996) results. His technique of term
combination reduces the number of distinct terms in his document collection. This is a
critical element in many IR systems that use the tf*idf1 heuristic. In principle,
disambiguation will lead to an increased number of distinct word senses that should
permit a better discrimination on term frequency, and so improve performance.
Chains Theory
In this section, we will consider future work that relates to some theoretical options
related to lexical chaining theory. Any improvements in lexical chaining should result in
overall improvements in text similarity assessment.
The Interaction of Lexical Chaining and Part of Speech
The process of lexical chaining is relatively new, and has largely conformed to
possibilities allowed by the subset of WordNet that concerns nouns only usually called
NounNet. These studies (Stairmand 1996, Green 1999, StOnge 1995) have frequently
found inferior performance to statistical approaches. The degree to which this degraded
performance has been due to information lost is not clear.
The Roget based chainer may use all parts of speech. If a restricted subset were created
out of nouns only, its performance could be compared to that of a lexical chainer that uses
all parts of speech.
The text similarity problem addressed here would provide an ideal benchmark. An “all
words” chainer could be compared on the baseline similarity task to that of a chainer that
used nouns only. Unchanged or improved performance by the nouns only chainer would
support Stairmand’s (1996) hypothesis that nouns contain the substance of a text.
The implication of successful performance would indicate that WordNet based chainers
could be improved if the noun and verb hierarchies in WordNet could be integrated.
Unambiguous Chaining
Word Sense Disambiguation has been studied extensively in this thesis since it has a
pronounced effect on the accuracy of lexical chaining. Since the WSD problem is well
known, it is unlikely to be adequately solved in the near future. An alternative approach to
term frequency * inverse document frequency heuristic.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
the applications of lexical chains would use a system that included unambiguous words
Sixty percent of the words in Roget occur in one entry only, and are hence unambiguous.
Since they have entries in the thesaurus, the algorithms described in this study would
continue to be applicable. A future line of investigation would be to pre-process the
thesaurus so that it contained only monosemous words. Text similarity performance could
then be assessed. The question to be answered is whether text similarity performance
would suffer more because of lost vocabulary, or benefit from the increased accuracy of
the lexical links.
Other Chains
Anaphora add greatly to the meaning and focus of a text by allowing key terms to be
substituted for simpler expressions. These anaphora do not contribute to the lexical chains
identified by Hesperus. Consequently, they are excluded from its Generic Document
Profile. Techniques have been described by Kennedy and Boguraev (1996) that identify
the referents of anaphora. If these could be adapted by Hesperus, they would potentially
increase its accuracy. Of course, there is a risk that, as with word sense disambiguation,
inaccurate recognition of anaphora would decrease its performance.
Chain Applications
In this section, we consider possible applications that may be derived from this study and
its results.
Use of Roget in other Lexical Chaining Activities
Lexical chains have been used in several problem areas such as summarisation (Barzilay
and Elhadad 1997), IR (Stairmand 1996), malapropism detection (StOnge 1995), and
hypertext linking (Green 1997, 1999).
Hesperus’ performance was crudely compared to that of StOnge’s chainer (Chapter 3).
However, both Stairmand (1996) and StOnge (1995) speculate that their work could have
been improved if WordNet had access to the relationships encapsulated in Roget.
A natural development of this study would be to apply Roget to other areas suitable for
the application of lexical chaining. This would have the practical objective of improving
on the known performance on the published systems above that obtained using WordNet.
Chapter 7
Information Metric
Chapter 4 of this study reported data that showed that the distance between thesaurally
related words follows a common frequency distribution across text types. If this data can
be shown to hold in further research, it would be an interesting property of coherent texts.
This relationship has been proposed by Ellman and Tait (1997) as the basis for an
“Information Metric”. That is, a measure that would allow the automatic differentiation
between coherent text, and incoherent text possibly designed to confuse an Internet search
engine (known colloquially as “spam”). Whilst Ellman and Tait’s (1997) measure was a
crude mean value, the concept could be further developed based on the observed versus
expected frequency data. As such, it would be appropriate for analysis with the chisquared statistic. This would be an interesting application of lexical chains.
The section contains some suggestion to improve the accuracy of the similarity matching
Broader Authentication on Genre and Text Length
Chapter 4 investigated the claim that lexical cohesive relations were impervious to effects
of document length and style. This work was based on several book length texts. Since
this was shown to be true for those texts, it is possible that text similarity performance can
be independent of genre and document length. Broader authentication on this issue is
Unknown words
The problem of words that are not in the thesaurus affects all lexical chainers. Unknown
words can not form part of lexical chains and are excluded from any document
representation. This work has shown that 60-80% of chain links identified were due to
term repetition. If a term is repeated that is not in thesaurus can not currently recognise
that this is a distinctive aspect of the document. If these could be incorporated into the
chaining approach, this would improve Hesperus’ general applicability.
Solutions to the unknown words problem have been proposed by Green (1999). These
include the application of Dumais (1995) techniques for latent semantic indexing
(Chapter 2) for unknown function words, and a system for dealing with proper names.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The identification of named entities is a recognised task within MUC (Chinor 1998).
Named entities could be stored in a separate, thesaural pseudo category. This would allow
items such as company or product names to contribute to the document representation.
This would exploit the value and specificity of identical word links, without increasing
the ambiguity within the GDP.
Weighting and Corpora
Text similarity matching is carried out in Hesperus as though all thesaural categorise are
equally important. However many categories, like the terms they contain, are more
indicative of content than others.
Further normalisation would be possible for example by eliminating attributes whose
value fell below certain percentages. Recalculating percentage weights would then have
the general effect of focusing the document representation to favour those concepts that
are already strongly represented.
The presence of rare terms in a document has been reliably shown to improve
performance of information retrieval systems (Salton and McGill 1988, Baeza-Yates and
Ribiero-Netto 1999) using the tf*idf heuristic. This differentially weights those terms that
are present in a text, but that are infrequent in a collection of documents. If the relative
frequency of thesaural categories could be determined, this could be used to weight
categories as more or less indicative of a texts content. Such frequency information could
come for example from the British National Corpus, although other corpora would be
needed for specialist domains.
Representing GDPs
The representation of GDPs is a considerable problem. This study has drawn GDPs as
tables, although graphical techniques are widely used for exploring the characteristics of
data. If a graphical representation for GDPs were found, people would be able to see at a
glance approximately what a text was about. This could be a useful tool for information
browsing. Furthermore two or more could be compared and their similarity determined
visually. Users would be able to differentiate conceptual clusters in the representation that
correspond to themes in a document, and identify the equivalents in similar texts. This
would allow the possibility that users could alter a desired concept strength interactively,
to differentially weight it, and so influencing the system’s similarity assessment.
Chapter 7
7.5 Summary
It was recognised at the start of this study that the technique(s) that were being
investigated would not be completely accurate. It was however anticipated that they
would be robust. The essential issue was whether inaccurate use of the cohesive nature of
text could simulate human performance on a problem with natural language such as
deciding how alike two texts are.
Evidence has been presented to show that even inexact use of thesaural relationships in
analysing natural language may augment a simple task such as determining text
similarity. The challenge is now to increase the accuracy of the process by exploiting
better the words that texts’ contain.
Aamodt A. and Plaza E. (1994) “Case Based Reasoning: Foundational Issues,
Methodological Variations and System Approaches” AI Communication Vol. 7, 1
March 1994
Agirre, E. and G. Rigau (1996) “Word Sense Disambiguation Using Conceptual Density”
In Proceedings of the 16th International Conference on Computational Linguistics
(Coling `96), Copenhagen, Denmark, 1996.
Allan J., Callan J, Sanderson M, Xu J, and Wegmann S. (1998) “INQUERY and
TREC-7” Proc. TREC 7. NIST
Allen, J. F. “Natural Language Understanding”, 2nd edition (1995) The
Benjamin/Cummings Publishing Company, Menlo Park, California, ISBN 0-80530330-8.
Alterman R. (1991) “Understanding and Summarization” Artificial Intelligence Review
Vol. 5., pp239-254.
Antworth, E. L. (1993) “Glossing text with the PC-KIMMO morphological parser”,
Computers and the Humanities 26:475-484,.
Ashley, K. D. and Rissland, E. L. (1988) “A Case-Based Approach to Modelling Legal
Expertise.” IEEE Expert, 1988, Vol. 3, No. 3, pp. 70-77.
Austin, J.L.: (1962) “How To Do Things With Words” Oxford University Press, Oxford,
Azzam S., Humphreys K., and Gaizauskas R. (1999) “Using Coreference Chains for Text
Summarization” in Proc. ACL'99 Workshop on “Coreference and Its Applications”
Baeza-Yates R. and Ribeiro-Neto B. (1999) “Modern Information Retrieval” AddisonWesley Harlow UK ISBN 0-201-39829-X
Bartell, B. D., Cottrel, G. W., and Belew R. K. (1998) “Optimizing Similarity using
Multi-Query Relevance Feedback”. Journal of the American Society for Information
Science Vol. 49(8) pp742-761
Bartell, B. D., Cottrel, G. W., and Belew R. K. , (1995) “Representing Documents using
an Explicit Model of their Similarities” Journal of the American Society for
Information Science Vol. 46(4) pp245-271
Barzilay, R. and M. Elhadad. (1997). “Using Lexical Chains for Text Summarization.” In
Proceedings of the Workshop on Intelligent Scalable Text Summarization at the
ACL/EACL Conference, 10–17. Madrid, Spain.
Becker J. D. (1975) “The Phrasal Lexicon”. In Proceedings of the Conference on
Theoretical Issues in Natural Language Processing, Cambridge, MA, pp. 70-77
Beeferman D. Berger A., and Lafferty J. (1997) “A model of lexical attraction and
repulsion. In Proceedings of the ACL-EACL '97 Joint Conference, Madrid, Spain
Belkin, N. and Croft, W. B. (1987) “Retrieval Techniques.” Annual Review of
Information Sciences and Techniques (ARIST). vol. 22, 1987. 109-145.
Berners-Lee, T., Cailliau, R. Luotonen, A, Nielsen H. F, and Secret A, (1994) “The
World Wide Web”, CACM Vol. 37, 8 August 1994
Bertino E. Catania B. and Ferrari E. (1999) “Multimedia IR: Models and Languages” in
Baeza-Yates and Ribeiro-Neto (1999)
Black W. J. Rinaldi F. Mowatt D. (1998) “FACILE: Description of the NE System Used
for MUC-7” in proc. MUC-7
accessed 25/10/1999
Blair, D. C. and Maron, M. E. (1985). “An evaluation of retrieval effectiveness for a fulltext document retrieval system.” Communications of the ACM, 28(3):289-299
Boguraev B. and Pustejovsky J. (eds.) (1996) “Corpus Processing for Lexical
Acquisition”, MIT Press.
Boyd R., Driscoll J., Syu M. 1992 “Incorporating Semantics Within a Connectionist
Model and a Vector Processing Model” in Proc. TREC 2. available from [15/06/2000]
Brill, E. (1992). “A simple rule-based part-of-speech tagger.” Proceeding of the Third
Conference on Applied Natural Language Processing. Trento, Italy.
Brin S. and Page L. (1998). “The Anatomy of a Large-Scale Hypertextual Web Search
Engine.” Proceedings of the Seventh World Wide Web Conference (WWW7),
Brisbane, also in a special issue of the journal Computer Networks and ISDN Systems,
Volume 30, issues 1-7. (also accessed
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Buckley C. (1985) “Implementation of the SMART Information Retrieval System”
cornell/TR85-686, available as 21st July (1999)
Buckley C. Salton G. Allan J. Singhal A. (1995) “Automatic Query Expansion Using
SMART:: TREC-3” in Proc TREC 3 NIST.
Burke R. Hammond K. Kulyukin V. Lytinen S. Tomuro N. and Schoenberg S. (1997)
“Question Answering from Frequently Asked Question Files.” AI Magazine, pp- 5766, (1997).
Callan J.P and Croft W. B. (1993) “An Approach to Incorporating CBR Concepts in IR
Systems.” In “Case Based Reasoning and Information Retrieval : Exploring the
Opportunities for Technology Sharing” Papers from the Spring Symposium, AAAI
Press Technical Report SS-93-07
Charniak, E. and Wilks, Y. (1976) (eds.) “Computational Semantics”, North-Holland,
Amsterdam, Netherlands.
Chen H. (1994) “Collaborative Systems: Solving the Vocabulary Problem” IEEE
Computer May 1994
Chinor N. (1998) “MUC-7 Named Entity Task Definition (version 3.5)” Proc. MUC-7 accessed 25/10/1999
COD9 “The Concise Oxford Dictionary” 9th Edition on CD-ROM. Oxford University
Press 1997
Collins, A.M., & Quillian, M.R. (1969). “Retrieval time from semantic memory.” Journal
of Verbal Learning and Verbal Behavior, 8, 240-247.
Crimmins F. Smeaton A. F. Dkaki T. and Mothe J. (1999) “TétraFusion: Information
Discovery on the Internet” IEEE Intelligent Systems July/August 1999.
Croft W. B. (1995a) “What Do People Want from Information Retrieval?” D-LIB
Magazine (“”), November 1995
Croft W. B. (1995b) “Effective Text Retrieval Based on Combining Evidence from the
Corpus and Users” IEEE Expert Vol. 10, 6 December 1995
Dale R., Oberlander J, Milosavljevic M. and Knott A: (1998) “Integrating Natural
Language Generation and Hypertext to Produce Dynamic Documents”. Interacting
with Computers 11(2): 109-135 (1998).
Deerwester, S. Dumais, S. T. Furnas, G. W. Landauer, T. K. and Harshman, R. (1990)
“Indexing by Latent Semantic Analysis.” Journal of the American Society for
Information Science, 41,6, (1990), 391-407
Dutch R. A. (1962) “Preface to Roget’s Thesaurus” in Roget’s Thesaurus, Longmans
Draper S. W. and Dunlop, M.D. (1997) “New IR - New Evaluation: The impact of
interactive multimedia on information retrieval and its evaluation”, The New Review
of Hypermedia and Multimedia, vol 3, pp 107-122,
Dumais, S. T. (1995), “Using LSI for information filtering: TREC-3 experiments.” In: D.
Harman (Ed.), The Third Text REtrieval Conference (TREC3) National Institute of
Standards and Technology Special Publication
Dumais, S. T., Letsche, T. A. Littman, M. L. and Landauer, T. K. (1997) “Automatic
cross-language retrieval using Latent Semantic Indexing.” In AAAI Spring
Symposium on Cross-Language Text and Speech Retrieval, March (1997).
Ellman J. (1983) “An Indirect Approach To Types Of Speech Acts.” Proc IJCAI (1983)
Ellman J. (1997) “Using Information Density to Navigate the Web” UK ISSN 0963-3308
IEE Colloquium on Intelligent World Wide Web Agents. March 1997
Ellman J. (1998) “Using the Generic Document Profile to Cluster Similar texts” in Proc.
Computational Linguistics UK (CLUK 97) Jan. 1998 University of Sunderland
Ellman J. and Tait J. (1996) “INTERNET Challenges for Information Retrieval”, proc
BCS IRSG Conference March 1996
Ellman J. and Tait J. (2000a) “Roget's thesaurus: An additional Knowledge Source for
Textual CBR?”, in "Research and Development in Intelligent Systems XVI: Proc 19th
SGES Intl Conf. on Knowledge Based and Applied Artificial Intelligence" Bramer M.,
Macintosh A., and Coenen F. (eds) ISBN 1-85233-231-X. pp-204-217 2000.
Ellman J. and Tait J. (2000b) “On the Generality of Thesaurally derived Lexical Links” in
Actes de 5es Journées Internationales d'Analyse Statistique des Données Textuelles
March 2000 (JADT 2000) pp147-154 Ecole Polytechnique Fédérale de Lausanne.
Ellman J., Klincke I., and Tait J. (1998) “SUSS: The Sunderland University Similarity
System: Beneath the Glass Ceiling” in Proc SENSEVAL workshop University of
Brighton 1998.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Ellman J., Klincke I., and Tait J. (2000) “Word Sense Disambiguation by Information
Filtering and Extraction” in “Computers and the Humanities” vol. 34, number 1-2,
2000, Special Issue on “Senseval: Evaluating Word Sense Disambiguation Programs”
Guest Editors Adam Kilgarriff and Martha Palmer
Fellbaum, C. (1998), ed. “WordNet: An Electronic Lexical Database”. MIT Press,
Cambridge, MA.
Frakes W.B. and Baeza-Yates R (eds.) (1992) “Information Retrieval: Data Structures
and Algorithms” Prentice-Hall, ISBN 0-13-463837-9
Francis W. N. and Kucera H. (1979) “Brown Corpus Manual” Accessed 5/7/99
Furnas, G.W. Landauer, T.K. Gomez, L.M. Dumais, S. T. (1987) “The vocabulary
problem in human-system communication.” Communications of the Association for
Computing Machinery, 30 (11), Nov 1987, pp. 964-971.
Gaizauskas R. Wakao T. Humphreys K. Cunningham H. and Wilks Y. (1995)
“Description of the LaSIE System as used for MUC-6”, In Proc. of the Sixth Message
Understanding Conference (MUC-6), Morgan Kaufmann, pp. 207-220,
Gauch S. and Wang J. (1997) “A Corpus Analysis Approach for Automatic Query
Expansion” in Proceedings of ACM CIKM '97.
Gonzalo J. Verdejo F. Chuur I. and Cigarrán J. (1998) “Indexing with WordNet synsets
can improve text retrieval” Proc. SIGIR (1998). Also cmp-lg/9808002.
Green S. (1996) “Using Lexical Chains to build Hypertext links in Newspaper Articles”
in proc AAAI Symposium
Green S. (1997) “Automatically generating Hypertext by Computing Semantic
Similarity” University of Toronto PhD Thesis. Computing Systems Research Group
Technical Report 366
Green S. (1999) “Building Hypertext Links by Computing Semantic Similarity” IEEE
Transactions on Knowledge and Data Engineering Vol. 11, no 8. September/October
Grefenstette G. (1994) “Explorations in Automatic Thesaurus Discovery” Kluwer
Academic Publishers, Boston ISBN 0-7923-9468-2
Grosz, B.J. Spärck Jones, K. and Webber, B.L. (eds.) (1986) “Readings In Natural
Language Processing.” Los Altos, CA,: Morgan Kaufmann. ISBN 0934613117.
Hahn U. and Chater N. (1998) “Similarity and rules: distinct? exhaustive? empirically
distinguishable?” Cognition vol. 65 pp 197-230.
Halliday, M. A. K. and Hasan, R. (1989), “Language, context, and text”. Oxford
University Press, Oxford, UK.
Halliday, M. A. K. and Hasan, R.: (1976), “Cohesion in English”, Longman, London.
Hammond K. R. (1998). “Ecological Validity: Then and Now” The Brunswik Society:
Web Essays #2 accessed 10th June
Hampton J. A. (1998) “Similarity-based categorization and fuzziness of natural
categories” Cognition 65 (1998) pp137-165
Harrison C. (1980) “Readability in the Classroom” Cambridge University Press UK.
ISBN 0 521 22713 7
Hearst M. A., (1994) “Multi-Paragraph Segmentation of Expository Text.” Proceedings
of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces,
NM, June, 1994.
Hearst, M. and Schütze, H. (1996) “Customizing a Lexicon to Better Suit a
Computational Task”, in Boguraev and Pustejovsky (1996)
Hirst G. and St Onge D. (1998) “Lexical Chains as representations of context for the
detection and correction of malapropisms” in Fellbaum (1998)
Humphreys K. Gaizauskas R. Azzam S. Huyck C. Mitchell B. Cunningham H. Wilks Y.
(1998) “University of Sheffield: Description of the LaSIE-II System as Used for
MUC-7” in Proc. MUC-7
accessed 25/10/1999
Ide N., and Véronis J. (1998) “Introduction to the Special Issue on Word Sense
Disambiguation: The State of the Art” Computational Linguistics Vol24, No. 1 pp.141
Inference (1994) “ART*Enterprise:ARTScript Programming guide. Chapter (19: Case
Based Reasoning”. Inference Corp.
Infopedia 96. CD-ROM Softkey Multimedia Inc.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Jiang, J. J. and Conrath D. W. (1997) “Semantic Similarity Based on Corpus Statistics
and Lexical Taxonomy'”, in Proceedings of ROCLING X (1997) International
Conference on Research in Computational Linguistics, Taiwan, (1997).
Karlgren J. (1999) “Stylistic Experiments” in Strzalkowski (1999)
Karlgren, J. , and Cutting, D. (1994) “Recognizing Text Genres with Simple Metrics
Using Discriminant Analysis” in proc. COLING (1994). (also accessed 30/6/99)
Kennedy C, and Boguraev B. (1996) "Anaphora for everyone: pronominal anaphora
resolution without a parser". Proceedings of the 16th International Conference on
Computational Linguistics (COLING'96), pp 113-118 Copenhagen, Denmark.
Kilgarriff, A. (1997) “I don't believe in word senses” Computers and the Humanities 31
(2), pp 91—113 (Also available at
Kilgarriff, A. (1998) “Gold Standard Datasets for Evaluating Word Sense Disambiguation
Programs” Computer Speech and Language 12 (3) (also University of Brighton
Technical Report ITRI-98-08)
Kilgarriff, A. and Rose, T. (1998) “Measures for corpus similarity and homogeneity”
Also published in Proc. 3rd Conf. “On Empirical Methods in Natural Language
Processing”, Granada, Spain. Pp 46-52. (also University of Brighton Technical Report
Kilgarriff, A. and Rosenzweig J. (2000) “English SENSEVAL: Report and Results”. To
appear in Proc. LREC, Athens, May-June 2000. available as Accessed 13 June 2000
Kilgarriff A. and Palmer M. (2000) “Introduction to the Special Issue on SENSEVAL”
Computers and the Humanities Vol. 34 (1/2):1-13, April 2000.
Kinnear P. R. and Gray C. D. (1997) “SPSS for Windows made simple” Psychology
Press, Hove East Sussex ISBN 0-86377-827-5
Klincke I. (1998) “Word Sense Disambiguation” MSc Thesis. School of Computing and
Information Systems, University of Sunderland
Kolodner J. (1993) “Case-Based Reasoning” Academic Press/Morgan Kaufmann; ISBN:
Kominek J. and Kazman R. (1997) “Accessing Multimedia through Concept Clustering”
Proc. CHI.
Krovetz R and Croft W. B. (1992) “Lexical Ambiguity and Information Retrieval”, ACM
Transactions on Information Systems, Vol. 10(2), pp. 115-141,.
Kunze M. and Hübner A. (1998) “CBR on Semi-structured Documents: The
ExperienceBook and the FAllQ Project” in proc 6th German Workshop On CaseBased Reasoning. Accessed 18/6/99
Lee L. (1997) “Similarity-Based Approaches to Natural Language Processing” PhD
Thesis, Harvard University accessed (19/8)/97
Lenz M. (1998) “Textual CBR and Information Retrieval - A Comparison.” In proc. 6th
German Workshop On Case-Based Reasoning. Berlin, March 6-8, (1998)
Lenz M. Bartsch-Spörl B. Burkhard H-D. Wess S. (Eds.) (1998): “Case-Based Reasoning
Technology: From Foundations to Applications. “Lecture Notes in Artificial
Intelligence 1400, Springer Verlag, (1998) ISBN 3-540-64572-1
Lenz M. Hübner A. Kunze M. (1998) “Textual CBR” in Lenz, Bartsch-Spörl, Burkhard
and Wess (1998).
Lesk M. (1986) “Automatic Sense Disambiguation using Machine Readable Dictionaries:
How to tell a Pine Cone from and Ice Cream Cone” Proc. ACM SIGDOC Toronto
Lewis D. D. (1991) “Representation and Learning in Information Retrieval” PhD thesis
University of Massachusetts TR91-93
Lewis D. D. and Spärck Jones K. (1996) “Natural Language Processing for Information
Retrieval” CACM Vol 39 no. 1
Li W. (2000) “Zipf's law” [accessed 19 June 2000)
Liddy E. D. (1998) “Enhanced Text Retrieval Using Natural Language Processing” ASIS
Bulletin June (1998)
Litkowski K. C. (1999) “Towards a Meaning-Full Comparison of Lexical Resources” in
Proceeding of the Association for Computational Linguistics Special Interest Group on
the Lexicon, June 21-22, College Park, MD (SIGLEX-99)
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Loukachevitch N. V. and Dobrov B.V. 2000 “Thesaurus as a Tool for Automatic
Detection of Lexical Cohesion in Texts” in JADT2000 5es Journées Internationales
d’Analyse Statistique des Données Textuelles. École Polytechnique Fédérales de
Lausanne. Switzerland
Makuta M. Cohen R. and Donaldson T. “An integrated approach to detecting incoherence
in texts with emphasis on the role of evaluating lexical cohesion” to appear
Mann, W. C. and Thompson, S. A. (1988). “Rhetorical structure theory: A theory of text
organization.” Text , 8(3), 243-281.
Mauldin M. (1991) “Conceptual Information Retrieval: A Case Study in Adaptive Partial
Parsing” Kluwer Academic Publishers, Dordrecht The Netherlands
Mauldin M. L. (1991) “Retrieval Performance in FERRET: A Conceptual Information
Retrieval System” Proc 14th SIGIR (1991)
Mauldin M. L. (1995) “Measuring the Web with Lycos” Proc 3rd Int. World Wide Web
Conference April 95 (Http://
Mauldin M. L. (1997) “Lycos: Design choices in an Internet search service” IEEE
Intelligent Systems Jan-Feb (1997), p. 8-11
Mc Hale, M. L. (1998) “A Comparison of WordNet and Roget's Taxonomy for
Measuring Semantic Similarity. 14 Sep (1998)
Miller G. Beckwith R. Fellbaum C. Gross D. and Miller K. (1990) “Introduction to
WordNet: An on-line lexical database” J. Lexicography 3(4) pp235-244
Miller, G. and Charles W. G. (1991) ``Contextual Correlates of Semantic Similarity'',
Language and Cognitive Processes, Vol. 6, No. 1, 1-28.
Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. (1990). “Five papers on
WordNet.” Technical Report CSL-43, Cognitive Science Laboratory, Princeton
University.(Available from )
Morris, J. and Hirst, G. (1991). “Lexical Cohesion computed by thesaural relations as an
indicator of the structure of text.” Computational Linguistics, 17(1), pp21-48.
Nadeau, M. (1994) “An Improved Encarta” BYTE, January (1994)
Ng H. T. and Lee H. B. (1996) “Integrating multiple knowledge sources to disambiguate
word senses: An exemplar-based approach.” in Proc. 34th ACL pp40-47
Ng H. T. and Zelle J. (1997) “Corpus Based Approaches to Semantic Interpretation in
NLP” AI Magazine Vol 18, 4. Winter 1997.
Ng H. T., Lim C. Y., and Foo S. K. (1999) “A Case Study on Inter-Annotator Agreement
for Word Sense Disambiguation” in Proceeding of the Association for Computational
Linguistics Special Interest Group on the Lexicon, June 21-22, College Park, MD
Nowell T. L., France R. K., Hix D. Heath L. S, and Fox E. A. (1996) “Visualizing Search
Results: Some Alternatives To Query-Document Similarity” Proc. SIGIR 1996.
Okumura M. and Honda T. (1994) “Word Sense Disambiguation and text segmentation
based on lexical cohesion” Proc COLING 1994 vol 2 pp 755-761
Project Gutenberg (1999) “Official and Original Project Gutenberg Web Site and Home
Page” Accessed 27/08/1999
Qiu Y. and Frei H.P. (1995) “Concept Based Query Expansion” in Proc. ACM SIGIR
1995 pp160-169
Quillian M.R. (1968) “Semantic Memory” in Minsky, M.“Semantic Information
Processing” MIT press Cambridge Mass.
Rada R. and Bicknell E. (1989) “Ranking Documents with a Thesaurus” Journal of the
American Society for Information Science 40(5) pp304-310
Radev D. R. (1997) “Natural Language Processing FAQ” Accessed April 1999.
Resnik, P. (1995) “Using Information Content to Evaluate Semantic Similarity in a
Taxonomy”, Proceedings of the 14th International Joint Conference on Artificial
Intelligence, Vol. 1, 448-453, Montreal, August 1995.
Richardson, R. and Smeaton A.F. (1995) Using WordNet in a Knowledge-Based
Approach to Information Retrieval. Working Paper CA-0395, School of Computer
Applications, Dublin City University, Ireland.
Rissland E. L, and Daniels J. L. (1996) “The Synergistic Application of CBR to IR”
Artificial Intelligence Review vol 10, pp 441-475
Robertson S. E. Walker Beaulieu (1998) “Okapi at TREC—7”: automatic ad hoc,
filtering, VLC and interactive track. Proc TREC-7. NIST.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Robertson S.E. (1994) “Computer Retrieval :As Seen Through The Pages Of Journal Of
Documentation” Published in: B.C. Vickery (Editor), Fifty years of information
progress. London: Aslib, (1994). (pp 119-146)
Salton G. and Buckley C. (1990) “Improving Retrieval Performance by Relevance
Feedback” Journal of the American Society for Information Science 41(4) pp288-297
Salton, G. and McGill, M. (1983), Introduction to Modern Information Retrieval.
Sanderson M. (1994) “Word Sense Disambiguation and Information Retrieval” in Proc.
ACM SIGIR (1994) pp 142-151
Sanderson M. (1996) “Word Sense Disambiguation and Information Retrieval” PhD
Thesis, University of Glasgow
Schank, R. C., and Abelson. R. P. (1977). “Scripts, Plans, Goals, and Understanding.”
Hillsdale, NJ: Lawrence Erlbaum Associates.
Schegloff E. and Sacks H. (1973) “Opening Up Closings.” Semiotica 8 pp289-327.
Schütze H. (1992) “Dimensions of Meaning”, Proceedings of Supercomputing, pp 787796, Minneapolis MN,.
Searle, J. (1969) “Speech Acts: An Essay in the Philosophy of Language”, Cambridge,
Eng.: Cambridge University Press.
Selberg E. and Etzioni O. (1995) “Multi-Service Search and Comparison Using the
MetaCrawler” Proc WWW4
Selberg, E. and Etzioni, O. (1997) “The MetaCrawler Architecture for Resource
Aggregation on the Web” IEEE Expert, January / February 1997, Volume 12 No. 1,
pp. 8-14.
Shimazu H. Kitano H, and Shibata G (1993) “Retrieving Cases from Relational
Databases: Another Stride to Corporate Wide Case Based Systems” Proc IJCAI 1993
Sloman S. A., and Rips L. J. (1998) “Similarity as an explanatory construct” Cognition
vol. 65 pp87-101
Smeaton A. (1999) “Using NLP or NLP Resources for Information Retrieval Tasks” in
Strzalkowski (1999)
Spärck Jones K. (1986) “Synonymy and Semantic Classification” Edinburgh University
Press, Edinburgh UK
Spärck Jones K. (1999) “What is the role of NLP in Text Retrieval” in Strzalkowski
Spärck Jones K. and Willett P. (1997) “Readings in Information Retrieval” Morgan
Kaufmann ISBN 1-55860-454-5
Srinivasan, P. (1992) “Thesaurus Construction.” In Frakes and Baeza-Yates (1992)
Stairmand M. (1996) “A Computational Analysis of Lexical Cohesion with Applications
in Information Retrieval” PhD Thesis. UMIST Computational Linguistics Laboratory
Stairmand M. and Black W. J. (1996) “Conceptual and Contextual Indexing using
WordNet-derived Lexical Chains” in proc BCS IRSG
StOnge, D. (1995). “Detecting and Correcting Malapropisms with Lexical Chains.” MSc
Thesis, University of Toronto. Department of Computer Science Technical Report
CSRI-319, March 1995. (Available on-line at [21/08/2000]
Strzalkowski T. (1999) “Natural Language Informational Retrieval” Kluwer Academic,
Dordrecht NL. ISBN 0-7923-5685-3
Sussna, M. (1993). Word Sense Disambiguation for Free-text Indexing Using a Massive
Semantic Network. Proceedings of the Second International Conference on
Information and Knowledge Base Management, pp67-74.
Tait J. and Ellman J. (1999) “MABLe: a multilingual authoring tool for business letters”.
proc. ASLIB 21st Conf. on Translating and the Computer. Nov. 1999.
Van Dijk T. A. (1977) “Text and context. Explorations in the semantics and pragmatics of
discourse” London: Longman,
van Rijsbergen C. J. (1979) “Information Retrieval” London: Butterworths, (available online on 25th March (1999) at:
Voorhees E. (1994), “Query Expansion Using Lexical-Semantic Relations”, Proceedings
SIGIR `94, pp61-69,
Voorhees E. and Harman D. (1998) “Overview of the Eighth Text REtrieval Conference
(TREC-8)” in Proc. TREC-8 NIST. Available on-line [15-Jun-00]
Watson I., Watson H (1998) “Case-based content navigation” in Knowledge-Based
Systems, (1998), Vol.11, No.5-6, pp.345-353
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Watson I. (1997) “Applying Case-Based Reasoning: Techniques for Enterprise Systems”
Morgan Kaufmann; ISBN: 1558604626
Watson I. and Marir F. (1994) “Case Based Reasoning: A review”. The Knowledge
Engineering Review Vol 9, 4 (1994) pp 327-354
Weaver P. L. (1993) “Practical SSADM Version 4” Pitman. London.
Wilbur, W. J. and Coffee, L. (1994) "The effectiveness of document neighbouring in
search enhancement" Information Processing and Management 30(2) pp253-266
Wilks Y. (1999) “COM 334: Language Engineering Notes” accessed 27/08/1999
Wilks, Y, Slator B. and Guthrie L. (1995) “Electric Words: dictionaries, computers and
meanings” MIT Press,
Wilks, Y. and Stevenson, M. (1996) “The grammar of sense: Is word-sense tagging much
more than part-of-speech tagging?” Technical Report CS-96-05, University of
Sheffield.. also proc. SIGLEX (1997)
Winograd T. (1972) Understanding Natural Language, (191 pp.) New York: Academic
Xu J. and Croft B. W. (1996) “Query Expansion using Local and Global Document
Analysis” in Proc. ACM SIGIR 1996
Yang Y. and Pederson J. (1999) “Intelligent Information Retrieval” IEEE Intelligent
Systems July/August 1999.
Yarowsky D. (1992) “Word Sense Disambiguation Using Statistical Models of Roget's
Categories Trained on Large Corpora.” Proc COLING 1992 pp 454-460
Zipf G.K. (1949) “Human Behavior and the Principle of Least Effort” Addison-Wesley
Inc (republished by Hafner Publishing Co. New York (1972))
Zobel J. (1998) “How reliable are large-scale information retrieval experiments?”
Proceedings of the Twenty-First International ACM-SIGIR Conference on Research
and Development in Information Retrieval, Melbourne, Australia, August 1998, pp.
Zobel J. and Moffat A. (1998) “Exploring the similarity space”, SIGIR forum 32(1):
pp18-34, spring 1998
Armstrong R. Freitag D. Joachims T. Mitchell T. (1995) “WebWatcher: A Learning
Apprentice for the World Wide Web” AAAI Spring Symposium on Information
Gathering In Heterogeneous, Distributed Environments. March 1995
Brüninghaus Stefanie and Ashley Kevin D. (1998) “Evaluation of Textual CBR
Approaches.” In: Proceedings of the AAAI-98 Workshop on Textual Case-Based
Reasoning (AAAI Technical Report WS-98-12). Pages 30-34. Madison, WI.
Catarci T, Chang SK, Liu W, Santucci G (1998) “A light-weight Web-at-a-Glance
system for intelligent information retrieval” Knowledge-Based Systems, , Vol.11,
No.2, pp.115-124
Crestani F, Van Rijsbergen C. J. (1998) “ A study of probability kinematics in
information retrieval” ACM Transactions On Information Systems, , Vol.16,
Davies J. Week R. Revett M. and McGrath A (1996) “Using Clustering in a WWW
Information Agent” Proc. BCS IRSG Conference March 1996
Di Battista G. Eades P. Tamassia R. and Tollis I. (1994) “Algorithms for Drawing
Graphs: an Annotated Bibliography” Computational Geometry: Theory and
Applications, 4(5), 235-282.
Eichmann D. (1994) “Ethical Web Agents” Proc WWW 94
Etzioni O. and Weld D. (1994) “A Softbot-Based Interface to the Internet” CACM July
Hunt J. (1997) “Case Based Diagnosis and Repair of Software Faults” Expert Systems:
The International Journal of Knowledge Engineering and Neural Networks, (1997),
vol. 14, no. 1, pp. 15-23(9)
Knott A and Sanders T, “The Classification of Coherence Relations and their Linguistic
Markers: An Exploration of Two Languages.” Journal of Pragmatics vol 30 (1998).
Knott A. (1996) “A Data-Driven Methodology for Motivating a Set of Coherence
Relations” PhD Thesis Dept of AI, University of Edinburgh
Koster, M. (1994), World Wide Web Wanderers, Spiders and Robots, mak/doc/robots/robots.html
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Lalmas M. Bruza P.D. (1998) “The use of logic in information retrieval modelling” In
Knowledge Engineering Review, (1998), Vol.13, No.3, pp.263-295
Liebermann H. (1995) “Letizia: An Agent That Assists Web Browsing” Proc WWW4
Magennis M. (1995) “Expert rule-based query expansion” Proc BCS IRSG (1995)
Manber U. and Wu S. (1993) “GLIMPSE: A Tool to Search Through Entire File
Systems” University of Arizona Dept of Computer Science Technical Report TR 9334
Mladenic D. (1999) “Text-Learning and Related Intelligent Agents: A Survey” IEEE
Intelligent Systems July/August (1999).
Obrazka K. Danzig P. B, and Li S-H. (1993) “Internet Resource Discovery Services”
IEEE Computer September 1993.
Onyshkevych B. (1993) “Template design for Information Extraction” in Proceeding of
the Fifth Message Understanding Conference (MUC-5)
Paice, C. D. (1991) “A Thesaural model of Information Retrieval” Information
Processing and Management Vol. 27, 5, pp443-447.
Pasi G, Pereira R.A.M. (1999) “A decision making approach to relevance feedback in
information retrieval: A model based on soft consensus dynamics” International
Journal Of Intelligent Systems, , Vol.14, No.1, pp.105-122
Resnick P and Varian H. (1997) “Recommender Systems” Communications of the
Association for Computing Machinery, 40 (3), Mar 1997, pp. 56-58
Riloff, E. and Lehnert, W. (1994) “Information extraction as a basis for high-precision
text classification” ACM Transactions on Information Systems Vol.12, No. 3 (July
1994), pp. 296-333
Schwartz M. (1993) “Internet Resource Discovery at the University of Colorado” IEEE
Computer September 1993
StatSoft, Inc. (1999). Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB:
Tait J. Sanderson H, Ellman J. Martinez A.M. Hellwig P, Tsagheas P, (1997) “Practical
Considerations in Building a Multi-Lingual Authoring System for Business Letters”
Proc. ACL'97/EACL'97 workshop on Commercial Applications of NLP
Willett, P. (1988) “Recent trends in hierarchical document clustering: a critical review”.
Information Processing and Management 24:577-97
Using Roget’s Thesaurus to determine the similarity of texts
Appendix I.
Jeremy Ellman
Experimental Examples on Rosetta
This appendix contains the texts for the text similarity experiment on the Rosetta Stone.
The articles are named as they appear in Chapter 6. The source text from Encarta is
known as “msRosetta”. The extract from Infopaedia is called “Info_Rosetta”. The other
texts have names derived automatically from their URLs.
The pages are shown as reduced to text only as this was the form used in the experiments.
Rosetta Stone, black basalt slab bearing an inscription that was the key to the deciphering
of Egyptian hieroglyphics and thus to the foundation of modern Egyptology. Found by
French troops in 1799 near the town of Rosetta in Lower Egypt, it is now in the British
Museum, London. The stone was inscribed in 196 BC with a decree praising the Egyptian
king Ptolemy V. Because the inscription appears in three scripts, hieroglyphic, demotic,
and Greek, scholars were able to decipher the hieroglyphic and demotic versions by
comparing them with the Greek version. The deciphering was chiefly the work of the
British physicist Thomas Young and the French Egyptologist Jean François Champollion.
The Rosetta Stone
The Rosetta Stone
Photo of the Rosetta Stone from British Museum (117k)
The Rosetta Stone led to the modern understanding of hieroglyphs. Made in Egypt around 200BC, it is a
stone tablet engraved with writing which celebrates the crowning of King Ptolemy V. It is a solid piece of
black Basalt and is 1m high by 70cm wide by 30cm deep. Quite heavy.
The interesting thing about the Rosetta Stone is that the writing is repeated three times in different
Hieroglyphic (top of stone)- used by ancient Egyptians
Demotic (centre of stone)- used by Arabs including modern Egyptians
Greek (base of stone)- used by, erm, Greeks, and other eastern Europeans
Simplified map of the world showing Egypt (10k)
Map of northern Egypt showing Rashid, the discovery location (44k)
The stone was re-discovered in 1799AD at Rosetta near Rashid, about 200km north of Cairo on the
Mediterranean coast. At that time, the meaning of hieroglyphs had been forgotten. Nobody could translate
any of the hieroglyphs found whilst raiding/exploring ancient Egyptian archeology.
However, the Rosetta Stone changed all that. Because people of the 19th century could understand the
Demotic and Greek parts of the engraving, a chap called Jean-Francois Champollion worked out which
words were represented by which hieroglyphs in 1821AD.
The Rosetta Stone now rests in the British Museum in London.
Here is an extract from the writing on the Rosetta Stone:
EUCHARISTO, the son of King Ptolemy and Queen Arsinoe, the Gods Philopatores, has been a benefactor
both to the temples and to those who dwell in them, as well as those who are his subjects, being a god
sprung from a god and goddess (like Horus the son of lsis and Osiris, who avenged his father Osiris) and
being benevolently disposed towards the gods, has dedicated to the temples revenues in money and corn
and has undertaken much outlay to bring Egypt into prosperity, and to establish the temples, and has been
generous with all his own means; and of the revenues and taxes levied in Egypt some he has wholly
remitted and others has lightened, in order that the people and the others might be in prosperity during his
reign: and whereas he has remitted the debts to the crown being many in number which they in Egypt and in
the rest of the kingdom owed: and whereas those who were in prison and those who were under accusation
for a long time, he has freed of the charges against them; and whereas he has directed that the gods shall
continue to enjoy the revenues of the temples and the yearly allowances given to them, both of corn and
money, likewise also the revenues assigned to the gods from vine land and from gardens and other
properties which belonged to the gods in his father's time...
Photo of the Place des Ecritures, Figeac, France (81k)
There is now a museum dedicated to the translator Jean-Francois Champollion, located in Champollion's
home town of Figeac near Cahors in southern France.
An Egyptian tablet in the Louvre Museum, Paris, France (26k)
Hieroglyphic name ring for King Ptolemy V
Champollion was born in December 1790 and could speak Greek, Latin, Hebrew, Arabic, Chaldean and
Syrian by the age of 14. By 19, Champollion was a History lecturer at Grenoble University. He had to make
his translations from a copy of the Rosetta Stone, since the stone itself had been stolen/seized by the English
during the Napoleonic war.Champollion visited Egypt only once- to put his new understanding of
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
hieroglyphs to the test. He returned to France to found the Egyptology Museum at the Louvre in Paris
(where you can still see many tablets and statues today). Champollion died in 1832 aged only 42.
Much of Champollion's work was based on that of Englishman Thomas Young who had already deciphered
names of people and places. Words like these, called Proper Nouns, are bordered by hieroglyphic name
rings, similar in shape to modern army name tags.
[ Cimmerii Index | [email protected] | Andrew Oakley ]
[ British Museum | Cambridge University Egyptology Dept. ]
[ Rosetta Stone- goth music band | Dark Horizons RS music reviews | Download RealAudio ]
Click here for hit statistics:
Rosetta Stone
Slab of basalt with inscriptions from 197 BC, found near the town of Rosetta, Egypt,
1799. Giving the same text in three versions - Greek, hieroglyphic, and demotic script - it
became the key to deciphering other Egyptian inscriptions.
Discovered during the French Revolutionary Wars by one of Napoleon's officers in the
town now called Rashid, in the Nile delta, the Rosetta Stone was captured by the British
1801, and placed in the British Museum 1802. Demotic is a cursive script (for quick
writing) derived from Egyptian hieratic, which in turn is a more easily written form of
The Rosetta Stone
The Rosetta Stone
Download at full size (127K)
© The Trustees of the British Museum
The Rosetta Stone was the key that unlocked the mysteries of Egyptian hieroglyphics. Napoleon's troops
discovered it in 1799 near the seaside town of Rosetta in lower Egypt, and it eventually made its way into
the British Museum in London where it resides today. It is a slab of black basalt dating from 196 BC.
inscribed by the ancient Egyptians with a royal decree praising their king Ptolemy V. The inscription is
written on the stone three times, once in hieroglyphic, once in demotic, and once in Greek. Thomas Young,
a British physicist, and Jean Francois Champollion, a French Egyptologist, collaborated to decipher the
hieroglyphic and demotic texts by comparing them with the known Greek text. From this meager starting
point a generation of Egyptologists eventually managed to read most everything that remains of the
Egyptians' ancient writings.
When I started my company in 1976 I was a new Ph.D. in programming languages and thought I'd write
compilers for a living, so I took the stone's name because of its famous association with language
translation. I did implement a few programming languages but turned out to spend most of my time doing
interactive graphical applications, and most recently computer games.
Rosetta home page
Copyright &#169; 1998 by Rosetta, Inc. All rights reserved.
“Rosetta” is a registered trademark of Rosetta, Inc.
Sendcomments or questions about this site to the webmaster.
Rosetta Technologies, Inc. - The Rosetta Stone
= 3) version = “n3”;
In 1799, soldiers in Napoleon's invading army
unearthed a flat basalt stone that had lain
concealed for nearly two thousand years near the
town of Rosetta, Egypt. The Rosetta Stone bore
inscriptions in three languages, and after years of
intense research, it gave linguists the key that
unlocked the secret of Egyptian hieroglyphics and
vastly expanded our knowledge of ancient Egyptian
history and culture.
In today's highly heterogeneous computing
environments, users who need to share data often
feel as if they might as well be dealing with
hieroglyphics: until they encounter Rosetta
Technologies. Rosetta unlocks the secrets of
diverse and incompatible computer documents and
breaks the barriers to communicating electronic
product data.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Rosetta Technologies Locations
The greatest power of computers is not as
numbercrunching machines but as platforms for
communication. Just as the Rosetta Stone broke the
barriers that kept Egyptian hieroglyphics a secret
for thousands of years, Rosetta Technologies breaks
the barriers that have made it difficult to exploit
the computer as a true communications vehicle. The
Rosetta Stone empowered us to communicate with the
past. Rosetta Technologies offers communication
capabilities that will carry us into the future.
Contact Rosetta Technologies today to see how to
take full advantage of your electronic product
In North America
Rosetta Technologies, Inc.
15220 NW Greenbrier Pkwy.,
Suite 300
Beaverton, Oregon 97006
1 503 690-2500
Sales (US): 1 800 445-0300
Telefax: 1 503 531-0401
Email: [email protected]
In Europe
Rosetta Technologies
35, cours Michelet
92060 Paris La Défense
Telephone: +33 01 47 73 15 60
Telefax: +33 01 47 73 15 58
Email: [email protected]
About Rosetta |
Products |
Services |
Technical Support |
Evaluation Software |
Site Index
Rosetta Stone Language Software - Multilingual Books - Rosetta Stone Language Software
Multilingual Books and TapesRosetta Stone CD-ROM Courses
Ordering Information
Back to Software Page
Rosetta Stone CD-ROM Language Courses
The most extensive CD-ROM courses available! Equivalent to 2 years of college course study.
Available for Arabic, Chinese (Mandarin), English, French, German, Italian,
Latin, Japanese, Portuguese, Russian, Thai, and Vietnamese.
Seven language PowerPac sampler available for only $69
Full courses only $385
About Rosetta Stone Screen Shots Availability PowerPac
What is The Rosetta Stone CD-ROM language course?
The Rosetta Stone CD-ROM series is the premier choice for students seeking to master a foreign
language on their computer. Intended for all serious students, these courses are equivalent to 2 years of
college course study. In fact, these
packed CD-ROM courses for Windows and Macintosh are the most intensive computer courses you can
buy. Rosetta Stone teaches with an
extensive series of drills that associate text, spoken words, and pictures. Rosetta Stone teaches you not just
tourist phrases, but also lots of
vocabulary and grammar in a clear and simple manner. This is the only course you will need to learn your
choice of Arabic, Chinese (Mandarin),
English, French, German, Italian, Latin, Japanese, Portuguese, Russian, Thai, and Vietnamese. We love
Rosetta Stone courses and we sell them
below retail price at only $385.00. Get yours today.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The Rosetta Stone intensive courses assume no prior knowledge of a foreign language and will take the
student through the equivalent of
two years college study. Unlike learning in language classrooms, Rosetta Stone is designed to be
entertaining and self-paced. You can learn a foreign
language when you want, for as long or short a time as you wish. Each Rosetta Stone language CD-ROM
contains over 8,000 real life color
images, plus thousands of words and phrases spoken by native speakers. The emphasis is on both spoken
and written language, meaning that
Rosetta Stone teaches both the spoken and written foreign language. The interactive design is not a mere
“static” course that plays like a
cassette; constant student response is necessary and this holds the student's attention and improves retention
of new material. Built-in speech
recognition aids the student in pronouncing the words as they should be spoken. Dual versions allow for the
course to be used on either
Macintosh or Windows computers.
Who should buy The Rosetta Stone?
The Rosetta Stone intensive course is best suited for people who want a professional-quality language
course to learn more than just the
basics of a language. People who choose Rosetta Stone are getting the best and most extensive computer
course available today. But don't take
our word for it, check out these comments from major news sources:
“The premier foreign language CD-ROM”
The American School Board Journal, Sept. 1996
Excellent teaching methods...beautiful photographs...a valuable educational tool and fun to us. Four stars.”
Joanna Pearlstein - MacWorld magazine
It is rare to find a superbly designed instructional software program as this one. One of the top 10 CDs!”
CD-ROMs Rated - McGraw-Hill
“Top 100 CD-ROMs!” - PC magazine
50 Best CD-ROMs!” - MacUser
Other outstanding features:
1 or 2 CD-ROMs with 92 lessons including 8 review lessons
Illustrated User's Guide
Language Book with curriculum text
Handbook for teachers
Icon-driven operation for ease of use
Instant scoring of exercises and tests
Timed modes for increased challenge
Wide scope of languages - intensive programs for Chinese, Russian, and Dutch are
Available for Windows 95, 3.x and Mac
Pricing and Availability
All Rosetta Stone intensive courses are equivalent to 2 years of college courses!
AR-525 Arabic Rosetta Stone Level 1 $385.00
CH-525 Mandarin Chinese Rosetta Stone Level 1 $385.00
DU-525 Dutch Rosetta Stone Level 1 $385.00
EN-525 English Rosetta Stone Level 1 $385.00
FR-525 French Rosetta Stone Level 1 $385.00
IT-525 Italian Rosetta Stone Level 1 $385.00
JA-525 Japanese Rosetta Stone Level 1 $385.00
LA-525 Latin Rosetta Stone Level 1 $385.00
PR-525 Portuguese Rosetta Stone Level 1 $385.00
RS-525 Russian Rosetta Stone Level 1 $385.00
SP-525 Spanish Rosetta Stone Level 1 $385.00
TH-525 Thai Rosetta Stone Level 1 $385.00
VI-525 Vietnamese Rosetta Stone Level 1 $385.00
Interested in a special version that includes samples of many different languages?
Check out the Rosetta Stone PowerPac edition for only $69.00!
Minimum system requirements:
Mac: 256 colors, CD-ROM drive, 4 MB RAM, microphone for voice
Win 3.x: 486, 4 MB RAM, 4MB hard drive space, CD-ROM drive, 256 colors,
Sound Blaster or 100% compatible, microphone for voice recording
Win 95: 486DX, 8 MB RAM, CD-ROM drive, 256 colors, Sound Blaster or 100%
compatible, microphone for voice recording
Ordering Information
Back to Top
Multilingual Books and Tapes
1205 E. Pike, Seattle, WA 98122
1-206-328-7922 - Fax: 328-7445
E-mail:[email protected]
&#169; Copyright 1998 The Internet Language Company
Appendix II.
Experimental Texts
Experimental Texts and their Generic Document Profiles
This appendix shows the experimental source texts from Microsoft’s© Encarta. These
have been marked up by Hesperus to show the component lexical chains1. The texts are
followed by the Generic Document Profile as determined using the fine-grain, no word
sense disambiguation options (Chapter 6).
Rosetta Stone
rosetta-stone0 Stone, black2 basalt slab bearing an inscription that was the key to the
deciphering of Egyptian hieroglyphics and thus to the foundation of modern Egyptology.
Found by French troops in near the town of Rosetta in Lower Egypt, it is now in the
British Museum, London. The stone was inscribed in BC with a decree praising the
Egyptian king Ptolemy V. Because the inscription appears in three scripts, hieroglyphic,
demotic, and greek1 , scholars were able to decipher the hieroglyphic and demotic
versions by comparing them with the Greek version. The deciphering was chiefly the
work of the British physicist Thomas Young and the French Egyptologist Jean Fran ois
Table II-1: Generic Document Profile for “Rosetta Stone” from MS Encarta
Rosetta Stone GDP
Roget Categories
It is an artifact of the algorith that non-ascii and numeric text is omitted.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
copyright1 , body5 of legal rights that protect creative0 works from being reproduced,
performed, displayed2 , or disseminated by others without permission10 . The owner of
copyright has the exclusive right to reproduce a protected work; to prepare14 other works
based on the protected work; to sell4 , rent, or lend copies of the protected work to the
public; to perform protected works in public; and to display copyrighted works publicly.
These basic exclusive rights of copyright owners are subject to exceptions depending on
the type of work and the type of use made by others. The term7 work used in copyright
law refers to any original creation of authorship fixed in a tangible13 medium. Thus,
works that can be protected by copyright include literary pieces, musical compositions,
dramatic selections, dances, photographs3 , drawings, paintings, sculpture, diagrams,
advertisements, maps, motion-pictures pictures, radio and television programs, sound
recordings11 , and computer12 software programs. copyright does not protect an idea or
concept; it only protects the way in which an author has expressed an idea or concept. If,
for example, a scientist publishes an article explaining a new process for making a
medicine, the copyright prevents others from copying the article, but it does not prevent
anyone from using the process described to prepare the medicine. In order to protect the
process, the scientist must obtain a patent. History of copyright The first real copyright
law, enacted in by the British Parliament, was the Statute of Anne. This law forbade the
unauthorized printing6 , reprinting, or importing of books for a limited number of years.
In the United States, the founding fathers recognized the need to encourage creativity by
protecting authors. They placed in the constitution of the United States a provision giving
Congress the power “to promote the progress of science and useful arts, by securing for
limited times to authors and inventors the exclusive right to their respective writings and
discoveries” ( art. I, Sect. ). This provision gave the federal government the power to
enact copyright and patent statutes. In , Congress passed the first U.S. copyright law.
Since then, the copyright statutes have been expanded and changed by Congress many
times. A major revision of U.S. law was made in the copyright Act, which remained the
basic framework for protection until January , , when the copyright Act of went into
effect. The act, which is the legal basis for copyright protection today, made substantial
and important changes in U.S. law. copyright in the United States The copyright Act
established a single system of federal statutory protection for all eligible works, both
published and unpublished. For works created after January , , copyright becomes the
property of the author the moment the work is created and lasts for the author's life plus
years. When a work is created by an employee in the normal course of a job, however,
the copyright becomes the property of the employer and lasts for years from publication
or years from creation, whichever is shorter. For works created before , the old act
provided that the copyright endured for years from the date the copyright was secured
and might be extended for another years, for a maximum term of protection of years. The
new act extended the renewal term for copyrights existing on January , , so that
copyright protection would last for years. However, for works produced in the United
States prior to , the owner must have filed a renewal application to obtain the benefit of
the renewal period. works that first obtained statutory copyright protection in or later
automatically receive the benefit of the renewal period. notice Although copyright
becomes effective on creation of a work, for works publicly distributed before march , ,
the copyright is potentially invalidated unless a prescribed copyright notice is placed on
all publicly distributed copies. For works published on or after march , , the use of a
copyright notice is optional, though recommended. This notice consists either of the
word copyright, the abbreviation Copr., or the symbol accompanied by the name of the
owner and the year of first publication (for example, John Doe ). In most printed books
the copyright notice appears on the reverse side of the title-page page. The use of the
notice is the responsibility of the copyright owner and does not require advance
permission from, or registration with, the copyright Office. A similar notice bearing the
symbol (for example, Doe record company) may be used to protect sound recordings
such as phonograph records and tapes. To enforce a copyright, the U.S. author or owner
must have applied to register with the copyright Office in Washington, D.C. To register,
the copyright owner must fill out the application, pay a fee, and send two complete
copies of the work, if published, which will be placed in the library of Congress. The
sooner the claim to copyright is registered, the more remedies the author may have in
any litigation to enforce the copyright. licensing copyright can be sold or licensed to
others. licenses of copyrights are normally granted in written contracts24 agreed-to to by
all parties involved. For example, an author of a novel can license one publisher to print
the work in hardbound copies, another publisher to produce paperback copies, and a
motion-picture company to make a movie based on the novel. A sale or license of
copyright made on or after January , , can be terminated by the author (or by the
author's family) years after the sale or license. The purpose of allowing such a
termination is to permit an author to obtain more financial reward if the work remains
commercially valuable over a long period of time. For the sale or license made before ,
the author has a similar right of termination years from the date the copyright was
originally secured or beginning on January , , whichever is later. The law sets up
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
conditions for reproduction of copies by libraries and archives and for transmission of
audiovisual and other programs and forbids unauthorized duplication of sound recordings.
It provides for royalty payments on recorded music, on public performance of sound
recordings by coin- operated phonographs, and on transmission of some television
programs. A radio station that broadcasts a recording of copyrighted music is
“performing” the work publicly and for profit and must be licensed to do so. In ,
however, the Supreme Court of the United states ruled that noncommercial use of
videocassette recorders does not violate copyright law. Infringement Copyright
infringement is any violation of the exclusive rights mentioned above-for example,
making an unauthorized copy of a copyrighted book. Infringement does not necessarily23
require word22 -for- word reproduction; “substantial similarity” to the copyright-
protected content of a work is sufficient. Generally, copyright infringements are dealt
with in civil lawsuits in federal court. If infringement is proved, the copyright owner has
several remedies available. The court may order an injunction against future
infringement; the destruction of infringing copies; reimbursement for any financial loss
incurred by the copyright owner; transfer of profits made from the sale of infringing
copies; and payment of fixed damages (usually between $ and $ , ) for each work
infringed, as well as court21 costs and attorney's fees. In copyright cases, a criminal
penalty of imprisonment and/or a fine can be imposed for knowingly infringing the
copyright for profit20 . fair Use An exception to the rule of copyright infringement is the
concept known as fair use, which permits the reproduction of copyrighted material for
purposes such as criticism9 , comment, teaching and research. In deciding whether a use
falls within the fair use exceptions, several factors are considered, including the purpose
of the use and the effect of the use on the value of the original-work work. Examples of
fair use include the quotation of excerpts from a book8 , poem, or play in a critical review
for purposes of illustration or comment; quotation of passages in a scholarly or technical
book to illustrate or clarify the author's observations; use in a parody of some of the
work being parodied; summary of a speech or article, with quotations, in a news report;
and reproduction by a teacher or student of a portion of a work to illustrate a lesson.
Because works created by the U.S. government cannot be copyrighted, material from the
many publications put-out out by the U.S. Government Printing Office may be
reproduced without fear of infringement. Advances in technology Technological
development has produced and will continue to produce new and different ways to store
information in smaller19 and smaller spaces, retrievable by electronic methods. Congress,
in passing the copyright Act, recognized that it could not foresee all the new methods of
fixing or storing information. Accordingly, it broadly defined the category of
copyrightable material to include all “ original works of authorship fixed in any tangible
medium of expression, now known or later developed, from which they can be perceived,
reproduced, or otherwise communicated, either directly or with the aid of a machine or
device.” Thus, an author who types a story on a computer, which stores it on a tape or
disc in computer18 memory, has “fixed” the work in a “ copy” sufficient for copyright
protection. International copyright Almost every nation has some form of copyright
protection for authors and artists. Most do not require marking published copies with a
formal copyright notice or registering the claim with the copyright Office, though use of
appropriate copyright notices is recommended to maximize international protection. The
United states is a member of the Universal copyright Convention (UCC), an
international treaty organization in effect since , designed to eliminate17 discrimination
against foreigners in copyright protection. More than nations belong to the UCC. Every
member nation must give foreign works that meet UCC requirements the same copyright
protection as that nation gives to domestic works and authors. An American who wishes
to secure copyright protection in the United states and in UCC member nations at the
same time can do so by marking all published copies with a copyright notice that
satisfies the provisions of both the UCC treaty and domestic U.S. law. This notice
includes the symbol , the name of the copyright owner, and the year of first publication.
Although no such thing as an “international copyright” exists, it is easy for an author to
obtain copyright protection in many nations. Several other international conventions also
provide copyright protection. As of march , , the United states became a member of the
Berne convention, which protects any works first published in a member nation, without
formalities such as a copyright notice. The Buenos Aires convention, a multilateral16
treaty of North and South American nations including the United states, requires a
statement such as “All Rights Reserved” to be printed in the copyright notice. In
February the United states and China signed an agreement to prevent companies in China
from illegally manufacturing items, such as compact discs and computer15 software, in
violation of American copyrights. The United states estimates that this piracy caused
American businesses to lose $ billion a year. To stop copyright violations, China agreed
to establish task forces and increase the power of customs officials.
“ copyright,” Microsoft(R) Encarta(R) Encyclopedia. (c) - Microsoft Corporation. All
reserved reserved.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table II-2 Generic Document Profile for “Copyright” from MS Encarta
Copyright GDP
Roget Category
socialism0 socialism, economic and social1 doctrine2 , political movement inspired by this
doctrine, and system or order established when this doctrine is organized in a society.
The socialist doctrine demands state-ownership ownership and control of the
fundamental5 means of production and distribution of wealth, to be achieved by
reconstruction of the existing capitalist or other political system of a country through
peaceful, democratic, and parliamentary means. The doctrine specifically advocates
nationalization of natural-resources4 resources, basic industries3 , banking and credit
facilities, and public utilities. It places special emphasis on the nationalization of
monopolized branches of industry and trade, viewing monopolies as inimical to the
public welfare. It also advocates state-ownership ownership of corporations in which the
ownership function has passed from stockholders to managerial personnel. Smaller and
less vital enterprises would be left under private ownership, and privately held
cooperatives would be encouraged. These are the tenets of the socialist party of the U.S.,
the labour party of Great Britain, and labor or social democratic parties of various other
countries. Therefore they constitute the centrist position held by most socialists. Some
political movements calling themselves socialist, however, insist on the complete
abolition of the capitalist system and of private profit, and at the other extreme are
socialist programs having objectives entailing even fewer changes in the social order
than those outlined above. The ultimate goal of all socialists, however, is a classless
cooperative commonwealth in every nation of the world. Comparison with communism
The terms socialism and communism were once used interchangeably. Today, however,
communism designates those theories and movements that, in accordance with one view
of the teachings of Karl Marx and Friedrich Engels, advocate the abolition of capitalism
and all private profit, by means of violent revolution if necessary. Marx organized the
international Workingmen's Association, or First international; when this congress met
at Geneva in , it was the first international forum for the promulgation of communist
doctrine. This doctrine was later explained by Lenin, who defined a socialist society as
one in which the workers, free from capitalist exploitation, receive the full product of
their labor. Most socialists deny the claim of communists to have achieved socialism in
the USSR, which they regarded as an authoritarian tyranny. But after World War II, many
communist-led political parties in the Soviet sphere of influence still used the
designation socialist in their names. In East Germany (now part of the united federal
republic of Germany), for example, the name adopted by the merged communist and
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
social democratic parties was the socialist Unity party. The modern socialist
movement, as distinguished from communism, had its origin largely in the revisionist
movement of the late th century. The worsening condition of the proletariat, or workers,
and the class war predicted by Marx for Western Europe had not come about. Many
socialist thinkers began to doubt the indispensability of revolution and to revise other
basic tenets of Marxism. Led by the German writer Eduard Bernstein, they declared that
socialism could best be attained by reformist, parliamentary, and evolutionary methods,
including the support of the bourgeoisie. Moderate socialism Such a view was held by
the founders of the fabian society, organized in by British social reformers Sidney and
Beatrice Webb and their associates. The fabians in turn6 helped to form the British
independent labour party in ; it became affiliated with the newly organized labour party
in . In the U.S. a socialist Labor party was founded in . This party, small as it was,
became fragmented in the s. In a moderate faction of the party under Morris Hillquit
joined with the social democratic party of Eugene V. Debs and the christian socialists
of George D. Herron to form the socialist party. The moderate, or revisionist, type of
socialism found its clearest expression in the organization in Paris in of the Second
international. This body differed7 from the First international in that it was merely a
coordinator of the activities of its affiliated political parties and trade unions. The
Second international also diverged in ideology; a majority of its members, led by Eduard
Bernstein, were revisionists. The left-wing-wing minority was led by Lenin and the
German revolutionist Rosa Luxemburg; a third element, Marxist but opposed to Lenin,
was led by the German theorist Karl Kautsky. The Second international declared its
opposition to the preparations for war being made by most European governments. rise of
the left-wing Wing When World War I began in , modern European socialist leaders
supported their respective governments. Leaders of the socialist party in the U.S. and of
the labour party of Great Britain did not. Spokespersons for the left-wing wing, led by
Lenin, labeled the war an imperialist struggle and urged the workers of the world to
convert the war into a proletarian revolution or to turn the imperialist war into a class
war. This ideological conflict resulted in the collapse of the Second international.
Revived after World War I, it was never again important. Despite the decline of the
Second international, socialist parties made substantial gains during the years following
World War I and during World War II. In Great Britain, the labour party under Ramsay
MacDonald was in power for ten months in and again from to , but it lacked
parliamentary majorities and accomplished little. In Australia the Labor party held office
from to , from to , and from to . The labour government of New Zealand, elected in ,
remained in power until . In Scandinavia, candidates of the social democratic parties of
Denmark, Norway, and Sweden were elected to high positions early in the s; these
parties subsequently became dominant in Scandinavia. socialism Versus fascism During
the s and ' s socialist and communist parties were in continuous conflict. One point of
contention was the question of support for the USSR. socialists castigated communists as
agents of the Soviet union and traitors to their own countries. Also during the ' s and ' s,
Fascist regimes in Germany and Italy caused both socialists and communists to develop
new tactics. Attempts were made in several countries to form a united front of all
working-class organizations opposed to fascism, but the movement had limited success,
even in France and Spain, where it did well in the elections. Failure of the communists
and socialists of Germany to unite is regarded as one cause of the success of the national
socialists. The fragile alliance that was achieved between socialists and communists in
some countries during this “ popular-front Front” period was destroyed in by the
conclusion of a nonaggression pact between Germany and the USSR. socialists
condemned this act as a demonstration of the community of interest between two
totalitarian governments. In august , Germany invaded Poland, precipitating World War
II, and socialists in the allied countries immediately expressed full support for their
governments. After World War II An upsurge occurred in support of socialist parties
after the war, chiefly8 in Western Europe. The greatest advance was scored in great
Britain in ; the victorious labour party had in its campaign advocated the socialization of
the British economy. In ensuing years individual socialists won victories and in some
instances formed governments in France, Italy, Belgium, the Netherlands, Norway,
Sweden, and numerous other European countries. The socialist international, similar to
the Second international, was organized in in Frankfurt, West Germany (now part of the
united federal republic of Germany). In Asia, socialism made progress in India, Burma
(now known as Myanmar), and Japan; the Asian socialist Conference was formed as the
Eastern equivalent of the socialist international. The Soviet satellites, the “ people's
democracies” of Eastern Europe, including Poland, Czechoslovakia (now the Czech
republic and Slovakia), Hungary, Bulgaria, and Romania, came under the control of
Communist- socialist parties, but these were dominated in all cases by communists.
China established a communist government, as did Albania and, later, Cuba. Emerging
nations of Africa, Asia, and Latin America frequently adopted social systems that were
largely socialist in orientation. In many instances, these nations took over properties held
by foreign owners. The influence of the socialist party of the U.S., led from to by
Norman Thomas, gradually declined, although much of its economic program became
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
law under the New Deal of president Franklin D. Roosevelt. The period following World
War II was also marked by intensification of the conflict between socialists and
communists. socialists approved such measures, initiated in the U.S. and supported by
the governments of Western Europe, as the European recovery Program and the North
Atlantic Treaty organization, declaring that the former would stem the tide of
totalitarian communism by raising living standards and that the latter would achieve the
same end by strengthening Western Europe militarily. communists denounced these
measures as imperialist preparations for war against the USSR. socialist political parties
have suffered occasional setbacks in elections in those countries in which they form half
of the two- party-system system, as in New Zealand in (they had been in power from to
and from to ) and in great Britain in (after five years in power). Nonetheless, extensive
and fundamental parts of the socialist program are permanent features of contemporary
economic and social life.
Contributed by: Robert E. Burke Norman Thomas
“ socialism,” Microsoft(R) Encarta(R) Encyclopedia. (c) - Microsoft corporation. All
rights reserved.
Table II-3 Generic Document Profile for “Socialism” from MS Encarta
Socialism GDP
Roget Categories
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
ballot in modern usage, a sheet of paper used in voting, usually in an electoral-system
system that allows the voter to make choices secretly. The term5 may also designate the
method and act of voting secretly by means of a mechanical device. Used in elections in
all democratic3 countries, the ballot method protects voters from coercion and reprisal in
the exercise of their vote. Wherever the practice of deciding questions by free vote has
prevailed, some form of secret voting has always been found6 necessary. History of
In ancient Greece, the dicasts ( members of high courts) voted secretly with balls, stones,
or marked shells. Legislation was enacted in Rome in BC establishing a system of secret
voting. Long before the passage of this law, however, questions sometimes were decided
in Rome in public meetings1 by means of the ballot. Colored balls were used as ballots
during the middle-ages Ages. This form has survived to modern-times times, particularly
in clubs or associations in which voting decides the question of admitting or rejecting
proposed new members. Each voter receives two balls, one white, indicating2
acceptance, and the other black, indicating rejection; they are then deposited secretly in
appropriate receptacles so as to indicate a favorable or unfavorable decision. In some
organizations, candidates for admission are rejected if any black balls are found among
the white balls. In modern-times times, the most common form of ballot has been the
written8 or printed ticket Although the ballot had been used previously by the British
parliament to conceal the voting record of its members, in the house of Lords rejected a
proposal7 of the house of Commons providing for secret voting on matters before
parliament. The French Chamber of deputies voted by ballot from to . With the
development of democracy the practice of voting secretly in legislative assemblies
responsible to the people was generally abandoned. Toward the end of the th century,
demands were made in Great Britain that elections to parliament be conducted by secretballot ballot, but the first proposal of this kind was not introduced into parliament until .
The proposal was rejected, but subsequently advocates of Chartism incorporated the
demand in their petitions to parliament. Despite repeated attempts by proponents of the
legislation to secure its enactment, parliament took no effective action until . In that year
the ballot Act was approved providing for secret voting at all parliamentary elections,
except parliamentary elections held at universities, and at all municipal elections.
similar legislation had been previously adopted in France ( ) and Italy ( ). balloting in the
Following the American Revolution, the secret-ballot ballot, used universally during the
period of British colonial rule, was adopted in most of the newly established states.
development of the political-party party system resulted in various abuses of the ballot
system in many states during the first half of the th century, when the law permitted the
printing and distribution of ballots to the voters both by candidates and by political
organizations. This system, which led to confusion and fraud at the polls, produced
widespread public sentiment for ballot reform. In the Massachusetts state legislature
initiated remedial action, adopting legislation that provided for the so- called Australian
ballot in state elections. The principal features of this method, first used in Australia in
and subsequently adopted by every state in the union, are the preparation, printing, and
distribution of the ballot by government4 agencies; the use of a blanket ballot listing the
names and party designations of all candidates for all offices to be filled; and secret
voting under government supervision. Formerly, most of the U.S. used the partycolumn type of blanket ballot, in which the names of candidates are arranged in columns
allocated to their respective political parties. By , however, most states had adopted the
office- column type of listing, in which the names are arranged under the office sought,
either alphabetically or by party, with the party label appearing after the name in either
case. When the party- column ballot is used, the party emblem is often added to the
party column and the party circle. In some places a party emblem is used on the officecolumn ballot, as in New York state. The purpose of the emblem and the party circle is
to make it easier for loyal but ill-informed party voters to vote a straight party ticket In
addition, some states, counties, and cities provide ballots with extra space for write-in
votes for candidates not listed. The preferential ballot, now rarely used, allows voters to
indicate with numerals the order of their preference among the candidates for the same
office. The long ballot, on which candidates for administrative as well as for legislative
office were listed, has gradually been replaced, through the efforts of such reformers as
president Woodrow Wilson, by the short ballot listing names of legislative candidates
only, administrative offices often being filled largely by appointment. Various methods
have been devised for the nomination of candidates to ensure that only the names of
authorized office seekers appear on the ballot. Many states and localities require a
candidate to file a petition before the name can appear on the ballot. The petition must
contain a certain minimum number of signatures of registered voters from a certain
minimum number of counties in the state, or districts in the locality. The validity of the
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
signatures may then be challenged by other candidates, with final adjudication of
disputes made by the appropriate board of elections, or in some cases by the courts. To
facilitate voting and to reduce the possibility of fraud, a mechanical device operated
either manually or electrically began to be adopted in various parts of the U.S. after ,
when New York state first authorized such use. The list of candidates is arranged on the
face of a voting machine according to the office- column model, either horizontally or
vertically. The voter indicates a preference by placing a pointer next to the name of the
candidate of his or her choice. Space is also provided for write-in votes. Each voting
machine is equipped with curtains, which the voter closes to form a complete, private
polling-booth booth. When the voter has finished voting, he or she pulls a special lever
that opens the curtains, returns the pointers to their original positions, and starts the
mechanical counting devices that record and add up the votes. The use of voting
machines in U.S. elections depends on state legislation. Despite the fact that the
Australian ballot system, or a modification of it, is used throughout the United states,
fraudulent voting, although greatly reduced, still occurs in some communities. This is
accomplished chiefly by “repeating,” an unlawful practice whereby citizens register and
vote at more than one polling place, and by “stuffing,” or putting extra votes into the
ballot-box box. All such frauds are generally accomplished with the connivance of
dishonest election officials, but may be counteracted in some cases by calling for a
recount after votes have been tallied. In some states, where voting machines are used
exclusively, it is claimed that virtually no fraudulence occurs, although efforts are
sometimes made to damage voting machines so as to reduce the number of votes given
to a favored candidate.
“ ballot,” Microsoft(R) Encarta(R) Encyclopedia. (c) - Microsoft Corporation. All rights
Table II-4: Generic Document Profile for “Ballot” from MS Encarta
Ballot GDP
Roget Categories
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
artificial-intelligence6 Intelligence or ai, a term that in its broadest1 sense would indicate
the ability of an artefact to perform3 the same kinds of functions that characterize human
thought. The possibility of developing some such artefact has intrigued human beings
since ancient times. With the growth of modern science, the search-for5 for ai has taken
two major directions: psychological and physiological research into the nature of human
thought, and the technological development of increasingly sophisticated computing
systems4 . In the latter sense, the term ai has been applied to computer systems and
programs capable of performing tasks more complex than straightforward programming,
although still far from the realm of actual thought. The most important fields of research
in this area are information processing, pattern recognition, game playing, and applied
fields such as medical diagnosis. current research in information processing deals with
programs that enable a computer to understand written0 or spoken information and to
produce summaries, answer2 specific questions, or redistribute information to users
interested in specific areas of this information. essential to such programs is the ability
of the system to generate grammatically correct sentences and to establish links between
words and ideas. research has shown that whereas the logic of language structure its
syntax submits to programming, the problem of meaning, or semantics, lies far deeper, in
the direction of true ai. In medicine, programs have been developed that analyse the
disease7 symptoms, medical history, and laboratory-test test results of a patient, and then
suggest a diagnosis to the physician. The diagnostic program is an example of a so-called
expert-system system programs designed to perform tasks in specialized areas as a
human would. Expert systems take computers a step beyond straightforward
programming, being based on a technique called rule-based inference, in which
preestablished rule systems are used to process the data. Despite their sophistication,
expert systems still do not approach the complexity of true intelligent thought. Many
scientists remain doubtful that true ai can ever be developed. The operation of the human
mind is still little understood, and computer design may remain essentially incapable of
analogously duplicating those unknown, complex processes. Various routes are being
used in the effort to reach the goal of true ai. One approach is to apply the concept of
parallel processing interlinked and concurrent computer operations. Another is to create
networks of experimental computer chips, called silicon neurons, that mimic data194
processing-processing functions of brain cells. Using analogue technology, the transistors
in these chips emulate nerve-cell membranes in order to operate at the speed of neurons.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table II-5 Generic Document Profile for “AI” from MS Encarta
Thesaural Category
Rock Music and Its Dances In the s the quietly sensuous movements of the Latin dances
became the provocative hip rolls0 of the singer Elvis Presley, whose first major1 record
was released in . Also in the mid- s, rock and roll became a national phenomenon when
Bill Haley and His Comets were featured in the film rock Around the Clock, and the
television show “American Bandstand” began its broadcasts of dancing teenagers.
American society underwent fundamental upheavals during this period and the following
decade with the civil rights movement, protests against the war in Vietnam, and such
events as the famous music festival at Woodstock, New York, in . In the rock musician
Chubby Checker ushered in the twist2 , performed with gyrating hips and torso and a body
attitude that seemed to express “doing your own thing”. The dances of the s such as the
fish, the hitchhiker, the frug, and the jerk were free and individualistic. People danced en
masse, both sexes with long hair, all dancing by themselves and inventing as they went
along. Several contradictory trends appeared in the s and s. Couple dancing, enhanced by
the individuality of the s, returned in the s with the hustle and other elaborately
choreographed dances performed to disco music, a simple form of rock with strong
dance rhythms. Alongside the disco movement, which dominated the s and s, the more
outrageous punk rock movement brought in its wake slam dancing, which involved
leaping, jumping, and sometimes physical attack, and in the mid- s the acrobatic solo-
dance dance form known as break dancing. The late s and s have seen the development of
rave culture in which people dance very energetically to electronically-based music with
a beat beat.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Table II-6 Generic Document Profile for “Breakdance” from MS Encarta
Breakdance GDP
Roget Categories
Appendix III.
Lexical Chain Visibility Algorithm
The following was designed to improve the visibility of lexical chains, by showing them
embedded in the text from which they are derived. This was described in Section 3-8.
Algorithm III-1: An Algorithm to make the lexical chains in a text visible.
Step 1.0 Output the Chains
1.1 For ALL the Chains in the ChainStore
Derive a unique name “ChainFile” from the text and chain number
Store “ChainFile” into the Chain
Write the Chain into a file “ChainFile”;
Step 2.0: Serialise Links
2.1 Initialise ALLChains to be NULL;
2.2 For ALL the Chains in the ChainStore
Append the Chain into ALLCHAINS //this is one long flat chain
SORT the links in ALLChains by link.WordNumber
Step 3.0: Output HTML Text
3.1 Re-Open the INPUT file
3.2 Open and Initialise the HTML file
3.3 Let WordNumber = 0;
3.4 For all WORDs in the INPUT
Increment WordNumber;
If the WordNumber is in ALLChains
Print the WORD to HTML File using MARKUP.
else copy from INPUT to HTML
3.3 Close the HTML file
Step 4.0: Procedure MARKUP (Link)
4.1 If the link type is 0
Start Anchor
Output the word
Link to “Chainfile”
Output the Chain Number as SuperScript
4.2 Else Output the Word // Derive Character Face from link type
// Derive colour derived from the Chain number.
Appendix IV.
Basics Statistics of the Experimental
Rosetta Stone
index locatio stone
rosetta rosettastone
Input Words:
Not in Thesaurus:
Table IV-1
Socialis Info_Socialis kimsk_ Welcome NOMARX Welcome
Input Words:
Not in Thesaurus:
Artificial Intelligence
Table IV-2
Aidef Welcome Welcome web-
Input Words:
130 1837
Not in Thesaurus:
Table IV-3
highline dance Welcome2 index frenz-e
Input Words:
Disambiguation Attempts:
Not in Thesaurus:
Disambiguations Done:
Using Roget’s Thesaurus to determine the similarity of texts
Appendix V.
Jeremy Ellman
Experimental Subjects
This appendix gives the explanatory, “help” information given to the experimental
Text Similarity Experiment Instructions
The Text Similarity Experiment is made up of a frame split
into a heading, and three further panes. The heading pane
contains a query (“Ballot”) that has been used on the
Internet The left text pane below contains a sample text on
the subject of the query found in Microsoft's Encarta
The bottom pane contains a number of statements. There
are five radio buttons next to each question. If you agree
with the statement select the leftmost button. If you
completely disagree select the rightmost button. Select the
middle button if the statement is neither completely true, or
completely false.
Once the initial questions have been asked we move onto
the text comparison part of the experiment. Some six
(random) texts retrieved from the Internet are shown in the
Different questions are shown in the bottom pane. Answer
these questions as before.
If you are using a smaller screen, you may not see
scroll across after you have answered the questions.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
The experiment is completely anonymous. The results are
identified by IP number and time of submission. The results
are most useful if you complete all the experiment.
Appendix VI.
Roget’s Thesaurus – A brief Overview.
Introduction and purpose
The purpose of this appendix is to describe Roget's thesaurus briefly for those who are not
familiar with it. This will be done by giving an overview of the structure and organisation
of the thesaurus, and by reporting some basic data about it.
What is Roget's thesaurus?
Roget's thesaurus is a tool designed to help writers find words. That is, a writer might
have in mind a word to express a certain idea. However, he or she might know that there
exists a more precise word or phrase that better captures the nuance of meaning they wish
to articulate.
Roget's thesaurus is made up of an index, and the thesaurus entries. To find alternatives to
the word the writer had in mind, he looks it up in the index. This refers to one or more
numbered entries in the thesaurus. To find the precise word he needs, the thesaurus user
needs to read the thesaurus entries to which the index refers.
Each entry in Roget's thesaurus is made up of words and phrases that are related in
meaning to each other. The title, or heading, of the entry gives a clue as to what the words
and phrases have in common, but not precise definition is given. Furthermore, there is no
fixed relationship between the words in an entry. Some are synonyms, others antonyms,
meronyms, or unspecified types of associations.
A writer selects a better word for his context based on his understanding of the words
meaning. Roget's thesaurus does not attempt to give definitions of word meaning as a
dictionary does. Dutch’s (1962) introduction to Roget's thesaurus gives more information
on the background to the thesaurus, and advice on how it should be used.
Organisation of the Thesaurus.
Roget's thesaurus is notable for the organisation of its entries, in addition to their contents.
These have been arranged into a semantic hierarchy. That is, a nested structure of heading
and sub-headings, that is up to five levels of subdivision deep. The major headings are
shown in Figure VII-1 below.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Figure VI-1:Major Headings in Roget's Thesaurus
These category heading are further subdivided, as shown in figure VII-2 to VII-7 below.
A boxed minus preceding a heading indicates that it is expanded by the indented
subheadings following it. A boxed inverted cross indicates that a heading or subheading
may be expanded further. An empty grey box indicates that a heading is fully expanded,
and refers to an actual entry in the thesaurus.
Figure VI-2 Sub divisions of “Abstract Relations”
Figure VI-4 Sub divisions of “Matter”
Figure VI-3 Sub divisions of “Space”
Figure VI-5 Sub divisions of “Emotion”
Figure VI-6 Sub divisions of “Volition”
Figure VI-7: Sub divisions relating to "Existence"
Figure VII-7 shows the category “Existence” fully expanded. Each of the terms that it
contains that have a part of speech as a suffix is the heading or title of an entry in the
thesaurus. Figure VII-8 shows the actual entry for existence (noun).
Figure VI-8 : An extract from Roget's thesaurus
Space precludes fully expanding the thesaural hierarchy, as in the 1987 edition there are
6400 entries. Earlier editions of the thesaurus that used approximately one thousand
entries commonly included a table that shows the ordering and arrangement of all the
categories (e.g. see Dutch 1962). There the “Existence” main category in figure VII-7
would only contain four thesaural entries only (Noun, Verb, Adjective, and Adverb), as
opposed to a sub-category, and then the entry titles seen here. Thus, the 1987 version of
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
the thesaurus adds a further level to the hierarchy, but is otherwise virtually identical in
structure. Note though that the entries remain approximately the same size as the
vocabulary has been expanded. We now go on to look at some data about this vocabulary.
Basic Data about Roget’s thesaurus
This section gives some basic data about Roget’s thesaurus. Its purpose is to support a
number of design decisions made in Hesperus. The data reported include the size of the
thesaurus in terms of the total number of words it contains, and the number of unique
terms, where a term may be a word, or collocation.
The data were collected using simple word frequency information collected over the
whole 1987 thesaurus supplemented by some purpose written Perl programs.
Words in the 1987 edition of Roget’s Thesaurus
Words and phrases (including duplicates):
Unique Words and phrases (eliminating duplicates)
Thus, it appears that each word in the thesaurus appears approximately 2.9 times. Since
each occurrence is in a different category, this implies that each word has approximately
2.9 senses. This figure is misleading however since the distribution of word senses is
highly skewed. This is shown in graph VII-1 below.
Roget's Thesaurus: Word Senses Vs Word Frequencies
No. of Words
Word Senses
Graph VI-1: Distribution of Polysemic Words in Roget’s Thesaurus
Graph VII-1 shows that of the 98357 unique words and phrases in Roget’s thesaurus,
60208, (i.e. 61.2%) only have one meaning since they are only found in one entry in the
thesaurus. The remaining words have a variable number of meanings, the one most
frequently found being “cut”, which has 73 thesaural entries (excluding its collocates).
Collocations are sequences of two or more words that are found together and have
acquired a lexical identity separate from their component words. Roget’s thesaurus is rich
in collocations of various lengths.
Graph VII-2 below shows the length of collocations found in Roget’s thesaurus and their
frequency of occurrence. From this we can see that only 8598 collocations are longer than
two words, which is, 8.74% of the unique words and phrases. Their comparative rarity
supports the implementation decision (Section 5.5) to limit the search for collocations to
word pairs only.
Using Roget’s Thesaurus to determine the similarity of texts
Jeremy Ellman
Roget's Thesaurus: Collocation Length Vs Frequency
Frequency of Occurrence
Words in Collocation
Graph VI-2: Frequency of Collocations in Roget’s Thesaurus
This appendix has briefly described the hierarchical structure of Roget’s thesaurus, and
shown an example entry. Basic statistical data about the thesaurus have also been
Appendix VII. Papers published related to this thesis.
In accordance with the PhD regulations of the University of Sunderland, this appendix
includes refereed papers published as part of the work of this thesis. These were:1. On the Generality of Thesaurally derived Lexical Links Jeremy Ellman and John Tait
in Actes de 5es Journées Internationales d'Analyse Statistique des Données Textuelles
March 2000 (JADT 2000) pp147-154 Ecole Polytechnique Fédérale de Lausanne.
2. Word Sense Disambiguation by Information Filtering and Extraction Jeremy Ellman,
Ian Klincke and John Tait in “Computers and the Humanities” vol. 34, number 1-2,
2000, Special Issue on “Senseval: Evaluating Word Sense Disambiguation Programs”
Guest Editors Adam Kilgarriff and Martha Palmer
3. Roget's thesaurus: An additional Knowledge Source for Textual CBR?, Jeremy
Ellman and John Tait in "Research and Development in Intelligent Systems XVI: Proc
19th SGES Intl Conf. on Knowledge Based and Applied Artificial Intelligence"
Bramer M., Macintosh A., and Coenen F. (eds) ISBN 1-85233-231-X. pp-204-217
4. SUSS: The Sunderland University Similarity System: "Beneath the Glass Ceiling"
Jeremy Ellman, Ian Klincke and John Tait in Proc SENSEVAL workshop University
of Brighton 1998.
5. Using the Generic Document Profile to Cluster Similar texts” Jeremy Ellman, in
Proc. Computational Linguistics UK (CLUK 97) Jan. 1998 University of Sunderland
6. "Using Information Density to Navigate the Web" Jeremy Ellman and John Tait. UK
ISSN 0963-3308 IEE Colloquium on Intelligent World Wide Web Agents. March
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF