Proceedings of the Eighth Global WordNet Conference

Proceedings of the Eighth Global WordNet Conference
Proceedings of the Eighth
Global WordNet Conference
Editors:
Verginica Barbu Mititelu, Corina Forăscu, Christiane Fellbaum, Piek Vossen
Bucharest, Romania, January 27-30, 2016
ISBN 978-973-0-20728-6
Preface
This eighth meeting of the international Wordnet community coincides with the 15th
anniversary of the Global WordNet Association and the 30th anniversary of the Princeton
WordNet. We are delighted to welcome old and new colleagues from many countries and four
continents who construct wordnets, ontologies and related tools, as well as colleagues who
apply such resources in a wide range of Natural Language Applications or pursue research in
lexical semantics.
The number of wordnets has risen to over 150 and includes – besides all the major world
languages – many less-studied languages such as Albanian and Nepali. Wordnets have
become a principal tool in computational linguistics and NLP, and wordnet, SemCor and
synset have entered the language as common nouns. Coming together and sharing some of the
results of our work is an important part of the larger collaborative effort to better understand
both universal and particular properties of human languages.
Many people have donated their time and effort to make this meeting possible: the review
committee, the local organizers and their helpers (Eric Curea, Maria Mitrofan, Elena Irimia),
our sponsors (PIM, QATAR Airways, Oxford University Press), EasyChair and our host, the
Romanian Academy. Above all, thanks go to you, the contributors, for traveling to Bucharest
to present your work, listen and discuss.
January, 2016
Christiane Fellbaum, Corina Forăscu,
Bucharest
Verginica Mititelu, Piek Vossen
i
Table of contents
Preface
i
Program and Organising Committees
vi
The Awful German Language: How to cope with the Semantics of Nominal
Compounds in GermaNet and in Natural Language Processing
Erhard Hinrichs
1
Adverbs in Sanskrit Wordnet
Tanuja Ajotikar and Malhar Kulkarni
2
Word Sense Disambiguation in Monolingual Dictionaries for Building Russian
WordNet
Daniil Alexeyevsky and Anastasiya V. Temchenko
10
Playing Alias - efficiency for wordnet(s)
Sven Aller, Heili Orav, Kadri Vare and Sirli Zupping
16
Detecting Most Frequent Sense using Word Embeddings and BabelNet
Harpreet Singh Arora, Sudha Bhingardive and Pushpak Bhattacharyya
22
Problems and Procedures to Make Wordnet Data (Retro)Fit for a Multilingual
Dictionary
Martin Benjamin
27
Ancient Greek WordNet Meets the Dynamic Lexicon: the Example of the Fragments of
the Greek Historians
Monica Berti, Yuri Bizzoni, Federico Boschetti, Gregory R. Crane, Riccardo Del
Gratta and Tariq Yousef
34
IndoWordNet::Similarity- Computing Semantic Similarity and Relatedness using
IndoWordNet
Sudha Bhingardive, Hanumant Redkar, Prateek Sappadla, Dhirendra Singh and
Pushpak Bhattacharyya
39
Multilingual Sense Intersection in a Parallel Corpus with Diverse Language Families
Giulia Bonansinga and Francis Bond
44
CILI: the Collaborative Interlingual Index
Francis Bond, Piek Vossen, John McCrae and Christiane Fellbaum
50
YARN: Spinning-in-Progress
Pavel Braslavski, Dmitry Ustalov, Mikhail Mukhin and Yuri Kiselev
58
Word Substitution in Short Answer Extraction: A WordNet-based Approach
Qingqing Cai, James Gung, Maochen Guan, Gerald Kurlandski and Adam Pease
66
An overview of Portuguese WordNets
Valeria de Paiva, Livy Real, Hugo Gonçalo Oliveira, Alexandre Rademaker,
Cláudia Freitas and Alberto Simões
74
Towards a WordNet based Classification of Actors in Folktales
Thierry Declerck, Tyler Klement and Antonia Kostova
82
ii
Extraction and description of multi-word lexical units in plWordNet 3.0
Agnieszka Dziob and Michał Wendelberger
87
Establishing Morpho-semantic Relations in FarsNet (a focus on derived nouns)
Nasim Fakoornia and Negar Davari Ardakani
92
Using WordNet to Build Lexical Sets for Italian Verbs
Anna Feltracco, Lorenzo Gatti, Elisabetta Jezek, Bernardo Magnini and Simone
Magnolini
100
A Taxonomic Classification ofWordNet Polysemy Types
Abed Alhakim Freihat, Fausto Giunchiglia and Biswanath Dutta
105
Some strategies for the improvement of a Spanish WordNet
Matias Herrera, Javier Gonzalez, Luis Chiruzzo and Dina Wonsever
114
An Analysis of WordNet’s Coverage of Gender Identity Using Twitter and The
National Transgender Discrimination Survey
Amanda Hicks, Michael Rutherford, Christiane Fellbaum and Jiang Bian
122
Where Bears Have the Eyes of Currant: Towards a Mansi WordNet
Csilla Horváth, Ágoston Nagy, Norbert Szilágyi and Veronika Vincze
130
WNSpell: a WordNet-Based Spell Corrector
Bill Huang
135
Sophisticated Lexical Databases - Simplified Usage: Mobile Applications and Browser
Plugins For Wordnets
Diptesh Kanojia, Raj Dabre and Pushpak Bhattacharyya
143
A picture is worth a thousand words: Using OpenClipArt library for enriching
IndoWordNet
Diptesh Kanojia, Shehzaad Dhuliawala and Pushpak Bhattacharyya
149
Using Wordnet to Improve Reordering in Hierarchical Phrase-Based Statistical
Machine Translation
Arefeh Kazemi, Antonio Toral and Andy Way
154
Eliminating Fuzzy Duplicates in Crowdsourced Lexical Resources
Yuri Kiselev, Dmitry Ustalov and Sergey Porshnev
161
Automatic Prediction of Morphosemantic Relations
Svetla Koeva, Svetlozara Leseva, Ivelina Stoyanova, Tsvetana Dimitrova and Maria
Todorova
168
Tuning Hierarchies in Princeton WordNet
Ahti Lohk, Christiane Fellbaum and Leo Vohandu
177
Experiences of Lexicographers and Computer Scientists in Validating Estonian
Wordnet with Test Patterns
Ahti Lohk, Heili Orav, Kadri Vare and Leo Vohandu
184
African WordNet: A Viable Tool for Sense Discrimination in the Indigenous African
Languages of South Africa
Stanley Madonsela, Mampaka Lydia Mojapelo, Rose Masubelele and James Mafela
192
iii
An empirically grounded expansion of the supersense inventory
Hector Martinez Alonso, Anders Johannsen, Sanni Nimb, Sussi Olsen and Bolette
Pedersen
199
Adverbs in plWordNet: Theory and Implementation
Marek Maziarz, Stan Szpakowicz and Michal Kalinski
209
A Language-independent Model for Introducing a New Semantic Relation Between
Adjectives and Nouns in a WordNet
Miljana Mladenović, Jelena Mitrović and Cvetana Krstev
218
Identifying and Exploiting Definitions in Wordnet Bahasa
David Moeljadi and Francis Bond
226
Semantics of body parts in African WordNet: a case of Northern Sotho
Mampaka Lydia Mojapelo
233
WME: Sense, Polarity and Affinity based Concept Resource for Medical Events
Anupam Mondal, Dipankar Das, Erik Cambria and Sivaji Bandyopadhyay
242
Mapping and Generating Classifiers using an Open Chinese Ontology
Luis Morgado Da Costa, Francis Bond and Helena Gao
247
IndoWordNet Conversion to Web Ontology Language (OWL)
Apurva Nagvenkar, Jyoti Pawar and Pushpak Bhattacharyya
255
A Two-Phase Approach for Building Vietnamese WordNet
Thai Phuong Nguyen, Van-Lam Pham, Hoang-An Nguyen, Huy-Hien Vu, Ngoc-Anh
Tran and Thi-Thu-Ha Truong
259
Extending the WN-Toolkit: dealing with polysemous words in the dictionary-based
strategy
Antoni Oliver
265
A language-independent LESK based approach to Word Sense Disambiguation
Tommaso Petrolito
273
plWordNet in Word Sense Disambiguation task
Maciej Piasecki, Paweł Kędzia and Marlena Orlińska
280
plWordNet 3.0 -- Almost There
Maciej Piasecki, Stan Szpakowicz, Marek Maziarz and Ewa Rudnicka
290
Open Dutch WordNet
Marten Postma, Emiel van Miltenburg, Roxane Segers, Anneleen Schoen and Piek
Vossen
300
Verifying Integrity Constraints of a RDF-based WordNet
Alexandre Rademaker and Fabricio Chalub
309
DEBVisDic: Instant Wordnet Building
Adam Rambousek and Ales Horak
317
Samāsa-Kartā: An Online Tool for Producing Compound Words using IndoWordNet
Hanumant Redkar, Nilesh Joshi, Sandhya Singh, Irawati Kulkarni, Malhar Kulkarni
and Pushpak Bhattacharyya
322
iv
Arabic WordNet: New Content and New Applications
Yasser Regragui, Lahsen Abouenour, Fettoum Krieche, Karim Bouzoubaa and Paolo Rosso
330
Hydra for Web: A Browser for Easy Access to Wordnets
Borislav Rizov and Tsvetana Dimitrova
339
Towards a methodology for filtering out gaps and mismatches across wordnets: the
case of plWordNet and Princeton WordNet
Ewa Rudnicka, Wojciech Witkowski and Łukasz Grabowski
344
Folktale similarity based on ontological abstraction
Marijn Schraagen
352
The Predicate Matrix and the Event and Implied Situation Ontology: Making More of
Events
Roxane Segers, Egoitz Laparra, Marco Rospocher, Piek Vossen, German Rigau and
Filip Ilievski
360
Semi-Automatic Mapping of WordNet to Basic Formal Ontology
Selja Seppälä, Amanda Hicks and Alan Ruttenberg
369
Augmenting FarsNet with New Relations and Structures for verbs
Mehrnoush Shamsfard and Yasaman Ghazanfari
377
High, Medium or Low? Detecting Intensity Variation Among polar synonyms in
WordNet
Raksha Sharma and Pushpak Bhattacharyya
384
The Role of the WordNet Relations in the Knowledge-based Word Sense
Disambiguation Task
Kiril Simov, Alexander Popov and Petya Osenova
391
Detection of Compound Nouns and Light Verb Constructions using IndoWordNet
Dhirendra Singh, Sudha Bhingardive and Pushpak Bhattacharyya
399
Mapping it differently: A solution to the linking challenges
Meghna Singh, Rajita Shukla, Jaya Saraswati, Laxmi Kashyap, Diptesh Kanojia and
Pushpak Bhattacharyya
406
WordNet-based similarity metrics for adjectives
Emiel van Miltenburg
414
Toward a truly multilingual GlobalWordnet Grid
Piek Vossen, Francis Bond and John McCrae
419
This Table is Different: A WordNet-Based Approach to Identifying References to
Document Entities
Shomir Wilson, Alan Black and Jon Oberlander
427
WordNet and beyond: the case of lexical access
Michael Zock and Didier Schwab
436
Author index
445
v
Program Committee
Eneko Agirre, University of the Basque Country, Spain
Eduard Barbu, Translated.net, Italy
Francis Bond, Nanyang Technological University, Singapore
Sonja Bosch, University of South Africa, South Africa
Alexandru Ceaușu, Euroscript Luxembourg S.à r.l., Luxembourg
Dan Cristea, Alexandru Ioan Cuza University of Iași, Romania
Agata Cybulska, VU University Amsterdam / Oracle Corporation, the Netherlands
Tsvetana Dimitrova, Institute for Bulgarian Language, Bulgaria
Marieke van Erp, VU University Amsterdam, the Netherlands
Christiane Fellbaum, Princeton University, USA
Darja Fiser, University of Ljubljana, Slovenia
Antske Fokkens, VU University Amsterdam, the Netherlands
Corina Forăscu, Alexandru Ioan Cuza University of Iași & RACAI, Romania
Ales Horak, Masaryk University, Czech Republic
Florentina Hristea, University of Bucharest, Romania
Shu-Kai Hsieh, Graduate Institute of Linguistics at National Taiwan University, Taiwan
Radu Ion, Microsoft, Ireland
Hitoshi Isahara, Toyohashi University of Technology, Japan
Ruben Izquierdo Bevia, VU University Amsterdam, the Netherlands
Kaarel Kaljurand, Nuance Communications, Austria
Kyoko Kanzaki, Toyohashi University of Technology, Japan
Svetla Koeva, Institute for Bulgarian Language, Bulgaria
Cvetana Krstev, University of Belgrade, Serbia
Margit Langemets, Institute of the Estonian Language, Estonia
vi
Bernardo Magnini, Fondazione Bruno Kessler, Italy
Verginica Mititelu, RACAI, Romania
Sanni Nimb, Society for Danish Language and Literature, Denmark
Kemal Oflazer, Carnegie Mellon University, Qatar
Heili Orav, University of Tartu, Estonia
Karel Pala, Masaryk University, Czech Republic
Adam Pease, IPsoft, USA
Bolette Pedersen, University of Copenhagen, Denmark
Ted Pedersen, University of Minnesota, USA
Maciej Piasecki, Wroclaw University of Technology, Poland
Alexandre Rademaker, IBM Research, FGV/EMAp Brazil
German Rigau, University of the Basque Country, Spain
Horacio Rodriguez, Universitat Politecnica de Catalunya, Spain
Shikhar Kr. Sarma, Gauhati University, India
Roxane Segers, VU University Amsterdam, the Netherlands
Virach Sornlertlamvanich, SIIT, Thammasart University, Thailand
Dan Ștefănescu, Vantage Labs, USA
Dan Tufiș RACAI, Romania
Gloria Vasquez, Lleida University, Spain
Zygmunt Vetulani, Adam Mickiewicz University, Poland
Piek Vossen, VU University Amsterdam, the Netherlands
vii
Additional Reviewers
Anna Feltracco
Filip Ilievski
Vojtech Kovar
Egoitz Laparra
Simone Magnolini
Zuzana Neverilova
Adam Rambousek
Conference Chairs
Christiane Fellbaum, Princeton University, USA
Piek Vossen, VU University Amsterdam, the Netherlands
Organising Committee
Verginica Mititelu, RACAI, Romania
Corina Forăscu, Alexandru Ioan Cuza University of Iași & RACAI, Romania
viii
The Awful German Language: How to cope with the Semantics of
Nominal Compounds in GermaNet and in
Natural Language Processing
Erhard Hinrichs
University of Tübingen
Tübingen, Germany
[email protected]
Abstract
The title for my presentation borrows from Mark Twain's well-known 1880 essay "The Awful German
Language", where Twain cites pervasive nominal compounding in German as one of the pieces of evidence for the "awfulness" of the language. Two much cited examples of noun compounds that are included in the Duden dictionary of German are Kraftfahrzeughaftpflichtversicherung (‘motor car liability insurance’) and Donaudampfschifffahrtsgesellschaft (‘Danube steamboat shipping company’). Any dictionary of German, including the German word net GermaNet, has to offer an account of such compound
words. Currently, GermaNet contains more than 55,000 nominal compounds. As the coverage of nouns in
GermaNet is extended, new noun entries are almost always compounds.
In this talk I will present an account of how to model nominal compounds in GermaNet with particular
focus on the semantic relations that hold between the constituents of a compound, e.g., the WHOLEPART relation in the case of Roboterarm ('robot arm') or the LOCATION relation in the case of Berghütte ('mountain hut'). This account, developed jointly with Reinhild Barkey, Corina Dima, Verena Henrich,
Christina Hoppermann, and Heike Telljohann, borrows heavily from previous research on semantic relations in theoretical linguistics, psycholinguistics, and computational linguistics.
The second part of the talk will focus on using the semantic modelling of nominal compounds in a word
net for the automatic classification of semantic relations for (novel) compound words. Here, I will present
the results of recent collaborative work with Corina Dima and Daniil Sorokin, using machine learning
techniques such as support vector machines as well as deep neural network classifiers and a variety of
publicly available word-embeddings, which have been developed in the framework of distributional semantics.
1
Adverbs in the Sanskrit Wordnet
Tanuja P. Ajotikar
Malhar Kulkarni
Dept. of South Asian Studies
Indian Institute of Technology Mumbai
Harvard University, Cambridge, MA
Powai, Mumbai, India
The Sanskrit Library
[email protected]
[email protected]
Abstract
The wordnet contains part-of-speech categories such as noun, verb, adjective and
adverb. In Sanskrit, there is no formal distinction among nouns, adjectives and adverbs. This poses the question, is an adverb a separate category in Sanskrit? If
not, then how do we accommodate it in
a lexical resource? To investigate the issue, we attempt to study the complex nature of adverbs in Sanskrit and the policies adopted by Sanskrit lexicographers
that would guide us in storing them in the
Sanskrit wordnet.
1
Introduction
An adverb is an open-class lexical category that
modifies the meaning of verbs, adjectives (including numbers) and other adverbs, but not nouns.1
It can also modify a phrase or a clause. The category of adverb indicates: (a) manner, (b) time, (c)
place, (d) cause, and (e) answers to the questions
how, where, when and how much.
Fellbaum (1998, p. 61) describes adverbs as
a heterogeneous group in which not only adverbs derived from adjectives are included but
also phrases used adverbially. Some of these
phrases are included in WordNet. These phrases
are mainly frozen phrases that are used widely.
In this paper we discuss those adverbs which
modify verbs, and how modern Sanskrit lexicography deals with them. Kulkarni et al. (2011) briefly
discussed the issues regarding adverbs in the Sanskrit wordnet. We focused primarily on how modern Sanskrit lexicographers have dealt with them.
The study of their methodology can guide us in
forming a policy for representing adverbs in the
Sanskrit wordnet.
1
http://www.odlt.org
2
Adverbs in Sanskrit
The Sanskrit grammatical tradition does not divide
words into many categories. It divides words into
two divisions: words that take nominal affixes and
words that take verbal affixes. The words in the
second division are verbs. Those in the first division are nouns, adjectives, adverbs, particles, etc.,
i.e., non-verbs. This is because unlike languages
like English, Sanskrit does not have distinct forms
for each part of speech. One cannot categorize a
word merely by looking at its form. This is why
there is not a formal category for adjective or adverb in traditional Sanskrit grammar. There is no
equivalent term in Sanskrit for adjective or adverb
in the modern sense (See Joshi (1967), Gombrich
(1979)). Sanskrit can be analyzed under word
classes other than noun and verb. Bhat (1991) observes that adjectives in Sankrit form a sub-group
of nouns. Likewise, adverbs, except indiclinables, form a subgroup of nouns. Attempts were
first made in the 19th century to describe Sanskrit using various word classes. Monier-Williams
(1846), Wilson (1841), Speijer (1886), Whitney
(1879) and Macdonell (1927) discuss adverbs in
Sanskrit.2 A summary of the description of adverbs given by these scholars is as follows:
• The non-derived words listed by traditional
grammar and termed ‘indeclinable’ are used
as adverbs, e.g., uccaih. ‘high,’ nı̄caih. ‘below,’ ārāt ‘distant,’ etc.
• Compounds, like avyayı̄bhāva, are used
as adverbs, e.g., yathāśakti ‘according to
power or ability.’3 Some of the bahuvrı̄hi
2
We refer to these works because Macdonell, Wilson and
Monier-Williams compiled bilingual dictionaries. We refer
to their works to study how far they follow their description
in their dictionaries.
3
In the sentence yathāśakti dātavyam ‘you may give according to your ability,’ the compound yathāśakti modifies
2the action. Hence, it is an adverb.
compounds are also used as adverbs, e.g.,
keśākeśi ‘hair to hair’ (i.e., head to head).4
• Words formed by adding certain affixes, such
as śas, dhā, etc., are used as adverbs. The
affix śas is added after a nominal base or a
number word in the sense of vı̄psā ‘repetition.’ Words like śataśah. ‘hundred times’ are
formed by adding this affix. The affix dhā
is added after a number word in the sense of
vidhā ‘division or part.’ Words like dvidhā
‘twofold’ or tridhā ‘threefold’ are formed by
adding this affix. Words formed by adding
certain affixes after a nominal base are considered indeclinable by the traditional grammarians.
• The accusative, instrumental, ablative and
locative cases of a noun or an adjective are
used as adverbs, e.g., mandam ‘slowly,’ vegena ‘hastily,’ javāt ‘speedily,’ sannidhau
‘near.’
This summary shows that we can classify adverbs in Sanskrit in three main groups: words that
are unanalyzable in parts, such as a base and an
affix; words that formed by secondary derivation,
such as adding an affix or forming a compound;
and words that have an adverbial sense but belong
to a class of words which are not adverbs, for example, the accusative or instrumental case of any
noun or adjective. A morphological analysis of
these words would categorize them under nouns
because they are formed by adding the same affixes that are added after a noun, even though their
function differs. In other words, qualifying a verb
or an adjective in Sanskrit does not require the use
of a distinct morphological form. The difficulty
in dealing with adverbs in Sanskrit arises only if
we have a form-based idea of word classes. It becomes lexically opaque to judge a category simply
by looking at the form. The adverb is a functional
category in Sanskrit, not formal one. Hence, adverbs pose a problem in Sanskrit lexicography because they lack a distinguishing form and they are
functional.
4
In the sentence te keśākeśi yuddhyante ‘they battled hair
to hair’, the compound keśākeśi also modifies the action so it
is an adverb.
2.1
The importance of part-of-speech
categories in lexical entries
The nature of adverbs in Sanskrit is complex, so
it is a matter of discussion what the exact relationship is between a part-of-speech category and
a dictionary. Lexemes do not occur in isolation.
They form part of a phrase or sentence. In this
way, the role of a lexicon is to structure sentences.
Lexemes form an important part, as they determine the syntactic structure of sentences. Each
and every lexeme plays a certain role in a sentence. The morphological and syntactic behavior of a lexeme determines its class. This class is
designated as a part-of-speech category. It is also
called a word class, lexical class or lexical category. Noun, verb, adjective and adverb are major
word classes. Thus, a lexicon, which is an inventory of lexemes, contains these major word classes
to denote the morphological and syntactic behavior of the lexemes listed in it. The morphological
and syntactic behavior of a language decides what
kind of information a lexicon should contain.
In Sanskrit, where there is no formal distinction between adverb and noun (with the exception
of indeclinables), the following question arises:
Should an adverb be a separate category in a Sanskrit lexicon? It would be interesting to study the
policy adopted in the available lexical resources
of Sanskrit, which range from 1819 C.E. to 1981
C.E, to answer this question. The examples below
were given by Gombrich (1979):
• atra ‘here’
• ciram ‘for a long time’
• javena ‘speedily’
• tūsn
. ı̄m ‘silently’
• vividhaprakāram ‘variedly’
• śı̄ghram ‘quickly’
Gombrich observes that the first, second and
fourth examples are found in the traditional grammar. However, the rest of the adverbs are not recognized as such. His article is important because
he has thoroughly discussed the position of traditional Sanskrit grammarians on adverbs, and given
an historical account of the concept of adverb. He
points out that words that function as adverbs are
not grammatically analyzed; instead, they are sim3ply listed by traditional grammarians. There is
no process of deriving adverbs from adjectives.
Hence, ciram, cirāt, cirasya ‘for a long time,’5
which might be derived from the same word, are
listed separately. Their status is independent. This
forms a base for entering these words in a lexicon
as separate lexemes.
2.2
Adverbs in the list above and the
treatment they receive in dictionaries
We consulted eighteen dictionaries of Sanskrit to
study the treatment given to the above-mentioned
adverbs. Two of these eighteen dictionaries are
monolingual and the rest are bilingual. Among
those bilingual dictionaries, (Goldstücker (1856)
and Ghatge (1981)) are not complete. These eighteen dictionaries are listed chronologically below:
• Radhakanatdeva,
1858.
(Monolingual),
1819–
• Wilson H. H., Sanskrit–English, 1832.
• Yates W., Sanskrit–English, 1846.
• Bopp F., Sanskrit–French, 1847.
• Böhtlingk, O. and Roth R., Sanskrit–German,
1855–1875.
• Goldstükar T., Sanskrit–English, 1856.
• Benfey, T., Sanskrit–English, 1866.
• Burnouf É., Sanskrit–French, 1866.
• Böhtlingk, O., Sanskrit–German, 1879-1889.
• Monier-Williams
1872.
M.,
Sanskrit–English,
• Bhattacharya T., (Monolingual), 1873.
• Cappeller, C., Sanskrit–German, 1887.
• Apte V. S., Sanskrit-English, 1890.
• Cappeller, C., Sanskrit–English, 1891.
• Macdonell A. A. Sanskrit–English 1893
• Monier-Williams M., Leumann, and Cappeller, Sanskrit–English, 1899.
• Stchoupak, N., Nitti, L. and Renou L.,
Sanskrit–French, 1932.
5
These forms resemble the accusative singular, ablative
singular and genitive singular, respectively, of a nominal base
which ends in short a.
• Ghatge, A. M., Sanskrit–English (Encyclopedic dictionary on historical principles), 1981.
Let us analyze how the above-listed adverbs are
treated in these Sanskrit dictionaries.
2.2.1 atra
Atra, which means ‘here,’ is an indeclinable according to the traditional Sanskrit grammarians,
whereas its treatment in dictionaries varies. It is
derived from the pronoun etad ‘this’ by adding
the affix tral. It is termed indeclinable by the rule
taddhitaścāsarvavibhakatih. A.1.1.38.6 There are
more such words formed by adding the affix tral,
such as, tatra ‘there,’ kutra ‘where,’ etc. We will
discuss only atra in detail in this paper.
Derivation of atra
etad tral
a tra (etad is replaced by a)
atra
All the lexicographers treat it as an adverb except Monier-Williams (1872), MonierWilliams, Leumann, and Cappeller (1899), Apte
(1890) and Goldstücker (1856). These lexicographers consider it indeclinable, as does Radhakantadeva (1819–1858) and Bhattacharya (1873–
1884). Cappeller (1887) does not assign any category to it, but describes it morphologically. We
can observe that the lexicographers who use the
term indeclinable as a part-of-speech category follow traditional grammar. Other lexicographers,
though aware of this analysis do not follow the traditional grammar.
2.2.2 tūsn
. ı̄m
The traditional Sanskrit grammarians list words
which are non-derivable. That list gets the status of indeclinable. The word under discussion is a member of this list. Tūsn
. ı̄m, which
means ‘silently,’ is categorized as an indeclinable.
Radhakantadeva (1819–1858), Wilson (1832),
Monier-Williams (1872), Monier-Williams, Leumann, and Cappeller (1899), Bhattacharya (1873–
1884) and Apte (1890) follow the tradition and indicate its category as indeclinable. The rest of the
lexicographers assign it to the category of adverb.
Here also we can observe that Radhakantadeva
(1819–1858) and Bhattacharya (1873–1884) are
consistent in following the traditional grammar.
Those lexicographers who label it an adverb are
6
This is a rule in Pān.ini’s Ast
. ādhyāyı̄. It assigns the term
avyaya ‘indeclinable’ to those words which end in the affixes
4termed taddhita, and are not used in all cases.
also consistent in analyzing indeclinables listed by
the traditional grammarians as adverbs.
2.2.3
ciram
Ciram means ‘for a long time.’ It can be analyzed as the accusative case of cira. The traditional grammarians of Sanskrit treat it as an indeclinable, as they include it in the list of nonderivable words. They do not analyze it as a nominal form, even though lexicographers vary in their
analysis. Macdonell (1893), Yates (1846), Bopp
(1847), Cappeller (1887), Cappeller (1891) assign
an adverb category to it. Wilson (1832), MonierWilliams (1872) and Monier-Williams, Leumann,
and Cappeller (1899) treat it as an indeclinable.
Apte (1890), Böhtlingk and Roth (1855–1875),
Benfey (1866) and Burnouf (1866) describe its adverbial role, but do not assign an adverb category
to it.
Macdonell (1893), Böhtlingk (1879–1889),
Monier-Williams (1872), Monier-Williams, Leumann, and Cappeller (1899), Benfey (1866) and
Burnouf (1866) list it under cira. Thus, they assume that all forms of cira are derivable–forms
such as ciram (formally identical to the accusative
singular of a nominal base which ends in short
a); ciren.a (formally identical to the instrumental
singular of a nominal base which ends in short
a); cirāya (formally identical to the dative singular of a nominal base which ends in short a);
cirāt (formally identical to the ablative singular
of a nominal base which ends in short a); and
cirasya (formally identical to the genitive singular
of a nominal base which ends in short a). These
are given separately by Radhakantadeva (1819–
1858) and Bhattacharya (1873–1884), who treat
these forms as indeclinable. This evidence is sufficient to say that ciram, ciren.a and cirāya, cirāt,
cirasya are different words according to them–not
declensions of cira, which is contrary to the western lexicographers’ treatment. Thus, western lexicographers do not follow the traditional grammar in this case. Radhakantadeva (1819–1858)
and Bhattacharya (1873–1884) follow the tradition and maintain their independent status.
2.2.4
javena
This is the instrumental singular of java ‘speed.’
None of the lexica records this form as an adverb, but its ablative form is assigned an adverb
category by Cappeller (1887). Böhtlingk (1879–
1889) notes its ablative form, and gives its mean-
ing as eiligst (haste), alsbald (soon). Stchoupak,
Nitti, and Renou (1932) note its accusative and ablative forms and give its meaning as rapidement,
vivement (quickly, sharply). They do not assign
any category to it. But the meanings given certainly reflect its adverbial use. The instrumental case of java ‘speed’ does not occur in dictionaries and hence is not recognized as an adverb.
Accordingly, words like ram
. hasā, vegena, vegāt
‘speedily’ should be recognized as adverbs since
they are instrumental and ablative singular forms
of ram
. has and vega ‘speed’ respectively. However, these also do not occur in dictionaries.
2.2.5
vividhaprakāram
The word vividhaprakāram ‘variedly’ is not found
in any of the dictionaries. It is the accusative
singular form of vividhaprakāra which is a karmadhāraya (endocentric) compound.
2.2.6 śı̄ghram
The word śı̄ghram ‘quickly’ is the nominative
and accusative singular form of śı̄ghra ‘quick.’
In the present context it is the accusative singular form. All the lexicographers consider it
an adverb, except for Monier-Williams (1872),
Monier-Williams, Leumann, and Cappeller (1899)
and Apte (1890) who consider it an indeclinable.
Stchoupak, Nitti, and Renou (1932) do not consider śı̄ghra an indeclinable or an adverb but rather
an adjective. Burnouf (1866) mentions its gender and accusative form, but does not assign any
category. Yates (1846) mentions its neuter gender by giving the nominative form, as well as assigns an adverb category to it. All of these lexicographers have analyzed it as derived from śı̄ghra
which is an adjective. Monier-Williams (1872)
and Monier-Williams, Leumann, and Cappeller
(1899) do not use the adjective category. Instead,
they use the abbreviation mfn (masculine, feminine and neuter) to show that the word is used in
all genders. Wilson (1832) and Cappeller (1887)
record śı̄ghra as a neuter word; thus, they consider
it a noun. Radhakantadeva (1819–1858) and Bhattacharya (1873–1884) list śı̄ghra and indicate its
gender as neuter. Then they mention its adjectival use through the term tadvati tri (i.e., having
that (speed)). It can be inferred that they consider
śı̄ghra a noun since they note its gender, but do
not mention its adverbial use. All of the lexicographers, except for Radhakantadeva (1819–1858)
5and Bhattacharya (1873–1884), take into consid-
eration the adverbial śı̄ghram, but do not consider
it an independent lexeme.
2.2.7 yathāśakti
The word yathāśakti ‘according to one’s power or
ability’ is an avyayı̄bhāva compound. Radhakantadeva (1819–1858), Bhattacharya (1873–1884),
Monier-Williams (1872), Monier-Williams, Leumann, and Cappeller (1899) and Apte (1890) give
its category as indeclinable following the traditional analysis. Benfey (1866), Bopp (1847), Macdonell (1893) do not list this word, even though
other avyayı̄bhāva compounds are assigned to the
adverb category.
3
Observations on the basis of the
previous section
This investigation gives rise to certain observations. We may say that tūsn
. ı̄m, atra and yathāśakti
are formal adverbs.
Ciram can be derived from cira, but its other
forms like ciren.a, cirāya, cirāt, cirasya are also
used as adverbs. So whether to analyze it
formally or functionally is a matter of debate.
Radhakantadeva (1819–1858) and Bhattacharya
(1873–1884) treat all these forms as synonyms on
the basis of the Amarakośa (a 6th century A.D.
Sanskrit thesaurus), and do not mention them under one lexeme, i.e., cira. Hence, we may say that
it is also a formal adverb on the basis of the monolingual dictionaries.
Śı̄ghram is also treated as a form of śı̄ghra,
which is an adjective according to western lexicographers. Hence, we may say that it is an adverbial not an adverb, whereas Radhakantadeva
(1819–1858) and Bhattacharya (1873–1884) treat
it as a noun. They also take into consideration its
use as an adjective. If we follow modern western
lexicographers, then śı̄ghram is an adverbial. If we
follow monolingual dictionaries, then it is neither
an adverb nor an adverbial. In this way, it is difficult to decide the exact criterion by which to label
its category.
Javena is an adverbial. None of the lexica assign it to the category of adverb. Cappeller (1887),
it should be noted, cites its adverbial use in the
ablative case. Interestingly, Bhattacharya (1873–
1884) cites an example under java where it occurs
in the instrumental case, but he is silent about its
part-of-speech category. The one example given
by Gombrich that is not found in any of these dictionaries is vividhaprakāram.
Table 1: The number of completed synsets for
each part-of-speech category in Sanskrit wordnet
Nouns
27563
Verbs
1247
Adjectives 4031
Adverbs
264
Total
33117
On the basis of this investigation, we may say
that there is no single policy adopted by modern
Sanskrit lexicographers to record adverbs. Even
after this investigation, doubts regarding the category of certain forms remain.
4
Adverbs in Sanskrit wordnet
These lexica are in print form and written purely
from the point of view of human use. Hence, a
single entry contains a lot of information. Multiple functions of a word can be listed under one
entry. But when a lexical resource is built for machines, then this strategy cannot be adopted. Multiple functions of a word are stored separately. In
other words, there is more than one entry for the
same word based on its meanings and functions,
whatever information is necessary to make it explicit for a machine.
The Sanskrit wordnet is being developed by following the expansion approach, and its source is
the Hindi wordnet. It is a well known fact that
Sanskrit is a morphologically rich language. So a
proper policy should be adopted for part-of-speech
categories that take into account their nature. A
long and rich tradition of Sanskrit grammar guides
us in this regard. Following the tradition, we accept the verbal roots given in the list of verbal roots
known as the dhātupāt.ha after removing their metalinguistic features. For nouns, we enter the nominative singular form, and we enter the base forms
of adjectives.
Given the discussion above, should the Sanskrit
worndet have a separate category called ‘indeclinable’ which links to the relevant synsets in the
Hindi wordnet, or should it just retain the category of adverb? A wordnet recognizes a separate
category for function words even though none are
actually included in it. Indeclinables in Sanskrit
consist of function words as well as content words.
6Hence it is difficult to adopt the category ‘indeclin-
able’ in the Sanskrit wordnet, which may harm the
basic principle of a wordnet. To avoid this, we
retain the adverb category. Thus, we follow western lexicographers who assign the adverb category
to those words which are indeclinables and which
can be termed formal adverbs. These words appear without any change in the Sanskrit wordnet,
e.g., atra ‘here,’ iha ‘here,’ etc. They appear in the
same synset (id 2647).7 The compound yathāśakti
is also entered without any change.8
The issue of adverbials remains to be solved.
How do we store the oblique cases of nouns or
adjectives that are used as adverbs? If they are
stored in their base forms, their role as an adverb
is restricted. Not all of the forms are used as adverbs. The Sanskrit wordnet resolves this issue by
storing the declined forms. For example, śı̄ghram,
śı̄ghren.a, javena, javāt appear in one synset (id
1922).9 At the same time, there is a separate entry
(id 5118) for śı̄ghra.10 In this way, we may say
that the Sanskrit wordnet stores adverbials. We
do not claim that this phenomenon is recognized
for the first time in the history of Sanskrit lexicography. It is implicit by its representation in
the dictionaries. We make it explicit for computational processing so that it will be helpful for an
automatic parser of Sanskrit. Such a parser would
benefit from a lexical resource that contains both
adverbs and adverbials.
5
Adverbs in the Hindi and Sanskrit
wordnets
The discussion in the previous sections focuses on
adverbs as a part-of-speech category. In this section, we address two issues regarding the linking
of synsets of adverbs.
1. It is difficult to link a synset in the source language if it uses an adverb to express what the target language conveys by using pre-verbs that are
bound morphemes.
2. According to the policy of the expansion
approach, we cannot link a synset whose partof-speech category in the source language differs
from that in the target language. For example, if
7
The source synset in Hindi is yahām
. isa jagaha itah. ita
iha ihām
. ihavām
. īım
. ghe ı̄hām
. yahām
.
8
The source synset in Hindi is id 9882 yathāśakti,
yathāsambhava, bhaarasaka, yathāsādhya, ks.amatānusāra,
yathāks.ama ‘according to one’s power or ability.’
9
The linked Hindi synset contains more than 30 words
such as jhat.pat., cat.pat., etc.
10
The linked Hindi synset is tı̄vra, druta, teja, etc.
the source language uses a noun or an adjective,
and the target language uses an adverb to convey
the same lexical concept, then we cannot link these
synsets.
These are cases of language divergence that become apparent when Sanskrit is analyzed in comparison to other languages. Let us take an example
for each of the two above–metioned issues.
5.1
Adverbs in Hindi and preverbs in
Sanskrit
Hindi Synset id 10819
Gloss: laut.akara phira apane sthāna para ‘Returning to his own place again.’
Example: Mohana kala hi videśa se vāpasa āyā
‘Mohana came back yesterday from abroad.’
Synset: vāpasa vāpisa ‘back’
Sanskrit uses the preverb and verb combination
to convey the meaning ‘back.’ It does not use
an independent word. The preverb prati is used
with verbs of motion. We cannot store preverbs
separately in synsets because they are bound morphemes. So the synset in the Hindi wordnet is not
linkable to the Sanskrit wordnet. This aspect of
preverbs that conveys adverbial sense becomes apparent when Sanskrit is analyzed in the context of
another language, i.e., Hindi.
5.2
Cross part-of-speech category
Hindi Synset id 11374
Gloss: ām
. khom
. ke sāmanevālā ‘the one who is in
front of eyes.’
Example: śiks.aka ne chātrom
. ko pratyaks.a
ghat.anā para ādhārita nibam
. dha likhane ko kahā.
‘The teacher asked students to write an essay
based on an actual incident.’
Synset: pratyaks.a sāks.āt anvaks.a aparoks.a
samaks.a nayanagochara ‘evident.’
The Sanskrit word pratyaks.a, which is an
avyayı̄bhāva compound, is not an adjective in the
sense of ‘evident’ but an adverb. When this word
was borrowed in Hindi, its category changed. So
the synset in Hindi is not linkable to the Sanskrit
wordnet under the adjective category. Cross partof-speech category linkage would be a solution for
this problem.
6
Adverbs and their relations
There are two kinds of relations, ‘derived from’
and ‘modifies verb,’ for adverbs in the Hindi word7net, and so also in the Sanskrit wordnet. Both of
these relations cross the part-of-speech category.
The first relation is between a noun and an adverb or between an adjective and an adverb, and
the second relation is between a verb and an adverb. The adverbials, such as vegena, are easy to
link by this relation. In this case, vega ‘speed’ is a
noun which is linkable to vegena with the relation
‘derived from.’ The non-derived adverbs such as
uccaih. ‘high,’ nı̄caih. ‘below,’ and śanaih. ‘slowly’
cannot be linked with any other noun or adjective
because they are frozen forms. These non-derived
adverbs may not present a complex situation, as
there is only one form. The complexities arise
with words like cira ‘for a long time.’ If adverbs
such as ciram, cirasya, etc. are considered as derived from cira, then there should be a separate
synset in the adjective category. It is hard to form
such a separate synset because it is not used as
an adjective. If these adverbs are considered nonderived, then they cannot be linked to any other
synset with the relation ‘derived from.’
The compound yathāśakti, for example, is derived from yathā and śakti. Should it be linked to
both of these words? Currently, it is linked only to
śakti. Thus, it is a matter of concern whether compounds should be linked to one or more of their
components. In this way, there is a need for more
analysis regarding the relations of adverbs in Sanskrit.
7
Conclusion
From the above discussion, it is clear that adverbs
in Sanskrit are formal as well as functional, and
that they have not received any uniform treatment
in the hands of lexicographers. Formal adverbs
are easy to store under the adverb category in the
Sanskrit wordnet. The real challenge is with the
nominal forms, adverbially used. It is the Sanskrit
wordnet’s contribution to lexicalize the adverbials,
especially the declined forms of nouns and adjectives. The real challenge is to collect all of the possible cases. Currently, the Sanskrit wordnet stores
those cases that are available in the lexical sources
it uses.
The case of adverbs in Sanskrit reveals the complexity of their nature. Clearly, a lexicon developed for a machine use will need to adopt strategies suitable for its system.
8
Acknowledgement
We thank Mr. Peter Voorlas for his valuable help
in editing this paper.
References
Apte, Vaman Shivram. (1890). The Practical
Sanskrit-English Dictionary. Delhi: Motilal Banarasidas.
Benfey, Theodore (1866). A Sanskrit-English Dictionary. London: Longmans, Green, and Co.
Bhat, D. N. S. (1991). An Introduction to Indian
Grammars: Part Three:Adjectives. A report being submitted to The University Grants commission. .Three.
Bhattacharya, Taranatha Tarkavacaspati (1873–
1884). Vācaspatya Br.hatsam
. skr.tābhidhāna.
Calcutta: Kavya Prakash Press.
Böhtlingk, Otto von (1879–1889). Sanskrit
Wörtebuch. in Kürzer Fassung. St. Petersburg:
Kaiserlichen Akademie der Wissenschaften.
Böhtlingk, Otto von and Rudolph von Roth (1855–
1875). Sanskrit Wörtebuch. in Kürzer Fassung.
St. Petersburg: Buchdruckerei Der kaiderlichen
Akadamie Der Wissenschften.
Bopp, Francisco (1847). Glossarium Sanscritum.
omnes radices et vocabula usitatissima explicantur et cum vocabulis graecis, latinis, germanicis, lithuanicis, slavicis, celticis comparantur. Berlin: Dümmler.
Burnouf, Émile (1866). Dictionnaire classique
Sanscrit-Francais. oú sont coordonnés, revisés et complétés les travaux de Wilson,
Bopp, Westergaard, Johnson etc. et contenant
le devanagari, sa transcription européene,
l’interprétation, les racines et de nombreux rapprochements philologiques, publié sous les auspices de M. Rouland, ministre de l’instruction
publique. Paris: Adrien-Maisonneuve.
Cappeller, Carl (1887). Sanskrit-Wörterbuch.
nach den Petersburger Wörterbüchern bearbeitet. Strassburg: Karl J. Trübner.
– (1891). A Sanskrit-English Dictionary. Based
upon the St. Petersburg Lexicons. Strassburg:
Karl J. Trübner.
Fellbaum, Christianne. (1998). Wordnet: an electronic lexical database. Cambridge: MIT Press.
Ghatge, A. M., ed. (1981). An encyclopedic dictionary of Sanskrit on historical principles. Vol. 2.
Poona: Deccan College Post Graduate and Research Institute.
8
Goldstücker, Theodor (1856). A Dictionary in
Sanskrit and English. Extended and improved
from the second edition of the dictionary of
Professor H. H. Wilson, with his sanction and
concurrence, together with a supplement, grammatical appendices and an index serving as an
English-Sanskrit vocabulary. Berlin: A. Asher
and Co.
Gombrich, Richard (1979). “‘He cooks softly’:
Adverbs in Sanskrit. In Honor of Thomas Burrow”. In: Bulletin of the School of Oriental and
African Studies, University of London 42 no. 2,
pp. 244–256.
Joshi, Shivaram Dattatreya (1967). “Adjectives
and Substantives as a Single Class in the ‘Parts
of Speech’”. In: Journal of University of Poona
Humanities Section, pp. 19–30.
Kulkarni, Malhar et al. (2011). “Adverbs in Sanskrit Wordnet”. In: Icon 2011. URL: http://
www . cfilt . iitb . ac . in / wordnet /
webhwn / IndoWordnetPapers / 02 _
iwn_Adverbs%20in%20SWN.pdf.
Macdonell, Arthur Anthony (1893). A SanskritEnglish dictionary. being a practical handbook with transliteration, accentuation, and etymological analysis throughout. London: Longmans, Green, and Co.
– (1927). Sanskrit grammar for students. third.
Oxford: Oxford University Press.
Monier-Williams, Monier (1846). An elementary
grammar of Sanskrit language. partly in roman
character. London: W. H. Allen.
– (1872). A Sanskrit-English Dictionary. etymologically and philologically arranged with special reference to Greek, Latin, Gothic, German, Anglo-Saxon, and other cognate IndoEuropean languages. London: The Clarendon
Press.
Monier-Williams, Monier, Ernst Leumann, and
Carl Cappeller (1899). A Sanskrit-English Dictionary. Etymologically and philologically arranged with special reference to cognate IndoEuropean languages new edition, greatly enlarged and improved. Oxford: The Clarendon
Press.
Radhakantadeva (1819–1858). Śabdakalpadruma.
1st ed. Varanasi: Chaukhamba Sanskrit Series.
Speijer, J. S. (1886). Sanskrit Syntax. with an intrduciton of H. Kern. first. New Delhi: Motilal
Banarasidas Publishers.
Stchoupak, Nadine, Luigia Nitti, and Louis Renou (1932). Dictionnaire Sanskrit-Français.
Paris: Librairie d’Amérique et d’Orient, Adrien
Maisonneuve.
Whitney, W. D. (1879). Sanskrit Grammar. 1st ed.
Cambridge: Harvard University Press.
Wilson, Horace Hayman (1832). A dictionary in
Sanscrit and English. translated, amended, and
enlarged from an original compilation, prepared by learned natives for the College of
Fort William. Calcutta: Printed at the Education
press.
– (1841). Anintroduction to the grammar of the
Sanskrit language. for the use of early students.
London: J. Mandon and co.
Yates, William (1846). A dictionary in Sanscrit
and English. designed for the use of private students and of Indian colleges and schools. Calcutta: Baptist Mission Press.
9
WSD in monolingual dictionaries for Russian WordNet
Daniil Alexeyevsky
Higher School of Economics, National
research University, Moscow, Russia
[email protected]
Anastasiya V. Temchenko
Moscow, Russia
[email protected]
English languages. Tufis et al. (2004) explain the
methods used to create BalkaNet for Bulgarian,
Greek, Romanian, Serbian and Turkish languages. These projects developed monolingual
WordNets for a group of languages and aligned
them to the structure of Princeton WordNet by
the means of Inter-Lingual-Index.
Several attempts were made to create Russian
WordNet. Azarova et al. (2002) attempted to
create Russian WordNet from scratch using
merge approach: first the authors created the core
of the Base Concepts by combining the most frequent Russian words and so-called “core of the
national mental lexicon”, extracted from the
Russian Word Association Thesaurus, and then
proceeded with linking the structure of RussNet
to EuroWordNet. The result, according to project’s site 1 , contains more than 5500 synsets,
which are not published for general use. Group
of Balkova et al. (2004) started a large project
based on bilingual and monolingual dictionaries
and manual lexicographer work. As for 2004, the
project is reported to have nearly 145 000 synsets
(Balkova et al. 2004), but no website is available
(Loukachevitch and Dobrov, 2014). Gelfenbeyn
et al. (2003) used direct machine translation
without any manual interference or proofreading
to create a resource for Russian WordNet2. Project RuThes by Loukachevitch and Dobrov
(2014), which differs in structure from the canonical Princeton WordNet, is a linguistically
motivated ontology and contains 158 000 words
and 53 500 concepts at the moment of writing.
YARN (Yet Another RussNet) project, described
by Ustalov (2014), is based on the crowdsourcing approach towards creating WordNetlike machine readable open online thesaurus and
contains at the time of writing more than 46 500
Abstract
Russian Language is currently poorly supported with WordNet-like resources. One of the
new efforts for building Russian WordNet involves mining the monolingual dictionaries.
While most steps of the building process are
straightforward, word sense disambiguation
(WSD) is a source of problems. Due to limited
word context specific WSD mechanism is required for each kind of relations mined. This
paper describes the WSD method used for
mining hypernym relations. First part of the
paper explains the main reasons for choosing
monolingual dictionaries as the primary source
of information for Russian language WordNet
and states some problems faced during the information extraction. The second part defines
algorithm used to extract hyponym-hypernym
pair. The third part describes the algorithm
used for WSD
1
Introduction
After the development of Princeton WordNet
(Fellbaum, 2012), two main approaches were
widely exploited to create WordNet for any given language: dictionary-based concept (Brazilian
Portuguese WordNet, Dias-da-Silva et al., 2002)
and translation-based approach (see for example,
Turkish WordNet, Bilgin et al., 2004). The last
one assumes that there is a correlation between
synset and hyponym hierarchy in different languages, even in the languages that come from
distant families. Bilgin et al. employ bilingual
dictionaries for building the Turkish WordNet
using existing WordNets.
Multilingual resources represent the next stage
in WordNet history. EuroWordNet, described by
Vossen (1998), was build for Dutch, Italian,
Spanish, German, French, Czech, Estonian and
1
http://project.phil.spbgu.ru/RussNet/, last update June 14,
2005
2
Аvailable for download at http://www.wordnet.ru
10
synsets and more than 119 500 words, but lacks
any type of relation between synsets.
This paper describes one step of semiautomated effort towards building Russian
WordNet. The work is based on the hypothesis
that existing monolingual dictionaries are the
most reliable resource for creating the core of
Russian WordNet. Due to absence of open machine-readable dictionaries (MRD) for Russian
Language the work involves shallow sectioning
of a non machine-readable dictionary (nonMRD). This paper focuses on automatic extraction of hypernyms from Russian dictionary over
a limited number of article types. Experts then
evaluate the results manually.
1.1
3. Many contractions and special symbols
are used.
4. Circular references exist; this is expected
for synonyms and base lexicon, but uncalled for in sister terms, hypernyms, and
pairs of articles with more complex relations.
5. The lexicon used in definitions is nearly
equal to or larger than the lexicon of the
dictionary.
In general, ordinary monolingual dictionaries,
compiled by lexicographers, were not intended
for future automated parsing and analysis. As
stated in Ide and Véronis (1994), when converting typeset dictionaries to more suitable format
researchers are forced to deal with:
1. Difficulties when converting from the
original format, that often requires development of complex dedicated grammar, as previously showed by Neff and
Boguraev (1989).
2. Inconsistencies and variations in definition format and meta-text;
3. Partiality of information, since some critical information in definitions is considered common knowledge and is omitted.
Research by Ide and Véronis (1994) gives us
hope that using monolingual dictionaries is the
best source of lexical information for WordNet.
First they show that one dictionary may lack significant amount of relevant hypernym links
(around 50-70%). Next they collect hypernym
links from merged set of dictionaries and in the
resulting set of hypernym links only 5% are
missing or inconsistent as compared with expert
created ontology.
Their work is partly based on work by Hearst
(1998) who introduced patterns for parsing definitions in traditional monolingual dictionaries.
One notable work for word sense disambiguation using text definitions from articles was performed by Lesk (1986). The approach is based
on intersecting set of words in word context with
set of words in different definitions of the word
being disambiguated. The approach was further
extended by Navigli (2009) to use corpus bootstrapping to compensate for restricted context in
dictionary articles.
In this paper we propose yet another extension
of Lesk’s algorithm based on semantic similarity
databases.
Parsing the Dictionary
As far as our knowledge extends, there is no
Russian monolingual dictionary that was designed and structured according to machinereadable dictionary (MRD) principles and is also
available for public use.
There exist two Russian Government Standards that specify structure for machine readable
thesauri (Standard, 2008), but they are not widely obeyed.
Some printed monolingual dictionaries are
available in form of scanned and proof-read texts
or
online
resources.
For
example,
http://dic.academic.ru/ offers online access to 5
monolingual Russian dictionaries and more than
100 theme-specific encyclopedias. Each dictionary article is presented as one unparsed text entry.
Resource
http://www.lingvoda.ru/dictionaries/, supported
by ABBYY, publishes user-created dictionaries
in Dictionary Specification Language (DSL)
format. DSL purpose is to describe how the article is displayed. DSL operates in terms of italic,
sub-article, reference-to-article and contains no
instrument to specify type of relations. This
seems to be closest to MRD among available
resources. Fully automated information extraction is out of the question in this case. When using non-MRD we have faced with number of
problems that should be addressed before any
future processing can be started:
1. Words and word senses at the article
head are not marked by unique numeric
identifiers.
2. Words used in article definitions are not
disambiguated, so creating a link from a
word in a definition to article defining
the word sense is not trivial task.
11
2
nyms. For each such noun we add each of its
senses as candidate hypernym senses.
If sense W is defined by reference rather than
by textual definition, we add both every sense of
referenced word and each of its candidate hypernym senses to the list of candidate hypernym
senses of W.
Building the Russian WordNet
Specific aim of this work is to create a bulk of
noun synsets and hypernym relations between
them for further manual filtering and editing. To
simplify the task we assume that every word
sense defined in a dictionary represents a unique
synset. Furthermore we only consider one kind
of word definitions: such definitions that start
with nominative case noun phrase. E. g.: rus.
ВЕНТИЛЯ́ЦИЯ: Процесс воздухообмена в
лёгких. eng.‘VENTILATION: Process of gas exchange in lungs’. We adhere to hypothesis that in
this kind of definitions top noun in the NP is hypernym. In order to build a relation between
word sense and its hypernym we need to decide
which sense of hypernym word is used in the
definition. This step is the focus of this work.
2.3 Disambiguation pipeline
We have developed a pipeline for massively testing different disambiguation setups. The pipeline
is preceded by obtaining common data: word
lemmas, morphological information, word frequency.
For the pipeline we broke down the task of
disambiguation into steps. For each step we presented several alternative implementations.
These are:
1. Represent candidate hyponym-hypernym
sense pair as a Cartesian product of list of
words in hyponym sense and list of words
in hypernym sense, repeats retained.
2. Calculate numerical metric of words similarity. This is the point we strive to improve. As a baseline we used: random
number, inverse dictionary definition
number; classic Lesk algorithm. We also
introduce several new metrics described
below.
3. Apply compensation function for word
frequency. We assume that coincidence of
frequent words in to definitions gives us
much less information about their relatedness than coincidence of infrequent words.
We try the following compensation functions: no compensation, divide by logarithm of word frequency, divide by word
frequency.
4. Apply non-parametric normalization function to similarity measure. Some of the
metrics produce values with very large
variance. This leads to situations where
one matching pair of words outweighs a
lot of outright mismatching pairs. To mitigate this we attempted to apply these functions to reduce variance: linear (no normalization), logarithm, Gaussian, and logistic curve.
5. Apply adjustment function to prioritize the
first noun in each definition. While extracting candidate hypernyms the algorithm retained up to three candidate nouns
in each article. Our hypothesis states that
the first one is most likely the hypernym.
We apply penalty to the metric depending
2.1 The Dictionary
The work is based on the Big Russian Explanatory Dictionary (BRED) by Kuznetsov S.A.
(2008). The dictionary has rich structure and includes morphological, word derivation, grammatical, phonetic, etymological information,
three-level sense hierarchy, usage examples and
quotes from classical literature and proverbs. The
electronic version of the dictionary is produced
by OCR and proofreading with very high quality
(less than 1 error in 1000 words overall). The
version also has sectioning markup of lower
quality, with FPR in range 1~10 in 1000 tag uses
for the section tags of our interest.
We developed specific preprocessor for the
dictionary that extracts word, its definition and
usage examples (if any) from each article. We
call every such triplet word sense, and give it
unique numeric ID. A article can have reference
to derived word or synonym instead of text definition. Type of the reference is not annotated in
the dictionary. We preserve such references in a
special slot of word sense. The preprocessor
produces a CSV table with senses.
2.2 Hypernym candidates
Given a word sense W we produce a list of all
candidate hypernym senses.
Ideally under our assumption the first nominative case noun in W’s definition is a hypernym.
However, due to variance in article definition
styles and imperfect morphological disambiguation used, some words before the actual hypernym are erroneously considered candidate hypernym. To mitigate this we consider each of the
first three nominative nouns candidate hyper-
12
on candidate hypernym position within
hyponym definition. We tested the following penalties: no penalty, divide by word
number, divide by exponent of word number.
6. Aggregate weights of individual pairs of
words. We test two aggregation functions:
average weight and sum of best N
weights. In the last case we repeat the sequence of weights if there were less than
N pairs. We also tested the following
values of N: 2, 4, 8, 16, 32.
Finally, the algorithm returns candidate hypernym with the highest score.
based on length of common prefix. In the results
we refer to this metric as advanced Lesk.
Another approach to enhance Lesk algorithm
is to detect cases where two different words are
semantically related. To this end we picked up a
database of word associations Serelex (Panchenko et al, 2013). It assigns a score on a 0 to infinity scale to a pair of noun lemmas roughly describing their semantic similarity. As a possible
way to score words that are not nouns in Serelex
we truncate a few characters off the ends of both
words and search for the best pair matching the
prefixes in Serelex. (See prefix “serelex” in Table 1).
We tested several hypotheses on how these
two metrics can be used to improve the resulting
performance. The tests were: to use only Lesk; to
use only Serelex; to use Serelex where possible
and fallback to advanced Lesk for cases where
no answer was available; and to sum the results
of Serelex and Lesk. Since Serelex has a specific
distribution of scores we adjusted the advanced
Lesk score to produce similar distribution.
For each estimator we performed full search
through available variations on steps 3-6 of the
pipeline and selected the best on the core set and
estimated again on the larger dataset.
Test results are given in the Table 1:
Algorithm
CoreSet LargeSet
random
30.8%
23.9%
first sense
38.7%
37.7%
naive Lesk
51.6%
41.3%
serelex
49.5%
38.0%
advanced Lesk
53.8%
33.3%
serelex with adjusted 52.7%
36.3%
Lesk fallback
serelex + adjusted 52.7%
38.3%
Lesk
prefix serelex
38.0%
53.8%
2.4 Testing setup
For testing the algorithms we selected words in
several domains for manual markup. We determined domain as a connected component in a
graph of word senses and hypernyms produced
by one of the algorithms. Each annotator was
given the task to disambiguate every sense for
every word in such domain. Given a triplet an
annotator assigns either no hypernyms or one
hypernym; in exceptional cases assigning two
hypernyms for a sense is allowed.
One domain with 175 senses defining 90
nouns and noun phrases was given to two annotators to estimate inter-annotator agreement.
Both annotators assigned 145 hypernyms within
the set. Of those only 93 matched, resulting in
64% inter-annotator agreement.
The 93 identically assigned hyponymhypernym pairs were used as a core dataset for
testing results. Additional 300 word senses were
marked up to verify the results on larger datasets.
The algorithms described were tested on both of
the datasets.
2.5 Our Approach to Disambiguation
Table 1. Precision of different WSD algorithms.
In this section we describe various alternatives to
metric function on step 2 of the pipeline.
One known problem with Lesk algorithm is
that it uses only word co-occurrence when calculating overlap rate (Basile et al., 2004) and does
not extract information from synonyms or inflected words. In our test it worked surprisingly
well on the dictionary corpus, finding twice as
many correct hypernym senses as the random
baseline. We strive to improve that result for dictionary definition texts.
Russian language has rich word derivation
through variation of word suffixes. The first obvious enhancement to Lesk algorithm to account
for this is to assign similarity scores to words
3
Discussion
The low resulting quality of disambiguation
seems to be a result of several factors: overall
difficulty of the task (inter-annotator agreement
is 64%), quality of input dictionaries, quality of
used similarity database. We also seem to have
missed some important linguistic or systemic
features of text as well. Notably, the algorithms
presented are still generically-applicable and do
not use hypernym information.
Despite the low precision in determining the
exact hypernyms, the pipeline produces thematically related chains of words. Examples of
13
works for building WordNet relations from raw
dictionary data in Russian language3.
We described new algorithm for hypernym
disambiguation which performs somewhat better
than baseline in cases where annotators agree.
The possibility for better disambiguation of specific relation types within dictionaries to be still
open.
The resulting network, though noisy, is very
suitable for rapid manual filtering.
chains, extracted by prefix Serelex algorithm are
given below with English translation and comparison to Princeton WordNet (here “>>” symbolises IS_A relation):
 rus. спираль >> кривая >> линия
eng.‘spiral >> curve >> line’ compared
to PWN spiral >> curve, curved shape
>> line >> shape >> attribute >> abstraction >> entity
 rus. передняя >> комната >>
помещение eng. ‘anteroom >> room >>
premises’ compared to PWN
room >> room >> area >> structure
>> artifact >> whole >> object >>
physical entity >> entity
 rus. рост >> высота >> расстояние
eng. ‘stature, height >> height >> distance’ compared to PWN stature, height
>> bodily property >> property >> attribute >> abstraction >> entity
Dictionary parsing quality appears to be crucial for the current work, and the dictionary we
selected provides us with a huge set of difficulties: abbreviations; alternating language in sense
definitions; not all head words are lemmas (e.g.
plural for nouns that have singular); poor quality
of sectioning in OCR. Sectioning within BRED
presents a large problem due to underspecified
vaguely nested nature of sections. Properly digitized openly published Russian dictionary is really wished for.
Another problem with the dictionary is presence of nearly-identical definitions for the same
term. Due to restricted context in dictionary in
some cases it is difficult even for a human annotator to guess correctly whether a given pair of
definitions describes the same concepts or two
very distinct ones. This is especially true with
abstract terms like time (rus.: время), but physical entities like field (rus.: поле) also present
such troubles.
One further step to building the Russian
WordNet is to differentiate hypernyms from synonyms and co-hyponyms. Currently we hope to
achieve this through classification of definitions
and developing morphosyntactic templates to
match different relation types within them. This
is out of the scope of the current article though.
4
References
Azarova, I., Mitrofanova, O., Sinopalnikova, A., Yavorskaya, M., and Oparin, I. 2002. Russnet:
Building a lexical database for the russian
language. In Proceedings of Workshop on Wordnet Structures and Standardisation and How this affect Wordnet Applications and Evaluation. Las
Palmas: 60-64.
Basile, P., Caputo, A., and Semeraro, G. 2014. An
Enhanced Lesk Word Sense Disambiguation
Algorithm through a Distributional Semantic
Model. In Proceedings of COLING: 1591-1600.
Bilgin, O., Çetinoğlu, Ö., and Oflazer, K. 2004.
Building a wordnet for Turkish.Romanian Journal of Information Science and Technology, 7(12):163-172.
Balkova, V., Sukhonogov, A., and Yablonsky, S.
2004. Russian wordnet. From UML-notation to
Internet/Intranet Database Implementation. In
Proceedings of the Second Global Wordnet Conference.
Dias-da-Silva, B. C., de Oliveira, M. F., and de
Moraes, H. R. 2002. Groundwork for the devel-
opment of the Brazilian Portuguese Wordnet.
In Advances in natural language processing:189196.
Fellbaum, C. 2012. WordNet. The Encyclopedia of
Applied Linguistics.
Gelfenbeyn, I., Goncharuk, A., Lehelt, V., Lipatov, A.
and Shilo, V. 2003. Automatic translation of
WordNet semantic network to Russian language. In Proceedings of International Conference
on Computational Linguistics and Intellectual
Technologies Dialog-2003.
Hearst, M. A. 1998. Automated discovery of
WordNet relations. WordNet: an electronic lexical database: 131-153.
Conclusion
In this work we present a new pipeline for disambiguating and testing disambiguation frame-
Ide, N., Véronis, J. 1994. Machine Readable Dic-
tionaries: What have we learned, where do we
3
14
Available at http://bitbucket.org/dendik/yarn-pipeline
go. In Proceedings of the International Workshop
on the Future of Lexical Research, Beijing, China:
137-146.
Ide, N., Véronis, J. 1993. Refining taxonomies extracted from machine-readable dictionaries. In
Hockey, S., Ide, N. Research in Humanities Computing 2, Oxford University Press.
Kuznetsov S.A. Кузнецов, С. А. 2008. Новейший
большой толковый словарь русского языка.
СПб.: РИПОЛ-Норинт.
Lesk, M. 1986. Automatic sense disambiguation
using machine readable dictionaries: how to
tell a pine cone from an ice cream cone. In
Proceedings of the 5th annual international conference on Systems documentation: 24-26
Loukachevitch, N., Dobrov, B. 2014. RuThes linguistic ontology vs. Russian Wordnets.GWC
2014: Proceedings of the 7th Global Wordnet Conference: 154–162.
Navigli, R. (2009, March). Using cycles and quasi-
cycles to disambiguate dictionary glosses.
In Proceedings of the 12th Conference of the European Chapter of the Association for Computational
Linguistics: 594-602.
Neff, M. S., and Boguraev, B. K. 1989. Dictionaries,
dictionary grammars and dictionary entry
parsing. In Proceedings of the 27th annual meeting on Association for Computational Linguistics:
91-101.
Panchenko, A., Romanov, P., Morozova, O., Naets,
H., Philippovich, A., Romanov, A., and Fairon, C.
2013. Serelex: Search and visualization of semantically related words. In Advances in Information Retrieval: 837-840.
Standard, G. O. S. T. 2008. Standard 7.0.47-2008,
Format for representation on machinereadable media of information retrieval languages vocabularies and terminological data.
Tufis, D., Cristea, D., Stamou, S. 2004. BalkaNet:
Aims, Methods, Results and Perspectives. A
General Overview In: D. Tufiş (ed): Special Issue
on BalkaNet. Romanian Journal on Science and
Technology of Information.
Ustalov, D. 2014. Enhancing Russian Wordnets
Using the Force of the Crowd. In Analysis of
Images, Social Networks and Texts. Third International Conference, AIST 2014. Springer International Publishing: 257-264.
Vossen, P. 1998. EuroWordNet: A Multilingual
Database with Lexical Semantic Network.
Dordrecht.
15
Playing Alias - efficiency for wordnet(s)
Sven Aller
University of Tartu
[email protected]
Heili Orav
University of Tartu
[email protected]
Kadri Vare
University of Tartu
[email protected]
Sirli Zupping
University of Tartu
[email protected]
use gamification in language learning, namely a
word explanation game called Alias. The Estonian computer game Alias4 uses nouns, verbs, adjectives and adverbs present in EstWN5. In this
paper we describe firstly how Alias is compiled
and secondly, how it helps to improve the quality
of EstWN. Although the data for learning language is quite useful and interesting, it is not the
primary focus of this paper.
Abstract
This paper describes an electronic variant of
popular word game Alias where people have
to guess words according to their associations
via synonyms, opposites, hyperonyms etc.
Lexical data comes from the Estonian Wordnet. The computer game Alias which draws information from Estonian Wordnet is useful at
least for two reasons: it creates an opportunity
to learn language through play, and it helps to
evaluate and improve the quality of Estonian
Wordnet.
1
2
Estonian Wordnet
When setting up the Estonian WordNet we followed the principles of Princeton WordNet and
EuroWordnet6. EstWN was built as a part of the
EWN project (EuroWordNet-2 from the beginning of January 1998) and thus used the extension method as a starting point. It means that
Base Concepts from English were translated into
Estonian as a first basis for a monolingual extension. The extensions have been compiled manually from Estonian monolingual dictionaries and
other monolingual resources (like frequency lists
from Corpora of Written Estonian7).
EstWN includes nouns, verbs, adjectives and
adverbs; as well as a set of multiword units. The
database currently (September 2015; version 72)
contains approximately 75 000 concepts (within
more than 95 000 words) which are connected
with approx 210 000 semantic relations and work
is still in progress.
Introduction
WordNet1 is one of the most well-known lexicosemantic resources which is not used simply as a
thesaurus for linguistic knowledge but also for
language technology applications of language
technology. Tony Veale has said that “WordNet
… has found myriad applications in the field of
natural language processing 2 ” (i.e word sense
disambiguation, ontologies, wordnets for opinion
mining or sentiment analysis etc).
Estonian Wordnet (EstWN)3 has grown quite
large in size and our team is consistently working
on the wordnet quality improvement. Since it is
fairly complicated to revise concepts and their
semantic relations manually (even one-by-one),
automatic or semi-automatic ways for checking
and discovering errors are preferred. For checking the consistency of EstWN different test patterns (Lohk 2015), also word frequency lists and
corpora were used. One of the possibilities is to
3
Design of the computer game Alias
Based on Princeton WordNet a game for word
sense labeling has been created (Venhuizen et al
2013) 8 . Since obtaining gold standard data for
4
1
http://keeleressursid.ee/alias/
http://www.cl.ut.ee/ressursid/teksaurus/
6
http://www.illc.uva.nl/EuroWordNet/
2
7
5
http://wordnet.princeton.edu
http://www.odcsss.ie/node/39
3
http://www.cl.ut.ee/ressursid/teksaurus/
8
16
http://www.cl.ut.ee/korpused/
http://wordrobe.housing.rug.nl/Wordrobe
word sense disambiguation is costly, they are
using gamification for collecting semantically
annotated data. Another game that uses Princeton
WordNet is an on-line questions game Piclick9.
This is an implementation of twenty questions
game, where one person thinks of a concept
while the other asks him a series of yes/no questions and attempts to guess what his partner
thinks of (Rzeniewicz and Szymanski, 2013).
One of the computer games which uses concepts and relations between these concepts is
called word explanation game Alias, where the
goal is to explain words to one’s partner using
different hints. These hints are typically definitions, synonyms, antonyms, hyperonyms and
hyponyms etc, which are mostly present in
wordnet making it suitable knowledge base for
Alias’ game engine.
Alias as a computer game is designed to be
used by non-experts, non-linguists, and for players to play for fun. One of the main crowdsourcing platform is Amazon’s Mechanical Turk,
where workers get paid. In Alias game it assumed that contributors are awarded with entertainment and players are challenged to win more
points than the computer.
The computer chooses a random word and
shows different hints which are supposed to help
a player guess the right words. For each word up
to 12 randomly chosen hints are given. Hints are
given to a player in sequence. If the player does
not guess the word by the last hint, the point will
be given to the computer.
Alias is written in PHP and it is web-based.
Considering the game’s architecture the EstWN
database is somewhat modified – Alias uses only
these synsets which have at least three hints to
show (synonyms or other semantic relations),
which in turn means, that at least three hints for a
player are assured.
3.1
synsets that belong to the frequency list are selected for playing. Following Table 1 shows the
numbers of words per word classes of different
levels in Alias game. Words are selected as follows: words from EstWN which are also in the
list of most frequent words, this means that conjunctives and pronouns are left out from the frequent words, since they do not exist in EstWN.
Also, only one member of the synset is taken
from the frequent words list, for example if both
synset members are in the frequency list (‘kid’
and ‘child’) then only the first is chosen.
Table 1. Numbers of words of different levels in
Alias game
Intermediate
(selected
from 5000
frequent
words)
Expert
(selected
from
10000
frequent
words)
Nouns
333
1654
2863
Verbs
161
583
883
Adjectives 56
315
528
Adverbs
99
251
384
All
649
2803
4658
Based on that information there are three different levels: beginner level contains of 649
words (selected from 1000 frequent), intermediate level contains of 2803 words (selected from
5000 frequent) and expert level of 4658 words
(selected from 10 000 frequent). Homonyms are
connected, the word bank, for example, displays
hints from the meanings of both institution and
natural object.
Different levels of Alias
The EstWN contains of words, which have very
different usage frequencies and it can be quite
complicated to guess the words, which are rarely
used (mostly adverbs, i.e criss-cross) or domainspecific (i.e grammatical categories in linguistics, ablative case) for example. For this reason
words for Alias game are selected in comparison
of the word frequency lists from the Corpus of
Written Estonian10 and only these words from the
9
Beginner
(selected
from
1000
frequent
words)
3.2
Questions for Alias
There are 55 different types of semantic relations
present on Alias game (as it is in EstWN). In addition also definitions and example-sentences are
used. Every type of semantic relation is related to
a certain sentence template, which is presented to
a player. The sentences should be simple in the
sense that an average user is supposed to under-
https://kask.eti.pg.gda.pl/pinqee/game
http://www.cl.ut.ee/ressursid/sagedused/ (only in Esto-
10
nian)
17
stand the questions that present different semantic relations.
Here are presented some of the sentence templates which Alias uses for questions:
 antonym – It’s opposite for ___ (for example “It’s opposite for a man”)
 fuzzynym – It’s somehow related to
_____ (for example “It’s somehow related to the word elegance”)
Similarly to original board game Alias the
computer game also asks words in dictionary
form – nouns in nominative and verbs in infinitive form.
Estonian language is rich in compound words
and in EstWN many hyponyms contain of their
hyperonym as the second part of the compound
word.
1. For example: one type of kaabu ‘hat’ is
vilt+kaabu ‘trilby hat’
If the compound word consists of the word
that is currently guessed, the similar stems of the
words are removed (see example 2). The same
rule applies also in the original board game.
Since Estonian is rich in cases, persons and in
inflectional system, then it is quite complicated
to find the word with the similar stem. The morphological analyzer 11 is used to compare the
lemmas in hint to the lemma of the asked word.
If they match, then the similar stem is replaced
with a gap.
game are guessed. As the Table 2 shows, the correctly guessed words percentage differed largely
across different semantic relations and definitions or examples used.
All the semantic relations present in EstWN
are also used in Alias. Of course there are some
relations in EstWN, which are not so frequent –
role_instrument or has_mero_member for example, which means that they are also asked less
frequently during the game. Table 2 states that
the top-guessed relation is role_instrument even
though it occurred only 5 times, so we can say
that it is not statistically so important as definitions and antonym relation for example.
Groups
(as
group_role,
group_xpos,
group_holo, group_involved, group_derive) are
connected in table because they share the same
sentence template for hints. These sentence templates will be changed in the next version of the
game.
5
George Miller, as a psycholinguist was interested
in how the human semantic memory is organized
(Miller 1998), which type of relations are most
typical between words and concepts.
In addition to (psycho)linguistic tests, some
conclusions/inferences can be drawn using log
files of game Alias as well. Results give us feedback which relations are clear, which are too
fuzzy or too general or just too strange. For example: migration involved_location residence,
abode. Piek Vossen’s (2002) test for location_involved relation is:
(A/an) X is the place where the Y happens.
So, it is obvious that relation between migration
and residence needs to be corrected in EstWN.
As you can see from the Table 2, there is a
slight difference between guessing hints containing of hyperonyms (7.2%) and hyponyms
(9.1%), the latter shows slightly better results.
Hyperonyms might be too general, they might
have multiple hyponyms, for example ‘to run –
to move’. While giving a hyponym as hint, for
example ‘to run – to sprint’, opens the meaning
of the word more precisely.
Since fuzzynym-hints do not appear to be
very useful for players (only 7.1%), we can assume, that the connections and associations presented by fuzzynyms are too vague. Some of the
fuzzynynms can be assigned to a more specific
semantic relation, for example ‘doctor’ and ‘stetoscope’ or ‘postman’ and ‘postbag’ which denote something that belongs to some certain pro-
2. For example:
Question:
See on teatud liiki õunapuu.
This has a type of appletree.
is replaced
See on teatud tüüpi õuna______
This has a type of apple______
Answer: Puu (Tree)
Question:
You can use this word like that:
Bring back my pony to me
is replaced with
Bring ____ my pony to me
Answer: Back
4
Some statistics from play log
Since the December 2014 Alias is played 664
times. During these games, 2571 words have
been asked, it means that average 3,87 words per
11
Discussion
http://www.filosoft.ee/html_morf_et/
18
fession. But, as we could see from the play logs,
there are many fuzzynyms completely distant,
for example ‘presentation’ and ‘evolution’,
‘painting’ and ‘education’ etc.
From the player’s perspective the definitions
(21.3%) and examples (18.2%) are one of the
most successful hint for guessing the right word.
In many cases we can see from logs that various
hints with semantic relations do not help the
player, but definition and explanation – also even
if they are the first hints – are very informative.
This means that as a concept based database
EstWN needs to have clear definitions and good
examples to open the meanings of concepts.
The meaning of the word is quite well
guessed while hints present synonyms (here Variants, 14.5% right answers) or antonyms (33.7%)
and near antonyms (9.0%) or near synonyms
(9.4%). It is intuitively simpler to guess for example the word ‘kiss’ by its synonym ‘buss’ than
its hyperonym ‘touch’ or verb ‘to buy’ by its antonym ‘to sell’ than its hyperonym ‘to acquire’.
Hints that contain of functional relations (i.e
role, meronymy) are usually very clear to a player, of course these indicate to concrete objects.
The role-relation can connect both nouns to
nouns and nouns to verbs. For example the verb
‘to run’ has been guessed by its role_agent ‘runner’ but not by its hyperonym ‘to move’.
The logs from beginner and even intermediate
level can indicate to problems of the main vocabulary, for example for a question: this is near
synonym for the word ‘swamp bridge’ the correct answer should be ‘road’. Of course this near
synonym link is not correct and should be revised also in EstWN.
In many aspects this game reflects that the associations of words/concepts are free and arbitrary in human minds. For example, illegible
(sloppy, quickly written) handwriting can remind
us the doctors’ style of handwriting. But still it is
possible – if considered carefully and thoroughly
– find a certain system, which is similar to the
one Georg Miller started to create a model of the
human mental lexicon. In „On wordnets and relations“ (Piasecki et al 2013) is mentioned that
forming a synset (in the sense of wordnet) is a
quite difficult task and has been largely left to
the intuition of people who build wordnets.
Game gives us a chance to check how similar the
compilers intuition is to a player’s intuition.
6
Conclusion
The play logs contain of valuable information for
a lexicographer and using this for improvement
of EstWN is quite a new approach. The EstWN
has benefited from the Alias game in many ways.
Firstly it was possible to determine completely
false synsets and/or the non-suitable semantic
relations. Secondly it was possible to correct
some of the semantic relations. Thirdly some of
the definitions were improved and made more
precise. The correction work has grown more
systematic, since more log files have become
available. As an addition to revising and correcting synsets and their relations it was interesting
to observe which hints were more informative to
players than the others. It gives us good feedback
if there is any semantic relation too general, too
narrow or just too vague.
Not less important is the value to Alias game
and it working principles. If studying the logs
more thoroughly it is possible to improve the
quality of Alias, for example how to choose concepts, how to sort, choose, form and present hints
etc. This game is adjustable for every language
which has their own wordnet.
Researchers of Polish Wordnet (Maziarz et al
2013) have said that “Synonymy is intended as
the cornerstone of a wordnet, hypernymy – its
backbone, meronymy – its essential glue”. After
analyzed the log files of Alias-game we can say
that traditional definitions and antonyms are
clearer to a player with no linguistic background.
References
Fellbaum, Ch., 2010. WordNet, in: Poli, R., Healy,
M., Kameas, A. (Eds.), Theory and Applications of
Ontology: Computer Applications. Springer Netherlands, pp. 231–243.
Lohk, A. 2015. A System of Test Patterns to Check
and Validate the Semantic Hierarchies of Wordnettype Dictionaries. Thesis on Informatics and System Engineering C105. Press of Tallinn University
of Technology.
Maziarz, M., Piasecki, M.; Szpakowicz, S. 2013. The
chicken-and-egg problem in wordnet design: synonymy, synsets and constitutive relations. – Language Resources and Evaluation 47 (3), 769–796.
Miller, G. A. 1998. Nouns in WordNet. – WordNet.
An Electronic Lexical Database. Ed. Christiane
Fellbaum. Cambridge, MA: The MIT Press, 23–46.
Piasecki, M.; Szpakowicz, S.; Fellbaum, Ch.; Pedersen, B.S. 2013. On wordnets and relations. – Language Resources and Evaluation 47 (3), 757–767.
19
Rzeniewicz, J.; Szymański, J. 2013. Bringing Common Sense To Wordnet With A Word Game.
Computational Collective Intelligence. Technologies and Applications (2013): 296-305. Web. 14
Sept. 2015.
ings of the 10th International Conference on Computational Semantics (IWCS'13) -- Short Papers,
397-403, Potsdam, Germany.
Vossen, P. 2002. EuroWordNet General Document.
Version
3.
Final.
July
1,
2002.
http://www.vossen.info/docs/2002/EWNGeneral.p
df (27.10.2015).
Venhuizen, N.J., Basile, V., Evang, K. and Bos, J.
2013. Gamification for Word Sense Labeling. In
Katrin Erk and Alexander Koller (eds.), Proceed-
Table 2. Results of playing by different relations
Relation
Occurence
Right cases
role_instrument
5
3
60.0%
2
role_agent
17
7
41.2%
10
antonym
86
29
33.7%
57
causes
18
6
33.3%
12
has_holo_madeof
23
6
26.1%
17
DEFINITION
1390
296
21.3%
1094
is_caused_by
31
6
19.4%
25
EXAMPLE
1136
207
18.2%
929
group_role
41
6
14.6%
35
VARIANTS
1597
232
14.5%
1365
has_mero_member
7
1
14.3%
6
has_mero_madeof
7
1
14.3%
6
has_meronym
26
3
11.5%
23
has_mero_part
36
4
11.1%
32
has_holo_member
18
2
11.1%
16
group_involved
42
4
9.5%
38
near_synonym
577
54
9.4%
523
has_hyponym
2123
194
9.1%
1929
near_antonym
200
18
9.0%
182
group_holo
60
5
8.3%
55
has_mero_location
12
1
8.3%
11
role_location
13
1
7.7%
12
has_hyperonym
994
72
7.2%
922
has_xpos_hyponym
152
11
7.2%
141
20
Right
(%)
cases
Wrong cases
fuzzynym
622
44
7.1%
578
group_xpos
313
19
6.1%
294
state_of
84
4
4.8%
80
be_in_state
45
1
2.2%
44
is_subevent_of
4
0
0.0%
4
has_mero_portion
2
0
0.0%
2
has_holo_portion
2
0
0.0%
2
role_target_direction
1
0
0.0%
1
has_subevent
1
0
0.0%
1
role_manner
1
0
0.0%
1
has_holo_location
0
0
0.0%
0
belongs_to_class
0
0
0.0%
0
group_derive
0
0
0.0%
0
role_source_direction
0
0
0.0%
0
has_instance
0
0
0.0%
0
role_direction
0
0
0.0%
0
21
Detecting Most Frequent Sense using Word Embeddings and
BabelNet
Harpreet Singh Arora
Computer Science and
Engineering, Academy of
Technology, Hooghly, India
Sudha Bhingardive
Department of Computer
Science and Engineering,
IIT Bombay, India
Pushpak Bhattacharyya
Department of Computer
Science and Engineering,
IIT Bombay, India
harpreet.singharora
@aot.edu.in
[email protected]
[email protected]
this baseline. The MFS baseline is created with the
help of a sense annotated corpus wherein the
frequencies of individual senses are learnt. It is
found that, only 5 out of 26 WSD systems
submitted to SENSEVAL-3, were able to beat this
baseline. The success of the MFS baseline is
mainly due to the frequency distribution of senses,
with the shape of the sense rank versus frequency
graph being a Zipfian curve. Unsupervised
approaches were found very difficult to beat the
MFS baseline, while supervised approaches
generally perform better than the MFS baseline.
In our paper, we have extended the work done
by Bhingardive et al. (2015). They have used word
embeddings along with features from WordNet for
the detection of MFS. We used word embeddings
and features from BabelNet for detecting MFS.
Our approach works for all part-of-speech (POS)
categories and is currently implemented for six
different languages viz., English, Spanish,
Russian, German, French and Italian. This
approach can be easily extended to other
languages if word embeddings for the specific
language are available.
The paper is organized as follows: Section 2
briefs the related work. Section 3 explains
BabelNet. Our approach is given in section 4.
Experiments are presented in section 5 followed
by conclusion.
Abstract
Since the inception of the SENSEVAL evaluation
exercises there has been a great deal of recent
research into Word Sense Disambiguation (WSD).
Over the years, various supervised, unsupervised
and knowledge based WSD systems have been
proposed. Beating the first sense heuristics is a
challenging task for these systems. In this paper, we
present our work on Most Frequent Sense (MFS)
detection using Word Embeddings and BabelNet
features. The semantic features from BabelNet viz.,
synsets, gloss, relations, etc. are used for generating
sense embeddings. We compare word embedding of
a word with its sense embeddings to obtain the MFS
with the highest similarity. The MFS is detected for
six languages viz., English, Spanish, Russian,
German, French and Italian.
However, this
approach can be applied to any language provided
that word embeddings are available for that
language.
1
Introduction
Word Sense Disambiguation or WSD refers to the
task of computationally identifying the sense of a
word in a given context. It is one of the oldest and
toughest problems in the area of Natural Language
Processing (NLP). WSD is considered to be an AIcomplete problem (Navigli et al., 2009) i.e., it is
one of the hardest problems in the field of
Artificial Intelligence. Various approaches for
word sense disambiguation have been explored in
recent years. Two of the widely used approaches
for WSD are – disambiguation using the annotated
training data called as supervised WSD and
disambiguation without the annotated training
data called as unsupervised WSD.
MFS is considered to be a very powerful
heuristics for word sense disambiguation. With
sophisticated methods, it is difficult to outperform
2
Related Work
McCarthy et al. (2007) proposed an unsupervised
approach for finding the predominant sense using
an automatic thesaurus. They used WordNet
similarity for identifying the predominant sense.
This approach outperforms the SemCor baseline
for words with SemCor frequency below five.
Bhingardive et al. (2015) compared the word
embedding of a word with all its sense embedding
22
to obtain the predominant sense with the highest
similarity. They created sense embeddings using
various features of WordNet.
Preiss et al. (2009) refine the most frequent sense
baseline for word sense disambiguation using a
number of novel word sense disambiguation
techniques.
3
Russian, German, French and Italian. We used
BabelNet as a lexical resource, as it contains
additional information as compared to WordNet.
This approach uses pre-trained Google Word
Embeddings 7 for English language, and for all
other languages Polyglot 8 Word Embeddings are
used.
BabelNet
BabelNet (Navigli et al., 2012) is a multilingual
encyclopedic dictionary, with lexicographic and
encyclopedic coverage of terms, and a semantic
network. It connects concepts and named entities
in a very large network of semantic relations,
made up of more than 13 million entries, called
Babel synsets. Each Babel synset represents a
given meaning and contains all the synonyms
which express that meaning in a range of different
languages.
BabelNet v3.0 covers 271 languages and is
obtained from the automatic integration of:
• WordNet 1 - a popular computational lexicon
of English.
Figure 1. Steps followed by our approach
The steps followed by our approach as shown in
figure 1 are as follows 1. The system takes a word, POS and language
code as an input.
2. For every sense of a word, features such as
synset members, gloss, hypernym, etc. are
extracted from BabelNet.
3. Sense embeddings or sense vectors are
calculated by using this feature set.
4. Cosine similarity is computed between
word vector (word embedding) of an input
word and its sense vectors.
5. Sense vector which has maximum cosine
similarity with the input word vector is
treated as the MFS for that word.
2
•
Open Multilingual WordNet - a collection of
WordNets available in different languages.
•
Wikipedia3 - the largest collaborative multilingual Web encyclopedia.
•
OmegaWiki4 - a large collaborative multilingual dictionary.
•
Wiktionary5 - a collaborative project to produce a free-content multilingual dictionary.
•
Wikidata6 - a free knowledge base that can be
read and edited by humans and machines
alike.
BabelNet provides API for Java, Python, PHP,
Javascript, Ruby and SPARQL.
4.1
4
4.1.1 Creation of BOW
Our Approach
Calculating Sense Vectors
We propose an approach for detecting the MFS
which is an extension of the work done by
Bhingardive et al. (2015). Our approach follows
an iterative procedure to detect the MFS of any
word given its POS and language. It works for six
different languages viz., English, Spanish,
Bag of Words (BOW): Bag of words for each
sense of a word are created by extracting context
words from each individual feature from
BabelNet. BOWs obtained for each feature are,
BOWS for synset members (S), BOWG for
1
6
2
7
http://wordnet.princeton.edu/
http://compling.hss.ntu.edu.sg/omw/
3
http://www.wikipedia.org/
4
http://www.omegawiki.org/
5
http://www.wiktionary.org/
https://www.wikidata.org/
https://code.google.com/p/word2vec/
8
http://polyglot.readthedocs.org/en/latest/Embed
dings.html
23
content words in the gloss (G), BOWHS for
synset members of the hypernym synset (HS),
BOWHG for content words in the gloss of
hypernym synsets (HG).
After removing stop words and words for which
word embeddings are not available, we get the
updated BOWG1 as,
Word Embeddings: Word embedding or word
vector is a low dimensional real valued vector
which captures semantic and syntactic features of
a word.
BOWG1 = {bat ball game played two teams}
Now, the cosine similarity of each word in
BOWG1 with other words in BOWG1 is computed
to get the most relevant words which can
represent the sense S1. For instance, for a word
game, the average cosine similarity was found to
be 0.38 which falls in the selected threshold.
Hence, the word game is not filtered from the
BOWG1. Table 1 shows how the word game is
selected based on the average cosine similarity
score.
Word
Gloss
Cosine
Members
Similarity
game
played
0.50
game
ball
0.49
game
bat
0.30
game
two
0.17
game
teams
0.44
Sense Embeddings: Sense embedding or sense
vector is similar to word embedding which is also
a low dimensional real valued vector. It is created
by taking average of word embeddings of each
word in the BOW.
4.1.2 Filtering BOW
Filtering of BOWs are done to reduce the noise.
The following procedure is used to filter
BOWs:
1. Words for which word embeddings are not
available are excluded from BOW.
2. From this BOW, the most relevant words
are picked using following steps:
a. Select a word from BOW
b. The cosine similarity of that word
with each of the remaining words
in the BOW is computed.
c. If the average cosine similarity lies
between the threshold values 0.35
and 0.4, then we keep the word in
the BOW else it is discarded. It is
found that values above 0.4 were
discarding many useful words
while the values below 0.35 were
accepting
irrelevant
words
resulting in increasing the noise.
Hence, the threshold range of 0.35
- 0.4 was chosen by performing
several experiments.
Table 1: Cosine similarity scores of a word game
Average Cosine Score (game) =
(0.51 + 0.49 + 0.30 + 0.17 + 0.44)/5 = 0.38
Similar process is carried out for each word of
BOW.
4.2
Detecting MFS
In our approach we are detecting MFS in an
iterative fashion. In each iteration we are
checking which type of BOWs (BOWS, BOWG,
BOWHS, and BOWHG) are sufficient to detect
the MFS. This can be observed in figure 2.
S
• Yes: Print MFS
• No: Next Step
S+G
For example, consider the input as Word: cricket
POS: NOUN
Language code: EN
•Yes: Print MFS
•No: Next Step
S+G+HS •Yes: Print MFS
•No: Next Step
Let BOWG1 be the BOW of a gloss feature for the
sport sense (S1) of a word cricket.
S+G+HS
+HG • Print MFS
BOWG1 = {Cricket is a bat and ball game played
between two teams of 11 players each on a field
at the center of which is a rectangular 22-yard
long pitch}
Figure 2: Iterative process of detecting MFS
In figure 2, we can see how BOWs are used to
create sense vectors in an iterative fashion to get
24
the MFS. If synset members (S) are sufficient to
get the MFS then our algorithm prints the MFS
and stops, otherwise other BOWs of various
features like gloss (G), synset members of the
hypernym synsets (HS), content words in the gloss
of the hypernym synsets (HG) are used iteratively
to get the MFS. The algorithm is as follows:
1. For each sense i of a word:
a. VEC(i) = Create_sense_vector (BOWSi)
b. VEC(W) = word vector of the input word
c. SCORE(i) = cosine_similarity (VEC(i),
VEC(W))
2. Arrange this SCORE in descending order
according to the similarity score.
3. If (SCORE(0) – SCORE(1)) > threshold:
a. MFS = Sense(SCORE(0))
b. Print MFS
c. End
4. Else:
a. Run Steps 1 to 3 for (BOWSi + BOWGi)
5. If (SCORE(0) – SCORE(1)) > threshold:
a. Run Steps 1 to 4 for (BOWSi + BOWGi +
BOWHSi)
6. Else:
a. Print MFS
b. End
7. If (SCORE(0) – SCORE(1)) > threshold:
a. Run Steps 1 to 4 for (BOWSi + BOWGi +
BOWHSi + BOWHGi)
b. Print MFS
c. End
8. Else:
a. Print MFS
b. End
is a net speed-up in the procedure, as the
computation time is significantly abridged
as compared to Bhingardive et al. (2015).
As we are using an iterative procedure for
detecting the MFS, our approach, most of
the times gives a better result as compared
to Bhingardive et al. (2015) which we have
manually verified.
5
Experiment and Results
We used pre-trained Google’s word vectors as
word embedding for English language, for all
other languages Polyglot’s word embeddings are
used. Due to lack of availability of gold data, we
could not compare our results with MFS results
obtained from BabelNet. Upon considering
Princeton WordNet as gold data, we cannot
equate our results with it because they might be
semantically similar but not syntactically.
6
Conclusion
We have proposed an approach for detecting the
most frequent sense for a word using BabelNet as
a lexical resource. BabelNet is preferred as a
resource since it incorporates data not only from
Princeton WordNet but also from sources. Hence
the volume of ambiguity is reduced by a
significant proportion. Our approached follows an
iterative procedure until a suitable context is found
to detect the MFS of a word. It is currently
working for English, Russian, Italian, French,
German, and Spanish languages. However, it can
be easily ported across multiple languages. An
API is developed for detecting MFS using
BabelNet which can be publically made available
in future.
Where,
• VEC(i) denotes sense vector of an input
word.
• SCORE (v1, v2) is cosine similarity
between word vector v1 and sense vector
v2.
• SENSE (SCORE(i)) is the sense corresponding to SCORE(i).
• Ambiguity is resolved by comparing the
score of most similar sense and second most
similar sense, obtained after Step 2. Step 3
checks if the difference between their score
is above threshold 0.02 (This threshold
was chosen after conducting various
experiments with other threshold figures.
The average difference between two most
similar senses was found to be 0.02). There
References
Calvo Hiram and Alexander Gelbukh. 2014. Finding
the Most Frequent Sense of a Word by the Length of
Its Definition. Human-Inspired Computing and its
Applications. Springer International Publishing.
Diana McCarthy, Rob Koeling, Julie Weeds and John
Carroll. 2007. Unsupervised Acquisition of
Predominant
Word Senses.
Computational
Linguistics, 33 (4) pp 553-590.
George A. Miller. 1995. WordNet: A Lexical Database
for English. Communications of the ACM Vol. 38,
No. 11: 39-41.
Judita Preiss. 2009. Refining the most frequent sense
baseline. Proceedings of the Workshop on Semantic
Evaluations: Recent Achievements and Future
25
Directions.
Linguistics.
Association
for
Computational
Roberto Navigli. 2009. Word sense disambiguation: A
survey. ACM Computing Surveys (CSUR) 41.2:
10.
Roberto Navigli and Simone Paolo Ponzetto. 2012.
BabelNet: The automatic construction, evaluation
and application of a wide-coverage multilingual
semantic network. Artificial Intelligence 193
(2012): 217-250
Sudha Bhingardive, Dhirendra Singh, Rudramurthy V,
Hanumant Redkar, and Pushpak Bhattacharyya.
2015. Unsupervised Most Frequent Sense Detection
using Word Embeddings. North American Chapter
of the Association for Computational Linguistics
(NAACL), Denver, Colorado.
26
Problems and Procedures to Make Wordnet Data (Retro)Fit for a
Multilingual Dictionary
Martin Benjamin
École Polytechnique Fédérale de Lausanne
Lausanne, Switzerland
[email protected]
Abstract
that might add value for the public over previous
distributions. Section 3 discusses problems
encountered with using Wordnet data as the basis
for detailed lexicography. Section 4 details the
systems we are implementing to (1) offer
improved data for current Wordnets and to (2) use
as a basis for building parallel data for many more
languages.
The data compiled through many Wordnet
projects can be a rich source of seed information
for a multilingual dictionary. However, the
original Princeton WordNet was not intended as
a dictionary per se, and spawning other
languages from it introduces inherent ambiguity
that confounds precise inter-lingual linking.
This paper discusses a new presentation of
existing Wordnet data that displays joints
(distance between predicted links) and
substitution (degree of equivalence between
confirmed pairs) as a two-tiered horizontal
ontology. Improvements to make Wordnet data
function as lexicography include term-specific
English definitions where the topical synset
glosses are inadequate, validation of mappings
between each member of an English synset and
each member of the synsets from other
languages, removal of erroneous translation
terms, creation of own-language definitions for
the many languages where those are absent, and
validation of predicted links between nonEnglish pairs. The paper describes the current
state and future directions of a system to
crowdsource human review and expansion of
Wordnet data, using gamification to build
consensus validated, dictionary caliber data for
languages now in the Global WordNet as well
as new languages that do not have formal
Wordnet projects of their own.
2. Converting synsets to concept-specific
lemmas.
In structuring a multilingual dictionary, Kamusi
has determined that each concept/spelling pair
within a language should be a distinct node;
“light” (not heavy) is different from “light” (not
dark) is different from “light” (not serious). This
arrangement is compatible overall with the
Princeton WordNet (PWN), which separates each
sense it has identified for a given English spelling.
However, PWN clusters other terms with the same
general meaning in the same “synset”, such as
{cloth, fabric, material, textile}, so part of the
conversion of PWN to the Kamusi structure is to
make each member a separate node, each linked
as a synonym to all others, while retaining for
each the Wordnet working definition.
Wordnets for different languages are matched
to PWN by synset (Bond and Foster 2013).
PWN’s own search engine shows the terms in the
OMW that correspond to a synset, marked by
language, with no further navigation possible
between languages (see figure 1). The OMW
search interface better shows the different synsets
that are linked to the English concept (see figure
2), and also allows users to seek synsets in a
second language that match through English to a
search term in a first. For Kamusi, by contrast, the
matrix of relationships between the individual
terms within Wordnet synsets is the multilingual
problematic. With English concepts and
translation equivalents granted a debatable
1. Introduction
When viewed from the perspective of creating a
concept-based multilingual dictionary, the Global
WordNet (GWN) is filled with both treasure and
risk. The Kamusi Project has imported the freely
available data from the Open Multilingual
Wordnet (OMW) as seed for further dictionary
development. In doing so, we have encountered
issues with current Wordnet implementations1
that we hope to contribute toward resolving.
Section 2 describes the work we have done to
make existing OMW data available in a format
1
while also referring to individual Wordnets that exist for
specific languages.
This paper uses “Wordnet” as a collective noun to signify
the web of projects that adopt the synset and ontological
approach, and that largely adhere to the same concept set,
27
assumption of validity, Kamusi has now linked
the individual terms in the synsets in each
language independently, with the matches
inferred through English shown as second degree.
In the example of “light” (not dark) in figure 3, the
concept as defined in English links to two
different nodes in Catalan, “brillant” and
“illuminós”, and two nodes in Spanish, “claro”
and “luminoso”. These particular senses of
“claro” and “luminoso” in turn link individually
to “brillant” and “illuminós”, and all five of the
preceding terms have independently negotiable
relationships with Japanese “明るい” and “明ら
か”, Croatian “svjetleći” and “svijetao”, and
onward through the languages available in OMW.
When new terms are matched to the concept in
Kamusi for non-Wordnet languages, for example
a Quechua equivalent matched to Spanish, links
are formed, with degree of separation indicated, to
all of the existing terms within the multilingual
relation set.
The data from OMW includes 117,659 synsets
from PWN, matched to varying amounts among
26 languages and two variants (for Chinese and
Norwegian), resulting in approximately 1.2
million individual nodes. Some large relation sets
include 150 or more terms as equivalents among
languages, which can produce upwards of 11,000
individual links; while server resources have not
been expended to tally the total links in the data,
at least ten million term pairs have been mapped.
Figure 3: Each term linked to
Figure 1: Terms linked to
Figure 2: Multilingual synsets
English synset (PWN method) linked to English synset (OMW concept and each other, with joints
(distance) and substitutes (type of
method)
equivalence) tracked (Kamusi
method)
3. Problems with Wordnet as lexicography
oriented resource, without losing its strong bonds
across languages and projects.
The first problem is that many of the English
definitions in the PWN data are inadequate, some
to the point of error. Many of the definitions were
written by the founder of the project, who was not
a lexicographer and was faced with the immense
task of producing good-enough ways of
understanding tens of thousands of terms. The
data is thus peppered with definitions such as
“elevator car: where passengers ride up and
down”; the sense is clear to a knowledgeable
speaker, but would not suffice for a credible
dictionary. Sometimes the definition is a problem
for one member of a synset, either because the
terms do not have identical meanings (e.g., verb
“eat, feed: take in food; used of animals only” is
valid for “feed” but not for “eat”) or because that
term forms the nub of the explanation used to
define the group (e.g., verb “visit, call, call in: pay
a brief visit” functions for “call” and “call in”, but
Having thus worked at length with the data in
OMW, we have encountered a number of
limitations that bear mentioning and further work.
It is important to acknowledge that Wordnet
was never intended to be a definitive dictionary,
for English or any other language. The intent of
the word list was to provide data for non linguistic
research, initially in psychology (Miller et al
1990, Miller and Fellbaum 2007). It is thus not a
criticism to state that it does not fulfill a role it was
not designed for. However, in the absence of a
better large and well organized set of freely
available terms and definitions, it has taken on the
de facto role of a universal lexicon, linked not
only across languages but also across numerous
projects related to computational linguistics. We
suggest that Wordnet can be retrofitted for
incorporation within a more lexicographically
28
is a tautology for “visit”). Some definitions are
simply wrong; a law practice, as a lexicalizable
multiword expression, is not “the practice of law”,
but a business through which lawyers conduct
their profession.
The consequence of a wrong definition is that
the
errors
propagate
through
reproductions, projects, and languages. Fixing
mistakes is thus an opaque journey through longcompleted Wordnet projects that are unlikely to
be reopened, in languages that can only be
corrected by their speaker communities if they are
alerted to the issues and provided with the tools to
make the necessary changes. All three languages
that attempt an equivalent for "law practice"
completely miss the true English sense (perhaps
the other 25 groups were too stymied by the
tautology to attempt a translation), so
Finnish, Thai, and Spanish parties must somehow
be alerted that the PWN definition has been
modified, and given the platform to review and
revise the term in their language. Further, the
original PWN definition must be maintained with
an indication that it had been deprecated, so
projects like BabelNet2 and VisuWords3 that link
to or build upon it (Navigli and Ponzetto 2010)
can see the adjustments flagged, and update
themselves accordingly. Unfortunately, numerous
websites have replicated the existing PWN data in
apparently static form (e.g., vocabulary.com4), so
the current data will live in many places forever.
The second problem is that many errors exist in
the equivalents that other languages map to
English. For example, the French word “lumière”,
always a noun, translates to a few senses of
English “light”, mostly in regard to things that
shine and figuratively in respect to illuminating
knowledge. As rendered in the WOLF French
Wordnet, however, “lumière” is mapped to 45
senses of “light”, as a noun, verb, or adjective,
with meanings such as “insubstantial”, “less than
the full amount”, and “alight from (a horse)”. Of
similar concern, “light” as visible radiation is
mapped to 24 different terms in Polish, and the
synset with “illuminate” is given 20 equivalents in
both Indonesian and Malaysian. While most
languages have a lively list of expressions for
some common concepts such as “goodbye”, large
sets of synonyms for most concepts indicate an
overly broad brush in the Wordnet compilation. In
the Polish example, the purported synonyms
include a range of things related to brightness,
such as “zaćmienie”, which is an eclipse. As with
poor English definitions, poor translations and
clustering are unlikely to be fixed because their
compilation projects have expired with no system
in place for updating data.
These issues point to a third problem, a
conceptual limitation that our concept-specific
rearrangement of the data described above in
section 2 seeks to address. A strength of Wordnet,
and indeed its main organizing principle, is the
highly detailed ontologies through which
concepts are related (Vossen et al 1998, Vossen
1998)), such as hyponymy (this is a type of that)
and meronymy (this is a part of that), e.g. a ship is
a type of vessel and a deck is a part of a ship
(Fellbaum 1998). These precise vertical
ontologies are not matched, however, with a
method for understanding horizontal distinctions
within a synset (Derwojedowa et al 2008). Every
term within a synset is defined as “this” same
thing, e.g. E={approximate, estimate, gauge,
guess, judge}, “judge tentatively or form an
estimate of (quantities or time),” is all one notion.5
Moreover, every term in every synset linked from
every other language in GWN is bequeathed with
the same meaning, in this example including 6
terms in Croatian, 11 in Japanese including
orthographic variations, 20 in Arabic, 22 in
Indonesian, and 24 in Malaysian; any term in {‫ﺛ ّﻤﻦ‬
, ‫ ﻋ ﻠﻰ ﺣ ﻜ ﻢ‬, ‫ ﻗ ﺎرب‬, ‫ ﺛﻤ ﻦ‬, ‫ر أﯾ ﺎ ﻛﺎن‬, ‫ ﻗﻀ ﺎﺋ ﯿ ﺎ ﺣ ﻜ ﻢ‬,
‫ ﻗﯿّﻢ‬, ‫ ﻗﺪ ر‬, ‫ ﺗﺒ ﺄ ر‬, ‫ ﻓ ﺼ ﻞ‬, ‫ ﺧﻤﻦ‬, َ‫ َﺧ ﱠﻤﻦ‬, ‫ ﺣﺰر‬, ‫ ﻗﻮ م‬,
‫ ا ﺳ ﺘ ﻨ ﺘﺞ‬, ‫ ﻗﺎس‬, ‫ ﻣ ﺎ ﺷﻰ ءﺳ ﻌ ﺔ ﻋ ﯿﻦ‬, ‫ ظﻦ‬, ‫ ﻗ ّﺪر‬, ‫}ﺣ ﺎﻛ ﻢ‬
is equivalent to any term in {見立てる , 見積る ,
予算+する , 目算 , 積もる , 目算+する , 見積も
る , 予算 , 積る , 推算 , 推算+する}. Where the
English synset elides the large difference between
guessing and gauging, the multilingual composite
compounds the weakness of the assumption of
strict equivalence. The Arabic terms do not all
share a meaning with each other, nor are all the
Japanese terms internal synonyms, leaving no
way to determine whether ‫ ا ﺳ ﺘ ﻨ ﺘﺞ‬is a viable
translation for 積もる.6 Any term produced by a
2
similar to what the processes in section 4 are designed to
elicit. The Arabic term is substantially more definitive
(“concluded”) than the Japanese (“pile up like discussions
during an absence”). {1. ‫ﺛ ّﻤﻦ‬, evaluated; 2. ‫ ﻋ ﻠﻰ ﺣ ﻜ ﻢ‬, judged;
3. ‫ ﻗ ﺎرب‬, compared; 4. ‫ ﺛﻤ ﻦ‬, price; 5. ‫ر أﯾ ﺎ ﻛﺎن‬, had an idea
about; 6. ‫ﻗﻀ ﺎﺋ ﯿ ﺎ ﺣ ﻜ ﻢ‬, verdict; 7. ‫ﻗﯿّﻢ‬, evaluated; 8. ‫ ﻗﺪ ر‬,
considered; 9. ‫ ﺗﺒ ﺄ ر‬, focused; 10. ‫ ﻓ ﺼ ﻞ‬, separated; 11. ‫ﺧﻤﻦ‬,
http://babelnet.org/synset?word=bn:00050277n&
details=1&orig=law%20practice&lang=EN
3
http://visuwords.com/law%20practice
4
http://www.vocabulary.com/dictionary/law%20practice
5
http://wordnet-rdf.princeton.edu/wn31/200674352-v
6
To evaluate these two blindly-chosen terms, bilingual
informants translated both synsets, yielding information
29
contributor in one language has a 1/E chance of
being a direct translation of one of the English
synset members, so any two cross-language terms
in GWN have a 1/E2 chance of corresponding via
the English intermediary with each other; in the
example, E=5, any thoughtfully-produced term
has a 20% of matching a specific term pertaining
to assessing amounts, and any two non-English
terms have a 4% chance of having been selected
as best equivalents of the same English term.
Linking the terms computationally is a prodigious
shortcut to find likely pairs, but it is not
lexicography.
If, however, we see the synset as a grouping of
things that share a topical relationship rather than
a strict meaning, we can resolve the problem by
adding levels of detail similar to the vertical
Wordnet ontologies. Kamusi splits the topical
lumping of synonymy into what what can be seen
as a two-tier horizontal ontology, joints and
substitutes, that extends the conceptualization of a
multilingual lexicon from a grid (Fellbaum and
Vossen 2007) to a matrix.
1. “Joints” is the relationship that shows that
terms have been linked transitively as synonyms
(synset members) or translations. Joints are
evaluated numerically by the degree of separation
between links that have, in principle, some
element of human confirmation.7 A first
generation joint indicates that two terms have
been manually paired, a second generation joint
links though one pivot term, third generation has
two intermediary terms, etc. With data from
GWN, the presumption of manual linking is
cloudy; all members of an English synset have
been manually linked to each other, all members
of internal synsets for most other languages have
been manually linked unless the Wordnet was
assembled computationally, and most otherlanguage synsets have been manually linked to the
English synset, but that does not mean that
‫ ا ﺳ ﺘ ﻨ ﺘﺞ‬or 積もるhave been manually linked to
“guess” or “gauge”. In the current import, joints
within a language are all shown as first generation
(to be re-filtered as “synonyms” in due course),
and joints between each term in an English synset
and each member of a linked synset are also
shown as first generation, i.e., ‫ ا ﺳ ﺘ ﻨ ﺘﺞ‬is said to
be a first generation joint with both guess and
gauge, as is 積もる, with the Arabic and Japanese
terms therefore set as second generation. A future
method to validate joints is described below in
section 4.8.
2. “Substitutes” speaks to the degree of
equivalence between terms. Whether in-language
synonyms or cross-language translations, terms
are either “parallel” or “similar”, with the
additional possibility that a translation is an
“explanatory phrase” invented in one language to
fill a lexical gap for a concept that is indigenous
to another (Benjamin 2014b). Pending
programming will provide fields on Kamusi
similar to those for definitions. These fields
provide space for the differences between
“similar” substitutes to be elaborated, such as the
distinction between “arm” in English that is the
body part from the shoulder to the wrist versus
“mkono” in Swahili that extends from the
shoulder to the fingertips. Substitution
relationships can in principle be followed across
joint relationships, so that the degree of
equivalence can be tracked along with the degree
of separation, a task for future coding. For the data
imported from OMW, all substitution relations
have been set initially to “parallel”, putting aside
judgments about equivalence for a more distant
future.
A fourth limitation with using Wordnet as a
dictionary end-product is that it is incomplete in
some essential ways. Wordnet cannot be faulted
for not including every sense of every English
term, much less every term from other languages,
as that was never its mission. However, terms or
senses that are not in Wordnet, such as “light” as
a traffic signal, or “lightsaber”, should be included
– or at least includable – in a dictionary that
guessed; 12. َ‫ َﺧ ﱠﻤﻦ‬, quantified; 13. ‫ﺣﺰر‬, guessed; 14. ‫ﻗﻮ م‬,
measured; 15. ‫ ا ﺳ ﺘ ﻨ ﺘﺞ‬, concluded; 16. ‫ ﻗﺎس‬, measured; 17.
‫ ﻣ ﺎ ﺷﻰ ءﺳ ﻌ ﺔ ﻋ ﯿﻦ‬, set capacity of; 18. ‫ظﻦ‬, doubted; 19. ‫ﻗ ّﺪر‬,
evaluated; 20. ‫ﺣ ﺎﻛ ﻢ‬, put to trial};{1. 見立てる to judge or
diagnose [kanji for see and stand up] (make a visual
estimation such as a physical exam, or take measurements
for clothing); 2. 見積る, 3. 見積もる to estimate [kanji for
see and stack] (predict price and time for a job); 4. 予算+す
る, 5. 予算 to estimate or budget [kanji for calculate and
beforehand] (calculate anticipated expenses); 6. 目算, 7. 目
算+する to estimate [kanji for calculate and look] (an
inexact number such as ml in a cup or remaining moves in
Go); 8. 積もる, 9. 積る to estimate [kanji for stack]
(uncountable things such as snow or emotions); 10. 推算,
11. 推算+する estimation [kanji for calculate and guess]
(less-knowable or unknowable things such as a coin flip, the
size of a crowd, or evaluation of a crime scene)}.
7
This assumption does not necessarily hold, as some
Wordnets are built using automatic generation techniques
(Atserias et al 1997, de Melo and Weikum 2008, Oliver
2014). The tendency for error in computationally-derived
datasets is amply displayed WOLF French Wordnet
(Wordnet Libre du Français) (Sagot and Fišer 2012,
http://alpage.inria.fr/~sagot/wolf-en.html)
30
aspires toward a thorough representation of a
language. If a concept is missing in PWN,
moreover, it stands little chance of appearing in
other language Wordnets, and conversely there is
no chance for a concept indigenous to another
language to join the global Wordnet concept set.
Within the scope of the Wordnet vision,
relationships that have not been found by Wordnet
editors cannot be forged by readers, such as
proposing that “boat” and “ship” be joined in a
synset. Further, the lack of own-language
definitions in most languages leaves the
impression that the meaning of each term can be
encapsulated in the English definition of the
corresponding synset, to the extent that the
attributed definition for “zaćmienie” is, exactly
and erroneously, “electromagnetic radiation that
can produce a visual sensation”. Finally, and
again because it is out of scope, Wordnet does not
include a great deal of information that is relevant
for dictionary or data purposes, such as word
forms (Spanish “invitado” does not indicate an
association with “invitada”, “invitados”, and
“invitadas”).
A final limitation with Wordnet is that projects
for many languages have licenses that restrict the
use of the data, if the data can be located at all. For
example, the Romanian Wordnet is distributed
with a “no derivatives” license. This means that
the data cannot be imported into the multilingual
structure described above, because linking
Romanian to Slovenian would be a derivative
product. Nor could the data be expanded, with
Romanian definitions or with information such as
the female form “invitátă” corresponding to the
given masculine “invitát”. Furthermore, the
Romanian data has a “no redistribution”
restriction, so its use in a project that makes its
data shareable or downloadable seems proscribed.
GermaNet is even more restrictive, only allowing
the data to be used for internal research within an
institution. The openness or lack thereof of
Wordnets
is
indicated
at
http://globalwordnet.org/wordnets-in-the-world.
Bringing restricted Wordnets into a dictionary
project does not offer new technical challenges,
but is only possible if the creators choose to
amend their licenses.
developed tools that will transform the open
Wordnet data into data that is appropriate for
dictionaries and additional technological
applications, using automated procedures as a
starting point for human lexicographic review
(Pianta, Bentivogli, and Girardi 2002). At the
same time, these tools are designed to keep the
data in synch with existing Wordnet instances, in
such a way that transformations generated by
Kamusi can be reincorporated in PWN or other
language projects when and if their maintainers
desire.
The primary new tools developed by Kamusi
that can transform Wordnet data are a set of
crowd-sourcing applications that include games
embedded within Facebook and (still in alpha
development) on mobile devices (Benjamin
2014a, Benjamin 2015). These games ask players
to answer targeted questions about their language,
for which they receive various rewards when their
answers adhere to the consensus. The games build
data progressively, such that a definition that has
been approved for English can be shown to people
producing equivalents or definitions for other
languages.
These systems can transform Wordnet seed
data into dictionary data, in several ways:
1. Each English definition will be reviewed as it
pertains to the individual members of a synset,
and improved when the participants find it
appropriate. Players are shown the existing
Wordnet “working definition”, and given the
opportunity to either suggest their own definition,
vote for the Wordnet definition, or vote for a
contribution from another player. Once a
definition passes the consensus threshold, it is
published to Kamusi and used for subsequent
game modes. If the Wordnet definition has been
replaced, it is shown on Kamusi as deprecated.
2. Definitions in their own languages for terms
from other Wordnets will be generated using the
same procedure. This feature will be introduced
after players have had the chance to validate
existing translations against a critical mass of
finalized English definitions, e.g. a new English
definition for “law practice” will first be given to
Spanish speakers to verify or replace the current
matched Spanish term, and only afterwards will
the approved Spanish term be advanced to the
definition game.
3. Existing translations of PWN will be validated
term by term. For example, Polish players will
assuredly approve “światło” for the sense of
visible light, but reject “zaćmienie”. This mode
has not been developed at time of writing, the
4. Tools and techniques for adding and
improving Wordnet data
Wordnet’s popularity stems in part from its
openness to the mash-ups others create from the
core PWN data. In that spirit, Kamusi has
31
need only becoming evident through examination
of the data imported in mid 2015, but is
anticipated for quick completion. Terms that are
evicted from a defined synset, like “zaćmienie”,
will be moved through a sequence of games to
produce definitions, translations, and sense
matches.
4. Concepts from PWN that are not already
matched in other Wordnet languages will be
elicited. For example, the Arabic WordNet has
only 10,000 synsets, so more than 100,000
concepts remain untouched. In the game, players
are shown a defined English term and asked to
provide an equivalent term in their language.
Terms that pass the consensus threshold are added
to Kamusi, while non-winning terms are passed to
another mode to see whether they are synonyms
for the concept.
5. Languages that do not have existing Wordnet
projects will be opened to their speakers, using the
improved English definition set and the game
modes described above. Because the elicitation
list used in the games is inherently linked to
Wordnet, Wordnets for these other languages will
be created as a default outcome. This opens GWN
to languages that do not have formal organizations
to take on the trouble of creating a Wordnet
project, including building tools from scratch (e.g.
Wijesiri et al 2014), but do have passionate
speakers who will contribute through crowd
methods.
6. Languages that have existing but restricted
Wordnet projects, like German, will be opened for
their speakers to start from scratch. This is a
phenomenal waste of time and energy, if one can
speak frankly in an academic paper, but, barring
changes in license restrictions, may be the fastest
way to acquire reliable data that can be used in an
open resource.
7. One already-developed game calls on players
to judge whether usages gleaned from Twitter or
more formal corpora (currently configured for
Wikipedia and the Helsinki Corpus of Swahili,
but the technique can be applied more widely) are
good examples to illustrate a particular sense.
Most Wordnets lack usage examples, so this game
can fill that gap for many languages. Future game
modes will elicit additional lexical and
ontological information, some of which falls
within the scope of what is sought within
Wordnets.
8. A future game mode, which will be activated
after languages have sufficient numbers of
defined entries, will ask users to confirm joints
established through English for their language
pairs. For example, “światło” and “lumière” will
be shown with their respective own-language
definitions, and a registered Polish/ French
speaker will vote whether the two concepts match.
This game can only be played after sufficient data
for the concerned languages has been gathered in
the English-confirmation mode described above
in paragraph 4.3. The result will be validated
aligned Wordnets for numerous language pairs.
9. Work on other tracks within Kamusi will
introduce many terms and senses that are not part
of PWN or other Wordnets. These concepts will
be made available to language teams, and some
could form part of an extended multilingual
Wordnet desiderata.
5. Conclusions
This paper has discussed two difficulties with
using Global Wordnet as the source for a formal
multilingual dictionary. First, Wordnet does not
do things it was not intended to do, but that are
needed for lexicography, such as differentiation of
terms grouped topically in synsets and matching
those concept distinctions across languages.
Second, some of the things it does do bear
improvement, either in quantity (completion of
the full PWN set of synsets in other languages,
production of own-language definitions), quality,
or access. Fortunately, the open approach with
which Wordnet was designed makes it possible to
retrofit the data with English definitions that may
be more sensible than those initially drafted, and
with revised equivalents in other languages when
necessary, without severing the bonds that have
already been built across languages and projects.
The broad inter-lingual predictions made possible
by GWN have been refined by charting the joints
between members of a topical group, and will
further show the degree to which confirmed pairs
can substitute for each other. The work will not be
easy, involving recruiting many crowd members
from many languages, as well as oversight from
authoritative arbiters. However, many of the tools
have already been developed, and are being rolled
out gradually as Kamusi musters the resources to
foster speaker communities and manage the
incoming data flow. As time goes on, the data
produced by various Wordnet projects will lie at
the core of a more comprehensive multilingual
dictionary, and the data from the dictionary
project will be available for the further refinement
of existing and future Wordnets.
32
References
George A. Miller and Christiane Fellbaum. 2007.
WordNet then and now. In Language Resources
and Evaluation, 41:209–214.
Jordi Atserias, Salvador Climent, Xavier Farreres,
German Rigau, and Horacio Rodriguez. 1997.
Combining multiple methods for the automatic
construction of multi-lingualWordNets. In Recent
Advances in Natural Language Processing II.
Selected papers from RANLP, volume 97, pages
327–338.
Roberto Navigli and Simone Paolo Ponzetto. 2010.
BabelNet: building a very large multilingual semantic network. In Proceedings of the 48th Annual
Meeting of the Association for Computational
Linguistics, ACL ’10, pages 216–225, Stroudsburg,
PA, USA. Association for Computational
Linguistics. ACM ID: 1858704.
Martin Benjamin. 2014a. Collaboration in the
Production of a Massively Multilingual Lexicon. In
LREC 2014 Conference Proceedings. Reykjavik
(Iceland).
Antoni Oliver. 2014. WN-Toolkit: Automatic
generation of WordNets following the expand
model. In Proceedings of the 7th International
Global WordNet Conference, Tartu (Estonia),
pages 7-15.
Martin Benjamin. 2014b. Elephant Beer and Shinto
Gates: Managing Similar Concepts in a
Multilingual Database. In Proceedings of the 7th
International Global WordNet Conference. Tartu
(Estonia)
Emanuele Pianta, Luisa Bentivogli, and Christian
Girardi. 2002. MultiWordNet: Developing an
aligned multilingual database. In Proceedings of
the 1st Int’l Conference on Global WordNet,
Mysore (India), 293-302.
Martin Benjamin. 2015. Crowdsourcing Microdata for
Cost-Effective and Reliable Lexicography. In
AsiaLex 2015 Conference Proceedings, Hong
Kong.
Benoît Sagot and Darja Fišer. 2012. Automatic
Extension of WOLF. In Proceedings of the 6th
International Global Wordnet Conference, Matsue,
(Japan)
Francis Bond and Ryan Foster. 2013. Linking and
extending an open multilingual wordnet. In 51st
Annual Meeting of the Association for
Computational Linguistics: ACL-2013, pages
1352–1362.
Piek Vossen, Laura Bloksma, Horacio Rodriguez,
Salvador Climent, Nicoletta Calzolari, Adriana
Roventini, Francesca Bertagna, Antonietta Alonge,
and Wim Peters. 1998. The EuroWordNet Base
Concepts and Top Ontology. Deliverable D017 D,
34:D036.
Magdalena Derwojedowa, Maciej Piasecki, Stanisław
Szpakowicz, Magdalena Zawisłavska, and Bartosz
Broda. 2008. Words, Concepts and Relations in the
Construction of Polish WordNet. In Proceedings of
the Global WordNet Conference 2008, Szeged
(Hungary), pages 167–68.
Piek Vossen. 1998. Introduction to EuroWordNet.
Computers and the Humanities, 32(2-3):73–89.
Christiane Fellbaum. 1998, ed.. WordNet: An
Electronic Lexical Database. Cambridge, MA:
MIT Press.
Indeewari Wijesiri et al. 2014. Building a WordNet for
Sinhala. In Proceedings of the 7th International
Global WordNet Conference. Tartu (Estonia)
Christiane Fellbaum and Piek Vossen. 2007.
Connecting the Universal to the Specific: Towards
the Global Grid. In Proceedings of the First
International
Workshop
on
Intercultural
Communication, Kyoto (Japan).
Gerard de Melo and Gerhard Weikum. 2008. On the
utility of automatically generated wordnets. In
Proceedings of 4th Global WordNet Conference,
GWC 2008, Szeged, Hungary. University of
Szeged. Pages 147–161.
George A. Miller, Richard Beckwith, Christiane
Fellbaum, Derek Gross, and Katherine J. Miller.
1990. Introduction to wordnet: an on-line lexical
database. International Journal of Lexicography,
3(4):235–244.
33
Ancient Greek WordNet meets the Dynamic Lexicon: the example of the
fragments of the Greek Historians
Yuri Bizzoni
Monica Berti
Federico Boschetti
Gregory R. Crane
Riccardo Del Gratta
Tariq Yousef
CNR-ILC “A. Zampolli”,
Institute of Computer Science,
Via Moruzzi 1
University of Leipzig
Pisa - Italy
Leipzig, Germany
{name.surname}@uni-leipzig.de {name.surname}@ilc.cnr.it
Abstract
Humboldt Chair of Digital Humanities at the University of Leipzig. We have been using this collection because it is big enough to include many different sources preserving information about Greek
historians. Instead of working with extant authors,
the DFHG allows us to focus on specific topics related to ancient Greek lost historiography and on
the language of text reuse within this domain. The
working hypothesis is that the evidence provided
by Dynamic Lexicon Greek - Latin pairs is relevant to score the Greek word - conceptual node
(synset) associations in the Ancient Greek WordNet and, on the other hand, that the evidence provided by AGWN Greek word - Latin translations
is relevant to score the DL Greek - Latin pairs.
The Ancient Greek WordNet (AGWN)
and the Dynamic Lexicon (DL) are multilingual resources to study the lexicon
of Ancient Greek texts and their translations. Both AGWN and DL are works
in progress that need accuracy improvement and manual validation. After a detailed description of the current state of
each work, this paper illustrates a methodology to cross AGWN and DL data, in order to mutually score the items of each resource according to the evidence provided
by the other resource. The training data
is based on the corpus of the Digital Fragmenta Historicorum Graecorum (DFHG),
which includes ancient Greek texts with
Latin translations.
1
2
The creation of the Ancient Greek WordNet has
been outlined in (Bizzoni et al., 2014). It is based
on digitized Greek-English bilingual dictionaries
(in particular the Liddell-Scott-Jones and the Middle Liddell provided by the Perseus Project2 ):
first, Greek-English pairs (Greek words and English translations) are extracted from the dictionaries; then, the English word is projected onto
the Princeton WordNet (PWN) (Fellbaum, 1998).
If the English word is in PWN, then its synsets
are assigned to the Greek word; the same goes
for its lexical relations with other lemmas. Thus
AGWN is created “bootstrapping” data from different datasets. As a bootstrapped process, its result is quite inaccurate. For example, induced polysemy (from English) maps the Greek verb ἔχω
-échō- over 170 English words (including “cut”,
“make”, “brake” . . . ). On the contrary, when the
English word is not in PWN, the Greek word of the
pair is excluded from AGWN, thus strongly reducing the coverage of AGWN for the entire Greek
lexicon to c.a 30%.
Introduction
The Ancient Greek WordNet (AGWN) and the
Dynamic Lexicon (DL), which will be illustrated
in detail in the next sections (see sections 2 and
4), are complementary resources to study the Ancient Greek lexicon. AGWN is based on the
paradigmatic axis provided by bilingual dictionaries, while DL is based on the syntagmatic axis
provided by historical and literary texts aligned to
their scholarly translations. Both of them have
been created automatically and they need to be
corrected and extended. In this specific case the
data is taken from the Digital Fragmenta Historicorum Graecorum (DFHG), which is a corpus of
quotations and text reuses of ancient Greek lost
historians and their Latin translations provided by
the editor Karl Müller (Berti et al., 2014 2015;
Yousef, 2015)1 . This corpus is part of LOFTS
(Leipzig Open Fragmentary Texts Series) at the
1
Ancient Greek WordNet
2
http://opengreekandlatin.github.io/dfhg-dev/
34
http://www.perseus.tufts.edu
5
Currently, AGWN is linked not only to PWN,
but also to other WordNets, in particular to the
Latin WordNet (LWN) (Minozzi, 2009) and to the
Italian WordNet (IWN) (Roventini et al., 2003).
The way these WordNets are interconnected follows the guidelines illustrated in (Vossen, 1998;
Rodrı́guez et al., 2008), by using English as the
bridge language. As a consequence, Greek and
Latin and/or Greek and Italian are linked through
the common sense(s) in English.
3
This section investigates a simple and effective
method for automatic extraction of a bilingual
lexicon (Ancient Greek/Latin) from the available aligned bilingual texts (Greek/English and
Latin/English) in the Perseus Digital Library using English as a bridge language.
The data comes from the corpus of the DFHG
and consists of 163 parallel documents aligned
at a word level (104 Ancient Greek/English files
and 59 Latin/English). The Greek-English dataset
consists approximately of 210K sentence pairs
with 4,32M Greek words, whereas the LatinEnglish dataset consists approximately of 123K
sentence pairs with 2,33M Latin words. The parallel texts are aligned on a sentence level using Moore’s Bilingual Sentence Aligner (Moore,
2002), which aligns the sentences with a very
high precision (one-to-one alignment).7 Then the
GIZA++ toolkit8 is used to align the sentence pairs
at the level of individual words. Table 1 introduces
statistics about the DFHG parallel corpus, while
Figure 1 displays the used workflow. Note that the
number of words in Table 1 is the total number of
words in the documents, whereas the aligned pairs
are the number of aligned words in the documents.
Some words are not aligned at all, therefore the
number of aligned words is smaller than the total
number of words.
The conceptual structure of Ancient
Greek WordNet
Sharing a unique conceptual network among different languages is a good solution when the civilizations expressed by those languages are very
similar, due to the effects of the globalization. In
this case, only few conceptual nodes must be inserted when a concept is lexicalized in the source
language but not in the target language, and few
nodes must be deactivated when a concept is only
lexicalized in the target language, but not in the
source language.
On the contrary, when the civilizations expressed by the source and the target languages are
highly dissimilar, the conceptual network needs to
be heavily restructured.
As illustrated in the introduction, the conceptual
network of AGWN is originally based on PWN,
but the glosses of the synsets and the semantic relations can be modified through a web interface.3
4
Bilingual Dictionary Extraction
Files
Sentences
Words
Aligned words
Distinct words
Dynamic Lexicon
The Dynamic Lexicon is an increasing multilingual resource constituted by bilingual dictionaries (Greek/English, Latin/English, Greek/Latin),
which have been created through the direct automated alignment of original texts with their translations or through a triangulation with a bridge
language.
The first version of the DL4 is a National Endowment for the Humanities (NEH)5 co-funded
project developed at Tufts University (Medford,
MA) by the Perseus Project, whereas the second
version is under development at the University of
Leipzig by the Open Philology. Project6
Ancient Greek
104
210K
4, 32M
3, 34M
872K
Latin
59
132K
2, 33M
1, 71M
575K
Table 1: Size of the corpora.
5.1
Preprocessing
The data sets provided by the workflow in Figure
1 are available in XML format. Each document
is identified (through an id) in the Perseus Digital Library and consists of sentences in the orig7
Sentences have been segmented using punctuation marks
excluding commas.
8
GIZA++ is an extension of the program GIZA which
was developed by the Statistical Machine Translation team
at the Center for Language and Speech Processing at JohnsHopkins University.
3
http://www.languagelibrary.eu/new ewnui
http://nlp.perseus.tufts.edu/lexicon
5
http://www.neh.gov/about
6
http://www.dh.uni-leipzig.de
4
35
Lemma Freq.
%
say
719
46.8
speak
621
40.6
tell
mention
149
45
9.7
2.9
Word
Freq.
say
551
said
89
saying
54
says
25
speak
492
speaking 110
spoke
19
tell
149
mention
45
%
36
6
3.5
1.5
32
7
1.2
9.7
2.9
Table 2: Lemmas and words:frequencies and percentages
Figure 1: Explanation of the method
(23.8%), and so on.
The extracted pairs via triangulation are the correct association {ναῦς, navis} and the wrong associations {ναῦς, no} (ship-to swim), {ναός, navis}
(temple-ship), {ναός, no} (temple-to swim). These
pairs don’t have the same level of relatedness,
therefore we have to filter the results to keep only
strong related pairs, as exposed in Section 5.3.
inal language (Ancient Greek or Latin) and their
translation in English, as reported in Figure 2 (A).
Each Latin or Greek word is aligned to one word
in the English text (one-to-one Alignment), but in
some cases a word in the original language could
be aligned to many words (one-to-many / manyto-one) or not aligned at all, cf. Figure 2 (B).
Lemmatization of English translations will produce better results, because that will reduce the
number of translation candidates as we can see in
this example: The Greek word λέγειν -légein- is
translated with (“say”, “speak”, “tell”, “speaking”,
“said”, “saying”, “mention”, “says”, “spoke”).
Many of the translation candidates share the same
lemma (say for “said”, “saying”, “says”), (speak,
“speaking”, “spoken”). Before the lemmatization there were 9 translation candidates and after
the lemmatization there are only four candidates,
showing therefore the change of frequencies.
Table 2 shows how the lemmatization process
recalculates the frequencies and percentages of
each single translation.
5.2
5.3
Translation-Pairs filtering
The translation pairs are not completely correct,
because there are still some translation errors. In
order to eliminate incorrect pairs, we will use a
similarity metric to measure the similarity or the
relatedness between every Greek-Latin pairs. The
Jaccard coefficient (Jaccard, 1901) measures the
similarity between finite sample sets (in our case
two sets), and is defined as the size of the intersection divided by the size of the union of the sample
sets:
|A∩B |
J=
(1)
|A∪B |
A and B in equation 1 are two vectors of
translation probabilities (Greek-English, LatinEnglish). For example, the relatedness9 between
the Greek word πόλις and the Latin word civitas is
reported in Figure 4.
We have to determine a threshold to classify the
translation pairs as accepted or not accepted. High
threshold yields high accuracy lexicon but with
less number of entries, whereas low threshold produce more translation pairs with lower accuracy.
The accuracy of the method depends on two factors:
Triangulation
Triangulation is based on the assumption that two
expressions are likely to be translations if they
are translations of the same word in a third language. We will use triangulation to extract the
Greek-Latin pairs via English. In order to do
that, we query our datasets to get the Greek and
Latin words that share the same English translation along with their frequencies, see Figure 3.
The English word ship is associated to the
Greek word ναῦς -naûs- (54.8%), to ναός -naós(21.5%) and so on; the same English word ship is
associated to the Latin word navis (65.3%), to no
9
In the calculation we use the fact that city and state are
shared English translation between πόλις -pólis- and civitas
36
Figure 2: The aligned sentences in XML format
Figure 3: An example of triangulation
Figure 4: Use of Jaccard algorithm for aligning πόλις to civitas
The size of aligned-parallel corpora plays important role to improve the accuracy of the
produced lexicon: bigger corpora produce
better translation probability distribution and
more translation candidates which yield a
more accurate lexicon. In addition to that bigger corpora cover more words
allel corpora: manually aligned corpora yield
more accurate results, whereas automatic
alignment tools produce some noisy translations; in our case GIZA++ has been used to
align the parallel corpora.
The quality of the aligner used to align the par-
37
6
References
Evaluating and extending the AGWN
through evidence provided by the
Dynamic Lexicon and vice versa
Monica Berti, Bridget Almas, David Dubin, Greta
Franzini, Simona Stoyanova, and Gregory Crane.
2014-2015. The Linked Fragment: TEI and the Encoding of Text Reuses of Lost Authors. Journal of
the Text Encoding Initiative, 8.
Students and scholars that evaluate and extend the
AGWN synset items need to compare online dictionaries and other lexical resources. The DL can
provide evidence for this purpose, especially to
discover relevant missing correspondences. An
example should clarify.
In AGWN we can find the association minister
(eng) / minister (lat) / διάκτορος -diáktoros- (grc),
but not minister (eng) / minister (lat) / διάκονος
-diákonos- (grc), which is instead provided by the
DL. If we consult the bilingual dictionary LiddellScott-Jones, we find out that διάκτορς “taken as
minister, =διάκονος”. The automatic parser used
to bootstrap AGWN from bilingual dictionaries
has not processed this information, so the DL provides a hint for the integration of this missed item
in the correct synset of AGWN.
Complementary, the DL is missing the triplet
minister (eng) / minister (lat) / διάκτορος (grc),
which would be a relevant translation, even if not
attested by the aligned bilingual texts of the training corpus. Moreover, AGWN can be used to add
scoring criteria to the DL system, by tuning the
results with a further piece of evidence, which reinforces the Jaccard score.
For example, the score of the correct association {ναῦς, navis}, discussed in Section 5.2 is reinforced, due to its presence in AGWN, whereas
the scores of the wrong associations {ναῦς, no},
{ναός, navis} and {ναός, no} are weakened, due
to their absence in AGWN.
7
Yuri Bizzoni, Federico Boschetti, Harry Diakoff, Riccardo Del Gratta, Monica Monachini, and Gregory
Crane. 2014. The Making of Ancient Greek WordNet. In Nicoletta Calzolari, Khalid Choukri, Thierry
Declerck, Hrafn Loftsson, Bente Maegaard, Joseph
Mariani, Asuncion Moreno, Jan Odijk, and Stelios
Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may.
Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database (Language, Speech, and
Communication). The MIT Press, Cambridge, MA,
USA.
Paul Jaccard. 1901. Étude comparative de la distribution florale dans une portion des Alpes et des Jura.
Bulletin del la Société Vaudoise des Sciences Naturelles, 37:547–579.
Stefano Minozzi. 2009. The Latin WordNet Project.
In Peter Anreiter and Manfred Kienpointner, editors, Latin Linguistics Today. Akten des 15. Internationalem Kolloquiums zur Lateinischen Linguistik, volume 137 of Innsbrucker Beiträge zur Sprachwissenschaft, pages 707–716.
Robert C. Moore. 2002. Fast and accurate sentence
alignment of bilingual corpora. In Proceedings of
the 5th Conference of the Association for Machine
Translation in the Americas on Machine Translation: From Research to Real Users, AMTA ’02,
pages 135–144, London, UK, UK. Springer-Verlag.
Horacio Rodrı́guez, David Farwell, Javi Farreres,
Manuel Bertran, M. Antonia Martı́, William Black,
Sabri Elkateb, James Kirk, Piek Vossen, and Christiane Fellbaum. 2008. Arabic Wordnet: Current
State and Future Extensions. In Proceedings of the
Fourth International Global WordNet - Conference
– GWC 2008, pages 387–406, January.
Future work
The next step is the creation of a gold standard
both for AGWN and for DL, in order to quantify
the gain in terms of precision and recall that we
can obtain by crossing AGWN and DL data.
8
Adriana Roventini, Antonietta Alonge, Francesca
Bertagna, Nicoletta Calzolari, Christian Girardi,
Bernardo Magnini, Rita Marinelli, and Antonio
Zampolli. 2003. ItalWordNet: building a large semantic database for the automatic treatment of italian. Computational Linguistics in Pisa, Special Issue, XVIII-XIX, Pisa-Roma, IEPI, 2:745–791.
Conclusion
In conclusion, we think that the paradigmatic approach, by extraction of bilingual pairs from dictionaries, and the syntagmatic approach, by extraction of bilingual pairs from aligned texts, are
complementary for the study of Ancient Greek semantics and that they can be integrated, in order
to mutually improve the performances of both of
them.
Piek Vossen, editor. 1998. EuroWordNet: A Multilingual Database with Lexical Semantic Networks.
Kluwer Academic Publishers, Norwell, MA, USA.
Tariq Yousef. 2015. Word Alignment and NamedEntity Recognition applied to Greek Text Reuse,
school = Alexander von Humboldt Lehrstuhl für
Digital Humanities, Universität Leipzig. Master’s
thesis.
38
IndoWordNet::Similarity
Computing Semantic Similarity and Relatedness using IndoWordNet
Sudha Bhingardive, Hanumant Redkar, Prateek Sappadla, Dhirendra Singh,
Pushpak Bhattacharyya
Center for Indian Language Technologies,
Indian Institute of Technology Bombay, India
{bhingardivesudha, hanumantredkar, prateek2693, dhiru.research,
pushpakbh}@gmail.com
tistical information derived from the corpora or a
combination of both. These measures are now
widely used in various natural language processing
applications such as Word Sense Disambiguation,
Information Retrieval, Information Extraction,
Question Answering, etc.
We have developed IndoWordNet::Similarity
tool, interface and API for computing the semantic
similarity or relatedness for the Indian Languages
using IndoWordNet.
The paper is organized as follows. Section 2 describes the IndoWordNet. Semantic similarity and
relatedness measures are discussed in section 3.
Section 4 details the IndoWordNet::Similarity. Related work is presented in section 5. Section 6 concludes the paper and points to the future work.
Abstract
Semantic similarity and relatedness measures
play an important role in natural language
processing applications. In this paper, we present the IndoWordNet::Similarity tool and interface, designed for computing the semantic
similarity and relatedness between two words
in IndoWordNet. A java based tool and a web
interface has been developed to compute this
semantic similarity and relatedness. Also, Java
API has been developed for this purpose. This
tool, web interface and the API are made
available for the research purpose.
1
Introduction
2
The Semantic Similarity is defined as a concept
whereby a set of words are assigned a metric based
on the likeliness of the semantic content. It is easy
for humans with their cognitive abilities to judge
the semantic similarity between two given words
or concepts. For example, a human can quite easily
say that the words apple and mango are more similar than the words apple and car. There is some
understanding of how humans are able to perform
this task of assigning similarities. However, measuring similarity computationally is a challenging
task and attracts a considerable amount of research
interest over the years. Another term very closely
related to similarity is Semantic Relatedness. For
example, money and bank would seem to be more
closely related than money and cash. In past, various measures of similarity and relatedness have
been proposed. These measures are developed
based on the lexical structure of the WordNet, sta-
IndoWordNet
WordNet1 is a lexical resource composed of
synsets and semantic relations. Synset is a set of
synonyms representing distinct concept. Synsets
are linked with basic semantic relations like hypernymy, hyponymy, meronymy, holonymy, troponymy, etc. and lexical relations like antonymy,
gradation, etc. IndoWordNet (Bhattacharyya,
2010) is the multilingual WordNet for Indian languages. It includes eighteen Indian languages viz.,
Assamese, Bengali, Bodo, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Malayalam, Manipuri,
Marathi, Nepali, Odia, Punjabi, Sanskrit, Tamil,
Telugu, Urdu, etc. Initially, Hindi WordNet2 was
created manually taking reference from Princeton
WordNet. Similarly, other Indian language Word1
2
39
http://wordnet.princeton.edu/
http://www.cfilt.iitb.ac.in/indowordnet/
Nets were created from Hindi WordNet using expansion approach and following the three principles of synset creation. In this paper, we present
the IndoWordNet::Similarity tool, interface and
API, which help in computing similarity and relatedness of words / concepts in Indian language
WordNets.
3
Where, S1 and S2 are the synsets and D represents the maximum depth of the taxonomy.
3.1.3
This measure proposed by Wu & Palmer (1994)
calculates the similarity by considering the depths
of the two synsets, along with the depth of the
lowest common subsumer (LCS). The formula is
given as,
Overview of Semantic Similarity and
Relatedness Measures
Over the years, various semantic similarity and
relatedness measures have been proposed. These
measures are classified based on the path length,
information content and the gloss overlap. Some of
them are described below.
3.1
Where, S1 and S2 are the synsets and
represents the lowest common subsumer of S1 and
S2.
Path Length Based Measure
3.2
These measures are based on the length of the path
linking two synsets and the position of synset the
WordNet taxonomy.
3.1.1
Information Content Based Measure
These measures are based on the information content of the synsets. Information content of a synset
measures the specificity or the generality of that
synset, i.e. how specific to a topic the synset is.
Shortest Path Length Measure
This is the most intuitive way of measuring the
similarity between two synsets. It calculates the
semantic similarity between a pair of synsets depending on the number of links existing between
them in the WordNet taxonomy. The shorter the
length of the path between them, the more related
they are. The inverse relation between the length of
the path and similarity can be characterized as follows:
3.2.1
Resnik’s Measure
Resnik (1995) defines the semantic similarity of
two synsets as the amount of information they
share in common. It is given as,
This measure depends completely upon the information content of the lowest common subsumer of
the two synsets whose relatedness we wish to
measure.
Where,
are synsets and D is the maximum depth of the taxonomy.
3.1.2
Wu and Palmer Measures
3.2.2
Leacock and Chodorow’s Measure
Jiang and Conrath’s Measure
A measure introduced by Jiang and Conrath (1997)
addresses the limitations of the Resnik measure. It
incorporates the information content of the two
synsets, along with that of their lowest common
subsumer. This measure is given by the formula:
This measure proposed by Leacock and Chodorow’s (1998) computes the length of the shortest
path between two synsets and scales it by the depth
D of the IS-A hierarchy.
40
Figure 1: IndoWordNet::Similarity Tool
Where, IC determines the information content of a
synset and LCS determines the lowest common
subsuming concept of two given concepts.
3.3

Gloss Overlap Measures

Lesk (1986) defines the relatedness in terms of
dictionary definition overlap of given synsets. Further, the extended Lesk measure (Banerjee and
Pedersen, 2003) computes the relatedness between
two synsets by considering their own gloss as well
as by considering the gloss of their related synsets.
4
Depending on the user query the similarity is
calculated and displayed in an output window.
4.1.2
IndoWordNet::Similarity
We have developed IndoWordNet::Similarity tool,
web based interface and API to measure the semantic similarity and relatedness for a pair of
words / synsets in the IndoWordNet.
4.1
4.1.1 User Interface Layout
The main window of the tool is as shown in Figure
1. In order to use this tool, user needs to provide
the following inputs:
 User can enter the pair of words for which
similarity to be computed.
 User can specify the part-of-speech and the
sense number for the given two words for calculating the similarity. If user doesn’t provide
Features

This is system independent
standalone Java Application.

Option such as part-of-speech and sense-id
are optional.

If user doesn’t provide part-of-speech and
sense-id option, then similarity is calculated
for all possible pair of senses of the given
words.

If the virtual root node option is enabled then
one hypothetical root is created which connects all roots of the taxonomy. This allows
similarity values to be calculated between any
pair of nouns or verbs.
IndoWordNet::Similarity Tool
The IndoWordNet::Similarity3 tool is implemented
using Java. The user interface layout and its features are given below.
3
these details then the tool computes the similarity between all possible pair of senses of
the two input words over all parts-of-speech.
Drop-box is provided for selecting the type of
similarity measure.
Check-box is provided for virtual root option.
portable
4.2 IndoWordNet::Similarity API
IndoWordNet::Similarity Application Programming Interface (API) has been developed using Java
which provides functions to compute the semantic
similarity and relatedness using various measures.
API provides three types of functions for each
measure.
1. A function which takes only two words as
http://www.cfilt.iitb.ac.in/iwnsimilarity
41
parameters and returns the similarity score
between all possible senses of the two words.
2. A function which takes two words along with
part-of-speech, sense-id and returns the
similarity score between the particular senses
as specified by the user.
3. A function which takes only two words as
parameters and returns the maximum similarity
between two words among all possible sense
pairs. Some of the API functions are
mentioned below:
API Function
Computes
public SimilarityValue[]
getPathSimilarity( String word1,
String pos1, int sid1, String word2,
String pos2, int sid2, boolean
use_virtual_root)
public SimilarityValue[]
getPathSimilarity(String word1,String
word2,boolean use_virtual_root)
public SimilarityValue
getMaxPathSimilarity(String word1,
String word2, boolean
use_virtual_root)
Path
Similarity
5
WordNet::Similarity4 (Pedersen et. al. 2004) is
freely available software for measuring the semantic similarity and relatedness for English WordNet.
This application uses an open source Perl module
for measuring the semantic distance between
words. It provides various semantic similarity and
relatedness measures using WordNets. Given two
synsets, it returns numeric score showing their degree of similarity or relatedness according to the
various measures that all rely on WordNet in different ways. It also provides support for estimating
the information content values from untagged corpora, including plain text, the Penn Treebank, or
the British National Corpus5.
WS4J6 (WordNet Similarity for Java) provides a pure Java API for several published semantic similarity and relatedness algorithms. WordNet
Similarity is also integrated in NLTK tool7. However, the need to make entirely different application for IndoWordNet lies in its multilingual nature
which supports 19 Indian language WordNets.
Hence, we developed the IndoWordNet::Similarity
tool, web interface and API for calculating the similarity and relatedness.
Path
Similarity
Maximum
Path
Similarity
Table 1. Important functions of IndoWordNet::Similarity API
4.3
Related Work
6
Conclusion
We have developed the IndoWordNet::Similarity
tool, web interface for computing the semantic
similarity and relatedness measures for the IndoWordNet. Also, a java API has also been developed for accessing the similarity measures. The
tool and the API can be used in various NLP areas
such as Word Sense Disambiguation, Information
Retrieval, Information Extraction, Question Answering, etc. In future, the other measures of computing similarity and relatedness shall be integrated
in our tools and utilities.
IndoWordNet::Similarity Web Interface
IndoWordNet::Similarity Web Interface has been
developed using Php and MySql which provides a
simple interface to compute the semantic similarity
and relatedness using various measures. Figure 2
shows the IndoWordNet::Similarity web interface.
References
Satanjeev Banerjee and Ted Pedersen. 2003. Extended
gloss overlaps as a measure of semantic relatedness.
In Proceedings of the Eighteenth International Joint
Conference on Artificial Intelligence, pages 805–810,
Acapulco, August.
4
http://wn-similarity.sourceforge.net/
http://corpus.byu.edu/bnc/
6 https://code.google.com/p/ws4j/
7 http://www.nltk.org/howto/wordnet.html
5
Figure 2. IndoWordNet::Similarity Web Interface
42
Pushpak Bhattacharyya. 2010. IndoWordnet, Lexical
Resources Engineering Conference (LREC 2010),
Malta.
Jay Jiang and David Conrath. 1997. Semantic similarity
based on corpus statistics and lexical taxonomy. In
Proceedings on International Conference on Research
in Computational Linguistics, pages 19–33, Taiwan.
Claudia Leacock and Martin Chodorow. 1998. Combining local context and WordNet similarity for word
sense identification. In C. Fellbaum, editor, WordNet: An electronic lexical database, pages 265–283.
MIT Press.
Michael Lesk. 1986. Automatic sense disambiguation
using machine readable dictionaries: How to tell a
pine cone from an ice cream cone. In Proceedings of
SIGDOC ’86, 1986.
Ted Pedersen, Siddharth Patwardhan, and Jason
Michelizzi. 2004. Wordnet::Similarity - Measuring
the relatedness of concepts. In Proceedings of
AAAI04, Intelligent Systems Demonstration, San Jose, CA, July 2004.
Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference
on Artificial Intelligence, pages 448–453, Montreal,
August.
Zibiao Wu and Martha Palmer. 1994. Verb semantics
and lexical selection. ACL, New Mexico.
43
Multilingual Sense Intersection in a Parallel Corpus
with Diverse Language Families
Giulia Bonansinga
Filologia, Letteratura e Linguistica,
Università di Pisa, Italy
[email protected]
Francis Bond
Linguistics and Multilingual Studies,
Nanyang Technological University
[email protected]
Abstract
Specifically, they looked at the sense intersection
between the lexical items found in all the reciprocal translations of a parallel corpus.
Gliozzo et al. (2005) showed how CL-WSD
can help to sense-annotate a bilingual corpus by
looking at the semantic differences in a language
pair. Bentivogli and Pianta (2005), on the other
hand, focused on how meaning is somehow preserved despite those differences, which allows us
to transfer the semantic annotation of a text in a
certain language to its translation in another language. The sense projection procedure that they
used is simple yet powerful, but it can only be applied on corpora in which at least one parallel text
is annotated with senses. Nevertheless, any way
to produce sense-annotated data is of great benefit to WSD given the difficulty to come across such
data. This knowledge acquisition bottleneck is still
a challenge to address for most languages.
Given the task of annotating an ambiguous word
in a multilingual parallel corpus, some valuable information can be derived through the comparison
of the set of senses of each of the word’s translations. If fewer senses (or one only, in the optimal case) are retained across languages, then the
cross-lingual information has helped reducing (or
solving) the ambiguity.
In previous work (Bond and Bonansinga, 2015)
sense intersection (SI), to annotate a trilingual parallel corpus in English, Italian and Romanian built
upon SemCor (SC) (Landes et al., 1998). We summarize the data used and our findings in Section 2.
In Section 3 we continue investigating in the
same strand by introducing a further language,
Japanese, to disambiguate English text. In Section 4 we show how an annotation task can benefit
from coarser sense distinctions. In Section 5 we
examine thoroughly how and how much each additional language helps the automatic sense annotation. We conclude in Section 6 and suggest some
future work.
Supervised methods for Word Sense Disambiguation (WSD) benefit from highquality sense-annotated resources, which
are lacking for many languages less common than English.
There are, however, several multilingual parallel corpora
that can be inexpensively annotated with
senses through cross-lingual methods. We
test the effectiveness of such an approach
by attempting to disambiguate English
texts through their translations in Italian,
Romanian and Japanese. Specifically, we
try to find the appropriate word senses for
the English words by comparison of all
the word senses associated to their translations. The main advantage of this approach is in that it can be applied to any
parallel corpus, as long as large, highquality inter-linked sense inventories exist
for all the languages considered.
1
Introduction
Cross-lingual Word Sense Disambiguation (CLWSD) is an approach to Word Sense Disambiguation (WSD) that exploits the similarities and the
differences across languages to disambiguate text
in an automatic fashion. Using existing multilingual parallel corpora for this purpose is a natural choice, as shown by a long series of works in
the literature; see for instance Brown and Mercer
(1991), Gale et al. (1992), Ide et al. (2002), Ng et
al. (2003), Chan and Ng (2005), and Khapra et al.
(2011) more recently.
As Diab and Resnik (2002) showed, the translation correspondences in a parallel corpus provide valuable semantic information that can be exploited to perform WSD. For instance, Tufiş et al.
(2004) used parallel corpora to validate the interlingual alignments in different WordNets (WNs).
44
2
Multilingual Sense Intersection
Bond et al. (2012) built a Japanese SemCor
(JSC) matching the texts covered in MSC, after
porting the sense annotations to WN 3.0 using the
mappings provided by Daude et al. (2003). The
sense annotation was carried out through sense
projection by exploiting the word alignment, similarly to what Bentivogli and Pianta (2005) did
for Italian. 58,265 annotations were automatically
transferred to Japanese content words.
JSC follows the Kyoto Annotation Format
(KAF) (Bosma et al., 2009) and is released under
the same license as SC.1
In Table 1 we remind the basic statistics of each
corpus. For English and Italian we also specify
the number of the target words after the migration
to WordNet 3.0 (WN 3.0). In Table 2.1, we show
more clearly, in terms of number of sentences, the
alignments available for each language pair.
In Bond and Bonansinga (2015) we explored the
cross-lingual approaches pioneered by Gliozzo et
al. (2005) and Bentivogli and Pianta (2005) to annotate the SC corpus (Landes et al., 1998) and two
corpora built upon it from its Italian and Romanian
translations. This parallel corpus, though rather
small (see Subsection 2.1), is ideal for the task as it
is sense-annotated in all its translations, thus making the evaluation of alternative sense annotation
methods straightforward. We briefly present the
data used back then and introduce the last component of the corpus, the Japanese SemCor (Bond et
al., 2012), which is included in the analysis presented in this paper.
2.1
Data
Developed at Princeton University, SC is a subset of the Brown Corpus of Standard American
English (Kučera and Francis, 1967) enriched with
sense annotations referring to the WN sense inventory (see Section 2.2).
Bentivogli and Pianta (2005) manually translated 116 SC texts and automatically aligned them
to their English counterparts. Then the sense annotations of the English words were automatically transferred following the word alignment,
thus leading to the creation of a sense-annotated
English-Italian corpus, MultiSemCor (MSC).
With the purpose of providing a Romanian version of SC, Lupu et al. (2005) developed the Romanian SemCor (RSC) (Lupu et al., 2005; Ion,
2007), which shares 50 texts with MSC. Unfortunately, RSC is not word-aligned to any other
component of the parallel corpus, which is a requirement to perform sense mapping with any of
the mentioned procedures. On the other hand, the
sentence alignment is available and we are only
interested in content words, so we attempted a
word alignment based upon the information already available. First, we aligned all reciprocal
translations in the same sentence pair with identical sense annotation. Then, we aligned the remaining content words, if any, using heuristics that
exploit PoS information and path similarity in the
WN ontology. Finally, we manually checked a
sample of the alignment found this way and we
observed a precision of 97%; of course, errors can
only be introduced in the second step, when using
heuristics to align the remaining unaligned content
words.
EN
IT
RO
JP
Texts
Tokens
116
116
82
116
258,499
268,905
175,603
119,802
Target
words
119,802
92,420
48,634
150,555
After mapping
118,750
92,022
=
=
Table 1: Statistics for each component of the multilingual parallel corpus built from SemCor.
2.2
Sense Inventories
When MSC was released, MultiWordNet2
(MWN) (Pianta et al., 2002), a multilingual
WordNet aligned to Princeton WN 1.6, was used.
As described in Bond and Bonansinga (2015),
we ported all senses annotations in MSC to WN
3.0, so to make it possible a comparison between
1
Both the Japanese WordNet and the Japanese
SemCor are available at the following address:
http://compling.hss.ntu.edu.sg/wnja/
index.en.html
2
http://multiwordnet.fbk.eu/
Language
EN-IT
EN-RO
EN-JP
IT-RO
IT-JP
RO-JP
Aligned sentences
12,842
4,974
12,781
4,974
12,781
4,913
Table 2: Number of aligned sentences for each
language pair.
45
the different components of the parallel corpus.
To this aim, we used automatically inferred
mappings (Daudé et al., 2000; Daudé et al.,
2001). However, the changes occurred between
WN versions 1.6 and 3.0 led to the loss of 4,631
sense annotations (1,204 types, half of which are
adjective satellites).
The Romanian WordNet (RW), created within
the BalkaNet project (Stamou et al., 2002) and
then consistently grown independently (Barbu Mititelu et al., 2014), includes synsets mapped to WN
3.0 with precision of 95% (Tufiş et al., 2013).
The Japanese WN (JWN) (Isahara et al., 2008;
Bond et al., 2009a; Bond et al., 2009b), originally
developed by the National Institute of Information and Communications Technology (NICT) and
firstly released in 2009, is a large-scale semantic
dictionary of Japanese available under the WordNet license.
English
Italian
Romanian
Japanese
Synsets
117,659
34,728
59,348
57,184
few sense-annotated corpora from which we can
derive such statistics. In the case of SC the issue
is even more crucial, because WN SFS are computed on SC itself. So, whenever the first sense
of a lemma follows a ranking order, we are using
biased statistics.
Generally speaking, the coverage scores were
quite good and higher with the baseline MFS. As
for precision, the gap between SI and the baseline
is smaller, probably due to the bias just mentioned.
On the other hand, in languages other than English, the contribution of SFS is not as decisive
and SI performs better than the baseline, and particularly so in the case of Italian.
3
The theoretical justification behind Multilingual
Sense Intersection (SI) is in that an ambiguous
word will often be translated in different words in
another language. As a consequence, the knowledge of all the senses associated to its translation
can help detect the sense actually intended in the
original text. More commonly, such a comparison
will help reduce the ambiguity, but it will not identify one single, shared sense. On the other hand,
a text whose ambiguity was progressively reduced
through automatic methods can be completely disambiguated by a human annotator at a lesser cost.
Moreover, the more the languages available for
comparison in the parallel corpus, the more likely
is that SI actually manages to discern the correct
sense in context.
Differently from our previous work, where we
disambiguated all texts aligned with at least one
other language, in the following section we show
results computed over 49 texts. Those are the subset of the corpus shared across all four components
and for which we have alignments. The result is
an even smaller corpus, but it can show more effectively the contribution of up to three languages.
Given an ambiguous word, all its translations
provide their ”set of senses”, as retrieved from the
shared sense inventory. Then, intersection is performed over every non-empty set and successes
when the final overlap contains only one sense,
meaning that the target word has been disambiguated. Otherwise, the overlap is further intersected with the top most frequent senses available
for the target lemma, and we take note whether the
sense selected was the most frequent one. As be-
Senses
206,978
69,824
85,238
158,069
Table 3: Coverage of the WNs used.
In Table 3 we give basic coverage statistics
for the WNs of our target languages. The Open
Multilingual WordNet (OMW)3 is an open-source
multilingual database that connects all open WNs
linked to the English WN, including Italian (Pianta
et al., 2002) among the 28 languages supported
(Bond and Paik, 2012; Bond and Foster, 2013). A
convenient interface to OMW is provided by the
Python module NLTK4 (Bird et al., 2009).
2.3
Multilingual Sense Intersection with
languages from different families
Findings
For the sake of completeness, in previous work we
performed sense projection on the Italian and Romanian corpora using English as pivot, scoring a
precision of over 90% in both cases. As for SI, we
report the previous precision and coverage scores
obtained through trilingual SI in Table 4, along
with the Most Frequent Sense (MFS) baseline, that
assigns each word its most frequent sense. In this
step, sense frequency statistics (SFS) are therefore necessary, but unfortunately there are very
3
http://compling.hss.ntu.edu.sg/omw/
summx.html
4
http://www.nltk.org
46
Method
MFS (baseline)
3-way Intersection
Coarse-grained MFS
Coarse-grained SI
English
Precision Coverage
0.761
0.998
0.750
0.850
0.849
Italian
Precision Coverage
0.599
0.999
0.653
0.687
0.761
0.778
0.998
0.778
0.915
0.999
0.915
Romanian
Precision Coverage
0.531
1
0.590
0.794
0.661
1
1
1
Table 4: Comparison of the results scored with SI and MFS baseline.
of a sense cluster and whether the correct sense
was in it. If so, we considered the output of the
algorithm correct.
Table 4 displays the difference in performance
when coarse-grained evaluation is employed.
fore, we resort to sense frequency statistics (SFS)
whenever the target word is not yet disambiguated
after SI. These frequencies were calculated over
all texts in the corpus except the one being annotated.
4
Introducing coarse-grained senses
Method
Sense inventories are a crucial part of this approach. Not only a sufficient coverage and the
alignment to the Princeton WN are necessary;
when it comes to deciding how to define close,
very specific senses, a trade-off between the detail of the sense description and its actual usability
in real contexts is desirable.
The fine granularity of WN senses can occasionally, depending on the application, be more
of a practical disadvantage than a quality. In this
analysis, for instance, error analysis suggested that
the senses found through SI were often very close,
but it may happen that they are discarded as wrong
outputs just because one language has a WN more
developed and granular than another. We should
also bear in mind that the correct senses against
which we evaluate were picked by trained human
annotators in the first place, and human annotators
tend to describe a word as precisely as possible.
Conscious of this limit, Navigli (2006) devised
an automatic methodology to find a reasonable
sense clustering for the senses in WN 2.1. Sense
clustering can be of great help when minor sense
distinctions can be ignored, allowing a coarsegrained evaluation.
They found 29,974 main clusters, some of
which were manually validated by an expert lexicographer for the Semeval all-word task.
We mapped the senses in the clusters found to
WN 3.0, losing 101 of them in the process (typically one-element clusters). When evaluating the
results of SI, we performed a coarse-grained evaluation; in particular, whenever the sense found by
SI was not correct, we checked whether it was part
Coarse-grained MFS
Coarse-grained 4-SI
English
Precision Coverage
0.851
0.998
0.854
0.788
Table 5: Coarse-grained evaluation of the results
scored with 4-way SI and MFS baseline, computed over the shared subset (49 texts).
5
Evaluation
In Table 4 we show the improvement in precision
obtained thanks to coarse-grained evaluation with
respect to the results in (Bond and Bonansinga,
2015). English and Italian show respectively a significant improvement of 0.1 and 0.11. In the case
of Romanian, the improvement is not as big, but
still meaningful (0.07). Of course, coarse-grained
evaluation causes the MFS baseline to improve as
well. In the case of English - which, again, is the
component most subjected to the bias introduces
by SFS - the difference between decreases a little
in perspective, ma MFS still beats SI. The case of
Italian is unique, in that in both cases, with fine
and coarse-grained senses, SI obtains better precision scores. For Romanian, on the other hand, SI
performs better until coarse-grained evaluation is
employed, and the improvement achieved by MFS
is striking.
In Table 5 we show our latest attempt to disambiguate English text by using the semantic information of its aligned translation in a parallel
corpus. The languages that contribute to the disambiguation process are Italian, Romanian and
Japanese, and all together they manage to beat
MFS, if coarse-grained senses are considered.
47
6
Conclusions
Francis Bond and Giulia Bonansinga. 2015. Exploring
cross-lingual sense mapping in a multilingual parallel corpus. In Second Italian Conference on Computational Linguistics CLiC-it 2015. to appear.
For future work, it is important to analyze the progressive improvement that we can achieve by taking into account semantic information from one
language at the time, so as to verify if it is true that
very diverse languages contribute the most to the
disambiguation process.
As for the sense inventories, it would be interesting to compare different lexical resources
for Italian, that is MWN and ItalWordNet (ITW)
(Roventini et al., 2002). ITW was born as the EuroWordNet Italian database, but even though compatible to a certain extent with EuroWordNet, it
is released in XML format. ITW includes about
47.000 lemmas, 50.000 synsets and 130.000 semantic relations and is currently maintained by
the Computational Linguistics Istitute (ILC) at the
National Research Council (CNR). An updated
version is freely available online. 5
Finally, we could easily address, at least for English, the lack of unbiased sense frequency statistics by computing them over the WordNet Gloss
Corpus, in which glosses are sense-annotated.6
This corpus alone would provide sense frequencies for 157,300 lemma-pos pairs.
Francis Bond and Ryan Foster. 2013. Linking and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 1352–1362. Association for Computational Linguistics.
Francis Bond and Kyonghee Paik. 2012. A Survey of
WordNets and their Licenses. In GWC 2012, pages
64–71.
Francis Bond, Hitoshi Isahara, Sanae Fujita, Kiyotaka
Uchimoto, Takayuki Kuribayashi, and Kyoko Kanzaki. 2009a. Enhancing the japanese wordnet.
In Proceedings of the 7th Workshop on Asian Language Resources, pages 1–8. Association for Computational Linguistics.
Francis Bond, Hitoshi Isahara, Kiyotaka Uchimoto,
Takayuki Kuribayashi, and Kyoko Kanzaki. 2009b.
Extending the japanese wordnet. In 15th Annual
Meeting of The Association for Natural Language
Processing.
Francis Bond, Timothy Baldwin, Richard Fothergill,
and Kiyotaka Uchimoto. 2012. Japanese SemCor:
A sense-tagged corpus of Japanese. In Proceedings
of the 6th Global WordNet Conference (GWC 2012),
pages 56–63.
Acknowledgments
This research was supported in part by the Erasmus Mundus Action 2 program MULTI of the
European Union (2010-5094-7) and the MOE
Tier 2 grant That’s what you meant: a Rich Representation for Manipulation of Meaning (MOE
ARC41/13).
Wauter Bosma, Piek Vossen, Aitor Soroa, German
Rigau, Maurizio Tesconi, Andrea Marchetti, Monica Monachini, and Carlo Aliprandi. 2009. KAF: a
generic semantic annotation format. In Proceedings
of the 5th International Conference on Generative
Approaches to the Lexicon GL 2009, Pisa, Italy.
Stephen A. Della Pietra Vincent J Della Pietra Brown,
Peter F. and Robert L. Mercer. 1991. Word-sense
disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the ACL,
Morristown, NJ.
References
Verginica Barbu Mititelu, Stefan Daniel Dumitrescu,
and Dan Tufiş, 2014. Proceedings of the Seventh
Global Wordnet Conference, chapter News about the
Romanian Wordnet, pages 268–275.
Yee Seng Chan and Hwee Tou Ng. 2005. Scaling
up word sense disambiguation via parallel texts. In
AAAI, volume 5, pages 1037–1042.
Luisa Bentivogli and Emanuele Pianta. 2005. Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus. Natural Language Engineering,
11(03):247, September.
Jordi Daudé, Lluı́s Padró, and German Rigau. 2000.
Mapping wordnets using structural information. In
38th Annual Meeting of the Association for Computational Linguistics (ACL’2000)., Hong Kong.
Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with Python.
O’Reilly Media, Inc., 1st edition.
Jordi Daudé, Lluı́s Padró, and German Rigau. 2001. A
complete WN1.5 to WN1.6 mapping. In Proceedings of NAACL Workshop” WordNet and Other Lexical Resources: Applications, Extensions and Customizations”. Pittsburg, PA.
5
http://datahub.io/dataset/iwn
http://wordnet.princeton.edu/
glosstag.shtml
6
48
Roberto Navigli. 2006. Meaningful clustering of
senses helps boost word sense disambiguation performance. In Proceedings of the 21st International
Conference on Computational Linguistics and the
44th annual meeting of the Association for Computational Linguistics, pages 105–112. Association for
Computational Linguistics.
Jordi Daude, Luiss Padro, and German Rigau. 2003.
Validation and tuning of wordnet mapping techniques. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’03).
Mona Diab and Philip Resnik. 2002. An unsupervised
method for word sense tagging using parallel corpora. In Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics, ACL
’02, pages 255–262, Stroudsburg, PA, USA. Association for Computational Linguistics.
Hwee Tou Ng, Bin Wang, and Yee Seng Chan. 2003.
Exploiting parallel texts for word sense disambiguation: An empirical study. In Proceedings of the 41st
Annual Meeting on Association for Computational
Linguistics-Volume 1, pages 455–462. Association
for Computational Linguistics.
William A. Gale, Kenneth W. Church, and David
Yarowsky. 1992. Using bilingual materials to develop word sense disambiguation methods.
Emanuele Pianta, Luisa Bentivogli, and Christian Girardi. 2002. MultiWordNet: Developing an Aligned
Multilingual Database. In In Proceedings of the
First International Conference on Global WordNet,
pages 293–302, Mysore, India.
Alfio Massimiliano Gliozzo, Marcello Ranieri, and
Carlo Strapparava. 2005. Crossing parallel corpora and multilingual lexical databases for WSD. In
Computational Linguistics and Intelligent Text Processing, pages 242–245. Springer.
Adriana Roventini, Antonietta Alonge, Francesca
Bertagna, Nicoletta Calzolari, Rita Marinelli,
Bernardo Magnini, Manuela Speranza, and Antonio Zampolli. 2002. Italwordnet: a large semantic
database for the automatic treatment of the italian
language. In First International WordNet Conference.
Nancy Ide, Tomaz Erjavec, and Dan Tufis. 2002.
Sense discrimination with parallel corpora. In
Proceedings of the ACL-02 workshop on Word
sense disambiguation: recent successes and future
directions-Volume 8, pages 61–66. Association for
Computational Linguistics.
Sofia Stamou, Kemal Oflazer, Karel Pala, Dimitris
Christoudoulakis, Dan Cristea, Dan Tufis, Svetla
Koeva, George Totkov, Dominique Dutoit, and
Maria Grigoriadou. 2002. Balkanet: A multilingual
semantic network for the balkan languages. Proceedings of the International Wordnet Conference,
Mysore, India, pages 21–25.
Radu Ion. 2007. Metode de dezambiguizare semantica automata. Aplicat ii pentru limbile englezas i
romana (“Word Sense Disambiguation methods applied to English and Romanian”). Ph.D. thesis, Research Institute for Artificial Intelligence (RACAI),
Romanian Academy, Bucharest.
Dan Tufiş, Radu Ion, and Nancy Ide. 2004. Word sense
disambiguation as a wordnets validation method in
balkanet. In Proceedings of the 4th LREC Conference, pages 741–744.
Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto,
Masao Utiyama, and Kyoko Kanzaki. 2008. Development of the japanese wordnet.
Dan Tufiş, Verginica Barbu Mititelu, Dan Ştefănescu,
and Radu Ion. 2013. The Romanian wordnet in
a nutshell. Language Resources and Evaluation,
47(4):1305–1314, December.
Mitesh M. Khapra, Salil Joshi, Arindam Chatterjee,
and Pushpak Bhattacharyya. 2011. Together we
can: Bilingual bootstrapping for wsd. In Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language
Technologies - Volume 1, HLT ’11, pages 561–569,
Stroudsburg, PA, USA. Association for Computational Linguistics.
Henry Kučera and W. Nelson Francis. 1967. Computational analysis of present-day American English.
Shari Landes, Claudia Leacock, and Randee I Tengi.
1998. Building semantic concordances. In Christiane Fellbaum, editor, WordNet: An Electronic Lexical Database, pages 199–216. MIT Press, Cambridge, MA.
Monica Lupu, Diana Trandabat, and Maria Husarciuc.
2005. A Romanian SemCor aligned to the English and Italian MultiSemCor. In 1st ROMANCE
FrameNet Workshop at EUROLAN, pages 20–27.
49
CILI: the Collaborative Interlingual Index
Francis Bond,♠ Piek Vossen,♢ John P. McCrae♣ and Christiane Fellbaum♡
♠ Nanyang Technological University, Singapore
♢ VU University Amsterdam, The Netherlands
♣ Insight Centre for Data Analytics, NUI Galway, Galway, Ireland
♡ Princeton University, U.S.A.
<[email protected],[email protected],[email protected],[email protected]>
Abstract
take this approach. A few wordnets are based
on the merge approach, where independent language specific structures are built first and then
some synsets linked to the PWN. In OMW, only
five projects take this approach: Chinese (Taiwan),
Danish, Dutch, Polish and Swedish (Huang et al.,
2010; Pedersen et al., 2009; Postma et al., 2016;
Piasecki et al., 2009; Borin et al., 2013).
To investigate meaning across languages, we
need to link synsets cross-lingually. It is easy to
link expand-style wordnets: they all link to PWN
and it can be used as a pivot to link them together.
This is one of the attractions of using the expand
approach, you immediately gain multilingual links.
The disadvantage is that concepts not in PWN (either because they are not lexicalized in English or
just because they have not been covered yet) cannot be expressed. Because of this, many expandstyle wordnets also define some new, languagespecific synsets, typically a few tens or hundreds
(Arabic, Chinese, Italian, Japanese, Catalan, Spanish, Galician, Finnish, Malay/Indonesian, Bulgarian, Greek, Romanian, Serbian and Turkish all do
so)(Pianta et al., 2002; Tufiş et al., 2004; Elkateb
and Fellbaum, 2006; Gonzalez-Agirre et al., 2012;
Wang and Bond, 2013; Bond et al., 2014; Seah and
Bond, 2014; Postma et al., 2016).
It is harder to link merge-style wordnets. The
projects need to somehow identify links to PWN,
and as a result, only a small subset of the language
specific synsets are linked to PWN. Examining the
unlinked synsets, this seems to be principally due
to the lack of resources to link them than semantic
incompatibility. For example, Danish and Polish
(Pedersen et al., 2009; Piasecki et al., 2009) have
many synsets which can be linked but are not currently.
Currently, when projects create their own
synsets, there is no coordination between these
projects. This means that similar or even identical concepts may be introduced in multiple places.
This paper introduces the motivation for
and design of the Collaborative InterLingual Index (CILI). It is designed to make
possible coordination between multiple
loosely coupled wordnet projects. The
structure of the CILI is based on the Interlingual index first proposed in the EuroWordNet project with several pragmatic
extensions: an explicit open license, definitions in English and links to wordnets in
the Global Wordnet Grid.
1
Introduction
Within 10 years of the release of Wordnet (Miller,
1990) researchers had started to extend it to other
languages (Vossen, 1998). Currently, the Open
Multilingual Wordnet (OMW: Bond and Paik,
2012; da Costa and Bond, 2015) has brought together wordnets for 33 languages that have released
open data,1 and automatically produced data for
150. There are even more wordnets than this: some
large projects have released non-open data, notably
German (Kunze and Lemnitzer, 2002) and Korean
(Yoon et al., 2009) and many projects have yet to
release any. This activity shows that the structure
of wordnets is applicable to many languages.
All the wordnets are based on the basic structure of the Princeton wordnet (PWN: Fellbaum,
1998): synonyms grouped together into synsets
and linked to each other by semantic relations.
The majority of wordnets have been based on
the expand approach, that is adding lemmas in
new languages to existing PWN synsets (Vossen,
1998, p83), boot-strapping from the structure of
English. 28 out of 33 of the wordnets in OMW
1
We use the definition from the Open Knowledge Foundation: http://opendefinition.org/: ``anyone is free
to use, reuse, and redistribute it --- subject only, at most, to
the requirement to attribute and/or share-alike''.
50
preted in the same way. In fact, different wordnet
editors and algorithms may interpret relations differently. Even the symbols used for parts of speech
differ in different projects (adverb is 'r' in PWN but
'b' in some projects). Finally, one can observe large
differences in coverage of the vocabulary and in the
degree of polysemy. Vocabularies and concepts
differ in size but also in terms of genre, pragmatics,
the inclusion of multiword expressions as ``phrase
sets'' (Bentivogli and Pianta, 2003) and specific
domains and areas. Choices for distinguishing
senses lead to fine-grained and coarse-grained polysemy, where the latter may lead to multiple hypernyms that can be modeled as complex types (Pustejovsky, 1995). Finally, the glosses for synsets play
an underestimated role in addition to the synsets
and the relations, but no formal structuring is defined for these glosses. As a result, glosses are
not sufficiently descriptive to precisely identify the
meaning of a concept. Such differences across
wordnets make it difficult to establish the proper
relations to the ILI and thus to compare and exploit
wordnets across languages. Further, if a synset is
not realized in a language it is not clear if that is
because the concept is not lexicalized in that language, or if it is merely not realized yet (the compilers may just not have got round to it).
To solve these problems, we need to not just define an interlingual index, but also shared guidelines for relations, how to write definitions, standard data formats and so forth.
For example, most South East Asian languages distinguish between cooked and uncooked rice: these
concepts have been added independently to the
Korean and Japanese wordnets. Typically, clusters of projects have tried to coordinate, such as
EuroWordNet, the Multilingual Central Repository for Basque, Catalan, Galician and Spanish
(Gonzalez-Agirre et al., 2012), the MultiWordNet
for Italian and Hebrew, (Pianta et al., 2002), Balkanet (Tufiş et al., 2004), the Wordnet Bahasa for
Malay and Indonesian (Bond et al., 2014), the IndoWordnet project (Bhattacharyya, 2010).
Clearly, there is a need for a single shared repository of concepts. In this paper, we propose to
build one: the Collaborative InterLingual Index
(CILI). We base the index on the technical foundations laid down in EuroWordNet: a single list that
is the union of all the synsets in all the wordnets
(Peters et al., 1998; Vossen et al., 1999). To this
we add ideas from the best-practice of the Semantic Web: a shared easily accessible resource with
a well defined license; from open-source software:
build a community of users who will co-develop
the resource; and from experiences in many multilingual lexical projects: accept the de facto use of
English as a common language of communication.
In the following sections we discuss the motivation further (§ 2), then describe in detail the structure of the CILI (§ 3), list some open issues (§ 4)
and finally conclude.
2
Motivation
3 The Collaborative Interlingual Index
Wordnets have been built with different methods and from different starting points: expand or
merge, manually or semi-automatically and based
on pre-existing monolingual resources or using
available bilingual resources to translate English
synsets to words in the target language. Furthermore, it is up to the wordnet builders to make
decisions about which words are synonyms, what
are the semantic relations between the synsets and
how to interpret each semantic relation. We can
observe very large synsets in one wordnet being
linked through PWN to small synsets in another
language. Different granularities of synsets brings
into questions the notion of the same concept existing across these wordnets. PWN uses 44 semantic relations (if separated by part-of-speech) but in
EuroWordNet 71 relations were defined that partially overlap. Even if two wordnets use the same
relation name, there is no guarantee that it is inter-
In this section we describe the core properties of
CILI. To coordinate an index among all the different wordnet projects, we propose that it should,
ideally, have the following properties (building on
1--5 from Fellbaum and Vossen, 2008):
1. The Interlinear Index (ILI) should be a flat list
of concepts.
2. The semantic and lexical relations should
mean the same things for all languages.
3. Concepts should be constructed for salient
and frequent lexicalized concepts in all languages.
4. Concepts linked to Multiword units (MWUs)
in wordnets should be included.
5. A formal ontology could be linked to but separate from the wordnets.
51
it will remain in the CILI, and marked as deprecated, preferably with a link to the concept that
supercedes it.
Property 8, that all synsets should have a definition in English, recognizes that, in practice, the
only language shared by all groups is English.
Here we are inspired by experience with the CICC
project, a multilingual machine translation project
linking Thai, Chinese, Japanese, Malay and Indonesian (but not English) (CICC, 1994). No
members spoke all five languages, but someone
in each group spoke English, so all dictionary
entries also had an English translation or definition. Having a universally understood definition
is a prerequisite in avoiding redundant creation of
new senses. This creates a burden on non-English
speakers, which we will try to lighten by giving
clear guidelines for writing definitions (see section
3.3). Note, that while the definition must be in English, the concept is not necessarily lexicalized in
English, in contrast to Princeton WordNet.
Properties 9 and 10 make sure that all new concepts link to something, there should be no orphaned concepts. Exactly which links are acceptable is still a matter of research.2
The final point (11) is about coordination. Practically, it will not be possible to have a single
moderator who can check new synsets in every
language. We therefore propose that the burden
of checking for duplication with existing synsets
should be placed on the project wanting to add
new synsets. As new concepts should be linked
to existing concepts through relational links in a
wordnet, and definitions in English will exist for
all entries, checking for a compatible entry in the
ILI should not be too burdensome. Project members with wordnets in the shared multilingual index
would gain write privileges to the ILI, of course
anyone should be able to read it. We will build automated tools that warn if definitions are too similar (for details see Vossen et al., 2016).
For the ILI to be successful there will be an
initial cost to combine all existing non-English
synsets, adding English definitions for all and
merging duplicates. It would also require buy-in
from all participating projects, but fortunately most
non-English wordnets contain few synsets that do
not correspond to an English synset, so this first
step should not be too burdensome. For wordnets
6. The license must allow redistribution of the
index
7. ILI IDs should be persistent: we never delete,
only deprecate or supercede; we should not
change the meaning of the concept
8. Each new ILI concept should have a definition in English, as this is the only way we
can coordinate across languages. The definition should be unique, which is not currently
true, and preferably also parse and sense tag
information should be included. Definition
changes will be moderated.
9. Each new ILI concept should link to a synset
in an existing project that is part of the
GWG with one of a set of known relations
(hypernymy, meronomy, antonymy, …)
10. This synset should link to another synset in an
existing project that is part of the GWG and
links to an ILI concept.
⇒ each concept is linked to another concept
through at least one wordnet in the grid
11. Any project adding new synsets should first
check that they do not already exist in the CILI
• New concepts are added through their
existing in a wordnet
• If something fulfills the criteria is proposed
• If no objections after three months then
it is added
Property 6, an open license, is a necessary condition for groups to be able to use the ILI within
their own project. To be maximally compatible, the
license should place as few restrictions as possible, ideally requiring only that the source of the resource be mentioned: it should be either the wordnet license itself, Creative Commons Attribution
(CC-BY) or the MIT license. We choose to use
CC-BY, as the license has been well written and
documented and is widely used.
Property 7, persistent identifiers, is an important
criterion for stability. If the ILI changed its IDs,
projects without the resources to maintain compatibility would fall behind. If a project changes its
hierarchy, then it will need to add new nodes and
delink the old ones. To keep backwards compatibility, even if a concept is deemed problematic,
2
Many wordnets, including PWN, currently contain some
orphans (e.g. uphill𝑟∶1 ), these would not be added to the ILI
unless they are linked to something.
52
is a crucial tool for coordinating across languages,
but is not meant to be the sole determiner of the ILI
concept's meaning. The ILI concepts will always
be linked to the global wordnet grid (Fellbaum and
Vossen, 2007; Vossen et al., 2016).
Labels for the concepts can be produced automatically, as it is probably that different languages
would want different labels. The easiest approach
would be to take the most frequent lemma in the
language of choice, backing off to the most frequent lemma in the language that introduced it
(which can be obtained from the dc:source).
built with the merge approach there will be many
more new synsets, these should be checked carefully and validated against corpora before being included in the ILI. We will support this with workshops at relevant conferences (such as the 16th
Global Wordnet Conference).
In the long run, we hope that external resources will link to the ILI's persistent IDs (things
like SUMO, TempoWordnet (Dias et al., 2014),
the many Sentiment wordnets (Baccianella et al.,
2010; Cruz et al., 2014).
3.1
Format
3.2 The WordNet Schema
The ILI will be represented as RDF. Our reference
implementation will be in Turtle (Terse RDF Triple
Language: W3C, 2012) a compact format for RDF
graphs.
It includes its own metadata, based on the
Dublin Core, shown in Figure 1. As far as possible,
triples are defined using existing schema (referenced in the preamble). The individual entries are
designed to be extremely simple. Unlike synsets
in individual wordnets, ILI concepts do not have
explicit parts-of-speech. No further semantics is
imposed within the ILI.
Each concept in the ILI has the following simple
structure:
In order to ensure that WordNets may be submitted in a form that is compatible with the ILI, we
have developed two specific schemas, namely an
XML schema based on the Lexical Markup Framework (Vossen et al., 2013, LMF) and the second
in JSON-LD (Sporny et al., 2014) using the Lexicon Model for Ontologies (McCrae et al., 2012,
lemon). These models are structured as follows:
LexicalResource The root element of the resource is the lexical resource
Lexicon Each WordNet has a lexicon for each resource, which has a name, an ID and a language. The language is given as a BCP 47
tag .
• A unique ID: i1, i2, i3, …
• A type: Concept or Instance
Lexical Entry Each 'word' is termed a lexical entry, it has exactly one lemma, at least one
sense and any number of syntactic behaviors.
• A gloss in English: skos:definition
• A link to the synset that first motivated the ILI
concept: dc:source
Lemma The lemma has a written form and
part-of-speech, which may be one of noun,
verb, adjective, adverb, phrase, sentence or
unknown.
• Links to all current wordnets in the GWG that
use this concept: owl:sameAs
• Optionally a deprecate/supercedes link
Sense The sense has any number of sense relations and a synset.
We give an example in Figure 2, which also
shows the relevant prefixes.
Information about provenance (who added the
entry, when it was made and so forth) are left to
the version control system, for which we have chosen to use (git: http://git-scm.com/). When
commits are made, the project will be added as the
author so a record is kept of who is responsible for
which change without making it visible in the ILI.
Note that the concept is defined not just by the
written definition but by the links to the wordnets
and the lemmas in those wordnets: the definition
Synset The synset has an optional definition and
any number of sense relations.
Definition The definition is given in the language
of the WordNet it came from as well as the ILI
definition (in English). A definition may also
have a statement that gives an example
Synset/Sense Relation A relation from a given
list of relations such as synonym, hypernym,
antonym. This list defines the relations used
53
<> a voaf:Vocabulary ;
vann:preferredNamespacePrefix "ili" ;
vann:preferredNamespaceUri "http://globalwordnet.org/ili" ;
dc:title "Global Wordnet ILI"@en ;
dc:description "The shared Inter-Lingual Index for the global wordnets.
It consists of a list of concepts of instances with definitions,
and their links to open wordnets."@en ;
dc:issued "2015-07-30"^^xsd:date ;
dc:modified "2015-07-30"^^xsd:date ;
owl:versionInfo "0.1.1"@en ;
dc:rights "Copyright Global Wordnet Association" ;
cc:license <http://creativecommons.org/licenses/by/4.0> ;
cc:attributionName "Global Wordnet Association";
cc:attributionURL <http://globalwordnet.org>;
dc:contributor <http://www3.ntu.edu.sg/home/fcbond/>, <http://john.mccr.ae> ,
<http://vossen.info/> ;
dc:publisher <http://globalwordnet.org> .
Figure 1: ILI metadata
@prefix pwn30: <http://wordnet-rdf.princeton.edu/wn30/> .
@prefix jwn12: <http://compling.hss.ntu.edu.sg/omw/wns/jpn/> .
@prefix ili: <http://globalwordnet.org/ili/> .
@base <http://globalwordnet.org/ili/ili#>.
<i71370> a <Concept> ;
dc:source
pwn30:06639428-n ;
skos:definition "any of the machine-readable lexical databases
modeled after the Princeton WordNet"@en ;
owl:sameAs
jwn12:jpn-06639428-n ;
owl:sameAs
pwn30:06639428-n .
Figure 2: Example ILI entry for the concept of a wordnet
be from a wordnet whose language is not comprehensible to another user. Moreover, as these definitions are given in natural language it is important
to ensure that they are as unambiguous as possible,
and can clearly identify the concepts, without the
additional mechanisms of semantic relations. For
these reasons strong guidelines for definitions are
of primary importance.
by the Global Wordnet Grid, and all the relations are documented on the Global Wordnet
Association website.
Syntactic Behavior A syntactic behavior (verb
frame) gives the subcategorization frame in
plain text, such as ``Sam and Sue %s the
movie''.
Meta Dublin Core properties may be added to lexicons, lexical entries, senses and synsets.
There are already good general guidelines
for writing dictionary definitions (Landau, 1989,
Chapter Four). Almost all of these apply to wordnets in general, and the CILI in particular, with the
exception that brevity is less important in an electronic resource.
Either format can be used to describe a WordNet
and it is simple to convert between either. An example of the LMF form is given in figure 3 and in
WN-JSON in figure 4
3.3
There are some extra constraints for the CILI.
First, definitions should be unique and there should
be enough information to minimally distinguish
one concept from all others. This was not the
case in the wordnets, PWN has over 1,629 synsets
with non-unique definitions, and there are similar numbers in other wordnets (1,362 in Japanese,
418 in Indonesian, 211 in Greek, 104 in Albanian
and so on). For example it would not be sufficient to describe paella𝑛∶1 as ``a Spanish dish'' as
Guidelines for Definitions
In any given wordnet, the definition is only one of
the things that helps to tell the meaning of a word,
it is accompanied by the semantic relations, part
of speech information, examples and so forth. The
ILI is situated in the global wordnet grid, so this
information should also be available. However the
definition is the only thing guaranteed to be in the
ILI, and the accompanying information may only
54
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE LexicalResource SYSTEM "http://globalwordnet.github.io/schemas/WN-LMF.dtd">
<LexicalResource>
<Lexicon label="Princeton WordNet" language="en">
<LexicalEntry id="w1">
<Lemma writtenForm="wordnet" partOfSpeech="n"/>
<Sense id="106652077-n-1" synset="106652077-n"/>
</LexicalEntry>
<Synset id="106652077-n" ili="s35545">
<Definition
gloss="any of the..."
iliDef="any of the..."/>
<SynsetRelation relType="hypernym" target="106651393-n"/>
</Synset>
<Meta publisher="Princeton University"
rights="http://wordnet.princeton.edu/wordnet/license/"/>
</Lexicon>
</LexicalResource>
Figure 3: Example of WordNet entry in WN-LMF
{
"@context": [ "http://globalwordnet.github.io/schemas/wn-json-context.json",
{ "@language": "en" } ],
"@id": "pwn30",
"label": "Princeton WordNet",
"language": "en",
"publisher": "Princeton University",
"rights": "wordnetlicense:",
"entry": [{
"@id" : "w1",
"lemma": { "writtenForm": "wordnet" },
"partOfSpeech": "wn:noun",
"sense": [{
"@id": "106652077-n-1",
"synset": {
"@id": "106652077-n",
"ili": "s35545",
"definition": {
"gloss": "any of the..." ,
"iliDef": "any of the..."
},
"hypernym": ["106651393-n"]
}
}]
}]
}
Figure 4: Example of an entry in WN-JSON
staghorn moss, Lycopodium complanatum which
are both defined as ``a variety of club moss''. In this
case, amending the definition to ``a variety of club
moss (Lycopodium obscurum)'' and ``a variety of
club moss (Lycopodium complanatum)'' makes the
definitions unique (at the cost of some redundancy.
We propose using some of the wide array of brackets available to show the redundant information in
the ILI definition: ``⟪plant⟫ a variety of club moss
[Lycopodium complanatum]''. Doing this reduces
the number of non-unique definitions by over 50%.
The ILI definitions are thus produced automatically from PWN 3.0, without always being identical to them.
this is not sufficiently distinctive. For the wordnets, the combination of definition and lemmas
is normally enough to distinguish a word, but for
the ILI, if necessary, one of the English lemmas must be included in the definition (for example, including the species name in the definition). This conflicts somewhat with the best
practice for individual wordnets, where in general we want to avoid redundancy: if the synset
is linked through domain-category to e.g. mathematics, we would normally not start the definition with ``(mathematics)''. A case in point
is the definitions for PWN30:13223710-n ground
fir, princess pine, tree clubmoss, Lycopodium obscurum and and PWN:13223588-n ground cedar,
We also place some limitations on the format.
55
entities can be considered part of the lexicon as
well as names for objects, for example Glaswegian𝑎∶1 ``of or relating to or characteristic of
Glasgow or its inhabitants'', which is also used in
the definition of other concepts. Thus, we retain
a small number of named entities, especially geographic terms but further discussion is required to
refine an exact policy.
It could also be argued that some of the derived
forms (for example quickly𝑟∶1 from quick𝑎∶1 ) are
unnecessary: as the meaning change is generative,
there is no point in having two concepts. These
kind of changes can be applied later by means of
superceding other concepts, and for the moment
we apply the distinctions made by Princeton WordNet.
The definition should consist of one or more short
utterances, separated by semicolons. Semicolons
should not be used within each utterance, use
comma or colon instead. Definitions will be split
on semicolons before being parsed, so it is important to be consistent here. We also do not allow the
use of ASCII double quotes instead preferring Unicode left and right (double) quotes to aid parsing.
In general, we need to be very conservative in
changing the definitions of concepts in the ILI.
When first written, we should try not to make
the definition too restricted, for example, prefer
for angel, backer instead of ``invests in a theatrical production'', prefer ``someone who invests in
something, typically a theatrical production''. This
makes it easier to avoid having to make multiple
very similar synsets.
Definitions should use standard patterns, especially for the first utterance in a definition. Ideally,
the definition should consist of a genus (the hypernym, not necessarily the immediate hypernym) and
differentiae, e.g.,
5 Conclusions
We have introduced and motivated the collaborative interlingual index (CILI). Its simple design allows us to link wordnets with a minimum of extra
work. Once concepts are added to the CILI, they
will get a persistent ID and thereafter should not
be deleted or change in meaning. We propose that
the task of checking the validity of new concepts is
taken up by the individual wordnet projects, with
only a light layer of moderation.
wordnet (lemma) ``any of the
machine-readable lexical databases
(genus) modeled
after the Princeton
:::::::::::::::::::::::::::
WordNet''
(differentiae)
::::::::
Adjectives and adverbs are exceptions, in that
they are often defined using prepositional phrases.
Finally we make a simple requirement that definitions have a minimum length of 20 characters or
5 words.
In future work we will produce a tool to parse
the definition and automatically identify the hypernym (Nichols et al., 2005), sense tag the definition
(Moldovan and Novischi, 2004) and report on this
to the definition writer, as well as compare the definition to definitions from similar concepts. This
can help identify infelicitous definitions.
4
References
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani.
2010. Sentiwordnet 3.0: An enhanced lexical resource for
sentiment analysis and opinion mining. In Proceedings of
the 7thh International Conference on Language Resources
and Evaluation (LREC 2010). Valletta, Malta.
Luisa Bentivogli and Emanuele Pianta. 2003. Beyond lexical units: Enriching wordnets with phrasets. In Proceedings of the Research Note Sessions of the 10th Conference
of the European Chapter of the Association for Computational Linguistics (EACL 03), pages 67--70. Budapest.
Pushpak Bhattacharyya. 2010. Indowordnet. In Proceedings
of 7th Language Resources and Evaluation Conference
(LREC 2010). La Valletta. URL http://www.cfilt.
iitb.ac.in/indowordnet/.
Open Issues
Francis Bond, Lian Tze Lim, Enya Kong Tan, and Hammam Riza. 2014. The combined wordnet Bahasa. Nusa:
Linguistic studies of languages in and around Indonesia,
57:83--100.
There are a few cases where it was hard to decide
whether a concept should be represented in the InterLingual Index.
One example is named entities. Roughly 6.6%
of the entries in PWN are linked by the instance
relation (including the names of people, places,
planets, gods and many more). Named entities
are much more numerous than words and these
concepts and their relations are better captured by
other kinds of resources. However, some named
Francis Bond and Kyonghee Paik. 2012. A survey of wordnets and their licenses. In Proceedings of the 6th Global
WordNet Conference (GWC 2012). Matsue. 64--71.
Lars Borin, Markus Forsberg, and Lennart Lönngren. 2013.
Saldo: a touch of yin to WordNet's yang. Language Resources and Evaluation, 47(4):1191--1211. URL dx.doi.
org/10.1007/s10579-013-9233-4.
CICC. 1994. Research on Malaysian dictionary. Technical
Report 6---CICC---MT54, Center of the International Cooperation for Computerization, Tokyo.
56
Fermín L Cruz, José A Troyano, Beatriz Pontes, and
F Javier Ortega. 2014. Building layered, multilingual sentiment lexicons at synset and lemma levels. Expert Systems
with Applications.
BoletteSandford Pedersen, Sanni Nimb, Jørg Asmussen,
NicolaiHartvig Sørensen, Lars Trap-Jensen, and Henrik
Lorentzen. 2009. DanNet --- the challenge of compiling
a wordnet for Danish by reusing a monolingual dictionary.
Language Resources and Evaluation, 43(3):269--299.
Luís Morgado da Costa and Francis Bond. 2015. OMWEdit
- the integrated open multilingual wordnet editing system.
In ACL-2015 System Demonstrations.
Wim Peters, Piek Vossen, Pedro Díez-Orzas, and Geert
Adriens. 1998. Cross-linguistic alignment of wordnets
with an inter-lingual-index. In Vossen (1998), pages 149-251.
Gaël Harry Dias, Mohammed Hasanuzzaman, Stéphane Ferrari, and Yann Mathet. 2014. Tempowordnet for sentence
time tagging. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide
Web Companion, WWW Companion '14, pages 833--838.
International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland. URL
http://dx.doi.org/10.1145/2567948.2579042.
Emanuele Pianta, Luisa Bentivogli, and Christian Girardi.
2002. Multiwordnet: Developing an aligned multilingual
database. In In Proceedings of the First International Conference on Global WordNet, pages 293--302. Mysore, India.
Maciej Piasecki, Stan Szpakowicz, and Bartosz Broda. 2009.
A Wordnet from the Ground Up. Wroclaw University of
Technology Press. URL http://www.plwordnet.pwr.
wroc.pl/main/content/files/publications/A_
Wordnet_from_the_Ground_Up.pdf, (ISBN 978-837493-476-3).
William Black Horacio Rodríguez Musa Alkhalifa Piek
Vossen Adam Pease Elkateb, Sabri and Christiane Fellbaum. 2006. Building a wordnet for Arabic. In In Proceedings of The fifth international conference on Language
Resources and Evaluation (LREC 2006).
Marten Postma, Emiel van Miltenburg, Roxane Segers, Anneleen Schoen, and Piek Vossen. 2016. Open Dutch wordnet. In Proceedings of the 8th Global Wordnet Conference
(GWC 2016). (this volume).
Christiane Fellbaum and Piek Vossen. 2007. Connecting the
universal to the specific: Towards the global grid. In
First International Workshop on Intercultural Collaboration (IWIC-2007), pages 2--16. Kyoto.
James Pustejovsky. 1995. The Generative Lexicon. MIT
Press, Cambridge, MA.
Christiane Fellbaum and Piek Vossen. 2008. Challenges for a
global wordnet. In Webster J., Nancy Ide, and A.Chengyu
Fang., editors, Online Proceedings of the First International Workshop on Global Interoperability for Language
Resources (ICGL 2008), pages 75--82. City University of
Hongkong.
Yu Jie Seah and Francis Bond. 2014. Annotation of pronouns
in a multilingual corpus of Mandarin Chinese, English and
Japanese. In 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation. Reykjavik.
Manu Sporny, Dave Longley, Gregg Kellogg, Markus Lanthaler, and Niklas Lindström. 2014. Json-ld 1.0: A jsonbased serialization for linked data. W3C recommendation,
The World Wide Web Consortium.
Christine Fellbaum, editor. 1998. WordNet: An Electronic
Lexical Database. MIT Press.
Aitor Gonzalez-Agirre, Egoitz Laparra, and German Rigau.
2012. Multilingual central repository version 3.0: upgrading a very large lexical knowledge base. In Proceedings of
the 6th Global WordNet Conference (GWC 2012). Matsue.
Dan Tufiş, Dan Cristea, and Sofia Stamou. 2004. BalkaNet: Aims, methods, results and perspectives. a general
overview. Romanian Journal of Information Science and
Technology, 7(1--2):9--34.
Chu-Ren Huang, Shu-Kai Hsieh, Jia-Fei Hong, Yun-Zhu
Chen, I-Li Su, Yong-Xiang Chen, and Sheng-Wei Huang.
2010. Chinese wordnet: Design and implementation of a
cross-lingual knowledge processing infrastructure. Journal of Chinese Information Processing, 24(2):14--23. (in
Chinese).
Piek Vossen, editor. 1998. Euro WordNet. Kluwer.
Piek Vossen, Francis Bond, and John McCrae. 2016. Toward
a truly multilingual global wordnet grid. In Proceedings
of the 8th Global Wordnet Conference (GWC 2016). (this
volume).
C. Kunze and L. Lemnitzer. 2002. Germanet --- representation, visualization, application. In LREC, pages 1485-1491.
Piek Vossen, Wim Peters, and Julio Gonzalo. 1999. Towards
a universal index of meaning. In Proceedings of ACL99 Workshop, Siglex-99, Standardizing Lexical Resources,
pages 81--90. Maryland.
Sidney I. Landau. 1989. Dictionaries: The Art and Craft of
Lexicography. Cambridge University Press, Cambridge,
UK.
Piek Vossen, Claudia Soria, and Monica Monachini. 2013.
LMF - lexical markup framework. In Gil Francopoulo, editor, LMF - Lexical Markup Framework, chapter 4. ISTE
Ltd + John Wiley & sons, Inc.
John McCrae, Philipp Cimiano, and Elena Montiel-Ponsoda.
2012. Integrating wordnet and wiktionary with lemon.
In Christian Chiarcos, Sebastian Nordhoff, and Sebastian
Hellman, editors, Linked Data in Linguistics. Springer.
World Wide Web Consortium W3C. 2012. Turtle --- terse
RDF triple language. http://www.w3.org/TR/2012/
WD-turtle-20120710/.
George A. Miller. 1990. WordNet: An online lexical database.
International Journal of Lexicography, 3(4). (Special Issue).
Shan Wang and Francis Bond. 2013. Building the Chinese Open Wordnet (COW): Starting from core synsets.
In Proceedings of the 11th Workshop on Asian Language
Resources, a Workshop at IJCNLP-2013, pages 10--18.
Nagoya.
Dan Moldovan and Adrian Novischi. 2004. Word sense disambiguaition of WordNet glosses. Computer Speech and
Language, 18:301--317.
Aesun Yoon, Soonhee Hwang, Eunroung Lee, and Hyuk-Chul
Kwon. 2009. Construction of Korean wordnet KorLex 1.5.
Journal of KIISE: Software and Applications, 36(1):92-108.
Eric Nichols, Francis Bond, and Daniel Flickinger. 2005. Robust ontology acquisition from machine-readable dictionaries. In Proceedings of the International Joint Conference on Artificial Intelligence IJCAI-2005, pages 1111-1116. Edinburgh.
57
YARN: Spinning-in-Progress
Pavel Braslavski
Ural Federal University
Yekaterinburg, Russia
[email protected]
Dmitry Ustalov
Ural Federal University
Yekaterinburg, Russia
[email protected]
Mikhail Mukhin
Ural Federal University
Yekaterinburg, Russia
[email protected]
Yuri Kiselev
Yandex
Yekaterinburg, Russia
[email protected]
Abstract
YARN intends to cover Russian nouns, verbs and
adjectives. Following the divide and conquer approach, we treat synset assembly and relationship
establishing separately.
The main difference between YARN and the previous projects is that YARN is based on crowdsourcing. We hope that the crowdsourcing approach will make it possible to create a resource
of a satisfactory quality and size in the foreseeable future and with limited financial resources.
Our optimism is based both on the international
practice and the recent examples of successful
Russian NLP projects fueled by volunteers. Another important distinction is that the editors do
not build the thesaurus from scratch; instead, they
use “raw data” as the input. These “raw data”
stem from pre-processed dictionaries, Wiktionary,
Wikipedia, and text corpora. More than 200 people have taken part in the synset assembly in the
course of the project. Currently, the resource comprises 100K+ word entries and 46K+ synsets that
are available under CC BY-SA license.
The paper describes the main linguistic and organizational principles of YARN, the tools developed, and the results of the current content evaluation. We also point to some pitfalls of the chosen
crowdsourcing methodology and discuss how we
could address them in the future.
(Yet Another RussNet), a project
started in 2013, aims at creating a large
open WordNet-like thesaurus for Russian
by means of crowdsourcing. The first
stage of the project was to create noun
synsets. Currently, the resource comprises
100K+ word entries and 46K+ synsets.
More than 200 people have taken part
in assembling synsets throughout the
project. The paper describes the linguistic,
technical, and organizational principles
of the project, as well as the evaluation
results, lessons learned, and the future
plans.
YARN
1
Introduction
The Global WordNet Association website lists 76
wordnets for 70 different languages1 , including
multilingual resources. Although the table mentions as many as three wordnets for Russian, unfortunately no open Russian thesaurus of an acceptable quality and size is still available.
The Yet Another RussNet (YARN) project2
started in 2013. It aims at creating a comprehensive and open thesaurus for Russian. From the
linguistics point of view, the proposed thesaurus
has rather a traditional structure: it consists of
synsets—groups of near-synonyms corresponding
to a concept, while synsets are linked to each other,
primarily via hierarchical hyponymic/hypernymic
relations.
2
Related Work
In this section, we briefly survey projects aimed
at creation of WordNet-like semantic resources
for Russian, describe peculiarities of other thesauri for Slavic languages, and systematize different crowdsourcing approaches to building lexicographic resources.
1
http://globalwordnet.org/
wordnets-in-the-world/
2
http://russianword.net/en/, not to be confused with a Hadoop subsystem.
58
2.1
Russian Thesauri
ural languages.
The Russian version of its
semantic network—the Universal Dictionary of
Concepts—contains approximately 62K universal
words (UWs) and 90K links between them and is
available9 under CC BY-SA license.
One of the recent trends is the creation of
semantic resources in a fully automatic manner, where collaboratively created resources like
Wikipedia and Wiktionary are used as the input.
A striking example of this approach is BabelNet,
a very large automatically generated multilingual
thesaurus (Navigli and Ponzetto, 2012); the Russian part of BabelNet consists of 2.37M lemmas,
1.35M synsets, and 3.7M word senses10 . The data
is accessible through an API under CC BY-NC-SA
3.0 license. No evaluation of the Russian data has
been performed yet.
As can be seen from the survey, no open humancrafted wordnet for Russian is available so far. Automatically created resources are freely available
and potentially have very good coverage, but their
quality is disputed.
project3
The RussNet
was launched in 1999 at
Saint-Petersburg university (Azarova et al., 2002).
According to the RussNet developers, the resource
currently contains about 40K word entries, 30K
synsets, and 45K semantic relations. However,
this data is not encoded in a uniform format and
cannot be published or used in a NLP application
in its current form.
RuThes is probably the most successful WordNet-like resource for Russian
(Loukachevitch, 2011). It has been developing since 2002, and now contains 158K lexical
units constituting 55K concepts. RuThes is a
proprietary resource; however a subset of it was
published recently4 . The main hurdle for a wider
use of the resource is a restrictive license and the
fact that the data in XML format can be obtained
by request only.
Another resource—RussianWordNet—was a
result of a fully automatic translation of the
Princeton WordNet (PWN) into Russian undertaken in 2003 and is freely available5 under the
PWN license. The approach based on bilingual
dictionaries, parallel corpora, and dictionaries of
synonyms resulted in the translation of about 45%
of the PWN entries. The thesaurus contains 18K
nouns, 6K adverbs, 5.5K verbs, and 1.8K adverbs;
no systematic quality assessments of the obtained
data were performed (Gelfenbeyn et al., 2003).
Another attempt to translate the PWN into Russian, in this case—in a semi-automatic fashion—
is the Russian Wordnet project (Balkova et al.,
2004) started in 2003, but its deliverables are not
available to the general public.
Russian Wiktionary6 can be seen as an
ersatz of a proper thesaurus, since along
with definitions it contains—though marginally—
semantic relations. Wikokit project7 allows handling Wiktionary data as a relational database
(Krizhanovsky and Smirnov, 2013). Russian Wiktionary contains about 190K word entries and 70K
synonym relations as of September, 2015.
The Universal Networking Language8 project
is dedicated to the development of a computer
language that replicates the functions of nat-
2.2
Thesauri of Other Slavic Languages
Slavic languages are highly inflectional and have
a rich derivation system. The survey of wordnets
for Czech (Pala and Smrž, 2004), Polish (Maziarz
et al., 2014) and Ukrainian (Anisimov et al., 2013)
shows that in each case a special attention is paid
to dealing with the morphological characteristics.
For instance, plWordNet features a versatile system of relations with dozens of subtypes of relations between synsets and lexical units, many of
which reflect derivational relations.
2.3
Crowdsourcing Language Resources
Crowdsourcing, a human-computer technique for
collaborative problem solving by online communities, has gained high popularity since its inception in the mid 2000’s (Kittur et al., 2013). Creation and expansion of linguistic resources using
crowdsourcing became a trend in recent years as
shown by Gurevych and Kim (2013).
Despite the ongoing unabated discussions about
the types, merits and limitations of crowdsourcing (Wang et al., 2013), we consider the following
genres of crowdsourcing: wisdom of the crowds
(W OT C), mechanized labor (ML AB) and games
with a purpose (GWAP S).
3
http://project.phil.spbu.ru/RussNet/
http://labinform.ru/pub/ruthes/
5
http://wordnet.ru/
6
http://ru.wiktionary.org/
7
https://github.com/componavt/wikokit
8
http://www.undl.org/
4
9
https://github.com/dikonov/
Universal-Dictionary-of-Concepts
10
http://babelnet.org/stats
59
In the W OT C genre, the resource is constructed
explicitly by a crowd of volunteers that collaborates in an online editing environment. Their participation is mostly altruistic and a participant’s
benefit is either self-exaltation or self-promotion
of any kind. Successful examples of this genre
are Wikipedia and Wiktionary. The primary issues
of such resources are vandalism and “edit wars”,
which are usually resolved by edit patrolling and
edit protection.
In the ML AB genre, the resource is created
implicitly by the workers who submit answers
to simple tasks provided by the requester. This
genre is proven to be effective in many practical applications. For instance, Lin and Davis
(2010) extracted ontological structure from social
tagging systems and engaged workers in evaluation. Rumshisky (2011) used crowdsourcing to
create an empirically-derived sense inventory and
proposed an approach for automated assessment
of the obtained data. Biemann (2013) described
how workers can contribute to thesaurus creation by solving simple lexical substitution tasks.
Most of these studies have been conducted on
the commodity platforms like Amazon Mechanical Turk11 (MTurk) and CrowdFlower12 . Unfortunately, MTurk can hardly be used for tasks implying the knowledge of Russian because: (1) there
are virtually no workers from Russia presented on
the platform (Pavlick et al., 2014), and (2) a requester must have a U.S. billing address to submit tasks13 . Having no access to the global online
labor marketplaces is a serious obstacle to paying the workers due to the requirements of the local legislation of Russia. However, projects like
OpenCorpora are trying to work around this problem by developing custom crowdsourcing platforms and effectively appealing to altruism instead
of money reward (Bocharov et al., 2013). Since
such altruistic mechanized labor does not imply
money reward, it is not prone to spam, where an
unfair worker may permanently submit random
answers instead of sensible ones.
In the GWAP S genre, the crowdsourcing process is embedded into a multi-player game, in
which the players have to accomplish various
goals by creating new data items to win the game.
Although such games are attractive and entertain-
ing, game development is an expensive and complex kind of activity that may be feasible only
for large-scale annotation projects. The examples
here are Phrase Detectives14 and JeuxDeMots15 .
3
YARN
Essentials
YARN is conceptually similar to Princeton WordNet (Fellbaum, 1998) and its followers: it consists of synsets—groups of quasi-synonyms corresponding to a concept. Concepts are linked
to each other, primarily via hierarchical hyponymic/hypernymic relationships.
3.1
YARN
Structure
Each single-word entry in YARN is characterized
by the grammatical features (the types of POS
and inflection) according to Zaliznyak’s dictionary (1977). Synsets may include single-word entries {суффикс (suffix)}, multi-word expressions
{подводная лодка (submarine)}, and abbreviations {ПО (программное обеспечение, software)}. Synsets may contain a definition (gloss
in terms of PWN). Additionally, definitions can
be attached to individual words in a synset—these
definitions are inherited from the dictionary data
and specify a word meaning, but cannot serve as
a good definition for the whole synset. “Empty
synsets” (i.e. containing no words) that correspond to a non-lexicalized concept are legitimate
and help to create a more harmonious hierarchy of
synsets.
Each word in a synset can be accompanied
by one or more usage examples. Words within
synsets can attach labels from the five categories: emotional, stylistic, chronological, domain/territorial, and semantic (28 labels in total). This list is a result of the systematization of
large and diverse Wiktionary label set. One of the
synset words can be marked as the head word. Its
sense is stylistically neutral, and it encompasses
the meanings of the whole synset, e.g. {армия
(army), войска (troops), вооружённые силы
(armed forces)}. Each synset may belong to a domain, e.g. {кино (movie), кинофильм (movie
picture), фильм (film)} → “Arts”, {думать (to
think), размышлять (to ponder)} → “Intellect”.
The vertical, hypo-/hypernymic relations between synsets are decisive for the hierarchical
11
https://www.mturk.com/mturk/welcome
http://crowdflower.com/
13
https://requester.mturk.com/help/faq#
can_international_requesters_use_mturk
12
14
https://anawiki.essex.ac.uk/
phrasedetectives/
15
http://www.jeuxdemots.org/
60
ers. Our target editors are college or university students, preferably from the linguistics departments,
who are native Russian speakers. It is desirable
that students receive instructions from a university teacher and may seek their advice in complex cases. YARN differentiates the two levels of
contributors—line editors and moderators. Moderators are authorized to approve thesaurus elements thus excluding them being modified by line
editors.
The current synset editing interface can be accessed online16 ; its main window is presented in
Figure 1. The “raw data” are placed on the lefthand side of the interface: definitions of the initial
word and examples, and possible synonyms for
each of the meanings, with definitions and examples for each of the synonyms. The right-hand part
represents the resulting synsets including words,
definitions, and examples. In principle, an editor
can assemble a “minimal” synset from the dictionary “raw data” simply with several mouse clicks,
without any typing.
macrostructure of the thesaurus. The root of
the YARN hierarchy is {предмет (entity), объект (object), вещь (thing)}; the second level
is represented by {физическое явление (physical phenomenon)}, {отвлечённое понятие,
абстрактное понятие, абстракция (an abstraction)}, {совокупность, набор (set), группа (group)}, {воображаемое, представляемое
(imaginary)}. We elaborated 4–5 top levels for
each part of speech.
The vertical links in YARN are also formed
by the meronymy relations (the part-whole relations): ноздря (nostrill)—нос (nose)—лицо
(face)—голова (head). The antonymy relationship connects specific words in the context of
corresponding synsets. For example, the verb
прибыть (to arrive) is the antonym of the verb
отбыть (to depart), but not of направиться
(to head somewhere) and the other words in the
synset.
In the future, YARN will reflect the crossPOS relations between derivates: {двигаться (to
move), движение (movement)}, {лес (forest),
лесной (forestadj )}. It will be significant for the
word pairs with a minimum difference in senses.
3.2
Raw Data
As the “raw data” for the thesaurus construction we employed existing resources such as Wiktionary (which constituted the core of the input
data), Wikipedia (redirects), the aforementioned
result of the automatic translation of the PWN,
the Universal Dictionary of Concepts, and the data
from two dictionaries in the public domain. We
also implicitly use the data from the Russian National Corpus (RNC) so that the corpus statistics
influence the queue of words presented to the editors. Wikipedia and RNC were also used to compile the list of multi-word expressions to be included in the resource.
3.3
Figure 1: YARN synset assembly interface (the interface captions are translated into English for the
convenience of the readers; originally all interface
elements are in Russian).
Synset assembly begins with a word, or “synset
starter”. The editor selects an item from the list of
words ranked by decreasing frequency; the already
processed words are shaded. The editor can go
through the words one after another or choose an
arbitrary word using the search box. The top-left
pane displays definitions of the initial word and
usage examples if any. The possible synonyms of
the initial word are listed on the bottom-left pane;
they in turn contain their definitions and examples. The top-right pane displays a list of synsets
containing the initial word. The editor can copy
definitions and usage examples of the initial word
User Interface
Our initial approach to synset building is based on
the W OT C inspired by the highly successful examples of Wikipedia and Wiktionary: our editors
assemble synsets using word lists and definitions
from dictionaries as the “raw data”. Technically,
virtually everybody can edit the YARN data—one
needs only to login using a social network account.
However, the task design implies minimal lexicographical skills and is more complicated than an
average task offered for instance to MTurk work-
16
61
http://russianword.net/editor
Figure 2: XML representation of the synset {суп, бульон, похлёбка (soup)}.
from the top-left pane of the interface to the current synset by clicking the mouse. From the synonyms pane one can transfer words along with
their definitions and examples. The editor can add
a new word to the list of synonyms; it will appear
with dictionary definitions and examples if presented in the parsed data. If the editor is not satisfied with the collected definitions, they can create
a new one—either from scratch or based on one of
the existing descriptions. Using search in the Russian National Corpus17 and OpenCorpora18 , the
editor can add usage examples. Additionally, a
word or a definition within a synset can be flagged
as “main”, and be provided with labels. All synset
edits are tracked and stored in the database along
with the timestamps and the editor ID.
As a pilot study showed, editors spent about two
minutes on average to compile a non-trivial synset,
i.e. containing more than a single word. The top
contributors demonstrated a learning effect: the
average time per synset tended to decrease as the
editor proceeded through the tasks, see Braslavski
et al. (2014) for details.
Our next goal is to lower the threshold of participation in the data annotation and thus—to increase the number of participants. To do this, we
are developing a mobile application in the MLab
genre that is aimed at gathering “raw synsets”:
users are presented with a series of sentences with
highlighted words and lists of possible contextual
substitutes. This approach is similar to the experiment described in (Biemann, 2013).
3.4
original dictionaries and thesauri were coming in
different formats, we decided to develop a custom XML schema for data export19 . We believe
that XML format provides sufficient flexibility and
preserves the connection to the internal data representation. The developed format is modular, as
different types of objects (lexical units, synsets,
and relationships) are described separately. The
proposed format is somewhat similar to the Lexical Markup Framework (LMF)20 approach, although the YARN format does not refer to the latter directly. All editing actions (in fact, aggregated “action chunks”) are stored in the database.
The YARN format stores the revision history analogously to the OpenStreetMap XML format21 . A
synset structure is illustrated in Figure 2.
The YARN software is implemented using Ruby
on Rails framework. All data are stored in a PostgreSQL database. The user interface is implemented as a browser JavaScript application, which
interacts with the back-end via JSON API. User
authentication is performed through an OAuth
endpoint provided by Facebook, VK and GitHub.
The entire source code of the project is available
in a GitHub repository22 .
3.5
The current version of the the YARN (September 2015) contains 44K synsets that consist of
48K words and 5.4K multi-word expressions; 838
words carry labels; 2.6K words are provided with
at least one usage example (there are 4.2K examples in total). The resource contains 2.5K synsetlevel and 8.3K word-level definitions. The synset
size distribution is presented in Figure 3.
Implementation Details
The YARN data are stored in a centralized database
that can be accessed through a web interface. In
addition, distributed teams can work directly with
the database through an API. The database is periodically exported to XML format. Although the
17
18
Current State and Problems
19
https://github.com/russianwordnet/
yarn-formats/
20
http://www.lexicalmarkupframework.
org/
21
http://www.openstreetmap.org/
22
https://github.com/russianwordnet
http://ruscorpora.ru/en/
http://opencorpora.org/
62
number of users
number of synsets
20000
15000
10000
5000
0
75
50
25
0
1
2
3
4
5
6
7
8
9
10 11 12 13+
(0, 10]
words in synset
(10, 100] (100, 500] (500, 1K]
(1K, +Inf)
number of edits
Figure 3: Synset distribution by size.
Figure 4: Distribution of users by edit count.
More than 200 people have taken part in editing YARN in the course of the project; the distribution of users by activity is shown in Figure 4.
Whereas we consider the early experiment under
a controlled crowd to be successful, we found the
three significant problems replicating over time:
organization issues, synset duplication and hyponymy/synonymy confusion.
project started in 1999. However, when comparing YARN and RuThes-lite, one may notice, that
they have an approximately equal number of concepts, yet the number of words in the latter is twice
bigger than in YARN. This implies the hypothesis
that expert-built thesauri include richer lexis that
could be covered by non-expert users. Hence, the
YARN synset quality requires more thorough evaluation.
Organization Issues. The number of synsets was
growing rapidly and moderators were not
able to assess all the incoming edits. In order
to work around this problem, we are experimenting with ML AB workflows.
4.1
Since YARN is created using crowdsourcing, it
seems reasonable to apply this technique for evaluation purposes, too. In our experiments we
used an open source engine for ML AB workflows
(Ustalov, 2015). In order to estimate the quality
of the current YARN synsets, we retrieved the 200
most frequently edited synsets. We asked four experts to assess the quality of each synset by rating them on the following scale: Excellent—the
synset completely represents a concept, Satisfactory—the synset is related to the concept, but some
words are missing or odd words are present, and
Bad—the synset is either ambiguous or it does not
represent any sensible concept.
We aggregated the 800 obtained answers using
the majority voting strategy, where the ties are resolved by choosing the worst of two answers, e.g.
given the same number of votes for both Good and
Bad, the latter will be selected. This resulted in
103 synsets of Excellent, 70 of Satisfactory and 27
of Bad quality. The results are shown in Table 2.
Values in column MV are the numbers of synsets
per each of the three grades, values in the last three
columns are the numbers of synsets grouped by
answer diversity—all the answers are the same in
1, two different answers present in 2, and the expert opinions divided in 3.
We also computed the alpha annotator reliability coefficient for ordinal values to estimate the
Synset Duplication. Participants do not consult
the other people’s work, which results in creation of duplicate synsets like {авто (auto),
автомобиль (automobile), машина (car)}
and {машина (car), тачка (ride)}.
Hyponymy Confusion. In some cases the participants mix hyponymy and synonymy, which
results in strange synsets like {мультфильм
(cartoon), мультик (cartoon), аниме
(anime)}.
4
Synset Quality
Evaluation
We compared YARN with other Russian thesauri
(Kiselev et al., 2015), which have been described
in Section 2.1 (Table 1). Besides YARN, the only
resource available for use is RuThes-lite, the commercial use of which requires licensing. It should
be noted that although the lexicon of YARN represents 100K+ words, only half of them are included
in synsets. Thus, we provide the latter number.
The number of concepts indicates that crowdsourcing is a promising approach for thesauri creation for the Russian language. Interestingly,
YARN contains more concepts than RussNet, a
63
RussNet
Russian Wordnet
RuThes
RuThes-lite
YARN
Table 2:
Excellent
Satisfactory
Bad
Total
Table 1: Russian thesauri comparison.
# of concepts # of relations # of words Availability
5.5K
8K
15K
No
157K
—
124K
No
55K
210K
158K
No
26K
108K
115K
Yes
44K
0
48.6K
Yes
extraction. Thus, this method—considering any
synsets having more than two common words as
duplicates—allows to detect and merge identical
concepts with a quality that is comparable to what
can be achieved by volunteers.
synset quality.
MV 1
2
3
103 37 62 21
70
3
43 11
27
0
12 11
200 40 117 43
YARN
5
Conclusion
The deliverables of YARN are available under the
CC BY-SA 3.0 license on the project website23 in
XML, CSV, and RDF formats. So far, we have the
following plans for the future work.
inter-rater reliability (Krippendorff, 2013). The
Krippendorff’s alpha is α = 0.202 due to the
skewness of the answer distribution: more than
half of the answers (434) are Excellent, the numbers of Satisfactory and Bad answers are 253 and
113 correspondingly. Given these results, we treat
the top 200 YARN synsets as sufficiently good.
These evaluation results define the upper bound
for the average quality of the resource in its current
state. Ustalov (2014) showed that revision count is
a good proxy for quality in the Russian Wiktionary
that is created in a similar fashion.
4.2
Commercial Usage
No
No
No
No
Yes
• Creating verb and adjective synsets.
• Establishing hierarchical links between
synsets through validation of the relationships imported from Wiktionary and other
resources.
• Development of automatic methods for generating hypotheses based on Wikipedia and
large text corpora.
Duplicate Synsets
• Development of automatic methods for
preparing “raw data”, as well as for postprocessing of annotation results produced by
the crowd.
Sometimes users create new synsets without investigating the current synsets presented in YARN.
The main problem with this is the presence of multiple entries for the same concept in the resource.
Detecting such concepts requires special effort because they are not described with identical synsets
but with similar ones.
Hence, we had to develop a method for automatically retrieving duplicate synsets. It was
based on the heuristics suggesting that any two
synonyms uniquely define a concept. This is
not always true, but it lets us discover duplicate
synsets with a very good recall. To estimate it,
we compared the senses of random 200+ synsets
having two or more common words. It turned out
that more than in 85% of the cases these pairs described the same sense.
However, we found out that non-linguists do
not recognize subtle nuances of meaning that are
noticeable to experts, so the non-linguists cannot significantly improve the quality of duplicate
• Widening the audience of the project’s participants through mobile applications and simpler tasks.
• Development of crowd management methods, such as automatic methods for evaluation of workers, task difficulty, and annotation results, the system of incentives, etc.
Acknowledgments
This work is supported by the Russian Foundation
for the Humanities project no. 13-04-12020 “New
Open Electronic Thesaurus for Russian”. We are
grateful to Yulia Badryzlova for proofreading the
text. We would also like to thank the three anonymous reviewers, who offered very helpful suggestions.
23
64
http://russianword.net/data
References
Klaus Krippendorff. 2013. Content Analysis: An Introduction to Its Methodology. SAGE, Thousand
Oaks, CA, USA, 3rd edition.
Anatoly Anisimov, Oleksandr Marchenko, Andrey
Nikonenko, et al. 2013. Ukrainian WordNet: Creation and Filling. In Flexible Query Answering Systems, volume 8132 of Lecture Notes in Computer
Science, pages 649–660. Springer Berlin Heidelberg.
Andrew A. Krizhanovsky and Alexander V. Smirnov.
2013. An approach to automated construction of
a general-purpose lexical ontology based on wiktionary. Journal of Computer and Systems Sciences
International, 52(2):215–225.
Irina Azarova, Olga Mitrofanova, Anna Sinopalnikova,
Maria Yavorskaya, and Ilya Oparin. 2002. RussNet:
Building a Lexical Database for the Russian Language. In Proc. of Workshop on WordNet Structures
and Standardisation, and How These Affect WordNet Applications and Evaluation, pages 60–64, Gran
Canaria, Spain.
Huairen Lin and Joseph Davis. 2010. Computational
and Crowdsourcing Methods for Extracting Ontological Structure from Folksonomy. In The Semantic Web: Research and Applications, volume 6089 of
Lecture Notes in Computer Science, pages 472–477.
Springer Berlin Heidelberg.
Natalia Loukachevitch. 2011. Thesauri in information
retrieval tasks. Moscow University Press, Moscow,
Russia.
Valentina Balkova, Andrey Sukhonogov, and Sergey
Yablonsky.
2004.
Russian WordNet.
In
Proceedings of the Second International WordNet
Conference—GWC 2004, pages 31–38, Brno, Czech
Republic. Masaryk University Brno, Czech Republic.
Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, and
Stan Szpakowicz. 2014. plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources. In
Proceedings of the Seventh Global Wordnet Conference, pages 304–312, Tartu, Estonia.
Chris Biemann. 2013. Creating a system for lexical substitutions from scratch using crowdsourcing.
Language Resources and Evaluation, 47(1):97–122.
Roberto Navigli and Simone Paolo Ponzetto. 2012.
BabelNet: The automatic construction, evaluation
and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–
250.
Victor Bocharov, Svetlana Alexeeva, Dmitry Granovsky, et al. 2013. Crowdsourcing morphological annotation. In Computational Linguistics and
Intellectual Technologies: papers from the Annual
conference “Dialogue”, volume 12(19), pages 109–
124, Moscow, Russia. RGGU.
Karel Pala and Pavel Smrž. 2004. Building Czech
Wordnet. Romanian Journal of Information Science
and Technology, 7(1–2):79–88.
Pavel Braslavski, Dmitry Ustalov, and Mikhail
Mukhin. 2014. A Spinning Wheel for YARN: User
Interface for a Crowdsourced Thesaurus. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for
Computational Linguistics, pages 101–104, Gothenburg, Sweden. Association for Computational Linguistics.
Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev,
and Chris Callison-Burch. 2014. The Language Demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics, 2:79–92.
Anna Rumshisky. 2011. Crowdsourcing Word Sense
Definition. In Proceedings of the 5th Linguistic
Annotation Workshop, LAW V ’11, pages 74–81,
Stroudsburg, PA, USA. Association for Computational Linguistics.
Christiane Fellbaum. 1998. WordNet: An Electronic
Database. MIT Press.
Ilya Gelfenbeyn, Artem Goncharuk, Vlad Lekhelt,
et al. 2003. Automatic translation of WordNet semantic network to Russian language. In Proceedings of Dialog-2003.
Dmitry Ustalov. 2014. Words Worth Attention: Predicting Words of the Week on the Russian Wiktionary. In Knowledge Engineering and the Semantic Web, volume 468 of Communications in
Computer and Information Science, pages 196–207.
Springer International Publishing.
Iryna Gurevych and Jungi Kim, editors. 2013. The
People’s Web Meets NLP. Springer Berlin Heidelberg.
Dmitry Ustalov. 2015. A Crowdsourcing Engine for
Mechanized Labor. Proceedings of the Institute for
System Programming, 27(3):351–364.
Yuri Kiselev, Sergey V. Porshnev, and Mikhail Mukhin.
2015. Current Status of Russian Electronic Thesauri: Quality, Completeness and Availability. Programmnaya Ingeneria, (6):34–40.
Aobo Wang, Cong Duy Vu Hoang, and Min-Yen Kan.
2013. Perspectives on crowdsourcing annotations
for natural language processing. Language Resources and Evaluation, 47(1):9–31.
Aniket Kittur, Jeffrey V. Nickerson, Michael Bernstein,
et al. 2013. The Future of Crowd Work. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work, CSCW ’13, pages 1301–
1318, New York, NY, USA. ACM.
Andrey Zaliznyak. 1977. Grammatical dictionary of
Russian. Russky yazyk, Moscow, USSR.
65
Word Substitution in Short Answer Extraction: A WordNet-based
Approach
Qingqing Cai, James Gung, Maochen Guan, Gerald Kurlandski, Adam Pease
IPsoft / New York, NY, USA
[qingqing.cai|james.gung|maochen.guan|gerald.kurlandski|adam.pease]@ipsoft.com
Abstract
we probably want to answer “Bob”, because
“walk” and “amble” are similar and not inconsistent. In isolation, a human would likely judge
“walk” and “amble” to be similar, and by many
WordNet-based similarity measures they would be
judged similar, since “walk” is found as WordNet
synsets 201904930, 201912893, 201959776 and
201882170, and “amble” is 201918183, which is
a direct hyponym of 201904930.
We can use Resnik’s method (Resnik, 1995)
to compute similarity. In particular we can
use Ted Pedersen’s (et al) implementation (Pedersen et al., 2004), which gives the result of walk#n#4 amble#n#1 9.97400037941652
. Word2Vec (Mikolov et al., 2013a) using their
300-dimensional vectors trained on Google News,
also gives a relatively high similarity score for the
two words
We describe the implementation of a short
answer extraction system. It consists of
a simple sentence selection front-end and
a two phase approach to answer extraction from a sentence. In the first phase
sentence classification is performed with a
classifier trained with the passive aggressive algorithm utilizing the UIUC dataset
and taxonomy and a feature set including word vectors. This phase outperforms
the current best published results on that
dataset. In the second phase, a sieve algorithm consisting of a series of increasingly
general extraction rules is applied, using
WordNet to find word types aligned with
the UIUC classifications determined in the
first phase. Some very preliminary performance metrics are presented.
1
> model.similarity(’walk’, ’amble’)
0.525
2
Introduction
But what about if we have
Short Answer Extraction refers to a set of information retrieval techniques that retrieve a short answer to a question from a sentence. For example,
if we have the following question and answer sentence
(1)
(3)
“Who ambles to the store?”
“Bob has an apple.”
(5)
“Who has a pear?”
> model.similarity(’apple’, ’pear’)
0.645
and from Resnik’s algorithm
Concept #1: apple
Concept #2: pear
apple pear
apple#n#1 pear#n#1
we want to extract just the phrase “George
Washington”. But what if we have a mismatch in
language between question and answer? What is
an appropriate measure for word similarity or substitution in question answering? If we have the
question answer pair
“Bob walks to the store.”
(4)
We find that this pair is even more similar than
“walk” and “amble”
Q: Who was the first president of the
United States?
A: George Washington was the first president of the United States.
(2)
Is Similarity the Right Measure?
10.15
and yet clearly 4 is not a valid answer to 5. One
possibility is that synset subsumption as a measure of word substitution (Kremer et al., 2014;
Biemann, 2013)1 2 may be the appropriate metric,
1
https://dkpro-similarity-asl.
googlecode.com/files/TWSI2.zip
2
http://www.anc.org/MASC/coinco.tgz
66
rather than word similarity.
3
In the last decade, many systems have been
proposed for question classification (Li and Roth,
2006; Huang et al., 2008; Silva et al., 2011).
Li and Roth (Li and Roth, 2002) introduced a
two-layered taxonomy of questions along with a
dataset of 6000 questions divided into a training
set of 5000 and test set of 500. This dataset
(henceforth referred to as the UIUC dataset) has
since become a standard benchmark for question
classification systems.
There have been a number of advances in word
representation research. Turian et al. (Turian et al.,
2010) demonstrated the usefulness of a number of
different methods for representing words, including word embeddings and Brown clusters (Brown
et al., 1992), within supervised NLP application
such as named entity recognition and shallow
parsing. Since then, largely due to advances in
neural language models for learning word embeddings, such as W ORD 2V EC (Mikolov et al.,
2013b), word vectors have become essential features in a number of NLP applications.
In this paper, we describe a new model for question classification that takes advantage of recent
work in word embedding models, beating the previous state-of-the-art by a significant margin.
Question Answering
Our approach starts with the user’s question and
the sentence that is most likely to contain the answer, which is selected with the BM25 algorithm
(Jones et al., 2000). Then we identify the incoming question as a particular question type according to the UIUC taxonomy3 . To this taxonomy
we have added the yes/no question type. Then we
pass the sentence and the question to a class written specifically to handle a particular UIUC question type. Generally, all the base question types
behave differently from one another. Within a base
question type, subtypes may be handled generically or with code specially targeted for that subtype. For this paper, we first discuss the approach
to question classification, and then to answer extraction with a focus on the question subtypes that
are amenable to a WordNet-based approach.
4
Question Classification
This section presents a question classifier with
several novel semantic and syntactic features
based on extraction of question foci. We use several sources of semantic information for representing features for each question focus. Our model
uses a simple margin-based online algorithm. We
achieve state-of-the-art performance on both finegrained and coarse-grained question classification.
As the focus of this paper is on WordNet, we leave
many details to a future paper and primarily report the features used, the learning algorithm and
results, without further justification
4.1
4.1.1
Question Focus Extraction
Question foci (also known as headwords) have
been shown to be an important source of information for question analysis. Therefore, their
accurate identification is a crucial component of
question classifiers. Unlike past approaches using
phrase-structure parses, we use rules based on a
dependency parse to extract each focus.
We first extract the question word (how, what,
when, where, which, who, whom, whose, or why)
or imperative (name, tell, say, or give). This is
done by naively choosing the first question word
in the sentence, or first imperative word if no question word is found. This approach works well in
practice, though a more advanced method may be
beneficial in more general domains than the TREC
(Voorhees, 1999) questions of the UIUC dataset.
We then define specific rules for each type of
question word. For example, what/which questions are treated differently than how questions. In
how questions, we identify words like much and
many as question foci, while treating the heads of
these words (e.g. feet or people) as a separate type
known as QUANTITY (as opposed to FOCUS.
Furthermore, when the focus of a how question
Introduction
Question analysis is a crucial step in many successful question answering systems. Determining
the expected answer type for a question can significantly constrain the search space of potential answers. For example, if the expected answer type
is country, a system can rule out all documents
or sentences not containing mentions of countries.
Furthermore, accurately choosing the expected answer type is extremely important for systems that
use type-specific strategies for answer selection. A
system might, for example, have a specific unit for
handling definition questions or reason questions.
3
http://cogcomp.cs.illinois.edu/Data/
QA/QC/definition.html
http://cogcomp.cs.illinois.edu/Data/QA/
QC/
67
Feature Set
All
-clusters
-vectors
-clusters, vectors
-lists
-clusters, vectors, lists
-definition disambiguation
-quantity focus differentiation
is itself the head (e.g. how much did it cost? or
how long did he swim?), we again differentiate the
type using a MUCH type and a SPAN type that
includes words like long and short.
A head chunk such as type of car contains two
words, type and car, which both provide potentially useful sources of information about the question type. We refer to words such as type, kind, and
brand as specifiers. We extract the argument of a
specifier (car) as well as the specifier itself (type)
as question foci.
In addition to head words of the question word,
we also extract question foci linked to the root
of the question when the root verb is an entailment word such as is, called, named, or known.
Thus, for questions like What is the name of the
tallest mountain in the world?, we extract name
and mountain as question foci. This can result in
many question foci in the case of a sentence like
What relative of the racoon is sometimes known
as the cat-bear?
Coarse
96.2
96
95.4
95.2
94
92.8
94.8
96
Table 2: Feature ablation study: accuracies on
coarse and fine-grained labels after removing specific features from the full feature set.
System
Li and Roth 2002
Huang et al. 2008
Silva et al. 2011
Our System
Fine
84.2
89.2
90.8
92.0
Coarse
91.0
93.4
95.0
96.2
Table 3: System comparison of accuracies for fine
(50-class) and coarse (6-class) question labels.
4.1.2 Learning Algorithm
We apply an in-house implementation of the
multi-class Passive-Aggressive algorithm (Crammer et al., 2006) to learn our model’s parameters.
Specifically, we use PA-I, with
lt
τt = min C,
kxt k2
4.3
Discussion
Our model significantly outperforms all previous
results for question classification on the UIUC
dataset (Table 3). Furthermore, we accomplished
this without significant manual feature engineering or rule-writing, using a simple online-learning
algorithm to determine the appropriate weights.
for t = 1, 2, ... where C is the aggressiveness
parameter, lt is the loss, and kxt k2 is the squared
norm of the feature vector for training example t.
The Passive-Aggressive algorithm’s name refers
to its behavior: when the loss is 0, the parameters are unchanged, but when the loss is positive,
the algorithm aggressively forces the loss to return to zero, regardless of step-size. τ (a Lagrange
multiplier) is used to used to control the step-size.
When C is increased, the algorithm has a more aggressive update.
4.2
Fine
92.0
90.2
90
89.8
88
86.2
91
90.2
5
Answer Extraction
In this section we discuss techniques for short answer extraction once questions have been classified into a particular UIUC type. We employ a
“sieve” approach, as in (Lee et al., 2011), that has
seen some success in tasks like coreference resolution and is creating a bit of a renaissance in
rule-based, as opposed to machine learning, approaches in NLP. We provide in this paper one example of how instead of taking an either/or approach, both methods can be combined into a
high performance system. We focus below on the
sieves that are specific to question types where we
have been able to profitably employ WordNet for
finding the right short answer. Preliminary results
have been positive employing this approach.
We have two strategies that are used across the
base question types: employing semantic role labels and recognizing appositives.
Experiments
We replicate the evaluation framework used in (Li
and Roth, 2006; Huang et al., 2008; Silva et al.,
2011). We use the full, unaltered 5500-question
training set from UIUC for training, and evaluate
on the 500-question test.
To demonstrate the impact of our model’s novel
features, we performed a feature ablation test (Table 2) in which we removed groups of features
from the full feature set.
68
Feature Type
Lemma
Shape
Authority List
Word Vector*
Brown Cluster Prefix
guitar
guitar
x+
instrument
vocals, guitars, bass,
harmonica, drums
0010, 001010,
0010101100, ...
Cup
cup
Xx+
sport
champions, championship,
tournament
0111, 011101,
0111011000, ...
Table 1: Features used for head words. Each dimension of the corresponding word vector was used as a
real-valued feature. *Nearest neighbors of the corresponding word vector are shown.
5.1
Corpus
ROOT
Our current testing corpus consists of three parts.
The first is an open source Q&A test set developed at Carnegie Mellon University (Smith et al.,
2008)4 consisting of roughly 1000 question and
answer pairs on Wikipedia articles. The second
is a proprietary Q&A test set developed at IPsoft
consisting of a growing set of question answer
pairs currently numbering roughly 2000 pairs and
conducted on short sections of Wikipedia articles.
The third test set is TREC-8 (Voorhees, 1999).
5.2
prep as
nsubj
(8)
(9)
(10)
Semantic Role Labels
(11)
Lincoln
loved
books
What did Lincoln love?
WP VBD NNP VB
As a boy, Abraham Lincoln loved
IN DT NN NNP
NNP VBD
books.
NNS
What
did
Lincoln
love
?
A0
(12)
As
a
boy
,
Abraham
Lincoln
A1
loved
books
1. We collect basic information from the question and answer sentence
(a) find the question word, e.g. “what”,
“when”, “where”, etc. In Example 6 it
is “what-1”
(b) Locate the verb node nearest to the question word. In Example 6 it is “love-4”
(c) Find the semantic relations in the question. We find an Agent/A0 relationship
aux
nsubj
love
Abraham
ARGM-PRD
dobj
Lincoln
,
A0
ROOT
did
boy
A1
Q: What did Lincoln love?
A: As a boy, Abraham Lincoln loved
books.
What
a
and semantic role labels
We have the following dependency graphs
among the tokens in each sentence:
(7)
As
and part of speech labels
We employ the semantic role labeling of
ClearNLP (Choi, 2012)5 . While the labels are
consistent with PropBank (Palmer et al., 2005),
ClearNLP fixes the definition of several of the labels (A2-A5) that are left undefined in PropBank.
A0 is the “Agent” relation, which is often the subject of the sentence. A1 is the “Patient” or object
of the sentence. The remainder can be found in
(Choi, 2012).
Let’s look at an example and the list the steps
followed in the code to analyse the question and
answer.
(6)
dobj
det
?
4
download from http://www.cs.cmu.edu/˜ark/
QA-data/
5
http://www.clearnlp.com
69
the question type has been identified as
“Human” or “Entity”
• If the candidate node has a child with a
different semantic role label than in the
question
• If the candidate node is an adverb or a
Wh- quantifier as marked by its part of
speech label
between Lincoln-3 and the verb love4. We find a Patient/A1 relationship between the question word What-1 and the
verb love-4. (See Examples 11 and 12).
(d) Find semantic relations in the answer
sentence. We find an Agent/A0 relationship between Lincoln-6 and the verb
loved-7. We find an ARGM-PRD relationship between As-1 and the verb
loved-7. We find a Patient/A1 relationship between books-8 and the verb
loved-7. (See Examples 11 and 12).
(e) Perform a graph structure match between the question and answer graphs
formed by the set of their semantic role
labels. Find the parent graph node in the
answer that matches as many nodes in
the question as possible. In our example, loved-7 is the best match. (See Examples 11 and 12).
3. Pick the dependency node with highest confidence score as the answer node. In our example we have As-1 = 0.97, Lincoln-6 = 0.96
and books-8 = 0.99.
Note that the step of scoring the answer nodes
enumerates a small feature set with hand-set coefficients. We expect in a future phase to enumerate
a much larger set of features, and then set the coefficients based on machine learning over our corpus
of question-answer pairs. One simple experiment
to show the value of semantic role labeling was
conducted on a portion of our testing corpus. Using semantic role labels we achieved total of 638
correct answers out of 1460 questions (which was
the total number in the IPsoft internal Q&A test
set at the time of the test), for a correctness score
of 43.7%. Without semantic role labels the result
was 462 out of 1460, or 31.6%.
2. Collect and score candidate answer nodes.
Score each semantic child for best parent
found in the previous step, based on part of
speech, named entity, dependency relations
from Stanford’s CoreNLP (Manning et al.,
2014), and semantic role label information.
We initialize each child to a value of 1.0 and
then penalize it by 0.01 for the presence of
any out of a set of possible undesirable features, as follows:
5.3
Appositives
The appositive is a grammatical construction in
which one phrase elaborates or restricts another.
For example,
• The candidate’s semantic role label
starts with “ARGM”, meaning that its
semantic role is something other than
A0-A5. (See Examples 11 and 12).
Note that this is only applied in cases
where the question type has been identified as “Human” or “Entity”
• The node’s dependency label = “prep*”
indicating that it is a prepositional relationship. Note that this is only applied in
cases where the question type has been
identified as “Human” or “Entity”
• If the candidate node is the same form
(word spelling) as in the question, or its
WordNet hyponym
• If the candidate node is the same root
(lemma) as in the question, or its WordNet hyponym
• If the candidate node is lower case. Note
that this is only applied in cases where
(13)
My cousin, Bob, is a great guy.
“Bob” further restricts the identity of “My
cousin”.
ROOT
nsubj
cop
(14)
My
det
appos
poss
amod
cousin
,
Bob
,
is
a
great
guy
We use the appositive grammatical relation to
identify the answers to “What” questions.
5.4
Entity Question Type
Short answer extraction for the Entity question
type has some specialized rules for some subtypes,
and some rules which are applied generally to all
70
(16)
the other subtypes. We are also exploring using
WordNet (Fellbaum, 1998) synsets to get word
lists that are members of each Entity subtype (see
Table 4). This appears to have a significant effect, since 10 questions are answerable with this
approach just addressing two of the 22 Entity subtypes. More work is needed to get comprehensive
statistics.
knowing that 100455599:{game} is a hypernym of 100477392:{lacrosse} makes finding the
right answer in the sentence easy.
5.4.1 Entity.animal Subtype
1. First try to find an appositive relationship. If
there is one, use it as the answer. For example
14, if we ask “Who is a great guy?” we have a
simple answer with “Bob” as the appositive.
If that fails:
6
5.4.2 Entity.creative Subtype
1. First try to find an appositive relationship. If
there is one, use it as the answer. If that fails:
2. try the approach described above in subsection 5.2 and keep the candidate with the highest confidence score. If that fails:
7
Conclusion
Using a WordNet-based word replacement method
appears to be better for question answering than
using word similarity metrics. In preliminary tests
10 questions in a portion of our corpora are answerable with this approach just addressing two
of the 22 Entity subtypes with WordNet based
matching. While more experimentation is needed,
the results are intuitive and promising. The current approach should be validated and compared
against other approaches on current data sets such
as (Peñas et al., 2015).
3. find the first capitalized sequence of words
and return it
5.4.3 All Other Entity Subtypes
1. First try to find an appositive relationship. If
there is one, use it as the answer. If that fails:
2. try the approach described above in subsection 5.2 and keep the candidate with the highest confidence score
Example
Take for example the following
(15)
UIUC Question Types and Synsets
Table 4 lists all the types and subtypes in the
UIUC taxonomy and the WordNet (Fellbaum,
1998) synset numbers that correspond to semantic types for the UIUC types. These are used to
get all words that are in the given synsets as well
as all words in the synsets that are more specific in
the WordNet hyponym hierarchy than those listed.
Note that below we prepend to the synset numbers
a number for their part of speech. In the current
scheme all are nouns, so the first number is always
a “1”. We only elaborate subtypes of Entity, Human, and Location as the other categories do not
use WordNet for matching.
2. try the approach described above in subsection 5.2 and keep the candidate with the highest confidence score
5.5
Q: What athletic game did dentist William
Beers write a standard book of rules for?
A: In 1860, Beers began to codify the
first written rules of the modern game of
lacrosse. Short Answer: Lacrosse.
Q: What shrubs can be planted that will be
safe from deer?
A: Three old-time charmers make the list
of shrubs unpalatable to deer: lilac, potentilla, and spiraea. Short Answer: Lilac,
potentilla, and spiraea.
Knowing
from
WordNet
that
112310349:{lilac}, and 112659356:{spiraea,
spirea} (although not potentilla) are hyponyms of
shrub makes it easy to find the right dependency
parse subtree for the short answer.
Similarly for
71
Class
ABBREVIATION
ENTITY
animal
body
color
currency
dis.med.
event
food
instrument
lang
letter
other
plant
product
religion
Definition
abbreviation
entities
animals
organs of body
colors
inventions, books
and other creative pieces
currency names
diseases and medicine
events
food
musical instrument
languages
letters like a-z
other entities
plants
products
religions
sport
sports
creative
substance
symbol
technique
term
vehicle
word
DESCRIPTION
HUMAN
group
ind
title
description
LOCATION
city
country
mountain
other
state
NUMERIC
elements and substances
symbols and signs
techniques and methods
equivalent terms
vehicles
words with a special property
description and abstract concepts
human beings
a group or organization of persons
an individual
title of a person
description of a person
locations
cities
countries
mountains
other locations
states
numeric values
Synsets
100015388
105297523
104956594
102870092, 103217458,
103129123
113385913, 113604718
114034177, 114778436
100029378
100021265
103800933
106282651
100017222
100021939
108081668, 105946687
100433216, 100523513,
103414162
100020090
103100490
107950920
102472293
108226335, 108524735
108168978
109359803, 109403734
108630039
108654360
Table 4: UIUC class to WordNet synset mappings
72
References
Meeting of the Association for Computational Linguistics: System Demonstrations.
Chris Biemann. 2013. Creating a system for lexical substitutions from scratch using crowdsourcing.
Language Resources and Evaluation, 47(1):97–122.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of Workshop
at ICLR. Now Pub.
Peter Brown, Peter Desouza, Robert Mercer, Vincent
dellaPietra, and Jenifer Lai. 1992. Class-based ngram models of natural language. Computational
linguistics, 18(4):467–479.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Efficient estimation of word
representations in vector space. arXiv preprint
arXiv:1301.3781.
Jinho D. Choi. 2012. Optimization of Natural Language Processing Components for Robustness and
Scalability. Ph.D. thesis, University of Colorado at
Boulder, Boulder, CO, USA. AAI3549172.
Martha Palmer, Daniel Gildea, and Paul Kingsbury.
2005. The proposition bank: An annotated corpus of semantic roles. Computational linguistics,
31(1):71–106.
Koby Crammer, Ofer Dekel, Joseph Keshet, Shai
Shalev-Shwartz, and Yoram Singer. 2006. Online
passive-aggressive algorithms. The Journal of Machine Learning Research, 7:551–585.
Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. WordNet::Similarity: Measuring
the Relatedness of Concepts.
In Demonstration Papers at HLT-NAACL 2004, HLT-NAACL–
Demonstrations ’04, pages 38–41, Stroudsburg, PA,
USA. Association for Computational Linguistics.
Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press.
Zhiheng Huang, Marcus Thint, and Zengchang Qin.
2008. Question classification using head words and
their hypernyms. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing, pages 927–936. Association for Computational Linguistics.
Anselmo Peñas, Christina Unger, Georgios Paliouras,
and Ioannis A. Kakadiaris. 2015. Overview of
the CLEF question answering track 2015. In Experimental IR Meets Multilinguality, Multimodality,
and Interaction - 6th International Conference of the
CLEF Association, CLEF 2015, Toulouse, France,
September 8-11, 2015, Proceedings, pages 539–544.
K. Sparck Jones, S. Walker, and S.E. Robertson. 2000.
A probabilistic model of information retrieval: development and comparative experiments: Part 1. Information Processing & Management, 36(6):779 –
808.
Philip Resnik. 1995. Using information content to
evaluate semantic similarity in a taxonomy. In In
Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95, pages
448–453. Morgan Kaufmann.
Gerhard Kremer, Katrin Erk, Sebastian Pad, and Stefan
Thater. 2014. What Substitutes Tell Us – Analysis
of an ”All-Words” Lexical Substitution Corpus. In
Proceedings of EACL, Gothenburg, Sweden.
Joao Silva, Luı́sa Coheur, Ana Cristina Mendes, and
Andreas Wichert. 2011. From symbolic to subsymbolic information in question classification. Artificial Intelligence Review, 35(2):137–154.
Heeyoung Lee, Yves Peirsman, Angel Chang,
Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011. Stanford’s Multi-pass Sieve Coreference Resolution System at the CoNLL-2011 Shared
Task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, CONLL Shared Task ’11, pages
28–34, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Noah A. Smith, Michael Heilman, , and Rebecca Hwa.
2008. Question Generation as a Competitive Undergraduate Course Project. In NSF Workshop on
the Question Generation Shared Task and Evaluation Challenge, Arlington, VA, September.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
Word representations: a simple and general method
for semi-supervised learning. In Proceedings of the
48th annual meeting of the association for computational linguistics, pages 384–394. Association for
Computational Linguistics.
Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th international
conference on Computational linguistics-Volume 1,
pages 1–7. Association for Computational Linguistics.
Ellen M. Voorhees. 1999. Overview of the TREC 2002
Question Answering Track. In In Proceedings of the
11th Text Retrieval Conference (TREC), pages 115–
123.
Xin Li and Dan Roth. 2006. Learning question classifiers: the role of semantic information. Natural
Language Engineering, 12(03):229–249.
Chris Manning, John Bauer, Mihai Surdeanu, Jenny
Finkel, Steven Bethard, and David McClosky. 2014.
The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual
73
An overview of Portuguese WordNets
Valeria de Paiva
Nuance Communications, USA
Livy Real
IBM Research Brazil
[email protected]
[email protected]
Hugo Gonçalo Oliveira
CISUC, DEI, Univ. Coimbra, Portugal
Alexandre Rademaker
IBM Research and FGV/EMAp, Brazil
[email protected]
[email protected]
Claudia Freitas
PUC-Rio, Brazil
Alberto Simoẽs
Universidade do Minho, Portugal
[email protected]
[email protected]
Abstract
compared in (Santos et al., 2010). But if those
alternatives proved themselves useful for some
tasks, they were not enough to enable all of the
standard uses of a wordnet in Natural Language
Processing (NLP), including similarity computation or word sense disambiguation. As the need
for a Portuguese wordnet was keenly felt, in the
early 2010s, several projects sprung up aiming to
develop free Portuguese wordnets. We describe
some of those wordnets, while indicating where
they were created, their construction process, their
availability and, when possible, their size.
We recall the wordnet model, its adaptation to
other languages, and how these adaptations may
be expanded through content alignment. Then, we
describe the Portuguese wordnets we are aware
of, alternative lexical-semantic resources, and go
on to focus on the open wordnets. After that, we
briefly compare the previous along a set of relevant features for processing Portuguese. Then, we
suggest work leveraging what is already planned
for these wordnets, as well as some ideas for collaboration. Knowing where we are in terms of our
wordnets is an essential first step in establishing
lexical resources, which are vital to the computational processing of the Portuguese language. 1
Semantic relations between words are key
to building systems that aim to understand and manipulate language. For English, the “de facto” standard for representing this kind of knowledge is Princeton’s WordNet. Here, we describe the
wordnet-like resources currently available
for Portuguese: their origins, methods of
creation, sizes, and usage restrictions. We
start tackling the problem of comparing
them, but only in quantitative terms. Finally, we sketch ideas for potential collaboration between some of the projects.
1
Introduction
Semantic relations are a key aspect when developing computer programs capable of handling language – they establish (labeled) associations between words and can be integrated into lexicalsemantic knowledge bases. Available since the beginning of the 1990s, Princeton’s WordNet (Fellbaum, 1998), henceforth PWN, is a paradigmatic
lexical resource. Originally created for English,
its model is now a “de facto” standard, due to its
wide use in applications and its adaptation to different languages.
For Portuguese, the first resource of this kind,
WordNet.PT (Marrafa, 2001), was announced in
2001 but, unlike PWN, was never free to use. This
meant that, in practice, there was still no open Portuguese wordnet. In parallel, a few alternatives
to the wordnet model arose, some of which were
2
WordNet and Alternatives
Lexical knowledge bases are organized repositories of lexical items, usually including information
about the possible meanings of words, relations
1
This paper is a shorter English version of our previous
article, in Portuguese (Gonçalo Oliveira et al., 2015).
74
ing on similarities or, indirectly, using Princeton
WordNet as a pivot, through the so-called InterLanguage Index (ILI). In MultiWordNet, the first
step was to translate, as much as possible, one
wordnet, usually Princeton’s, into the other languages. Among the multilingual wordnets aligned
with PWN, there are, for instance, BalkaNet (Stamou et al., 2002), dedicated to the languages of the
Balkans, and the Multilingual Central Repository
(Gonzalez-Agirre et al., 2012) (henceforth, MCR)
dedicated to the languages of Spain.
Open Multilingual WordNet (Bond and Foster,
2013), henceforth OMWN, is an initiative to facilitate access to different wordnets, for different
languages. To this end, wordnets, created independently, were normalized using PWN, and then
connected to each other and accessed through a
common interface. Another initiative that should
be mentioned is the Universal WordNet (de Melo
and Weikum, 2009) (henceforth, UWN), a multilingual lexical knowledge base automatically built
from PWN and the alignment of multilingual versions of Wikipedia.
There are also several projects on the alignment of PWN with other lexical resources or
knowledge bases. These include, for instance,
YAGO (Suchanek et al., 2007), UBY (Gurevych
et al., 2012), BabelNet (Navigli and Ponzetto,
2012), SUMO (Pease and Fellbaum, 2010) and
DOLCE (Gangemi et al., 2010).
between them, definitions, and phrases that exemplify their use. The Princeton WordNet model,
with English as its target language, is probably the
most popular representative of this type of lexical
knowledge base. Its flexibility has led not only to
its growing use by the NLP community, but also
to the adaptation of the model to other languages.
PWN was created manually in the early 1990s
and has been updated several times since then. Initially based on psycholinguistic principles, it combines traditional lexicographic information, similar to that in a dictionary, with an appropriate
organization for computational use, which facilitates its application as a basis for lexical-semantic
knowledge. Like a thesaurus, PWN is organized
in groups of synonymous lexical items, called
synsets, which can be seen as the possible lexicalizations for the concepts in the language. Besides
synonymy, inherent to synsets, PWN covers other
types of semantic relation between synsets. For
example, hypernymy – a concept is a generalization of another – or meronymy – a concept is a part
of another. In addition, each synset has a part-ofspeech (noun, verb, adjective or adverb); a gloss,
similar to a definition in a dictionary; and it may
still have phrases that illustrate its use. The inclusion of a lexical item in a synset indicates a sense
of that item.
Both its free availability and the flexibility of its
model were crucial to the success and widespread
use of PWN. This made it possible to integrate
PWN into a large number of NLP or knowledge management projects, making it virtually the
standard model of a lexical resource for several
languages. The popularity of the PWN knowledge base model led to the creation of the Global
WordNet Association (GWA), a non-commercial
organization that provides a platform for discussion, sharing and linking the wordnets of the
world.
2.1
2.2
Closed Portuguese WordNets
There is no doubt that the open-source character of
PWN was key in its wide acceptance. Still, not all
resources that followed on the footsteps of PWN
have chosen to make their results freely available.
We describe three projects that resulted in Portuguese wordnets that are not free to use.
WordNet.PT (Marrafa, 2001), henceforth
WN.PT, was the first Portuguese wordnet,
in development since 1998.
Its construction is essentially manual and it follows the
EuroWordNet (Vossen, 1997) model, which
means WN.PT is created from scratch for Portuguese. WN.PT 1.6, released in 2006, covers
a wide range of semantic relations, including:
hypernym, whole/part, equivalence, opposition, categorization, instrument-for, or place-of.
More recently, WN.PT was expanded to Global
WordNet.PT (Marrafa et al., 2011), which contains 10,000 concepts, including nouns, verbs and
Multilingual Wordnets
Many people have studied the possibility of
aligning, as far as possible, wordnets of different languages, given their similarities. Thus,
the unveiling of multilingual wordnets such as
EuroWordNet (Vossen, 1997) or MultiWordNet
(Pianta et al., 2002), which nonetheless follow
very different approaches. In EuroWordNet,
wordnets are created independently for each language, and only after that they are aligned, rely-
75
simpler. Those include OpenThesaurus.PT, typically used to suggest synonyms in word processors; PAPEL (Gonçalo Oliveira et al., 2008), a
lexical-semantic network, automatically extracted
from a Portuguese dictionary, with words connected through a wide range of semantic relationships; the Port4Nooj lexical resources (Barreiro, 2010), which include a set of definitions
and semantic relations between words; and the Dicionário Aberto (Simões et al., 2012), an open
electronic dictionary which includes also several
explicit relationships between words.
adjectives, their lexicalizations in different variants of Portuguese and their glosses, in a network
of more than 40,000 relation instances. An approach to expand the WN.PT semi-automatically
with relations extracted from a corpus (Amaro,
2014) was recently presented, which shows that
the project is still active.
WordNet.BR (henceforth, WN.BR) aimed to be
a wordnet for Brazilian Portuguese. In its first
development phase (Dias-da-Silva et al., 2002),
a team of linguists analyzed five Portuguese dictionaries and two corpora to collect information
on synonymy and antonymy. This resulted in the
manual creation of synsets and antonymy relations between them, and writing some glosses and
example sentences. In a second phase (Dias-daSilva, 2006), its synsets were manually aligned
with PWN, in a similar process to that followed
in the EuroWordNet project, using bilingual dictionaries. After this alignment, the semantic relations between synsets with equivalents in Portuguese and English were inherited. It is assumed
that the full version of WN.BR covers relations of
hyperonymy, part-of, cause and implication (entailment). However, this version is not available
online. One can view and download the results
of phase one, available under the name of Electronic Thesaurus of Portuguese (TeP) (Maziero et
al., 2008). TeP includes more than 44,000 lexical
items, organized into 19,888 synsets, which in turn
are connected through 4,276 antonymy relations.
3
Open Portuguese Wordnets
Open wordnets for Portuguese finaly appeared in
the early 2010s. They were created by automatic or semi-automatic means and all assume that
lexical-semantic resources must be open-source to
be really useful to the community. We present four
wordnets that fall in this category.
3.1
Onto.PT
The Onto.PT (Gonçalo Oliveira and Gomes,
2014) project begun in 2008.
To create a
new wordnet in a completely automatic fashion, Onto.PT used several lexical resources available at the time, with special focus on those
of the project PAPEL (Gonçalo Oliveira et al.,
2008), including grammars to extract relations
from dictionaries. Other exploited resources include Wiktionary.PT, Dicionário Aberto (Simões
et al., 2012), TeP (Maziero et al., 2008),
OpenThesaurus.PT and, more recently, OpenWNPT (de Paiva et al., 2012).
The creation of Onto.PT follows the ECO approach (Gonçalo Oliveira and Gomes, 2014), tailored to for this project, but flexible enough to integrate words and relations extracted from different
sources. ECO is different from other approaches
because it tries to learn the whole structure of a
wordnet, including the contents and boundaries of
synsets, as well as the synsets involved in semantic relations. Hence, despite exploring, automatically, handcrafted resources, the authors refer to
ECO as a “fully automatic” approach. It consists
of three main phases: (i) relation extraction between words; (ii) synset discovery from the clusters of the extracted synonymy network (an initial
set of synsets, such as those of TeP, may be used as
a starting point); (iii) mapping word arguments of
remaining relations to the discovered synsets. In
MultiWordNet.PT, commonly referred to as
MWN.PT, is the Portuguese section of the MultiWordNet project (Pianta et al., 2002), which can
be purchased through the European Language Resources Association catalog. MWN.PT includes
17,200 manually validated synsets, which correspond to approximately 21,000 senses and 16,000
lemmas, covering both European and Brazilian
variants of Portuguese. As a resource established
under the MultiWordNet project, its synsets are
derived from the translation of their PWN equivalents. Transitively, this resource turns out to be
also aligned with the MultiWordNets of Italian,
Spanish, Hebrew, Romanian and Latin.
The manual creation of a wordnet is a complex task, which requires much effort and time.
When it was not possible to use an open Portuguese wordnet, researchers working on the processing of Portuguese felt the need to develop
free alternatives which, in most cases, were also
76
Onto.PT 0.6 (Gonçalo Oliveira et al., 2014), dictionary definitions were also assigned to synsets,
automatically.
Onto.PT is different from the typical wordnet,
not only for its creation process, but also because
it includes a wide range of semantic relations that
are not in PWN. Those relations are the same as
the ones in PAPEL, extracted from dictionaries,
and include causation, purpose, location or manner, among others.
On the one hand, ECO allows for the creation of
a large knowledge base with little effort – Onto.PT
0.6 covers ≈169,000 distinct lexical items, organized in ≈117,000 synsets, which in turn are related through ≈174,000 relation instances. On the
other hand, there are reliability consequences. For
example, in Onto.PT 0.35, 74% of synsets were
correct, in 18% there was no agreement between
two judges, and the remaining had at least one
incorrect word. The quality of relationships also
varies dramatically depending on the type. Considering that relations between incorrect synsets
are also wrong, the hypernymy connections were
just 65% correct and between 78%-82% in a set
with other relation types. These evaluation efforts
are described in (Gonçalo Oliveira and Gomes,
2014). Nevertheless, Onto.PT was used, for instance, in the expansion of synonyms for information retrieval (Rodrigues et al., 2012) or for creating lists of causal verbs (Drury et al., 2014).
Due to its design, Onto.PT is a dynamic resource and, from release to release, may have
significant changes in the number and size of its
synsets. Thus, it is not planned to be aligned with
PWN. Onto.PT is freely available in RDF/OWL2 ,
following an existing PWN model (van Assem et
al., 2006), expanded to cover all its relation types.
3.2
and open electronic dictionaries. OpenWN-PT
has constantly been improved through linguistically motivated additions, either manually or from
evidence in large corpora. This is also the case
for the lexicon of nominalizations, NomLex-PT,
tightly integrated with the OpenWN-PT (Freitas et
al., 2014).
OpenWN-PT employs three language strategies
in its lexical enrichment process: (i) translation;
(ii) corpus extraction; (iii) dictionaries. Regarding
translations, glossaries and lists produced for other
languages, such as English, French and Spanish,
are used, automatically translated and manually
revised. The addition of data from corpora contributes with words or phrases in common use,
which may be specific to Portuguese or do not
appear in other wordnets. The first corpora experiment in OpenWN-PT was the integration of
NomLex-PT. The use of a corpus, while useful for
specific conceptualizations in the language, brings
additional challenges for the mappings alignment,
since it is expected that there will be expressions for which there is no synset in the English
wordnet. As for the information in dictionaries,
this was used indirectly through PAPEL (Gonçalo
Oliveira et al., 2008).
Like Onto.PT, OpenWN-PT is available in
RDF/OWL (Real et al., 2015), following and
expanding, when necessary, the mapping proposed by (van Assem et al., 2006). Both the
OpenWN-PT data and schema of the RDF model
are freely available for download. The philosophy of OpenWN-PT is to keep a close connection with PWN, but try to fix the biggest mistakes
created by the automated methods, through language skills and tools. A consequence of this close
connection is the ability to minimize the impact
of lexicographical decisions on splitting/grouping
the senses in a synset. While such decisions are,
to a great extent, arbitrary, the practical criterion of
following the multilingual alignment behaves as a
pragmatic and practical guiding solution.
OpenWordNet-PT
OpenWordNet-PT (de Paiva et al., 2012) abbreviated to OpenWN-PT, is a wordnet originally developed as a syntactic projection of the Universal
WordNet (UNW). Its long-term goal is to serve
as the main lexicon for a NLP system, focused
on logical reasoning, based on representation of
knowledge, using an ontology such as SUMO. The
process of creating OpenWN-PT uses machine
learning techniques to build relations between
graphs representing lexical information from versions in multiple languages of Wikipedia entries
2
OpenWN-PT was chosen by the developers of Freeling (Padró and Stanilovsky, 2012),
OMWW (Bond and Foster, 2013), BabelNet and
Google Translate, as the representative Portuguese
wordnet in those projects, respectively, due to
its comprehensive coverage of the language and
its accuracy. OpenWN-PT currently has 43,925
synsets, of which 32,696 correspond to nouns,
4,675 to verbs, 5,575 to adjectives and 979 to ad-
http://ontopt.dei.uc.pt
77
verbs. Besides being available for download, the
data can be retrieved via a SPARQL endpoint3 and
can be consulted and compared with other wordnets both through the OMWN interface and its
own interface4 .
3.3
based on automatic translation. For this, a tool
based on the Google Translate API was developed
to translate the contents of PWN. UfesWN.BR
covers 34,979 words, grouped in 48,981 synsets,
connected by 238,413 relations. However, only
31,6% of the English synsets were translated and
these translations are not very reliable. In the
scope of this project, the glosses of PWN were
also translated. They could be useful for other
projects, depending on the quality and easiness of
alignment, which has not been investigated.
PULO
PULO (Simões and Guinovart, 2014), short for
Portuguese Unified Lexical Ontology, intends to
incorporate resources from open publicly available wordnets into a free Portuguese wordnet,
perfectly aligned and included in the MCR
project (Gonzalez-Agirre et al., 2012), which already includes wordnets for Spanish, Catalan,
Basque and Galician, in addition to PWN.
The beginning of this project, in late 2014, involved some experiments on the translation and
alignment between the English, Spanish and Galician wordnets. Beyond those, this process used
probabilistic translation dictionaries (Simões and
Almeida, 2003), a dynamic Portuguese-Galician
translation dictionary (Guinovart and Simões,
2013), and the official Orthographical Vocabulary
of the Portuguese Language. This resulted in
≈50,000 word meanings, but only ≈17,000 were
actually added to PULO. This was due to the statistical nature of the approach and the cutoff line
established. The scoring value obtained for each
meaning was properly stored on the database and
may serve as a measure of relevance or quality of
each meaning.
Currently, as the other wordnets of MCR, the
ontological structure of PULO is the same as
PWN. Despite this similarity, the internal structure
of the database allows each individual wordnet to
be easily extended to new concepts. PULO is
available for download and has currently 25,711
senses, corresponding to 17,854 synsets. In a second stage of the process, a machine translation of
glosses was produced using the MyMemory API5 .
Through the same interface, it is possible to consult the other languages of the MCR, as well as to
browse through the base ontology.
3.4
4
Comparing Open WordNets
Table 1 summarises the main properties of the
Portuguese wordnets. The most common alternative to the creation of a wordnet for Portuguese
is based on translation, manual (MWN.PT), automatic (UfesWN.BR), based on a syntactic projection (OpenWN-PT), or on triangulation between resources (PULO). Within these four approaches, PULO stands out for using as a “pivot”,
not only the English wordnet, but also the wordnets for Spanish and Galician. Unlike all others, the structure of Onto.PT is learned fully automatically, based on the extraction of relationships
from other textual resources or wordnets, and discovering clusters of synonyms, used as synsets.
Among the advantages of a completely manual approach is the creation of a resource with an accuracy of virtually 100%. On the other hand, with
an automatic approach, a larger resource can be
created in a shorter time, avoiding tedious and
time-consuming work, prone to raise issues. A
semi-automatic method where expediency can be
reigned in by accuracy would seem the best approach.
Name
WN.PT
WN.BR
MWN.PT
Onto.PT
OpenWN-PT
UfesWN.BR
PULO
Ufes WordNet
The Ufes WordNet (Gomes et al., 2013)
(UfesWN.BR) aims at building a Brazilian Portuguese database with a similar structure to PWN,
Creation
Synsets
Relations
manual
manual
manual
transitivity
manual?
transitivity
translation
RE,clustering
RE,clustering
UWN
transitivity
projection
machine
transitivity
translation
triangulation
transitivity
Update
Usage
manual
manual?
?
closed
free synsets
paid license
automatic
semi-autom
free
free
?
free
semi-autom
free
Table 1: Properties of Portuguese wordnets. A ‘?’
is shown for fields we could not fill.
We also made a superficial comparison of their
lastest versions, that should not be seen as more
than a purely quantitative tabling. We do not consider the consistency nor the usefulness of the con-
3
http://logics.emap.fgv.br:10035/
repositories/wn30
4
http://wnpt.brlcloud.com/wn/
5
http://mymemory.translated.net/
78
from other resources aligned with it In addition
to relation inheritance, an alignment allows access to knowledge of other extensions of PWN,
such as WordNet-domains, SentiWordNet or TempoWordNet. On the other hand, a blind alignment
does not consider that different languages represent different socio-cultural realities, do not cover
the same part of the lexicon and, even where they
seem to be common, several concepts are lexicalized differently (Hirst, 2004).
Both WN.PT and Onto.PT cover a wide range
of relation types, some not typically present in
wordnets. We recall that, for Onto.PT, their extracted was possible due to the regularities in dictionary definitions.
tents of the various Portuguese wordnets.
On the number of covered lexical items,
Onto.PT stands out for including more than three
times more lexical items than the second largest
wordnet, OpenWN-PT. This confirms that a fully
automatic construction approach leads to a larger
resource. Equally important for the size of
Onto.PT, is the amount (currently six) and the
type of resources used, including: resources that
cover different variants of Portuguese, which can
lead to minor spelling variations; and dictionaries,
which already have a wide coverage of the language. Either manually or automatically, it is common to exploit dictionaries in the construction of a
wordnet. Still, their automatic exploitation results
in many different words and meanings that exist
and are valid, but a large slice are of little use in
colloquial Portuguese.
5
Building on Open WordNets
We presented and compared various wordnets that
currently exist for Portuguese. Among them, four
are freely available; until recently, one synset base
(TeP) was also freely available; one (MWN.PT)
may be purchased; and another can be explored
online (WN.PT). The creation of these wordnets followed different approaches, from completely manual labour, through translation-based
approaches with more or less manual labour, to an
approach in which the whole structure is populated
automatically. We hope to have shown that, currently, it makes no sense to regret that there is no
Portuguese wordnet. In fact, the use of a wordnet
in a project targeting Portuguese is becoming less
of a problem of finding a workaround solution, and
increasingly more one of choosing the most suitable within the available alternatives. This selection should consider, among other things, the need
to align with other wordnets, the error tolerance,
the coverage needs – both with regards to the lexical items and to relationships between them – and
even the available budget. Since each wordnet has
distinct characteristics, one should not discard the
use of more than one wordnet in the same project.
It is sensible to ask whether all these alternatives make sense or if it would be preferable to focus on a single effort to build a single Portuguese
wordnet, trying to harness the strong points of
each of the projects described. The authors of this
article, responsible for Onto.PT, OpenWN-PT and
PULO, believe that there are advantages both on
converging into a single wordnet and on keeping
separate projects. Thus, in the short term, the development of each wordnet will remain a respons-
On the number of word senses, synsets and relation instances, Onto.PT also stands out from the
rest. But it should be noted that there is an intrinsic trade-off between the size of a wordnet and
the accuracy and usefulness of the resource under
scrutiny. One of the difficulties in developing a
wordnet is precisely to decide, on the one hand,
if two words are to be regarded as synonymous
and thus placed within the same synset and, on
the other hand, which words should be in different synsets. These are typical lexicography challenges to which there is probably no final unique
answer. But there seems to be a consensus that a
very large number of synsets is a sign of “noise”
in the process of grouping words and/or in the discrimination process. Correction/accuracy is undoubtedly one of the bottlenecks of building wordnets. If, on the one hand, size and coverage are a
quantitative comparison, which is relatively simple, the same cannot be said about the quality assessment. PWN, built manually, may even reflect
questionable decisions, but does not contain “errors” as such, as we are using it as a baseline for
comparison. As for the wordnets built automatically, or semi-automatically, for languages other
than English, quality assessment will always be an
issue, since there is no golden reference available
– this is precisely what they want to become. From
this perspective, resources that rely on human labor have an advantage, although we do not know
exactly how this advantage can and should be measured. An alignment with PWN may be important for obtaining additional knowledge, mostly
79
ability of its original team, but there will be closer
monitoring of each other’s work. The idea is that
each project may reuse what is done by the others,
this way minimizing duplicate work, but without
losing sight of specific goals.
In a near future, Onto.PT will become a fuzzy
wordnet, based on the redundancy accross several Portuguese computational lexical resources,
including the other open wordnets, whose further
updates will be welcome by this new initiative.
Following ECO, confidence degrees will be assigned to each decision taken, including the membership of words to synsets (first experiments in
Santos and Gonçalo Oliveira (2015)), or the attachment of relations to synsets. This will enable
the users to select between a larger but less reliable
resource and a smaller one with fewer issues.
Similarly to Onto.PT, the other open wordnets
will devise the integration of the contents of each
other, or replicate their enrichment approaches.
6
and Knowledge Management (CIKM 2009), pages 513–
522, New York, NY, USA. ACM.
Valeria de Paiva, Alexandre Rademaker, and Gerard de Melo.
2012. OpenWordNet-PT: An Open Brazilian WordNet for
Reasoning. In Proceedings of 24th International Conference on Computational Linguistics, COLING (Demo Paper).
Bento C. Dias-da-Silva, Mirna F. de Oliveira, and Helio R.
de Moraes. 2002. Groundwork for the Development of
the Brazilian Portuguese Wordnet. In Advances in Natural Language Processing (PorTAL 2002), LNAI, pages
189–196, Faro, Portugal. Springer.
Bento C. Dias-da-Silva. 2006. Wordnet.Br: An exercise
of human language technology research. In Proceedings
of 3rd International WordNet Conference (GWC), GWC
2006, pages 301–303, South Jeju Island, Korea, January.
Brett Drury, Paula C.F. Cardoso, Janie M. Thomas, and Alneu
de Andrade Lopes. 2014. Lexical resources for the identification of causative relations in portuguese texts. In Proceedings of the 1st Workshop on Tools and Resources for
Automatically Processing Portuguese and Spanish (ToRPorEsp), ToRPorEsp, pages 56–63, São Carlos, SP, Brasil,
October. BDBComp.
Christiane Fellbaum, editor. 1998. WordNet: An Electronic
Lexical Database (Language, Speech, and Communication). The MIT Press.
Conclusions
We presented a collection of Portuguese wordnets.
While none feels as mature as Princeton WordNet,
some have already been used in applications. Joint
efforts, as we started doing and hope to do more,
seem the only way of making progress in this hard
problem. Clearly, the envisaged applications will
lead to slightly different strong points in our resources, but the common denominator remains to
provide a wordnet that is open, large coverage and
as reliable as possible.
Cláudia Freitas, Valeria de Paiva, Alexandre Rademaker,
Gerard de Melo, Livy Real, and Anne de Araujo Correia da Silva. 2014. Extending a lexicon of Portuguese
nominalizations with data from corpora. In Proceedings
of Computational Processing of the Portuguese Language
- 11th International Conference (PROPOR 2014), São
Carlos, Brazil, oct. Springer.
Aldo Gangemi, Nicola Guarino, Claudio Masolo, and
Alessandro Oltramari. 2010. Interfacing WordNet with
DOLCE: towards OntoWordNet. In Ontology and the
Lexicon: A Natural Language Processing Perspective,
Studies in Natural Language Processing, chapter 3, pages
36–52. Cambridge University Press.
Marcelo Machado Gomes, Walber Beltrame, and Davidson
Cury. 2013. Automatic construction of brazilian portuguese WordNet. In Proceedings of X National Meeting on Artificial and Computational Intelligence, ENIAC
2013.
References
Raquel Amaro. 2014. Extracting semantic relations from
portuguese corpora using lexical-syntactic patterns. In
Proceedings of the 9th International Conference on Language Resources and Evaluation, LREC’14, Reykjavik,
Iceland, May. ELRA.
Hugo Gonçalo Oliveira and Paulo Gomes. 2014. ECO and
Onto.PT: A flexible approach for creating a Portuguese
wordnet automatically. Language Resources and Evaluation, 48(2):373–393.
Anabela Barreiro. 2010. Port4NooJ: an open source,
ontology-driven portuguese linguistic system with applications in machine translation. In Proceedings of the 2008
International NooJ Conference (NooJ’08), Budapest,
Hungary. Newcastle-upon-Tyne: Cambridge Scholars
Publishing.
Hugo Gonçalo Oliveira, Diana Santos, Paulo Gomes, and
Nuno Seco. 2008. PAPEL: A dictionary-based lexical
ontology for Portuguese. In Proceedings of Computational Processing of the Portuguese Language - 8th International Conference (PROPOR 2008), volume 5190 of
LNCS/LNAI, pages 31–40, Aveiro, Portugal, September.
Springer.
Francis Bond and Ryan Foster. 2013. Linking and extending an open multilingual wordnet. In Proceedings of
51st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1352–1362,
Sofia, Bulgaria, August. ACL Press.
Hugo Gonçalo Oliveira, Inês Coelho, and Paulo Gomes.
2014. Exploiting Portuguese lexical knowledge bases for
answering open domain cloze questions automatically. In
Proceedings of the 9th Language Resources and Evaluation Conference, LREC 2014, Reykjavik, Iceland, May.
ELRA.
Gerard de Melo and Gerhard Weikum. 2009. Towards a universal wordnet by learning from combined evidence. In
Proceedings of the 18th ACM Conference on Information
80
Livy Real, Fabricio Chalub, Valeria de Paiva, Claudia Freitas, and Alexandre Rademaker. 2015. Seeing is correcting: curating lexical resources using social interfaces. In
Proceedings of 53rd Annual Meeting of the ACL and 7th
International Joint Conference on NLP of Asian Federation of NLP - 4th Workshop on Linked Data in Linguistics:
Resources and Applications, Beijing, China, jul.
Aitor Gonzalez-Agirre, Egoitz Laparra, and German Rigau.
2012. Multilingual central repository version 3.0. In Proceedings of the 8th International Conference on Language
Resources and Evaluation, pages 2525–2529. ELRA,
LREC’12.
Hugo Gonçalo Oliveira, Valeria de Paiva, Cláudia Freitas,
Alexandre Rademaker, Livy Real, and Alberto Simões.
2015. As wordnets do português. In Alberto Simões,
Anabela Barreiro, Diana Santos, Rui Sousa-Silva, and
Stella E. O. Tagnin, editors, Linguı́stica, Informática e
Tradução: Mundos que se Cruzam, volume 7(1) of OSLa:
Oslo Studies in Language, pages 397–424. University of
Oslo.
Ricardo Rodrigues, Hugo Gonçalo Oliveira, and Paulo
Gomes. 2012. Uma abordagem ao Págico baseada no
processamento e análise de sintagmas dos tópicos. Linguamática, 4(1):31–39, April.
Fábio Santos and Hugo Gonçalo Oliveira. 2015. Descoberta
de synsets difusos com base na redundância em vários dicionários. Linguamática, page accepted for publication,
December.
Xavier Gómez Guinovart and Alberto Simões. 2013. Retreading Dictionaries for the 21st Century. In José Paulo
Leal, Ricardo Rocha, and Alberto Simões, editors, 2nd
Symposium on Languages, Applications and Technologies, volume 29 of OpenAccess Series in Informatics (OASIcs), pages 115–126. Schloss Dagstuhl–Leibniz-Zentrum
fuer Informatik.
Diana Santos, Anabela Barreiro, Cláudia Freitas, Hugo
Gonçalo Oliveira, José Carlos Medeiros, Luı́s Costa,
Paulo Gomes, and Rosário Silva. 2010. Relações
semânticas em português: comparando o TeP, o
MWN.PT, o Port4NooJ e o PAPEL. In Textos seleccionados. XXV Encontro Nacional da Associação Portuguesa
de Linguı́stica, APL 2009, pages 681–700. APL.
Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann,
Michael Matuschek, Christian M. Meyer, and Christian
Wirth. 2012. UBY - a large-scale unified lexical-semantic
resource. In Proceedings of 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2012, pages 580–590, Avignon, France.
ACL Press.
Alberto Simões and Xavier Gómez Guinovart. 2014. Bootstrapping a Portuguese wordnet from Galician, Spanish
and English wordnets. In Advances in Speech and Language Technologies for Iberian Languages, Proceedings
of 2nd International Conference, IberSPEECH 2014, Las
Palmas de Gran Canaria, Spain, volume 8854 of LNCS,
pages 239–248. Springer.
Graeme Hirst. 2004. Ontology and the lexicon. In Steffen
Staab and Rudi Studer, editors, Handbook on Ontologies,
International Handbooks on Information Systems, pages
209–230. Springer.
Alberto Simões, Álvaro Iriarte Sanromán, and José João
Almeida. 2012. Dicionário-Aberto: A source of resources for the Portuguese language processing. In
Proceedings of Computational Processing of the Portuguese Language, 10th International Conference (PROPOR 2012), Coimbra Portugal, volume 7243 of LNCS,
pages 121–127. Springer, April.
Palmira Marrafa, Raquel Amaro, and Sara Mendes. 2011.
WordNet.PT Global – extending WordNet.PT to Portuguese varieties. In Proceedings of 1st Workshop on Algorithms and Resources for Modelling of Dialects and
Language Varieties, pages 70–74, Edinburgh, Scotland.
ACL Press.
Palmira Marrafa. 2001. WordNet do Português: uma base de
dados de conhecimento linguı́stico. Instituto Camões.
Alberto M. Simões and J. João Almeida. 2003. NATools –
a statistical word aligner workbench. Procesamiento del
Lenguaje Natural, 31:217–224, September.
Erick G. Maziero, Thiago A. S. Pardo, Ariani Di Felippo, and
Bento C. Dias-da-Silva. 2008. A Base de Dados Lexical
e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para
o Português do Brasil. In VI Workshop em Tecnologia da
Informação e Linguagem Humana, TIL, pages 390–392.
Sofia Stamou, Kemal Oflazer, Karel Pala, Dimitris Christoudoulakis, Dan Cristea, Dan Tufis, Svetla Koeva,
George Totkov, Dominique Dutoit, and Maria Grigoriadou. 2002. BalkaNet: A multilingual semantic network
for the balkan languages. In Proceedings of 1st Global
WordNet Conference, GWC’02.
Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network.
Artificial Intelligence, 193:217–250.
Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum.
2007. YAGO: a core of semantic knowledge. In Proceedings of the 16th International Conference on World
Wide Web, WWW 2007, pages 697–706, Alberta, Canada.
ACM Press.
Lluı́s Padró and Evgeny Stanilovsky. 2012. Freeling 3.0:
Towards wider multilinguality. In Proceedings of the
Language Resources and Evaluation Conference (LREC
2012), Istanbul, Turkey, May. ELRA.
Mark van Assem, Aldo Gangemi, and Guus Schreiber. 2006.
RDF/OWL representation of WordNet. W3c working
draft, World Wide Web Consortium, June.
Adam Pease and Christiane Fellbaum. 2010. Formal ontology as interlingua: the SUMO and WordNet linking
project and global WordNet linking project. In Ontology
and the Lexicon: A Natural Language Processing Perspective, Studies in Natural Language Processing, chapter 2, pages 25–35. Cambridge University Press.
Piek Vossen. 1997. EuroWordNet: a multilingual database
for information retrieval. In Proceedings of DELOS workshop on Cross-Language Information Retrieval, Zurich.
Emanuele Pianta, Luisa Bentivogli, and Christian Girardi.
2002. MultiWordNet: developing an aligned multilingual
database. In Proceedings of 1st International Conference
on Global WordNet, GWC 2002.
81
Towards a WordNet based Classification of Actors in Folktales
Thierry Declerck
DFKI GmbH
Saarbrücken, Germany &
Austrian Centre for
Digital Humanities (ACDH)
Vienna, Austria
Tyler Klement
Saarland University
Saarbrücken, Germany
Antonia Kostova
Saarland University
Saarbrücken, Germany
[email protected]
[email protected]
[email protected]
Stith Thompson1 . In general, we are aiming at
a WordNet2 based generation of lexical semantic relations for building a terminology network
of actors/characters mentioned in folktales. Our
work is anchored in the field of Digital Humanities
(DH), where there is an increased interest in applying methods from Natural Language Processing (NLP) and Semantic Web (SW) technologies
to literary work.
In the following sections we will present first
the data we are dealing with and the transformations we applied on those for being able to use the
NLTK interface to WordNet3 . We describe then
the functions of NLTK we are using and how we
can benefit from those for building a more generic
vocabulary and extending the basic terminology
for classifying actors/characters in folktales.
Related work on this topic is presented in Declerck (2012), which is more focused on the use of
Wiktionary for translation and also dealing rather
with the formal representation of the terminology
used in ATU.
Abstract
In the context of a student software project
we are investigating the use of WordNet for improving the automatic detection
and classification of actors (or characters)
mentioned in folktales. Our starting point
is the book “Classification of International
Folktales”, out of which we extract text
segments that name the different actors involved in tales, taking advantage of patterns used by its author, Hans-Jörg Uther.
We apply on those text segments functions
that are implemented in the NLTK interface to WordNet in order to obtain lexical
semantic information to enrich the original naming of characters proposed in the
“Classification of International Folktales”
and to support their translation in other
languages.
1
Introduction
This short paper reports on the current state of a
student software project aiming at supporting the
automatized classification of folktales along the
line of the classification proposed by Hans-Jörg
Uther (2004). This classification scheme is considered as a central source for the analysis work
of folklorists. It builds on former work by Antti
Aarne (1961) and Stith Thompson (1977). In the
following, we are using the acronym ATU for referring to (Uther, 2004): ATU standing for AarneThompson-Uther.
We focus in the current work on the detection
of common superclasses to the naming of the main
actors (or characters) that are mentioned in the various types of folktales listed by Uther (2004). In
doing this we are able to propose more generic
classes of characters and an extended vocabulary,
and so to link to other classification systems, like
the Motif-Index of Folk-Literature proposed by
2
The Data Source
We are taking the ATU classification scheme
as our starting point. Just below we display the
initial part of a type of folktale, which in ATU is
marked using an integer, possibly followed by a
letter. In this example we deal with type 2, which
is included in the list of types “Wild Animal”
(from type 1 to type 99), and more specifically
within the list “The Clever Fox (Other Animal)”
(from type 1 to type 69)4 .
1
See the online version of the index: http://www.
ruthenia.ru/folklore/thompson/index.htm.
2
See (Fellbaum, 1998) and (Miller, 1995).
3
NLTK is described in (Bird et al., 2009), with an updated
online version: http://www.nltk.org/book/. At
http://www.nltk.org/howto/wordnet.html the
WordNet interface is described in details.
4
See also https://en.wikipedia.org/wiki/
Aarne-Thompson_classification_systems,
82
2 The Tail-Fisher. A bear (wolf) meets
a fox who has caught a big load
of fish. He asks him where he
caught them, and the fox replies
that he was fishing with his tail
through a hole in the ice. He
advises the bear to do likewise
and the bear does. When the bear
tries to pull his tail out of
the ice (because men or dogs are
attacking him), it is frozen in
place. He runs away but leaves
his tail behind [K1021]. Cf.
Type 1891.
Combinations:
This type is usually
combined with episodes of one or
more other types, esp. 1, 3, 4,
5, 8, 15, 41, 158, and 1910.
text format, using for this a Python script. For the
type 6, just to present another example of an ATU
type, we have now the following text format:
6˜Animal Captor Persuaded to
Talk.˜ A fox (jackal, wolf)
catches a chicken (crow, bird,
hyena, sheep, etc. ) and is
about to eat it. The weak animal
asks a question and the fox
answers. Thus he releases the
prey and it escapes. ˜K561.1
With this new format, where the sign “ ˜ ” is
used as the separator, it is very easy to write
code that is specialized for dealing with parts of
the ATU entries. For our work, we concentrate
only on the third field of the “ ˜ ” separated input
file. This way we avoid the “noise” that could be
generated if considering the use of parentheses in
the second field (the label of the type), like:
In this example, we can see the number of the
type (“2”), its label (“The Tail-Fisher”) and a text
summarizing the typical motifs of this type of
folktale. At the end of this “script”, a link to a
corresponding Thompson Motif-Index is provided
(“[K1021]”). Finally, types are indicated, with
which the current type is usually combined.
For us, a very interesting pattern in the description part of the type entry is “A bear (wolf)”. This
way (and also using more complex patterns), the
author specifies variants of actors/characters that
can play a role within a folktale type. We found
this pattern interesting because our assumption is
that in most of the cases only semantically related actors/characters can be mentioned in this
text construct. And those pairs of variants give
us a promising basis for trying to generate more
generic terms from WordNet for classifying actors
in folktales and so to support the linking of ATU
to other classification schemes.
Our work consisted first in extracting from ATU
the relevant text segments corresponding to such
patterns and then to query WordNet in order to see
if the characters named in such text segments are
sharing relevant lexical semantic properties.
2.1
Torn-off Tails (previously The
Buried Tail).
which is used in the label of type 2A.
2.2
Pattern Extraction
On the basis of a manual analysis of the ATU entries, regular expressions for detecting the formulation of variants of actors/characters have been
formulated and implemented in Python. Below we
show some examples of extracted text segments,
on the basis of the Python script:
• A master (supervisor)
• an ox is so big that it takes a bird a whole day
(week, year)
• A sow (hare)
• A giant has sixty daughters (sons)
• a brook (sea)
Pre-Processing the ATU Catalogue
• A man puts a pot with hot milk (chocolate)
In order to be able to apply functions of the
WordNet interface of NLTK to the ATU classification scheme, we first had to transform the
original document into a punctuation separated
• A man who has recently been married meets
a friend (neighbor, stranger)
• A wolf (bee, wasp, fly)
with more details given in the French or German corresponding pages.
• A suitor (suitors)
83
• a flea (fly, mouse)
the function “path similarity” gives ’0.2’, while
for “fox.n.01” and “jackal.n.01” it gives ’0,33’.
We might have ’0,33’ as a threshold for accepting
the selected hypernym as a relevant generalization
of the words used in the patterns of ATU we are
investigating. Or allowing also lower similarity
measures, but filtering out the selected hypernym
on the basis of the length of the path leading from
it to the root node. The LCH “canine.n.02’ has
a much longer path to “entity” as does the LCH
“person.n.01”. Our first experiments seem to indicate that the longer the path of the hypernym to the
root node, the more informative is the generalization proposed by querying WordNet for the least
common hypernym.
Additionally to those two functions of the
NLTK interface to WordNet, we make use of the
possibility to extract from WordNet all the hyponyms of the involved synsets. This can offer an extended word base for searching in folktale texts for relevant actors/characters. While
this assumption seems reasonable in certain cases,
like for example for the synset “overlord.n.01”
for which we can retrieve hyponyms like “feudal lord”, “seigneur’ and “seignior”, it is not
clear if it is beneficial to retrieve all the scientific names listed as hyponyms of the synset
“fox.n.01”, like “Urocyon cinereoargenteus” or
“Vulpes fulva”. But in any case, the terminology
basis of the words used in ATU can this way be
extended.
Last but not least, we take advantage of the multilingual coverage of WordNet, using for this another function implemented in NLTK. As an example, for the following pairs mentioned in ATU,
we get from WordNet the French equivalents:
• a series of animals (hen, rooster, duck, goose,
fox, pig)
• a person (animal)
• An ant (sparrow, hare)
As the reader can see, each text segment starts
with an indefinite Nominal Phrase (NP) and ends
with a closing parenthesis. This pattern is consistently used in ATU, and corresponds to our intuition that a referent in discourse is mostly introduced by an indefinite NP. For the first step of
our investigation of the use of WordNet for generating more generic terms for the mentioned actors, we decided to concentrate on the simple sequence “A/An Noun (Noun)”, like for example “A
fox (wolf)”.
2.2.1
Accessing WordNet with the NLTK
Interface
NLTK provides for a rich set of functions for accessing WordNet. The first function we applied
was the one searching for the least common hypernym for the two words used in the pattern “A/An
Noun (Noun)”. Some few results on such a search
for all the synsets of the considered noun-pairs are
displayed below for the purpose of exemplification, where we indicate the least common hypernym with the abbreviation LCH:
• Synset(man.n.01) & Synset(fox.n.05) =>
LCH(Synset(person.n.01))
• Synset(fox.n.01) & Synset(jackal.n.01) =>
LCH(Synset(canine.n.02))
• Synset(fox.n.01) & Synset(cat.n.01) =>
LCH(Synset(carnivore.n.01))
• Synset(fox.n.01) & Synset(wolf.n.01) =>
[’renard’] & [’loup’, ’louve’]
• Synset(raven.n.01) & Synset(crow.n.01) =>
LCH(Synset(corvine bird.n.01))
• Synset(dragon.n.02) & Synset(monster.n.04)
=> [’dragon’] & [’démon’, ’monstre’,
’diable’, ’Diable’]
It is for sure interesting to see that depending on the word they are associated with, synsets
of “fox”, for example, can be related to a different hypernym. In the case of “fox.n.05” and
“man.n.01” sharing the hypernym “person.n.01”,
we have to check if this case should be filtered out,
since the hypernym is too generic. We tested for
this the NLTK function “path similarity”, which
computes a measure on the basis of the respective length of the path needed for each synset to
the shared LCH. For “man.n.01” and “fox.n.05”
• Synset(enchantress.n.02) &
Synset(sorceress.n.01) => [’sorcière’] &
[’enchanteur’, ’ensorceleur’, ’sorcière’]
As part of future work, we are considering those
multilingual equivalents provided by WordNet as
a starting point for providing for a multilingual extension of the ATU classification.
84
3
An Ontology for ATU
Identifier (URI). The property “rdf:type” indicates
that the object named by the URI is an instance
In order to store all the results of the work deof the class “ATU”. The last element of the code,
scribed above, including the multilingual correintroduced by “rdfs:label”, stores the original laspondences of the English terminology used in
bel in English (“en”). We will use this property
ATU, we decided to go for the creation of an on“rdfs:label” to encode the multilingual correspontology of ATU, a step which is also aiming at supdences. We encode the original description of the
porting the linking of this classification scheme
type as a value to the property “rdfs:isDefinedBy”.
to other approaches in the field. The ontology
The property “linkToTMI” is the way we go
was generated automatically from the transformed
for linking ATU types to Motifs listed in the
ATU input data described in section 2.1., and enMotif-Index of Folk-Literature (which we abbrecoded in the OWL and RDF(s) representation lanviate with TMI). This linking is still in a prelimiguages5 . ATU not being a hierarchical classifinary stage, since we first have to finalize the correcation, we decided to have only one class in the
sponding TMI ontology, and also check the validontology, and to encode each type of ATU as an
ity of the linking to TMI we extracted from the
instance of this class. As a result, we have 2221
ATU book. This kind of linking is the one we
instances. The main class is displayed just below,
will use for interconnecting all types of classificausing the Turtle syntax 6 for its representation:
tion schemes used for folktales (and maybe also
:ATU
for other literary genres). We will add a proprdf:type owl:Class ;
erty for including relevant hypernyms (and posrdfs:comment
"\"Ontology Version of ATU\""@en ;
sibly hyponyms) extracted from WordNet to the
rdfs:label "\"The Types of International
current labels, contributing this way to the semanFolktales Aarne-Thompson-Uther\""@en ;
rdfs:subClassOf owl:Thing ;
tic enrichment of the original classification.
.
4
An instance of this class, for example for the
type 101, has the following syntax:
<http://www.semanticweb.org/tonka/
ontologies/2015/5/tmi-atu-ontology#101>
rdf:type :ATU ;
linkToTMI <http://www.semanticweb.org/
tonka/ontologies/2015/5/
tmi-atu-ontology#K231.1.3> ;
rdfs:comment "\"Type 101 of ATU\""@en ;
rdfs:isDefinedBy "The Old Dog as Rescuer
of the Child (Sheep). A farmer plans
to kill his faithful old dog because
it cannot work anymore. The wolf makes
a plan to save the dog: The latter is to
rescue the farmer’s child from the wolf.
The plan succeeds and the dog’s life is
spared. The wolf in return wants to
steal the farmer’s sheep. The dog
refuses to help and loses the wolf’s
friendship . "@en ;
rdfs:label "\"The Old Dog as Rescuer
of the Child (Sheep)\""@en ;
.
The reader can see in this extensive example
that each instance of the ATU class is named in
the first line of the code by an Unique Resource
5
See http://www.w3.org/2001/sw/wiki/OWL
and http://www.w3.org/TR/rdf-schema/.
6
See http://www.w3.org/TR/turtle/ for more details.
85
Conclusion and future Work
We presented work done in the context of a running student software project consisting in accessing WordNet for providing for lexical semantic information that can be used for enriching an existing classification scheme of folktales with additional terms gained from the extraction of relevant hypernyms (and to a certain extent from hyponyms) of words naming characters playing a
central roles in folktales. The aim is to generate
a WordNet based network of terms for the folktale
domain.
As future work, an investigation will be performed in order to determine the optimal length
of the path between a Lowest Common Hypernym
(LCH) and the root node of WordNet as the filtering process for excluding irrelevant and noise
introducing LCHs. We will also perform an evaluation of the extracted LCHs against a manually annotated set of ATU entries. And we will compare
the French equivalents of the synsets proposed by
WordNet with the French terms used in the French
Wikipedia page for the AT. Additionally, we plan
to compare our WordNet based approach as the basis for the linking between ATU and TMI to the
machine learning approach to such a linking described in (Ofex et al., 2013).
Acknowledgments
We would like to thank Alyssa Price for providing
for the manual analysis of the patterns occurring
in ATU. Our gratitude goes also to the two anonymous reviewers for the very helpful comments on
the previous version of this short paper.
References
Antti Aarne. 1961. The Types of the Folktale: A Classification and Bibliography. The Finnish Academy
of Science and Letters, Helsinki.
Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with Python.
O’Reilly Media, Sebastopol, CA.
Thierry Declerck, Karlheinz Mrth, Piroska Lendvai.
2012. Accessing and Standardizing Wiktionary
Lexical Entries for the Translation of Labels in
Cultural Heritage Taxonomies. Proceedings of
the Eight International Conference on Language
Re-sources and Evaluation (LREC’12). Istanbul,
Turkey.
Christiane Fellbaum (ed). 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge,
MA.
George A. Miller. 1995. WordNet: A Lexical Database for English. Communications of the ACM, Vol.
38, No. 11: 39-41.
Thierry Declerck, Karlheinz Mrth, Piroska Lendvai.
2013. Linking Motif Sequences to Tale Types by
Machine Learning. Proceedings of the 2013 Workshop on Computational Models of Narrative, 166182. Dagstuhl, Germany
Stith Thompson. 1977. The Folktale. University of
California Press, Berkeley.
Hans J. Uther. 2004. The Types of the Folktale: A
Classification and Bibliography. Based on the system of Antti Aarne and Stith Thompson. The Finnish
Academy of Science and Letters, Helsinki.
86
Extraction and description of multi-word lexical units in plWordNet 3.0
Agnieszka Dziob
Wrocław University of Technology
Wrocław, Poland
[email protected]
Michał Wendelberger
Wrocław University of Technology,
Wrocław, Poland
consisting of more than one word and
constituting a semantic and morpho-syntactic
whole. It is close in spirit to Maziarz et al.
2015 proposal saying that MWLU is “built
from more than one word, associated with a
definite meaning somehow stored in one's
mental lexicon and immediately retrieved from
memory as a whole” (Maziarz et al. 2015).
Such a definition forces one to perceive
MWLUs as having defined structure and
semantics which makes the connection
"behave like the single individual" (Calzolari
et al. 2002).
Abstract
In this paper, we present methods of
extraction of multi-word lexical units
(MWLUs) from large text corpora and their
description in plWordNet 3.0. MWLUs are
filtered from collocations of the structural
type Noun+Adjective (NA).
1
Introduction
Our focus in this paper are multi-word lexical
units (henceforth, MWLUs), derived from
collocations (automatically extracted from
corpora). As in the case of many linguistic
terms, there is no agreement among scholars
on their common defining criteria. Two main
approaches are distinguished. The first one
treats as collocations all expressions that tend
to co-occur in the immediate syntactic
neighbourhood (Firth 1957). This approach is
followed by the constructors of corpora (cf.
Przepiórkowski 2012). The second approach
puts the emphasis on the linguistic properties
of collocations such as non-compositionality
and impossibility of modification and
substitution (Evert 2004). In this approach the
term collocation is close to the term multiword expression (henceforth, MWE), used in
computational linguistics for the linkage of
words of the established meaning, analysed as
a whole (Sag et al. 2002) and to our
understanding of the term MWLU. In the
present paper we define MWLU by reference
to lexical unit (henceforth, LU), a central
element of a wordnet (Fellbaum 1998), a
whole
attributed
with
meaning
and
morphosyntactic properties (Derwojedowa et
al. 2008). Thus, MWLU will be an LU,
2
Data preparation
In the work on extracting MWEs, IPI PAN
Corpus1 and the plWordNet corpus of the
Wrocław University of Technology (Piasecki
et al. 2014) corpora were used. The extraction
was carried out using the set of MWeXtractor
tools, developed for the purposes of the
CLARIN2 project. MWeXtractor is a package
of tools, which was created for the purposes of
the construction of MWLU's network in
plWordNet and their syntactic description. It is
the part of a bigger infrastructure for aimed for
the work with text corpora. The package user
has the access to the data cloud, where they
record their own corpora (or uses the existing
corpora available on the open licence).
MWeXtractor tools package is available on the
open licence. Sketch Engine is a tool for the
work with corpora, which allows for the
extraction of collocations on the basis of their
grammatical relations (Kilgarriff et al. 2004).
In many respects Sketch Engine and
MWeXtractor do not differ from each other.
For the purposes of the development of
1
2
87
http://korpus.pl/
http://clarin-pl.eu/
MWeXtractor
package
new
statistical
measures were implemented, described in this
Section. Those measures, which are
compilations or modifications of the known
measures, improved extraction results,
described in Sections 2 and 3.
y=
2.3
In the first phase, the authors defined initial
data (sets of corpora, tagset, WCCL’s
operators describing relations within a
collocation (Radziszewski et al. 2011)). In
addition, the order of candidates for MWLU
can be changed and the continuity of the
elements of a collocation does not have to be
preserved. The next stage was a dispersion of
collocations, through which candidates whose
syntactic traits were regarded interesting, are
being promoted. In the MWeXtractor package,
apart from available measures that are present
in the subject literature, the measures designed
for the purposes of the present work and
presented in Sections 2.1, 2.2, 2.3 were also
implemented.
2.1
And for her the described generalization is
used the pattern:
2.2
W Term Frequency Order
Two types of files are final data - files with
lists k-best of candidates for MWE, and files
with evaluations of these lists. The number of
generated files in the ranking is equal ((and +
V + C) ∗ R ∗ F), where and, V and are
indicating C one by one number of exploited
functions of associative, vector associative
measures and classifiers, however R and F are
one by one a number of rounds and folds of
cross validation. Additionally for every file
with the ranking generated is being Q of files
of the evaluation of this ranking, where Q is a
number of exploited functions of the
evaluation of lists k-bests.
The final list of extracted collocations also
contained collocaltions being already Lexical
Units in plWordNet. Last filtering consisted in
removing proper names and determined
descriptions and these LU’s.
p ( x, y ) e
p( x) p( y )
2.4
p( x1 , x2 ,..., xn ) e
∏
n
y = f (t)WOrder(t)
The function W Specific Exponential
Correlation is a compilation of a few other
associative measures, of Specyfic Exponential
Correlation among others described above. She
is represented by the following pattern:
y = p( x1 , x2 ,..., xn ) log 2
f ( S (t )i ))
∏i =1 (1 + maks( f ( S (t ))) + 1)
This function W Term Frequency Order
includes the frequency of appearing of the
candidate which many associative measures
are using assessed as good.
W Specific Exponential Correlation
y = p( x, y ) log 2
1
Results
Table 1 presents the 20 bests of extracted
collocations (of the k-best list). The list
included forms of lemma according part of
speech:
n
p ( xi )
2 =1
W Order
String of lemma of corpus
N:link A:zewnętrzny (‘external link’)
N:raz A:pierwszy (‘first time’)
N:wojna A:światowy (‘word war’)
N:to A:sam (‘the same’)
N:samorząd A:terytorialny (‘local
government’)
N:piłka A:nożny (‘football’)
N:porządek A:dzienny (‘agenda’)
N:papier A:wartościowy (‘security’)
N:sprawa A:wewnętrzny (‘affairs’)
N:igrzyska A:olimpijski (‘Olimpic
Games’)
W Order is the function based on the
assumption, that for them the chic more
peculiar to the given connection in which
storage connections are appearing, with it more
interesting, more certain collocation. The
function is disregarding interpretation of the
order of the chic, examining only their number
and the frequency distribution in chics and
from the frequency riots of the collocation for
the given candidate, and studying only their
attitude.
88
N:strona A:drugi (‘other side’)
N:podatek A:dochodowy (‘income tax’)
N:minister A:właściwy (‘minister
responsible’)
N:finanse A:publiczny (‘public finance’)
N:rada A:nadzorczy (‘supervisory board’)
N:opieka A:zdrowotny (‘healt care’)
N:rok A:ubiegły (‘last year’)
N:ciąg A:daleki (‘string far’)
N:działalność A:gospodarczy (‘bussines
activity’)
N:projekt A:rządowy (‘government
project’)
gra losowa (‘game of chance’)
energetyka
odnawialna
(‘renewable
energy industry’)
klęska żywiołowa (‘natural disaster’)
kodeks celny (‘customs code’)
linie papilarne (‘fingerprint’)
medycyna weterynaryjna (‘veterinary
medicine’)
obszar wiejski (‘rural area’)
oficer prasowy (‘Press officer’)
pole golfowe (‘golf course’)
pojemność
skokowa
(‘engine
displacement’)
Table 1: Bests of extracted collocations
3
Syntactically
MWE’s
non-compositional
Table 2: Syntactially non-compositional MWLU’s
Automatic evaluation was the first phase of
verification of the extracted collocations. We
verified syntactic non-compositionality for
NA-type collocations (noun and a postposed
Adjective), for which we defined syntactic
idiosyncrasies, attesting the stability of the
connection (in such a form) in the corpus.
Based on a statistical analysis, we argue that
MWLUs syntactic non-compositionality must
have the following features:
1. established word order
2. separability.
What we understand by the established word
order is the ratio of neutral word order
(Adjective in postposition) occurrence in the
corpus to the alternative word order (Adjective
in preposition). We took the established word
order as the main criterion, and if its
occurrence was lower than 87.09%, the
algorithm suggested abandoning further
procedure (Maziarz et al. 2015). In the case of
reaching more than 87.09 % of occurrence, the
algorithm tested separability defined as the
ratio of occurrence in the word order with the
Adjective in preposition and postposition
divided by at least one other text word to the
sum of occurrences in both word orders, but
without no text word between elements of the
collocation.
Finally, by using this method we extracted 607
collocations – potential MWLUs. From this
list, we rejected several proper names and
incomplete phrases. The rest of collocations
was automatically accepted.
Table 2 shows chosen syntactically noncompositional MWLUs.
4
Verification
collocations
of
extracted
At this stage, we gave linguists the list of
extracted collocations for verification. At the
preliminary stage of verification, linguists
removed (i) combinations which were proper
names (and were eliminated during the
automatic verification), (ii) combinations with
incomplete
phrases
or
(iii) peculiar
metaphorical uses (rare in accessible sources).
Next, linguists assessed the remaining
combinations in accordance to the following
criteria:
1. a word cannot appear outside the given
collocation (imprisoned meaning),
2. terminology,
3. paraphraseability,
4. free word order (in case of the type NA)
(Maziarz et al. 2015a)
By a phrase “a word cannot appear outside the
given collocation” we understood the word, for
which a given collocation is specific, i.e. the
word does not appear in any other collocation
in Polish or it does not appear in predicative
position. An example of such a collocation is
linia naboczna (‘lateral line’).
As “terms”, we recognised these collocations,
which are precisely and explicitly specified in
one or more sources (Polański et al. 1999). In
the case of mathematical-natural sciences,
technical sciences, law, econometrics or
linguistics one source, e.g. encyclopaedia
(specialist), the specialist dictionary or the
specialist lexicon, was enough for positive
verification of the collocation. In the case of
89
other disciplines (especially social sciences or
humanities) to do the positive verification two
sources of the types listed above were needed.
Universal encyclopedias and normative legal
texts (acts, regulations) were treated as
sufficient sources for term status confirmation
of the selected units (Maziarz et al. 2015a). We
also took into account other sources (e.g.
scientific texts, institutional regulations) whose
status is confirmed by some organization (e.g.
scientific unit, association). In such cases, to
do the positive verification it was essential for
the candidate to occur in two sources.
“Paraphraseability” means the possibility of
occurrence of a collocation in transformations,
in which the collocation becomes separated, or
one of its elements is replaced by another word
or phrase, without the change in meaning. At
this stage the following transformations were
allowed:
1. a subordinate clause instead of an
Adjective or a participle: niebieska teczka
= teczka, która jest niebieska (‘blue file =
file, which is blue’);
2. a noun or a prepositional phrase instead of
an Adjective (with the force of semantic
transposition): tekst prawny = tekst prawa
(‘legal test = text of law’), drewniana
podłoga = podłoga z drewna (‘wooden
floor = floor made of wood’);
3. a synonym or a dictionary definition in the
place of any element of a collocation: gra
zespołowa = zabawa towarzyska, która ma
określone
zasady,
może
wymagać
rekwizytów3 (team game = team sociable
fun, which has particular rules, can need
requisites).
In the case of the NA-type, an additional
criterion, i.e. word order, was taken into
account. On the basis of corpus data, linguists
judged whether it was possible to change word
order in a collocation without changes in its
meaning. In addition, we decided that for the
change in word order to be unacceptable, the
ratio of NA word order to AN word order has
to be greater than 100:1 (Maziarz et al. 2015a).
5
1.
2.
3.
4.
5.
6.
MWE's syntactic scheme,
MWE's part of speech,
MWE's base form,
MWE's syntactic head,
base form of each MWE's component,
part of speech for each MWE's component.
At present, the dictionary contains 45 thousand
MWLUs, mainly of nouns and bigrams.
MWLU's are grouped together according to
syntactic schemes described according to the
WCCL formalism (Radziszewski et al. 2011a).
The dictionary is systematically enlarged.
Acknowledgements
Work supported by the Polish Ministry of
Education and Science, Project CLARIN-PL,
the European Innovative Economy Programme
project POIG.01.01.02-14-013/09, and by the
EU’s 7FP under grant agreement No. 316097
[ENGINE].
References
Nicoletta Calzolari, Charles Fillmore, Ralph
Grishman, Nancy Ide, Alessandro Lenci, Catharine
MacLeod, & Antonio Zampolli. 2002. Towards
best practice for multiword expressions in
computational lexicons. W: Proceedings of 3rd
International Conference on Language Resources
and Evaluation (LREC-2002). Las Palmas, Canary
Islands - Spain.
Derwojedowa Magdalena, Szpakowicz Stanisław,
Zawisławska Magdalena i Piasecki Maciej. 2008.
Lexical units as the centrepiece of a wordnet.
Proceedings of Intelligent Information Systems,
Zakopane Poland. Institute of Computer Science
PAS.
Stefan Evert. 2004. The Statistics of Word
Cooccurrences Word Pairs and Collocations,
University of Stuttgart.
Christiane Fellbaum (ed.). 1998. WordNet: An
Electronic Lexical Database. Cambridge, MA: MIT
Press.
John Firth. 1957. The synopsis of linguistic theory
1930-1955. In studies of linguistic analysis. The
Philological Society, Oxford.
Applications
MWLUs are collected in the MWE dictionary,
in which the following description of
candidates is applied:
Adam Kilgarriff, Pavel Rychly, Pavrl Smrz, David
Tugwell. 2004. The Sketch Engine. Proceedings of
the 11th EURALEX International Congress.
France.
3
Source: plWordNet
(http://plwordnet.pwr.wroc.pl/wordnet/)
90
Marek Maziarz, Stan Szpakowicz, Maciej Piasecki.
2015. A Procedural Definition of Multi-word
Lexical Units. Proceedings of the International
Conference on Recent Advances in Natural
Language Processing, Hissar, Bulgaria.
Marek Maziarz, Stanisław Szpakowicz, Maciej
Piasecki, and Agnieszka Dziob. 2015a. Jednostki
wielowyrazowe.
Procedura
sprawdzania
leksykalności połaczeń wyrazowych [‘Multi-word
units. A procedure for testing the lexicality of
collocations’]. Technical Report PRE-11, Faculty
of Computer Science and Management, Wroclaw
University of Technology.
Maciej Piasecki, Marek Maziarz, Stanisław
Szpakowicz, and Ewa Rudnicka. 2014. PlWordNet
as the Cornerstone of a Toolkit of Lexico-semantic
Resources. Proc. 7th International Global Wordnet
Conference, Tartu, Estonia, 25-29 January.
Krzysztof Polański (ed.). 1999. Encyklopedia
językoznawstwa ogólnego. [‘Encyclopedia of
general linguistics’], Ossoliński National Institute,
Wroclaw.
Adam Przepiórkowski. 2004. The IPI PAN Corpus
- preliminary version. Institute of Computer
Sciences, PAS, Warsaw.
Adam Przepiórkowski, Mirosław Bańko, Rafał L.
Górski, Barbara Lewandowska-Tomaszczyk (ed.).
2012. National Corpus of Polish. Polish Scientific
Publishers PWN, Warsaw.
Adam Radziszewski, Adam Wardyński and Tomasz
Śniatowski. 2011. WCCL: A Morpho-syntactic
Feature Toolkit. Text, Speech and Dialogue.
Volume 6836 of Lecture Notes in Computer
Science. Springer.
Adam Radziszewski, Michał Marcińczuk, Adam
Wardyński. 2011a. Specyfikacja języka WCCL
[‘Specification of WCCL language’]. Faculty of
Computer Science and Management, Wroclaw
University
of
Technology.
Source:
http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki
/Specyfikacja.
John Sinclair. 1991. Corpus, Concordance,
Collocation. Oxford University Press, Oxford.
Ivan Sag, Timothy Baldwin, Francis Bond, Ann
Copestake, Dan Flickinger. 2012. Multiword
Expressions: A Pain in the Neck for NLP.
Proceedings of the 3rd Intenational Conference on
Computational Linguistics and Intelligent Text
Processing. Mexico City.
91
Establishing Morpho-semantic Relations in FarsNet (a focus on
derived nouns)
Nasim Fakoornia
Shahid Beheshti University
Tehran, Iran
Negar Davari Ardakani
Shahid Beheshti University
Tehran, Iran
[email protected]
[email protected]
[email protected]
2756 derived nouns), 5691 verbs, 6560
adjectives and 2014 adverbs. Besides semantic
relations (synonymy, hypernymy, hyponymy,
meronymy and antonymy) and morphological
relations
(derivation),
some
additional
conceptual relations such as domain and related
to, have been devised in FarsNet. At present
(2015), it consists of more than 36000 entries,
organized in almost 2000 synsets. The present
study which is aimed at formulating morphosemantic relations of FarsNet’s derived nouns
provides the wordnet with the basic required
information for automation of the relations.
Abstract
This paper aims at a morpho-semantic analysis
of 2461 Persian derived nouns, documented in
FarsNet addressing computational codification
via formulating specific morpho-semantic
relations between classes of derived nouns and
their bases. Considering the ultimate aim of the
study, FarsNet derived nouns included 12 most
productive suffixes have been analysed and as
a consequence 45 morpho-semantic patterns
were distinguished leading to creation of 17
morpho-semantic relations. The approach
includes a close examination of beginners,
grammatical category and part of speech shifts
of bases undergoing the derivation process. In
this research the morpho-semantic relations are
considered at the word level and not at the
synset level which will represent a crosslingual validity, even if the morphological
aspect of the relation is not the same in the
studied languages. The resulting morphosemantic formulations notably increase
linguistic and operative competence and
performance of FarsNet while is considered an
achievement in Persian descriptive morphology
and its codification.
1
According to Deléger et al. (2009), a morphosemantic
process
decomposes
derived,
compound and complex words into their base
and associates such process to their semantic
interpretation.
Through
morpho-semantic
analysis derived and compound words are
analysed morphologically and relations between
base and derivational form are interpreted
semantically (Namer & Baud 2007). Raffaelli &
Kerovec (2008) consider “morphosemantics” as
the best expression describing studies which deal
with links between form and meaning at the
word level.
Introduction
Derivation and compounding are the two main
word formation processes. Persian derivational
morphology consists of an affixal system in
which the number of suffixes is more than
prefixes. Persian derivational morphological
processes include suffixation, prefixation, only a
single case of circumfixation and no infixation
(Davari and Arvin 2015). Affixation patterns in
this language are generally regular however in
some cases there are few exceptions
(Megerdoomian 2000). According to Keshani
(1992) Persian derivational processes are relying
on almost 56 suffixes. The aim of the present
study is to neatly explore, formulate and classify
A comprehensive and detailed description of the
relevant linguistic levels is a prerequisite for
achieving progress in natural language
processing (NLP). Wordnets are very popular
lexical ontologies, relying on morphological,
semantic and morpho-semantic descriptions and
formulations.
FarsNet which is a Persian
wordnet has been established in 2009 by NLP
research lab of Shahid Beheshti University. It
goes closely in lines and principles of Princeton
WordNet,
EuroWordNet
and
BalkaNet
(shamsfard et al. 2010). The latest version of
FarsNet (2.0) contains 22180 nouns (including
1
92
the morpho-semantic patterns of derived nouns
by analysing the relevant data in FarsNet. It is
worth noting that the present article originates
from a wider scope research by Fakoornia
(2013), in which all FarsNet derived nouns
(2756) were analysed in order to establish
morpho-semantic relations between derived
nouns and their bases. The derived nouns under
study included 26 different suffixes. In this study
the derivatives of 12 most productive noun
marker suffixes (2461) have been focused. This
study enriches FarsNet while improves morphosemantic codification of Persian.
Table 1: list of noun beginners in FarsNet
The synsets which do not fall into any of the
above categories will be tagged by the label
nothing. The semantic relations are also
established among the synsets with the same
POS. Synsets with different POS will be tagged
by labels such as “related to”. There are 3
choices for mapping a synset to the
correspondent one in Princeton WordNet 3.0:
equivalence mapping, near-equivalence mapping
and no-mapping. Finally, the morphological
relations among senses, such as derivational
relations are marked.
After a brief introduction to FarsNet word entries
in general and noun entries in particular, the
process of morpho-semantic pattern formulation
will be elaborated for the selected suffixes.
2
Besides specifying a noun type (such as
common, proper, countable, uncountable,
pronoun, number or infinitive), a classification
on the basis of some more general semantic
features (such as belonging to human, animal,
location or time) is provided.
FarsNet Word Entries
Entries include phonological transcription, part
of speech, synonyms and their classifications in
to a synset, word meaning and an example. A
beginner will be selected for each lexeme.
According to Miller et al. (1990) a beginner is a
primitive semantic component of any word in its
hierarchically
structured
semantic
field.
Beginners could be used in the recognition of
domains synsets. Different syntactic types can
be related to each other in FarsNet; mapping
each entry to its corresponding concept in
Princeton WordNet 3.0 is also possible
(Shamsfard et al., 2010). Using this information
is essential in establishing morpho-semantic
relations. Table 1 shows the prevailing noun
beginners in FarsNet.
1. act
2.
animal
3.
artifact
4.
attribu
te
5.
body
Noun beginners
6.
11.
16.
cognition location plant
7.
17.
12.
communi
posses
motive
cation
sion
18.
13.
8. event
proces
object
s
19.
14.
9. feeling
quantit
person
y
15.
20.
10. group phenom relatio
enon
n
3
Data Analysis
For the purpose of this study the noun corpus of
FarsNet (22180) were thoroughly explored. First
of all, the list of derived nouns (2756) was
prepared. Then they were broken into their roots
and affixes. From among 26 suffixes, in this
paper, the 12 most frequents were selected (2461
derivatives), described and analysed. They are
listed in table 2, the morphological descriptions
are compatible with Keshani's (1992) description
of Persian suffixes.
Suffix
POS
n-n
a-n
n-d
d-d
n-n
v-n
a-a
n-n
a-n
v-n
1
“-i”
2
“-e”
3
“-æk”
4
“-ʧe”
n-n
24.
time
5
“-gah”
25.
food
n-n
v-n
d-d
6
21.
shape
22.
state
23.
substa
nce
2
93
“-dan”
n-n
Semantic load
Any type of
noun, adjective
& adverb
Any type of
noun &
adjective
Any type of
noun
Diminution
similarity
Location
Body part
time
Location
Body part
dish
7
“-gær”
8
“-ban”
9
“ænde”
10
“-ar”
11
“-eʃ”
12
“-ane”
v-n
n-n
n-a
n-n
v-a
v-n
d-a
a-a
v-n
v-a
n-n
n-a
v-n
a-n
n/ aa/ d
n/ dn
a. “-i” connects to nouns and adjectives and
makes abstract noun, expressing an attribute
or a state. The process is highly productive in
Persian. Thus if “-i” connects to a noun or an
adjective with different types of beginners,
the resulting derivative beginner will be
attribute or state. Considering the mentioned
regularity the relation could be expressed as
follows: “derivative attribute of base”, for
example “bideGati attribute of bideGat”,
(carelessness attribute of careless). FarsNet
includes 802 tokens of such nouns.
Profession
object
Similarity
Any type of
noun &
adjective
Any type of
noun &
adjective
Any type of
noun
Any type of
adjective &
adverb
food
b. “-i” connects to agent nouns and present
participles, describing a job or an act and
makes noun infinitive referring to a field, a
job or an act. In Persian the beginner of
agent noun is person and the beginner of
gerund is act or cognition. So if “-i” connects
to a noun belonging to person or to present
participle, the beginner of derivative will be
act or cognition. Following this the relation
“base agent of derivative” is predictable. For
example; “mohændes agent of mohændesi”
(engineer agent of engineering). FarsNet
includes 890 tokens of such nouns.
Table 2: A list of selected suffixes
The following information is required to link
each noun to its base:




4
Morphological information of nouns;
including POS of the base and derivative
as well as other noun types such as;
proper, common, number, etc.
Semantic category; including human,
animal, location, time or nothing.
Beginner; such as act, person, feeling,
event, etc., (table 1).
The derivational relation between
derived noun and its base.
c. “-i” connects to agent noun and makes
nouns referring to location or territory. So if
“-i” connects to a base which is person, and
makes a derivative referring to location, we
will have the relation “derivative location of
base” for example “tælaforuʃi location of
tælaforuʃ” (jewelry location of jeweler).
FarsNet includes 15 tokens of such nouns.
d. Other structures include the use of “-i” to
refer to colors. Colors inherited from
property. Thus if the base beginner is
anything and the derivative beginner is
property, the relation “base the same color as
derivative” will be established. For example:
“porteGal the same color as porteGali”
(orange the same color as orange). FarsNet
includes 15 tokens of such nouns.
Morpho-semantic Analysis of Selected
Suffixes
In this part we will scrutinize our 12 most
productive selected noun marker suffixes from a
morpho-semantic point of view. More
information about the other Persian suffixes
could be found in Fakoornia (2013).
4.1
“-i”
e. “-i” connects to some other nouns, verbs
and adjectives (excluding the above
mentioned ones) and makes derivatives,
referring to feeling, process, event, act,
person, object, nothing etc. So if the base
POS is verb, noun, adjective (other than
present participles) and the derivative
beginner could be anything, we will have the
relation “derivative related to base”. For
In FarsNet, 4125 nouns ends in letter /i/ among
which 1880 nouns are considered to be
derivatives of suffix “-i”.
“-i” is an extremely productive Persian suffix. It
has the potential for connecting to bases with
different grammatical category, to compound
words and even to syntactic phrases.
3
94
example; “barani related to baran” (raincoat
related to rain). FarsNet includes 144 tokens
of such nouns.
“bædæxlagi
attribute
of
bædæxlag”
(irritability attribute of irritable) and also
“bædæxlag agent of bædæxlagi”.
A summary of what has been explicated is listed
in the table 3:
f. There are also 14 derivatives of “-i” in
FarsNet which can be classified in both (a)
and (b). In this case relations of “derivative
attribute of base” and “base agent of
derivative” can be established. For example:
input
output
morphosemantic
relation
derivative
attribute of
base
base agent of
derivative
derivative
location of
base
derivative the
same color as
base
Base
POS
suffix
Base
beginner
derivative
POS
derivative beginner
a
n/adj
“-i”
anything
n
attribute/state
b
n/pres.
part.
“-i”
person
n
act/ cognition
c
n
“-i”
person
n
location
d
n
“-i”
anything
n
Property
e
v/n/adj
“-i”
anything
except
above
n
anything
derivative
related to base
144
attribute/state/act/cognition
“derivative
attribute of
base” and
“base agent of
derivative”
14
f
n/ pres.
part.
“-i”
Total
Person
n
number
802
890
15
15
1880
Table 3: morpho-semantic patterns of suffix “-i” derivatives
f. Noun (person) + “-i” = noun (attribute/
state/ act/ cognition) → derivative attribute
of base/ base agent of derivative <14>.
According to the above patterns, “- i”’s word
formation processes are formulated. The
beginners are given in parenthesis and the
frequency of each pattern is given in bracket.
As can be seen, “-i” is frequently involved in
forming derivatives with beginners such as act,
cognition and attribute. Few numbers of its
derivatives are categorized under location and
property. Formula (3) shows those patterns not
covered in other structures.
a. Noun (person)/ present participle + “-i” =
noun (act/ cognition) → base agent of
derivative <890>.
b. Noun (anything)/ adjective + “-i” = noun
(attribute/ state) → derivative attribute of
base <802>.
c. Verb/ noun (other) / adjective + “-i” =
noun (anything) → derivative related to base
<144>.
d. Noun (person) + “-i” = noun (location) →
derivative location of base <15>.
e. Noun (anything) + “-i” = noun (property)
→ derivative the same color as base <15>.
4.2
“-e”
7 morpho-semantic patterns
distinguished for suffix “-e”:
have
been
a. Verb + “-e” = noun (anything except act)
→ derivative related to base verb form:
“sorude related to sorudan” (song related
to sing) <47>.
4
95
b. Noun (anything) + “-e”→ noun (other)
→ derivative related to base: “ruze
related to ruz” (fast1 related to day)
<42>.
c. Adjective + “-e” = noun (anything) →
base attribute of derivative: “jævan
attribute of jævane” (young attribute of
sprout) <23>.
d. Noun (object/ body) + “-e” = noun
(anything) → derivative similar to base:
“dæhane similar to dæhan” (opening
similar to mouth) <16>.
e. Noun (quantity) + “-e” = noun (time) →
base quantity of derivative: “dæh
quantity of dæhe” (ten quantity of
decade) <4>.
f. Verb + “-e” = noun (act) → derivative
act of base verb form: “xænde act of
xændidæn” (laughter (n.) act of laugh
(v.)) <3>.
g. Diminutive noun (person) + “-e”= noun
(person) → derivative pejorative sense
of base: “doxtæræke pejorative sense of
doxtæræk” (bad girl pejorative sense of
little girl <1>.
d. Verb + “-æk” = noun (anything) →
derivative related to base verb form:
“gæltæk related to gæltidæn”, (roller
related to roll) <4>.
e. Noun (anything) + “-æk” = noun (food)
→ derivative similar to base “pæʃmæk
similar to pæʃm”, (cotton candy similar
to wool) <3>*.
f. Noun (person/ animal) + “-æk” = noun
(person/
animal)
→
derivative
diminutive
of
base:
“doxtæræk
diminutive of doxtær”, (little girl
diminutive of girl) <2>.
g. Noun (body) + “-æk” = noun (act) →
base agent of derivative: “naxon agent of
naxonæk”, (nail agent of pick) <1>.
h. Noun (body) + “-æk” = noun (body) →
derivative related to base: “guʃæk**
related to guʃ”, (eardrum related to ear)
<1>.
* Formula (a) and (e), however similar
cannot be merged into a single category as in
(a) although the beginner of both derivative
and base can be anything, the tokens of each
category are exclusive. It should be
mentioned that in pattern (e) the beginner of
derivative can be the same as the base.
As can be seen, “-e” often links to verbs and
creates derivatives with different types of
beginners; it seldom results in pejorative nouns.
4.3
** As the POS and the beginner of the word
“guʃæk” (eardrum), do not change in the
derivation process, during computational
codification it is classified in second
formula, however, according to its meaning
it cannot entered in that group, thus it should
be manually excluded and entered in a
general relation (derivative related to
base) formulated for it.
“-æk”
Suffix “-æk”* shows 8 morpho-semantic patterns
in Persian:
a. Noun (anything) + “-æk” = noun
(anything except food) → derivative
similar to base: “surætæk similar to
suræt”, (mask similar to face) <22>.
b. Noun (anything except person/ animal/
food) + “-æk” = noun (anything except
person, animal and food) → derivative
similar to base and derivative diminutive
of base: “ʃæhræk similar to ʃæhr” and
“ʃæhrak diminutive of ʃæhr”, (town
similar to city) and (town diminutive of
city) <11>.
c. Adjective + “-æk” = noun (anything) →
base attribute of derivative: “sorx
attribute of sorxæk”, (red attribute of
measles) <6>.
4.4
“-ʧe”
Suffix “-ʧe” shows 2 morpho-semantic patterns:
a. Noun (anything) + “-ʧe” = noun
(anything) → derivative diminutive of base
and derivative similar to base: “dæryaʧe
diminutive of dærya” and “dæryaʧe similar
to dærya”, (lake diminutive of sea) and (lake
similar to sea) <28>.
“-ʧe” in some nouns does not refer to similarity
or diminution but it merely indicates a vague
relatedness, an example is “ʔænbærʧe”, (sachet).
In such situations the relation “derivative related
to base” is formulated, but during computational
codification derivatives belonging to this
1
abstain from certain foods, as for religious or
medical reasons (especially during the day)
5
96
base: “ahængær related to ahæn”,
(blacksmith related to iron) <13>.
c. Noun (anything) + “-gær” = noun
(object) → derivative instrument of the
base: “næmayeʃgær instrument of
næmayeʃ”, (monitor instrument of
display) <3>.
d. Verb + “-gær” = noun (object) →
derivative agent of base verb form:
“roftegær agent of roftæn”, (dustman
agent of sweep) <1>.
structure, automatically classified in the previous
structure which should be manually removed
from it. In FarsNet there was only one derivative
of this type. Thus the formula would be:
b. Noun (anything) + “-ʧe” = noun
(anything) → derivative related to base:
“ʔænbærʧe related to ʔænbær”, (sachet
related to ambergris) <1>.
4.5
“-gah”
Suffix “-gah”
patterns:
shows
3
morpho-semantic
4.8
Suffix “-ban”
patterns:
a. Noun (anything)/ verb + “-gah” = noun
(location) → derivative location of base:
“dærmangah location of dærman”,
(health centre location of treatment)
<83>.
b. Noun (anything) + “-gah” = noun (body)
→ derivative related to base: “gijgah
related to gij”, (temple related to dizzy)
<6>.
c. Verb + “-gah” = noun (anything) →
derivative related to base verb form:
“didgah related to didæn”, (viewpoint
related to view) <1>.
4.9
“-dan”
4
“-ænde”
4.10 “-ar”
Suffix “-ar” shows 4 morpho-semantic patterns:
“-gær”
shows
morpho-semantic
a. Verb + “-ænde” = noun (anything) →
derivative agent of base verb form:
“ʔafarinænde agent of ʔafæridæn”,
(creator agent of create) <76>.
a. Noun (anything) + “-dan” = noun
(anything) → derivative location of base:
“goldan location of gol”, (vase location
of flower) <11>.
Suffix “-gær”
patterns:
3
Suffix “-ænde” shows a single morpho-semantic
pattern:
Suffix “-dan” shows a single morpho-semantic
pattern in Persian:
4.7
shows
a. Noun (anything) + “-ban” = noun
(person) → derivative protector of base:
“jængælban protector of jængæl”,
(woodsman protector of wood) <17>.
b. Noun (anything) + “-ban” = noun
(object) → derivative related to base:
“sayeban related to saye”, (sunshade
related to shade) <3>.
c. Verb + “-ban” = noun (person) →
derivative agent of base verb form:
“dideban agent of didæn”, (sentinel
agent of guard) <2>.
The above shows that the number of derivatives,
having location as their beginner is more than the
other beginners. Moreover the suffix rarely
connects to a verb.
4.6
“-ban”
a. Noun (anything) + “-ar” = noun
(anything) → derivative related to base:
“dadar related to dad”, (God related to
justice) <5>.
b. Verb + “-ar” = noun (act) → derivative
act of base verb form: “goftar act of
goftæn”, (speech act of say) <2>.
c. Verb + “-ar” = noun (person) →
derivative agent of base: “xæridar agent
of xæridæn”, (buyer agent of buy) <2>.
d. Verb + “-ar” = noun (anything except act
and person) → derivative related to base
morpho-semantic
a. Noun (act) + “-gær” = noun (person) →
derivative agent of base: “arayeʃgær
agent of arayeʃ”, (stylist agent of
makeup) <28>.
b. Noun (anything except act) + “-gær” =
noun (person) → derivative related to
6
97
verb form: “saxtar related to saxtæn”,
(structure related to construct) <2>.
productive suffixes. Considering that only 2
words out of 2461 (0.08%) did not fall into the
patterns, it could be concluded that the patterns
have successfully provided the foundations for
establishing automatic relations between derived
or complex nouns and their bases in FarsNet.
The coincident consideration of the words’
morphological features such as their POS, their
semantic and grammatical category (e.g. agent
noun, participle noun, present participle, etc.) as
well as recognizing the beginners of the bases
(e.g. act, person, food, etc.) and their change
after the affixation process have been the key
criteria in formulating the relations which were
especially crucial for the majority of studied
suffixes that were polysemous. Defining and
codifying these morpho-semantic patterns leads
us to coherent establishment of morpho-semantic
relations in FarsNet and hence has a remarkable
developing impact on the applicability of the
data base in machine translation, question
answering systems, etc. Although In this research
the morpho-semantic relations are considered at
the word level and not at the synset level,
mapping the results to the relations formulated in
other languages wordnets will provide a crosslingual validity, even if the morphological aspect
of the relation is not the same in the mapped
languages.
4.11 “-eʃ”
Suffix “-eʃ” shows 3 morpho-semantic patterns:
a. Verb + “-eʃ” = noun (act) → base act of
derivative verb form: “Gorridæn act of
Gorreʃ”, (roar (v.) act of roar (n.)) <68>.
b. Verb + “-eʃ” = noun (anything except
act) → derivative related to base verb
form: “deræxʃeʃ act of deræxʃidæn”,
(shine act of shine) <15>.
c. Noun (anything) + “-eʃ” = noun
(anything) → derivative related to base:
“yoneʃ related to yon”, (ionization
related to ion) <8>.
4.12 “-ane”
Suffix “-ane” shows 3 morpho-semantic patterns:
a. Noun (anything)/ adverb + “-ane” =
noun (food) → derivative food of the
base: “sobhane food of sobh”, (breakfast
food of morning) <7>.
b. Verb + “-ane” = noun (object) →
derivative instrument of base verb form:
“resane instrument of resandæn”, (media
instrument of broadcast) <6>.
c. Noun (anything) + “-ane” = noun
(anything except food) → derivative
related to base: “ʔængoʃtane related to
ʔængoʃt”, (thimble related to finger)
<5>.
References
Davari Ardakani, Negar and Mahdiyeh Arvin. 2015.
Persian. In N. Grandi and L. Kortvelyessy, editors.
Edinburgh Handbook of Evaluative Morphology.
Edinburgh University Press, Edinburgh, pages 287295.
The 2 represented exceptions; “guʃæk”
(eardrum) and “ʔænbærʧe” (sachet) will
naturally and respectively fall in the formulated
relations “derivative similar to base” or
“derivative diminutive of base” and “derivative
diminutive of base” or “derivative similar to
base”, however considering the meaning of their
bases and the resulting derivatives, they do not
belong to the mentioned relations, thus some
other relations should be formulated to include
them.
5
Deléger, Louise, Fiammetta Namer and
Pierre
Zweigenbaum. 2009. Morphosemantic Parsing of
Medical Compound Words: Transferring a French
Analyzer to English. International Journal of Medical
Informatics, 78 (1): 48-55.
Fakoornia, Nasim. 2013. Morphosemantic Analysis of
Nouns in Persian and English Aiming at
Computational Codification. Master’s thesis, Shahid
Beheshti University, June.
Farshidvard, Khosrow. 2007. Derivation
Compounding in Persian. Zavar press, Tehran.
Conclusion
and
Keshani, khosrow. 1992. Suffix Derivation in
Contemporary Persian. Iran University Press, Tehran.
Morpho-semantic analysis of a selection of 2461
derived nouns in FarsNet showed 45 morphosemantic patterns and 17 morpho-semantic
relations (such as “derivative agent of base”,
“derivative location of base”, etc.) for 12 most
Megerdoomian, Karine. 2000. Persian Computational
Morphology: A unification-based Approach. NMSU,
7
98
CRL, Memoranda in Computer and Cognitive Science
(MCCS-00-320).
Miller, George A. et al. 1990. Introduction to
Wordnet: An Online Lexical Database. Journal of
Lexicography, 3 (4): 235-244. doi:10.1093/ijl/3.4.235
Namer, Fiammetta and Robert Baud. 2007. Defining
and Relating Biomedical Terms: Towards a Crosslanguage
Morphosemantics-based
System.
International Journal of Medical Informatics, 76(23): 226-233.
Raffaelli, Ida and Barbara Kerovec. 2008.
Morphosemantic fields in the Analysis of Croatian
Vocabulary. Jezikoslovlje, 9 (1-2): 141-169.
Shamsfard, Mehrnoush et al. 2010. Semi-Automatic
Development of FarsNet; The Persian WordNet. 5th
Global WordNet Conference (GWA8020), Mumbai,
India.
8
99
Using WordNet to Build Lexical Sets for Italian Verbs
Anna Feltracco
Fondazione Bruno Kessler
Università di Pavia, Italy
[email protected]
Simone Magnolini
Fondazione Bruno Kessler
Università di Brescia, Italy
[email protected]
Lorenzo Gatti
Fondazione Bruno Kessler
Università di Trento, Italy
[email protected]
Bernardo Magnini
Fondazione Bruno Kessler
Povo-Trento, Italy
[email protected]
Abstract
A relevant step in our methodology is the annotation of the lexical items for argument positions in sentences. A previous work (Jezek and
Frontini, 2010) has already outlined an annotation
scheme for this purpose, and highlighted its benefits for NLP applications. In that work, however,
the annotation of lexical sets was intended as manual, whereas the methodology we propose here is
conceived for automatic annotation, and exploits
an existing external resource. Under this perspective our work is related to semantic role labeling
(Palmer et al., 2010).
This paper is organized as follows. Section 2 introduces the T-PAS resource; in Section 3 the lexical set population task is defined, and in Section
4 the experimental setting is presented. Section 5
discusses the results and is followed by the error
analysis in Section 6. Finally, Section 7 provides
some conclusions and directions for future work.
We present a methodology for building
lexical sets for argument slots of Italian
verbs. We start from an inventory of
semantically typed Italian verb frames
and through a mapping to WordNet we
automatically annotate the sets of fillers
for the argument positions in a corpus of
sentences. We evaluate both a baseline algorithm and a syntax driven algorithm and
show that the latter performs significantly
better in terms of precision.
1
Introduction
In this paper we present a methodology for building lexical sets for argument slots of Italian verbs.
Lexical sets (Hanks, 1996) are paradigmatic sets
of words which occupy the same argument positions for a verb, as found in a corpus. For example,
for the verb read, the following set can be built by
observing the lexical fillers of the object position
in the BNC corpus:
(1)
Elisabetta Jezek
Università di Pavia
Pavia, Italy
[email protected]
2
Overview of the T-PAS Resource
T-PAS, Typed Predicate Argument Structures, is a
repository of verb patterns acquired from corpora
by manual clustering of distributional information
about Italian verbs (Jezek et al., 2014).
The resource has been developed following
the lexicographic procedure called Corpus Pattern
Analysis, CPA (Hanks, 2004). In particular, in
the resource T-PASs are semantically motivated
and are identified by analysing examples found in
a corpus of sentences, i.e. a reduced version of
ItWAC (Baroni and Kilgarriff, 2006).
After analyzing a sample of 250 concordances
of the verb in the corpus, the lexicographer defines each T-PAS recognising its relevant structure and identifying the Semantic Types (STs) for
each argument slots by generalizing over the lexical sets observed in the concordances; as an exam-
read {book, newspaper, bible, article, letter, poem, novel, text, page, passage, ...}
To collect lexical sets for Italian verbs, we use the
lexical resource T-PAS (Jezek et al., 2014), an inventory of typed predicate argument structures for
Italian manually acquired from corpora through
inspection and annotation of actual uses of the analyzed verbs. In the current version of the T-PAS
resource, only the verb is tagged in the annotated
corpus, while the lexical items for each argument
slots are not. Thus, the annotation of the lexical
sets will enrich the actual version of the resource
and will open to experiments for automatically extending its coverage.
100
For instance, example (2) shows the T-PAS#1
of the verb preparare (Eng. to prepare) and a sentence associated to it.
Figure 1: T-PAS#2 for the verb divorare.
(2)
Figure 2: Lexical Set identification for T-PAS#2
for the verb divorare.
In this case, the system should identify nonna
(Eng.
grandmother) as a lexical item for
[[Human]]-SUBJ and torta (Eng.
cake) for
[[Food]]-OBJ. If this annotation is repeated for
all the sentences of the T-PAS#1 of the verb
preparare, the system will build the lexical set
for the ST [[Human]] in Subject position in the
T-PAS, such as {nonna, chef, Gino, bambina, ..},
and for [[Food]] in object position, such as {torta,
zuppa, pasta, panino, ..}.
ple, Figure1 shows the T-PAS#2 of the verb divorare: [[Human]] divorare [[Document]] (Eng. to
devour), where [[Document]] stands for {libro, romanzo, saggio} (Eng. {book, newspaper, essay})
(Figure 2). STs are chosen among a list of about
230 corpus-derived semantic classes compiled by
applying the CPA procedure to the analysis of concordances for about 1500 English and Italian verbs
(Jezek et al., 2014)1 . If no generalization is possible, the lexical set is listed. Finally, the lexicographer associates the instances in the corpus
to the corresponding T-PAS and adds a free-text
description of its sense (Figure 1). The T-PAS resource thus lists the analyzed verbs2 , the identified
T-PASs for each verb, the annotated instances for
the T-PAS in the corpus.
In the next Sections, we will define the lexical
set population task and describe the experiment we
ran and its evaluation.
3
[[Human]] preparare [[Food | Drug]]
“La nonna, prima di infornare le patate,
prepara una torta”
(Eng. “the grandmother, before baking the
potatoes, prepares a cake”)
4
Experimental Setting
In order to identify possible candidate items for a
ST, the system uses information from MultiWordNet (Pianta et al., 2002)(from now on MWN);
e.g. to derive that “grandmother” is a human
being and associate it to the ST [[Human]] and
that “cake” is a type of food and associate it to
the ST [[Food]]. The task, thus, required an initial
mapping between the T-PAS resource and MWN.
Then, we compared a naive Baseline algorithm
and a more elaborated algorithm that we called
LEA, Lexical Set Extraction Algorithm. Finally,
to evaluate the performance of our methodology
we also created a gold standard.
Task Definition
The aim of our system is to automatically derive
lexical sets corresponding to the STs in the T-PAS
resource. The task is defined as follows. The
system receives as input (i) a T-PAS of a certain
verb and (ii) a sentence associated to that T-PAS
in the resource. The system should correctly mark
(where present) the lexical items or the multiword
expressions correspondent to the STs of each argument position specified by the T-PAS (i.e. sentence annotation step). By replicating this annotation for all the sentences of a T-PAS, the system
will build the lexical set for a specific ST in a specific T-PAS (i.e. lexical set population step).
ST to Synset mapping. For our experiment,
the list of STs used in the T-PAS resource was automatically mapped onto corresponding WordNet
1.6 synsets. For instance, the ST [[Human]] was
mapped to all the synsets for the noun human (i.e.
human#n). Manual inspection was limited to the
case in which there is no exact match between a
ST and a synset (e.g. by associating “atmosphericphenomenon” to [[Weather Event]]).
The Baseline algorithm. The Baseline algorithm identifies possible candidate members of the
lexical set corresponding to a certain ST for a certain T-PAS by (i) lemmatizing each sentence using
TextPro (Pianta et al., 2008), (ii) checking if each
lemma is in MWN and (iii) determining whether
1
Labels for STs in T-PAS are in English, as in the corresponding English resource PDEV (Hanks and Pustejovsky,
2005).
2
The current version of T-PAS contains 1000 analyzed average polysemy verbs, selected on the basis of random extraction of 1000 lemmas out of the total set of fundamental
lemmas of Sabatini Coletti (2007).
101
ST allowed by the T-PAS frame (e.g. Maria Rossi
→ Person → [[Human]]). Since the Baseline recognizes only named entities that are in MWN, we
expect this algorithm to identify more items.
Finally, LEA (iv) looks for multiword expressions in a chunk by checking if the combination exists in MWN. For instance, in “La nonna
prepara la conserva di frutta” (Eng.: the grandmother prepares the fruit conserve), LEA should
identify conserva di frutta as [[Food]] (while the
Baseline identifies only the token frutta).
The LEA algorithm, thus, should recognize as
valid only the items for a certain argument slot
of the analyzed verb (and not for other verbs in
the sentence), solve major cases of same ST in
different slots and identify named entities and
multiword expressions.
the lemma belongs to a synset that was mapped to
the ST, or if it is an hyponym of one such synsets.
For instance, in example (2), the Baseline
lemmatizes the sentence and selects as possible
candidates the nouns of the sentence, i.e. nonna,
torta and patate. The Italian lemma nonna is
thus searched in MWN and the correspondent English lemmas grandma#n#1, grandmother#n#1,
granny#n#1, grannie#n#1 are found. Since none
of these synset lemmas match with [[Human]],
[[Food]] or [[Drug]], the MWN hierarchy is
traversed until human#n#1 is found, which is
mapped to [[Human]]. The same is done for torta
and patate, until [[Food]] is found. Thus, for (2),
the Baseline identifies nonna as [[Human]] and
torta and patate as [[Food]] (with patate being a
misclassified item, as it is not referred to the verb
preparare).
Gold Standard. We created a gold standard
for the task by manually annotating 500 examples. We asked three annotators to mark the lexical items or the multiword expressions that correspond to the STs, without annotating pronouns
or relative clauses. We selected the 500 sentences
by extracting 10 sentences for 10 different STs in 5
different T-PASs (for a total of 50 different T-PASs
belonging to 47 verbs). In particular, we chose,
among all the STs within the [[Inanimate]] hierarchy, 10 types that are used in at least 5 different
T-PAS, each of them having at least 10 (potential) sentences associated in the corpus resource.
For example, we selected [[Food]] and annotated
10 sentences for T-PAS#1 of mangiare ”[[Human]] mangiare [[Food]]” (Eng. to eat), since (i)
there are at least 5 verbs with a T-PAS containing
[[Food]], like mangiare itself and (ii) we have at
least 10 sentences available for each of these five
T-PASs 3 . This selection of few STs was intended
to better compare performances of the algorithms
for different lexical sets.
The gold standard annotation resulted in a total
of 981 annotated tokens out of 15090 (the average
sentence length being 30.18 tokens).
The LEA algorithm. Compared to the Baseline algorithm, the LEA algorithm takes into account also the dependency tree of the sentence,
named entities as recognized by TextPro, and multiword expressions.
It starts by (i) finding the position of the verb
in an example and considering as valid candidate
only the chunks that are a subject, direct object or
complement of the verb according to the TextPro
dependency tree. With respect to the Baseline, this
leads to a more precise identification of the items
for the argument slots of just the verb we are considering. For instance, in (2) we expect the algorithm to correctly identify nonna as [[Human]] and
torta as [[Food]], but not proposing patate (as the
Baseline does).
The LEA algorithm also (ii) checks if the verb
allows the same ST for subject and object, as in the
T-PAS#3 of pettinare: [[Human1]] pettinare [[Human2]] (Eng. to comb someone’s hair). In the sentence “La mamma pettina il bambino” (Eng. The
mum combs the baby), LEA will correctly propose
mamma as [[Human1]] and baby as [[Human2]].
In this case, it also checks if the verb is in passive
form and swaps the items for subject and object
position as needed, improving the precision with
respect to the Baseline.
Furthermore, the algorithm (iii) checks if the
chunk contains/overlaps with proper names related to persons, organizations and locations detected by TextPro, and, if this is the case, checks
the corresponding type of named entity against the
5
Results
For what concerns sentence annotation, we evaluate overall precision, recall and F-measure, con3
This is mainly a selection criteria. Considering that we
analyzed a limited number of examples for each verb, and
that more than one ST can be specified for each argument
slot, it is also possible that none of the sentences extracted
for a ST for a verb instantiate that particular ST.
102
sidering as a positive match when the algorithms
agree with the gold standard in recognizing a token as an item (or part of the item in case of multiword expressions) instantiating a ST for a precise
position.
Compared to the Baseline, the LEA algorithm
registers a significant higher value for precision
(see Automatic Mapping in Table 1). This is not
surprising, as the Baseline considers as valid all
the items in the sentence that can correspond to the
ST, without taking into account if they are in the
argument position required by the T-PAS or not.
On the contrary, the LEA algorithm also considers the syntactic structure, thus lowering the false
positives rate; the downside effect is that its recall
is lower than the one of the Baseline.
reasons, we believe that on a broader scale, the
higher precision for LEA is more advisable with
respect to the Baseline.
Automatic mapping
The results presented in the first part of Table 1
were manually inspected to identify sources of errors. In particular, we have noticed that many inaccuracies are due to the automatic mapping of
STs to WordNet synsets. For instance, both algorithms failed to recognize casa (Eng.: house), corresponding to the ST [[Building]] which was automatically mapped onto building#n; they would
have succeeded, had the ST been mapped to the
more general construction#n.
Even when the automatic mapping works, the
different structure of the two resources can lead
to wrong results. For instance, vehicles such as
elicottero (Eng.: helicopter) are frequently generalized by the ST [[Vehicle]] in T-PAS and are
hyponyms of vehicle#n in MWN. However, while
in T-PAS [[Machine]] is a hypernym of [[Vehicle]], the same is not true for machine#n in MWN.
As a consequence, in the sentences in which vehicles are considered members of the lexical set
correspondent to [[Machine]], even traversing the
MWN hierarchy, the algorithms can not consider
these items as valid candidates for the ST [[Machine]].
To solve at least some of these problems, we
manually inspected the 40 STs of the sentences
of the gold standard, and modified the automatic
mapping of 11 of those; for example, we chose
to translate the ST [[Building]] to construction#n,
and mapped [[Machine]] to both transport#n and
machine#n. This led to a significant improvement
of the recall for both algorithms, and a minor improvement of the precision, as shown in Table 1.
This improvement is also reflected on the second part of the task (i.e. the creation of the lexical
Baseline
LEA
Precision
Recall
F1
0.28
0.70
0.42
0.25
0.34
0.37
Baseline LEA
Cuocere#2-SBJ-[[Food]]
Crollare#1-SBJ-[[Building]]
Dirottare#1-OBJ-[[Vehicle]]
Prescrivere#2-OBJ-[[Drug]]
Togliere#4-OBJ-[[Garment]]
0.30
0.72
0.52
0.32
0.57
0.25
0.50
0.46
0.22
Table 2: Dice’s value for lexical set annotation for
the Baseline Algorithm and the LEA Algorithm.
6
Mapping with manual revision
Baseline
LEA
0.54
0.40
0.72
0.42
0.45
0.38
0.44
Table 1: Results for sentence annotation for the
Baseline Algorithm and the LEA Algorithm.
We also measured the similarity between the 5
most populated lexical sets in the gold standard
(from 6 to 15 tokens in 10 sentences) and their correspondent lexical sets built by the two algorithms
(see Table 2), by calculating the Dice’s coefficient4
(van Rijsbergen, 1979). For example, we compare
the lexical set of the T-PAS#1 of crollare: [[Building]] crollare (Eng. to fall down) {e.g. casa, muro,
torre} with the lexical set for the same ST in the
same T-PAS derived by the Baseline and LEA.
Results show that both the Baseline and LEA do
not reach high overlap. In fact, even if LEA has an
high precision in identifying the members of the
lexical set, the low recall penalizes the amount of
items it can detect given few sentences to annotate. On the contrary, the Baseline is favored by
a higher recall, but its low precision causes major
differences with the gold standard sets. For these
4
Dice’s coefficient measures how similar two sets are by
dividing the number of shared elements of the two sets by
the total number of elements they are composed by. This
produces a value from 1, when both sets share all elements,
to 0, when they have no element in common.
103
Error Analysis
set). For example, the Dice value for Crollare#1SBJ-[[Building]] improves from 0.4 to 0.71 for
the Baseline and from 0.25 to 0.6 for LEA.
Another significant aspect concerns the recognition of proper names: out of the 185 tokens that
are -or are part of- proper nouns (137 are related
to persons, locations or organizations), the Baseline recognized correctly only 10 (mainly common
nouns that are used as proper names), while the
LEA algorithm only 26.
Finally, some errors are introduced in the PoS
tagging and dependency parsing steps. During the
former, an incorrect tag can be assigned to a word
(e.g. a noun could be mis-tagged as an adjective)
and hinder both algorithms, as the word would not
be checked in MWN. The latter only undermines
the recall of the LEA algorithm instead. Moreover, LEA does not deal with complex syntactic
structure yet (e.g. when our verb is in an infinitive
phrase, which is the object of a main verb, such
as “[..] e il presidente chiede agli italiani di ipotecare la casa [..]”, Eng.: [..] and the president asks
Italians to mortgage their houses [..]).
lect [[Food]] as object).
7
Elisabetta Jezek, Bernardo Magnini, Anna Feltracco,
Alessia Bianchini, and Octavian Popescu. 2014. TPAS: a resource of corpus-derived Types PredicateArgument Structures for linguistic analysis and semantic processing. In Proceedings of the 9th International Conference on Language Resources and
Evaluation (LREC’14), Reykjavik, Iceland.
References
Marco Baroni and Adam Kilgarriff. 2006. Large
linguistically-processed web corpora for multiple
languages. In Proceedings of the 11th Conference of
the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations,
pages 87–90.
Patrick Hanks and James Pustejovsky. 2005. A pattern
dictionary for natural language processing. Revue
française de linguistique appliquée, 10(2):63–82.
Patrick Hanks. 1996. Contextual dependencies and
lexical sets. The International Journal of Corpus
Linguistics, 1(1).
Patrick Hanks. 2004. Corpus pattern analysis. In
Proceedings of the 11th EURALEX International
Congress, Lorient, France, Universite de BretagneSud, volume 1, pages 87–98.
Elisabetta Jezek and Francesca Frontini.
2010.
From Pattern Dictionary to Patternbank. In G.M.
De Schrijver, editor, A Way with Words: Recent Advances in Lexical Theory and Analysis, pages 215–
239. Kampala:Menha Publishers.
Conclusion and Further Work
In this paper we have presented an experiment
for the automatic building of lexical sets for argument positions of the Italian verbs in the T-PAS
resource. The method is based on the use of MWN
in order to match the STs with the potential fillers
of each argument position.
The experiment suggests that LEA can be used
to automatically populate the lexical sets with
good precision. We believe that significantly better results could be obtained with an accurate manual mapping of the STs to synsets, possibly narrowed to specific senses (e.g. mapping [[Building]] to just the third sense of construction#n).
Furthermore, recognizing proper nouns proved
a difficult task, and even using named entities
recognition in addition to MWN was not enough.
Therefore a resource to map these nouns to a
synset in the WordNet hierarchy is needed; BabelNet (Navigli and Ponzetto, 2012) could prove
useful in this sense.
Further work includes the extension of the sentence annotation and lexical set population for all
T-PAS and the comparison of the same ST in different T-PASs in order to study Italian verbs’ selectional preferences from the perspective of verb
selectional classes (for example, all verbs that se-
Roberto Navigli and Simone Paolo Ponzetto. 2012.
BabelNet: The automatic construction, evaluation
and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–
250.
Martha Palmer, Daniel Gildea, and Nianwen Xue.
2010. Semantic role labeling. Synthesis Lectures
on Human Language Technologies, 3(1):1–103.
Emanuele Pianta, Luisa Bentivogli, and Christian Girardi. 2002. MultiWordNet: developing an aligned
multilingual database. In Proceedings of the 1st international conference on global WordNet, volume
152, pages 55–63.
Emanuele Pianta, Christian Girardi, and Roberto
Zanoli. 2008. The TextPro Tool Suite. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
Francesco Sabatini and Vittorio Coletti.
2007.
Dizionario della lingua italiana 2008. Milano: Rizzoli Larousse.
CJ van Rijsbergen. 1979. Information Retrieval. 1979.
Butterworth.
104
A Taxonomic Classification of WordNet Polysemy Types
Abed Alhakim Freihat
Qatar Computing Research Institute
Doha, Qatar
Fausto Giunchiglia
University of Trento
Trento, Italy
Biswanath Dutta
Indian Statistical Institute (ISI)
Bangalore, India
[email protected]
[email protected]
[email protected]
Abstract
been given, so far, to the principles or rules
used to identify polysemy types. In fact, none
of these approaches can explain how to identify
the polysemy types of the discovered polysemy
structural patterns or how to differentiate for
example, between homonymy and metaphoric
structural patterns. Although Apersejan’s semantic similarity criterion (Apresjan, 1974) can be
used to account for regularity in polysemy, it
can not predict the polysemy type of the regular
polysemy types in WordNet. Our hypotheses in
this paper is that identifying and differentiating
between the polysemy types of the regular polysemy structural patterns requires understanding
the hierarchical structure of WordNet and, thus,
the criteria related to the taxonomic principles that
the hierarchical structure of WordNet comply with
or violates. In this paper, we show how to use two
taxonomic principles as criteria for identifying
the polysemy types in WordNet. Based on these
principles, we introduce a semi automatic method
for discovering and identifying three polysemy
types in WordNet.
WordNet represents polysemous terms by
capturing the different meanings of these
terms at the lexical level, but without giving emphasis on the polysemy types such
terms belong to. The state of the art polysemy approaches identify several polysemy types in WordNet but they do not explain how to classify and organize them.
In this paper, we present a novel approach
for classifying the polysemy types which
exploits taxonomic principles which in
turn, allow us to discover a set of polysemy structural patterns.
1
Introduction
Polysemy in WordNet (Miller, 1995) corresponds
to various kinds of linguistic phenomena and
can be grouped into various polysemy types
(Falkum, 2011). Although WordNet was inspired
by psycholinguistic and semantic principles
(Miller et al., 1990), its conceptual dictionary puts
greater emphasis on the lexical level rather than
on the semantic one (Dolan, 1994). Lexicalizing
polysemous terms without any further information
about their polysemy type affects the usability of
WordNet as a knowledge resource for semantic
applications (Mandala et al., 1999).
In general, the state of the art approaches suggests
different solutions to the polysemy problem. The
most prosperous among these approaches are the
regular/systematic polysemy approaches such as
(Buitelaar, 1998) (Barque and Chaumartin, 2009)
(Veale, 2004) (Peters, 2004). These approaches
propose the semantic regularity as a basis for
classification of the polysemy classes and offer
different solutions that commensurate the nature
of the discovered polysemy types.
Despite the diversity and depth of the state of
the art solutions, no or very little attention has
The paper is organized as follows. In Section two, we discuss the problem. In Section
three, we introduce the formal definitions we
use. In Section four, we discuss the taxonomic
principles that we use to discover three of the
polysemy types in WordNet. In Section five, we
give an overview of our approach. In Section six,
we show how to use the taxonomic principles to
identify metaphoric structural patterns. In Section
seven, we demonstrate how to determine specialization polysemy structural patterns. In Section
eight, we describe how to discover homonymy
structural patterns. In Section nine, we explain
how to handle false positives in the structural
patterns. In Section ten, we present the results of
our approach. In Section eleven, we conclude the
paper and depict our future work.
105
2
Problem Statement
approaches do not reveal the principles or the criteria used to classify polysemous terms into polysemy types.
In this paper, we explain how to use the exclusiveness property and the collectively exhaustiveness
property (Bailey, 1994) (Marradi, 1990) for identifying the following polysemy types.
WordNet is a machine readable online lexical
database for the English language. Based on psycholinguistic principles, WordNet has been developing since 1985, by linguists and psycholinguists
as a conceptual dictionary rather than an alphabetic one (Miller et al., 1990). Since that time,
several versions of WordNet have been developed.
In this paper, we are concerned with WordNet 2.1.
WordNet 2.1. contains 147,257 words, 117,597
synsets and 207,019 word-sense pairs. The number of polysemous words in WordNet is 27,006,
where 15776 are nouns.
In this paper, we deal with polysemous nouns at
the concept level only. We do not consider polysemy at the instance level. After removing the
polysemous nouns that refer to proper names, the
remaining polysemous nouns are 14530 nouns.
WordNet does not differentiate between the types
of the polysemous terms and it does not contain
any information in terms of polysemy relations
that can be conducted to determine the polysemy
type between the synsets of a polysemous term.
The researchers who attached the polysemy problem in WordNet gave different descriptions for the
polysemy types in WordNet. For example, polysemy reduction approaches (Edmonds and Agirre,
2008) (Mihalcea R., 2001) (Gonzalo J., 2000)
differentiate between contrastive polysemy and
complementary polysemy. Regular polysemy approaches such as (Barque and Chaumartin, 2009)
(Veale, 2004) (Peters, 2004) (Freihat et al., 2013)
(Lohk et al., 2014) give more refined classification
of the polysemy types into metonymy, metaphoric,
specialization polysemy, and homonymy. In one
of our recent papers, compound noun polysemy is
introduced as a new polysemy type beside the former four polysemy types in WordNet (Freihat et
al., 2015).
So far, no polysemy reduction approaches have
introduced a mechanism for classifying the polysemy types into contrastive and complementary. Instead, these approaches adopt semantic and probabilistic rules to discover redundant
and/or very fine grained senses. On the other
hand, the regular polysemy approaches embrace a
clear definition for classifying polysemous terms
into regular and non regular polysemy (Apresjan,
1974). Although, the definition of regular polysemy in these approaches is useful to distinguish
between regular and non regular polysemy, these
1 Metaphoric polysemy: Refers to the polysemy instances in which a term has literal
and figurative meanings (Evans and Zinken,
2006). In the following example, the first
meaning of the term fox is the literal meaning and the second meaning is the figurative.
#1 fox: alert carnivorous mammal.
#2 dodger, fox, slyboots: a shifty
deceptive person.
2 Specialization polysemy: A type of related
polysemy which denotes a hierarchical relation between the meanings of a polysemous
term. In the case of abstract meanings, we
say that a meaning A is a more general
meaning of a meaning B. We may also use
the taxonomic notations type and subtype
instead of more general meaning and more
specific meaning respectively. For example,
we say that the first meaning of turtledove
is a subtype of the second meaning.
#1 australian turtledove, turtledove:
small Australian dove.
#2
turtledove: any of several Old
World wild doves.
3 Homonymy: Refers to the contrastive
polysemy instances, where meanings are not
related. Consider for example the following
polysemy instance of the term bank.
#1 depository financial institution,
bank: a financial institution.
#2
bank: sloping land (especially
the slope beside a body of water).
3 Approach Notations
We begin with the basic notations. Lemma is the
basic lexical unit in WordNet that refers to the base
form of a word or a collocation. Based on this definition, we define a natural language term or simply
a term as a lemma that belongs to a grammatical
category; i.e., noun, verb, adjective or adverb.
Definition 1 (Term).
106
a) entity P S is the single root of the hierarchy;
b) E Ď S ˆ S;
c) ps1 , s2 q P E if s1 ă s2 ;
d) For any synset s ‰ entity, there exists at least
1
1
one synset s such that s ă s.
A term T is a quadruple xLemma, Caty, where
a) Lemma is the term lemma;
b) Cat is the grammatical category of the term.
Synset is the fundamental structure in WordNet that we define as follow.
In this definition, point (a) defines the single
root of the hierarchy and point (d) defines the
connectivity property in the hierarchy.
We move now to the semantics of WordNet. We
define the subset of the semantics of WordNet
hierarchy that is relevant for our approach. A full
definition of the WordNet semantics is described
in approaches such as (Alvarez, 2000) (Rudolph,
2011) (Breaux et al., 2009).
We define the semantics of WordNet using an
Interpretation I “ x∆I , f y, where ∆I is an non
empty set (the domain of interpretation) and f is
an interpretation function.
Definition 2 (WordNet synset).
A synset S is defined as xCat, Terms, Gloss,
Relationsy, where
a) Cat is the grammatical category of the synset;
b) Terms is an ordered list of synonymous terms
that have the same grammatical category Cat;
c) Gloss is a text that describes the synset;
d) Relations is a set of semantic relations that hold
between synsets.
Now, we move to the hierarchical structure
of WordNet. WordNet uses the relation direct
hypernym to organize the hierarchical relations
between the synsets. This relation denotes the
superordinate relationship between synsets. For
example, the relation direct hypernym holds
between vehicle and wheeled vehicle where
vehicle is hypernym of wheeled vehicle. The
direct hypernym relation is transitive. In the
following, we generalize the direct hypernym
relation to reflect the transitivity property, where
we use the notion hypernym instead of a direct
hypernym.
Definition 5 (Semantics of WordNet Hierarchy).
Let W H “ xS, Ey be wordNet hierarchy. We
define an Interpretation of W H, I “ x∆I , f y as
follows:
a) entity I “ ∆I ;
b) KI “ H;
c) @s P S: sI Ă ∆I ;
d) ps1 [ s2 qI “ sI1 X sI2 ;
e) ps1 \ s2 qI “ sI1 Y sI2 ;
f) s1 Ď s2 if sI1 Ď sI2 .
Definition 3 (hypernym relation).
1
In points a) and b), we define the empty and
universal concepts. Point c) states that ∆I is
closed under the interpretation function f . In and
d) and e), we define the conjunction and disjunction operations. In f), we define the subsumption
relation.
1
For two synsets s and s , s is a hypernym of s , if
1
the following holds: s is a direct hypernym of s ,
2
or there exists a synsets s such that s is a direct
2
2
1
hypernym of s and s is a hypernym of s .
For example, vehicle is a hypernym of car,
because vehicle is direct hypernym of wheeled
vehicle and wheeled vehicle is a direct hypernym of car.
We use the following symbols to denote direct
hypernym/hypernym relations:
1
1
a) s ă s if s is a direct hypernym of s
1
1
c) s ă˚ s if s is a hypernym of s
We present now the polysemy notations. A
term is polysemous if it is found in the terms of
more than one synset. A synset is polysemous if
it contains at least one polysemous term. In the
following, we define polysemous terms.
Definition 6 (polysemous term).
A term t = xLemma, Cat, T-Ranky is polysemous
1
1
if there is a term t and two synsets s and s ,
1
s ‰ s such that
1
1
a) t P s.Terms and t P s .Terms;
1
b) t.Lemma = t .Lemma;
1
c) t.Cat = t .Cat.
Using the direct hypernym relation, wordNet
organizes noun-synsets in a hierarchy that we
define as follows.
Definition 4 (wordNet hierarchy).
Let S “ ts1 , s2 , ..., sn u be the set of noun-synsets
in WordNet. WordNet hierarchy is defined as a
connected and rooted digraph xS, Ey, where
In the following, we define polysemous synsets.
107
p2 the pattern hyponyms.
For example, the structural pattern of the polysemy instance r tbazaar, bazaru, s1 , s2 s is
xmercantile establishment, marketplace, shopy
as shown in Figure 1, where mercantile
establishment is the pattern root and
marketplace and shop are the pattern hyponyms. A special structural pattern is the
Definition 7 (polysemous synset).
A synset s is polysemous if any of its terms is a
polysemous term.
It is possible for two polysemous synsets to share
more than one term. Two polysemous synsets and
their shared terms constitute a polysemy instance.
In the following, we define polysemy instances.
Definition 8 (polysemy instance).
A polysemy instance is a triple rtT u, s1 , s2 s,
where s1 , s2 are two polysemous synsets that have
the terms {T} in common.
For example, the term bazaar belongs
to
the
following
polysemy
instances:
rtbazaar, bazaru, #1, #2s, rtbazaaru, #1, #3s,
and rtbazaaru, #2, #3s.
#1 bazaar, bazar: a shop where a variety
of goods are sold.
#2
bazaar, bazar: a street of small
shops.
#3 bazaar, fair: a sale of miscellany;
often for charity.
We move now to the last part of our definitions.
We exploit the structural properties in WordNet
hierarchy to identify the polysemy types of the
polysemy instances in WordNet. According to
the connectivity property of WordNet hierarchy
in definition 4, any two synsets in wordNet have
at least one common subsumer that we define as
follows.
Figure 1: Example of a structural pattern
common parent structural pattern as illustrated in
Figure 2. A strcutural pattern P “ xr, p1 , p2 y of a
polysemy instance I = r tT u, s1 , s2 s is a common
parent structural pattern if p1 “ s1 or p2 “ s2 .
Definition 9 (common subsumer).
Let s1 , s2 , and s be synsets in wordNet. The
synset s is a common subsumer of s1 and s2 if
s ă˚ s1 and s ă˚ s2 .
Figure 2: Common parent structural pattern
4 Taxonomic principles in WordNet
The WordNet hierarchy is a DAG (directed
acyclic graph). This implies that it is possible
for two synsets to have more than one common
subsumer. We define the least common subsumer
as the subsumer with the least height.
In the following, we define structural patterns.
WordNet hierarchy represents a classification hierarchy where synsets are the nodes. Classification hierarchies should fulfill among other requirements the exclusiveness property and the exhaustiveness property.
We begin with the exclusiveness property.
Definition 10 (structural pattern).
Definition 11 (Exclusiveness property).
A structural pattern of polysemy instance I =
r tT u, s1 , s2 s is a triple P “ xr, p1 , p2 y, where
a) r is the least common subsumer of s1 and s2 ;
b) r ă p1 and r ă p2 ;
c) p1 ă˚ s1 and p2 ă˚ s2 .
Two synsets s1 , s2 P S fulfill the exclusiveness
property if sI1 [ sI2 “ KI . For example, abstract
entity and physical entity fulfill the exclusiveness property. On the other hand expert and
scientist do not fulfill this property because
expertI [ scientistI ‰ KI .
The exclusiveness property means that any two
We
call
r
the
pattern
root
and
p1 ,
108
sibling nodes ni , nj in the hierarchy are disjoint, i.e., nIi Ć nIj and nIj Ć nIi . Analyzing the structural patterns in WordNet shows that
the exclusiveness property is not always guaranteed in WordNet. For example, the pattern
xperson, expert, scientisty shown in Figure 3
does not fulfill this property because forcing this
property would result in preventing a scientist to
be an expert or an expert to be a scientist. We
and female to a concept. This is because person
is a direct hypernym of the concept organism and
the role causal agent.
5 Approach Overview
We exclude the structural patterns whose pattern
root resides in the first and second level in WordNet hierarchy. Accordingly, any structural pattern
whose root belongs to the synsets {entity,
abstract entity, abstraction, physical
} was automatically
excluded. Our hypothesis is that the pattern
hyponyms in these structural patterns in general
fulfill the exclusiveness and the exhaustiveness
property. These patterns are subject to our current
research in discovering metonymy structural
patterns. On the other hand, exclusiveness and
exhaustiveness property are not guaranteed for all
structural patterns whose roots reside in the third
level and beyond. The input of the algorithm is
the taxonomic structure of WordNet, starting from
level 3, after removing lexical redundancy in compound nouns (Freihat et al., 2015). The output
consists of three lists that contain specialization
polysemy, metaphoric polysemy and homonymy
instances. The first step of our algorithm is
automatic, while the other two are manual.
entity, physical object
Figure 3: An example of exclusiveness property
violation
are concerned with the cases, where the synsets
s1 and s2 are not disjoint and each of them subsumes a synset of the same polysemous term such
as the term statistician in Figure 3. The fact that
the two synsets of the polysemous terms are not
disjoint implies that the polysemy type of these
two synsets can not be homonymy, metonymy, or
metaphoric. This can be explained as follow. The
polysemy type homonymy implies that the two
synsets are unrelated and that the disjointness between the two synsets indicates a relation between
the two synsets. Metonymy on the other hand
means that one synset is a part of the other synset.
Now, we explain the exhaustiveness property.
S1. Structural pattern discovery: The input of
this step is the current structure of WordNet
after removing lexical redundancy. The algorithm returns structural patterns associated
with their corresponding polysemy instances.
S2. Structural pattern classification: In this
step, we manually classify the structural patterns returned in the previous step. The output consists of four lists of patterns associated with their polysemy instances. These
four lists are:
Specialization polysemy patterns : This list
contains the patterns whose corresponding
instances are specialization polysemy candidates.
Metaphoric patterns : This list contains the
patterns whose corresponding instances are
metaphoric candidates.
Homographs patterns : This list contains the
patterns whose corresponding instances are
homonymy candidates.
Singleton patterns : The patterns in this group
are those patterns that have one polysemy in-
Definition 12 (Collective Exhaustiveness).
Two synsets s1 , s2 P S are collectively exhaustive
if it is possible to find a synset s such that
sI “ sI1 \ sI2 and s1 , s2 fulfill the exclusiveness
property.
For
example,
abstract entity
and
physical entity fulfill the collectively exhaustiveness property because entity I
“
abstract entity I \ physical entity I . On
the other hand worker and female in the pattern
xperon, worker, f emaley do not fulfill this
property because worker corresponds to a role
109
organism,
they are not collectively exhaustive as
explained. The polysemy instances that belong
to this pattern are 326 instances. Consider for
example the following instance.
#1
snake, serpent, ophidian: limbless
scaly elongate reptile.
#2
snake, snake in the grass: a
deceitful or treacherous person.
Another
example
is
the
pattern
xattribute, property, traity.
Although, both
synset share the same hypernym attribute,
they are not collectively exhaustive because traitI is a special case of property I
(traitI “ property I [ personI ). The polysemy
instances that belong to this pattern are 111
instances. Consider for example the following
instance.
#1 softness:the property of giving little
stance only and thus cannot be considered to
be regular.
S3 Identifying false positives: In this step, we
manually process the polysemy instances in
the four lists from the previous step. Our
task is to decide the polysemy type for the
instances in the singleton patterns list and remove false positives form the other three lists.
6
Metaphoric Structural Patterns
Identifying metaphoric patterns is based on the
distinction between the literal meaning and the
figurative meaning. Our idea is that it is not
possible for a literal and the figurative meaning
to be collectively exhaustive.
Violating the
exhaustiveness property in a structural pattern
xr, p1 , p2 y may be a result of the following:
a) p1 and p2 belong to different types and can not
be subsumed by the pattern root r, or
b) p1 Ă p2 or p2 Ă p1 .
For example f emale and worker can not
be subsumed by person in the pattern
xperson, f emale, workery as shown in Figure 4. On the other hand, it is correct that
resistance to pressure and being easily
cut or molded.
#2
gentleness, softness, mildness:
acting in a manner that is gentle and
mild and even-tempered.
7 Specialization Polysemy Structural
Patterns
We use the exclusiveness property and the pattern
root in a structural pattern to discover specialization polysemy candidates indirectly. The relation
between the synsets in specialization polysemy is
hierarchical. The hierarchical relation between the
synsets in a specialization polysemy instance indicates that the exclusiveness property does not hold
between synsets and thus between the structural
pattern hyponyms.
We define specialization polysemy patterns as follows.
Figure 4: Example of a metaphoric polysemy instance
Definition 14 (specialization polysemy structural
pattern).
person and animal are organisms in the structural
xorganism, animal, persony but it is clear that
personI Ă animalI
In the following, we define metaphoric patterns
structural pattern as follows.
A pattern p “ xr, p1 , p2 y is a specialization
polysemy pattern if a) and b) hold
a) p1 and p2 do not fulfill the exclusiveness
property.
b) p1 and p2 fulfill the exhaustiveness property.
In the following we give examples for identified
specialization polysemy patterns. All instances
that belong to the common parent structural
patterns are classified as specialization polysemy
instances. The polysemy instances that belong
to this pattern are 2879 instances. Consider for
example the following instance.
Definition 13 (Metaphoric structural pattern).
A pattern p “ xr, p1 , p2 y is metaphoric if p1 and
p2 do not fulfill the collectively exhaustiveness
property.
In the following we give examples for identified metaphoric patterns.
The pattern
xorganism, animal, persony is metaphoric.
Although both synsets share the same hypernym
110
#1
capital, working capital:
flowers.
assets
available for use in the production of
#2 red fox, Vulpes fulva: New World fox;
further assets.
often considered the same species as the
#2
capital:
Old World fox.
wealth in the form of
Another
example
is
the
pattern
xvertebrate, bird, mammaly.
The polysemy instances that belong to this pattern are 13
instances. Consider for example the following.
#3 griffon, wire-haired pointing griffon:
breed of medium-sized long-headed dogs.
#4 griffon vulture, griffon, Gyps fulvus:
money or property owned by a person or
business and human resources of economic
value.
Another
example
is
the
pattern
xact, action, activityy.
The polysemy instances that belong to this pattern are 406
instances. Consider for example the following.
#1 employment, work: the occupation for
which you are paid.
#2
employment, engagement: the act of
giving someone a job.
Another
example,
is
the
pattern
xanimal, invertebrate, larvay.
The polysemy instances that belong to this pattern are 17
instances. Consider for example the following.
#1
ailanthus silkworm, Samia cynthia:
large green silkworm of the cynthia moth.
#2
cynthia moth, Samia cynthia, Samia
walkeri:
large vulture of southern Europe and
northern Africa.
9 False Positives Identification
In this section, we describe the third step of our
approach. Our task here is to process the four lists
returned at the end of the pattern classification
and remove false positives. These lists are the
metaphoric polysemy list, the specialization
polysemy list, the homonymy list, and a list of
non regular (singleton patterns) list. This task can
only be performed manually due to the implicit
and missing information in synset glosses. Our
procedure for determining the polysemy class of
a polysemy instance is based on the three definitions in the previous section, where we process
the polysemy instances instance by instance to
determine the the relation between the synsets of
the polysemy instances.
If a polysemy instance does not belong to the
polysemy type it was assigned to (false positive
instance), we assign it to its corresponding polysemy type.
In the following, we give examples for false
positives. The common parent structural pattern
which was automatically assigned to the specialization polysemy type (step 1 in Section 5)
contains 180 false positive polysemy instances, 98
of them were identified as homonymy instances.
One example is:
#1
cardholder: a person who holds a
credit card or debit card.
#2 cardholder: a player who holds a card
or cards in a card game.
Metaphoric false positives (82 instances) were
also identified in the common parent class. Consider for example the following instance.
#1 game plan: (figurative) a carefully
large Asiatic moth introduced
into the United States; larvae feed on
the ailanthus.
8
Homonymy Structural Patterns
We define homonymy patterns as follows.
Definition 15 (Homonymy structural pattern).
A pattern p “ xr, p1 , p2 y is homonymy pattern if
the following condition hold.
a) p1 and p2 fulfill the exclusiveness property;
b) p1 and p2 fulfill the exhaustiveness property;
c) There is no relation between p1 and p2 .
In the following we give examples for identified homonymy patterns.
The pattern
xorganism, person, planty.
The polysemy
instances that belong to this pattern are 40
instances. Consider for example the following
instance.
#1
spinster, old maid: an elderly
unmarried woman.
#2
zinnia, old maid, old maid flower:
any of various plants of the genus
Zinnia.
Another
example
is
the
pattern
xorganism, animal, planty.
The polysemy
instances that belong to this pattern are 41 instances. Consider for example the following.
#1
red fox, Celosia argentea: weedy
thought out strategy for achieving an
objective in war.
annual with spikes of silver-white
111
#2
game plan:
agreement of the evaluators with our approach was
on 96.5% of the instances. In the following Table
3, a refers to our approach, e1 , e2 refer to evaluator1 and evaluator 2 respectively.
(sports) a plan for
achieving an objective in some sport.
Another
example
is
the
pattern
xorganism, animal, persony which was assigned to the metaphoric polysemy type contains
326 polysemy instances, 74 of them were identified as homonyms such as the following instance.
#2
Minnesotan, Gopher: a native or
resident of Minnesota.
#3 ground squirrel, gopher, spermophile:
e1 “ e2 “ a
a “ e1
a “ e2
3665 (96.5%)
3621 (95.3%)
3600 (94.8%)
Table 3: Evaluation of the polysemy classification
any of various terrestrial burrowing
rodents of Old and New Worlds.
10
11 Conclusion and future Work
In this paper, we have presented how to use two
taxonomic principles for classifying the polysemy
types in WordNet. We have demonstrated the usefulness of our approach on classifying three polysemy types, namely, specialization, metaphoric
and homonymy. In this approach, we were
able to discover all specialization polysemy structural patterns and subsets of the metaphoric and
metonymy structural patterns. We aim to continue
our work to study the metonymy patterns in the
upper level of WordNet hierarchy, where we generalize our structural pattern definition as follows.
Results and Evaluation
The number of polysemy instances computed by
the polysemy instances discovery algorithm is
41306. We excluded 28318 instances because the
pattern roots of these instances reside in the first
and the second level of the hierarchy as per the
approach discussed in Section 5.The remaining
number of polysemy instances is 12988. These
instances are divided in two groups as follow.
12988 of these instances belong to 1028 regular
type compatible patterns and 1569 instances belong to single tone patterns. The classification of
the pasterns and the result of the false positive removing is shown in the following tables.
#Type
Specialization
Metaphoric
Homonymy
Total
#patterns
823
134
71
1028
Definition 16 (generalized structural pattern).
A structural pattern of polysemy instance I =
r tT u, s1 , s2 s is a triple P “ xr, p1 , p2 y, where
a) r is the least common subsumer of s1 and s2 ;
b) r ă˚ p1 and r ă˚ p2 ;
c) p1 ă˚ s1 and p2 ă˚ s2 .
Our hypothesis is that in case of metonymy structural patterns: the nodes p1 and p2 fulfill the exclusiveness and the exhaustiveness properties and
there is a part of relation between p1 and p2 . The
conditions for metaphoric and homonymy structural patterns obtained by adapting the new structural definition remain the same as explained in
this paper.
#instances
9902
1697
1389
12988
Table 1: Classification of the regular structural
patterns
In Table 2, we show the results removing false postive instances, where we see that the average false
positives is about 17%.
#Poly Type
Specialization
Metaphoric
Homonymy
Total
#Instances
9902
1697
1389
12988
Acknowledgment
The research leading to these results has received partially funding from the European Community’s Seventh Framework Program under
grant agreement n.
600854, Smart Society
(http://www.smart-society-project.eu/).
#False Positives
1740
175
295
2210
Table 2: False Positives in Pattern Classification
References
To evaluate our approach, 3797 polysemy instances were evaluated by two evaluators. The
Jordi Alvarez. 2000. Integrating the wordnet ontology
into a description logic system.
112
JU. Apresjan. 1974. Regular polysemy. Linguistics,
pages 5–32.
Rila Mandala, Takenobu Tokunaga, and Hozumi
Tanaka. 1999. Complementing wordnet with roget’s and corpus-based thesauri for information retrieval. In EACL, pages 94–101. The Association
for Computer Linguistics.
Kenneth D. Bailey. 1994. Typologies and Taxonomies:
An Introduction to Classification Technique. Sage
Publications, Thousand Oaks, CA, 1994///.
Alberto Marradi. 1990. Classification, typology, taxonomy. Quality & Quantity: International Journal
of Methodology, 24(2):129–157.
Lucie Barque and Franois-Rgis Chaumartin. 2009.
Regular polysemy in wordnet. JLCL, 24(2):5–18.
MoldovanD. I. Mihalcea R. 2001. Ez.wordnet: Principles for automatic generation of a coarse grained
wordnet. FLAIRS Conference, pages 454–458.
Travis D. Breaux, Annie I. Anton, and Jon Doyle.
2009. Semantic parameterization: A process for
modeling domain descriptions. ACM Transactions
on Software Engineering Methodology, 18(2).
G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and
K. J. Miller. 1990. Introduction to WordNet: an online lexical database. International Journal of Lexicography, 3(4):235–244.
Paul Buitelaar. 1998. Corelex: Systematic polysemy
and underspecification. PhD thesis,Brandeis University, Department of Computer Science.
George A. Miller. 1995. Wordnet: A lexical database
for english. Commun. ACM, 38(11):39–41, November.
W. B. Dolan. 1994. Word sense ambiguation: clustering related senses. In Proceedings of COLING94,
pages 712–716.
Wim Peters. 2004. Detection and characterization
of figurative language use in wordnet. PhD thesis,
Natural Language Processing Group, Department of
Computer Science, University of Sheffield.
Philip Edmonds and Eneko Agirre. 2008. Word sense
disambiguation. Scholarpedia, 3.
Vyvyan Evans and Jrg Zinken. 2006. Figurative language in a modern theory of meaning construction:
A lexical concepts and cognitive models approach.
Sebastian Rudolph. 2011. Foundations of description
logics. In Arenas C. Polleres A., dÁmato C., editor,
Reasoning Web. Semantic Technologies for the Web
of Data - 7th International Summer School 2011,
volume 6848 of LNCS, pages 76–136. Springer.
Ingrid Lossius Falkum. 2011. The semantics and pragmatics of polysemy: A relevance-theoretic account.
PhD thesis, University College London.
Tony Veale. 2004. Pathways to creativity in lexical ontologies. In In Proceedings of the 2nd Global WordNet Conference.
Abed Alhkaim Freihat, Fausto Giunchiglia, and Bisu
Dutta. 2013. Regular polysemy in wordnet and pattern based approach. International Journal On Advances in Intelligent Systems, 6(3&4), jan.
Abed Alhkaim Freihat, Bisu Dutta, and Fausto
Giunchiglia. 2015. Compound noun polysemy and
sense enumeration in wordnet. In Proceedings of the
7th International Conference on Information, Process, and Knowledge Management (eKNOW), pages
166–171.
Verdejo F. Gonzalo J., Chugur I. 2000. Sense clusters for information retrieval: Evidence from semcor
and the eurowordnet interlingual index. ACL-2000
Workshop on Word Senses and Multi-linguality, Association for Computational Linguistics, pages 10–
18.
Ahti Lohk, Kaarel Allik, Heili Orav, and Leo Vhandu.
2014. Dense components in the structure of wordnet. In Nicoletta Calzolari (Conference Chair),
Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion
Moreno, Jan Odijk, and Stelios Piperidis, editors,
Proceedings of the Ninth International Conference
on Language Resources and Evaluation (LREC’14),
Reykjavik, Iceland, may. European Language Resources Association (ELRA).
113
Some Strategies for the Improvement of a Spanish WordNet
Matı́as Herrera
Javier González
Luis Chiruzzo
Facultad de Ingenierı́a
Universidad de la República
Abstract
identifier “ili-30-02084071-n” corresponds both
to the English synset “eng-30-02084071-n” with
lemmas “dog, domestic dog”, and to the Spanish
synset “spa-30-02084071-n” with lemmas “can,
perro”. In addition, it corresponds to the Basque
synset “eus-30-02084071-n” with lemmas “zakur,
or, txakur”, to the synset “cat-30-02084071-n” for
Catalan with lemmas “ca, canis familiaris”, and
also “glg-30-02084071-n” for Galician with lemmas “can, Canis familiaris”. The current ILI version corresponds to WordNet 3.0. All identifiers
stem from the original synset in English. In the
previous example there is a translation for each
one of the languages, however, this is not the
most common scenario. The MCR is incomplete,
at least for the Spanish version. This document
presents several strategies to extend the coverage
of the Spanish version. An in-depth analysis of
the different problems of the Spanish MCR is presented in section 2, and section 3 describes several processes to enhance it. Section 4 presents
the evaluations carried out for the strategies proposed and section 5 presents final observations on
the general results and the possibility to launch an
enhanced version on line.
Although there are currently several versions of Princeton WordNet for different languages, the lack of development
of some of these versions does not make
it possible to use them in different Natural Language Processing applications. So
is the case of the Spanish Wordnet contained in the Multilingual Central Repository (MCR), which we tried unsuccessfully to incorporate into an anaphora resolution application and also in search terms
expansion. In this situation, different
strategies to improve MCR Spanish WordNet coverage were put forward and tested,
obtaining encouraging results. A specific process was conducted to increase the
number of adverbs, and a few simple processes were applied which made it possible to increase, at a very low cost, the
number of terms in the Spanish WordNet.
Finally, a more complex method based on
distributional semantics was proposed, using the relations between English Wordnet
synsets, also returning positive results.
1
Dina Wonsever
2
Introduction
The Multilingual Central Repository (Agirre, Laparra, Rigau, & Donostia, 2012) follows the
model proposed by the EuroWordNet project. EuroWordNet (Vossen, 1998) is a multilingual lexical database with wordnets for several European
languages, structured in the same way as Princeton’s WordNet. The MCR comprises five different
languages: English, Spanish, Catalan, Basque and
Galician. The Inter-Lingual-Index (ILI) allows us
to link the words in one language with their equivalent translation in any of the other languages,
thanks to the automatically generated mappings
among WordNet versions. For example: the ILI
2.1
Problems on the MCR Spanish
WordNet
Deficiencies of the current Spanish MCR:
first evaluation
For the purpose of finding the deficiencies of the
MCR WordNet, our initial approach was to use it
and test it out. Version 3.0 was used, since this is
the latest version currently available. The web interface provided by the MCR (Benı́tez et al., 1998)
was used to fulfill this stage. The MCR was requested to provide the results both in English and
Spanish for all the searches made, in order to be
able to compare them. Below we provide some examples of this initial informal evaluation and the
114
addition to Corin, such as Freeling (Carreras,
Chao, Padró, & Padró, 2004) and the dictionaries Apertium (Armentano-Oller et al., n.d.)
and Wiktionary(Wikimedia Foundation. 2008b.
Wiktionary., 2008).
following section presents a quantitative evaluation:
• Lack of common words
Some common words such as “cargador” and
the adverb “no” were found to be missing.
The following aspects were studied:
• Empty synsets
Some Spanish synsets were available through
the web interface but they were empty. For
example, the synset “spa-30-00396699-r” did
not contain any variants, but its English
equivalent “eng-30-00396699-r” did. This
shows that there were no Spanish translations in the MCR for the lemmas “meagerly”, “sparingly”, “slenderly” and “meagrely”. When searching for the adverb “escasamente”, which is a possible translation
for “sparingly”, it was not found.
1. The percentage of available lemmas in the
Spanish version of WordNet.
2. The percentage of corpus lemmas for which
there was a translation available.
3. The percentage of these lemmas that was not
present in the Spanish MCR but did have an
available translation in the English MCR.
The results obtained are presented as follows:
2.2.1
• Very few entries for the grammatical category
adverbs
POS
Once evaluated, it was concluded that adverb
coverage of the MCR was very low. We have
already mentioned the example for the adverb “no”. It was also found that the adverbs
“recién” (just) and “rápidamente” (quickly)
were not present, although these are very
commonly used in Spanish.
N
A
V
R
Total
Lemmas not
found
69,29% 2780
51,00%
840
75,35% 1235
32,79%
121
48,70% 3734
Lemmas found
30,71%
49,00%
24,65%
67,21%
51,30%
1232
807
404
248
3933
Processed
lemmas
4012
1647
1639
369
7667
The previous chart shows the total number of
lemmas processed, their Parts Of Speech and how
many of them were number found on WordNet .
We can see that verbs are the grammatical category with the lowest coverage at 25%. The
remaining POS show a higher coverage, with
adverbs showing the highest one.
• Lack of glosses or phrases that show the usage of the terms in Spanish.
No Spanish gloss was found for many of
the words searched. For example, we found
that the result for the noun “cuchillo”, “spa30-03623556-n” and “spa-30-03624134- n”
did not include a Spanish gloss for these
synsets. Additionally, a generalized lack of
phrases that illustrate the use of the lemmas
and synsets was found.
2.2
Percentage of Corin lemmas available
in the Spanish version of WordNet
2.2.2
POS
N
A
V
R
Total
Deficiencies in the current MCR:
evaluation on a corpus
Several MCR WordNet coverage measures were
applied taking Corin corpus (Grassi, Malcuori,
Couto, Prada, & Wonsever, 2001) as a baseline. Corin corpus is a synchronous corpus that
comprises the years 1996-2000 and contains
literary-type texts by Uruguayan authors (essays
and fiction) and journalistic texts published in
Montevideo (articles and interviews). Several
other language processing tools were used in
Percentage of corpus lemmas for which
there was a translation available
Untranslated
12,04%
483
21,07%
347
18,00%
295
16,26%
60
15,46% 1185
Translated
87,96% 3529
78,93% 1300
82,00% 1344
83,74%
309
84,54% 6482
Lemmas
4012
1647
1639
369
7667
Using the two mentioned dictionaries we were
able to cover a large percentage of the lemmas
present in the corpus. Even so, the results do not
ensure the quality of the translations. Therefore, it
is necessary to improve the resources used for this
purpose.
115
2.2.3
Lemmas not found in the Spanish MCR
but with a translation available in the
English MCR
open. Wiktionary and Apertium were downloaded
from their respective websites, and Bing Translator was used online through its API.
Microsoft’s Bing Translator does not take into account the grammatical category of the word to
be translated, therefore, there were cases where
if verbs were translated, it would return nouns,
or even the same verb but in a different conjugated form, instead of the infinitive form used in
the search. In order to solve this problem, it was
decided to use the results returned by the translator, and conduct a morphological analysis applying Freeling. The procedure entails obtaining all
the possible grammatical categories of the word
and its lemma, to afterwards select the words with
the same grammatical category as the originally
translated English word.
We decided to use a dictionary created based on
the XML stem files of the Apertium dictionary
rather than the already processed Apertium dictionary, since, for some reason, when making a request it would only return one possible translation,
even if the XML file contained more. It was possible to obtain all the available translations for each
word using the XML stem files.
Out of the 6482 lemmas translated into English,
we focused on those found in the English MCR,
so it was possible to compare the lemmas which
were not found in the Spanish MCR but did have
a translation available in the English MCR.
POS
N
A
V
R
Total
Lemmas not in
Spanish MCR
43,40%
1349
46,37%
492
15,02%
176
69,00%
187
39,27%
2204
Lemmas in
Spanish MCR
56,60%
1759
53,63%
569
84,98%
996
31,00%
84
60,73%
3408
Total
3108
1061
1172
271
5612
We can conclude that verbs are the grammatical
category with the widest coverage, and adverbs are
the most incomplete. In addition, nouns and adjectives present a coverage of just over 50%.
3
Strategies to improve WordNet
To improve the existing Spanish WordNet we
conducted tests with processes that we have
called “selectors”, following the terminology
already used in the field (WoNeF). A selector is
a mechanism that, when applied to an English
synset, will choose the translation or translations
for the Spanish synset based on the original
in English. Previously defined selectors were
tested, supported by Apertium and Wiktionary
translators, and in addition, two new selectors
were defined, one based on morphology and
the other based on the exploitation of semantic
relations between synsets, with frequentist criteria
used in distributional semantics. Selectors are
applied in two differentiated stages, which are
separately evaluated.
3.1
3.2
Phase 1: Initial selectors
Below we present the experiments conducted with
simple selectors already reported in the literature:
monosemy and single translation. It is surprising that these selectors are still productive over the
currently available version of WordNet, as our experiments show.
Monosemy Monosemy takes those words found
in a single synset. This condition seems to
show that there is no ambiguity and, therefore, all translations obtained are added to the
corresponding synsets in the Spanish WordNet. For example, when applying this selector to the synset “eng-30-00048268-r” whose
lemma is “currently” the three possible translations obtained by the translators “hoy”,
“ahora” and “actualmente” are selected since
“currently” is only found in one synset in the
English WordNet.
Translation methods
The translation process used was key for the application of this method to create the Spanish WordNet based on the English WordNet. We used two
different methods: automatic translation and dictionaries. With regard to dictionaries, Wiktionary
was used as well as a dictionary created based on
the XML stem files of the Apertium dictionary.
The automatic translation used was the one provided by Bing Translator (Bing Online Translator., 2015). These tools were chosen mainly due to
their availability, since they are either free and/or
Single translation This selector takes all the
words that have a single translation into
Spanish and places it in all corresponding Spanish WordNet synsets.
For example, when applying this selector to the
116
ple, “educada” and “educadamente”. Since
this selector builds words by applying morphological derivation rules, we observed that
sometimes it would return adverbs that do not
exist in Spanish. Therefore, we decided to
validate them against a corpus comprised of
Spanish news text. To do so, we extracted
all adverbs from said corpus to put together
a list of adverbs to validate the existence of
the adverbs built by the selector. The weakness of such validation method lies in the
fact that it may discard adverbs which are
correct as they are not found in the reference corpus. However, we considered more
pertinent to ensure that accurate words were
added. Moreover, it is always possible to use
a longer list of known adverbs to reduce the
number of false negatives.
synset “eng-30-00061528-r”, whose lemma
is “abruptly” and the translation returned is
“abruptamente”, this will be selected since it
is the single translation.
Factorization The factorization selector works at
synset level. It takes all synsets from the English WordNet and returns all possible translations for each lemma. Once the set of translations for each lemma is put together, the
selector selects those translations found as a
common translation for all the lemmas in the
synset, that is, with the intersection of the
translation sets for each lemma. For example, consider the synset “eng-30-01309991a”, whose lemmas are “artless” and “ingenuous”. The translations for “artless” are: “inocente”, “ingenuo” and “cándido” and those
for “ingenuous” are: “inocente” and “ingenuo”. In this case, by applying the selector
we obtained “inocente” and “ingenuo”, as a
common translation.
Levenshtein This selector uses Levenshtein’s edit
distance, based on the assumption that, if
the distance between a word in English and
its translation is short, they can be considered to have the same sense. Minor modifications are made to reduce the distance between one word and its translation. One example of these transformations is the inversion of the letters “r” and “e” to be applied to
the word “tiger” and corresponding translation “tigre”. After doing the transformation,
Levenshtein’s distance becomes 0. When implementing the initial selectors we decided
not to use it since it did not return good results during the initial experiments. A possible explanation for this is that Spanish and
English do not share as many cognate terms
as English and French do, as discussed in the
WoNeF article.
Derived Adverb This selector obtains adverbs
from the English WordNet and then the adjectives from which these derive. The property “is derived from” provided by the MCR
was used to obtain the adjectives from which
these adverbs derive. Once the adjective
synsets are returned, we will obtain all the
variants. These are in turn translated so as to
later apply the morphological derivation rules
to build adverbs in Spanish. By applying
this selector to the synset “eng-30-00033562r” whose lemma is “mildly” and is linked to
the POS adjective synset “eng-30-01508719a” whose lemma is “mild”, we will obtain
“suavemente” and “levemente”. The latter
are generated based on both available translations for “mild”: “suave” and “leve”, and by
applying the following morphological derivation rules.
If the adjective ends in an “o”, it will be replaced by the sequence “amente”, for example, “lento” resulting in “lentamente”. If the
adjective ends in an “r” or “n”, then , add
the sequence “amente”, for example, “encantador” and “fanfarron” and their respective
results “encantadoramente” and “fanfarronamente”. The sequence “mente” will be added
to the rest of the adjectives that do not fall
in the categories above mentioned, for exam-
Singular translation selectors, monosemy and
single factorization Levenshtein were inspired in
(Atserias, Climent, Farreres, Rigau, & Guez,
1997), while Levenshtein was used in (Pradet, de
Chalendar, & Desormeaux, 2014). Derived adverbs was our own production.
3.3
Phase 2: distributional semantics
For the expansion stage we proposed a selector
that would exploit the relations between synsets
and frequencies of occurrence of both words
within a corpus, to determine which translation
is the correct one for each ambiguous synset. It
117
is worth noting that this selector would be used
when both related lemmas in English are known,
and one of them gets only translation but for the
other one there are several possible translations.
A detailed explanation of the implementation of
this phase is presented below:
Let’s suppose that we have a synset SA associated to synset SB in WordNet through a hypernymy relation. In addition, we have two English
lemmas LA and LB for SA and SB respectively.
The translations for LA are T A1 and T A2 , and the
translations for LB are T B. So to decide which
translation is correct for this lemma, we searched
for the occurrence of each translation in a corpus.
These searches are considered as a function and
represented with letter Θ. This process is called
disambiguation.
For example, for calculating Θ(T A1 , T B) we
count all occurrences of the words T A1 and T B
that happen within the same sentence.
O1 =
O2 =
is affiliated with another or with an organization”.
The semantic relations used for this process
were hypernymy, meronimy and antonymy, and
the frequency counts were performed over the
Spanish news text corpus.
4
Evaluation of results
We show evaluations for the initial selectors, for
the phase 2 process and a global evaluation of results within a lexical semantics effort.
4.1
Quantitative evaluation of phase 1 results
In the evaluation we randomly selected 1000
synsets for each POS (verb, adverb, noun and
adjective). The translations of every lemma in
all the sorted synsets were obtained and the four
selectors mentioned above were applied. The
results obtained were stored in a database.
(Θ(T A1 , T B)
Θ(T A1 ) + Θ(T B)
POS
R
V
A
N
All
(Θ(T A2 , T B)
Θ(T A2 ) + Θ(T B)
Translated
82,80% 1187
71,90% 1226
59,50%
969
71,20% 1036
71,00% 4418
Untranslated
17,20%
246
28,10%
478
40,50%
659
28,80%
419
29,00% 1802
Table 1: Translated lemmas
In case O1 ≥ O2 =⇒ T A1 is chosen as the
translation of LA.
As can be seen, 71 % of the lemmas processed
returned a translation. When we analyze the
data at grammatical category level, we see that
adverbs is the category with the highest translation
percentage, with over 80 %. The other categories
behave in a similar way to each other, adjectives
being the category with the least coverage with
almost 60 % of translations returned.
However, if O1 < O2 =⇒ T A2 is chosen as
the translation of LA.
An example of the application of this expansion
phase follows:
We know that SA = “eng-30-09776346-n”
and SB = “eng-30-09816771-n” are related
through the hypernym relation and they have the
lemmas LA = “affiliate” and LB = “associate”
respectively. Furthermore, we know that T B =
“asociado” and the translation candidates for “affiliate” are T A1 = “filial” and T A2 = “afiliado”.
Because O(f ilial, asociado)
=
0.0 and
O(af iliado, asociado) = 8.18129755379e−05 ,
then we know that O(af iliado, asociado) ≥
O(f ilial, asociado) =⇒ the word T A2 =
“afiliado” is chosen as the translation of LA.
The previous result is correct because the English gloss for SA = “eng-30-09776346-n” is: “a
subordinate or subsidiary associate; a person who
The following table shows the distribution of
the translation of the lemmas for each of the 4000
synsets selected. Our aim was to obtain the results returned for each selector over the total of
lemmas translated, but avoiding the overlapping
of results by providing an order of importance.
There follows the order applied: single selector,
monosemy selector, factorization selector and others. For “V”, “A” and “N” POS, the others include
the translations that were not selected by any selector. For “R” POS, as well as translations not selected by any selector, the translations determined
by the derived adverbs selector are also included.
118
POS
Singulars
Monosemic
and not
singular
R
V
A
N
All
56,40%
58,00%
77,60%
72,70%
70,20%
6,10%
2,00%
5,20%
4,60%
4,70%
Not
monosemic,
not singular
and factored
1,30%
0,80%
0,90%
1,40%
1,20%
A manual qualitative evaluation was conducted
to measure the accuracy of the results. We
randomly selected 25 synsets for each POS (verb,
adverb, noun and adjective) of the added ones, and
we verified if the result was correct or not. For the
selectors that work at synset level, the data in table
5 reflect the percentages of the resulting correct or
incorrect synsets, and for the selectors that work
at lemma level, the percentages correspond to the
resulting correct or incorrect synsets.
Table 2: Translation by selector
As seen here, verbs and adverbs had the worst
result, while adjectives had the best result: 16.3%.
We must remember that these data do not consider
the results of the derived adverbs selector. These
were excluded from the comparison because they
could not be compared with the rest of the POS.
4.2
Synsets for which the initial selectors
obtained results
POS
R
V
A
N
All
Yes
73,90%
52,80%
59,90%
63,70%
62,60%
739
528
599
637
2845
No
26,10%
47,20%
40,10%
36,30%
37,40%
694
390
429
423
1936
Existent
16,20%
49,60%
37,50%
45,20%
36,70%
93.48%
96.08%
93.48%
97.14%
95.04%
Factorization
100.00%
96.00%
100.00%
92.00%
97.00%
Derived
adverb
92.00%
-
As seen in the charts above, the results of the
four selectors were very good: all show over 92 %
of effectiveness and some reach 100 % for some
POS.
5
Evaluation of phase 2 results
5.1
Comparison with current WordNet
New
83,80%
50,40%
62,50%
54,80%
63,30%
V
A
N
R
All
Single
translation
98.39%
100.00%
100.00%
94.59%
98.25%
Although the derived adverbs selector was the
least accurate one, it returned a very good result:
92%.
261
472
401
363
1155
As seen here, the POS with the highest coverage by initial selectors were adverbs, with almost
74%; without distinguishing according to POS,
there is a 62.60% coverage.
POS
R
V
A
N
All
Monosemy
Table 5: Accuracy for the initial selectors
Table 3: Synsets for which the initial selectors obtained results
4.3
POS
Lemmas processed
The 1040 synsets that were not translated in phase
1 because they were ambiguous were applied and
evaluated in phase 2. As phase 2 can fail for various reasons, in this section we present detailed
information about the results obtained to identify
such reasons. As phase 2 exploits the relations between the existing synsets in WordNet up to the
present, if the synsets are not related to any other
synsets, or if they are, but such synsets are empty
for Spanish, this method returns no results. Therefore three different groups can be observed on the
following table.
134
384
257
349
1124
Table 4: Comparison with current WordNet
As seen here, for each POS there was a high
percentage of synsets that had translations which
were not found in the current Spanish WordNet
(MCR 3.0). Adverbs is the grammatical category
with the highest percentage: approximately
83%. In total there were just over 63% new
synsets. As only the initial selectors were applied,
we concluded that we would see a significant
improvement at the end of the process.
POS
R
V
A
N
Total
119
With relations
83,10%
1,20%
1,79%
0,00%
12,21%
With relations
and no trans.
10,56%
10,40%
34,52%
22,17%
16,92%
With relations
and trans.
6,34%
88,40%
63,69%
77,83%
70,87%
As seen here, adverbs is the grammatical category that has the least connected synsets, which
shows that our method does not return good results for this POS. The other grammatical categories have enough relations and they are sufficiently complete for phase 2 to return results.
5.2
The lemma “cup” of synset “eng-30-03147901-n”
with the sense of “trophy” is a good example of
this. The translations obtained for the lemma were
“taza” and “copa”, and when requesting disambiguation the process selected “taza”, which was
not the correct meaning for this synset.
Lemmas processed in phase 2 with
relations and with translations for these
relations
POS
R
V
A
N
Total
It is important to highlight that for lemmas corresponding to synsets associated to other already
complete synsets, the method applied in phase 2
can fail if there were no occurrences in the corpus
of the possible candidates for all lemmas. This is
explained in the following results.
POS
R
V
A
N
Total
With result
33,33%
63,80%
60,75%
70,95%
64,72%
3
282
65
127
477
Without result
66,67%
6
36,20%
160
39,25%
42
29,05%
52
35,28%
260
Comparison with current WordNet
In this section we compare the results obtained in
phase 2 with the results of the current WordNet, as
only the results that do not appear in the current
WordNet will entail a real increase in the completeness of WordNet.
POS
R
V
A
N
Total
5.4
Not present
66,67%
2
73,05%
206
52,31%
34
62,20%
79
67,30%
321
Present
33,33%
26,95%
47,69%
37,80%
32,70%
Incorrect
0.00%
32.00%
16.00%
32.00%
25.97%
From these evaluations we can conclude that
phase 2 was not as accurate as phase 1. These results could be improved by increasing the size of
the corpus or by improving the method. A larger
corpus would have more sentences, that is to say,
more contexts where the meaning of candidates
can be validated. The translations where the gender does not match in English could be discarded
to improve the method. Doing this would discard cases like that of synset “spa-30-10129825n”, whose gloss is “mujer joven”. For the lemma
“girl”, which corresponds to said English synset, a
possible translation obtained was “chico”. This is
a clear example where the original lemma in English and the resulting translation do not match
in gender. Another way to improve the method
would be to prioritize some specific relations.
As can be seen here, there is margin for improvement: 35 %, which can be improved by increasing the size of the search corpus.
5.3
Correct
100.00%
68.00%
84.00%
68.00%
74.03%
6
Evaluation of the results on Corin
lexicon
To evaluate the results obtained in both phases we
implemented a task to measure the semantic coverage on a small corpus, in this case Corin. For this
task we obtained all the lemmas in the corpus, applied Freeling to know the grammatical category,
and then searched WordNet. This process was first
executed with the original WordNet, our starting
point, and then with the resulting WordNet. The
aim was to measure the improvement in the coverage of the existing lemmas in the corpus under
study of the resulting WordNet regarding the current WordNet. We must remember that the process to improve WordNet was executed on a random set of 1000 synsets per POS. The results obtained must be weighed considering the percentage these synsets represent within the total number of synsets for each POS. These percentages are
shown in the following table.
There follows a table with the percentages of
1
76
31
48
156
Manual evaluation of disambiguated
synsets
A manual qualitative evaluation was conducted to
measure the accuracy of the results. We randomly
selected 25 synsets for each POS (verb, adverb,
noun and adjective) and we verified if the result
was correct or not. We must remember that for
adverbs there were only two results. It is important to remember that most of the errors detected at this stage correspond to lemmas that had
been accurately translated but whose translation
was not the correct one for the synset in question.
120
POS
Total
V
N
R
A
13845
83090
3621
18156
Processed
Synsets
1000
7.22%
1000
1.20%
1000 27.62%
1000
5.51%
References
Agirre, A. G., Laparra, E., Rigau, G., & Donostia,
B. C. (2012). Multilingual central repository version 3.0: upgrading a very large lexical knowledge base. in gwc 2012 6th international global wordnet conference.
Armentano-Oller, C., Corbı́-Bellot, A. M., Forcada, M. L., Ginestı́-Rosell, M., Montava Belda, M. A., Ortiz-Rojas, S., . . .
Sánchez-Martı́nez, F. (n.d.).
Atserias, J., Climent, S., Farreres, X., Rigau, G.,
& Guez, H. R. (1997). Combining multiple methods for the automatic construction of multilingual wordnets. In In proceedings of international conference on recent advances in natural language processing (ranlp’97), tzigov chark (pp. 143–149).
Benı́tez, L., Cervell, S., Escudero, G., López, M.,
Rigau, G., & Taulé, M. (1998). Methods
and tools for building the catalan wordnet.
In in proceedings of elra workshop on language resources for european minority languages.
Bing online translator. (2015). https://www
.bing.com/translator.
Carreras, X., Chao, I., Padró, L., & Padró, M.
(2004). Freeling: An open-source suite of
language analyzers. In Proceedings of the
4th international conference on language
resources and evaluation (lrec’04).
Grassi, M., Malcuori, M., Couto, J., Prada, J. J.,
& Wonsever, D. (2001). Corpus informatizado: textos del español del uruguay
(corin). In Slplt-2-second international
workshop on spanish language processing
and language technologies-jaén, españa.
Pradet, Q., de Chalendar, G., & Desormeaux, J. B.
(2014). Wonef, an improved, expanded
and evaluated automatic french translation
of wordnet. Volume editors, 32.
Vossen, P. (1998). Introduction to eurowordnet.
In Eurowordnet: A multilingual database
with lexical semantic networks (pp. 1–17).
Springer.
Wikimedia foundation. 2008b. wiktionary. (2008).
http://www.wiktionary.org.
coverage obtained according to each POS, for the
two versions of WordNet: the original one and the
one expanded by this method.
POS
V
N
R
A
Original WordNet
75.35% 1235
69.29% 2780
32.79%
121
51.00%
840
Expanded
WordNet
77.36%
70.09%
62.87%
54.34%
1268
2812
232
895
In the Corpus
1639
4012
369
1647
We can conclude that adverbs was the category
with the best results, reaching a coverage of almost 63 % over the original 33 %. Two reasons explain this: first, adverbs is the category least covered by the original WordNet, and it was also the
POS where the strategy was implemented more
times, which was executed on just over 27 % of its
synsets. The coverage also improved for the other
POS. Though it is true that the improvement was
relatively small (between 1 % and 3 %), we must
remember that in these cases the method was applied to a small percentage of the synsets in WordNet.
7
Conclusions
Different strategies were designed and implemented in order to enrich the current Spanish
WordNet from the English WordNet within the
context of the expansion model. The strategy was
to use a series of selectors which were called “initial selectors” as a first step. We then applied a
method based on the exploitation of the semantic
relations of WordNet so as to add variants that the
initial selectors had not been able to add. The results obtained show that the strategy used is effective as it entails a significant improvement of the
current Spanish WordNet, thus complying with the
initial expectations. One of the weaknesses lies in
the translation methods and tools, as they provide
the resources our proposals are based on. This is
why they strongly condition the final results. Regarding the strategy implemented, the initial selectors are sufficient to significantly improve the current WordNet, with a 92 % accuracy, while there
was a 74 % accuracy in phase 2.
121
An Analysis of WordNet’s Coverage of Gender Identity Using Twitter and
The National Transgender Discrimination Survey
Amanda Hicks
University of Florida
Gainesville, FL, USA
aehicks
@ufl.edu
Michael Rutherford Christiane Fellbaum
University of Arkansas Princeton, University
Princeton, NJ, USA
for Medical Sciences
fellbaum
Little Rock, AR, USA
@prinecton.edu
mwrutherford
@uams.edu
Abstract
and gender non-conforming people in WordNet.
For example, the Institute of Medicine (IOM) recently recommended (1) gathering data on sexual orientation and gender identity in Electronic
Health Records (EHR) as part of the meaningful use objectives in EHRs, (2) developing standardization of sexual orientation and gender identity measures to facilitate synthesizing scientific
knowledge about the health of sexual and gender
minorities, and (3) supporting research to develop
innovative methods of conducting research with
small populations to determine the best ways to
collect information on LGBT minorities. Furthermore, it is important for the medical community
to use words that are common among patients and
research participants since the use of language that
is familiar to the participant has been shown to improve response rates in data collection (Catania et
al., 1996; Institute of Medicine, 2011; Alper et al.,
2013).
However, there are challenges to determining
which words to include in WordNet and how to
define them. Based on the limited research available, some evidence (Dargie et al., 2015; Kuper et
al., 2012; Scheim and Bauer, 2015) suggests that
vocabulary for self-identifying gender and sexual
orientation varies by community. There is clear
evidence of lexical variation associated with geography in linguistics studies (Carver, 1987; Chambers, 2001; Nerbonne, 2013). Also, through discussions with members of the trans* community
and health care providers at LGBT clinics across
the country, we have learned that new words are
frequently coined to describe gender identity and
that the connotations of existing words may vary
across communities. We use ‘trans*’ broadly
to refer to transgender, transsexual, gender nonconforming, gender variant, and non-binary individuals.
User generated content on social media, such
as Twitter, is a valuable resource because it can
While gender identities in the Western
world are typically regarded as binary, our
previous work (Hicks et al., 2015) shows
that there is more lexical variety of gender identity and the way people identify
their gender. There is also a growing need
to lexically represent this variety of gender identities. In our previous work, we
developed a set of tools and approaches
for analyzing Twitter data as a basis for
generating hypotheses on language used
to identify gender and discuss genderrelated issues across geographic regions
and population groups in the U.S.A. In
this paper we analyze the coverage and
relative frequency of the word forms in
our Twitter analysis with respect to the
National Transgender Discrimination Survey data set, one of the most comprehensive data sets on transgender, gender
non-conforming, and gender variant people in the U.S.A. We then analyze the
coverage of WordNet, a widely used lexical database, with respect to these identities and discuss some key considerations
and next steps for adding gender identity
words and their meanings to WordNet.
1
Jiang Bian
University of Florida
Gainesville, FL, USA
bianjiang
@ufl.edu
Introduction
Gender identity is richly lexicalized in American
English. Nevertheless, a cursory investigation of
gender identity in WordNet (Miller, 1995) suggests that coverage of non-binary gender identity
is low. The goal of our research is to measure the
coverage of WordNet’s gender identity and to suggest steps to improve it.
There is increasing incentive to include gender identity terms and other words that are relevant to transgender, gender variant, non-binary,
122
2
provide a source for gleaning information about
people’s daily lives to answer scientific questions.
In our previous work, we produced a data set
to investigate words used to discuss gender in
the general population and among self-identifying
trans* persons using Twitter (Hicks et al., 2015).
With ‘self-identifying’ we refer to people who
have stated that they have a trans* identity either
through their tweets or in the National Transgender Discrimination Survey (NTDS) (Grant et al.,
2011). We believe that we can augment our Twitter data set with the NTDS data to produce a data
set that is in sync with current speakers’ language,
that can serve as a starting point for enriching
WordNet’s coverage of gender identity, and that
can contribute to the medical and clinical goals
outlined at the beginning of this section.
Methods
Here we describe our language analysis of the
Twitter data and the NTDS data.
2.1
Language Analysis of Twitter Data
The general idea underlying our approach is to
identify tweets that are relevant to the discussion
of trans* related issues and then examine the variations in language used for gender identification
by different communities, that is, by population
(trans* people vs. the general public) and by geographical location (U.S. states). The analysis
workflow consists of five main steps, as depicted
in Figure 1: 1) collect tweets that are potentially
related to discussions about gender identification;
2) preprocess and geotag tweets with their corresponding U.S. state; 3) build supervised classification models based on textual features in the tweets
to a) filter out irrelevant tweets and b) find people
who are self-identified as trans*; 4) collect relevant (both self-identifying trans* users and users
in the general public who discussed trans* related
issues) users’ Twitter timelines which consists of
all of their tweets in chronological order; and 5)
compare the usage of gender identification words
by geographical locations (i.e., by U.S. states) and
by population groups (self-identifying trans* people vs. the general public).
Some of the search terms are ambiguous and
their meanings are context dependent. For example, the tweet ‘That Hot Pocket is full of trans fats’
is not related to discussions of gender identification even though it contains the keyword ‘trans’.
To account for this observation, we engineered a
binary classifier to determine the likelihood that a
tweet is relevant to the discussion of gender identification and to remove those that are unlikely
to be relevant from the corpus in step 3. We
also leverage a number of visualization techniques
to provide straightforward and easy-to-understand
visual representations, namely, word clouds, cooccurrence matrices, and network graphs to substantiate our findings. A full description of this
work and analysis of terms can be found in (Hicks
et al., 2015).
The National Transgender Discrimination Survey (NTDS) is the largest survey of the trans* population in the United States to date (Harrison et al.,
2012). The survey was designed to collect information about “the broadest possible swath of experiences of transgender and gender nonconforming people” in the U.S.A., including questions
about how participants identify their own gender
and an option to write in one’s own identity (Harrison et al., 2012). We have compiled a list of
the gender-identity word forms (henceforth simply ‘words’) from this survey and performed a normalized frequency analysis that can be compared
to our Twitter data set.
In our previous work we built a data set and
visualization tools that show relative frequency
and co-occurrence networks for American English
trans* words on Twitter (Grant et al., 2011). Our
goal in this paper is to perform a two-fold coverage analysis of WordNet with respect to American
English gender identity.
Our hypothesis is that a comprehensive list of
words used to self-identify gender will require examining the words trans* people use in different
contexts. In order to evaluate this hypothesis, we
perform a frequency analysis of words from both
sets.
Our approach is as follows. First, we compare
the trans* identity words that we identified in our
previous work with the words from the NTDS to
assess the coverage of the Twitter set. Next, we
produce an updated set of words using the NTDS
and compare WordNet’s coverage of gender identity against this list.
2.2
Language Analysis of NTDS Data
Unlike the Twitter study data processing techniques, the NTDS dataset did not require the preprocessing for language filtering, geotagging or
the mining techniques for the identification of rel-
123
Figure 1: The analysis workflow for identifying tweets related to trans* issues
NTDS list. Due to the character limit on Twitter, abbreviations are common in Tweets as are
alternate spellings of words (e.g., ‘gender queer’
and ‘gender-queer’). We also gathered words into
groups consisting of alternative spellings and abbreviations. ‘Genderqueer’ and ‘gender-queer’
are in the same group. Henceforth we call these
groups of word forms simply ‘groups’. We measured the degree of overlap of groups in Twitter
and in NTDS which is reported in the results section of this paper.
evant trans* individuals. Knowing that the records
were all unique self-identified trans* individuals,
we were able to skip ahead to Step 5, the term usage analysis.
The Twitter data analysis methods were duplicated and restricted to the term extraction and usage analysis, including term frequencies and word
cloud generation.
We utilized questions three and four from the
NTDS. These questions asked what gender identity the respondent identified with at the time of
the survey and how strongly they identified with
certain identities. Figure 2 shows these questions.
Term frequency analyses were generated based
on all words utilized, no matter the degree with
which the respondent specified (strongly, somewhat, or not at all). The frequencies were then
measured both at a state and national level for coverage comparisons with the Twitter set.
2.3
2.4
Coverage Analysis of WordNet
Our next step was to generate a list of words to
use in the coverage analysis of WordNet. We
removed the Twitter terms that contained a hash
tag from the Twitter data set and removed word
forms that only had one occurrence in the NTDS
set. We then took the union of these sets to produce a set of words for evaluating the coverage
of WordNet. Similarly, we produced a list of
groups with alternate spellings and abbreviations
by taking the union set of corresponding groups
for the Twitter list and NTDS list. For example,
the NTDS word groups contained the set of word
forms (gender non-conforming, gender non conforming) and the Twitter word groups contained
(gender non-conforming, gnc). The compiled set
of groups contains (gender non-conforming, gender non conforming, gnc).
Coverage Analysis of Twitter Words
We performed a coverage analysis of the words in
the Twitter data set with those from the NTDS data
set. We collated all of the words in the NTDS
questions three and four as well as the identity
words used in the write-in responses. We removed
terms that were preceded by a hash tag in the Twitter set and words that were only used once in the
NTDS set, and then we measured the number of
common words from both the Twitter list and the
124
Figure 2: Questions 3 and 4 from the National Transgender Discrimination Survey that asks respondents
to report their gender identity
We automatically searched for words and
groups of synonymous words (‘synsets’) that corresponded to words and groups using the Natural
Language Tool Kit’s (NLTK) interface for WordNet 3.0 (Bird et al., 2009). We then manually evaluated which synsets were relevant to gender identity. We did not evaluate whether the WordNet
definition accurately characterized the intended
meaning of the word, in part because we do not
have a reliable method for ascertaining the intended meaning of the word and also because that
is outside of the scope of our coverage analysis.
did not, (2) those that were completely covered by
WordNet, meaning every word in the compound
(excluding stop words) was represented in WordNet, and (3) those that had no coverage in WordNet.
Many of the groups that did not have a corresponding synset in WordNet 3.0 were compounds
such as ‘trans person of color’. Our next step
was to produce a list of words in compounds and
search for corresponding synsets in WordNet. We
manually identified compounds and then generated a set of words in the compounds. We removed
stop words from the set with NLTK. Once again
we programmatically searched for synsets using
NLTK and then manually evaluated whether the
retrieved synset was relevant to gender identity.
We classified the compounds into three groups:
(1) those that were partially covered by WordNet,
meaning they contained at least one word that corresponded to a relevant synset and at least one that
We collected over 53.8 million tweets matching
the search queries during a 116-day period from
January 17, 2015 to May 12, 2015 inclusive. Out
of the collected tweets, about 29 million tweets
(54.2%) were in English. We were able to extract
location information for 368,518 tweets (1.26% of
English tweets from 119,778 unique users), which
we retained for further processing. We eliminated
the tweets that were deemed irrelevant (15,478
tweets from 3,785 users) based on a classification
model we developed (Hicks et al., 2015). From
the remaining records, 115,993 Twitter users were
classified as relevant, of which 1,921 users were
classified as self-identifying trans*. In addition
to the data we collected using the search API, we
3
Results
First we discuss the results of analysis of our Twitter data. Then we discuss our analysis of WordNet’s coverage of trans* related terms.
3.1
125
Language Analysis of Twitter Data
Figure 3: Word clouds representing the relative frequency of trans* words used by self-identifying trans*
people on Twitter in the U.S.A. (left) and self-identifying trans* people in questions three and four of the
NTDS (right)
derqueer’ is prominent in the NTDS word cloud
but relatively small in the Twitter word cloud (top
left-hand quadrant). Conversely, ‘Transgender’ is
more prominent in the Twitter word cloud than the
NTDS.
3.3
We found that 39% of the words in our compiled
list of trans* groups have a corresponding synset
in WordNet 3.0. Another 28% of the words were
compounds that contain at least one component
word with a corresponding synset in WordNet and
one without. 33% of the words did not have any
corresponding entries in WordNet. These results
are summarized in Figure 4. Table 2 shows a numerical analysis of WordNet’s 3.0 coverage of our
trans* related words.
Table 1: The percentage of overlap among NTDS
and Twitter words and groups
crawled more than 337.9 million tweets from the
115,993 relevant Twitter users’ timelines. Out of
the 337.9 million tweets, 872,340 Twitter messages contain one or more of the keyword forms
of our interest. These 872k tweets comprise the
corpus we used for language usage analysis.
3.2
WordNet’s Coverage of Gender Identities
Coverage of Twitter Word Groups
Table 1 contains a summary of the degree of overlap between the set of Twitter trans* words and
their groups and the NTDS trans* words and their
groups. Only about 18% of the NTDS groups were
represented in the Twitter data set. Section 4.2
contains a discussion of some of the main reasons
for the most frequent word forms not being in the
Twitter data set.
The word clouds in Figure 3 illustrate two interesting facts about word usage to self-describe
trans* identity.
First, different words appear in different contexts. For example, ‘cis’ and ‘shemale’ are prevalent on Twitter but not in the NTDS. Second,
even words that are common across contexts are
used with different frequency. For example, ‘gen-
Figure 4: Summary of WordNet 3.0’s coverage of
trans* word groups
4
4.1
Discussion
Limitations
We note that our previous study is limited by
the user demographics available on social media
126
Table 2: Analysis of trans* word groups in WordNet 3.0 reported by number
platforms. The users of social media tend to be
younger; 37% of Twitter users are under 30, while
only 10% are 65 or older, as of 2014 (Duggan, Ellison, Lampe, Lenhart, & Madden, 2014). There
are also power users who exhibit a substantially
greater level of activity than the average user (Pew
Research Center, 2015). These characteristics are
likely to create sample bias and impose limitations
on mining meaningful information from Twitter
that represents a broader population. For instance,
Twitter data may not be reliable for mining information about older people who may not use Twitter.
The NTDS was published in 2011, but more
current data are being collected at the time of
writing this paper. The Transgender Survey 2015
was launched in August 2015 (U.S, 2015) and the
PRIDE study in June 2015 (PRI, 2015). We expect
these newer data sources to be completed within
the next year or two. Both studies collect demographic data on trans* individuals, including identity words. This will provide insight into which
words are relatively stable over time and may also
reveal words that are emerging as more prevalent.
4.2
Table 3: Ten most frequent words in NTDS
sexual orientation words were among participant
responses in the NTDS so were included in our
NTDS data set. The heterogeneous nature of the
Twitter term lists and NTDS term lists may skew
the coverage analysis of our Twitter list. However,
this heterogeneity is valuable for our analysis of
WordNet’s coverage since it provides a more comprehensive list of words that trans* people use to
describe their own identities.
Words Excluded From Twitter Search
Terms
While compiling a list of words for Twitter, we
observed the distinctions among trans* identities,
intersex conditions, and sexual orientation. As a
result we excluded words that were specifically
intersex related or that describe sexual orientation from the Twitter set. However, intersex and
An examination of tables 3 and 4 reveals three
main reasons words from the NTDS term lists
127
were not included in the Twitter term lists: (1)
Polysemy - ‘Aggressive’ is polysemous and would
result in too many false hits in the Twitter search.
Similarly ‘androgynous’ produced too many false
hits since many people who used this word were
tweeting about fashion. (2) Gender words that
are not trans* specific -‘male’, ‘female’, ‘woman’,
and ‘man’, are used with such prevalence that we
excluded them in the Twitter set since they are unhelpful in identifying tweets about trans* issues.
(3) Identity words that are not trans* specific ‘butch’ and ‘intersex’ were deliberately excluded
from the Twitter set since we were following the
conceptual distinctions among sexual orientation,
gender identity, and intersex. However, the NTDS
data set shows that when individuals describe their
gender identities, they do not limit their descriptions to these high level distinctions.
4.3
Table 4: The ten most frequent words in the NTDS
write-in fields in questions three and four
Suggestions for Integrating Gender
Identity Into the WordNet Database
Approximately one third of the compounds with
partial or no coverage have ‘gender’ as a component term. The synsets for ‘gender’ in WordNet
are tied to biological properties and reproductive
roles, and there is no synset for gender as a social role independently of reproductive features.
Other words that would have a significant effect
on WordNet’s coverage of compounds are ‘trans’,
‘genderqueer’, and ‘femme’. Some words that
are relevant to the trans* issues such as ‘agender’, ‘cisgender’ (describing somebody who is not
trans*), and ‘binarism’ are missing.
In addition to adding more words to integrate
gender identity in WordNet, efforts should be
made to craft informed definitions and example
sentences of new words and to evaluate the accuracy of existing entries. Likewise, more work
needs to be done to identify synsets. The word
groups that we used for this study grouped morphologically similar words such as ‘gender queer’
and ‘gender-queer’. However, we did not group
words like ‘agender’ and ‘genderless’ into synsets.
Methods for reliably detecting synonyms of gender identity words should be developed and tested.
Finally, methods also need to be developed
for establishing hierarchy relations among gender identity words. Such methods may include
testing established lexical patterns with English
speakers who are competent with trans* vocabulary (Hearst, 1992). Another approach may in-
clude leveraging the responses in question 4 of the
NTDS to detect hierarchy relations. For example, if most participants who identify strongly as
transgender also identify strongly as genderqueer
but not vice versa, this could indicate that ‘genderqueer’ is a hypernym of ‘transgender’.
4.4
Future Work
Wordnets have been built in some seventy different languages, and each reflects the culture
of the speakers. Mapping gender identity words
across languages should reveal interesting similarities and differences. For example, India allows
its citizens to officially identify as ‘third gender’,
or hijra, a term that encompasses biological males
dressing in women’s clothes as well as intersex individuals. Future research within the global wordnet community could ask whether such officially
sanctioned words cover distinct words used in specific communities and if so, how do they correspond to the English words identified in our work?
Twitter corpora can show which terms are used in
similar or identical contexts (n-grams), suggesting
synonymy and shared synset membership. Additionally, questionnaires could be developed and
submitted to the trans* population for input on
how to accurately represent the terms. Reflecting
geographic and group differences poses additional
challenges, akin to dialectal variation that is cur-
128
rently marked in WordNet with usage flags.
5
Emma Dargie, Karen L Blair, Caroline F Pukall, and
Shannon M Coyle. 2015. Somewhere under the
rainbow: Exploring the identities and experiences of
trans persons. The Canadian Journal of Human Sexuality.
Conclusion
Our hypothesis was that a comprehensive list of
words used to describe gender identity will require
sets of words taken from different contexts. To test
this hypothesis we performed a coverage analysis
of trans* words taken from two different contexts,
Twitter and the National Transgender Discrimination Survey. We found that while there was some
overlap, there was significant variation of words
used between these contexts. As a result, we generated a more comprehensive list of trans* words
from both sources. A second aim of this paper
was to assess WordNet’s coverage of trans* identity. We found that, while there is some coverage
of trans* words in WordNet, there is more work to
be done to ensure more comprehensive coverage.
Jaime M Grant, Lisa Mottet, Justin Edward Tanis, Jack
Harrison, Jody Herman, and Mara Keisling. 2011.
Injustice at Every Turn: A Report of the National
Transgender Discrimination Survey. National Center for Transgender Equality.
Jack Harrison, Jaime Grant, and Jody L Herman. 2012.
A gender not listed here: Genderqueers, gender
rebels, and otherwise in the National Transgender
Discrimination Survey. LGBTQ Public Policy Journal at the Harvard Kennedy School, 2(1).
Marti A Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of
the 14th conference on Computational linguisticsVolume 2, pages 539–545. Association for Computational Linguistics.
Amanda Hicks, R. Hogan, William, Michael Rutherford, Bradley Malin, Mengjun Xie, Christiane Fellbaum, Zhijun Yin, Daniel Fabbri, Josh Hanna, and
Jiang Bian. 2015. Mining Twitter as a first step toward assessing the adequacy of gender identification
terms on intake forms. In Proceedings of the AMIA
2015 Annual Symposium. American Medical Informatics Association.
Acknowledgements
We are grateful to the National Center for Transgender Equality (NCTE) for providing the dataset
from the National Transgender Discrimination
Survey. This work was supported in part by
the NIH/NCATS Clinical and Translational Science Awards to the University of Florida UL1
TR000064 and the University of Arkansas for
Medical Sciences UL1 TR000039. The content is
solely the responsibility of the authors and does
not necessarily represent the official views of the
National Institutes of Health or the NCTE.
Institute of Medicine. 2011. The health of lesbian,
gay, bisexual, and transgender people: Building a
foundation for better understanding.
Laura E Kuper, Robin Nussbaum, and Brian Mustanski. 2012. Exploring the diversity of gender and
sexual orientation identities in an online sample of
transgender individuals. Journal of Sex Research,
49(2-3):244–254.
References
John Nerbonne. 2013. How much does geography influence language variation? Space in Language and
Linguistics: Geographical, Interactional, and Cognitive Perspectives, pages 220–36.
Joe Alper, Monica N Feit, Jon Q Sanders, et al. 2013.
Collecting Sexual Orientation and Gender Identity
Data in Electronic Health Records: Workshop Summary. National Academies Press.
2015. The Population Research in Identity and Disparities for Equality (PRIDE) Study. http://www.
pridestudy.org. Accessed: 2015-08-24.
Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with Python.
O’Reilly Media, Inc.
Ayden I Scheim and Greta R Bauer. 2015. Sex and
gender diversity among transgender persons in Ontario, Canada: Results from a respondent-driven
sampling survey. The Journal of Sex Research,
52(1):1–14.
Craig M Carver. 1987. American Regional Dialects: A
Word Geography. University of Michigan Press.
Joseph A Catania, Diane Binson, Jesse Canchola,
Lance M Pollack, Walter Hauck, and Thomas J
Coates. 1996. Effects of interviewer gender, interviewer choice, and item wording on responses to
questions concerning sexual behavior. Public Opinion Quarterly, 60(3):345–375.
2015. U.S. Trans Survey 2015.
transsurvey.org.
Jack K Chambers. 2001. Region and language variation. English world-wide, 21(2):169–199.
129
http://www.
Where Bears Have the Eyes of Currant: Towards a Mansi WordNet
Csilla Horváth1 , Ágoston Nagy1 , Norbert Szilágyi2 , Veronika Vincze3
1
University of Szeged, Institute of English–American Studies
Egyetem u. 2., 6720 Szeged, Hungary
[email protected], [email protected]
2
University of Szeged, Department of Finno-Ugrian Studies
Egyetem u. 2., 6720 Szeged, Hungary
[email protected]
3
Hungarian Academy of Sciences, Research Group on Artificial Intelligence
Tisza Lajos krt. 103., 6720 Szeged, Hungary
[email protected]
Abstract
tasks, be it mono- or multilingual: for instance, in
machine translation, information retrieval and so on.
In this paper, we aim at constructing a wordnet
for Mansi, an indigenous language spoken in Russia. Mansi is an endangered minority language, with
less than 1000 native speakers. Most often, minority
languages are not recognized as official languages
in their respective countries, where there is an official language (in this case, Russian) and there is one
or there are several minority languages (e.g. Mansi,
Nenets, Saami etc.). Hence, the speakers of minority
languages are bilingual, and usually use the official
or majority language in their studies and work, and
the language of administration is the majority language as well. However, the minority language is
typically restricted to the private sphere, i.e. among
family members and friends, and thus it is mostly
used in oral communication, with only sporadic examples of writing in the minority language (Vincze
et al., 2015). Also, the cultural and ethnographic
background of Mansi people may affect language
use: certain artifacts used by Mansi people that are
unknown to Western cultures have their own vocabulary items in Mansi and vice versa, certain concepts
used by Western people are unknown to Mansi people, therefore there are no lexicalized terms for them.
The construction of a Mansi wordnet help us explore how a wordnet can be built for a minority language and also, an endangered language. Thus, we
will investigate the following issues in this paper:
Here we report the construction of a wordnet
for Mansi, an endangered minority language
spoken in Russia. We will pay special attention to challenges that we encountered during
the building process, among which the most
important ones are the low number of native
speakers, the lack of thesauri and the bear
language. We will discuss our solutions to
these issues, which might have some theoretical implications for the methodology of wordnet building in general.
1
Introduction
Wordnets are lexical databases that are rendered according to semantic and lexical relations between
groups of words. They are supposed to reflect the
internal organization of the human mind (Miller et
al., 1990). The first wordnet was constructed for English (Miller et al., 1990) and since that time, wordnets have been built for several languages including
several European languages, mostly in the framework of EuroWordNet and BalkaNet (Alonge et al.,
1998; Tufiş et al., 2004) and other languages such
as Arabic, Chinese, Persian, Hindi, Tulu, Dravidian,
Tamil, Telugu, Sanskrit, Assamese, Filipino, Gujarati, Nepali, Kurdish, Sinhala (Tanács et al., 2008;
Bhattacharyya et al., 2010; Fellbaum and Vossen,
2012; Orav et al., 2014). Synsets within wordnets
for different languages are usually linked to each
other, so concepts from one language can be easily
mapped to those in another language. Wordnets can
be beneficial for several natural language processing
• What are the specialties of constructing a wordnet for a minority language?
• What are the specialties of constructing a word-
130
adapted to the Mansi phonology (ï
yëüíèöà), or using a Mansi neologism to describe it (ìaõóì ïóñìàëòàí êîë, ‘a house for healing people, hospital’, as opposed to í
ÿâðàì ïóñìàëòàí êîë ‘children hospital, children’s clinic’ or y
éõóë ïóñìàëòàí êîë ‘veterinary clinic’).
net for an endangered language?
• What are the specialties of constructing a wordnet for Mansi?
The paper has the following structure. First, the
Mansi language will be shortly presented from linguistic, sociolinguistic and language policy perspectives. Then our methods to build the Mansi wordnet will be discussed, with special emphasis on specific challenges as regards endangered and minority
languages in general and Mansi in particular. Later,
statistical data will be analysed and our results will
be discussed in detail. Finally, a summary will conclude the paper.
2
3
Semi-automatic construction of the
Mansi WordNet
In this section, we will present our methods to construct the Mansi WordNet. We will also pay special
attention to the most challenging issues concerning
wordnet building.
3.1
The Mansi Language
Low number of native speakers
The first and greatest problem we met while creating
the Mansi wordnet was that only a handful of native
speakers have been trained in linguistics. Thus, we
worked with specialists of the Mansi language who
have been trained in linguistics and technology, but
do not have native competence in Mansi.
As it is not rentable to build a WordNet from
scratch and as our annotators are native speakers of
Hungarian, we used the Hungarian WordNet (Miháltz et al., 2008) as a starting point. First, we decided to include basic synsets, and the number of
the synsets is planned to be expanded continuously
later on. We used Basic Concepts – already introduced in EuroWordNet – as a starting point: this set
of synsets contains the synsets that are considered
the most basic conceptual units universally.
Mansi (former term: Vogul) is an extremely endangered indigenous Uralic (more precisely FinnoUgric, Ugric, Ob-Ugric) languages, spoken in Western Siberia, especially on the territory of the KhantyMansi Autonomous Okrug. Among the approximately 13,000 people who declared to be ethnic
Mansi according to the data of the latest Russian federal census in 2010 only 938 stated that they could
speak the Mansi language.
The Mansi have been traditionally living on hunting, fishing, to a lesser extent also on reindeer breeding, they got acquainted with agriculture and urban
lifestyle basically during the Soviet period. The
principles of Soviet linguistic policy according to
which the Mansi literary language has been designed
kept changing from time to time. After using Latin
transcription for a short period, Mansi language
planners had to switch to the Cyrillic transcription
in 1937. While until the 1950s the more general tendency was to create new Mansi words to describe the
formerly unknown phenomena, later on the usage of
Russian loanwords became more dominant. As a result of these tendencies some of the terms describing contemporary environment, urban lifestyle, the
Russian-dominated culture are Russian loanwords,
while others are Mansi neologisms created by Mansi
linguists and journalists. It is not uncommon to find
two or even three different synonyms describing the
same phenomena (for example, hospital): by the
means of borrowing the word from Russian (áîëüíèöà), or using the Russian loanword in a form
3.2
Already existing resources
In order to accelerate the whole task and to ease
the work of Mansi language experts, the WordNet
creating process was carried out semi-automatically.
Since there is no native speaker available who could
solve the problems requiring native competence, we
were forced to utilize the available sources as creatively as possible.
First, the basic concept sets of the Hungarian
WordNet XML file were extracted and at the same
time, the non-lexicalized elements were filtered as
in this phase, we intend to focus only on lexicalized
elements.
Second, we used a Hungarian-Mansi dictionary
to create possible translations for the members of
131
WordNet (Miller et al., 1990).
the synsets. The dictionary we use in the process
is based on different Mansi-Russian dictionaries
(e.g. Rombandeeva (2005), Balandin and Vahruševa
(1958), Rombandeeva and Kuzakova (1982)). The
translation of all Mansi entries to Hungarian and to
English in the new dictionary is being done independently of WordNet developing (Vincze et al., 2015).
In order not to get all Hungarian entries of the
WordNet translated to Mansi again, a program code
was developed to replace the Hungarian terms with
the already existing translations from the dictionary.
Only literals are replaced, definitions and examples
are left untouched, so that the linguists can check
the actual meaning and can replace them with their
Mansi equivalents. The Mansi specialists’ role is
to check the automatic replacement and to give new
term candidates if there is no proper automatic translation.
In this workphase, as there are no synonym dictionaries or thesauri available for the Mansi language, the above-mentioned bilingual student dictionaries are used as primary resources. These dictionaries were designed to be used during school
classes, they rarely contain any synonyms, antonyms
or hypernyms, and hardly any phrases or standing
locutions. (Most of these dictionaries were written
by the same authors, thus – besides the inconsistent marking of vowel length – fortunately we do
not have to pay special attention to possible contradictions or incoherence.) Hence originates the
unbalanced situation in which we are either missing the Mansi translation, either the Mansi definition belonging to the same code, and we are able
to present the translation, the definition and the examples of usage only in a few extraordinary instances. The sentences illustrating usage in the
synset come from our Mansi corpus, built from articles from the Mansi newspaper called Luima Seripos
published online semimonthly at http://www.
khanty-yasang.ru/luima-seripos.
In
its final version, our corpus will contain above
1,000,000 tokens, roughly 400,000 coming from the
online publications and the rest from the archived
PDF files.
Even if based on the Hungarian WordNet, the elements of the Mansi WordNet can be matched to
the English ones and those of other wordnets since
the Hungarian WN itself is paired with the Princeton
3.3
Bear language
Another very special problem occurred during wordnet building in Mansi, that is the question regarding the situation of the so called “bear language”.
The bear is a prominently sacred animal venerated
by Mansi, bearing great mythical and ritual significance, and also surrounded by a detailed taboo language. Since the bear is believed to understand the
human speech (and also to have sharp ears), it is
respectful and cautious to use taboo words while
speaking about the bear, the parts of its body, or
any activity connected with the bear (especially bear
hunting) so that the bear would not understand it.
The taboo words of this “bear language” may be divided into two major subgroups: Mansi words which
have a different, special meaning when used in connection with the bear (e.g. ñîñûã ‘currant’ but also
meaning ‘eye’, when speaking of the bear’s eyes),
and those which may be used solely in connection
with the bear (e.g. õàùëû ‘to be angry’, as opposed
to êàíòëû ‘to be angry’ speaking of a human). Even
the word for bear belongs to taboo words and has
only periphrastic synonyms like Boðòoëíoè
êà ‘an
old man from the forest’ etc.
As a first approach, taboo words were included
as literals in the synsets because their usage is restricted in the sense that they can solely be used in
connection with bears. Hence, first we marked the
special status of these literals, for which purpose we
applied the note “bear”. However, it would have
also been practical to well differentiate the synsets
that are connected to “bears”. This can be realized
in many ways: for example, the “bear”-variants of
the notions should be the hyponyms of their respective notions, like õàùëû ‘to be angry’, which can
be considered as a hyponym of êàíòëû ‘to be angry’ speaking of a human. However, this solution is
not a perfect one since (i) this is not a widespread
method either in WordNets of other languages and
therefore it would not facilitate WordNet-based dictionaries and (ii) it is not a true hyponym, that is, a
real subtype of their respective notion connected to
humans. Finally, we decided to put these notions in
separate synsets, which has the advantage that these
notions are grouped together and it is easier to do a
targeted search on these expressions.
132
4
Results
solved them while constructing the Mansi wordnet.
The manual correction of the automatically translated Basic Concept Set 1 is in progress. Currently,
the online xml file contains 300 synsets. These
synsets had altogether 410 literals, thus a synset had
1.37 literals in average: this proportion was 1.88
in the original Hungarian WordNet xml file. Concerning the proportion of the two part-of-speech categories, nouns prevail over verbs with 210 nouns
(70%), 90 verbs.
Presumably 40% of all lexicon entries are multiword expressions, regardless of word class or derivational processes. In many case when the Russian
word refers to special posts or professional person,
the proper Mansi word is a roundabout phrase. For
example the ó÷èòåëü ’schoolteacher masc.’ could
be translated as í
ÿâðàìûò õàíèñüòàí õóì built
up of the element children-teaching man , and the
feminine counterpart ó÷èòåëüíèöà ’schoolteacher
f em.’ as í
ÿâðàìûò õàíèñüòàí íý from childrenteaching woman. Though the multi-word expressions are highly variable in their elements, replacing
the dedicated parts with synonyms, or adding new
ones to enrich the layers of senses. The number of
multi-word expressions in this version of the Mansi
WordNet is 74, that is 18% of all literals.
Section 3.2 enumerated some challenges about
transforming an already existing WordNet to Mansi.
Some synsets in the Basic Concept Set also have
proved to be difficult to handle. For example, the
Mansi language is only occasionally (if ever) used
in scientific discourse. Therefore, the terms ‘unconscious process’, ‘physiology’ or ‘geographical creature’ cannot have any Mansi equivalents and therefore can be included in the Mansi WordNet only as
non-lexicalized items. The number of such literals
is 34, that is 16% of all literals.
5.1
5
5.2
Wordnet construction for minority and
endangered languages
First, linguistic resources, e.g. mono- and multilingual dictionaries may be at our disposal only to a
limited extent and second, there might be some areas of daily life where only the majority language
is used, hence the minority language has only a
limited vocabulary in that respect. As for the first
challenge, we could rely on the Mansi-RussianEnglish-Hungarian dictionary under construction,
which is itself based on Mansi-Russian dictionaries
(see above) and we made use of its entries in the
semi-automatic building process. However, if there
are no such resources available, wordnets for minority languages should be constructed fully manually.
For dead languages which are well-documented and
have a lot of linguistic descriptions and dictionaries
(like Latin and Ancient Greek), this is a less serious
problem.
As for the second challenge, we applied two
strategies: we introduced non-lexicalized synsets for
those concepts that do not exist in the Mansi language or we included an appropriate loanword from
Russian.
Besides being a minority language, Mansi is also
an endangered language. Almost none of its native
speakers have been trained in linguistics, which fact
rules out the possibility of having native speakers
as annotators. Thus, linguist experts specialized in
the Mansi language have been employed as wordnet builders and in case of need, they can contact
native speakers for further assistance. This problem
is also relevant for dead languages, where there are
no native speakers at all, however, we believe that
linguists with advanced knowledge of the given language can also fully contribute to wordnet building.
Discussion
Building a wordnet for a minority or endangered language can have several challenges. Some of these
are also relevant for dead languages, however, wordnets for e.g. Latin (Minozzi, 2009), Ancient Greek
(Bizzoni et al., 2014) and Sanskrit (Kulkarni et al.,
2010) prove that these facts do not necessarily mean
an obstacle for wordnet construction. Here we summarize the most important challenges and how we
Specialties of wordnet construction for
Mansi
Wordnet building for Mansi also led to some theoretical innovations. As there is a subvocabulary
of the Mansi language related to bears (see above),
we intended to reflect this distinction in the wordnet too. For that reason, we introduced the novel
relation “bear”, which connect synsets that are only
used in connection with bears and synsets that in-
133
clude their “normal” equivalents. All this means that
adding new languages to the spectrum may also have
theoretical implications which contribute to the linguistic richness of wordnets.
6
Language Resources Association (ELRA). ACL Anthology Identifier: L14-1054.
Christiane Fellbaum and Piek Vossen, editors. 2012.
Proceedings of GWC 2012. Matsue, Japan.
M. Kulkarni, C. Dangarikar, I. Kulkarni, A. Nanda, and
P. Bhattacharya. 2010. Introducing Sanskrit WordNet.
In Principles, Construction and Application of Multilingual Wordnets. Proceedings of the Fifth Global
WordNet Conference (GWC 2010), Mumbai, India.
Narosa Publishing House.
Márton Miháltz, Csaba Hatvani, Judit Kuti, György
Szarvas, János Csirik, Gábor Prószéky, and Tamás
Váradi. 2008. Methods and Results of the Hungarian WordNet Project. In Proceedings of the Fourth
Global WordNet Conference (GWC 2008), pages 311–
320, Szeged. University of Szeged.
George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. 1990.
Introduction to WordNet: an on-line lexical database.
International Journal of Lexicography, 3(4):235–244.
Stefano Minozzi. 2009. The Latin WordNet project.
In Peter Anreiter and Manfred Kienpointner, editors, Latin Linguistics Today. Akten des 15. Internationalem Kolloquiums zur Lateinischen Linguistik,
volume 137 of Innsbrucker Beiträge zur Sprachwissenschaft, pages 707–716.
Heili Orav, Christiane Fellbaum, and Piek Vossen, editors. 2014. Proceedings of GWC 2014. Tartu, Estonia.
E.I. Rombandeeva and E.A. Kuzakova. 1982. Slovar’
mansijsko-russkij i russko-mansijskij. Prosvešenije,
Leningrad.
E.I. Rombandeeva. 2005. Russko-mansijskij slovar’.
Mirall, Sankt-Peterburg.
Attila Tanács, Dóra Csendes, Veronika Vincze, Christiane Fellbaum, and Piek Vossen, editors. 2008. Proceedings of GWC 2008. University of Szeged, Department of Informatics, Szeged, Hungary.
Dan Tufiş, Dan Cristea, and Sofia Stamou. 2004. BalkaNet: Aims, Methods, Results and Perspectives. Romanian Journal of Information Science and Technology.
Special Issue on BalkaNet, 7(1-2):9–43.
Veronika Vincze, Ágoston Nagy, Csilla Horváth, Norbert Szilágyi, István Kozmács, Edit Bogár, and Anna
Fenyvesi. 2015. FinUgRevita: Developing Language
Technology Tools for Udmurt and Mansi. In Proceedings of the First International Workshop on Computational Linguistics for Uralic Languages, Tromsø, Norway, January.
Conclusions
In this paper, we reported the construction of a wordnet for Mansi, an endangered minority language
spoken in Russia. As we intend to make the Mansi
wordnet freely available for everyone, we hope that
this newly created language resource will contribute
to the revitalization of the Mansi language.
In the future, we would like to extend the Mansi
wordnet with new synsets. Moreover, we intend to
create applications that make use of this language
resource, for instance, online dictionaries and linguistic games for learners of Mansi.
Acknowledgments
This work was supported in part by the Finnish
Academy of Sciences and the Hungarian National
Research Fund, within the framework of the project
Computational tools for the revitalization of endangered Finno-Ugric minority languages (FinUgRevita). Project number: OTKA FNN 107883; AKA
267097.
References
Antonietta Alonge, Nicoletta Calzolari, Piek Vossen,
Laura Bloksma, Irene Castellon, Maria Antonia Marti,
and Wim Peters. 1998. The Linguistic Design of the
EuroWordNet Database. Computers and the Humanities. Special Issue on EuroWordNet, 32(2-3):91–115.
A.N. Balandin and M.I. Vahruševa. 1958. Mansijskirusskij slovar’ s leksičeskimi paralelljami iz južnomansijskogo (kondinskogo) dialekta. Prosvešenije,
Leningrad.
Pushpak Bhattacharyya, Christiane Fellbaum, and Piek
Vossen, editors. 2010. Principles, Construction and
Application of Multilingual Wordnets. Proceedings of
GWC 2010. Narosa Publishing House, Mumbai, India.
Yuri Bizzoni, Federico Boschetti, Harry Diakoff, Riccardo Del Gratta, Monica Monachini, and Gregory
Crane. 2014. The making of ancient greek wordnet.
In Proceedings of the Ninth International Conference
on Language Resources and Evaluation (LREC’14),
pages 1140–1147, Reykjavik, Iceland, May. European
134
WNSpell: a WordNet-Based Spell Corrector
Bill Huang
Princeton University
[email protected]
Abstract
as they also considered the statistical likeliness of
certain errors and the frequency of the candidate
word in literature.
Two other major techniques were also being developed. One was similarity keys, which used
properties such as the word’s phonetic sound or
first few letters to vastly decrease the size of the
dictionary to be considered. The other was the
rule-based approach, which implements a set of
human-generated common misspelling rules to efficiently generate a set of plausible corrections and
then matching these candidates with a dictionary.
With the advent of the Internet and the subsequent increase in data availability, spell correction has been further improved. N-grams can be
used to integrate grammatical and contextual validity into the spell correction process, which standalone spell correction is not able to achieve. Machine learning techniques, such as neural nets, using massive online crowdsourcing or gigantic corpora, are being harnessed to refine spell correction
more than could be done manually.
Nevertheless, spell correction still faces significant challenges, though most lie in understanding context. Spell correction in other languages is
also incomplete, as despite significant work in English lexicography, relatively little has been done
in other languages.
This paper presents a standalone spell corrector, WNSpell, based on and written for
WordNet. It is aimed at generating the
best possible suggestion for a mistyped
query but can also serve as an all-purpose
spell corrector. The spell corrector consists of a standard initial correction system, which evaluates word entries using a
multifaceted approach to achieve the best
results, and a semantic recognition system, wherein given a related word input,
the system will adjust the spelling suggestions accordingly. Both feature significant performance improvements over current context-free spell correctors.
1
Introduction
WordNet is a lexical database of English words
and serves as the premier tool for word sense
disambiguation. It stores around 160,000 word
forms, or lemmas, and 120,000 word senses, or
synsets, in a large graph of semantic relations. The
goal of this paper is to introduce a spell corrector for the WordNet interface, directed at correcting queries and aiming to take advantage of WordNet’s structure.
1.1
Previous Work
1.2
Work on spell checkers, suggesters, and correctors
began in the late 1950s and has developed into a
multifaceted field. First aimed at simply detecting
spelling errors, the task of spelling correction has
grown exponentially in complexity.
The first attempts at spelling correction utilized
edit distance, such as the Levenshtein distance,
where the word with minimal distance would be
chosen as the correct candidate.
Soon, probabilistic techniques using noisy
channel models and Bayesian properties were invented. These models were more sophisticated,
This Project
Spell correctors are used everywhere from simple
spell checking in a word document to query completion/correction in Google to context-based inpassage corrections. This spell corrector, as it is
for the WordNet interface, will focus on spell correction on a single word query with the additional
possibility of a user-inputted semantically-related
word from which to base corrections off of.
135
2
Correction System
3. Reasonable runtime
The first part of the spell corrector is a standard
context-free spell corrector. It takes in a query
such as speling and will return an ordered list of
three possible candidates; in this case, it returns
the set {spelling, spoiling, sapling}.
The spell corrector operates similarly to the Aspell and Hunspell spell correctors (the latter which
serves as the spell checker for many applications
varying from Chrome and Firefox to OpenOffice
and LibreOffice). The spell corrector we introduce
here, though not as versatile in terms of support
for different platforms, achieves far better performance.
To tune the spell corrector to WordNet queries,
stress is placed on bad misspellings over small errors. We will mainly use the Aspell data set (547
errors), kindly made public by the GNU Aspell
project, to test the performance of the spell corrector. Though the mechanisms of the spell corrector are inspired by logic and research, they are
included and adjusted mainly based on empirical
tests on the above data set.
2.1
There have been several approaches to this search
space problem, but all have significant drawbacks
in one of the criteria of search space generation:
• The simplest approach is the lexicographic
approach, which simply generates a search
space of words within a certain edit distance away from the query. Though simple,
this minimum edit distance technique, introduced by Damerau in 1964 and Levenshtein
in 1966, only accounts for type 3 (and possibly type 2) misspellings. The approach is
reasonable for misspellings of up to edit distance 2, as Norvig’s implementation of this
runs in ∼0.1 seconds, but time complexity
increases exponentially and for misspellings
such as f unetik → phonetic that are a significant edit distance away, this approach will
not be able to contain the correction without
sacrificing both the size of the search space
and the runtime.
• Another approach is using phonetics, as misspelled words will most likely still have similar phonetic sounds. This accounts for type
2 misspellings, though not necessarily type
1 or type 3 misspellings. Implementations
of this approach, such as using the SOUNDEX code (Odell and Russell, 1918), are able
to efficiently capture misspellings such as
f unetik → phonetic, but not misspellings
like rypo → typo. Again, this approach is
not sufficient in containing all plausible corrections.
Generating the Search Space
To improve performance, the spell corrector needs
to implement a fine-tuned scoring system for each
candidate word. Clearly, scoring each word in
WordNet’s dictionary of 150,000 words is not
practical in terms of runtime, so the first step to
an accurate spell corrector is always to reduce the
search space of correction candidates.
The search space should contain all possible
reasonable sources of the the spelling error. These
errors in spelling arise from three separate stages
(Deorowicz and Ciura, 2005):
• A similarity key can also be used. The similarity key approach stores each word under
a key, along with other similar words. One
implementation of this is the SPEED-COP
spell corrector (Pollock and Zamora, 1984),
which takes advantage of the usual alphabetic proximity of misspellings to the correct
word. This approach accounts for many errors, but there are always a large number of
exceptions, as the misspellings do not always
have similar keys (such as the misspelling
zlphabet → alphabet).
1. Idea → thought word
i.e. distrucally → destructf ully
2. Thought word → spelled word
i.e. egsistance → existence
3. Spelled word → typed word
i.e. autocorrecy → autocorrect
The main challenges regarding search space generation are:
• Finally, the rule-based approach uses a set of
common misspelling patterns, such as im →
in or y → t, to generate possible sources
of the typing error. The most complicated
1. Containment of all, or nearly all, possible
reasonable corrections
2. Reasonable size
136
This group includes words with a phonetic key that differs by an edit distance at
most 1 from the phonetic key of the entry
(f unetik → phonetic), and also does a very
good job of including typos/misspellings of
edit distance greater than 1 (it actually includes the first group completely, but for
pruning purposes, the first group is considered separately) in very little time O(Cn)
where C ∼ 52 × 9.
approach, these spell correctors are able to
contain the plausible corrections for most
spelling errors quite well, but will miss many
of the bad misspellings. The implementation
by Deoroicz and Ciura using this approach is
quite effective, though it can be improved.
Our approach with this spell corrector is to
use a combination of these approaches to achieve
the best results. Each approach has its strengths
and weaknesses, but cannot achieve a good coverage of the plausible corrections without sacrificing
size and runtime. Instead, we take the best of each
approach to much better contain the plausible corrections of the query.
To do this, we partition the set of plausible corrections into groups (not necessarily disjoint, but
with a very complete union) and consider each
separately:
• Exceptions:
This group includes words that are not
covered by either of the first two groups
but are still plausible corrections, such as
lignuitic → linguistic. We observe that
most of these exceptions either still have
similar beginning and endings to the original word and are close edit distance-wise
or are simply too far-removed from the entry to be plausible. Searching through words
with similar beginnings that also have similar endings (through an alphabetically-sorted
list) proves to be very effective in including
the exception, while taking very little time.
• Close mistypings/misspellings:
This group includes typos of edit distance
1 (typo → rypo) and misspellings of edit
distance 1 (consonent → consonant), as
well as repetition of letters (mispel →
misspell). These are easy to generate, running in O(n log nα) time, where n is the
length of the entry and α is the size of the
alphabet, to generate and check each word
(though increasing the maximum distance to
2 would result an significantly slower time of
O(n2 log nα2 ).
As many generated words, especially from the
later groups, are clearly not plausible corrections,
candidate words of each type are then pruned with
different constraints depending on which group
they are from. Words in later groups are subject to
tougher pruning, and the finding of a close match
results in overall tougher pruning.
For instance, many words in the second group
are quite far removed from the entry and completely implausible as corrections (e.g. zjpn →
[00325] → [03235] → suggestion), while those
that are simply caused by repetition of letters (e.g.
lllooolllll → loll) are almost always plausible, so
the former group should be more strictly pruned.
Finally, since the generated search space after
group pruning can be quite large (up to 200), depending on the size of the search space, the search
space may be pruned, repetitively, until the size of
the search space is of an acceptable size.
Some factors considered during pruning include:
• Words with similar phonetic key:
We implement a precalculated phonetic key
for each word in WordNet, which uses a numerical representation of the first five consonant sounds of the word:
0:
1:
2:
3:
4:
5:
6:
7:
8:
(ignored) a, e, i, o, u, h, w, [gh](t)
b, p
k, c, g, j, q, x
s, z, c(i/e/y), [ps], t(i o), (x)
d, t
m, n, [pn], [kn]
l
r
f, v, (r/n/t o u)[gh], [ph]
• Length of word
Each word in WordNet is then stored in an
array with indices ranging from [00000] (no
consonants) to [88888] and can be looked up
quickly.
• Letters contained in word
• Phonetic key of word
137
• First and last letter agreement
First, we smooth all the counts using add-k
smoothing (where we set k = 21 ), as there are numerous counts of 0. Since the bigram/monogram
counts were retrieved in log format, for sake of
simplicity of data manipulation, we only smooth
the counts of 0, changing their values to −0.69
(originally undefined). We then calculate c(ti | ci )
as:
1
c(ti | ci ) = k1 log
+ k2
p(α → β)
• Number of syllables
• Frequency of word in text (COCA corpus)
• Edit distance
This process successfully generates a search space
that rarely misses the desired correction, while
keeping both a small size in number of words and
a fast runtime.
2.2
Where p(α → β) is the probability of the edit operation and k1 , k2 factors that adjust the cost depending on the uncertainty of small counts and the
increased likelihood of errors if errors are already
present.
For the different edit operations, p(x → y) is:

del0 (xy)

deletion
:

N 0 (xy)



0 (xy)·N

add
 addition :
N 0 (x)·N 0 (y)
p(x → y) =
0 (xy)·N


substitution : Nsub
0 (x)·N 0 (y)




rev0 (xy)
 reversal :
Evaluating Possibilities
The next step is to assign a similarity score to all of
the candidates in the search space. It must be accurate enough to discern that disurn −→ discern
but disurn 6−→ disown and versatile enough to
figure out that f unetik −→ phonetic.
Our approach is a modified version of Church
and Gale’s probabilistic scoring of spelling errors.
In this approach, each candidate correction c is
scored following the Bayesian combination rule:
Y
P (c) = p(c) max
p(ti | ci )
N 0 (xy)
And for deletion and addition of letters at the beginning of a word:

del0 (.y)
 deletion :
N 0 (.y)
p(x → y) =
0 (.y))·N ·w
(add
 addition :
N 0 (y)
P
To
i c(ti |
evaluate the minimum cost min
ci ) of a correction, we use a modified WagnerFischer algorithm, finds the minimum in O(mn)
time, where m, n are the lengths of the entry and
correction candidate, respectively. This is done
over for candidate corrections in the search space
generated in (3.1).
Now, the probabilistic scoring by itself is
not always accurate, especially in cases such as
f unetik −→ phonetic. Thus, we modify the
scoring of each candidate correction to significantly improve the accuracy of the suggestions:
i
C(c) = c(c) + min
X
c(ti | ci )
i
Where P (c) is the frequency of the candidate
correction, P (ti | ci ) the cost of each edit distance
operation in a sequence of edit operations that
generate the correction. The cost is then scored
logarithmically based on the probability,
where
c(ti | ci ) ∝ − log p(ti | ci ) . The correction
candidates are then sorted, with lower cost meaning higher likelihood.
We use bigram error counts generated from a
corpora (Jones and Mewhort, 2004) to determine
the values of c(t | p). Two sets of counts were
used:
• Error counts:
–
–
–
–
• Instead of setting c(c) = − log(p(c), we
find that using c(c) as multiplicative constant
as a function f (c)γ , where f (c) is the frequency of the word in the corpus and γ an
empirically-determined constant, yields significantly more accurate predictions.
Deletion of letter β after letter α
Addition of letter β after letter α
Substitution of letter β for letter α
Adjacent transposition of the bigram αβ
• We add empirically-determined multiplicative factors λi pertaining to the following factors regarding the entry and the candidate correction:
• Bigram/monogram counts (log scale):
– Monograms α
– Bigrams αβ
138
– Same phonetic key (not restricted to first
5 consonant sounds)
– Same aside from repetition of letters
– Same consonants (ordered)
– Same vowels (ordered)
– Same set of letters
– Similar set of letters
– Same number of syllables
– Same after removal of es
corrections are lexicographically difficult to generate, using a semantic approach would be more
effective in increasing coverage.
The additional group of word forms is generated
as follows:
1. For each synset of the related word, we consider all synsets related to it by some semantic pointer in WordNet.
2. All lemmas (word forms) of these synsets are
evaluated.
(Note that other factors were considered but
the factors pertaining to them were insignificant)
3. Lemmas that share the same first letter or the
same last letter and are not too far away in
length are added to the group.
The candidate corrections are then
Q ordered by their
0
modified costs C (c) = C(c) i λi and the top
three results, in order, are returned to the user.
3
The inclusion of the additional group is indeed
very effective in capturing the missed corrections
and remains relatively small in size.
Some examples of missed words captured in
this group from the training set are (entry, correct,
related):
Semantic Input:
The second part of the spell corrector adds a semantic aspect into the correction of the search
query. When users have trouble entering the query
and cannot immediately choose a suggested correction, they are given the option to enter a semantically related word. WNSpell then takes this
word into account when generating suggestions,
harnessing WordNet’s vast semantic network to
further optimize results.
This added dimension in spell correction is very
helpful for the more severe errors, which usually
arise from the “idea → thought word” process in
spelling. These are much harder to deal with than
conventional mistypings or misspellings, and are
exactly the type of error WNSpell needs to be able
to handle (as mistyped or even misspelled queries
can be fixed without too much trouble by the user).
The semantic anchor the related word provides
helps WNSpell establish the idea” behind the desired word and thus refine the suggestions for the
desired word.
To incorporate the related word into the suggestion generation, we add some modifications to the
original context-free spell corrector.
3.1
• autoamlly, automatically, mechanically
• conibation, contribution, donation
3.2
Adjusting the Evaluation:
We also modify the scoring process of each candidate correction to take into account semantic distance. First, each candidate correction is assigned
a semantic distance d (higher means more similar)
based on Lesk distance:
d = max max s(ri , cj )
i
j
Which takes the maximum similarity over all pairs
of definitions of the related word r and candidate
c where similarity s is measured by:
X
s(ri , cj ) =
k − ln(nw + 1)
w∈Ri ∩Cj ,w∈S
/
Which considers words w in the intersection of
the definitions that are not stopwords and weights
them by the smoothed frequency nw of w in the
COCA corpus (as rarity is related to information
content) and some appropriate constant k.
Additionally, if r or c is found in the other definition, we also add to the similarity
s of two defi
nitions a k − ln(nr/c + 1) for some appropriate
constant a > 1. This resolves many issues that
come up with hypernyms/hyponyms (among others) where two similar words are assigned a low
Adjusting the Search Space:
One of the issues in search space generation in the
original is that a small fraction of plausible corrections are still missed, especially in more severe errors. To improve the coverage of the search space,
we modify the search space to also include a nucleus of plausible corrections generated semantically, not just lexicographically. Since the missed
139
score since the only words in common in their definitions may be the words themselves.
We also consider the number n of shared subsequences of length 3 between r and c, which is
very helpful in ruling out semantically similar but
lexicographically unlikely words.
We then adjust the cost function C 0 by:
C 00 =
Compared to these three spell correctors, WNSpell clearly does a significantly better job containing the desired correction than Aspell, Hunspell, Ispell, or Word within a set of words of acceptable size.
We now compare the results of the top three
words returned on the list with those returned by
Aspell, Hunspell, Ispell, Word. We also include
data from Deorowicz and Ciura, which also uses
the Aspell test set. Since the dictionaries used
were different, we also include Aspell results using their subset of the Aspell test set. The results
are shown in Table 2, and a graphical comparison
is shown in Figure 1.
Once again, WNSpell significantly outperforms
the other five spell correctors.
C0
(d + 1)α (n + 1)β
For some empirically-determined constants α and
β. The new costs are then sorted and the top three
results returned to the user.
4
Results
We used the Aspell data set to train the system.
The test set consists of 547 hard-to-correct words.
This is ideal for our purposes, as we are focusing
on correcting bad misspellings as well as the easy
ones. Most of the empirically-derived constants
from (3.2) were determined based off of results
from this data set.
4.1
Aspell Test Set Results (% Identified)
Without Semantic Input
We compare the results of WNSpell to a few popular spellcheckers: Aspell, Hunspell, Ispell, and
Word; as well as with the proposition of Deorowicz and Ciura, which seems to have the best results
on the Aspell test set so far (other approaches are
based off of unavailable/uncompatible data sets).
Ideally, for comparison, it would be ideal to run
each spell checker on the same lexicon and on the
same computer for consistent results. However,
due to technical constraints, it is rather infeasible
to do so. Instead, we will use the results posted
by the authors of the spell checkers, which, despite some uncertainty, will still yield consistent
and comparable results.
First, we compare our generated search space
with the lists returned by Aspell, Hunspell, Ispell,
and Word (Atkinson). We use a subset of the Aspell test set containing all entries whose corrections are in all five dictionaries. The results are
shown in Table 1.
Search Space Results
Method
%
found
WNSpell
Aspell (0.60.6n)
Hunspell (1.1.12)
Ispell (3.1.20)
Word 97
97.4
90.1
83.2
54.8
75.4
10
12
4
1
2
Top 1
Top 2
Top 3
Top 10
WNSpell
Aspell (0.60.6n)
Hunspell (1.1.12)
Ispell (3.1.20)
Word 97
Aspell (n)
DC
77.5
54.3
58.2
40.1
62.6
56.9
66.3
88.5
63.0
71.5
47.9
69.4
66.9
75.5
91.2
72.9
76.6
50.4
72.7
74.7
79.6
96.1
87.1
82.3
54.1
75.4
87.9
85.5
Table 2
Aspell Test Set Results (% Identified)
100
90
80
WNSpell
Aspell(0.60.6n)
Hunspell(1.1.12)
Word97
Aspell(n)
DC
70
60
2
4
6
Top n
8
10
Figure 1
We also test WNSpell on the Aspell common
misspellings test set, a list of 4206 common misspellings and their corrections. Since the word
corrector was not trained on this set, it is a blind
comparison. Once again, we use a subset of the
Aspell test set containing all entries whose corrections are in all five dictionaries. The results are
Size
(0/50/100%)
1
2
1
0
0
Method
66
100
15
29
20
Table 1
140
Semantic Results (% Identified)
shown in tables 3 and 4, and a graphical comparison is shown in Figure 2.
Method
Top 1
Top 2
Top 3
Top 10
Blind Search Space Results
with
without
87.4
77.5
93.0
88.5
96.5
91.2
99.1
96.1
Method
%
found
Size
(0/50/100%)
WNSpell
Aspell (0.60.6n)
Hunspell (1.1.12)
Ispell (3.1.20)
98.4
97.7
97.3
85.2
1
1
1
0
4
9
5
1
Table 5
50
100
15
26
Semantic Results (% Identified)
100
Table 3
95
Blind Test Set Results
Method
Top 1
Top 2
Top 3
Top 10
WNSpell
Aspell (0.60.6n)
Hunspell (1.1.12)
Ispell (3.1.20)
91.4
73.6
80.8
77.4
96.3
81.2
92.0
82.7
97.6
92.0
95.0
84.3
98.3
97.0
97.3
85.2
90
85
Table 4
80
with
without
Blind Test Set Results (% Identified)
100
2
95
4
6
Top n
8
10
Figure 3
90
The runtime for WNSpell with semantic input,
however, is rather slow at an average of ∼200ms.
85
5
WNSpell
Aspell(0.60.6n)
Hunspell(1.1.12)
Ispell(3.1.20)
80
75
2
4
6
Top n
8
The WNSpell algorithm introduced in this paper
presents a significant improvement in accuracy
in correcting standalong spelling corrections over
other systems, including the most recent version of
Aspell and other commercially used spell correctors such as Huspell and Word, by approximately
20%. WNSpell is able to take into a variety of factors regarding different types of spelling errors and
using a carefully tuned algorithm to account for
much of the diversity in spelling errors presented
in the test data sets. There is a efficient sample
space pruning system that restricts the number of
words to be considered, strongly improved by a
phonetic key, and an accurate scoring system that
then compares the words. The accuracy of WNSpell in correcting hard-to-correct words is quite
close that of most peoples’ abilities and significantly stronger than other methods.
WNSpell also provides an alternative using a
related word to help the system find the desired
correction even if the user is far off the mark in
terms of spelling or phonetics. This added feature
once again significantly increases the accuracy of
10
Figure 2
Additionally, WNSpell runs in decently fast
time. WNSpell takes ∼13ms per word, while
Aspell takes ∼3ms, Hunspell ∼50ms, and Ispell
∼0.3ms. Thus, WNSpell is a very efficient standalone spell corrector, achieving superior performance within acceptable runtime.
4.2
Conclusions:
With Semantic Input
We test WNSpell with the semantic component on
the original training set, this time with added synonyms. For each word in the training set, a humangenerated related word is inputted.
With the addition of the semantic adjustments,
WNSpell performs considerably better than without them. The results are shown in Table 5 and a
graphical comparison in Figure 3:
141
References
WNSpell by approximately 10% by directly connecting the idea word the user has in mind to the
word itself. This link allows for the possibility of
users who only know what rough meaning their
desired word has or context it is in to actually find
the word.
5.1
D. Jurafsky and J.H. Martin. 1999. Speech and Language Processing, Prentice Hall.
R. Mishra and N. Kaur. 2013. “A survey of Spelling
Error Detection and Correction Techniques,” International Journal on Computer Trends and Technology, Vol. 4, No. 3, 372-374.
Limitations:
K. Atkinson. “Spell Checker Test Kernel Results,”
http://aspell.net/test/.
The standalone algorithm currently does not take
into consideration vowel phonetics, which are
rather complex in the English language. For instance, the query spoak would be corrected into
speak rather than spoke. While a person easily
corrects spoak, WNSpell would not be able to use
the fact that spoke sounds the same while speak
does not. Rather, all three have consonant sounds
s, p, k and have one different letter from spoak.
But an evaluation of edit distance finds that speak
is clearly closer, so the algorithm chooses speak
instead.
WNSpell, a spell corrector targeting at singleword queries, also does not have the benefit of
contextual clues most modern spell correctors use.
5.2
S. Deorowicz and M.G. Ciura. 2005. “Correcting
Spelling Errors by Modeling their Causes,” Int. J.
Appl. Math. Comp. Sci., Vol. 15, No. 2, 275-285.
P. Norvig.
“How to Write a Spell Corrector,”
http://norvig.com/spell-correct.html.
K.W. Church and W.A. Gale. 1991. “Probability Scoring for Spelling Correction,” AT&T Bell Laboratories
M.N. Jones and J.K. Mewhort. 2004. “Case-Sensitive
Letter and Bigram Frequency Counts from LargeScale English Corpora,” Behavior Research Methods, Instruments, & Computers, 36(3), 388-396.
Corpus of Contemporary American English. (n.d.).
http://corpus.byu.edu/coca/.
Future Improvements:
As mentioned earlier, introducing a vowel phonetic system into WNSpell would increase its
accuracy. The semantic feature of WNSpell
can be improved by either pruning down the
algorithm to improve performance or possibly
using/incorporating other closeness measures of
words into the algorithm. One possible addition is
the use of some distributional semantics, such as
using pre-trained word vectors to search for similar words (such as Word2Vec).
Additionally, WNSpell-like spell correctors can
be implemented in many languages rather easily,
as WNSpell does not rely very heavily on the morphology of the language (though it requires some
statistics of letter frequencies as well as simplified phonetics). The portability is quite useful as
WordNet is implemented in over a hundred languages, so WNSpell can be ported to other nonEnglish WordNets.
142
Sophisticated Lexical Databases - Simplified Usage: Mobile
Applications and Browser Plugins For Wordnets
Diptesh Kanojia
CFILT, CSE Department,
IIT Bombay,
Mumbai, India
[email protected]
Raj Dabre
School of Informatics,
Kyoto University,
Kyoto, Japan
[email protected]
Abstract
identifiable mother-tongues and 122 major
languages1 . Of these, 29 languages have
more than a million native speakers, 60 have
more than 100,000 and 122 have more than
10,000 native speakers. With this in mind,
the construction of the Indian WordNets, the
IndoWordNet (Bhattacharyya, 2010) project
was initiated which was an effort undertaken
by over 12 educational and research institutes headed by IIT Bombay. Indian WordNets were inspired by the pioneering work of
Princeton WordNet(Fellbaum, 1998) and currently, there exist WordNets for 17 Indian languages with the smallest one having around
14,900 synsets and the largest one being Hindi
with 39,034 synsets and 100,705 unique words.
Each WordNet is accessible by web interfaces
amongst which Hindi WordNet(Dipak et al.,
2002), Marathi WordNet and Sanskrit WordNet(Kulkarni et al., 2010) were developed at
IIT Bombay2 . The WordNets are updated
daily which are reflected on the websites the
next day. We have developed mobile applications for the Hindi, Marathi and Sanskrit
WordNets, which are the first of their kind to
the best of our knowledge.
India is a country with 22 officially recognized languages and 17 of these have
WordNets, a crucial resource. Web
browser based interfaces are available
for these WordNets, but are not suited
for mobile devices which deters people from effectively using this resource.
We present our initial work on developing mobile applications and browser
extensions to access WordNets for Indian Languages.
Our contribution is two fold: (1) We
develop mobile applications for the Android, iOS and Windows Phone OS
platforms for Hindi, Marathi and Sanskrit WordNets which allow users to
search for words and obtain more information along with their translations
in English and other Indian languages.
(2) We also develop browser extensions
for English, Hindi, Marathi, and Sanskrit WordNets, for both Mozilla Firefox, and Google Chrome. We believe
that such applications can be quite
helpful in a classroom scenario, where
students would be able to access the
WordNets as dictionaries as well as lexical knowledge bases. This can help in
overcoming the language barrier along
with furthering language understanding.
1
Pushpak Bhattacharyya
CFILT, CSE Department,
IIT Bombay,
Mumbai, India
[email protected]
This paper is organized as follows: Section
2 gives the motivations for the work. Section
3 contains the descriptions of the application
with screen-shots and the nitty gritties. We
describe the browser extensions in Section 4,
and we conclude the paper with conclusions,
and future work in Section 5. At the very
end, some screen-shots of the applications and
browser extensions are provided.
Introduction
India is among the topmost countries in the
world with massive language diversity. According to a recent census in 2001, there
are 1,365 rationalized mother tongues, 234
1
2
143
http://en.wikipedia.org/wiki/Languages_of_India
http://www.cfilt.iitb.ac.in/
2
Motivation
According to recent statistics, about 117 million Indians3 , are connected to the Internet
through mobile devices. It is common knowledge that websites like Facebook, Twitter,
Linkedin, Gmail and so on can be accessed
using their web browser based interfaces but
the mobile applications developed for them are
much more popular. This is a clear indicator
that browser based interfaces are inconvenient
which was the main motivation behind our
work. We studied the existing interfaces and
the WordNet databases and developed applications for Android, iOS and Windows Phone
platforms, which we have extensively tested
and plan to release them to the public as soon
as possible.
Our applications and plugins are applicable
in the following use cases:
1. Consider an educational classroom scenario, where students, often belonging
to different cultural and linguistic background wish to learn languages. They
would be able to access the WordNets
as dictionaries for multiple Indian languages. This would help overcome the
language barrier which often hinders communication, and thus, understandability.
The cost effective and readily available
“Aakash” tablet device4 will be one of the
means by which our application will be
accessed by educational institutes over India.
Figure 1: Home Screen
4. Linguists who happen to be experts at
lexical knowledge can use the WordNet
apps as well as plugins to acquire said
knowledge irrespective of whether they
have mobile phones or PCs.
Apart from the cases mentioned above,
there are many other cases where our apps and
plugins can be used effectively.
2. Tourists traveling to India can use the
WordNet mobile apps for basic survival
communication, because Indian language
WordNets contain a lot of culture and
language specific concepts, meanings for
which may not even be available on internet search.
3 Mobile WordNet Applications
In the subsections below we describe the
features of the applications accompanied by
screen-shots.
3.1 Home Screen
3. People who read articles on the internet
may come across words they do not understand and can benefit from our plugins
which can help translate words and give
detailed information about them at the
click of a button.
When the user starts the application, the
home screen (Figure: 1) is shown with a brief
description of how to use it, the link which
takes the user to search interface.
3.2 Search Interface
3
“Internet trends 2014 report” by Mary Meeker,
Kleiner Perkins Caufield & Byers (KPCB)
4
http://www.akashtablet.com/
We have provided the user with two types
of input mechanisms, Phonetic Translitera-
144
(zha), respectively. These characters occur
twice in the Unicode chart, both with nukta
as a separate unicode character, and adjoining
the parent character. We normalize the input
for standard unicode encoding with nukta as
a separate character before search.
3.3.2 Morphological Analysis
Before searching in the databases the word is
first passed to a morphological analyzer to obtain its root form. We use Hindi Morph Analyzer (Bahuguna et al., 2014) to return the
root form of the input word for Hindi language, since by principle, WordNet only contains root forms of the words.
Due to non availability of other language
Morphological Analyzers, we may not be able
to include them in the search process. Though,
in the future, we can use a fully automated
version of the “Human mediated TRIE base
generic stemmer”(Bhattacharyya et al., 2014)
for obtaining root forms for other languages
later.
3.3.3 Handling Multiple Root forms
Figure 2: Devanagari Keyboard
Our interface also requests the user to select the preferred root, if more than one root
forms are returned post morphological analysis. The user can then just select one and
then the synset retrieval process is initiated
on the server. It gives the user more control,
and choice over results. We assume that while
searching the WordNet, a user may not be familiar with all the senses of the words, or all
the morphology of the word. It may be possible that the user came across the word over
the internet, and is using our plugin to search
the WordNet. This feature enables the user
to select the appropriate root, or check all the
possibilities for the correct answer.
tion using Google Transliteration API5 , and
a JavaScript based online keyboard (Figure:
2) for input of Hindi Unicode characters.
Transliteration for a native user is very convenient. In case, the user does not know the
right combination of keys then the keyboard
for Devanagari is provided. These two methods ensure that all words can be easily entered
for searching. Thereafter, by touching / clicking on “Search”, the synsets with all relevant
information are retrieved.
3.3
Search Process
Indian languages are fairly new to the web,
and despite standard UTF encoding of characters, there remain a few steps to be taken
to sanitize the input for WordNet search. The
steps taken by us are given below:
3.4 Application Design
We have used the WebView class, and URL
loading from the Android SDK6 , and Windows
Phone SDK7 to display a responsive layout
of the WordNets. WebView renders the application pages seamlessly onto the mobile /
handheld devices, thus making the application
usable for mobile, tablet, and other handheld
3.3.1 Nukta Normalization for Hindi
Hindi Characters such as क (ka), ख (kha),
ग (ga), ज (ja), ड (ḍa), ढ (ḍha), फ (pha), झ
(jha), take up nukta symbol to form क़ (qa),
ख़ (kḫa),ग़ (ġa),ज़ (za),ड़ (da),ढ़ (ṛha),फ़ (fa),झ़
5
6
https://developer.android.com/
https://dev.windows.com/enus/develop/download-phone-sdk
7
https://developers.google.com/transliterate/
145
or mobile. The process commonly involves a
user navigating to a web page, and searching
the required ‘word’ for its senses. In a world
where getting things done in one click is important, we feel that the process of searching
needs to be simplified. We develop browser extensions to ease this process. Google Chrome
and Mozilla Firefox are the most popular web
browsers among the web users8 . Our approach
makes the search quite simple and is summarized in the following 3 steps:
device of any size.
Similarly, for iOS, we have used the UIWebView class with some scaling measures to render the pages with a responsive layout onto
the device screen. Our application is compatible with all iOS devices. It will be deployed
to Apple App Store soon.
A preliminary check on internet connection
is done before connecting to the web interface,
and retry button is provided on the front, in
case an internet connection is not detected.
3.5
Search Results
• User highlights the word of interest and
right-clicks the page or clicks on the plugin shortcut.
The results returned by the server are interpreted by the application pages and displayed
in a very simplistic manner. We display all
synsets for each part of speech and all senses
of that word and initially showing the synset
words, gloss and example.These senses are categorized by their part of speech categories. We
have conformed to the principles of good User
Interface design and provided for an incremental information display.
• They click the context menu option for
‘Search <relevant> WordNet for . . .’
• A new tab opens up showing the information from the relevant WordNet.
We present the sample context menu screenshots, post installation in Figures 6 and 7, respectively.
3.5.1 Additional Information
If the user wishes to see the synset relations
and the translations of that word in other
synsets the link “Relations and Languages”
should be clicked to give a list of all additional
information that can be displayed. Relations
like Hypernymy and Hyponomy and the relevant synset in the other 16 languages can be
displayed. Please refer to figure 3 for an example.
5 Conclusions and Future Work
In this era of handheld mobile devices, there
is a great need to make available traditional
web services as mobile applications which are
extremely popular. Our success in developing mobile applications for Hindi, Marathi and
Sanskrit WordNets along with browser plugins for English, Hindi, Marathi and Sanskrit
to simplify word look-up is the first step in
providing people with easy access to such important knowledge bases. We have described
a variety of use cases for our apps and plugins which are quite realistic, especially in India
where language and cultural diversity is quite
varied. These can have a huge impact on language education, especially in the rural areas,
along with enabling people to understand a
multitude of languages.
We plan to make available offline search
in our apps. Also, we plan to make efforts
towards improving this application to enable
searching for words belonging to all languages
which have a common interface via language
detection. We also plan to inculcate Word
Suggestions as they are being typed so that the
3.5.2 Current Drawbacks
Current version of Android OS (Lollipop 5.0)
deployed on most of the smartphones, does not
support rendering of Gujarati, Punjabi, and
Nepali languages, on all devices. The language
support also depends on the device manufacturer. Hence, they are currently disabled from
the interface.
Also, Our applications are currently online,
and can only be used if the user is connected to
the internet. We plan to implement an offline
version of our applications.
4
Browser Extensions
Major WordNets of the world are available via
web interfaces, enabling a user to search for
the senses using a web browser on a computer
8
http://gs.statcounter.com/#all-browser-wwmonthly-201506-201506-bar
146
user is presented with better lexical choices.
Plugins like PeraPera9 for Japanese and Chinese are quite popular since they simply provide lexical information when the user hovers
the mouse over words. Implementing such a
feature is something we plan to do in the immediate future. Also, We would publish our
application, and browser plugin source codes
publicly for research purposes.
6
Acknowledgment
We gratefully acknowledge the support of the
Department of Electronics and Information
Technology, Ministry of Communications and
IT, Government of India. We also thank Ravi
Nambudripad, for replicating the application
in other languages and the entire computational linguistics group at Centre For Indian
Language Technology, IIT Bombay, which has
provided its valuable input and critique, helping us refine our work.
References
Dipak Narayan, Debasri Chakrabarti, Prabhakar
Pande and Pushpak Bhattacharyya. 2002. An
Experience in Building the Indo WordNet a WordNet for Hindi. In First International
Conference on Global WordNet, (GWC 2002),
Mysore, India.
Christiane D. Fellbaum. 1998. WordNet: An Electronic Lexical Database. Published by Bradford
Books
Pushpak Bhattacharyya. 2010. IndoWordNet. In
Proceedings of Lexical Resources Engineering
Conference, May, 2010, Malta.
Ankit Bahuguna, Lavita Talukdar, Pushpak Bhattacharyya and Smriti Singh. 2014. HinMA: Distributed Morphology based Hindi Morphological
Analyzer. In Proceedings of the 11th International Conference on Natural Language Processing (ICON 2014), December, 2014.
Pushpak Bhattacharyya, Ankit Bahuguna, Lavita
Talukdar, and Bornali Phukon. 2014. Facilitated Multi-Lingual Sense Annotation: Human
Mediated Lemmatizer. In Proceedings of the
Global Wordnet Conference 2014 (GWC 2014),
January, 2014.
Malhar Kulkarni, Chaitali Dangarikar, Irawati
Kulkarni, Abhishek Nanda, and Pushpak Bhattacharyya. 2010. Introducing Sanskrit Wordnet.
In Proceedings of the Global Wordnet Conference
2010 (GWC 2010), January, 2010.
9
Figure 3: Screen-shots of Search Results
http://www.perapera.org/
147
Figure 6: Browser Extensions Context Menu
for word ‘specific’
Figure 4: Search Results with Malayalam,
Tamil, and Telugu Synsets
Figure 7: Browser Extensions Context Menu
for word 'िहंसा' (hiMsaa) translated to ‘violence’
Figure 5: Search Results with English, Bengali, and Marathi Synsets
148
A picture is worth a thousand words: Using OpenClipArt
library to enrich IndoWordNet
Diptesh Kanojia, Shehzaad Dhuliawala, and Pushpak Bhattacharyya
Centre for Indian Language Technology,
Computer Science and Engineering Department,
IIT Bombay,
Mumbai, India
{diptesh,shehzaadzd,pb}@cse.iitb.ac.in
1 Introduction
Abstract
Our goal is to enrich the semantic lexicon of
various Indian languages by mapping it with
images from the OpenClipArt library(Phillips,
2005). India is currently experiencing a major
enhancement in the digital education sector
with its vision of the ‘Digital India’ program1 .
In this paper, we introduce an approach to enrich the IndoWordNet2 (Bhattacharyya, 2010),
with images, which can help students and
language enthusiasts alike. We envision the
use of WordNets in the education sector to
promote language research among young students, and provide them with a multilingual
resource which eases their study of languages.
WordNets have proven to be a rich lexical resource for many NLP sub tasks such as Machine Translation (MT) and Cross Lingual Information retrieval.
India has 22 official languages, written in
more than 8 scripts. When a user reads a concept in a language that is not known to them,
and moreover in an unknown script, an image
can provide helpful insight into the concept.
Language learners in a multilingual country
like this often face difficulty mainly due to: (a)
Not being able to find a mapping of the concept in the language being studied and their
native language and (b) Not being able to decipher the script in the language being learnt.
In such cases a pictorial representation of a
concept will be very useful.
Finally, systems for Automatic image captioning and Real time video summarization
can leverge the power of image enriched WordNets.
WordNet has proved to be immensely
useful for Word Sense Disambiguation,
and thence Machine translation, Information Retrieval and Question Answering. It can also be used as a dictionary for educational purposes. The
semantic nature of concepts in a WordNet motivates one to try to express this
meaning in a more visual way. In this
paper, we describe our work of enriching IndoWordNet with image acquisitions from the OpenClipArt library.
We describe an approach used to enrich WordNets for eighteen Indian languages.
Our contribution is three fold: (1) We
develop a system, which, given a synset
in English, finds an appropriate image for the synset. The system uses
the OpenclipArt library (OCAL) to retrieve images and ranks them. (2) After retrieving the images, we map the
results along with the linkages between
Princeton WordNet and Hindi WordNet, to link several synsets to corresponding images. We choose and sort
top three images based on our ranking
heuristic per synset. (3) We develop
a tool that allows a lexicographer to
manually evaluate these images. The
top images are shown to a lexicographer by the evaluation tool for the task
of choosing the best image representation. The lexicographer also selects the
number of relevant images. Using our
system, we obtain an Average Precision
(P @ 3) score of 0.30.
1
2
149
http://www.digitalindia.gov.in/
http://www.cfilt.iitb.ac.in/indowordnet
1.1
WordNets and IndoWordNet
WordNets are lexical structures composed
of synsets and semantic relations(Fellbaum,
1998). Such a lexical knowledge base is at the
heart of an intelligent information processing
system for Natural Language Processing and
Understanding. IndoWordNet is one such rich
online lexical database containing more than
twenty thousand parallel synsets for eighteen
languages, including English. It uses Hindi
WordNet as a pivot to link all these languages.
The first WordNet was built in English at
Priceton University3 .
Then, followed the
WordNets for European Languages4 (Vossen,
1998), and then IndoWordNet. IndoWordNet has approximately 25000 synsets linked to
Princeton WordNet. We use these linkages to
mine English words from the Princeton WordNet which form the basis of our query for the
OpenClipArt API. We download the images
via their URLs, and store them locally, to map
them to Hindi WordNet5 (Dipak Narayan and
Bhattacharyya, 2002) synset IDs later.
The paper is organized as follows. In section
2, we describe our related work. In section 3
and 4, we describe our architecture, and the
retrieval procedure along with the scoring algorithm. We describe the results obtained in
Section 5. We describe the evaluation tool and
qualitative analysis in sections 6 and 7, respectively. We conclude in section 8.
Figure 1: Our Architecture
with image tags, and then map them.
3 Our Architecture
The following section gives the detailed architecture of our system. A diagrammatic representation is shown in figure 1. Also, we discuss the structure of the IndoWordNet and
talk about how we link it to the retrieved set
of images.
3.1 Dataset
A linked Hindi - English synset mapping is required to mine the image-synset mapping for
Hindi. OpenClipArt contains URL tags in English, and thus a linked Hindi - English synset
data structure was required. For our work, we
use the following data sets:
3.1.1 Hindi Database
The latest version of Hindi WordNet is available for download at:
http://www.cfilt.iitb.ac.in/wordnet/
webhwn/downloaderInfo.php, which provides an offline interface along with the
database, in text format.
2 Related Work
Bond et al. (2009) used OCAL to enhance the
Japanese WordNet, and were able to mine 874
links for 541 synsets. On the basis of manual
scoring they found 62 illustrations which were
best suited for the sense, 642 illustrations to be
a good representation, and 170 suitable, but
imperfect illustrations. We extend their work
for IndoWordNet, and use OCAL to mine the
illustrations. Imagenet(Deng et al., 2009) is a
similar project for Princeton WordNet which
provides images/URLs for a concept. It contains 21841 synsets indexed with 14,197,122
images. We present a much simpler methodology of collecting images from the web, and
then using the synset words to find overlaps
3.1.2 English Database
The latest version of Princeton WordNet is
available for download at: https://wordnet.
princeton.edu/wordnet/download/. It provides both the latest database, and standalone
installers for WordNet
3.1.3 Hindi-English Linkage database
WordNets have been built for around 100 different languages. Efforts towards mapping
synsets across WordNets have been going on
for a while in various parts of the world. IndoWordNet contains 28,446 synsets linked to
the Princeton WordNet, out of which 21,876
are Nouns. Those concepts in Hindi for which
there are no direct linkages in the English
WordNet, it was decided to link them to a
3
http://www.wordnet.princeton.edu
http://www.illc.uva.nl/EuroWordNet/
5
http://www.cfilt.iitb.ac.in/wordnet/webhwn/
4
150
hypernymy synset in English. The idea was
that instead of having no linkage at all there
would be at least a super-ordinate concept and
lexical item/items with which the Hindi concept could be linked to provide weak translation candidates which could be exploited for
various NLP tasks. IndoWordNet has 11,582
direct linkages, and 8184 hypernymy linkages.
We use only 11,582 directly linked noun concepts to mine OCAL.
Algorithm 1 Image scoring algorithm
4
10:
1:
2:
3:
4:
5:
6:
7:
8:
9:
Retrieval procedure and scoring
11:
We use the OpenClipArt API6 to retrieve a
set of results using the head word from a
synset as the query, since OpenClipArt is a
free to use resource, unlike Google Search results which might retrieve copyright data. The
API provides a JSON output which can be easily parsed using any programming language.
We use JAVA for this purpose. The result for
each image provides the following data:
12:
13:
14:
WordNets to the available images. A total of
8,183 Hindi synsets for directly linked nouns
were mapped to their corresponding images.
We perform manual evaluation of the data using the tool mentioned above and have evaluated approx. 3,000 synsets for each of the
languages. We continue with the manual evaluation for mapping as of now.
Table 1 describes the number of synsets of
the WordNets of the following languages for
which images have been found, the number of
evaluated images out of these, the correctly
mapped images, and the precision score for
each language.
The top three images are shown to a trained
linguist who decides the winner image and also
calculates the precision ([email protected]) of results for
that synset. Over a set of 8183 images, we
obtain a precision ([email protected]) of 0.30.
• The title of the ‘image’
• The tags for the ‘image’
• The URL of the ‘image’
To rank the results, we calculate a score
based on overlaps between the synsets and
image meta-data. The score is derived as a
weighted overlap between the words in the Title and Tags of the result image with the words
of the synset. Words from each part are given
a different weight owing to how useful the feature is in describing the image. For example, words from the Title are given a higher
weight as compared to words from the image
Tags. The algorithm increases the score if an
overlap occurs and decrements the score otherwise. The magnitude of this increase and decrease depends on the weights of the words being compared. Our system allows for all these
weights to be tweaked.
After the result images are scored, they are
sorted based on this score. Only the top three
scoring images are downloaded. These downloaded images are then evaluated by lexicographers.
5
6 Evaluation Tool
We create a PHP7 based interface, and provide it to lexicographers and linguists for evaluation of the images obtained. The tool uses
MySQL8 database at the back-end to store
both Hindi and English WordNet databases,
and uses synset ID as a pivot to display the
images obtained. The tool provides with a
Hindi synset words, its concept, and the English words to help the lexicographer identify
its proper sense. The lexicographer chooses a
winner image out of the top three, or none of
Results
Using the methodology described above, we
map several synsets of the Indian language
6
procedure Image–Scoring
score:= 0
weight(ImageT ags) := w
cost(ImageT ags) := c
for each token i ∈ ImageT ags do
for each token j ∈ Synset do
if i = j then
score:= score + w
else
score:= score - c
end if
end for
end for
end procedure
7
8
https://openclipart.org/search/json/
151
http://php.net/
https://www.mysql.com/
Languages
Hindi
Assamese
Bengali
Bodo
Gujarati
Kannada
Kashmiri
Konkani
Malayalam
Manipuri
Marathi
Nepali
Sanskrit
Tamil
Telugu
Punjabi
Urdu
Oriya
Images Obtained
8183
5198
7823
5138
7736
5695
6705
7548
6504
5299
6863
3959
7812
7272
5728
5889
5096
7412
Evaluated Images
3851
2860
3851
2835
3787
2883
3470
3686
3427
2907
3452
2163
3851
3611
2980
3186
2683
3660
Accurate Images
1154
771
1154
765
1134
870
1043
1110
954
780
1031
584
1154
1083
834
896
684
1034
Precision
0.3
0.27
0.3
0.27
0.3
0.3
0.3
0.3
0.28
0.27
0.3
0.27
0.3
0.3
0.28
0.28
0.25
0.28
Table 1: No. of synsets linked to images
Figure 3: Accuractely acquired images
Figure 2: Screen-shot of the Evaluation Tool
these, in case of no relevant image. They were
also requested to tick the relevant images. A
screen-shot for our interface is shown in Figure
2.
7
which no images could be retrieved fell into
the category of abstract nouns. For example the synset "गुलछरार्" (“gUlchharra”) - 4939
which translates to “profligacy, extravagance”
returned no images.
Qualitative Analysis
Complex synsets: Apart from abstract
nouns, several complex synsets returned no
results. For example, the synset "शारी रक
तरल पदाथर् " (“shAririk TaRal PadArth”) - 1644
which translates to “Liquid body substance”
was unable to fetch any results.
In this section, we explain the work done to
evaluate the resultant images and the analysis
of the results.
7.1
No images found
From the 11,573 sysets that were chosen to
be tagged with images, We were unable to retrieve images for 3390 synsets from OpenClipArt, due to unavailability in the source. Our
analysis shows that most of the synsets for
which a suitable image could not be retrieved
fell into two major categories:
Abstract nouns: Several of the synsets for
We believe that synsets falling into the first
category, i.e Abstract nouns, were too vague
for an image to do justice to the concept. However, synsets falling into the second category
display the limitedness of the OpenClipArt
database and further the need of looking in
more than one image source.
152
7.2
resources of several untagged image databases,
and thus further enich IndoWordNet as a resource.
Images found
Amongst the synsets for which some images
were retrieved, a link was noticed between the
class of the noun and how well the image was
able to explain the synset.
9 Acknowledgment
We gratefully acknowledge the support of the
Department of Electronics and Information
Technology, Ministry of Communications and
IT, Government of India. We also acknowledge the annotation work done in this task by
Rajita Shukla, Jaya Saraswati, Meghna Singh,
Laxmi Kashyap, Ankit, and Amisha. Also, not
to be missed, is the entire computational linguistics group at CFILT, IIT Bombay, which
has provided its valuable input and critique,
helping us refine our task.
7.2.1 Common Nouns
Our methodology performs well in this case,
and most of the images obtained were able to
correctly and almost completely explain the
concept. For example, the synset "मोमबत्ी"
(“momBatti”) - 9866 meaning “candle” and
synset "म स्जद" (“maszid”) - 2900 meaning
“mosque” retrieved very accurate results as
shown in figures 3.4 and 3.2, respectively.
7.3
Proper Nouns
Our retrieval performs well for proper nouns.
We were able to obtain pictures for most of the
synsets which represent a country. The country flag and map was retrieved for each country name. Several Indian monuments obtained
good images along with several Hindu deities.
The illustration for synset "िवष्णु" (“viShnU”)
- 2185 translating to a named entity “Vishnu”
is shown in figure 3.3.
7.4
References
Pushpak Bhattacharyya.
2010.
Indowordnet.
In Bente Maegaard Joseph Mariani Jan Odjik Stelios Piperidis Mike Rosner Daniel Tapias Nicoletta Calzolari (Conference Chair), Khalid Choukri, editor, Proceedings of the Seventh conference on International
Language Resources and Evaluation (LREC’10),
Valletta, Malta, may. European Language Resources Association (ELRA).
Abstract Nouns
Francis Bond, Hitoshi Isahara, Sanae Fujita, Kiyotaka Uchimoto, Takayuki Kuribayashi, and
Kyoko Kanzaki. 2009. Enhancing the japanese
wordnet. In Proceedings of the 7th Workshop
on Asian Language Resources, ALR7, pages 1–8,
Stroudsburg, PA, USA. Association for Computational Linguistics.
Several images were unable to illustrate
their corresponding abstract nouns. A few
cases of good images were obtained such as
synset "हड़कंप" (“HaDKamp”) - 3366 meaning
“panic” was illustrated by the image 3.1.
8
Conclusion and Future Work
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
We successfully identified images for synsets
of Indian languages and described our work on
enriching IndoWordNet. Many synsets could
not be linked due to the lack of appropriate
image availability on OCAL. We also created
a tool for manual evaluation of the data, or any
other such work in the future. We evaluated
the images obtained, and reported the highest precision score as 0.30. As a future work,
we aim to try to retrieve these images using
other open source image databases, and utilizing gloss and examples for finding overlaps.
Also, The concept of Content Based Image Retrieval (CBIR) appears to be a viable option
of several Indian language synsets which cannot be directly linked to a single corresponding
English synset. Using CBIR, we can harness
Prabhakar
Pande
Dipak
Narayan,
Debasri Chakrabarti and Pushpak Bhattacharyya.
2002. An experience in building the indowordnet - a wordnet for hindi. In Proceedings of
the First International Conference on Global
WordNet (GWC’02), Mysore, India, January.
Christiane Fellbaum, editor. 1998. WordNet An
Electronic Lexical Database. The MIT Press,
Cambridge, MA ; London, May.
Jonathan Phillips. 2005. Introduction to the openclip art library.
Piek Vossen, editor. 1998. EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Norwell,
MA, USA.
153
Using Wordnet to Improve Reordering in Hierarchical Phrase-Based
Statistical Machine Translation
†
Arefeh Kazemi† , Antonio Toral? , Andy Way?
Department of Computer Engineering, University of Isfahan, Isfahan, Iran
{kazemi}@eng.ui.ac.ir
?
ADAPT Centre, School of Computing, Dublin City University, Ireland
{atoral,away}@computing.dcu.ie
Abstract
models (Tillmann, 2004; Koehn et al., 2005),
which are able to perform local reordering but
they cannot capture non-local (long-distance) reordering. The weakness of PB-SMT systems on
handling long-distance reordering led to proposing
the Hierarchical Phrase-based SMT (HPB-SMT)
model(Chiang, 2005), in which the translation operates on tree structures (either derived from a syntactic parser or unsupervised). Despite the relatively good performance offered by HPB-SMT in
medium-range reordering, they are still weak on
long-distance reordering (Birch et al., 2009).
A great deal of work has been carried out to
address the reordering problem by incorporating
reordering models (RM) into SMT systems. A
RM tries to capture the differences in word order
in a probabilistic framework and assigns a probability to each possible order of words in the target sentence. Most of the reordering models can
perform reordering of common words or phrases
relatively well, but they can not be generalized to
unseen words or phrases with the same meaning
(”semantic generalization”) or the same syntactic
structure (”syntactic generalization”). For example, if in the source language the object follows
the verb and in the target language it precedes the
verb, these models still need to see particular instances of verbs and objects in the training data
to be able to perform required reordering between
them. Likewise, if two words in the source language follow a specific reordering pattern in the
target language, these models can not generalize
to unseen words with equivalent meaning in the
same context.
In order to improve syntactic and semantic generalization of the RM, it is necessary to incorporate syntactic and semantic features into the
model. While there has been some encouraging work on integrating syntactic features into
the RM, to the best of our knowledge, there has
been no previous work on integrating semantic
We propose the use of WordNet synsets
in a syntax-based reordering model for hierarchical statistical machine translation
(HPB-SMT) to enable the model to generalize to phrases not seen in the training data but that have equivalent meaning.
We detail our methodology to incorporate synsets’ knowledge in the reordering
model and evaluate the resulting WordNetenhanced SMT systems on the English-toFarsi language direction. The inclusion of
synsets leads to the best BLEU score, outperforming the baseline (standard HPBSMT) by 0.6 points absolute.
1
Introduction
Statistical Machine Translation (SMT) is a data
driven approach for translating from one natural
language into another. Natural languages vary in
their vocabularies and also in the manner that they
arrange words in the sentence. Accordingly, SMT
systems should address two interrelated problems:
finding the appropriate words in the translation
(“lexical choice”) and predicting their order in the
sentence (“reordering”). Reordering is one of the
hardest problems in SMT and has a significant impact on the quality of the translation, especially
between languages with major differences in word
order. Although SMT systems deliver state-of-theart performance in machine translation nowadays,
they perform relatively weakly at addressing the
reordering problem.
Phrased-based SMT (PB-SMT) is arguably the
most widely used approach to SMT to date. In
this model, the translation operates on phrases,
i.e. sequences of words whose length is between 1
and a maximum upper limit. In PB-SMT, reordering is generally captured by distance-based models (Koehn et al., 2003) and lexical phrase-based
154
Reordering Model
Zens and Ney (2006)
Features Types
lexical
Cherry (2013)
lexical
Green et al. (2010)
lexical
syntactic
Bisazza and Federico
and Goto et al. (2013)
Gao et al. (2011) and
Kazemi et al. (2015)
The proposed method
(2013)
lexical
syntactic
lexical
syntactic
lexical
syntactic
semantic
Features
surface forms of the source and target words
unsupervised class of the source and target words
surface forms of frequent source and target words
unsupervised class of rare source and target words
surface forms of the source words, POS tags of the
source words, relative position of the source words
sentence length
surface forms and POS tags of the source words
surface forms and POS tags of the source context
words
surface forms of the source words
dependency relation
surface forms of the source words
dependency relation
synset of the source words
Table 1: An overview of the used features in the SOTA reordering models
features. In this paper we enrich a recently proposed syntax-based reordering model for HPBSMT system (Kazemi et al., 2015) with semantic features. To be more precise, we use WordNet1 (Fellbaum, 1998) to incorporate semantics
into our RM. We report experimental results on a
large-scale English-to-Farsi translation task.
The rest of the paper is organized as follows.
Section 2 reviews the related work and puts our
work in its proper context. Section 3 introduces
our RM, which is then evaluated in Section 4.2.
Finally, Section 5 summarizes the paper and discusses avenues of future work.
2
the-art RMs along with the features that they use
for generalization.
Zens and Ney (2006) proposed a maximumentropy RM for PB-SMT that tries to predict the
orientation between adjacent phrases based on various combinations of some features: surface forms
of the source words, surface form of the target
words, unsupervised class of the source words
and unsupervised class of the target words. They
show that unsupervised word-class based features
perform almost as well as word-based features,
and combining them results in small gains. This
motivates us to consider incorporating supervised
semantic-based word-classes into our model.
Related Work
Cherry (2013) integrates sparse phrase orientation features directly into a PB-SMT decoder. As
features, he used the surface forms of the frequent
words, and the unsupervised cluster of uncommon
words. Green et al. (2010) introduced a discriminative RM that scores different jumps in the translation depending on the source words, their PartOf-Speech (POS) tags, their relative position in the
source sentence, and also the sentence length. This
RM fails to capture the rare long-distance reorderings, since it typically over-penalizes long jumps
that occur much more rarely than short jumps
(Bisazza and Federico, 2015). Bisazza and Federico (2013) and Goto et al. (2013) estimate for
each pair of input positions x and y, the probability of translating y right after x based on the surface forms and the POS tags of the source words,
and the surface forms and the POS tags of the
source context words.
Many different approaches have been proposed to
capture long-distance reordering by incorporating
a RM into PB-SMT or HPB-SMT systems. A RM
should be able to perform the required reorderings
not only for common words or phrases, but also
for phrases unseen in the training data that hold
the same syntactic and semantic structure. In other
words, a RM should be able to make syntactic
and semantic generalizations. To this end, rather
than conditioning on actual phrases, state-of-theart RMs generally make use of features extracted
from the phrases of the training data. One useful
way to categorize previous RMs is by the features
that they use to generalize. These features can be
divided into three groups: (i) lexical features (ii)
syntactic features and (iii) semantic features. Table 1 shows a representative selection of state-of1
http://wordnet.princeton.edu/
155


if (pS1 − pS2 ) × (pT 1 − pT 2 ) > 0



 monotone
ori =

else



 swap
Gao et al. (2011) and Kazemi et al. (2015)
proposed a dependency-based RM for HPB-SMT
which uses a maximum-entropy classifier to predict the orientation between pairs of constituents.
They examined two types of features, the surface
forms of the constituents and the dependency relation between them. Our approach is closely related to the latter two works, as we are interested
to predict the orientation between pairs of constituents. Similarly to (Gao et al., 2011; Kazemi
et al., 2015), we train a classifier based on some
extracted features from the constituent pairs, but
on top of lexical and syntactic features, we use semantic features (WordNet synsets) in our RM. In
this way, our model can be generalized to unseen
phrases that follow the same semantic structure.
3
(1)
For example, for the sentence in Figure 1, the
orientation between the source words “brown” and
“quick” is monotone, while the orientation between “brown” and “fox” is swap.
We use a classifier to predict the probability of
the orientation between each pair of constituents
to be monotone or swap. This probability is used
as one feature in the log-linear framework of the
HPB-SMT model. Using a classifier enables us to
incorporate fine-grained information in the form
of features into our RM. Table 3 and Table 4 show
the features that we use to characterize (head-dep)
and (dep-dep) pairs respectively.
As Table 3 and Table 4 show, we use three
types of features: lexical, syntactic and semantic. While semantic structures have been previously used for MT reordering, e.g. (Liu and Gilda,
2010), to the best of our knowledge, this is the first
work that includes semantic features jointly with
lexical and syntactic features in the framework of
a syntax-based RM. Using syntactic features, such
as dependency relations, enables the RM to make
syntactic generalizations. For instance, the RM
can learn that in translating between subject-verbobject (SVO) and subject-object-verb (SOV) languages, the object and the verb should be swapped.
On top of this syntactic generalization, the RM
should be able to make semantic generalizations.
To this end, we use WordNet synsets as an additional feature in our RM. WordNet is a lexical
database of English which groups words into sets
of cognitive synonyms. In other words, in WordNet a set of synonym words belong to the same
synset. For example, the words “baby”, “babe”
and “infant” are in the same synset in WordNet.
The use of synsets enables our RM to be generalized from words seen in the training data to any of
their synonyms present in WordNet.
Method
Following Kazemi et al. (2015) we implement a
syntax-based RM for HPB-SMT based on the dependency tree of the source sentence. The dependency tree of a sentence shows the grammatical relation between pairs of head and dependent words
in the sentence. As an example, Figure 1 shows
the dependency tree of an English sentence. In this
figure, the arrow with label “nsubj” from “fox” to
“jumped” indicates that the dependent word “fox”
is the subject of the head word “jumped”. Given
the assumption that constituents move as a whole
during translation (Quirk et al., 2005), we take the
dependency tree of the source sentence and try to
find the ordering of each dependent word with respect to its head (head-dep) and also with respect
to the other dependants of that head (dep-dep). For
example, for the English sentence in Figure 1, we
try to predict the orientation between (head-dep)
and (dep-dep) pairs as shown in Table 2.
We consider two orientation types between the
constituents: monotone and swap. If the order of two constituents in the source sentence
is the same as the order of their translation in
the target sentence, the orientation is monotone
and otherwise it is swap. To be more formal,
for two source words (S1 ,S2 ) and their aligned
target words (T1 ,T2 ), with the alignment points
(PS1 ,PS2 ) and (PT 1 ,PT 2 ), we find the orientation
type between S1 and S2 as shown in Equation 1
(Kazemi et al., 2015).
4
Experiments
4.1
Data and Setup
We used the Mizan English–Farsi parallel corpus 2 (Supreme Council of Information and Communication Technology, 2013), which contains
2
156
http://dadegan.ir/catalog/mizan
Figure 1: An example dependency tree for an English source sentence, its translation in Farsi and the
word alignments
head
dependant
dependant 1
dependant 2
jumped
fox
fox
dog
jumped
dog
brown
quick
fox
the
the
brown
fox
brown
the
quick
fox
quick
the
lazy
dog
the
dog
lazy
Table 2: head-dependant and dependant-dependant pairs for the sentence in Figure 1.
around one million sentences extracted from English novel books and their translation in Farsi.
We randomly held out 3,000 and 1,000 sentence
pairs for tuning and testing, respectively, and used
the remaining sentence pairs for training. Table 5
shows statistics (number of words and sentences)
of the data sets used for training, tuning and testing.
Train
Tune
Test
Unit
sentences
words
sentences
words
sentences
words
English
1,016,758
13,919,071
3,000
40,831
1,000
13,165
trained a Maximum Entropy classifier (Manning
and Klein, 2003) (henceforth MaxEnt) on the extracted constituent pairs from the training data set
and use it to predict the orientation probability of
each pair of constituents in the tune and test data
sets. As mentioned earlier, we used WordNet in
order to determine the synset of the English words
in the data set.
Our baseline SMT system is the Moses implementation of the HPB-SMT model with default
settings (Hoang et al., 2009). We used a 5-gram
language model and trained it on the Farsi side of
the training data set. All experiments used MIRA
for tuning the weights of the features used in the
HPB model (Cherry and Foster, 2012).
The semantic features (synsets) are extracted
from WordNet 3.0. For each word, we take the
synset that corresponds to its first sense, i.e. the
most common one. An alternative would be to apply a word sense disambiguation algorithm. However, these have been shown to perform worse than
the first-sense heuristic when WordNet is the inventory of word senses, e.g. (Pedersen and Kolhatkar, 2009; Snyder and Palmer, 2004).
Farsi
1,016,758
14,043,499
3,000
41,670
1,000
13,444
Table 5: Mizan parallel corpus statistics
We used GIZA++ (Och and Ney, 2003) to align
the words in the English and Farsi sentences. We
parsed the English sentences of our parallel corpus with the Stanford dependency parser (Chen
and Manning, 2014) and used the “collapsed representation” of its output which shows the direct
dependencies between the words in the English
sentence. Having obtained both dependency trees
and the word alignments, we extracted 6,391,956
(head-dep) and 5,247,526 (dep-dep) pairs from
our training data set and determined the orientation for each pair based on Equation 1. We then
4.2
Evaluation: MT Results
We selected different feature sets for (head-dep)
and (dep-dep) pairs from Table 3 and Table 4
respectively, then we used them in our MaxEnt
classifier to determine the impact of our novel se-
157
Features
lex(head),lex(dep)
depRel(dep)
syn(head),syn(dep)
Type
lexical
syntactic
semantic
Description
surface forms of the head and dependent word
dependency relation of the dependent word
synsets of the head and dependent word
Table 3: Features for (head-dep) constituent pairs
Features
lex(head),lex(dep1),lex(dep2)
depRel(dep1),depRel(dep2)
syn(head),syn(dep1),syn(dep2)
Type
lexical
syntactic
semantic
Description
surface forms of the mutual head and dependent words
dependency relation of the dependent words
synsets of the head and dependent words
Table 4: Features for (dep-dep) constituent pairs
mantic features (WordNet synsets) on the quality
of the MT system. Three different feature sets
were examined in this paper, including information from (i) surface forms (surface), (ii) synsets
(synset) and (iii) both surface forms and synsets
(both). We build six MT systems, as shown in Table 6, according to the constituent pairs and feature sets examined.
We compared our MT systems to the standard
HPB-SMT system. Each MT system is tuned three
times and we report the average scores obtained
with multeval3 (Clark et al., 2011) on the MT outputs. The results obtained by each of the MT systems according to two widely used automatic evaluation metrics (BLEU (Papineni et al., 2002), and
TER (Snover et al., 2006)) are shown in Table 7.
The relative improvement of each evaluation metric over the baseline HPB is shown in columns
dif f .
Compared to the use of surface features, our
novel semantic features based on WordNet synsets
lead to better scores for both (head- dep) and (depdep) constituent pairs according to both evaluation
metrics, BLEU and TER (except for the dd system
in terms of TER, where there is a slight but insignificant increase (79.8 vs. 79.7)).
As for future work, we propose to work mainly
along the following two directions. First, an investigation of the extent to which using a WordNetinformed approach to classify the words into semantic classes (as proposed in this work) outperforms an unsupervised approach via word clustering. Second, an in-depth human evaluation to gain
further insights of the exact contribution of WordNet to the translation output.
5
Arianna Bisazza and Marcello Federico. 2013. Dynamically shaping the reordering search space of
phrase-based statistical machine translation. Transactions of the ACL, (1):327–340.
Acknowledgments
This research is supported by Science Foundation Ireland through the CNGL Programme
(Grant 12/CE/I2267) in the ADAPT Centre
(www.adaptcentre.ie) at Dublin City University,
the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement
PIAP-GA-2012-324414 (Abu-MaTran) and by the
University of Isfahan.
References
Alexandra Birch, Phil Blunsom, and Miles Osborne.
2009. A Quantitative Analysis of Reordering Phenomena. In Proceedings of the Fourth Workshop
on Statistical Machine Translation, pages 197–205,
Athens, Greece.
Conclusions and Future Work
In this paper we have extended a syntax-based RM
for HPB-SMT with semantic features (WordNet
synsets), in order to enable the model to generalize to phrases not seen in the training data but that
have equivalent meaning. The inclusion of synsets
has led to the best BLEU score in our experiments,
outperforming the baseline (standard HPB-SMT)
by 0.6 points absolute.
3
Arianna Bisazza and Marcello Federico. 2015. A survey of word reordering in statistical machine translation: Computational models and language phenomena. In arXiv preprint arXiv:1502.04938.
Danqi Chen and Christopher D Manning. 2014. A fast
and accurate dependency parser using neural networks. In Empirical Methods in Natural Language
Processing (EMNLP).
https://github.com/jhclark/multeval
158
MT System
hd-surface
hd-synset
hd-both
dd-surface
dd-synset
dd-both
Features
Lex(head), Lex(dep), depRel(dep)
depRel(dep),Syn(head), Syn(dep)
Lex(head),Lex(dep),depRel(dep),Syn(dep),Syn(head)
Lex(head), Lex(dep1), Lex(dep2), depRel(dep1), depRel(dep2)
Syn(head), Syn(dep1), Syn(dep2), depRel(dep1), depRel(dep2)
Lex(head),Lex(dep1),Lex(dep2),Syn(head),Syn(dep1),Syn(dep2),
depRel(dep1),depRel(dep2)
Table 6: Examined features for MT systems
System
baseline
dd-surface
dd-syn
dd-both
hd-surface
hd-syn
hd-both
Avg
10.9
11.4
11.3
11.5
11.1
11.3
11.1
diff
4.58%
3.66%
5.50%
2.18%
3.66%
2.18%
BLEU ↑
ssel sTest
0.6 0.0
0.7 0.1
0.6 0.2
0.7 0.2
0.6 0.1
0.6 0.1
0.6 0.1
p-value
0.00
0.01
0.00
0.08
0.00
0.06
Avg
80.3
79.7
79.8
79.8
80.9
80.5
81.1
diff
-0.74%
-0.62%
-0.62%
0.74%
0.24%
0.99%
TER ↓
ssel sTest
0.8 0.0
0.8 0.2
0.8 0.2
0.8 0.5
0.8 0.3
0.8 0.2
0.8 0.3
p-value
0.01
0.05
0.02
0.01
0.40
0.00
Table 7: MT scores for all systems. p-values are relative to the baseline and indicate whether a difference
of this magnitude (between the baseline and the system on that line) is likely to be generated again
by some random process (a randomized optimizer). Metric scores are averages over three runs. ssel
indicates the variance due to test set selection and has nothing to do with optimizer instability. The best
result according to each metric (highest for BLEU and lowest for TER) is shown in bold.
Colin Cherry and George Foster. 2012. Batch tuning
strategies for statistical machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
427–436.
Yang Gao, Philipp Koehn, and Alexandra Birch. 2011.
Soft dependency constraints for reordering in hierarchical phrase-based translation. In Proceedings
of the Conference on Empirical Methods in Natural
Language Processing, pages 857–868.
Isao Goto, Masao Utiyama, Eiichiro Sumita, Akihiro
Tamura, and Sadao Kurohashi. 2013. Distortion
model considering rich context for statistical machine translation. In Proceedings of the 51st Annual
Meeting of the Association for Computational Linguistics, pages 155–165.
Colin Cherry. 2013. Improved reordering for phrasebased translation using sparse features. In Proceedings of the 2013 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 22–
31.
Spence Green, Michel Galley, and Christopher D. Manning. 2010. Improved models of distortion cost
for statistical machine translation. In Proceedings
of the Human Language Technologies: The 2010
Annual Conference of the North American Chapter of the Association for Computational Linguistics
(NAACL), page 867875.
David Chiang. 2005. A hierarchical phrase-based
model for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association
for Computational Linguistics, pages 263–270.
Jonathan H. Clark, Chris Dyer, Alon Lavie, and
Noah A. Smith. 2011. Better hypothesis testing for
statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2, HLT ’11, pages 176–181, Stroudsburg, PA, USA. Association for Computational Linguistics.
Hieu Hoang, Philipp Koehn, and Adam Lopez. 2009.
A unified framework for phrase-based, hierarchical,
and syntax-based statistical machine translation. In
Proceedings of the International Workshop on Spoken Language Translation, IWSLT, pages 152–159.
Arefeh Kazemi, Antonio Toral, Andy Way, Amirhassan Monadjemi, and Mohammadali Nematbakhsh.
2015. Dependency-based reordering model for constituent pairs in hierarchical smt. In Proceedings of
Christiane Fellbaum. 1998. WordNet. Wiley Online
Library.
159
Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pages 41–43, Barcelona,
Spain, July. Association for Computational Linguistics.
the 18th Annual Conference of the European Association for Machine Translation, pages 43–50, Antalya, Turkey, May.
Philipp Koehn, Franz Och, and Daniel Marcu. 2003.
Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
127–133.
Supreme Council of Information and Communication
Technology. 2013. Mizan English-Persian Parallel
Corpus. Tehran, I.R. Iran.
Christoph Tillmann. 2004. A unigram orientation
model for statistical machine translation. In Proceedings of HLT-NAACL 2004: Short Papers, pages
101–104.
Philipp Koehn, Amittai Axelrod, Alexandra Birch,
Chris Callison-Burch, Miles Osborne, and David
Talbot. 2005. Edinburgh system description for the
2005 iwslt speech translation evaluation. In Proceedings of the International Workshop on Spoken
Language Translation, pages 68–75.
Richard Zens and Hermann Ney. 2006. Discriminative reordering models for statistical machine translation. In StatMT ’06 Proceedings of the Workshop
on Statistical Machine Translation, pages 55–63.
Ding Liu and Daniel Gilda. 2010. Semantic role
features for machine translation. In Proceedings
of the 23rd International Conference on Computational Linguistics, pages 716–724.
Christopher Manning and Dan Klein. 2003. Optimization, maxent models, and conditional estimation without magic. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human
Language Technology: Tutorials, pages 8–8.
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment
models. Computational Linguistics, 29(1):19–51.
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the
40th Annual Meeting on Association for Computational Linguistics, pages 311–318.
Ted Pedersen and Varada Kolhatkar. 2009. Wordnet::senserelate::allwords: A broad coverage word
sense tagger that maximizes semantic relatedness.
In Proceedings of Human Language Technologies:
The 2009 Annual Conference of the North American
Chapter of the Association for Computational Linguistics, Companion Volume: Demonstration Session, NAACL-Demonstrations ’09, pages 17–20,
Stroudsburg, PA, USA. Association for Computational Linguistics.
Chris Quirk, Arul Menezes, and Colin Cherry. 2005.
Dependency treelet translation: Syntactically informed phrasal smt. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 271–279.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and Ralph Weischedel. 2006. A
Study of Translation Error Rate with Targeted Human Annotation. In Proceedings of the Association
for Machine Translation in the Americas.
Benjamin Snyder and Martha Palmer. 2004. The english all-words task. In Rada Mihalcea and Phil
Edmonds, editors, Senseval-3: Third International
160
Eliminating Fuzzy Duplicates in Crowdsourced Lexical Resources
Yuri Kiselev
Yandex
Yekaterinburg, Russia
[email protected]
Dmitry Ustalov
Ural Federal University
Yekaterinburg, Russia
[email protected]
Sergey Porshnev
Ural Federal University
Yekaterinburg, Russia
[email protected]
Abstract
important problem to be addressed, especially in
collaboratively created lexical resources like Wiktionary, which is known to suffer this problem
(Kiselev et al., 2015). However, deduplication
is rather problematic because thesauri may contain fuzzy duplicated synsets composed of different words.
The work, as described in this paper, makes
the following contributions: (1) it proposes an
automatic approach to synset deduplication, (2)
presents a synonymic dictionary-based technique
for assessing synset quality, and (3) compares the
proposed approach with the crowdsourcing-based
one.
The rest of this paper is organized as follows.
Section 2 reviews the related work. Section 3 defines the problem of synset duplicates existing in
thesauri. Section 4 presents a novel approach to
synset deduplication. Section 5 describes the experimental setup. Section 6 shows the obtained
results. Section 7 discusses the interesting findings. Section 8 concludes the paper and defines
directions for future work.
Collaboratively created lexical resources
is a trending approach to creating high
quality thesauri in a short time span at a
remarkably low price. The key idea is
to invite non-expert participants to express
and share their knowledge with the aim
of constructing a resource. However,
this approach tends to be noisy and
error-prone, thus making data cleansing
a highly topical task to perform. In this
paper, we study different techniques for
synset deduplication including machineand crowd-based ones. Eventually, we put
forward an approach that can solve the
deduplication problem fully automatically,
with the quality comparable to the expertbased approach.
1
Introduction
A WordNet-like thesaurus is a dictionary of a
special type that represents different semantic relations between synsets—sets of quasi-synonyms
(Miller et al., 1990). It is a crucial resource for
addressing such problems as word sense disambiguation, search query extension and many other
problems in the fields of natural language processing (NLP) and artificial intelligence (AI). Typical
semantic relations represented by thesauri are synonymy, antonomy (primarily for nouns and adjectives), troponymy (for verbs), hypo-/hypernymic
relations, and meronymy.
A good linguistic resource should not contain
duplicated lexical senses, because duplicates violate the data integrity and complicate addition
of semantic relations to the resource. Therefore,
removing duplicated synsets from thesauri is an
2
Related Work
One of the most straightforward ways to clear a
thesaurus of sense duplicates is to align its entries
with another resource of proven quality, e.g. using
the OntoClean methodology proposed by Guarino and Welty (2009). Consequently, synsets that
will be linked with one synset from another resource represent the same concepts, and should
be merged. However, such alignment can be performed only manually. It is also a time-consuming
process that requires careful examination of every
synset by an expert. Therefore, it is crucial to focus on methods that are either automatic or involve
lesser amount of human intervention.
161
Many studies nowadays aim to evaluate the feasibility of crowdsourcing for various NLP problems. For instance, Snow et al. (2008) showed that
non-expert annotators can produce the data whose
quality may compete with the expert annotation in
such tasks as word sense disambiguation and word
similarity estimation (they conducted their study
using Amazon Mechanical Turk1 (AMT), a popular online labor marketplace).
Sagot and Fišer (2012) assumed that semantically related words tend to co-occur in texts.
Given such an assumption, they managed to find
and eliminate the words that had been added to
synsets by mistake. This approach can be used
to find sense duplicates, but it requires a large
amount of semantic relations to be present in a resource. It should be noted that some resources that
contain synsets may not contain any links between
them. For instance, Wiktionary represents certain
words and relations between them, but it does not
explicitly link its synsets.
Sajous et al. (2013) presented a method for
semi-automatic enrichment of the Wiktionaryderived synsets. First, they analyzed the contents
of Wiktionary and produced new synonymy relations that had not been previously included in the
resource. After that, they invited collaborators to
manually process the data using a custom Firefox
plugin to add missing synonyms to the data.
A similar approach was used by Braslavski et
al. (2014) to bootstrap YARN (Yet Another RussNet) project, which aims at creating a large open
WordNet-like machine-readable thesaurus for the
Russian language by means of crowdsourcing. In
this project, a dedicated collaborative synset editing tool was used by the annotators to construct
synsets by adding and removing words.
The most recognized crowdsourcing workflow
is the Find-Fix-Verify pattern proposed by Bernstein et al. and used in Soylent, a Microsoft Word
plugin that submits human intelligence tasks to
AMT for rephrasing and improving the original
text (Bernstein et al., 2010). As the name implies,
the workflow includes the three stages: 1) in the
Find stage crowd workers find the text area that
can be shortened without changing the meaning,
2) in the Fix stage the workers propose improvements for these text areas, and 3) in the Verify stage
the workers select the worst proposed fixes.
Inspired by this pattern, Ustalov and Kiselev
1
(2015) presented the Add-Remove-Confirm workflow for improving synset quality. Similarly, it
contains three stages: 1) in the Add stage workers choose the words to be added to a synset from
a given list of candidates, 2) in the Remove stage
the workers choose the words that should be removed from a synset, 3) in the Confirm stage the
workers choose which synset is better—the initial
one or the fixed one.
3
Problem
In our study, we focus on the synsets represented
in a WordNet-like thesaurus. Hence, we regard a
thesaurus as a set of synsets S, where every synset
s ∈ S consists of different words and represents
some sense or concept.
In lexical resources created by expert lexicographers, synsets usually correspond to different
meanings, so synset duplicates never arise. Unfortunately, it is not true for the resources created by non-expert users, e.g. through the use
of crowdsourcing. One approach to synset creation would be to combine manually constructed
synsets with synsets that are imported from open
resources. Obviously, it is going to lead to the
situation where there is a plenty of synsets representing identical concepts. The crowdsourcing
approach to synset creation is also prone to this
drawback, as the crowd is likely to create duplicate synsets.
The following example from the Russian Wiktionary2 shows that it contains synsets with
identical meanings. For example, the synset
{стоматолог (stomatologist), дантист (dentist),
зубной врач (“tooth doctor”)} and the synset
{дантист (dentist), стоматолог (stomatologist)}
definitely describe the same concept “a person
qualified to treat the diseases and conditions that
affect the teeth”. Hence, such synsets should be
combined, yet they both are present in the Russian
Wiktionary. Note that in this example the second
synset is a full subset of the first one; however, it is
possible that two synsets may intersect only partly
while sharing the same meaning.
For a native speaker, it is relatively easy to detect whether two synsets share the same meanings.
So, the detection may be done by non-experts via
crowdsourcing. However, the key problem here is
how to retrieve the pairs of synsets that presumably represent identical concepts. In the next sec2
https://www.mturk.com/mturk/welcome
162
https://ru.wiktionary.org/
crowdsourcing engine Mechanical Tsar3 , which is
designed for rapid deployment of mechanized labor workflows (Ustalov, 2015). Inspired by the
similar annotation study conducted by Snow et
al. (2008), we used the default configuration, i.e.
the majority voting strategy for answer aggregation, the fixed answer number per task strategy
for task allocation, and the no worker ranking.
The workers were invited from VK, Facebook and
Twitter via a short-term open call for participation
posted by us.
tion, we propose a simple, yet effective approach.
4
Approach
Suppose the word w has several meanings. According to Miller et al. (1990), it is usually enough
to provide one synonym for every meaning of w
to a native speaker of a language to be able to distinguish the meanings from each other (provided
that the speaker is familiar with the corresponding
concepts). This phenomenon is widely exploited
by explanatory dictionaries. It is also utilized in
some thesauri which assume that a synset itself
is enough to deduce its meaning, therefore definitions of synsets may be omitted.
Hence, we formulate the meaning deduplication problem as follows. Given a pair of different synsets s1 ∈ S and s2 ∈ S, we treat them as
duplicates if they share exactly two words:
5.1
We used two different electronic thesauri for the
experiments. The first one was chosen from
among crowdsourced lexical resouces. Selecting
between the Russian Wiktionary and YARN, we
settled on the latter because it comprises one and
half time more synsets, and it is easier to parse
because YARN4 synsets are available in the CSV
format.
We were also interested in applying the described approach to a resource created by expert
lexicographers. The current situation with electronic thesauri for the Russian language is that
there is only one resource that is large enough and
is available for study. This resource is RuTheslite5 , a publicly available version of the RuThes
linguistic ontology, which has been developing for
many years (Loukachevitch, 2011).
We retrieved 210 presumably duplicated synsets
from each resource—70 synsets with exactly two
common words, 70 synsets with three, and 70
synset with four or more common words. Such
a stratification is motivated by the interest in analyzing how the number of shared words correlates
with their meanings.
By randomly sampling pairs of possibly duplicated synsets from YARN, we concluded that the
proposed criterion for synset equivalence is very
robust. It appears that for YARN this approach
may be used even without the Voting stage. Thus,
we decided to study whether the manual annotation does increase the quality of synset deduplication. In order to do this, we selected synsets from
YARN as follows.
Since synsets in YARN are not always accompanied by sense definitions, we asked an expert to
∃s1 ∈ S, s2 ∈ S : s1 6= s2 ∧ |s1 ∩ s2| = 2.
Obviously, this is a strong criterion that may be
violated, so we propose the following two-stage
workflow for synset deduplication.
Filtering. In this stage, the possible duplicates are
retrieved using the above described criterion
resulting in the set of synset pairs (s1 , s2 ) for
further validation.
Voting. In this stage, the obtained synset pairs
are subject to manual verification. The pairs
voted as equivalent are combined.
The assessment required in the Voting stage
may be provided by expert lexicographers; in
crowdsourced resources, the contributors may be
invited not only to add the new data, but also to increase the quality of the created data and to deduplicate it.
5
Stage “Filtering”
Experiments
Since task submission to Amazon Mechanical
Turk requires a U.S. billing address, this solution
is not accessible to users from other countries. Although there are many other crowdsourcing platforms, e.g. CrowdFlower, Microworkers, Prolific Academic, etc., yet the proportion of Russian
speakers on such platforms is still low (Pavlick et
al., 2014).
Given the fact that our workers are native Russian speakers, we decided to use the open source
3
4
http://mtsar.nlpub.org/
http://russianword.net/yarn-synsets.
csv
5
163
http://www.labinform.ru/pub/ruthes/
from s, which do not correspond to the definition
d. The fixed synsets s0 were then combined with
the corresponding synsets sBAB . These combined
synsets were used as the gold standard synsets sGS
for concepts, as we considered that such synsets
contained all the words representing the concepts.
manually align the selected synsets with an expertbuilt lexical resource. We chose the Babenko dictionary (2011) (hereinafter referred to as BAB)
as an expert-built lexical resource because it is a
relatively recent dictionary with a wide language
coverage. As a result of the alignment, each
YARN synset s was provided with a corresponding
synset sBAB defined by a sense definition d.
5.2
6.2
Consider the following example in order to better
understand the described process of data preparation and the further evaluations.
Let say
that YARN contains synset s1 ={think, opine,
suppose, sleep} and synset s2 ={think, suppose,
reckon}, and BAB contains synset sBAB ={think,
opine, suppose, imagine} with definition d
“expect, believe, or suppose” (|s1 ∩ s2 | =
|{think, suppose}| = 2 and |s1 ∩ sBAB | =
|{think, opine, suppose}| = 3). Assume that the
expert aligned s1 and sBAB in the Filtering stage.
In that case the expert would be provided with
synset s = s1 ∪ s2 ={think, opine, suppose, sleep,
reckon} and definition d from BAB. After fixing
this synset s (by removing the wrong word sleep),
it will be combined with the corresponding synset
sBAB . So the synset that will be further treated as
the gold standard for this concept is sGS ={think,
opine, suppose, imagine, reckon}. This set will
be used as L for calculating (1) and (2) (for the
corresponding s1 and sBAB , L(s1 ) = L(sBAB )).
According to this,
Stage “Voting”
The goal of the Voting stage is to choose
true equivalents among the prepared presumably
equivalent synset. The input of this stage is a pair
of synsets (s1 , s2 ) from a resource, and a worker is
to determine if the synsets share the same meaning
(Figure 1).
Do the following synsets have the same meanings: “s1 ” and
“s2 ”?
[ ] Yes
[ ] No
Figure 1: Task format for Voting stage (the original text was in Russian).
6
6.1
Results
Quality metrics
We use precision and recall to measure the quality of synsets in a thesaurus S. Precision P (s)
of a synset s ∈ S is the fraction of the synset
words with the meaning represented by s, compared to all the words in the language representing
the meaning of the synset L(s).
P (s) =
|s ∩ L(s)|
|s|
P (s1 ) =
(1)
R(sBAB ) =
Recall R(s) of a synset s is the fraction of all
words S in the language that have the meaning that
s represents.
R(s) =
|s ∩ L(s)|
|L(s)|
Example of Quality Calculation
|s1 ∩ L(s1 )|
3
= = 0.75,
|s1 |
4
4
|sBAB ∩ L(sBAB )|
= = 0.8.
|L(sBAB )|
5
Note that in the proposed evaluation method, precision P of any synset from BAB sBAB is 1.0.
6.3
(2)
Quality Assessment
The procedure described in Section 6.1 allowed
us to calculate the suggested quality measures for
the resources (Table 1). The BAB row is calculated for 210 synsets from the Babenko dictionary, the YARN, aligned row—for 210 synsets s1
from YARN that were aligned with the BAB by
the expert, and the YARN, machine—for the automatically merged all 210 presumably equivalent
synsets (s1 , s2 ) of YARN.
The F1 -measure for YARN is expectedly lower
than for the BAB, yet, after a simple merging of
As may be easily noticed, it is impossible to precisely calculate the measure of synset recall R(s),
since the whole set of words that can correspond
to a particular meaning is unknown. In order to estimate L(·), we used the data retrieved at the Filtering stage. We combined the YARN synsets in
each pair (s1 , s2 ) into a new synset s. Then, we
provided the resulting synset s with a corresponding definition d from the BAB and asked the same
expert as in the Filtering stage to remove words
164
Table 1: Synset quality.
Avg P Avg R
BAB
1.000 0.661
YARN , aligned
0.901 0.634
YARN , machine 0.840
0.774
7
Avg F1
0.796
0.744
0.805
The F1 -measure shows no change after applying
the Voting stage, yet the precision increases by
0.012 while the recall drops by 0.01. Despite the
fact that the overall quality is constant regardless
of the human annotations, it still presents an interesting finding, since people increase the precision
of the merging. This is important because it allows
to compensate, at least partially, for the reduction
in the precision against the original synsets caused
by the automatic merge. (Table 3).
It is also of interest that YARN contains 24.8
thousand synsets that presumably have a duplicate (58% of the synsets with two or more words),
while the Russian Wiktionary has 13.2 thousand
(40%), and RuThes-lite has only 6.3 thousand
(28%). We may therefore conclude that the proposed approach should mainly be applied to resources that a priori are known to contain duplicate synsets rather than to improve the quality of
expert-built resources.
the presumably equivalent synsets, its average F1 measure became higher than for the BAB. However, this result was due to the significant increase
in the recall, while the precision dropped.
To investigate how people’s participation can
improve the quality of automatic merging, we conducted a crowdsourcing experiment. Every task
(Figure 1) was annotated by at least three different
workers. The decision about merging was made
by majority voting. Table 2 shows the share of
synsets that the workers decided to merge.
Table 2: Crowdsourcing synset deduplication.
# of common words
2
3
4+
61/
64/
68/
YARN
70
70
70
25/
40/
51/
RuThes-lite
70
70
70
7.1
YARN ,
7.2
Pairwise Annotation
Special attention should be given to the performance of the crowd workers. In our experiment,
25 workers provided 1262 answers to 420 pairwise comparison tasks (Figure 1). The workers
repeatedly reported that the tasks were time consuming due to data inconsistency. Suppose that
synset deduplication.
Avg P Avg R Avg F1
0.840 0.774
0.805
0.852 0.764
0.805
YARN
machine
crowd
YARN ,
Synset Ambiguity
The analysis of the results of the experiments and
the annotations provided by our expert showed
that in some cases it is almost impossible to derive a meaning from a synset. For instance, just a
couple of synonyms is not enough to distinguish
the meaning “a woman thought to have evil magic
powers” from “a woman who uses magic or sorcery” (the latter definition does not imply an “evil”
woman, which can be not obvious from a synonymy row).
Another example of such ambiguity are the concepts corresponding to “a bed with a back” and
“a bed without a back”. Given only a synset, it
is barely possible to discern this shade of meaning and distinguish any of these two concepts from
the more common one (simply “a bed”). With this
observation in mind, we suggest that the authors
of the wordnets for which the meanings of synsets
are optional should take it into account and include
definitions for vague concepts.
Quite expectedly, the two analyzed lexical resources proved very different. Our equivalence
criterion worked only in one third of the cases for
RuThes-lite. And even the stronger version of the
criterion (the one considering synsets that share
4+ words as sense duplicates) was true only in 32
cases according to the annotators. However, for
YARN the criterion proved to be rather robust, so
that it can be applied without crowd checking, provided that the results of the merging will be verified by a moderator of the resource.
This conclusion agreed with the quality estimates of the merging performed according to human annotations (Table 3). The first row (YARN,
machine) corresponds to the automatic merge of
all 210 synsets repeats the row of Table 1 with the
same name, and the second row (YARN, crowd)
corresponds to the selective merge performed according to the human judgements. So, 61+64+68
synset pairs (s1 , s2 ) were merged (Table 2), and
the 17 remained synsets we left as they were (s1 ).
Table 3:
Discussion
165
8
synset sizes are n1 and n2 correspondingly, and an
annotator spends O(n1 + n2 ) time to make a decision. Hence, even in the simplest case (Table 4) an
annotator will perform 4 + 4 = 8 operations per
pair, which is inconvenient.
In this study, we presented an automated approach
to synset deduplication. The results were obtained
from expert labels and annotations provided by
crowd work. At least three different annotations
per every synset pair from two different resources
(YARN and RuThes-lite) were used. The approach
allows to significantly increase the synset quality in crowdsourcing lexical resources. Participation of people does not notably affect the average synset quality, though the precision slightly
increases when people are involved.
The results showed that two synonyms are not
sufficient for defining a meaning, but three words
usually give a satisfactory result. So, it is three
words that should be used as a threshold value
for merging duplicate synsets when using the proposed deduplication approach in a fully automatic
mode. Our results, including the crowd answers
and the produced gold standard, are available6 under the terms of Creative Commons AttributionShareAlike 3.0 license.
As a possible future direction, we may suggest
using more sophisticated similarity measures to
select a threshold for fully automatic merging of
synsets. Another possible way to improve the approach is to detect not just pairs, but clusters of
synsets. This is hardly possible in resources that
are manually crafted by a team of experts, but it is
definitely worth exploring for crowdsourcing resources.
Table 4: Average synset sizes.
# of common words 2
3
4+
YARN
4.2 4.6 5.5
RuThes-lite
4.3 5.0 5.8
Further studies should avoid pairwise comparison in problems involving contextual or domain
knowledge for making a decision by annotators.
However, it still may be useful in various visual
recognition tasks, especially when the workers are
provided with an observable hint (Deng et al.,
2013). We should also note that this outcome
agrees well with the study conducted by Wang et
al. (2012), when cluster-based task generation led
to lower time spent rather than in pair-based tasks.
7.3
Conclusion
Agreement & Issues
We have analyzed all the cases when all the three
workers gave the same answer to the task (Table 5). For YARN , the number of cases when all
the workers agreed rises with the number of common words in synsets. This is quite expected considering that sharing more common words makes
it more obvious that the synsets have common
senses. However, we do not observe the same in
RuThes-lite.
Acknowledgments
This work is supported by the Russian Foundation for the Humanities, Project No. 13-04-12020
“New Open Electronic Thesaurus for Russian”.
The authors are grateful to Yulia Badryzlova for
proofreading the text, and to Alisa Porshneva for
labeling synsets. The authors would also like to
thank all those who participated in the crowdsourced experiment.
Table 5: # of merge decisions made unanimously.
# of common words
2
3
4+
32
47
57
YARN
/70
/70
/70
36/
35/
32/
RuThes-lite
70
70
70
Manual analyses of the data from RuThes-lite
showed that its authors tend to discriminate meanings of synsets with common words by means of
only one word, e.g. using a hyponym for a concept in one set and a corresponding hypernym in
another. It is enough to emphasize the difference
in meanings, but workers may find it problematic
to detect the only pair of words that defines the
difference in the pair of synsets. This task may
become even more complicated in large synsets,
as they grow in size along with the increase in the
number of common words in them (Table 4).
6
http://ustalov.imm.uran.ru/pub/
duplicates-gwc.tar.gz
166
References
George A. Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. 1990.
Introduction to WordNet: An On-line Lexical
Database. Lexicography, 3:235–244.
Ljudmila G. Babenko, editor. 2011. Dictionary of
synonyms of the Russian Language. AST: Astrel,
Moscow, Russia.
Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev,
and Chris Callison-Burch. 2014. The Language Demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics, 2:79–92.
Michael S. Bernstein, Greg Little, Robert C. Miller,
Björn Hartmann, Mark S. Ackerman, David R.
Karger, David Crowell, and Katrina Panovich.
2010. Soylent: A Word Processor with a Crowd
Inside. In Proceedings of the 23Nd Annual ACM
Symposium on User Interface Software and Technology, UIST ’10, pages 313–322, New York, NY,
USA. ACM.
Benoı̂t Sagot and Darja Fišer. 2012. Cleaning noisy
wordnets. In Proceedings of the Eight International
Conference on Language Resources and Evaluation
(LREC’12), Istanbul, Turkey.
Pavel Braslavski, Dmitry Ustalov, and Mikhail Yu.
Mukhin. 2014. A Spinning Wheel for YARN: User
Interface for a Crowdsourced Thesaurus. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for
Computational Linguistics, pages 101–104, Gothenburg, Sweden. Association for Computational Linguistics.
Franck Sajous, Emmanuel Navarro, Bruno Gaume,
Laurent Prévot, and Yannick Chudy. 2013. SemiAutomatic Enrichment of Crowdsourced Synonymy
Networks: The WISIGOTH System Applied to
Wiktionary. Language Resources and Evaluation,
47(1):63–96.
Rion Snow, Brendan O’Connor, Daniel Jurafsky, and
Andrew Y. Ng. 2008. Cheap and Fast—but is It
Good?: Evaluating Non-expert Annotations for Natural Language Tasks. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing, EMNLP ’08, pages 254–263, Stroudsburg, PA, USA. Association for Computational Linguistics.
Jia Deng, Jonathan Krause, and Li Fei-Fei. 2013. FineGrained Crowdsourcing for Fine-Grained Recognition. In Computer Vision and Pattern Recognition
(CVPR), 2013 IEEE Conference on, pages 580–587.
Nicola Guarino and Christopher A. Welty. 2009. An
Overview of OntoClean. In Steffen Staab and Rudi
Studer, editors, Handbook on Ontologies, International Handbooks on Information Systems, pages
201–220. Springer Berlin Heidelberg.
Dmitry Ustalov and Yuri Kiselev. 2015. Add-RemoveConfirm: Crowdsourcing Synset Cleansing. In Application of Information and Communication Technologies (AICT), 2015 IEEE 9th International Conference on, pages 143–147. IEEE.
Yuri Kiselev, Andrew Krizhanovsky, Pavel Braslavski,
et al. 2015. Russian Lexicographic Landscape:
a Tale of 12 Dictionaries. In Computational Linguistics and Intellectual Technologies: papers from
the Annual conference “Dialogue”, volume 1, pages
254–271. RGGU, Moscow.
Dmitry Ustalov. 2015. A Crowdsourcing Engine for
Mechanized Labor. Proceedings of the Institute for
System Programming, 27(3):351–364.
Natalia V. Loukachevitch. 2011. Thesauri in information retrieval tasks. Moscow University Press,
Moscow, Russia.
Jiannan Wang, Tim Kraska, Michael J. Franklin, and
Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. Proc. VLDB Endow., 5(11):1483–
1494.
167
Automatic Prediction of Morphosemantic Relations
Svetla Koeva, Svetlozara Leseva, Ivelina Stoyanova,
Tsvetana Dimitrova, Maria Todorova
Department of Computational Linguistics
Bulgarian Academy of Sciences
{svetla,zarka,iva,cvetana,maria}@dcl.bas.bg
Abstract
heuristics (Dimitrova et al., 2014), and had been
manually post-edited.
The MSRs link verb–noun pairs of synsets that
contain derivationally related literals. As semantic and morphosemantic relations refer to concepts, they are universal, and such a relation must
hold between the relevant concepts in any language, regardless of whether it is morphologically expressed or not. This has enabled the automatic transfer of the relations to other languages,
such as Polish (Piasecki et al., 2009), Bulgarian
(Koeva, 2008; Stoyanova et al., 2013; Dimitrova
et al., 2014), Serbian (Koeva et al., 2008), Romanian (Barbu Mititelu, 2012; Barbu Mititelu,
2013). Other sets of MSRs have been proposed
for Turkish (Bilgin et al., 2004), Czech (Pala
and Hlaváčková, 2007), Estonian (Kahusk et al.,
2010), Polish (Piasecki et al., 2012a; Piasecki et
al., 2012b), Croatian (Šojat and Srebačić, 2014).
The study is motivated by the fact that a considerable number – 67% (7,905 out of 11,751) of the
noun synsets derivationally related to verb synsets
and 89% (7,962 out of 8,934) of the verb synsets
derivationally related to noun synsets in the PWN
3.0. – is not labelled with an MSR. In addition,
the linguistic generalisations behind the existing
MSRs have been made on the basis of English
derivational morphology, hence the proposed set
of MSR instances may be extended based on evidence from the derivational morphology of other
languages, including Bulgarian.
The present research builds on Leseva et al.
(2014), where all plausible MSRs were assigned
by intersecting the following pairs registered in
BulNet <noun literal suffix – semantic prime of
the noun synset> and <noun literal suffix – MSR
between the noun and a verb synset>. Then the
probability for each MSR was estimated given the
frequency of occurrence of the triples <MSR –
noun synset semantic prime – verb synset semantic prime> in the PWN, and was used to filter out
This paper presents a machine learning
method for automatic identification and
classification of morphosemantic relations
(MSRs) between verb and noun synset
pairs in the Bulgarian WordNet (BulNet).
The core training data comprise 6,641
morphosemantically related verb–noun literal pairs from BulNet. The core data
were preprocessed quality-wise by applying validation and reorganisation procedures. Further, the data were supplemented with negative examples of literal
pairs not linked by an MSR. The designed
supervised machine learning method uses
the RandomTree algorithm and is implemented in Java with the Weka package.
A set of experiments were performed to
test various approaches to the task. Future work on improving the classifier includes adding more training data, employing more features, and fine-tuning. Apart
from the language specific information
about derivational processes, the proposed
method is language independent.
1
Introduction
This paper investigates a machine learning method
for identification and classification of morphosemantic relations (MSRs) between verb and noun
synset pairs in the Bulgarian WordNet (BulNet).
It is based on the MSR dataset from the Princeton WordNet (PWN) (Fellbaum et al., 2009),
automatically imported to the Bulgarian WordNet (the core dataset), the PWN semantic primitives (henceforth, semantic primes) and the derivational relations (DRs) in the Bulgarian WordNet.
The derivational relations had been previously assigned automatically to the Bulgarian WordNet using a string similarity algorithm combined with
168
less probable MSRs.
In a follow-up paper (Leseva et al., 2015), a
decision-tree based supervised machine learning
method was designed, implemented and tested for
classification of MSRs. In the present paper, we
upgrade the previous research along the following
lines – we propose a method designed to identify
new synset pairs that have a high probability of
being MSR related and to classify the respective
MSRs; we test new sets of features combined in
different ways (as described in the experiments),
which gives us insights into possible extensions
and improvements of the method.
Our task is three-fold: (i) to find out potential
derivational verb–noun pairs in BulNet; (ii) for
a given potential derivational pair, the classifier
must determine whether a derivational relation exists (or there is just a formal coincidence); (iii) if a
DR exists, decide what type of MSR connects the
relevant synsets.
The first part of the task was implemented by
identifying common substrings shared by noun–
verb literal pairs and by mapping the resulting endings to the canonical suffixes. The implementation
of (ii) and (iii) was performed using a machine
learning classifier. The suffixes of the noun–verb
derivational pairs and the semantic primes of the
verb and noun synsets were used as features in the
learning, while the types of MSR between these
pairs of synsets were the classes in the classification task. Our research is focused on Bulgarian but
the results are transferable across languages and
the methodology can be used to enhance wordnets
for other languages with semantic content.
2
2.1
Agents or Causes but also Means and possibly
other relations), Instrument, Material, Body-part,
Uses ((intended) purpose or function), Vehicle
(means of transportation), Location, Result, State,
Undergoer, Destination, Property, and Event (linking a verb to its eventive nominalisation). These
relations are assigned between verb–noun synset
pairs containing at least one derivationally related
verb–noun literal pair, e.g., teacher:2 (’a person
whose occupation is teaching’) is the Agent of
teach:2 (’impart skills or knowledge to’). Most
of the relations correspond to or are subsumed
by eponymous semantic roles (Agent, Instrument,
Location, Destination, Undergoer, Vehicle, Bodypart, etc.).
2.2
Semantic Primes
All the verb and noun synsets in the PWN are
classified into a number of language-independent
semantic primes. The nouns are categorised into
25 groups, such as noun.act (acts or actions),
noun.artifact (man-made objects), etc. The verbs
fall into 15 groups, such as verb.body (verbs of
grooming, dressing and bodily care), verb.change
(verbs of size, temperature change, intensifying,
etc.), as defined in the PWN lexicographer files.2
2.3
Derivational Relations
Derivational relations are language specific lexical relations (between pairs of literals in related
synsets). A DR may signal the existence of a morphosemantic relation between the relevant synsets,
which may or may not be defined explicitly in
wordnet. A DR is formally expressed by means of
a (combination of) morphological device(s), such
as suffixation, prefixation, suffixation plus root
vowel mutation, etc.
Most suffixes in Bulgarian can be associated
with more than one MSR. Consider the suffix
-ach/-yach. Its prototypical meaning is Agent,
e.g., polivach:1 (waterer:2 – ’someone who waters
plants or crops’) but also denotes an instrumental meaning, e.g., rezach:1 (cutter:1; cutlery:2;
cutting tool:1 – ’cutting implement; a tool for
cutting’) and other relations, such as: Vehicle –
prehvashtach:1 (interceptor:1 – ’a fast maneuverable fighter plane designed to intercept enemy aircraft’); Body-part – privezhdach:1 (adductor:1 –
’a muscle that draws a body part toward the median line’); and others.
Linguistic Motivation
Morphosemantic Relations
MSRs hold between synsets containing literals
that are derivationally related and express knowledge additional to that conveyed by semantic relations, such as synonymy, hypernymy, etc. We
use the inventory of MSRs from the PWN 3.0.
morphosemantic database1 which includes 17,740
links connecting 14,877 unique synset pairs. The
MSRs were mapped to the equivalent Bulgarian
synsets using the cross-language relation of equivalence between synsets.
The PWN specifies 14 types of MSRs between
verbs and nouns: Agent, By-means-of (inanimate
1
http://wordnetcode.princeton.edu/standoff-files/
morphosemantic-links.xls
2
169
https://wordnet.princeton.edu/man/lexnames.5WN.html
pair pensioner:2 (retiree:1 – ’someone who has
retired from active working’) – pensioniram se:2
(retire:7 – ’go into retirement’) was based both on
the gloss and on the noun suffix -er. In other cases,
e.g. <Agent, Event>, <Agent, Instrument>, the
choice of relation depends on the semantic prime,
as a noun.artifact or a noun.act cannot be an Agent,
and vice versa – a noun.person cannot be an Instrument or an Event.
(II) One of the relations implies or overlaps
with the other (16 combinations of MSRs). Examples of such combinations are <Instrument,
Uses>, <By-means-of, Instrument>, <Bodypart, Uses>. The choice is based mainly on which
relation is more informative rather than abstract.
For example, Instrument is preferred instead of
Uses as instruments are used for a certain purpose.
The semantics of the suffix, e.g. -tel in usilvatel:1
(amplifier:1) – usilvam:7 (amplify:1), also plays a
role in the choice of the relation (Instrument).
(III) No strict distinction between the semantics of the relations (10 combinations of MSRs),
e.g., <Result, Event>, <Result, State>, <Result,
Material>, <State, Event>, <Property, State>.
The choice is motivated on the basis of semantic information from the synsets, such as the literals, the gloss, or the semantic primes. For
instance, the eventive and the resultative meaning of deverbal nouns are not always distinguished as different senses. In such case, a
noun.state synset would suggest the relation Result, while a noun.act or a noun.event synset points
to Event. Definitions often give additional information about the type of MSR, e.g. ’the act of...’,
’a state of...’, etc. especially where the semantic
prime is more specific. By inspecting the triples
<verb.prime–noun.prime–MSR>, we established
prime combinations that strongly indicate the type
of relation, e.g., <noun.state–verb.state> points
to State; <noun.event/noun.process/noun.act–
verb.change> – to Event. On their own, noun.act
and noun.event point to Event, noun.person – to
Agent, etc.
The distinction between (part of) the meanings of a suffix corresponds to a distinction
in the semantic primes of the relevant noun
synsets. Polivach:1 (Agent) has the semantic
prime noun.person; interceptor:1 (Vehicle), and
rezach:1 (Instrument) bear the semantic prime
noun.artifact; privezhdach:1 (Body-part) bears the
prime noun.body. We can thus derive general rules
for disambiguation or partial reduction of the number of MSRs associated with the suffix. Given a
derivationally related verb–noun literal pair which
has not been assigned an MSR, and a relevant suffix, we are then able to rule out the MSRs possible
for that suffix but not compatible with the semantic primes of the related verb and noun synsets.
3
Linguistic Preprocessing
We performed the following consistency procedures on the wordnet structure: (i) manual inspection and disambiguation of MSRs in case of multiple relations assigned to a synset pair; (ii) validation of the consistency of the semantic primes in
the hypernym–hyponyms paths; (iii) consistency
check of the type of the assigned MSR against the
semantic primes. The quality analysis and validation is performed only on the core dataset and is
language independent, i.e., it concerns the wordnet structure, rather than any language data, and
is transferrable across wordnets. This is a oneoff task, ensuring the quality of the data used for
machine learning, as well as for any future tasks
based on these data.
3.1
Disambiguation of Multiple MSRs
We identified 450 cases of multiple MSRs assigned between pairs of synsets, which represent
50 different combinations of two (rarely three) relations. As we assume that two unique concepts
are linked by a unique semantic relation, we kept
only one MSR per pair of synsets to ensure the
consistency of the data. The following observations served as a main point of departure.
(I) The relations are mutually exclusive (24
combinations of MSRs). Consider the following assignments: <Agent, Destination>, <Agent,
Undergoer>. Except in a reflexive interpretation,
an entity cannot be an Agent, on the one hand, and
a Destination (Recipient) or an Undergoer (Patient
or Theme), on the other. The actual relation is signalled by the synset gloss and usually by the suffix,
e.g., the choice of Agent over Destination for the
3.2
Validation of Semantic Primes
There are many hypernym–hyponym trees in
which the semantic primes shift along the tree
path. For instance, the majority of the 11,574 hypernyms with the prime noun.artifact have a hyponym classified as noun.artifact, but other prime
labels are also found, such as noun.substance –
170
3.3
for nouns denoting raw materials or synthetic substances, e.g., pina cloth:1 (’a fine cloth made
from pineapple fibers’), noun.substance, is a hyponym of fabric:1 (’artifact made by weaving or
felting or knitting or crocheting natural or synthetic fibers’), noun.artifact; etc. Moreover, some
synsets are linked to two hypernyms but inherit
the semantic prime of one of the two, as in: prednisolone:1 (’a glucocorticoid (trade names Pediapred or Prelone) used to treat inflammatory conditions’), noun.substance, which is hyponym to
both glucocorticoid:1, noun.substance, and antiinflammatory drug:1, noun.artifact.
Cross-check of Primes and MSRs
Semantic restrictions on the combinations of semantic primes and MSRs were formulated after cross-checking their compatibility (with subsequent changes either of the semantic primes
of nouns and/or verbs, or of the MSR) in order
to reduce the number of possible combinations
of <verb.prime–noun.prime–MSR> against those
from the PWN 3.0. The purpose of the procedure
is to ensure consistency of the training data.
The role Agent is associated with persons
(noun.person), social entities, e.g., organisations
(noun.group), animals (noun.animal) and plants
(noun.plant) that are capable of acting so as to
bring about a result. Instruments are concrete
man-made objects (noun.artifact), but nouns with
the prime noun.communication – debugger:1 and
noun.cognition – stemmer:3 which may function
as instruments are also possible.
Inanimate causes (Fellbaum et al., 2009) –
non-living (and non-volitional) entities that bring
about a certain effect or result – are expressed by
the MSRs Body-part, Material, Vehicle, and Bymeans-of. The relation Body-part may be an inanimate cause that is an inalienable part of an actor
and is expressed by nouns with noun.body primes
(rarely noun.animal or noun.plant). The relation
Material denotes a subclass of inanimate causes
– substances that may bring about a certain effect (e.g. inhibitor:1 (’a substance that retards
or stops an activity’). Beside noun.substance,
noun.artifacts (synthetic substances or products)
also qualify for the relation, e.g. depilatory:2 (hair
removal cosmetics). The relation Vehicle represents a subclass of artifacts (means of transportation); consequently the respective synsets have the
prime noun.artifact and are generally hyponyms of
the synset conveyance:3; transport:8. Inanimate
causes whose semantics differ from that of the
other three relations, are assigned the generic relation By-means-of, e.g. geyser:2 (’a spring that discharges hot water and steam’) (noun.object), etc.
The relation Event denotes processual nominalisation and involves nouns such as noun.act,
noun.event, noun.phenomenon, and rules out concrete entities such as animate beings, natural
(noun.object) and man-made (noun.artifact) objects, etc. The relation State denotes abstract entities such as feelings, cognition, etc. The relation
Undergoer denotes entities which are affected by
the event or state. The relation Result involves en-
The most variation in the semantic primes of
the noun synsets down a hypernym–hyponym tree
is observed with: noun.state (16 other primes);
noun.attribute (15); noun.group (14); etc. For example, the paths down the trees with the prime
noun.group on the hypernym(s) involve noun
synsets with the primes noun.person (a group of
persons – for example, synsets for ethnic groups,
nationalities, etc.), noun.animal (a group of animals – animal taxons, etc.), noun.plant (a group of
plants – plant taxons), etc.
We analysed manually the cases where hyponyms have different semantic primes from their
immediate hypernym. The primes of 33 nouns labeled as noun.Tops were changed to the prime they
give name to and found predominantly in their hyponyms, e.g. state:2 was relabelled as noun.state,
process:6; physical process:1 – as noun.process,
etc. 66 hyponyms’ prime labels were aligned with
those of their immediate hypernym in order to reflect more precisely the semantics of the words
with which they are linked. For example, dance:2
(’move in a pattern; usually to musical accompaniment; do or perform a dance’) is classified
as verb.creation, its hypernym move:14 (’move so
as to change position, perform a non-translational
motion’) has the prime verb.motion, and dance:2’s
hyponyms are a mix of verbs with the primes
verb.creation and verb.motion. As dance:2’s semantics is consistent with verb.motion, the semantic prime of the verb and its hyponyms (where
needed) was changed accordingly.
The majority of the shifts in the semantic
primes, however, reflect specific features of the
hypernym–hyponym paths – for example, the
shifts between noun.substance and noun.artifact,
noun.body and noun.animal or noun.plant; and so
forth, especially in the cases of two hypernyms.
171
of the relation. The dataset comprises a total of
6,641 literal pairs in 4,016 unique synset pairs, and
was compiled in two stages.
Initially, the core dataset included 6,220 instances of derivationally related verb–noun literal
pairs in the BulNet verb–noun synset pairs (automatically detected and manually validated as described in Dimitrova et al. (2014)) which were
assigned an MSR by automatic transfer from the
PWN. We took into consideration the pairs obtained by suffixation and zero derivation.
We supplemented the core data with additional
instances from BulNet extracted in the following
way: (1) we identified literal pairs from BulNet
which exhibited a possible DR but an MSR had
not been assigned between the respective synsets;
(2) after measuring the similarity of the disambiguated PWN glosses3 for the pairs of synsets
identified in step (1) using a wordnet-based measure for text similarity (Mihalcea et al., 2006), we
filtered out the low similarity pairs (below threshold of 2.0); and (3) the glosses of high similarity
were examined for certain structural patterns in order to determine the MSR where possible (e.g.,
a gloss of the type ’someone who <verb,active
voice>’ points to Agent, or ’instrument used for
<verb>ing’ – points to Instrument). As a result,
421 additional instances of morphosemantically
related literal pairs were added to the core dataset.
tities that are produced or have come to existence
as a result of the event or state. The relation Property denotes various attributes and qualities. These
relations involve nouns with various primes.
The relation Location denotes a concrete (natural or man-made) or an abstract location where
an event takes place and therefore relates verbs
with nouns with various primes – noun.location,
but also noun.object, noun.plant, noun.artifact,
noun.cognition, etc.
The relation Destination is associated with the primes noun.person,
noun.location and noun.artifact, which corresponds to two distinct interpretations of the
relation – Recipient (noun.person) and Goal
(noun.artifact, noun.location). The relation Uses
denotes a function or purpose, e.g. lipstick:1 –
lipstick:3. The relation allows nouns with various
primes, both concrete and abstract.
We examined the combinations of noun primes
and MSRs in the PWN 3.0. with a view to the
semantic restrictions and in some cases MSRs
were modified accordingly. For instance, some
noun.body nouns were originally assigned the relation Instrument, some noun.person – Event, etc.
As a result, the noun primes associated with a
given MSR were reduced: Agent from 17 to 4
(person, animal, plant, group); Instrument – from
9 to 3 (artifact, communication, cognition); Material – from 6 to 2 (artifact, substance); State – from
10 to 5 (state, feeling, attribute, cognition, communication); Body-part – from 4 to 3 (body, animal, plant); Event – from 24 to 13 (act, communication, attribute, event, feeling, cognition, process,
state, time, phenomenon, group, possession, relation). Result, Property, By-means-of, Uses, Location, and Undergoer are more heterogeneous and
few of the semantic primes were ruled out. The
relations Vehicle and Destination and the corresponding semantic primes need not be subject to
any changes.
The reduction of the noun.prime–verb.prime
combinations for a given MSR rules out the corresponding branches in the decision trees.
The changes made in the relations and semantic
primes in these validation procedures are available
at: http://dcl.bas.bg/en/wordnetMSRs.
4
4.1
4.2
Negative Examples Dataset
The task of determining whether an MSR holds
between a given verb–noun pair is a binary classification task where the classes are true and
false. To be able to train a classifier for this
task, we needed a set of examples of class false,
i.e. instances of (potentially) derivationally related verb–noun literal pairs which did not have
an MSR. This can be due to various reasons: (a)
one of the words has acquired an additional, usually metaphorical, meaning; (b) the similarity in
the form of the noun and the verb literals is coincidental (due to historical changes in the forms,
etc.) and there is no transparent DR; or (c) the relation does not fit into the pre-designed system of
relations in PWN.
The negative examples were extracted automatically from BulNet and include: (i) (potentially)
derivationally related verb–noun literal pairs from
synsets which have mutually exclusive seman-
Training Data for the ML Task
Core data
The core training data include examples for which
we are sure an MSR exists, and we know the type
3
172
http://wordnet.princeton.edu/glosstag.shtml
to 44 canonical verb suffixes.
In this way the number of suffix values for each
MSR is reduced, while the number of examples
per relation and pair of semantic primes increases,
thus reducing the noise in the data that arises from
the contextual suffix variants.
tic primes (i.e., not occurring among MSR pairs
in PWN) and thus cannot be semantically related, e.g., verb.weather – noun.animal; and (ii)
verb–noun literal pairs linked by a DR but not
by an MSR in BulNet which formally coincide
with pairs of literals that have an MSR in BulNet. For example, the literal gotvya is a member of the synsets gotvya:2 (cook:1 – ’transform
and make suitable for consumption by heating’,
verb.change) and gotvya:4 (prepare:6 – ’to prepare verbally, either for written or spoken delivery’, verb.creation). The noun synset gotvach:1 (cook:6 – ’someone who cooks food’,
noun.person) derived from the verb gotvya bears
an MSR (Agent) only to gotvya:2, thus the pair
gotvach:1 - gotvya:4 is extracted as a negative example.
A total of over 170,000 negative instances
(verb–noun literal pairs) were extracted from BulNet. As the number and quality of the negative
examples (and the number of training instances in
general) affect the performance of the classifier,
they usually need to be balanced against the number of positive examples and only a selection of
roughly the same number as positive data were applied in each task.
5
ML Method for Identification of MSRs
5.1
The following features were used in the analysis
of the data: (i) the canonical verb suffix; (ii) the
canonical noun suffix; (iii) the semantic prime of
the verb; and (iv) the semantic prime of the noun.
Our data are in string format but the sets of values
for both the canonical suffixes (these 121 noun and
44 verb suffixes) and the synset primes (25 semantic primes for nouns and 15 primes for verbs) are
finite.
Additional features were also considered and
tested such as the similarity between the glosses
of the verb–noun synset pair, which was in the
end disregarded due to the fact that only a limited number of instances exhibit similarity above
the threshold. Instead, these examples were used
to extend the training data (see section 4.1).
5.2
4.3
Features
Preprocessing of the Data
Implementation
The implementation of the Machine Learning is
made in Java using the Weka library (Witten et al.,
2011), which offers various capabilities and advanced techniques for data mining.4
We analysed and tested various classifiers
within the Weka package in order to select the best
performing one suitable for the task – decision
tree algorithms, Naive Bayes classifier, K* classifier, SMO (Sequential Minimal Optimisation), linear logistic regression, etc., as well as some complex classifiers applying several algorithms in a sequence. The Naive Bayes classifier was not suitable due to the data scarcity and the fact that not all
combinations of feature values were covered in the
data. The K* classifier relies on an entropy-based
distance measure between instances and is not particularly suitable for string and nominal data. The
decision tree was considered most relevant to the
task. After comparing empirically several decision
tree classifiers in Weka, based on the performance
evaluation using 10-fold cross-validation, we selected the algorithm of RandomTree which consistently outperformed the rest. The decision tree
The Bulgarian synsets connected with MSRs from
the PWN were processed using previously proposed methods and datasets. The derivationally related literal pairs found in the MS-related synsets
were assigned an appropriate DR, following Dimitrova et al. (2014). The particular derivational devices were automatically established and manually
validated, and the variants of the affixes (suffixes
in particular) were associated with a canonical suffix form, as proposed in Leseva et al. (2014).
As a first step, the word endings of each pair of
verb–noun literals were identified by removing the
common substring (base) shared by the two literals. In order to discard pairs that coincide in form
by chance, the base was set to be at least 75%
of each literal’s length. Secondly, as the endings
usually do not coincide with a literal’s suffix (may
also include part of the literal’s root or stem), they
were mapped to the canonical forms of the suffixes using lists of suffixes with their contextual
variants. The training data contain 294 different
noun endings, which were mapped to 121 canonical noun suffixes, and 172 verb endings mapped
4
173
http://www.cs.waikato.ac.nz/ml/weka/
more complex approach combining case-specific
classifiers may prove more reliable.
Test 2. The second experiment tests a classifier
with a list of 15 classes – the 14 MSRs and the
class null used to label instances with no MSR.
The training data include the core dataset supplemented with a limited number (6,700) of randomly
selected negative examples. The results from the
10-fold cross-validation show F1 score of 0.769
(baseline 0.654), which is significantly better than
the results in Test 1. The performance also varies
across relations: the highest rate is for true negatives (0.811), State (0.809), Agent (0.788), etc. In
this case the RandomTree classifier significantly
outperforms the baseline for all relations.
The experiment raises the question whether the
negative data should be selected at random, or the
training data should conform to certain selection
criteria aiming at representativeness of the patterns
and varieties in terms of feature values and combinations between them. Tests in this direction
might be considered in the future.
Test 3. The third test examines the performance
of a complex classifier combining a set of separate binary classifiers for each type of relation between a noun and a verb: there is a binary classifier
(true/false) for Agent, another for Undergoer, etc.
This method allows assignment of more than one
relation to a given pair. In this way we can observe
when uncertainty or ambiguity occurs and look for
ways to tackle it. When no relation is assigned,
the pair is considered unrelated. The core dataset
was applied for the training of the model. In this
case, for each MSR, the subset of this relation’s
instances constitutes the positive dataset, and the
subset of instances of other relations serves as a
set of negative examples.
If we look for exact matches, the results are
lower: F1 score varies from 0.81 (Agent, Event)
down to 0.30 − 0.35 (Result, By-means-of, etc.).
But since in this method more than one MSR can
be assigned, we can evaluate whether the correct
relation is in the set of assigned relations.
The method was also tested on a dataset of 300
new examples having a DR or formally coinciding with a DR, independently extracted from BulNet (not used in the training data), preprocessed
and having their class (or lack of an MSR) manually verified. Using the complex classifier, we
obtained the following results: (i) exact matches
are 64.00%, (ii) in another 3.33% the real class
built by the RandomTree algorithm on each node
tests a given number of random features and no
pruning is performed. As a baseline, we applied
on the same dataset the OneR classifier which
chooses one parameter best correlating with the
class value to provide best prediction accuracy,
and which is particularly suited for discrete data.
Three approaches were considered with a view
to the method of classification. The first one
uses two separate classifiers applied in a sequence
– first, a binary classifier that identifies pairs of
derivationally related verb–noun literals in synsets
linked via an MSR, and then, a multiclass classifier that selects the type of relation. The second
approach merges the above two classifiers and applies a single multiclass classifier to assign MSRs,
where the set of classes includes an additional
value null for the instances which do not have an
MSR. The third method combines a set of separate binary classifiers for each of the 14 MSRs. A
verb–noun pair can be assigned more than one relation, or none (in the latter case the pair is considered unrelated). The results are presented in the
following section.
5.3
Experiments
Test 1. The first experiment tests the performance
of the approach which first discovers whether a
verb-noun pair has an MSR, and subsequently applies a multiclass classifier to assign a particular
relation to the pair. The core dataset extended
with negative examples is used as training data
for the binary classifier, and the classes are ’true’
(there is an MSR) and ’false’ (no MSR). The RandomTree classifier shows an F1 score of 0.815
(compared to the baseline of 0.687) using 10-fold
cross-validation.
The multiclass classifier is trained on the core
dataset and the classes are represented by the 14
MSRs. Its F1 score on 10-fold cross-validation
is 0.842 (baseline 0.808) but varies considerably
across different classes: from as high as 0.975 for
Agent down to 0.333 for By-means-of (relations
with less than 10 examples in the data are not considered reliable).
The F1 score of the overall method is 0.682
since the error propagates from one phase to another. Results also show that for certain MSRs the
OneR algorithm performs slightly better than the
RandomTree (usually RandomTree outperforms
OneR by more than 25%), which suggests that a
174
Test
Baseline
(OneR)
Random
Tree
Test 1
MSR true-false
Type of MSR
Overall
0.687
0.808
0.498
0.815
0.842
0.682
Test 2
0.654
0.769
Test 3
Exact MSR
MSR in set
Reclassify null
0.653
0.699
0.710
0.713
0.746
0.781
ond prime noun.substance to synsets denoting synthetic substances or raw materials (noun.artifact)
is expected to make the data more consistent as
these noun.artifact synsets are more alike substances as regards the choice between certain relations, e.g., Material and Instrument. At present
this shows only an insignificant increase in precision due to the small amount of data affected.
However, with the increase of training data in the
future, the number of added instances may increase as well, which can potentially yield significant improvement.
The observations on the constructed decision
trees also show that the features are insufficient to
fully distinguish between different MSRs as the
tree structures are too shallow to achieve better
results. By introducing more features, we can
also test the RandomForest classification method
which requires more features in order to construct
a properly sized forest of RandomTree classifiers
and usually outperforms the singular RandomTree
method. If several learning schemes are available,
it may be advantageous not to choose the bestperforming one for a dataset but to use all of them
and merge the results.
Table 1: Evaluation results: F1 score on the 10fold cross-validation in Tests 1-3.
is contained in the set of guessed relations, (iii)
28.33% of the test instances are labelled as null
while in fact they have an MSR, and (iv) the remaining 4.33% comprise incorrectly assigned relations.
The large amount of instances incorrectly labelled as null (28.33%) points to the need to either introduce more features to fine-tune the classifier, or to apply an additional classifier on these
data using a different method, and merge results.
We ran a second classifier on all data labelled by
the first classifier as null, using only the noun semantic prime as a feature in order to assign the
most probable relation according to the semantic
prime of the noun. In this case the precision increased to 78.13% by taking the most frequent relation associated with each noun prime. However,
in this case we assign an MSR to all test instances,
thus mislabel true negatives correctly recognised
by the first classifier. A more fine-tuned method
and feature design, as well as training on different
sets/features in each phase, may be more effective.
5.4
6
Conclusion and Future Work
Our future work will be focused on the enhancement of the method by exploring at least two mutually related directions: (i) automatic harvesting
of more labelled data from other wordnets; (ii) incorporation of new features for classification and
assignment of relations including heuristics derived from the WordNet structure.
Alongside the introduction of new features, it is
necessary to develop techniques for reducing redundant features, as well as for correlation-based
feature selection, feature ranking or principal component analysis.
We have devised experiments to extend the
datasets with more data for English and Romanian. The multilingual data can contribute to the
training with respect to the possible pairs of verb–
noun primes and the relevant semantic restrictions.
While part of the information employed in this
paper, such as the suffix lists and mappings from
word endings to canonical suffixes, is language
specific, the method proposed is language independent, including the linguistic processing of the
data. Testing it for other languages is a task we
envisage to implement in the future.
Follow-up
In further tests we experimented with variations
in the data, i.e., addition of new training data
instances exhibiting specific features. To this
end, we assigned a second semantic prime to the
synsets which either have two hypernyms (with
two different semantic primes) and inherit the
prime of only one of the two, or have a hypernym
with another, different semantic prime which does
not clash with the semantic prime of the hyponym
– see the observations in 3.2. The purpose was to
test whether the inherited semantic prime impacts
the result. For instance, the assignment of a sec-
175
References
Rada Mihalcea, Courtney Corley, and Carlo Strapparava. 2006. Corpus-based and knowledge-based
measures of text semantic similarity. In Proceedings
of the 21st National Conference on Artificial Intelligence - Volume 1, AAAI’06, pages 775–780. AAAI
Press.
Verginica Barbu Mititelu. 2012. Adding morphosemantic relations to the Romanian Wordnet. In
Proceedings of the Eight International Conference
on Language Resources and Evaluation (LREC
2012), pages 2596–2601.
Karel Pala and Dana Hlaváčková. 2007. Derivational
relations in Czech WordNet. In Proceedings of the
Workshop on Balto-Slavonic Natural Language Processing, pages 75–81.
Verginica Barbu Mititelu. 2013. Increasing the effectiveness of the Romanian Wordnet in NLP applications. Computer Science Journal of Moldova,
21(3):320–331.
Maciej Piasecki, Stanislaw Szpakowicz, and Bartosz
Broda. 2009. A Wordnet from the Ground up.
Wroclaw: Oficyna Wydawnicza Politechniki Wroclawskiej.
Orhan Bilgin, Ozlem Cetinoglu, and Kemal Oflazer.
2004. Morphosemantic relations in and across
Wordnets – a study based on Turkish. In Proceedings of the Second Global Wordnet Conference
(GWC 2004), pages 60–66.
Maciej Piasecki, Radoslaw Ramocki, and Marek
Maziarz. 2012a. Automated generation of derivative relations in the Wordnet expansion perspective.
In Proceedings of the 6th Global Wordnet Conference (GWC 2012), pages 273–280.
Tsvetana Dimitrova, Ekaterina Tarpomanova, and
Borislav Rizov. 2014. Coping with derivation
in the Bulgarian WordNet. In Proceedings of the
Seventh Global Wordnet Conference (GWC 2014),
pages 109–117.
Maciej Piasecki, Radoslaw Ramocki, and Pawel
Minda. 2012b. Corpus-based semantic filtering in
discovering derivational relations. In A. Ramsay
and G. Agre, editors, Applications – 15th International Conference, AIMSA 2012, Varna, Bulgaria,
September 12-15, 2012. Proceedings. LNCS 7557,
pages 14–22. Springer.
Christiane Fellbaum, Anne Osherson, and Peter E.
Clark. 2009. Putting semantics into WordNet’s
”morphosemantic” links. In Proceedings of the
Third Language and Technology Conference, Poznan, Poland. [Reprinted in: Responding to Information Society Challenges: New Advances in Human
Language Technologies. Springer Lecture Notes in
Informatics], volume 5603, pages 350–358.
Ivelina Stoyanova, Svetla Koeva, and Svetlozara Leseva. 2013. Wordnet-based cross-language identification of semantic relations. In Proceedings of the
4th Biennal International Workshop on Balto-Slavic
Natural Language Processing, pages 119–128.
Neeme Kahusk, Kadri Kerner, and Kadri Vider. 2010.
Enriching Estonian WordNet with derivations and
semantic relations. In Proceedings of the 2010 Conference on Human Language Technologies – The
Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010, pages 195–
200, Amsterdam, The Netherlands, The Netherlands. IOS Press.
Krešimir Šojat and Matea Srebačić. 2014. Morphosemantic relations between verbs in Croatian WordNet. In Proceedings of the Seventh Global WordNet
Conference, pages 262–267.
Ian Witten, Eibe Frank, and Mark Hall. 2011. Data
Mining: Practical Machine Learning Tools and
Techniques. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 3rd edition.
Svetla Koeva, Cvetana Krstev, and Dusko Vitas. 2008.
Morpho-semantic relations in Wordnet – a case
study for two Slavic languages. In Proceedings
of the Fourth Global WordNet Conference (GWC
2008), pages 239–254.
Svetla Koeva. 2008. Derivational and morphosemantic
relations in Bulgarian Wordnet. Intelligent Information Systems, pages 359–368.
Svetlozara Leseva, Ivelina Stoyanova, Borislav Rizov,
Maria Todorova, and Ekaterina Tarpomanova. 2014.
Automatic semantic filtering of morphosemantic relations in WordNet. In Proceedings of CLIB 2014,
Sofia, Bulgaria, pages 14–22.
Svetlozara Leseva, Maria Todorova, Tsvetana Dimitrova, Borislav Rizov, Ivelina Stoyanova, and Svetla
Koeva. 2015. Automatic classification of wordnet morphosemantic relations. In Proceedings of
BSNLP 2015, Hissar, Bulgaria, pages 59–64.
176
Tuning Hierarchies in Princeton WordNet
Ahti Lohk
Department of Informatics
Tallinn University of Technology
Tallinn, Estonia
[email protected]
Christiane D. Fellbaum
Department of Computer
Science
Princeton University
New Jersey, USA
[email protected]
In this paper we report on how to validate the
PWN hierarchies using the system of test patterns.
A test pattern is a description of a specific substructure in the wordnet hierarchy. The system of
test patterns and the descriptions of all patterns are
found in (Lohk, 2015). This system consists of ten
test patterns that all involve multiple inheritance,
an important property that can point to different
semantic inaccuracies going back to lexicographic
errors. Because it is semantic, every test pattern
applies cross-lingually and sheds new light on
wordnets by examining their hierarchies and helping to detect and correct possible errors.
These patterns were used to validate the semantic hierarchies of Estonian Wordnet over four
years (2011–2014) and on ten versions. During
this time the structure of Estonian Wordnet structure changed significantly, as described in Section
3.
The aim of this paper is to show that the same
specific substructures that have been found in Estonian Wordnet also exist in Princeton WordNet.
Moreover, some experiments on Princeton WordNet confirm the promising benefits of test pattern
application (Section 4). Therefore, we propose
test patterns as a method for validation and tuning
hierarchies in PWN and all other wordnets.
This paper is structured as follows: Section 2
provides an overview of the validation methods
applied to the wordnet hierarchies. Section 3 presents the results of using test patterns iteratively
on EstWN. Section 4 demonstrates that the same
pattern instances can be found in PWN as well as
in other wordnets. Some experiments are described. We close with a conclusion and proposals
for future work.
Abstract
Many new wordnets in the world are created and most
take the original Princeton WordNet (PWN) as their
starting point. This arguably central position imposes a
responsibility on PWN to ensure that its structure is
clean and consistent. To validate PWN hierarchical
structures we propose the application of a system of
test patterns. In this paper, we report on how to validate
the PWN hierarchies using the system of test patterns.
In sum, test patterns provide lexicographers with a very
powerful tool, which we hope will be adopted by the
global wordnet community.
1
Leo Võhandu
Department of Informatics
Tallinn University of Technology
Tallinn, Estonia
[email protected]
Introduction and background
Many new wordnets in the world are created and
most take the original Princeton WordNet (PWN)
as their starting point. This arguably central position imposes a responsibility on PWN to ensure
that its structure is clean and consistent. This is
particularly true for hierarchical relations, which
are the most frequently encoded relations and
which form the backbone of the network. To validate PWN hierarchical structures we propose the
application of a system of test patterns developed
in (Lohk, 2015). Importantly, all instances returned by the test pattern system were manually
validated by two members of the Estonian Wordnet (EstWN) team (Kadri Vare and Heili Orav).
The results were encouraging, and we applied the
algorithms to PWN. We propose that after couple
of iterations on PWN other wordnets apply the algorithm on their resources and, after a couple of
iterations, compare their structures with that of
PWN, which can serve as some kind of Gold
Standard for wordnets. Alternatively, the analysis
is commercially available from the first author.
177
2
2.2
State of the art in validating the semantic hierarchies of wordnet
These methods for validating hierarchies rely on
lexical relations (word-word), semantic relations
(concept-concept) and the rules among them. This
includes the rules applied to the construction of
WordNet (Fellbaum, 1998), and additional rules,
such as the following:
 Metaproperties (rigidity, identity, unity
and dependence) described in ontology
construction (Guarino and Welty, 2002)
 Top Ontology concepts or “unique beginners” (Object, Substance, Plant, Comestible, …) (Atserias et al., 2005; Miller,
1998)
 Specific rules for particular error detections (Gupta, 2002; Nadig et al., 2008).
For instance, a rule proposed by (Nadig et
al., 2008):“If one term of a synset X is a
proper suffix of a term in a synset Y, X is
a hypernym of Y”
To give a better understanding of the test patterns
approach we provide a short overview of the validation methods applied on the semantic hierarchies of wordnet. (Lohk 2015) argues that the
methods can be divided into three groups based on
two features, as shown in Table 1. These features
can be formulated as questions as follows: do they
rely on corpus data and lexical resources? Do
they make use the contents of a synset?
use of corpus
use the conGroup of
data, lexical retents of a synmethods
sources
set
Group I
+
+
Group II
–
+
Group III
–
–
Table 1: Features that classify a group of validating
methods
Group I comprises all methods based on lexical
resources and corpora; group II includes rules or
rule-based methods, while group III consists of
graph-based methods.
2.1
Rule-based methods
2.3
Graph based methods
These methods are purely formal and do not take
into account the semantics among word forms.
Specific substructures of wordnet’s hierarchies
are checked and validated. Target substructures
include:
 Cycles (Šmrz, 2004), (Kubis, 2012)
 Shortcuts (Fischer, 1997)
 Rings (Liu et al., 2004; Richens, 2008)
 Dangling uplinks (Koeva et al., 2004;
Šmrz, 2004)
 Orphan nodes (null graphs) (Čapek,
2012).
 Small hierarchy (Lohk et al., 2014c)
 Unique beginners (Lohk et al., 2014c)
Corpus-based methods
The most frequently used validation methods for
wordnet hierarchies rely on corpora and lexical resources. Different techniques for extracting the
relevant information have been applied. Some of
the well-known approaches include:
 Lexico-syntactic patterns (Hearst, 1992),
(Nadig et al., 2008)
 Similarity measurements (Sagot and
Fišer, 2012)
 Mapping and comparing to wordnet
(Pedersen et al., others, 2013)
 Applying wordnet in NLP tasks (Saito et
al., 2002)
Resources used in this group of methods are:
 Monolingual text corpora (Sagot and
Fišer, 2012)
 Bilingual aligned corpora (Krstev et al.,
2003)
 Monolingual explanatory dictionaries
(Nadig et al., 2008)
 Wordnets (Peters et al., 1998; Pedersen et
al., 2012)
 Ontologies (Gangemi et al., 2002)
In addition, (Lohk, 2015) proposed different yet
undiscovered substructures and shows that the application of these substructures to validate the semantic hierarchies of wordnet may improve wordnet structure significantly. While these substructures with the specific nature are used in wordnet
assessment, they are called test patterns. Next,
we explain the idea of test pattern and demonstrate
their efficient use with Estonian Wordnet.
3
A case study: applying test patterns to
Estonian Wordnet
Since 2011, the different type of test patterns have
been developed and applied progressively to
EstWN. Currently, ten test patterns exist. For
178
every test pattern we implemented a program to
find the relevant instances. Four programs are implemented for semi-automatic application (closed
subsets, closed subset with a root, the largest
closed subset and connected roots) and six for automatic use (the test patterns shown in italics in
Ring
Synset with many
roots
Heart-shaped substructure
Dense component
24
1,296
235
3,445
1,123
1,825
104
301
61
183
22
1,592
259
3,560
1,309
1,861
121
380
62
102
16
1,700
299
3,777
1,084
1,941
128
415
63
114
16
1,815
321
3,831
1,137
2,103
141
447
64
149
15
1,893
337
3,882
1,173
2,232
149
471
65
248
14
1,717
194
2,171
791
451
132
459
66
144
4
1,677
119
1,796
613
259
121
671
67
129
4
1,164
79
928
477
167
24
407
68
131
4
691
60
537
232
38
18
54
69
121
4
102
18
291
35
1
8
23
70
118
4
51
7
21
30
0
3
7
„Compound“
pattern
Shortcut
142
Verb roots
60
Version
Noun roots
Multiple inheritance
cases
Table 2). Instances found with test patterns using
programs for semi-automatic application have
been discussed in elsewhere (Lohk, 2015). Test
patterns’ instances found with programs for automatic use are employed in process of constant validation.
Table 2: A numerical overview of EstWN spanning eleven version
Table 2 shows the number of instances that each
test pattern returned after its automatic application. The first two patterns (shortcut and ring) are
inspired by (Fischer, 1997; Liu et al., 2004; Richens, 2008). There are also some cases of synset
with many roots, called dangling uplinks in
(Koeva et al., 2004) and (Šmrz, 2004). Bold font
in the table shows when the test pattern was given
to a lexicographer for verification. For example,
the “shortcut” cases where lexicographers who
verified each instance manually in the 63rd version submitted to the EstWN. The effect, as reflected in the next version, can be clearly seen in
the table. It is clear that the application of heartshaped substructure and dense component patterns had a considerable effect on the lexicography.
As all instances of test patterns include multiple inheritance cases, the fourth column (Multiple
inheritance cases) demonstrates the influence of
using test patterns most clearly. For example, a
comparison between versions 66 and 70 shows
that the number of cases has gone down about 32
times (97%). Note that the number 118 of hierarchies has about 75% of shallow hierarchies where
roots are connected to only one level of subordinates.
According to (Lohk, 2015) over ten versions of
EstWN the most popular correction operation has
been removing the hypernymy and hyponymy relations – 21,911 times. Secondly, 5,344 times the
lexical units in synsets were changed (included
deleted and added lexical units). Thirdly, 4,122
times hypernymy and hyponymy relation were replaced by another semantic relation, mainly by
near synonymy and fuzzynymy.
4
Validating Princeton WordNet
Substructures connected with multiple inheritances have been used to validate PWN. (Fischer,
1997; Liu et al., 2004 and Richens, 2008) examined shortcuts; rings were suggested by (Koeva
et al., 2004), and (Šmrz, 2004) examined
dangling uplinks. There are also some examples
about closed subsets in (Lohk et al., 2012) and one
example of heart-shaped substructure in (Lohk
and Võhandu, 2014). Lohk gives an example of a
connected roots case in his poster presentation at
Estonian Applied Lingvistics Conference in
Tallinn on April 2013.
179
where instead of role or type relation the hypernymy is used. That kind of example is presented
in Figure 2, where {hard drug} is actually a certain
type of {narcotic} and as well as in the role of
{controlled substance}.
It is remarkable that first time when heartshaped substructure was used in EstWN the number of its instances was 451 (see Table 2) and 5
versions later 0. Moreover, during the correction
operations no hypernymy/hyponymy relation was
changed to role or type relation (Lohk, 2015).
Next we demonstrate some examples of test
patterns’ instances to see their structure and how
they may help to discover specific incosistencies
in PWN semantic hierarhies. The complete
overview of the test patterns has been given in the
dissertation of first author (Lohk, 2015).
4.1
Shortcut
Shortcut is a pattern where a synset (based on Figure 1, {event}) is simultaneously connected to another synset ({group action}) directly and indirectly. In that case, {group action} is not an ambiguous concept. Instead of this, it just contains a
redundant link (dotted line).
4.3
“Compound” pattern is an exception among other
test patterns while it considers the content of
synsets. More precisely, that kind of substructure
satisfies the following two conditions:
At first, this substructure contains a case where
a lexical unit of a superordinate (based on Figure
3 {ball}) is connected to minimum two subordinates (1-{baseball}, 2-{basketball}… 24-{volleyball}) which contain that lexical unit (ball).
Secondly, at least one subordinate has an extra
superordinate ({baseball equipment}, {basketball
equipment}, …, {golf equipment}).
{event}
something that happens at
a given place and time;
{act , deed, ...}
something that people do
or cause to happen
{group action}
action taken by a group of
people
Figure 1. An instance of shortcut, PWN (version 3.1)
4.2
Heart-shaped substructure
In a heart-shaped substructure, two nodes (based
on Figure 2, {hard drug} and {cannabis, …})
have direct connection through an identical parent
({controlled substance}) and an indirect connection through a semantic relation {soft drug} –
{narcotic}) that links their second parent.
{narcotic} a drug that
produces numbness or
stupor; often taken for
pleasure or to reduce
pain;
{soft drug} a drug
of abuse that is
considered relatively
mild and not likely to
cause addiction
“Compound” pattern
{controlled substance} a
drug or chemical
substance whose
possession and use are
controlled by law;
{hard drug} a
narcotic that is
considered relatively
strong and likely to
cause addiction
1– {baseball}
a ball used in playing baseball
{baseball equipment}
equipment used in playing
baseball
2 – {basketball}
an inflated ball used in
playing basketball
{basket ball equipment}
sports equipment used in
playing basket ball
3 – {cricket ball}
the ball used in playing cricket
{cricket equipment}
sports equipment used in
playing cricket
4 – {crouquet ball}
a wooden ball used in playing
croquet
{crouquet equipment}
sports equipment used in
playing croquet
5 – {golf ball}
a small hard ball used in
playing golf
{golf equipment}
sports equipment used in
playing golf
…
9 – {football}
the inflated oblong ball used
in playing American football
...
24 – {volleyball}
an inflated ball used in
playing volleyball
{ball}
round object that is hit or
thrown or kicked in games
Figure 3. An instance of "compound" pattern, PWN
(version 3.1)
{cannibis, marijuana, ...}
the most commonly used
illicit drug
To validate that kind of instance as it is in Figure
3, the lexicographer has to ask if subordinates 1 to
5 have an extra superordinate, and why it is not
true about subordinates from 6 to 24. Studying
this figure more carefully, we see that {basketball} is a {basketball equipment}. However,
{football} and {volleyball} being quite similar in
their definitions do not follow the same logic.
Figure 2. An instance of heart-shaped substructure,
PWN (version 3.1)
In case of PWN we have seen that the instances of
heart-shaped substructure tend to show the cases
180
That is to say, {football} and {volleyball} are not
equipment.
care} is a {beauty treatment} beside the {aid, attention, care, …}.
4.4
4.5
Dense component
The dense component pattern provides the
opportunity to uncover substructures where, due
to the multiple inheritance, the density of the interrelated concepts in the semantic hierarchy is
higher (Lohk et al., 2014a), (Lohk et al., 2014b).
This substructure (subgraph) consists of two
synsets (nodes) (based on Figure {manicure} and
{pedicure}) with at least two identical parents (it
corresponds to complete bipartite graph) ({beauty
treatment} and {aid, attention, care, …}). The
overall size of an instance of a dense component
depends on how many synsets (nodes) with at
least two parents are interconnected through the
multiple inheritance and/or same parents (Lohk,
2015).
Connected roots
The connected roots test pattern involves different hierarchies through multiple inheritance
cases.
This pattern helps to see how big and deep are
the connections between POS hierarchies. Every
node acts as a unique beginner is equipped with
the number of hierarchy levels and the number of
subordinates in the same hierarchy (Figure 1). The
first number of the edge label indicates the number of common subordinates for two hierarchies.
The next two numbers separated by “|” denote the
hierarchy levels where the first common concept
is located in both hierarchies.
1/2 - {South_1}
{facial} care for the face that
usually involves cleansing and
massage and the application of
cosmetic creams
1* - 1|8 -> {Alabama_1, ...}
19/74,023 - {entity_1}
{makeover} an overall beauty
treatment (involving a person's
hair style and cosmetics and
clothing)
1* - 1|9 -> {Epimetheus_1}
{manicure}
professional care for the hands
and fingernails
{beauty treatment} (2|4)
enhancement of someone's
personal beauty
{pedicure}
professional care for the feet
and toenails
{aid, attention, care, ...} (2|20)
the work of providing treatment
for or attending to someone or
something
1/2 - {Spain_1, ...}
Figure 5. An instance of connected roots, PWN (version 3.1)
In Figure 5, there is only one large hierarchy with
the unique beginner {entity}. It heads a 19-level
hierarchy and 74,023 subordinates. By contrast,
the two hierarchies ({South_1} and {Spain_1
…}) are very small. They both dominate only one
additional level. The edge labels reveal that the
common concepts of both hierarchies are on the
first lower levels in both of the smaller hierarchy
cases. Both unique beginners ({South_1} and
{Spain_1}) seem to be too specific to be the highest concepts.
Table 3 presents a comparison between
PWN’s structure with that of other wordnets.
{hair care, ...} care for the hair:
the activity of washing or
cutting or curling or arranging
the hair
...
Figure 4. An instance of dense component, PWN (version 3.1)
In Figure 4, the pattern of dense component is
emphasized with bold lines. While this substructure contains at least two multiple inheritance
cases we see it as a case of the regularity of multiple inheritance. Herewith, the aim of the dense
component is to help to detect if this regularity is
justified or vice versa, if this regularity has to be
expanded.
In the case of Figure 4, the regularity of multiple inheritance has to be expanded. Two reasons
for that are concepts {facial} and {hair care, …}.
In addition to {beauty treatment}, {facial} fits in
with {aid, attention, care, …}. Moreover, {hair
4.6
Short numerical overview of the test patterns’ instances
In Table 3, it is easy to see that the wordnets are
very different. Finnish Wordnet was manually
translated from PWN (Lindén and Niemi, 2014)
so it is not surprising that first two rows are essentially identical.
The table shows a clear need for a deep structural analysis of all wordnets. Of course, it must
be remembered that the hierarchies of different
181
Heart-shaped substructure
1,453
40
2,991
18
155
115
358
Finnish Wordnet, v2.0
12
334
1,453
40
2,991
18
155
115
394
2
2
2,438
351
5,309
62
1,226
217
549
42 10,942
553
57,887 205,254
5,037
778
541
0
3
7
Cornetto, v2.0
Polish Wordnet, v2.0
637
Estonian Wordnet, v70
118
4
51
Ring
7
21
30
„Compound“
Pattern
334
Dense
component
12
Short cut
Multiple
inheritance cases
Princeton WordNet, v3.0
Version
Noun roots
Verb roots
Synset with many
roots
languages will never show a one-to-one correspondence, as the lexicons necessarily differ.
Table 3: Five wordnets in comparison
5
Multiple inheritance is not always wrong.
However, PWN contains so far many cases where
instead of role or type relation the hypernymy relation has been used. This is one reason, why
sometimes multiple inheritance cases are presented in PWN (see Figure 2).
In sum, the analysis of wordnet structures using test patterns provides lexicographers with a
very powerful tool, which we hope will be
adopted by the global wordnet community.
Conclusions
Test patterns are a unique form of validating hierarchies. They are not language-specific and can be
applied cross-lingually. Their value lies in aiding
lexicographers to detect and correct errors and
thus provide more accurate resources.
Every test pattern has the property of multiple
inheritance. In the most cases, behind the multiple
inheritance there is a lexical polysemy, except the
pattern of shortcut (Sec. 4.1).
Knowledge Management: Ontologies and the Semantic Web. Springer, pp. 166–181.
Guarino, N., Welty, C., 2002. Evaluating Ontological
Decisions with OntoClean. Communications of the
ACM - Ontology: Different Ways of Representing
the Same Concept 45, 61–65.
Gupta, P., 2002. Approaches to Checking Subsumption
in GermaNet, in: Proceedings of the 3rd International Conference on Language Resources and
Evaluation. European Language Resources Association (ELRA), Las Palmas, Canary Islands, Spain,
pp. 8–13.
Hearst, M.A., 1992. Automatic Acquisition of Hyponyms from Large Text Corpora, in: Proceedings of
the 14th Conference on Computational Linguistics
- Volume 2, COLING ’92. Association for Computational Linguistics (ACL), Stroudsburg, PA, USA,
pp. 539–545.
Koeva, S., Mihov, S., Tinchev, T., 2004. Bulgarian
Wordnet–Structure and Validation. Romanian J.
Inf. Sci. Technol. 7, 61–78.
Krstev, C., Pavlović-Lažetić, G., Obradović, I., Vitas,
D., 2003. Corpora Issues in Validation of Serbian
Wordnet, in: Matoušek, V., Mautner, P. (Eds.),
Reference
Atserias Batalla, J., Climent Roca, S., Moré López, J.,
Rigau Claramunt, G., 2005. A Proposal for a Shallow Ontologization of WordNet. Proces. Leng. Nat.
No 35 Sept 2005 Pp 161-167.
Čapek, T., 2012. SENEQA-System for Quality Testing
of Wordnet Data, in: Proceedings of the 6th International Global Wordnet Conference. Toyohashi
University of Technology, Matsue, Japan, pp. 400–
404.
Fellbaum, C., 1998. WordNet: An Electronic Lexical
Database, MIT Press. ed. Wiley Online Library,
Cambridge, USA.
Fischer, D.H., 1997. Formal Redundancy and Consistency Checking Rules for the Lexical Database
WordNet 1.5, in: Workshop Proceedings of on Automatic Information Extraction and Building of
Lexical Semantic Resources for NLP Applications.
Association for Computational Linguistics (ACL),
Madrid, Spain, pp. 22–31.
Gangemi, A., Guarino, N., Masolo, C., Oltramari, A.,
Schneider, L., 2002. Sweetening Ontologies with
DOLCE, in: Knowledge Engineering and
182
Text, Speech and Dialogue, Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp.
132–137.
Kubis, M., 2012. A Query Language for WordNet-Like
Lexical Databases, in: Pan, J.-S., Chen, S.-M., Nguyen, N.T. (Eds.), Intelligent Information and Database Systems, Lecture Notes in Computer Science.
Springer Berlin Heidelberg, pp. 436–445.
Liu, Y., Yu, J., Wen, Z., Yu, S., 2004. Two Kinds of
Hypernymy Faults in WordNet: the Cases of Ring
and Isolator, in: Proceedings of the 2nd Global
Wordnet Conference. Brno, Czech Republic, pp.
347–351.
Lohk, A., 2015. A System of Test Patterns to Check
and Validate the Semantic Hierarchies of Wordnettype Dictionaries. Tallinn University of Technology, Tallinn, Estonia.
Lohk, A., Allik, K., Orav, H., Võhandu, L., 2014a.
Dense Component in the Structure of Wordnet, in:
Proceedings of the 9th International Conference on
Language Resources and Evaluation. European
Language Resources Association (ELRA), Reykjavik, Iceland, pp. 1134–1139.
Lohk, A., Norta, A., Orav, H., Võhandu, L., 2014b.
New Test Patterns to Check the Hierarchical Structure of Wordnets, in: Information and Software
Technologies. Springer, pp. 110–120.
Lohk, A., Orav, H., Võhandu, L., 2014c. Some Structural Tests for WordNet with Results. Proceedings
of the 7th Global Wordnet Conference 313–317.
Lohk, A., Vare, K., Võhandu, L., 2012. First Steps in
Checking and Comparing Princeton Wordnet and
Estonian Wordnet, in: Proceedings of the EACL
2012 Joint Workshop of LINGVIS & UNCLH. Association for Computational Linguistics (ACL), pp.
25–29.
Lohk, A., Võhandu, L., 2014. Independent Interactive
Testing of Interactive Relational Systems, in:
Gruca, D.A., Czachórski, T., Kozielski, S. (Eds.),
Man-Machine Interactions 3, Advances in Intelligent Systems and Computing. Springer International Publishing, pp. 63–70.
Miller, G.A., 1998. Nouns in WordNet, in: WordNet:
An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts, USA, pp. 24–45.
Nadig, R., Ramanand, J., Bhattacharyya, P., 2008. Automatic Evaluation of WordNet Synonyms and Hypernyms, in: Proceedings of ICON-2008: 6th International Conference on Natural Language Processing. CDAC Pune, India.
Pedersen, B.S., Borin, L., Forsberg, M., Kahusk, N.,
Lindén, K., Niemi, J., Nisbeth, N., Nygaard, L.,
Orav, H., Rögnvaldsson, E., others, 2013. Nordic
and Baltic Wordnets Aligned and Compared
Through “WordTies,” in: The 19th Nordic Conference of Computational Linguistics. Linköping University Electronic Press, Oslo University, Norway,
pp. 147–162.
Pedersen, B.S., Forsberg, M., Borin, L., Lindén, K.,
Orav, H., Rögnvaldsson, E., 2012. Linking and
Validating Nordic and Baltic wordnets, in: Proceedings of the 6th International Global Wordnet
Conference. Matsue, Japan, pp. 254–260.
Peters, W., Peters, I., Vossen, P., 1998. Automatic
Sense Clustering in EuroWordNet, in: Proceedings
of the 1st International Conference on Language
Resources and Evaluation. European Language Resources Association (ELRA), Granada, Spain, pp.
409–416.
Richens, T., 2008. Anomalies in the Wordnet Verb Hierarchy, in: Proceedings of the 22nd International
Conference on Computational Linguistics-Volume
1. Association for Computational Linguistics
(ACL), pp. 729–736.
Sagot, B., Fišer, D., 2012. Cleaning Noisy Wordnets,
in: Proceedings of the 8th International Conference
on Language Resources and Evaluation. European
Language Resources Association (ELRA), Istanbul, Turkey, pp. 23–25.
Saito, J.-T., Wagner, J., Katz, G., Reuter, P., Burke, M.,
Reinhard, S., 2002. Evaluation of GermaNet: Problem Using GermaNet for Automatic Word Sense
Disambiguation, in: Proceedings of the LREC
Workshop on WordNet Structure and Standardization and How These Affect WordNet Applications
and Evaluation. European Language Resources Association (ELRA), Las Palmas, Canary Islands,
Spain, pp. 14–29.
Šmrz, P., 2004. Quality Control and Checking for
Wordnet Development: A Case Study of BalkaNet.
Science and Technology 7, 173–181.
183
Experiences of Lexicographers and Computer Scientists in Validating
Estonian Wordnet with Test Patterns
Ahti Lohk
Tallinn University of Technology
Akadeemia tee 15a
Tallinn, Estonia
[email protected]
Heili Orav
University of Tartu
Liivi 2
Tartu, Estonia
[email protected]
Leo Võhandu
Tallinn University of Technology
Akadeemia tee 15a
Tallinn, Estonia
[email protected]
Kadri Vare
University of Tartu
Liivi 2
Tartu, Estonia
[email protected]
been covered. Moreover, there have been
endeavors to automatically add derivatives and
the results have been used in the sense
disambiguation process. Version 70 of EstWN
consists of 67,674 synsets, including 110,869 lexical units.
Abstract
New concepts and semantic relations are constantly
added to Estonian Wordnet (EstWN) to increase its
size. In addition to this, with the use of test patterns, the
validation of EstWN hierarchies is also performed.
This parallel work was carried out over the past four
years (2011-2014) with 10 different EstWN versions
(60-70). This has been a collaboration between the creators of test patterns and the lexicographers currently
working on EstWN. This paper describes the usage of
test patterns from the points of views of information
scientists (the creators of test patterns) as well as the
users (lexicographers). Using EstWN as an example,
we illustrate how the continuous use of test patterns has
led to significant improvement of the semantic hierarchies in EstWN.
1
1.1
1.2
Before the introduction of test patterns, the
EstWN was validated and revised by adding new
synsets and semantic relations into its semantic
network. Information about new lexical concepts
(synsets) originated from the Estonian language
explanatory dictionary (EKSS1), text corpora and
even from feedback on applying EstWN to the
word sense disambiguation (WSD) task (Kahusk
and Vider, 2002). In addition, EstWN participated
in the META-NORD project, which aims to link
and validate Nordic and Baltic wordnets (Danish,
Estonian, Finnish, Icelandic, Latvian, Lithuanian,
Norwegian and Swedish) and make these resources widely available for different categories
of user communities in academia and in the industry. Under this project, the preliminary task is to
“upgrade several wordnet resources to agreed
standards” “and let them undergo cross-lingual
comparison and validation in order to ensure that
they become of the highest possible quality and
usefulness” (Pedersen et al., 2012).
The first attempt to check the structure of
EstWN took place with version 55 (by the first
Introduction and background
About Estonian Wordnet
The Estonian Wordnet began as a part of the EuroWordNet project (Vossen, 1998) and was built
by translating basic concepts from English to allow for the monolingual extension. Words (literals) to be included were selected on a frequency
basis from corpora. Extensions have been compiled manually from Estonian monolingual dictionaries and other monolingual resources. In this
process, several methods have been used. For
example, domain-specific methods, i.e. semantic
fields like architecture, transportation, etc. have
1
Previous experience of validation
http://www.eki.ee/dict/ekss/
184
author of this paper). One of the aspects studied
was the number of branches a synset goes through
before arriving at one or several root synsets.
These results were presented at the Estonian
Applied Linguistics Conference in spring 2011,
where Kadri Vider2 provided our first feedback.
Her comments elucidated that EstWN requires
this kind of structure checking. In the same year,
the first attempt was made to validate EstWN with
the test pattern3 of closed subset. Test pattern instances were evaluated by Kadri Vare and some
of the results were reflected in two papers (Lohk
et al., 2012a), (Lohk et al., 2012b). Later Lohk et
al. (2014b) discovered more test patterns, all related to multiple inheritance cases. Presently,
there is a system of ten test patterns (Lohk, 2015).
This paper aims to introduce these test patterns
and prove that the usage of the test patterns to validate semantic hierarchies of wordnet may significantly improve the wordnet structure. In addition,
lexicographers Heili Orav and Kadri Vare share
their experiences of working with these test pattern instances (Section 5).
The paper is structured as follows: Section 2
elaborates on the motivation for this work. Section
3 provides a general description of the test patterns, followed by examples of test pattern instances. Section 4 proves the efficiency of test pattern instances in validating the semantic hierarchies of wordnet. Section 5 describes the experiences of lexicographers in using test pattern instances.
2
3) In many cases, multiple inheritance causes topological rings (Liu et al., 2004), (Richens,
2008). According to (Liu et al., 2004), one synset cannot inherit properties from both parents.
4) Multiple inheritance may refer to a short cut
problem (Fischer, 1997), (Liu et al., 2004),
(Richens, 2008). One synset has a two-fold
connection to another one, both directly and
indirectly. The direct link is illegal.
5) Multiple inheritance may refer to dangling uplinks in the hierarchical structure (Šmrz,
2004).
Secondly, the use of test patterns has many advantages:
1) Using a test is always quicker than “[doing] a
full revision in top-down or alphabetical order” (Čapek, 2012).
2) Use of “manual verification and correction” is
the most reliable. (Lindén and Niemi, 2014).
3) Test pattern instances highlight substructures
that refer to possible errors and they simplify
the work of the expert lexicographer (Lohk et
al., 2012a), (Lohk et al., 2012b), (Lohk et al.,
2014b).
4) Test patterns are applicable to wordnets in any
language (Lohk et al., 2014c).
3
3.1
Test patterns
General knowledge about test patterns
As mentioned above, test patterns, by their nature,
are descriptions of substructures with a specific
nature in the wordnet semantic hierarchy that are
intended to validate its structure. All patterns have
the property of multiple inheritance. In most
cases, there is a lexical polysemy behind multiple
inheritance. In the remaining cases, there are
synsets that simultaneously inherit specific and
general concepts (test pattern of short cut).
Test pattern instances help to detect possible
errors in the semantic hierarchies of wordnet.
Each test pattern provides a different perspective
to the semantic hierarchy. Thus, they vary in their
capability to discover various types of possible semantic errors. Test pattern instances are identified
by programs and have to be validated by an expert
lexicographer.
Motivation
There are many reasons for why test patterns
should be chosen as a way to validate multiple
inheritance in the wordnet hierarchical structure
(formed by its semantics). To begin with, due to
the nature of multiple inheritance, it requires
checking. More precisely, multiple inheritance is
prone to semantic errors:
1) Inappropriate use of multiple inheritance
(Kaplan and Schubert, 2001). There are many
cases where multiple inheritance is not used as
a conjunction of two properties (Gangemi et
al., 2001).
2) Sometimes an IS-A relation is used instead of
other semantic relations (Martin, 2003). Multiple inheritance makes it possible to compare
relations that connect the parents of a synset.
2
3
A computational linguist from the University of
Tartu.
Test pattern is a description of a substructure with a
specific nature in the wordnet semantic network (intended to validate the semantic hierarchies of wordnet).
185
useful to mention that short cut indicates redundancy in the semantic hierarchy and ring may refer to problematic synsets, which are simultaneously co-hyponyms and co-hypernyms and additions from the same domain category (Liu et al.,
2004).
All of the following examples are described by
the first author of this paper. Moreover, all ten test
patterns are described as mathematical models
(more precisely, as graphs) in the thesis of (Lohk,
2015).
In the examples, every synset is equipped with
the equivalent synonyms from Princeton WordNet Version 1.5 and begins with an abbreviation
“(Eq_s)”. If the equivalent synonyms are unknown, free translation has been used.
Test pattern structures partially or entirely
overlap with each other. However, they have different perspectives to the substructures of hierarchies and may typically point to different semantic errors therein.
There are only two ways to cover all multiple
inheritance cases in the certain semantic hierarchy
of a wordnet – by using test pattern instances of
closed subset or test pattern instances of ring and
synset with many roots together.
We developed algorithms and created programs (in the framework of the doctoral thesis of
(Lohk, 2015)) to automatically find instances of
the different types of test patterns. However, some
algorithms and programs are implemented to
semi-automatically find instances of different
types of test patterns. Table 1 gives an overview
of the developed test patterns and information
about the automation level of finding their instances. This table illustrates that six of the test
patterns are implemented to find their instances in
an automatic way and the remaining four in a
semi-automatic way. In addition, it should be
mentioned that the first two patterns (short cut and
ring) are inspired by other authors (Fischer,
1997), (Liu et al., 2004), (Richens, 2008). Test
patterns with a gray background are all the closed
subset patterns, however, the second and third
ones have a specific property. Moreover, the test
pattern instances of synset with many roots may in
some cases correspond to the substructure called
dangling uplink noted by (Koeva et al., 2004) and
(Šmrz, 2004).
Test pattern
Short cut
Ring
Closed subset
Closed subset with a root
The largest closed subset
Dense component
Heart-shaped substructure
Synset with many roots
“Compound” pattern
Connected roots
3.2
Dense component
The dense component pattern provides an
opportunity to uncover substructures where, due
to multiple inheritance, the density of the interrelated concepts in the semantic hierarchy is higher
(Lohk et al., 2014a), (Lohk et al., 2014b). This
pattern contains at least two ambiguous concepts
(as in Figure 1 {hotel_1} and “hostel”), which
have a minimum of two identical parents (“a
housing enterprise” and “accommodation
building”). The benefit of this pattern is its ability
to uncover all regular polysemy cases that reveal
themselves as the regularity of multiple inheritance.
The lexicographer has to establish:
 whether that kind of regularity is justified, and
 whether multiple inheritance can be extended
to another synset(s)
In order to better understand the semantic field of
the dense component in Figure 1, the synsets with
dotted lines are additional information to the
dense component (synsets with bold lines) to
grasp its content more clearly. The first number
after in the brackets the synset indicates the number of subordinates inside the dense component.
The second number in the brackets displays the
count of all the subordinates for that synset.
It is a well-known fact that there are several
concepts related to polysemic patterns (Langemets, 2010). Based on Figure 1, {hotel_1} and
“hostel” describe that kind of pattern through institution-building. Checking the concept(s) additional to {hotel_1} and “hostel”, {motel_1, …} is
found which in its nature is quite similar to {hotel_1} and “hostel”. Hence, it appears reasonable
to also connect it to “accommodation building”.
Automation
level
automatic
automatic
semi-automatic
semi-automatic
semi-automatic
automatic
automatic
automatic
automatic
semi-automatic
Table 1: Automation level of finding test pattern instances
Even though there exist ten test patterns (Table 1),
only the instances findable in an automatic way
were delivered to the lexicographer.
Below, four of them are described, while short
cut and ring are considered by their authors and
the main author of this paper. However, it may be
186
...
{teenindusastus_1, ...}
service agency
{motell_1}
(Eq_s) {motel_1, ...}
{hotell_1}
(Eq_s) {hotel_1}
{majutusasutus_1, ...}(2|8)
a housing enterprise
{võõrastemaja_1, ...}
hostel
{majutushoone_1}(2|2)
accomodation building
{asutus_1, institutsioon_1}
(Eq_s) {institution_3, ...}
{asutushoone_1, ...}
institutional building
Fig-
ure 1. An instance of the dense component (rotated 90 degrees)
In the latest version of EstWN, it emerged that
{hotel_1} and “hostel” are no longer connected to
building through a hypernymy relation. (Instead,
the connection is through near_synonymy.) Meanwhile, in the current version of Princeton WordNet4, {hotel_1} is only a building and {hostel_1}
is its subordinate. For a solution, let us look at another concept similar to motel, hotel, and hostel –
the hospital. EstWN organizes this concept into
two synsets. Firstly, it denotes a medical institution, and secondly, a medical building. A similar
idea is followed in Princeton WordNet. Thus, in
both wordnets, hospital is related to an institution
as well as a building. According to this example,
it is advised to organize the concepts hotel, motel
and hostel in a similar manner.
3.3
The heart-shaped substructure pattern describes
the substructure in the wordnet hierarchy where
two synsets (in Figure 2, {homoepathy_1} and
“mud cure, mud treatment”) along with their two
parents are interconnected due to a common
parent ({curative_1, cure_1}) as well as through a
hypernymy relationship between another one of
their parents ({naturopathy_1} and {alternative
medicine_1, …}).
{alternatiivmeditsiin_1}
(Eq_s){alternative medicine_1, ...}
{loodusravi_1}
(Eq_s){naturopathy_1}
Heart-shaped substructure
{ravimisviis_1, raviviis_1, ...}
(Eq_s) {curative_1, cure_1, ...}
{homöopaatia}
(Eq_s) {homoeopathy_1}
{mudaravi_1}
mud cure, mud treatment
Figure 2. An instance of the heart-shaped substructure
In the report file on the instances of a heartshaped substructure delivered to lexicographers,
additional subordinates of the two topmost nodes
are shown. This helps to assess why these two
synsets with two parents are so specific that they
join superordinates but their co-members under
both parents are not linked.
Secondly, this pattern indicates an instance,
where a super-concept ({curative_1, cure_1, …})
seems to be connected to a sub-concept from a different taxonomy level (“mud cure, mud treatment”). On the one hand, this situation might be a
particular feature of the language, but on the other
hand, it might refer to an error.
4
An example of a heart-shaped substructure in
Figure 2 originates from (Lohk et al., 2014b). The
question arises why {homoeopathy_1} is not a
subcase of {naturopathy_1}. Secondly, are “mud
cure, mud treatment” and {homoeopathy_1} subcases of {alternative medicine_1} or of {curative_1, cure_1, …}? On the basis of the definitions of these concepts, the lexicographers decided that both are subcases of {curative_1,
cure_1, …} and that {alternative medicine_1} is
connected to them via a holonymy relation.
There is still no thorough analysis of the heartshaped substructure. Therefore, there is no such
http://wordnetweb.princeton.edu/perl/webwn
187
instance in the latest version of EstWN. In addition, as discovered in (Lohk and Võhandu, 2014),
most of the cases of heart-shaped substructures in
Princeton WordNet pointed to situations where
instead of a hypernymy relation there should have
been a role or type relation.
{olev_2}
(Eq_s) {entity_1}
{mäletsejalised_1}
ruminantia
...
{imetaja_1, mammaal_1}
(Eq_s) {mammal_1}
{pärisimetaja_1, imetaja_2}
(Eq_s) {eutherian mammal_1, …}
{kabiloom_1}
(Eq_s) {hoofed mammal_1, ungulate_1}
{sõraline_1}
(Eq_s) {artiodactyl_1, ...}
{mäletseja_1}
(Eq_s) {ruminant_1}
Figure 3. An instance of the connected roots
3.4
means that the synset ({ruminant_1}) is connected
to the second parent (“ruminantia”) which represents a root synset, but in fact, is carrying the
ower-level concept. The root synset “ruminantia”
is a taxon, i.e. it represents a group of animals with
particular properties. Therefore, it was correct to
change the hypernymy relationship between {ruminant_1} and „ruminantia” to holonymy. Thus,
{ruminant_1} belongs to the group “ruminantia”.
Synsets with many roots
Quite a similar pattern to rings is the synset with
many roots. This pattern differs from the former
one by its unconnected branches. On the one hand,
this signifies that some of the detectable errors are
similar to rings. On the other hand, this pattern is
capable of discovering errors related to root
synsets. Figure 3 demonstrates how one root synset is a dangling uplink5 – “ruminant animals”. It
1 – {boa_1, boamadu_1}
(Eq_s) {boa_1}
{sall_1}
(Eq_s) {scarf_1}
2 – {lõgismadu_1}
(Eq_s) {Crotalus 1 genus Crotalus 1}
3 – {mürkmadu_1}
venomous snake; asp; viper
{madu_1}
(Eq_s) {ophidian_1, serpent_1, snake_1}
Figure 4. An instance of the "compound" pattern
3.5
we have to consider that at least one of the subordinates has an additional superordinate as in Figure 4, where {boa_1} has a superordinate
{scarf_1}. In that case, the lexicographer must
consider why {boa_1} with an extra superordinate
did not have any connections to the other subordinates. Upon checking this additional concept
({scarf_1}), it emerges that it is totally unsuitable
because while {boa_1} is a serpent, scarf is a garment. However, scarf is still related to boa, but in
a different meaning {boa_2, feather boa_1}.
Substructure that considers the content
of synsets (“compound” pattern)
(Nadig et al., 2008) consider a relationship between synsets where a member of a synset is a
suffix to the member of another synset. They utilize examples such as {work}, {paperwork}, and
{racing}, {auto racing, car racing}. In that manner, it is possible to check whether that synset has
a hypernymy relation. In this pattern, the idea of
(Nadig et al., 2008) is employed to uncover all the
cases where this condition is true. Additionally,
5
Dangling uplink is a special case of the synset with
many roots.
188
4
In the iterative evolution of EstWN, test pattern instances were separated with help of our programs and subsequently delivered to lexicographers who validated all instances and corrected
wordnet semantic hierarchies where necessary.
Table 2 reflects the number of test pattern instances over 11 EstWN versions. As background
information, the noun roots, verb roots and multiple inheritance cases are also presented. Every
number in this table indicates the condition of a
specific version in the light of the number of test
pattern instances. These numbers are found immediately after the addition of new concepts and semantic relations, and the release of the new version. Thus, the correction of semantic hierarchies
is revealed in the next version of wordnet.
The bold font in Table 2 indicates the versions
in which a specific pattern was applied. We may
notice that in the range from 60 to 62 no test patterns are used. As a matter of fact, at that time we
conducted some experiments with the closed subset pattern for our first two papers. Beside the
numbers of test pattern instances, it is important
to observe the number of multiple inheritance
cases, as every test pattern instance contains at
least one. The last row in this table confirms that
one multiple inheritance case may be contained in
many different types of test pattern instances,
while the total of the last row of instances
(7+21+30+0+3+7) is bigger than the multiple inheritance cases (51).
The largest changes in the number of multiple
inheritance cases appear when dense components
are taken into use in version 66. This is due to the
fact that dense component contains at least two or
more multiple inheritance cases in one instance.
In the paper of (Lohk et al., 2014a), it was discovered that only 12% (14) of 121 dense component
instances do not need any correction. Nevertheless, the next version (67) revealed 8 new instances.
The decrease in the number of multiple inheritance cases continues even after version 67 when
two more patterns are applied (heart-shaped substructure and “compound” pattern). In the last
version, there are only 3 dense component instances and 0 heart-shaped substructure instances. Comparing the numbers of multiple inheritance cases in versions 66 and 70, it is noted
that the last number (51) is approximately 32
times smaller, i.e. multiple inheritance cases have
been shrunk by approximately 97%.
Experiences of lexicographers in using
test pattern instances
The activities of a lexicographer are rather diverse. Compiling a thesaurus requires access to
vast amounts of linguistic data (e.g. corpora, different dictionaries, databases) as well as
knowledge of how to analyze these data.
Test patterns provide lexicographers with a
broader overview than daily work with a lexicographic tool could ever give. All the patterns were
checked individually. In many cases, additional
descriptions of usage context or definitions help
to ascertain the correct relations between the concepts and may also provide additional relations
found to be missing.
On occasion, synsets with many hypernyms
were left unaltered. For example, morphine is
simultaneously both a narcotic and pain medicine.
This illustrates a well-known problem: “Rigidity
property plays an important role when we distinguish semantic relations of type and role” because “every type is a rigid concept and every role
is a non-rigid concept” (Hicks and Herold, 2011).
It is suspected that the hyponymy relation may
sometimes be a role or type relationship.
There were also instances where a hypernym
had several hyponyms which in turn indicated a
problem, namely that some hyponyms had hypernyms that were too general. Revising the hypernymy trees often reduced the amount of direct hyponyms, resulting in a more precise and systematic hierarchy.
Thus, lexicographer should also know how to
use their own intuition in the decision-making
process. As these test patterns only indicate possible problems, it is not sensible to apply test patterns automatically. However, it could be very
useful, if the test pattern results ran simultaneously in a wordnet editing tool, so the lexicographer is provided with complementary information.
5
Iterative evolution of EstWN
Applying the test patterns to EstWN has taken
place gradually. As mentioned earlier, we began
validating EstWN with the closed subset test pattern. At that time, we studied approximately 20 instances of EstWN and Princeton WordNet. Some
of the results are reflected in two joint papers with
Kadri Vare (Lohk et al., 2012a) and (Lohk et al.,
2012b). Later, we started to use short cut as well
as other patterns.
189
Short cuts
Rings
Synset with many
roots
Heart-shaped substructure
Dense component
1,296
235
3,445
1,123
1,825
104
301
61
183
22
1,592
259
3,560
1,309
1,861
121
380
62
102
16
1,700
299
3,777
1,084
1,941
128
415
63
114
16
1,815
321
3,831
1,137
2,103
141
447
64
149
15
1,893
337
3,882
1,173
2,232
149
471
65
248
14
1,717
194
2,171
791
451
132
459
66
144
4
1,677
119
1,796
613
259
121
671
67
129
4
1,164
79
928
477
167
24
407
68
131
4
691
60
537
232
38
18
54
69
121
4
102
18
291
35
1
8
23
70
118
4
51
7
21
30
0
3
7
“Compound“
pattern
Multiple inheritance
cases
24
Verb roots
142
Noun roots
60
Version
Table 2: A numerical overview of EstWN spanning 11 versions
6
stance to near synonymy, fuzzynymy and holonymy. Moreover, as there are about 70 wordnets in
the world, we believe that applying these test patterns to them may “automatically characterize
their modelling decisions (i.e. potential modelling
errors)”6.
Conclusion and future works
The main collaboration between computer scientists and lexicographers in order to validate
EstWN (version 60) began with the closed subset
test pattern. The closed subset was successful in
finding possible errors in semantic relations.
Later, nine other test patterns dealing with multiple inheritance were developed (see more: Lohk,
2015). Two patterns, namely short cut and ring
patterns are inspired from different authors and
one pattern can in certain cases include a dangling
uplink. In this paper, six test patterns were described but the examples covered four test patterns.
Typically, the work for using test patterns was
organized as follows: the first author of this paper
generated the instances of test patterns, then based
on that document, the lexicographer made corrections using the EstWN editing tool.
The experience of validating Estonian Wordnet assured that the continuous usage of test patterns can significantly improve the semantic hierarchy. Multiple inheritance decreased 32 times or
97% in the last five versions of EstWN.
In the future, we plan to apply these test patterns to other types of semantic relations, for in-
6
Reference
Čapek, T., 2012. SENEQA-System for Quality Testing
of Wordnet Data, in: Proceedings of the 6th International Global Wordnet Conference. Toyohashi
University of Technology, Matsue, Japan, pp. 400–
404.
Gangemi, A., Guarino, N., Oltramari, A., 2001. Conceptual Analysis of Lexical Taxonomies: The Case
of WordNet Top-Level, in: Proceedings of the International Conference on Formal Ontology in Information Systems-Volume 2001. ACM, pp. 285–
296.
Kaplan, A.N., Schubert, L.K., 2001. Measuring and
Improving the Quality of World Knowledge Extracted from WordNet (No. 751). The University of
Rochester Computer Science Department, Rochester, New York.
Koeva, S., Mihov, S., Tinchev, T., 2004. Bulgarian
Wordnet–Structure and Validation. Romanian J.
Inf. Sci. Technol. 7, 61–78.
Langemets, M., 2010. Nimisõna süstemaatiline polüseemia eesti keeles ja selle esitus eesti keelevaras.
Eesti Keele Sihtasutus, Tallinn, Eesti.
Comment by a reviewer.
190
Lindén, K., Niemi, J., 2014. Is It Possible to Create a
Very Large Wordnet in 100 Days? An Evaluation.
Language Resources and Evaluation 48, 191–201.
Liu, Y., Yu, J., Wen, Z., Yu, S., 2004. Two Kinds of
Hypernymy Faults in WordNet: the Cases of Ring
and Isolator, in: Proceedings of the 2nd Global
Wordnet Conference. Brno, Czech Republic, pp.
347–351.
Lohk, A., 2015. A System of Test Patterns to Check
and Validate the Semantic Hierarchies of Wordnettype Dictionaries. Tallinn University of Technology, Tallinn, Estonia.
Lohk, A., Allik, K., Orav, H., Võhandu, L., 2014a.
Dense Component in the Structure of Wordnet, in:
Proceedings of the 9th International Conference on
Language Resources and Evaluation. European
Language Resources Association (ELRA), Reykjavik, Iceland, pp. 1134–1139.
Lohk, A., Norta, A., Orav, H., Võhandu, L., 2014b.
New Test Patterns to Check the Hierarchical Structure of Wordnets, in: Information and Software
Technologies. Springer, pp. 110–120.
Lohk, A., Orav, H., Võhandu, L., 2014c. Some Structural Tests for WordNet with Results. Proceedings
of the 7th Global Wordnet Conference 313–317.
Lohk, A., Vare, K., Võhandu, L., 2012a. Visual Study
of Estonian Wordnet Using Bipartite Graphs and
Minimal Crossing Algorithm, in: Proceedings of
the 6th International Global Wordnet Conference.
Matsue, Japan, pp. 167–173.
Lohk, A., Vare, K., Võhandu, L., 2012b. First Steps in
Checking and Comparing Princeton Wordnet and
Estonian Wordnet, in: Proceedings of the EACL
2012 Joint Workshop of LINGVIS & UNCLH. Association for Computational Linguistics (ACL), pp.
25–29.
Lohk, A., Võhandu, L., 2014. Independent Interactive
Testing of Interactive Relational Systems, in:
Gruca, D.A., Czachórski, T., Kozielski, S. (Eds.),
Man-Machine Interactions 3, Advances in Intelligent Systems and Computing. Springer International Publishing, pp. 63–70.
Martin, P., 2003. Correction and Extension of WordNet 1.7, in: Conceptual Structures for Knowledge
Creation and Communication. Springer, pp. 160–
173.
Nadig, R., Ramanand, J., Bhattacharyya, P., 2008. Automatic Evaluation of WordNet Synonyms and Hypernyms, in: Proceedings of ICON-2008: 6th International Conference on Natural Language Processing. CDAC Pune, India.
Pedersen, B.S., Forsberg, M., Borin, L., Lindén, K.,
Orav, H., Rögnvaldsson, E., 2012. Linking and
Validating Nordic and Baltic wordnets, in: Proceedings of the 6th International Global Wordnet
Conference. Matsue, Japan, pp. 254–260.
Richens, T., 2008. Anomalies in the Wordnet Verb Hierarchy, in: Proceedings of the 22nd International
Conference on Computational Linguistics-Volume
1. Association for Computational Linguistics
(ACL), pp. 729–736.
Šmrz, P., 2004. Quality Control and Checking for
Wordnet Development: A Case Study of BalkaNet.
Science and Technology 7, 173–181.
Vossen, P., 1998. Introduction to EuroWordNet. Computers and the Humanities 32, 73–89.
191
African WordNet: A Viable Tool for Sense Discrimination in the
Indigenous African Languages of South Africa
Stanley Madonsela
Department of African Languages
University of South Africa
[email protected]
Munzhedzi James Mafela
Department of African Languages
University of South Africa
[email protected]
Mampaka Lydia Mojapelo
Department of African Languages
University of South Africa
[email protected]
Rose Masubelele
Department of African Languages
University of South Africa
[email protected]
lexicography units were tasked with the duty
of establishing dictionaries in the different
official languages of South Africa. Although
many of the dictionaries are bilingual, they
give very little information regarding sense
discrimination, especially for non-mother
tongue speakers who are interested in learning
indigenous African languages. The South
African government encourages people to
learn one indigenous African language in
addition to their first language. Lexicography
work in African languages produced so far
does not address the needs of indigenous
African language learners because the
equivalents provided do not address the
problem of sense discrimination. Similarly,
indigenous African language learners take it
for granted that a lexical item has the same
sense across these languages, whereas
sometimes the sense of a word is different in
these languages even if languages are related.
Abstract
In promoting a multilingual South Africa, the
government is encouraging people to speak
more than one language. In order to comply
with this initiative, people choose to learn the
languages which they do not speak as home
language. The African languages are mostly
chosen because they are spoken by the
majority of the country’s population. Most
words in these languages have many possible
senses. This phenomenon tends to pose
problems to people who want to learn these
languages. This article argues that the African
WordNet may the best tool to address the
problem of sense discrimination. The focus of
the argument will be on the primary sense of
the word ‘hand’, which is part of the body, as
lexicalized in three indigenous languages
spoken in South Africa, namely, Tshivenḓa,
Sesotho sa Leboa and isiZulu. A brief
historical background of the African
WordNet will be provided, followed by the
definition of the word ‘hand’ in the three
languages and the analysis of the word in
context. Lastly, the primary sense of the word
‘hand’ across the three languages will be
discussed.
1
This paper argues that African WordNet could
be a viable tool to address problems such as
those mentioned above. The equivalents of
‘hand’ in Tshivenḓa (Venda), Sesotho sa
Leboa (Northern Sotho) and isiZulu (Zulu) are
tshanḓa, seatla (letsogo) and isandla,
respectively. Indigenous official languages of
South Africa belong to the same family of
languages; they are Bantu languages belonging
to the Niger-Congo family. They are further
divided into groups that are, to a certain extent,
mutually intelligible. The Nguni language
group and the Sotho language group, for
example, are not mutually intelligible whereas
languages within any of the two groups are. A
Introduction
Thoughtful lexicography work for indigenous
African languages of South Africa commenced
just after the introduction of democratic
elections in 1994. With the establishment of
the Pan South African Language Board,
national lexicography units were constituted in
all the official languages of South Africa. The
192
Open Rubric
resulted in first versions of WordNets for
isiZulu [zul], isiXhosa [xho], Setswana [tsn]
and Sesotho sa Leboa [nso], all members of the
Bantu language family (Griesel and Bosch,
2014). Currently Tshivenḓa is the fifth of the
nine official African languages of the country
that are part of the project.
majority of the people in the country is
multilingual but they may nevertheless not be
competent in all the languages. Being a
rainbow nation with a myriad of people and
languages, everyday life dictates that one has
some understanding or awareness, however
limited, of other languages. The fact that
official African languages in the country
belong to the same family often tempts people,
knowingly or unknowingly, to clamp them
together with the saying ‘if you know one you
know them all’ – and this is far from the truth.
The lexicons and the senses reflect some
similarities, overlaps and unrelatedness to an
extent
that
they
may
result
in
miscommunication unless sense discrimination
is taken care of.
3
Sense is defined as one of a set of meanings a
word or phrase may bear especially as
segregated in a dictionary entry (Miriam
Webster Online). Frege (1892) argues that
sense is the mode of presentation of the
referent. There are multiple ways of describing
and conveying information about one and the
same referent; and to each of these ways
correspond a distinct sense. Every word is
associated with a sense, and the sense specifies
the condition for being the word’s referent.
We have used the English word ‘hand’ to
demonstrate
lexicalisation
and
sense
discrimination in the languages, Sesotho sa
Leboa (Northern Sotho), Tshivenda (Venda)
and isiZulu (Zulu). Whilst there are other
examples that could be used in the African
WordNet to indicate sense discrimination
across the indigenous African languages of
South Africa, the choice of the word ‘hand’
stems from its cultural significance in the
African value system. The word ‘hand’ has as
its underpinning in the ‘Ubuntu’ (a value
system that promotes humanity to others)
element which regards humanity as a
fundamental part of the eco-systems that lead
to a communal responsibility to sustain life.
2
Word sense
According to Fellbaum (1998) in WordNet,
each occurrence of a word form indicates a
different sense of the word, which provides for
its unique identification. A word in a synset is
represented by its orthographic word form,
syntactic category, semantic field and
identification number. Together these items
make a “sense key” that uniquely identifies
each word/sense pair in the database. The
sense of a word can be derived from the
semantic relations that it has with other words.
The manner in which word sense is viewed has
a great appeal for the discussion of the word
‘hand’ in this article.
African WordNet defined
African WordNet is based on the Princeton
WordNet. It is a multilingual WordNet of
official indigenous languages of South Africa.
WordNets for African languages were
introduced with a training workshop for
linguists, lexicographers and computer
scientists facilitated by international experts in
2007. The development of WordNet prototypes
for four official African languages started in
2008 as the African WordNet Project. This
project was based on collaboration between the
Department of African Languages at the
University of South Africa (UNISA) and the
Centre for Text Technology (CTexT) at the
North-West University (NWU), as well as
support from the developers of the DEBVisDic
tools at the Masaryk University. The initiative
The underlying hypothesis of this paper relies
on previous studies that used multiplicative
models of composition by exploring methods
to extend the models to exploit richer contexts.
Studies by Gale et al., (1993) and Dagan et al.,
(1991) have used parallel texts for sense
discrimination to identify semantic properties
of and relations among lexemes (Dyvik, 1998).
Whilst there are different approaches to sense
discrimination, this paper adopts an approach
by Akkaya, Wiebe and Mihalcea (2012) which
is to cluster target word instances, so that the
induced clusters contain instances used with
the same sense.
193
4
of idioms) and Ṱhalusamaipfi ya luamboluthihi
ya Tshivenḓa (Tshivenḓa monolingual
dictionary) define tshanḓa as follows:
The primary sense of ‘hand’ in
the three African languages
The primary meaning of a word is its literal
meaning. This section looks into the dictionary
equivalents of the primary meaning of the
English word ‘hand’ in the three languages
Tshivenḓa, Sesotho sa Leboa and isiZulu. The
concept under discussion in this paper is
defined in WordNet as “the (prehensile)
extremity of the superior limb”. It is sense 1 of
the domain Anatomy and SUMO Bodypart
[POS: n ID: ENG 20-05246212-n BCS: 3].
4.1
tshanḓa dzin tshipiḓa tsha muvhili
tshi re na minwe miṱanu tshine tsha
shumiswa u fara ngatsho (Tshikota,
2012a:57)
‘part of the body with five fingers,
which is used to hold’
tshanḓa (zwanḓa) dzin 1 tshipiḓa
tsha muvhili tshi re na minwe miṱanu
tshine tsha shumiswa u fara ngatsho
(Tshikota, 2012b:258)
Tshivenḓa
‘part of the body with five fingers,
which is used to hold’
The equivalent of hand in Tshivenḓa is
tshanḓa. Whereas hand in English refers to the
part at the end of a person’s arm, including the
fingers and thumb (Longman Dictionary of
Contemporary English, 1995), tshanḓa in
Tshivenḓa refers to both arm and hand taken as
one. Tshivenḓa does not separate between arm
and hand as languages such as English do, both
are taken as one.
The definitions of the lexical entry tshanḓa in
the two dictionaries are similar, and they refer
to the English word hand. Lexicographers in
these dictionaries were influenced by the
English definition of hand. They do not reflect
what the word tshanḓa refers in the spoken
language. The word tshanḓa in spoken
Tshivenḓa refers to English arm plus hand.
This is attested by Wentzel and Muloiwa
(1982), Van Warmelo (1989) and Tshivenḓa –
English Ṱhalusamaipfi Dictionary (2006). The
word tshanḓa also refers to the palm.
There is a slight difference among the
Tshivenḓa lexicographers in defining the
lexical entry tshanḓa. Wentzel and Muloiwa
(1982:65 and 173) define tshanḓa and ‘hand’
differently. They define tshanḓa (pl. zwanḓa)
as arm, hand; whereas hand is defined as
tshanḓa (pl. zwanḓa). According to these
lexicographers, tshanḓa has got two senses,
that of the whole arm, and the part at the end of
a person’s arm.
4.2
Sesotho sa Leboa
The word for ‘hand’ in Northern Sotho is seatla
(plural: diatla). Ziervogel and Mokgokong’s
(1975) trilingual dictionary gives entries in
Northern Sotho and equivalents in Afrikaans and
English. The English equivalents of the word
seatla in the dictionary are ‘hand’, ‘palm of
hand’, ‘handwriting’. The dictionary then
continues to use the word in various linguistic
contexts in order to lay bare different senses. Of
the three English equivalents mentioned above,
only ‘handwriting’ seems to be non-literal, not
representing the sense under the domain Anatomy. The first two equivalents refer to the
physical part of the body. Only the first
equivalent has a conceptual one-to-one with the
concept defined in WordNet as “the (prehensile)
extremity of the superior limb”. The other
equivalent ‘palm of hand’ is part of the whole
concept defined above. Another trilingual
The same applies to Tshivenḓa – English
Ṱhalusamaipfi
Dictionary
(2006);
the
equivalent of hand is tshanḓa and the
equivalents of tshanḓa are hand and arm. It
would
seem
Tshivenḓa
–
English
Ṱhalusamaipfi Dictionary (2006) adopted the
definitions of the two lexical entries direct
from Wentzel and Muloiwa (1982). To them
both hand and arm are called tshanḓa. Van
Warmelo (1989:388) on the other hand
provides the equivalent of tshanḓa as hand. He
does not differentiate between arm and hand;
according to him the whole limb is tshanḓa.
However, he also refers to the upper arm as
tshishasha. Tshikota (2012a) and Tshikota
(2012b) in his two monolingual dictionaries,
Ṱhalusamaidioma ya luamboluthihi ya
Tshivenḓa (Tshivenḓa monolingual dictionary
194
dictionary (Northern Sotho Language Board,
1988) gives entries in English and equivalents in
Afrikaans and Northern Sotho. The latter is not
only a dictionary, but a terminology and
orthography standardizing document as well.
The entry ‘hand’ has a Northern Sotho
equivalent seatla. Following this entry is a
number of English compound nouns and twoword entries which include ‘hand’. Of these
entries seven are clearly built on the primary
meaning of ‘hand’. The seven entries reflect that
‘hand’ is also referred to as letsogo in Northern
Sotho. For example, the Northern Sotho
equivalent of ‘handwork’ is modiro wa diatla,
‘hand muscle’ is mošifa wa seatla’, ‘hand
movement’ is tshepedišo ya letsogo, ‘hand drill’
is borotsogo , and ‘handbag’ is sekhwamatsogo.
4.3
5
Discussion
Across the three languages, the primary sense
of ‘hand’ is a physical part of the human body.
Lexicographers have to constantly strive to
enhance the quality of definitions in
monolingual dictionaries to best suit the needs
and level of
their target users (Gouws
2001:143). Landau (2001:162) also maintains
that the definition must define and not just talk
about the word or its usage. It is clear from the
argument given above that they do not provide
the answer to the question ‘what it is’ that is
being defined as Gouws (Ibid) suggests.
Lombard (1991:166) pinpoints defining criteria
that would result in good definitions namely
completeness, clarity, accuracy, consistency,
independency, objectivity and neutrality.
Although words for ‘hand’ in the three
languages may refer to the different parts of
the limb, starting at the end of the shoulder and
ending at the fingers, the parts constitute the
same limb. Whereas in Tshivenḓa and isiZulu,
‘hand’ is referred to as tshanḓa and isandla
respectively, in Sesotho sa Leboa it is referred
to as seatla or letsogo. In Tshivenḓa, tshanḓa
is that part of the human body starting from the
shoulder to the fingers. This means that the
whole limb is referred to as tshanḓa. The sense
in isiZulu is slightly different from that in
Tshivenḓa because isandla refers to the
forearm including the wrist, fingers. Whereas
Tshivenḓa tshanḓa refers to the whole limb,
isiZulu isandla refers part of the limb, i.e.
forearm. Sesotho sa Leboa refers to the whole
limb as letsogo ‘arm’, to the ‘hand’ as seatla;
additionally ‘hand’ is referred to as letsogo.
Seatla is part of the whole limb, a meronym of
letsogo ‘arm’, but also used synonymously
with letsogo. Unlike Tshivenḓa and Sesotho sa
Leboa, isiZulu recognises the forearm as part
of the hand, which is referred to as isandla. In
Tshivenḓa and Sesotho sa Leboa, the palm of
the hand is referred to as tshanḓa and seatla,
respectively. The diagram in appendix 1
illustrates the situation sketches above.
IsiZulu
Mbatha (2006: 9) in his isiZulu monolingual
dictionary defines ‘hand’ as isitho somuntu
okuyisona abamba ngaso ‘a body part which a
human uses to hold’. Mbatha’s definition
shows dearth of the lexicographic feature in
providing the quality of definition required to
give clarity. However, Doke and Vilakazi
(1972: 9) in their Zulu-English dictionary
define ‘hand’ as forearm (including the hand).
From the definitions of these lexicographers, it
is apparent that they define the concept not
exactly the same.
Mbatha seems to be
focusing mostly on the functional aspect of the
word ‘hand’ than striving to describe its
meaning. Mbatha’s definition shows dearth
of the lexicographic feature in providing the
quality of definition required to give clarity.
The definition by Doke and Vilakazi on the
other hand, is not detailed enough.
When
considering Doke and Vilakazi’s definition, it
lacks the defining criteria and the
characteristics that are necessary to understand
what the word means. What makes Doke and
Vilakazi’s definition incomplete is that it does
not give enough information about the word. In
Collins English Dictionary (1991:704) the
word hand is defined as ‘the prehensile part of
the body at the end of the arm, consisting of a
thumb, four fingers and a palm’. Considering
the definitions given by Mbatha, and Doke &
Vilakazi, it becomes clear the information that
they have provided has a tentative validity.
It emerges from the Northern Sotho dictionary
definitions and equivalents that the concept is
lexicalized as seatla and/or letsogo. The
English dictionary equivalent of Northern
Sotho letsogo is ‘arm’. Letsogo refers to the
whole superior limb, which includes seatla
‘hand’. The two are understood to be in a
195
the ACL, 18-21 Berkeley, California, 130137.
holonym-meronym relationship, while being
used as synomyms as well.
6
Doke, C.M. and Vilakazi, B.W. 1972. Zulu-English
Dictionary. Johannesburg: Witwatersrand
University Press.
Conclusion
The empirical conclusion in this paper
provides a new understanding of words with
different senses which pose a challenge to the
different speakers of the indigenous South
African languages, particularly the three
languages mentioned. Considering the
hypothesis posed at the beginning of this
paper, it can be concluded that the primary
sense of hand in the three languages, although
related, is different. People learning these
languages should not conclude that because
they are grouped as African languages the
senses of their lexicons are similar throughout.
It is also noted that the sense of hand in
English is different from that in the African
languages. WordNet is a good tool to
investigate the sense of African languages’
lexicons, in that the word ‘arm’ has a
comparable sense and an ID, namely, arm: 1
[POS: n ID: ENG 20-05245410-n BCS: 3] and
belongs to a specific domain: Anatomy.
Dyvik, H. 1998. Translations as Semantic Mirrors.
Proceedings of Workshop Multilinguality in
the Lexicon II, ECAI 98, Brighton, UK, 2444.
Fellbaum, C. 1998. Ed. WordNet: An Electronic
Lexical Database. Cambridge, MA: MIT
Press.
Frege, G. 1892 “On Sense and Reference”
Translated by M. Black in Geach, P. and
Black, M. (eds.) (1970) Translations from
the Philosophical Writings of Gottlob Frege.
Oxford: Basil Blackwell.
Gale, W. A., Church, K. W. and Yarowsky, D.
1993. A method for disambiguating word
senses in a large corpus. Computers and the
Humanities, 26, 415-439.
Gouws, R.H. and Prinsloo, D.J. 2005. Principles
and
Practices
of
South
African
Laxicography. Stellenbosch: Sun Press
The discussion in this paper has gone some
way towards enhancing our understanding of
the degree to which African WordNet can be a
tool that can be used to differentiate word
sense. This research has thrown up many
questions in need of further investigation
regarding the other sense such as the
metaphoric use and the idiomatic expression of
the word in discussion. It became evident from
the discussion that the same word can have
different senses in the different.
Griesel, M. and Bosch, S. 2014. Taking stock of the
African Wordnets project: 5 years of
development. In Proceedings of the Seventh
Global WordNet Conference, January 2014.
University of Tartu, Estonia.
Landau, S.I. 2001. Dictionaries: The Art and Craft
of
Lexicography.
Second
Edition.
Cambridge University Press.
Lombard, F. 1991. Die aard en aanbieding van die
leksikografiese definisie. In: Lexikos 1: 158180. Longman
Dictionary
of
Contemporary English, 1995. 3rd Edition.
Harlow: Longman Group.
References
Mbatha, M.O. 2006. Isichazamazwi SesiZulu.
Pietermaritzburg: New Dawn Publishers.
Akkaya, C., Wiebe, J. and Mihalcea, R. 2012.
Utilizing
Semantic
Composition
in
Distributional Semantic Models for Word
Sense Discrimination and Word Sense
Disambiguation. Sixth IEEE International
Conference on Semantic Computing (IEEE
ICSC2012). Amsterdam and Philadelphia:
John Benjamins and Cognitive Processes, 6
(1), 1–28.
Northern Sotho Language Board. 1988. Sesotho sa
Leboa Mareo le Mongwalo No. 4/ Northern
Sotho Terminology and Orthography No. 4/
Noord-Sotho Terminologie en Spelreëls No.
4. Pretoria: Government Printer.
Tshikota, S.L. 2012. Ṱhalusamaidioma ya
Luamboluthihi ya Tshivenḓa. Ṱhohoyanḓou:
Tshivenḓa National Lexicography Unit.
Dagan, I., Itai, A. and Schwall, U. 1991. Two
languages are more informative than one.
Proceedings of the 29th Annual Meeting of
196
Tshikota, S.L. 2012. Ṱhalusamaipfi ya
Luamboluthihi ya Tshivenḓa. Ṱhohoyanḓou:
Tshivenḓa National Lexicography Unit.
Tshivenḓa
NLU.
2006.
Tshivenḓa/English
Ṱhalusamaipfi Dictionary. Cape Town:
Phumelela Publishers.
Van Warmelo, N.J. 1989. Venda Dictionary:
Tshivenḓa – English. Pretoria: Van Schaik.
Wentzel, P.J. 1982. Improved Trilingual
Dictionary: Venda – Afrikaans – English.
Pretoria: Univesrity of South Africa.
Ziervogel, D., Lombard, D. P and Mokgokong, P.C.
1969. A Handbook of the Northern Sotho
Language. Pretoria: Van Schaik.
197
Appendix 1: Lexicalisation of ‘hand’ in the three languages
198
An empirically grounded expansion of the supersense inventory
Héctor Martı́nez Alonso† Anders Johannsen Sanni Nimb‡
Sussi Olsen Bolette Sandford Pedersen†
University of Copenhagen (Denmark)
‡Danish Society of Language and Literature (Denmark)
†[email protected], [email protected]
Abstract
al., 2014a; Søgaard et al., 2015). According to
Ciaramita and Altun (2006), supersenses extend
the named entity recognition (NER) inventory so
that the predictions of an SST model subsume the
output of NER. Schneider et al. (2015) provide a
full SSI for prepositions.
The current supersense inventory (henceforth
SSI) enjoys de facto standardness, but in spite of
its potential usefulness, it is used acritically. The
current SSI provides 26 noun supersenes and 15
verb supersenses. Adjective and adverb lexicographer files are disregarded. We provide a revision
of the SSI by an extension of its supersenses using
the Danish wordnet as starting point.
This revision is empirically backed by four
evaluation criteria, namely inter-annotator agreement, sense frequency after adjucation, sense coocurrence, and NER compliance (whenever possible). Note that we do not suggest merging existing
supersenses, but only extending the current SSI in
a backwards-compatible manner.
We conduct our extension in three steps steps.
First, we propose new supersenses when a projection between an EuroWordNet (EWN) ontological type and a supersense is not univocal (Section
2). Second, we evaluate the distribution of supersenses in terms of agreement after an annotation
task, frequency and sense-sense relations (Section
4) and analyze the results across the different parts
of speech (Section 5). Lastly, we suggest new supersenses (underlined in in Table 2) when large
sections of the data have been assigned to backoff categories.
The main contributions of this paper are i )a set
of guidelines for the inclusion of new supernses
in the SSI, ii) an empirically motivated expansion
of the SSI with new senses for nouns, verbs and
adjectives respectively,1 and iii) a projection from
ontological types to supersenses that can be used
to enrich any wordnet that is not organized in lexi-
In this article we present an expansion of
the supersense inventory. All new supersenses are extensions of members of the
current inventory, which we postulate by
identifying semantically coherent groups
of synsets. We cover the expansion of the
already-established supernsense inventory
for nouns and verbs, the addition of coarse
supersenses for adjectives in absence of a
canonical supersense inventory, and supersenses for verbal satellites. We evaluate
the viability of the new senses examining
the annotation agreement, frequency and
co-ocurrence patterns.
1
Introduction
Coarse word-sense disambiguation is a well established discipline (Segond et al., 1997; Peters et al.,
1998; Lapata and Brew, 2004; Alvez et al., 2008;
Izquierdo et al., 2009) that has acquired more momentum in the latter years under the name of supersense tagging (SST). SST uses a coarse sense
inventory to label spans of variable word length
(Ciaramita and Johnson, 2003; Ciaramita and Altun, 2006; Johannsen et al., 2014). This coarse
sense inventory is obtained from the list of WordNet first beginners, i.e. the names of the lexicographer files that hold the synsets.
However, lexicographer files were devised for
practical reasons, namely as an organization
method for the development of WordNet (Miller,
1990; Gross and Miller, 1990; Fellbaum, 1990),
and not as final target categories to annotate with
or disambiguate from.
Nevertheless, the organization of lexicographer
files is semantically motivated, and supersenses
have proven useful for natural language processing such as metaphor detection or relation extraction (Ciaramita and Johnson, 2003; Tsvetkov et
1 https://github.com/coastalcph/semdax
199
cographer files or where synsets are not fully connected to Princeton synsets.
2
New supersense
Noun
V EHICLE
B UILDING
C ONTAINER
D OMAIN
A BSTRACT
I NSTITUTION
D ISEASE
L ANGUAGE
D OCUMENT
Extending the supersense inventory
This section describes the extension of the SSI that
results from an analysis of projections into supersenses from ontological types, ensuing both retrocompatibility with the existing inventory (i.e. all
new supersenses are extensions of an existing supersense), and compability with NER tags.
We use The Danish wordnet (Pedersen et al.,
2009), DanNet, as a starting point. DanNet is
not organized in lexicographer files. However, its
synsets are associated to ontological types (Vossen
et al., 1998). We map from the ontological type of
the synsets to a supersense. Table 2 provides one
example for each lexical part of speech.
Ontological type
Supersense
Property+Physical+Colour
Liquid+Natural
Dynamic+Agentive+Mental
NOUN . SUBSTANCE
Verb
A SPECTUAL
P HENOMENON
Adjective
M ENTAL
P HYSICAL
S OCIAL
T IME
F UNCTION
Satellite
C OLLOCATION
PARTICLE
R EFL P RON
Subsumed by
)
A RTIFACT
o
C OGNITION
} G ROUP
} S TATE
o
C OMMUNICATION
} S TATIVE
} C HANGE
)
A LL
)
none
ADJ . PHYSICAL
Table 2: Extensions to the sense inventory. Items
in grey do not fulfill the inclusion criteria, underlined items have been suggested during postannotation analysis.
VERB . COGNITION
Table 1: Supersense mapping examples.
We establish a projection into supersenses with
the following steps; if an ontological type ti :
1. does not have a straightforward 1-to-1 mapping to a supersense,
2. is the subtype of an ontological type t j (e.g.
Liquid+Natural is a subtype of Liquid),
3. and has enough support (in terms of how
many synsets make up ti ),
then we propose new supersense for ti as an extension of the supersense of t j . We consider the
support to be substantial enough when a subtype
has at least 500 synsets out of the 65k synsets in
DanNet and, and it makes up at least 12% of its
parent supersense.
We exemplify this method by explaining how
we extend DISEASE from STATE. The subtype Property+Physical+Condition is associated
to 527 synsets and makes up 70% of the synsets
of the type Condition. All the synsets of this subtype are diseases, and we propose the supersense
DISEASE as an extension of STATE , which is otherwise the supersense translation of Condition.
In addition to providing new supersenses for the
main three lexical parts of speech, we devise three
aditional tags for verbal satellites (collocations,
particles and reflexive pronouns) as aid for verbal
multiwords the annotation (cf. Section 5.4). Table 2 lists the new supersenses. Underlined dupersenses marked are determined in post-annotation
analysis (cf. Section 5), while the rest have been
determined during the projection step described in
this section. Supersenses in grey do not meet the
inclusion criteria, and are thus not incorporated in
our proposal for SSI extension.
3
Annotation task
We perform an annotation task on 5,500 sentences
from a Danish contemporary corpus (Asmussen
and Halskov, 2012) made up of newswire, parliamentary speech, blog posts, internet forum discussions, chatroom logs and magazine articles, plus
the test section of the Danish Dependency Treebank (Buch-Kromann et al., 2003).
Any corpus choice imposes a bias, and we base
the corpus choice on a twofold need: to tune the
sense inventory to the needs of contemporary genres that are used for information extraction, without sacrificing its adequacy for more usual domains. Generally speaking, another corpus choice
would yield a different supersense expansion.
200
A domain (i.e. a field of knowledge or professional discipline) is difficult to distinguish from
its semantically related senses N . COGNITION and
N . COMMUNICATION . Low agreement also compromises the reliability of some of the established
supersenses such as NOUN . SHAPE. However, the
goal of these measurements is to evaluate the new
supersenses, because we do not advocate for a reduction of the canonical SSI, but an extension of
the existing list of supersenses.
The corpus was pre-annotated using the supersense projection list described in Section 2. Even
though the size of the specific wordnet is a determining factor for the quality of the preannotation,
it does not determine the coverage of the final supersense annotation, which provides full coverage
because a SSI covers all content words.
Two in-house native annotators with a background in linguistics annotated the data, choosing the best pre-annotated sense or selecting a new
one. A third annotator performed adjudication in
case of disagreement. The overall kappa score before adjudication is 0.62. Olsen et al. (2015) provide more details on the annotation task. The resulting data has been use for automatic supersense
tagging by Martı́nez Alonso et al. (2015).
4
n.top
n.shape
n.domain
n.substance
n.food
n.motive
n.state
n.group
n.relation
n.attribute
n.animal
n.body
n.vehicle
n.quantity
n.event
n.location
n.possession
n.artifact
n.container
n.institution
n.communication
n.building
n.feeling
n.person
n.abstract
n.cognition
n.time
n.act
n.plant
n.phenomenon
n.object
n.disease
Metrics
This section describes the metrics applied to the
supersense-annotated corpus in order to assess the
distribution of the new supersenses.
4.1
Sense-wise agreement variation
n.top
n.shape
n.domain
n.substance
n.food
n.motive
n.state
n.group
n.relation
n.attribute
n.animal
n.body
n.vehicle
n.quantity
n.event
n.location
n.possession
n.artifact
n.container
n.institution
n.communication
n.building
n.feeling
n.person
n.abstract
n.cognition
n.time
n.act
n.plant
n.phenomenon
n.object
n.disease
Inter-annotator agreement is a source of information on the reliability of semantic categories
(Lopez de Lacalle and Agirre, 2015). In this section, we examine the variation in agreement for
noun and verb supersenses. Cf. Olsen et al. (2015)
for a more detailed account.
Figures 1 and 2 portray the variation of agreement across noun and verb supersenses. Each cell
in the matrix indicates the probability of a token
being annotated with a row-column tuple of supersenses (ri , c j ) by the two annotators. The matrix
is normalized row-wise, and each row describes
the probability distribution of a certain supersense
ri to be annotated with any other supersense c j .
When ri and c j have the same value, annotators
agree. Rows are sorted in descending order of
agreement, i.e. the size of the ri = c j box on the
diagonal. The larger the box in the diagonal, the
higher the agreement for a given ri supersense.
From the standard supersenes, for instance,
N . GROUP is very seldom assigned by both annotators, and there is usual disagreement with
N . QUANTITY . Other senses like N . BODY have
very few off-diagonal values and have near-perfect
agreement.
Out of the new supersenses, N . INSTITUTION
has very high agreement. However, the new
supersense N . DOMAIN has very low agreement.
Figure 1: Agreement variation for nouns.
v.social
v.emotion
v.competition
v.body
v.phenomenon
v.perception
v.consumption
v.aspectual
v.possession
v.motion
v.change
v.cognition
v.creation
v.communication
v.act
v.stative
v.social
v.emotion
v.competition
v.body
v.phenomenon
v.perception
v.consumption
v.aspectual
v.possession
v.motion
v.change
v.cognition
v.creation
v.communication
v.act
v.stative
Figure 2: Agreement variation for verbs.
201
4.3
Agreement also varies across parts of speech.
Diagonal boxes take up 69% of the probability
mass of the verbs, while 58% is taken by the
agreed nouns. In other words, 31% of the annotations for verbs are mismatched, whereas 42% of
the nouns have mismatching annotations. We consider this difference a consequence of the size of
the inventory for nouns and verbs respectively, and
not an indication of verbs being per se easier to annotate than nouns.
4.2
A third source of information on the appropiateness of a supersense is its relation with the
other established senses. This section offers
an overview on how supersenses co-occur. To
give account for relevant associations between
senses, we use PMI (pointwise mutual information). Higher PMI values indicate stronger association, i.e. a higher conditional probability of one
sense appearing in a sentence given the other, controlled for the frequency of both senses in order
not to overestimate the co-occurrence of frequent
senses.
Table 3 shows the twelve pairs of supersense
with the highest PMI calculated across sentences.
We compare the supersense-wise PMI for three
corpora:
1. Danish extended (DA-EX): The Danish corpus annotated with the extended SSI described in Section 3,
2. Danish regular (DA-RG): The Danish corpus from Section 3 with regular supersenses,
where the extended senses have been replaced by their subsuming original sense, e.g.
all the occurrences of N . VEHICLE in DA-EX
are N . ARTIFACT in DA-RG,
3. English regular (EN-RG): The English SemCor (Miller, 1990) with the regular supersense annotation.
Some of the associations are prototypical selectional restrictions like V. COMSUMPTION +
N . FOOD . Other associations are topical across
parts of speech, like VERB . COMPETITION and
NOUN . EVENT (‘They won the final’). Finally,
there are associations within a part of speech,
like N . DISEASE and N . BODY, or N . FOOD and
N . CONTAINER . In these associations, one sense
is a strong indicator for the other at the topic level
(diseases are bodily, food is kept somewhere, etc).
In DA-EX we observe that three of the new
nominal senses appear strongly associated with
standard supersenses. These relations are topical
and easy to interpret. The vehicle-substance relation is the least straightforward one and describes
vehicles and the fuel they use, or the materials they
are built from.
Projecting back to the regular SSI is not equivalent to annotating from scratch with it. Nevertheless, if we examine the top supersense pairs for
DA-RG, we observe that the V. STATIVE sense appears three times. By ignoring the aspectual differ-
Supersense frequency
Frequency is the most straightforward way of assessing whether a a certain sense has been given
to enough examples to be considered relevant. If a
new sense is very frequent, there is sufficient reason to consider it as a valid addition to the SSI.
Table 3 provides the absolute frequency for the
28 most frequent supersenses, namely half of the
total SSI, after disagreements had been resolved
by the adjudicator.
n.building
n.quantity
a.mental
v.phenomenon
v.emotion
a.social
v.motion
v.change
n.event
a.phys
n.cognition
v.possession
Supersense
n.location
s.particle
a.time
n.abstract
n.act
v.act
n.institution
n.artifact
v.cognition
s.coll
v.communication
a.all
n.time
n.communication
n.person
v.stative
0
100
200
300
400
500
600
700
800
Association between supersenses
900
Observed count
Figure 3: Distribution of frequent senses.
Presence in the top half of the sense ranking is
one of the criteria for inclusion in the SSI.
202
Danish (extended)
v.consumption
v.contact
n.food
v.body
n.disease†
v.competition
v.motion
v.contact
n.substance
n.shape
n.vehicle†
v.competition
Danish (regular)
n.food
n.body
n.container†
n.body
n.body
n.event
v.contact
n.artifact
n.object
n.body
n.substance
n.relation
v.consumption
v.stative
n.person
v.competition
v.competition
v.change
n.state
v.consumption
v.motion
v.stative
v.stative
n.substance
English (regular)
n.food
n.plant
n.animal
n.relation
v.event
n.substance
n.feeling
v.change
n.object
v.consumption
n.substance
n.person
v.consumption
v.weather
v.weather
n.plant
n.plant
n.substance
v.body
v.weather
v.emotion
n.plant
v.contact
n.food
n.food
n.object
n.phenomenon
n.food
n.animal
n.process
n.body
n.substance
n.motive
n.tops
n.body
n.animal
Table 3: Sense pairs ranked by PMI, bold and underlined described in Section 4.3, † marks new sense.
stated in Section 1, NER compatibility is a favorable side effect of SST, we consider improved
NER compatiblity of the new SSI as a plus.
ence, the tag receives associations with N . PLANT,
V. CONSUMPTION and N . SUBSTANCE . Upon manual examination we deem these relations to be spurious, i.e. caused by the presence of the verb være
(‘be’) somewhere in the sentence, except the relation between V. STATIVE and V. CONSUMPTION,
which is aspectual in nature. The effect on the
distribution of supersenses when projecting back
to the original SSI becomes apparent for the pair
V. COMPETITION + N . RELATION , which becomes
the fourth highest PMI in DA-RG.
The English supersense associations of ENRG provide an example on the effect of corpus
choice when annotating. The fairly uncommon
N . PLANT appears in several of the top associations, which is a sign of plant senses being used
in very restricted contexts in this corpus (biology
and recipes). Moreover, we also find a strong association with one of the backoff senses, namely
N . TOPS , which is not desirable.
5
5.1
Even though NER inventories are application
dependent (cf. Nadeau and Sekine (2007) for a
survey), our reference is the de facto standard
CONLL inventory (Tjong Kim Sang and De Meulder, 2003), with the labels P ERSON, L OCATION
and O RGANIZATION, as well as a M ISCELLA NEOUS label, needed for full coverage but not
present in e.g. the 7-label inventory of MUC-7
(Chinchor and Robinson, 1997).
Concrete meaning is easier to annotate (Passonneau et al., 2009) and can be the easiest to extend
with new senses. As a matter of fact, the concrete N . ARTIFACT supersense is the one that yields
more new supersenses in our analysis, namely
N . BUILDING , N . CONTAINER and N . VEHICLE . In
particular, N . BUILDING extends N . ARTIFACT because artifactual locations, already noted as a semantic type the SIMPLE ontology (Lenci et al.,
2000), like houses and highways are very often
predicated as locations (following locative prepositions, etc.) instead of having the typical distribution of artifacts, i.e. with the verb use or the
preposition with. Moreover, N . BUILDING maps
better into the Location type of NER. We leave
the potential supersenses for instruments and machines as parts of N . ARTIFACT and do not specifiy
them even further, because they hold the prototypical meaning of the supersense.
Supersenses across parts of speech
Nouns
This section describes the extended SSI for nouns.
To the extent that nouns denote entities, they
are very often of focus of interest of ontologies.
To the extent that entities often have physical
denotation—and thus concrete meaning—, they
are the easiest concepts to categorize semantically.
Indeed, many ontologies are largely nominal, cf.
Suchanek et al. (2008) or Wu and Weld (2008).
WordNet lexicographer files were developed
before the consolidation of NER, and namedentity coverage in wordnets is irregular. If, as
In spite of the expected higher difficulty of dealing with abstract meaning, we examine two extensions for the abstract supersense N . COGNITION
203
yielded by the the ontological type projection from
Section 2, namely N . DOMAIN and N . ABSTRACT.
The supersense N . DOMAIN covers fields of knowledge such as philosophy, but also other disciplines
to cover sense alternations like ‘I enjoyed this
dance’ (N . ACT) vs. ‘I studied dance at the Performing Arts Academy’ (N . DOMAIN). The supersense N . ABSTRACT aims at covering concepts
like idea, and as a label for metaphorical usages
of other concrete words like pattern in ‘behavioral
pattern’.
The fairly abstract supersense N . STATE yields
a concrete sense DISEASE, which is much easier to annotate than its original parent supersense
(cf. Figure 1). Lastly, we extend N . GROUP with
N . INSTITUTION . The original sense does not map
neatly into NER, as the overlap is only partial;
while ministry would fall under the O RGANIZA TION type of NER, pack (of rats) and school (of
fish) would not.
5.1.1
New supersense
Agr.
Freq.
A BSTRACT
B UILDING
C ONTAINER
D ISEASE
D OMAIN
I NSTITUTION
V EHICLE
x
x
x
x
x
x
x
x
L ANGUAGE
D OCUMENT
–
–
Assc.
NER
x
x
x
x
x
x
x
Table 4: Inclusion criteria for new noun senses.
The strongest nominal candidate for inclusion is
which satisfies the first two initial criteria, plus improves NER compatibility.
During the annotation task, we observed
that a large amount of examples of the standard N . COMMUNICATION supersense were document names, movie titles, and so on. One
of the authors of this article reviewed all the
N . COMMUNICATION spans and classified them
in three categories, two of them mapped from
the EWN top ontology, N . DOCUMENT and
N . LANGUAGE , and a third back-off category
for N . COMMUNICATION. Notice how, in spite
of having spawned three senses (N . CONTAINER,
N . VEHICLE and N . BUILDING ), N . ARTIFACT is
still a very frequent supersense.
The document-language distinction is a highlevel type in the SIMPLE ontology (Lenci et al.,
2000). Note that these two new communication
subsenses do not solve the artifact-information
ambiguity commonly found in lexical semantics
(Pustejovsky, 1991). While N . LANGUAGE has
more often an eventual reading (e.g. conversation, remark), N . DOCUMENT refers more often to
works and other entities with a non-temporal denotation. We also use N . LANGUAGE for the metalinguistic usage of words (e.g. ‘The word drizzle
sounds funny’). This re-annotation produces examples like the following:
N . INSTITUTION ,
Sense-wise evaluation
In this section we evaluate the extended noun supersenses according to four properties summarized in Table 4; whether the agreement for a
supersense is high enough (Agr.), whether its
frequency is high enough, whether we identify relevant associations using PMI (Assc.), and
whether it potentially improves NER compliance
(NER). Moreover, we suggest two new supersenses, N . LANGUAGE and N . DOCUMENT, indicated in the lower section of Table 4.
The first three properties are obtained from the
metrics in Section 4. We consider agreement to
be high enough when there is at least 51% agreement for a supersense. We consider frequency to
be enough when the sense belongs to the first 28
senses out of 56 (i.e. the first half of the frequencyranked SSI). None of the thresholds are particularly high, but we consider a noun supersense as a
candidate for inclusion in the final SSI if two of the
four properties are satisfied. In other words, none
of the criteria are necessary, but fulfilling two of
them is sufficient.
We observe most of the new senses fulfill at
least two of the criteria, with the exception of
N . DOMAIN , which fulfills none. Thus, we do not
endorse using the N . DOMAIN supersense and still
use N . COGNITION for fields of knowledge. Nevertheless, the N . ABSTRACT sense seems a valuable
extension because it satisfies the agreement and
frequency criterion.
H. C. Andersen er jo verdensberømt , fordi
hans forfatterskab/N . DOCUMENT er blevet
oversat til alle sprog/N . LANGUAGE .
H. C. Andersen is world famous, because his
writing has been translated to all languages.
Out of the 1513 N . COMMUNICATION cases,
360 fall under N . LANGUAGE and 928 under
204
We evaluate verb sense using the criteria we
used for nouns in Section 5.1, but discarding NER
compliance, which does not apply to verbs. Table
5 shows the criteria for verbs.
N . DOCUMENT , and the remaining were left with
the original label. Out of the 929 N . DOCUMENT
spans, 382 are named entities, where 248 are +2
tokens in length. This metric aims at justifying
having document as an NER label, where span
identification is as relevant as proper labeling.
New supersense
We believe the frequency of document-name
named entities makes a good case for considering the N . DOCUMENT class as an addition to the
SSI and to NER. However, we do not find enough
support to recommend a N . LANGUAGE supersense
and prefer using the original N . COMMUNICATION
instead.
5.2
ASPECTUAL
PHENOMENON
Freq.
x
x
x
Assc.
x
Table 5: Inclusion criteria for new verb senses.
Both new verbal supersenses satisfy two out of
three of the criteria, and we can consider them candidates for the SSI extension. We leave it for further discussion whether aspectual verb reading deserves a full-fledged supersense or should be used
as a satellite tag (cf. Section 5.4).
Verbs
Verbs are central to the theory of lexical semantics, yet their semantic characterization has been
closer to the syntax-semantics interface (Levin,
1993; Kipper et al., 2000; Kipper et al., 2006).
In this aspect, the wordnet SSI for verbs is very
different, e.g. verbs like jump or displace are of
the V. MOTION, even though their argument structures are very different. Nevertheless, verbal sense
alternations are often associated with different argument structures (Grimshaw, 1990).
5.3
Adjectives
SST as defined by Ciaramita and Johnson (2003)
only labels nouns and verbs. Adjectives have received much less attention than nouns and verbs,
arguably because of the inherent difficulty of their
analysis, cf. Boleda et al. (2012) for a survey on
adjective classifications. In addition to the theoretical complications, adjectives are not regarded as
core elements of meaning when building applications. For instance, in WordNet 3.0 there are 82k
synsets for nouns, 14k for verbs, 18k for adjectives and 4k for adverbs. However, the base concepts from EWN (Vossen et al., 1998), with 4,869
synsets in total, hold 37 adjectives in contrast to
3,210 nouns and 1,442 verbs.
Moreover, the supersense-synset relation is hyponimic, but adjectives in WordNet are not taxonomically organized (Gross and Miller, 1990). For
instance, there is no way to retrieve that ashamed
and exasperated are emotional in nature (Tsvetkov
et al., 2014b).
The meaning plasticity of adjectives makes it
also hard to determine whether adjectives hold
any meaning onto themselves, or their meaning is
an emergent property of the relation they establish with the noun they complement. Murphy and
Andrew (1993) consider adjectives monosemous
elements that define their sense when predicated
alongside nouns. Under this light, supersense adjectives would be superflous if adjective meaning
is an epiphenomenon of noun meaning.
However, insofar adjectives can help disambiguate nominal polysemy (Tsvetkov et al.,
The V. CHANGE supersense is populated with
semantically disparate categories and is very difficult to annotate, even though it is a very frequent
sense, both in terms of annotated words and of
synsets adscribed to it. According to Fellbaum
(1990), ‘the concept of change is flexible enough
to accomodate verbs whose semantic description
mathen them unfit for any other semantically
coherent group’. In other words, the rummage
box category for verbs is actually the majority
class. Indeed, an expansion of change into its
subsenses of CHANGE - VARY, CHANGE - STATE,
CHANGE - REVERSAL ,
CHANGE - INTEGRITY ,
CHANGE - SHAPE and CHANGE - ADAPT could
potentially make the supersense more useful, if
one is willing to incur the cost of annotating with
five more labels.
The
Agr.
V. PHENOMENON
supersense extends
by delimiting events that have no
agency and are not weather-related, such as
happen, or occur. WordNet shows a systematic
ambiguity between V. STATIVE and V. CHANGE for
aspectual readings of verbs, and we also propose
V. ASPECTUAL for constructions like ‘start the
engine’ or ‘begin to hope’.
V. CHANGE
205
tion (Hoppermann and Hinrichs, 2014; Baldwin,
2005b; Baldwin, 2005a).
We use three satellite tags; S . COLLOCATION,
S . PARTICLE and S . REFLPRON (for reflexive pronouns). While the particle distinction is more relevant for satellite-framed languages (Talmy, 1985)
like Germanic languages, light-verb constructions
are pervasive in many languages, also characteristically verb-framed languages like Spanish or
French, where we find verb-headed multiwords
like llevar a cabo (lit. ‘take to ending’, ‘carry out’)
or avoir l’air (lit. ‘to have the air’, ‘seem’), respectively. A similar approach has been used by
Schneider and Smith (2015).
The intention of these tags is to help isolate
the head of a verb-headed multiword. We assign
the sense label to the syntactic head, even though
a light verb construction would be arguably best
headed by its introduced noun. In this manner,
gøre grin af (‘make fun of’) would be labeled as
gøre/V. COMMUNICATION grin/S . COLLOCATION
af /S . COLLOCATION’, and we thus avoid giving
gøre (‘make’) the V. CREATION sense.
2014a), and have different listed synsets, we advocate for providing a set of supersenses for
adjectives. This addition makes therefore SST
truly all-words for the three main lexical parts of
speech. Adjective classifications into supersenses
or coarse classes do exist, notably in GermaNet
(Hamp and Feldweg, 1997), which Tsvetkov et al.
(2014b) apply to English.
When applying the projection method from
Section 2, we extend A . ALL with A . MENTAL,
A . PHYS , A . SOCIAL and A . TIME . These supersenses do not distinguish descriptive (i.e. extensional) from reference-modifying (intensional) adjectives, e.g. former is A . TIME while imaginary is
A . MENTAL . These senses do not distinguish relational adjectives either, to the extent that ecologic
and one of the senses of green should fall under
the same supersense.
The new adjective SSI cannot be evaluated in
the samme manner as nouns. The adjective SSI
is much smaller, and the agreement and frequency
metrics can be misleadingly positive. Indeed, all
adjective supersenses satisfy the agreement and
frequency criteria specified in Section 5.1.1.
However, A . ALL is the most frequent supersense for adjectives, and it covers 40% of the annotated adjectives. This proportion is too large,
and indicates the sense inventory needs to be further specified in order to minimize how many tokens get assigned the backoff sense.
Many of the adjectives under A . ALL are
function-appraisal related, such as god (‘good’),
bedre (better’), stor (‘large’ as in ‘grand’), vigtig
(‘important’). While polarity is an important property of adjectives (Chesley et al., 2006), we do
not consider it a desirable trait for supersenses,
which are more oriented towards conveying sense
denotation that connotation. Hence, we suggest
a new supersense A . FUNCTION to give account
for function-related senses, what in the terminology of Pustejovsky (1991) would be the telic role.
We observe that the ALLGEMEIN (‘general’) category of GermaNet and Tsvetkov et al’s MISCEL LANEOUS hold similar senses.
5.4
6
Conclusions and further work
We suggest an extension of the SSI for the three
main lexical parts of speech. We obtain new supersenses using a mapping from ontological types,
and evaluating their distribution after an evaluation task. Most of the new suggested senses satisfy the inclusion criteria we determine. In particular, we advocate for an inclusion of the senses
N . DOCUMENT and N . INSTITUTION , which improve NER compatibility.
The extension method can be applied to any
wordnet where the synsets are associated to EWN
ontological types. Nevertheless, the inclusion criteria might change when dealing with different
languages or corpus types. Moreover, the SSI proposed in this article can be applied retroactively to
any EWN-aligned synset-annotated corpus.
With regards to adjectives, the backoff A . ALL
category still constitutes 40% of the annotated adjectives. In future work, we consider including
senses from the GermaNet inventory, and experimenting with data-driven approaches to infer lexical categories for adjectives by means of their
relations to other words in wordnets, following
the work of Alonge et al. (2000), Mendes (2006),
Nimb and Pedersen (2012) and corpus-based approaches like Lapata (2001).
Satellites
When annotating nouns in Section 3, we annotate continuous NER-like spans. But verb-headed
multiwords pose a challenge because they are not
necessarily continuous, and pose attested challenges for their annotation and automatic recogni-
206
Acknowledgements
Christiane Fellbaum. 1990. English verbs as a semantic net. International Journal of Lexicography,
3(4):278–301.
We wish to thank Nathan Schneider and Yulia
Tsvetkov for their useful comments.
Jane Grimshaw. 1990. Argument structure. the MIT
Press.
References
Derek Gross and Katherine J Miller. 1990. Adjectives
in wordnet. International Journal of Lexicography,
3(4):265–277.
Antonietta Alonge, Francesca Bertagna, Nicoletta Calzolari, Adriana Roventini, and Antonio Zampolli.
2000. Encoding information on adjectives in a
lexical-semantic net for computational applications.
In Proceedings of the 1st North American chapter of the Association for Computational Linguistics
conference, pages 42–49. Association for Computational Linguistics.
Birgit Hamp and Helmut Feldweg. 1997. Germaneta lexical-semantic net for German. In Proceedings
of ACL workshop Automatic Information Extraction
and Building of Lexical Semantic Resources for NLP
Applications, pages 9–15. Citeseer.
Christina Hoppermann and Erhard Hinrichs. 2014.
Modeling prefix and particle verbs in GermaNet.
Global Wordnet Conference, page 49.
Javier Alvez, Jordi Atserias, Jordi Carrera, Salvador
Climent, Egoitz Laparra, Antoni Oliver, and German
Rigau. 2008. Complete and consistent annotation of
wordnet using the top concept ontology. In LREC.
Rubén Izquierdo, Armando Suárez, and German Rigau.
2009. An empirical study on class-based word sense
disambiguation. In Proceedings of the 12th Conference of the European Chapter of the Association for
Computational Linguistics, pages 389–397. Association for Computational Linguistics.
Jørg Asmussen and Jakob Halskov. 2012. The
CLARIN DK Reference Corpus. In Sprogteknologisk Workshop.
Timothy Baldwin. 2005a. Deep lexical acquisition
of verb–particle constructions. Computer Speech &
Language, 19(4):398–414.
Anders Johannsen, Dirk Hovy, Héctor Martı́nez, Barbara Plank, and Anders Søgaard. 2014. More or less
supervised supersense tagging of Twitter. In Lexical
and Computational Semantics (*SEM 2014).
Timothy Baldwin. 2005b. Looking for prepositional
verbs in corpus data. In Proc. of the Second ACLSIGSEM Workshop on the Linguistic Dimensions of
Prepositions and their Use in Computational Linguistics Formalisms and Applications, pages 180–9.
Karin Kipper, Hoa Trang Dang, Martha Palmer, et al.
2000. Class-based construction of a verb lexicon. In
AAAI/IAAI, pages 691–696.
Gemma Boleda, Sabine Schulte im Walde, and Toni
Badia. 2012. Modeling regular polysemy: A study
on the semantic classification of Catalan adjectives.
Computational Linguistics, 38(3):575–616.
Karin Kipper, Anna Korhonen, Neville Ryant, and
Martha Palmer. 2006. Extending VerbNet with
novel verb classes. In Proceedings of LREC.
Mirella Lapata and Chris Brew. 2004. Verb class
disambiguation using informative priors. Computational Linguistics, 30(1):45–73.
Matthias Buch-Kromann, Line Mikkelsen, and
Stine Kern Lynge. 2003. Danish dependency
treebank. In TLT.
Maria Lapata. 2001. A corpus-based account of regular polysemy: The case of context-sensitive adjectives. In Proceedings of the second meeting of
the North American Chapter of the Association for
Computational Linguistics on Language technologies, pages 1–8. Association for Computational Linguistics.
Paula Chesley, Bruce Vincent, Li Xu, and Rohini K
Srihari. 2006. Using verbs and adjectives to
automatically classify blog sentiment. Training,
580(263):233.
Nancy Chinchor and Patricia Robinson. 1997. Muc7 named entity task definition. In Proceedings
of the 7th Conference on Message Understanding,
page 29.
Alessandro Lenci, Nuria Bel, Federica Busa, Nicoletta
Calzolari, Elisabetta Gola, Monica Monachini, Antoine Ogonowski, Ivonne Peters, Wim Peters, Nilda
Ruimy, et al. 2000. Simple: A general framework
for the development of multilingual lexicons. International Journal of Lexicography, 13(4):249–263.
Massimiliano Ciaramita and Yasemin Altun. 2006.
Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger.
In Proc. of Proceedings of EMNLP, pages 594–602,
Sydney, Australia, July.
Beth Levin. 1993. English verb classes and alternations: A preliminary investigation. University of
Chicago press.
Massimiliano Ciaramita and Mark Johnson. 2003. Supersense tagging of unknown nouns in WordNet. In
Proceedings of the 2003 conference on Empirical
methods in natural language processing, pages 168–
175. Association for Computational Linguistics.
Oier Lopez de Lacalle and Eneko Agirre. 2015.
Crowdsourced word sense annotations and difficult
words and examples. IWCS.
207
Frédérique Segond, Anne Schiller, Gregory Grefenstette, and Jean-Pierre Chanod. 1997. An experiment in semantic tagging using hidden Markov
model tagging. In Piek Vossen, Geert Adriaens,
Nicoletta Calzolari, Antonio Sanfilippo, and Yorick
Wilks, editors, Automatic Information Extraction
and Building of Lexical Semantic Resources for
NLP Applications: ACL/EACL-97 Workshop Proceedings, pages 78–81, Madrid, Spain, July.
Héctor Martı́nez Alonso, Anders Johannsen, Anders Søgaard, Sussi Olsen, Anna Braasch, Sanni
Nimb, Nicolai Hartvig Sørensen, and Bolette Sandford Pedersen. 2015. Supersense tagging for danish.
In Nodalida.
Sara Mendes. 2006. Adjectives in wordnet.pt. In Proceedings of the GWA 2006–Global WordNet Association Conference.
George A Miller. 1990. Nouns in wordnet: a lexical
inheritance system. International journal of Lexicography, 3(4):245–264.
Anders Søgaard, Barbara Plank, and Héctor
Martı́nez Alonso. 2015. Using frame semantics for knowledge extraction from Twitter. In
AAAI.
Gregory Leo Murphy and Jane M Andrew. 1993.
The conceptual basis of antonymy and synonymy
in adjectives. Journal of memory and language,
32(3):301–319.
Fabian M Suchanek, Gjergji Kasneci, and Gerhard
Weikum. 2008. Yago: A large ontology from
wikipedia and wordnet. Web Semantics: Science, Services and Agents on the World Wide Web,
6(3):203–217.
David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification.
Lingvisticae Investigationes, 30(1):3–26.
Leonard Talmy. 1985. Lexicalization patterns: Semantic structure in lexical forms. Language typology and syntactic description, 3:57–149.
Sanni Nimb and Bolette Sandford Pedersen. 2012. Towards a richer wordnet representation of properties–
exploiting semantic and thematic information from
thesauri. LREC 2012, pages 3452–3456.
Erik F Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared task:
Language-independent named entity recognition. In
In CoNLL.
Sussi Olsen, Bolette Sandford Pedersen, Héctor
Martı́nez Alonso, and Anders Johannsen. 2015.
Coarse-grained sense annotation of Danish across
textual domains. In NODALIDA.
Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman,
Eric Nyberg, and Chris Dyer. 2014a. Metaphor detection with cross-lingual model transfer. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL’14).
Rebecca J Passonneau, Ansaf Salleb-Aouissi, and
Nancy Ide. 2009. Making sense of word sense variation. In Proceedings of the Workshop on Semantic
Evaluations: Recent Achievements and Future Directions. Association for Computational Linguistics.
Yulia Tsvetkov, Nathan Schneider, Dirk Hovy, Archna
Bhatia, Manaal Faruqui, and Chris Dyer. 2014b.
Augmenting English adjective senses with supersenses. In Proc. of LREC.
Bolette Sandford Pedersen, Sanni Nimb, Jørg Asmussen, Nicolai Hartvig Sørensen, Lars TrapJensen, and Henrik Lorentzen. 2009. Dannet: the
challenge of compiling a wordnet for Danish by
reusing a monolingual dictionary. Language resources and evaluation, 43(3):269–299.
Piek Vossen, Laura Bloksma, Horacio Rodriguez, Salvador Climent, Nicoletta Calzolari, Adriana Roventini, Francesca Bertagna, Antonietta Alonge, and
Wim Peters. 1998. The EuroWordNet base concepts and top ontology. Deliverable D017 D,
34:D036.
Wim Peters, Ivonne Peters, and Piek Vossen. 1998.
Automatic sense clustering in EuroWordNet. In
LREC. Paris: ELRA.
Fei Wu and Daniel S Weld. 2008. Automatically refining the wikipedia infobox ontology. In Proceedings
of the 17th international conference on World Wide
Web, pages 635–644. ACM.
James Pustejovsky. 1991. The generative lexicon.
Computational linguistics, 17(4):409–441.
Nathan Schneider and Noah A Smith. 2015. A corpus
and model integrating multiword expressions and
supersenses. Proc. of NAACL-HLT. Denver, Colorado, USA. To appear.
Nathan Schneider, Vivek Srikumar, Jena D. Hwang,
and Martha Palmer. 2015. A hierarchy with, of, and
for preposition supersenses. In Proc. of The 9th Linguistic Annotation Workshop, pages 112–123, Denver, Colorado, USA, June.
208
Adverbs in plWordNet: Theory and Implementation
Marek MaziarzA , Stan SzpakowiczB , Michał KalińskiA
A
Department of Computational Intelligence
Wrocław University of Technology, Wrocław, Poland
B
School of Electrical Engineering and Computer Science
University of Ottawa, Ottawa, Ontario, Canada
A
[email protected], [email protected]
B
[email protected]
Yes, most adverbs do derive from adjectives.3
And yet, the adverb is a bona fide open-class part
of speech. Its distinctness and its peculiarities
cannot be “swept under the carpet” by making it
merely an inflected adjective.
Abstract
Adverbs are seldom well represented in
wordnets. Princeton WordNet, for example, derives from adjectives practically all
its adverbs and whatever involvement they
have. GermaNet stays away from this part
of speech. Adverbs in plWordNet will be
emphatically present in all their semantic
and syntactic distinctness. We briefly discuss the linguistic background of the lexical system of Polish adverbs. We describe
an automated generator of accurate candidate adverbs, and introduce the lexicographic procedures which will ensure high
consistency of wordnet editors’ decisions
about adverbs.
1
Polish morphology acknowledges the adverb
grudgingly, but at least it is present in several
monographs, notably in (Grzegorczykowa, 1975).
The paper presents a definition of adverbs in
plWordNet (section 2), a procedure to generate
candidate adverbs (section 3), a manual verification (section 4) and a few conclusions (section 5).
2
Adverbs in plWordNet
The designers of plWordNet established a spectrum of relations for nouns, verbs and adjectives
(Maziarz et al., 2011a; Maziarz et al., 2011b;
Maziarz et al., 2012). Table 1 lists the relations
for adverbs, with examples.4 The list is based
on the adjective model (Maziarz et al., 2012); we
have assumed that those relations will fit adverbs,
given that most adverbs are transposition derivatives from adjectives.
Adverbs in wordnets and monographs
Adverbs have yet to receive their due in wordnets.
There are only few adverbs in WordNet
(hardly, mostly, really, etc.) as the majority
of English adverbs are straightforwardly derived from adjectives via morphological affixation (surprisingly, strangely, etc.)1
Every relation type has its own test expressions. (The substitution of lexical units for variables yields correct expressions in Polish.) Language forces the tests to be polymorphic. That is
because an adverb can modify a verb, an adjective
or an adverb, and it can appear in a predicative position (jest ‘to be3rd person ’ + adverb).
GermaNet shares the basic division of the
database into the four lexical categories noun,
R
adjective, verb, and adverb with WordNet,
although it is not planned to implement adverbs in the current work phase.2
Curiously, English monographs on lexical semantics (Cruse, 1997; Geeraerts, 2010) give the
adverb a short shrift. The term does not even appear in the index of either book!
3
Calculations on dictionary material show that only 1% of
all adverbs is not derived from adjectives (Grzegorczykowa,
1998, p. 524).
4
See http://tinyurl.com/okdc5w7 for all relations and
wordnet editors’ instructions (in Polish).
1
https://wordnet.princeton.edu/
http://www.sfs.uni-tuebingen.de/lsd/
germanet_structure.shtml – dated 2009
2
209
Relation type
hyponymy
definition
Synset relations
goraczkowo
˛
1 ‘frantically’
→ nerwowo1 ‘anxiously’
value of
intensywnie2 ‘intensively’
the attribute
→ intensywność1 ‘intensity’
brazowawo
˛
1 ‘in brownish colour’
→ brazowo
˛
2 ‘in brown colour’
weselnie1 ‘in a wedding mood’
→ wódka1 ‘vodka’
gradation
fuzzynymy
inter-register
lation hypo(goraczkowo
˛
1 , nerwowo1 ), then, is an
instance of hyponymy in plWordNet.
Listing 1: Hyponymy. Modifier of intentional
verbs.
Jeżeli ktoś/coś robi coś x, to robi to y.
Jeżeli ktoś/coś robi coś y, to niekoniecznie robi to x.
‘If someone/something does something x,
they do it y.’
‘If someone/something does something y,
they do not necessarily do it x.’
dziwnie1 ‘strangely’
→ dziwno1 ‘strangely (obsolete)’
Lexical unit relations
apriorycznie1 ‘a priori’
antonymy
↔ aposteriorycznie1 ‘a posteriori’
lepiej1 ‘better’
converseness
↔ gorzej1 ‘worse’
goraczkowo
˛
1 adv. ‘frantically’
XPOS synonymy
→ goraczkowy
˛
1 adj. ‘frantic’
lepiej1 ‘better’
degree
→ dobrze1 ‘well’
intonacyjnie1 ‘with regard to
derivation
intonation’ → intonacja3 ‘intonation’
synonymy
Listing 2: Hyponymy. Modifier of unintentional
verbs.
Jeżeli coś dzieje si˛e x, to dzieje si˛e y.
Jeżeli coś dzieje si˛e y, to niekoniecznie dzieje si˛e x.
‘If something happens x, it happens y.’
‘If something happens y,
it does not necessarily happen x.’
Listing 3: Hyponymy. Adjective modifier.
Table 1: Relations in plWordNet with examples.
2.1
Jeżeli ktoś/coś jest x jakiś, to jest też y jakiś.
Jeżeli ktoś/coś jest y jakiś, to niekoniecznie jest x jakiś.
Synset relations
‘If someone/something is x so,
they are also y so.’
‘If someone/something is y so,
they are not necessarily x so.’
Synset relations are short-cuts for a bundle of links
between lexical units belonging to two different
synsets (Maziarz et al., 2013, pp. 774-775). Our
test expression, then, admit pairs of lexical units
belonging to synsets which are supposed to be
linked by a synset relation.
We present four such tests for hyponymy.5 Symbols x, y denote adverb lexical units. The awkward phrase ‘does it x’ is meant as “does it in a
manner x”, etc.
When we insert actual words into these tests,
we can decide whether the resulting assertion is
true. For example, let x and y in Listing 1
be goraczkowo
˛
1 ‘frantically’ and nerwowo1 ‘anxiously.’
Let us now put the hyponymous pair fiołkowo1
‘± like a violet’ and słodko2 ‘sweetly’ in Listing 2,
and replace the generic non-volitional dzieje si˛e ‘it
happens’ with its hyponym pachnie ‘it smells’:
• Jeżeli ktoś robi coś goraczkowo
˛
1 , to robi to
nerwowo1 . ‘If someone does something frantically, he does it anxiously.’
• Jeżeli ktoś/coś robi coś nerwowo1 , to
niekoniecznie robi to goraczkowo
˛
‘If
1.
someone does something anxiously, he does
not necessarily do it frantically.’
• Jeżeli coś pachnie fiołkowo2 , to pachnie
słodko3 . ‘If something smells like a violet, it
smells sweetly.’
• Jeżeli coś pachnie słodko3 , to niekoniecznie
pachnie fiołkowo2 .
‘If something smells
sweetly, it does not necessarily smell like a violet.’
Both these statements hold for Polish: the re-
In Listing 3, we put the hyponymous pair
bordowo1 ‘maroonadv ’ and ciemnoczerwono1
‘dark-redadv ’ and a specific passive participle
zabarwiony ‘*-hued’ to replace the generic “so”.
Listing 4: Hyponymy. Predicative adverb.
Jeżeli jest x, to jest też y.
Jeżeli jest y, to niekoniecznie jest x.
‘If it is x, it is also y.’
‘If it is y, it is not necessarily x.’
5
We give separate tests for the adjective modifier, the
predicative position, and the modifiers of intentional and unintentional verbs; Laskowski (1998) gives an exact definition.
210
• Jeżeli coś jest bordowo1 zabarwione, to jest też
ciemnoczerwono1 zabarwione. ‘If something
is maroon-hued, it is also dark-red-hued.’
• Jeżeli coś jest ciemnoczerwono1 zabarwione,
to niekoniecznie jest bordowo1 zabarwione. ‘If
something is dark-red-hued, it is not necessarily maroon-hued.’
contemporary vocabulary, while dziwnie1 belongs
to the general register.
2.2
Lexical unit relations
The most prominent relation among lexical units is
cross-categorial synonymy, which we refer to as
XPOS synonymy. It binds the adjectival net with
the adverbial net. Almost all plWordNet adverbs
are related to their derivative bases.3 An adverb
x and its adjective base a are XPOS-synonymous
if they can be replaced in the nominalisation process – see (Nagórko, 1987, p. 140) and (J˛edrzejko,
1993, p. 61). Two transpositions are possible from
a verb context to a nominalised phrase (denoted by
the symbol ⇒):
Finally, two hyponymous adverbs in a predicative context (to be3rd person + adverb).6
• Jeżeli jest słonecznie6 ,
to jest też
bezchmurnie4 .
‘If it is sunnyadv , it is
also cloudlessadv ’.
• Jeżeli jest bezchmurnie4 , to niekoniecznie jest
słonecznie6 . ‘If it is cloudlessadv , it is not necessarily sunnyadv ’.
• krzatał
˛ si˛e goraczkowo
˛
‘he bustled frantically’
⇒ goraczkowa
˛
krzatanina
˛
‘frantic bustle’,
• jest zimno na ulicy ‘it is cold in the street’ ⇒
zimna ulica ‘cold street’.
If any of these four tests admits a given pair of
lexical units, we will say they are a hyponymy pair.
The relation value of the attribute resembles hyponymy. It holds between an adverb,
treated as a feature value and a noun, which
represents certain category (attribute). For example, the attribute intensywność1 ‘intensity’,
has several values, among them the adverbs
intensywnie2 ‘intesively’, f anatycznie1 ‘fanatically’ and wydajnie3 ‘about cough in medicine:
efficiently’. Actual hyponymy and value of the attribute together form the backbone of plWordNet’s
adverb structure.
The gradation relation is applied when a series
of adverbs can be arranged into a sequence according to some scale. The adverbs brazowawo
˛
1 ‘in
brownish colour’ and brazowo
˛
2 ’in brown colour’
represent the same attribute hue and could be
ordered according to that attribute. Adverb sequences can be quite long. Consider adverbs of
temperature: lodowato1 ‘icily’, zimno5 ‘coldly’,
zimnawo1 ‘coldishly’, chłodno6 ‘coolly’, chłodnawo1 ‘coolishly’, letnio1 ‘lukewarmly’, ciepło1
‘warmly’, goraco
˛ 1 ‘hotly’.
Inter-register synonymy links adverbs which
would be synonymous if not for minor differences in register (in usage). For example, the
adverbs dziwnie1 and dziwno1 occupy nearly the
same place in plWordNet’s lexico-semantic relation net. They are related to the same lexical units,
except for hyponymy (see Figure 1 at the end of
section 3). They cannot be in the same synset: dziwno1 is obsolete, so is a poor hypernym choice for
The test expressions make use of these transpositions. Let us present a test for a modifier of
intentional verbs (Listing 5; x is an adverb, a is an
adjective).
Listing 5: XPOS synonymy. Modifier of intentional verbs.
Jeżeli ktoś/coś robi coś x,
to jest to a robienie czegoś przez kogoś/coś.
Jeżeli to jest a robienie czegoś przez kogoś/coś,
to ktoś/coś robi to x.
‘If someone/something does something x,
then it is a doing it by someone/
something.’
‘If it is a doing something by someone/
something, then someone/something does
not necessarily do it x.’
For goraczkowo
˛
˛
1 and goraczkowy
1 , we get the
following test expressions:
• Jeżeli ktoś/coś robi coś goraczkowo
˛
1 , to jest
to goraczkowe
˛
1 robienie czegoś przez kogoś/coś. ‘If someone/something does something
frantically, then it is frantic doing something
by someone/something.’
• Jeżeli jest to goraczkowe
˛
robienie
1
czegoś przez kogoś/coś, to ktoś/coś robi
coś goraczkowo
˛
‘If it is frantic doing
1.
something by someone/something, then someone/something does something frantically.’
The tests check the truth of two hyponymylike implications which go in opposite directions.
6
Unlike English, Polish allows both adjectives and adverbs in this position.
211
Since synonymy can bee seen as bi-directional
hyponymy, the tests effectively investigate synonymy conditions for the two parts of speech.
Apart from XPOS-synonymy, the adverbial
plWordNet has two more derivationally motivated
relations: degree and derivation. The former
caters for synthetic comparatives and superlatives.7 The latter is a catch-all for other derivational relations.
Antonymy links two adverb lexical units if they
satisfy the conditions in Listing 6.
According to Lyons (1981), converseness is
quite frequent among adverbs in the comparative degree whose positive degree is involved in
antonymy. We found many such pairs. Listing 7
shows tests for an adjective modifier.
Listing 7: Converseness. Predicative context.
Jeżeli p robi coś x niż q, to q robi to y niż p.
‘If p does something x than q,
then q does it y than p.’
For example, the lexical units wolno6 ‘slowly’
and szybko3 ‘quickly’ have the comparatives wolniej1 ‘more slowly’ and szybciej1 ‘more quickly’.
The test becomes:
Listing 6: Antonymy. Predicative context.
– Jest x? – Wr˛ecz przeciwnie: jest y .
Jeżeli jest x, to nie jest y.
Jeżeli nie jest x, to niekoniecznie jest y.
• Jeżeli p robi coś wolniej niż q, to q robi to szybciej niż p. ‘If p does something more slowly
than q, then q does it more quickly than p.’
- Is it x? - On the contrary: it is y.
‘If it is x, then it is not y.’
‘If it is not x, then it is
not necessarily y.’
3
Semantic opposition was introduced into this
test with a short dialogue, with the key word
przeciwnie ‘on the contrary, conversely’ (note the
predicative context):8
Automatic generation of candidate
adverbs
We followed six steps in the generation of new adverbs from their adjective bases. We worked all
along with a copy of plWordNet, which we denote
plWordNetc .
• – Jest x? ‘– Is it x?’
• – Wr˛ecz przeciwnie: jest y. ‘On the contrary:
it is y.’
1. Derivator. Consider every existing adjective
lemma X within the domain qualitative
in plWordNetc . Using the Derivator tool (Piasecki et al., 2012) create all possible adverbial derivatives A of adjectives X housed in
plWordNetc . The resulting lexicon L contains
adverb-adjective pairs of lemmas (A, X).
Consider the pair słonecznie6 ‘sunnyadv ’ and
deszczowo1 ‘rainyadv ’:
• – Jest słonecznie6 ? – Nie, wr˛ecz przeciwnie:
jest deszczowo1 . ‘– Is it sunny? – On the contrary: it is rainy.’
• Jeżeli jest słonecznie6 , to nie jest deszczowo1 .
‘If it is sunny, then it is not rainy.’
• Jeżeli nie jest słonecznie6 , to niekoniecznie jest
deszczowo1 . ‘If it is not sunny, then it is not
necessarily rainy.’
Table 2 presents the statistics of the derivation
process. Since mainly qualitative adjectives form
their adverbs, it is interesting that more than onethird of them have their derivatives. For example,
for the adjective czyściutki ‘pleasantly clean, clear,
pure’ the Derivator created its adverb derivative
czyściutko ‘≈cleanly, neatly; purely’, whereas for
the adjective poszkodowany ‘injured, damaged’ no
adverb was derived.
7
Degree in Polish adverbs is either synthetic (affix-ej for
comparatives and naj-. . . -ej for superlatives) or analytic (precede with the adverb bardzo ‘more’ or najbardziej ‘most’,
respectively) (Grzegorczykowa, 1998, pp. 533-534).
8
We follow here a very interesting synonymy test (Cruse,
1997, pp. 257-258): “[N]ot all lexical items are felt to have
opposites. Ask someone for the opposite of table, or gold,
or triangle, and he will be unable to oblige. Some lexical
items, it seems, are inherently non-opposable.” The dialogue
from our test suggests a language-game in oppositions (“[a]sk
someone for the opposite of. . . ”). This helps us throw out
those lexical unit pairs which only satisfy the main condition
of antonymy, i.e., the incompatibility implication x ⇒ ∼ y
(Lyons, 1981, 154-155).
2. Adverbial lexical units. For every given qualitative adjective lexical unit x in plWordNetc
representing lemma X which is present in L,
create its counterpart lexical unit a representing lemma A. Omit the lexical units housed in
artificial (non-lexical) synsets (Piasecki et al.,
2009, p. 30). Equip every thus created adverb
lexical unit with register labels and glosses
copied from the corresponding adjective unit.
212
Lemma type
Adj. lemmas
Qualitative Adj. lemmas
Adv. derivative lemmas, |L|
Freq.
27,042
17,045
6,321
[%]
100.0
63.0
23.4
on a simple random sample of 69 adjective lexical
units from plWordNetc ( more in Section 4).
4. Synsets. Group all adverbial lexical units into
synsets, mirroring their counterpart adjective
synsets: two adverb units a1 , a2 are in the same
synset iff the corresponding adjective lemmas
x1 , x2 are in the same synset. An adjective
lemma can also correspond to two or more adverb lemmas (each with perhaps a slightly different meaning). In such cases, all adverb lexical units a1 , a2 , . . . are considered counterparts
of the same adjective lexical unit x; the register obsolete (Maziarz et al., 2014; Maziarz
et al., 2015) is assigned to all ak except the unit
of the most frequent adverb lemma.
Table 2: Statistics for automatic adverb derivation
by the Derivator and plWordNetc . Abbreviations:
Adj. – adjective, Adv. – adverb, |L| – cardinality
of the set L.
The rule states that whenever an adjective lexical unit x from the domain qualitative has an
entry (A, X) in the dictionary L, we create for it
its counterpart lexical unit a. For example, lemma
czyściutki has 5 senses in plWordNetc in the domain qualitative, so the lemma czyściutko
would have also 5 senses (as).
For example, the lemma żmudny ‘arduous; laborious’ has only one meaning in plWordNet, but
two adverbial derivatives in the lexicon L: żmudnie, żmudno ‘arduously; laboriously’ (of which the
second one is almost absent in modern Polish
texts). It has also one synonym mozolny. Since
mozolny has its own adverb derivative mozolnie,
finally, we get a 3-element synset: {żmudnie1 ,
żmudno1 (obsolete), mozolnie1 }.
3. Filtering rules. Having created counterparts
as for senses xs, we perform filtering based on
six rules. Two of them are shown in Listings
8-9. If a rule’s premise holds, we remove from
plWordNetc the considered sense a0 of a given
adverb lemma A.
Listing 8: Illustration for rule #1.
mod(x0 , istota1 ) ∨
∃y [mod(x0 , y) ∧ hypo0 (y, istota1 )] ∨
∃y [hypo0 (x0 , y) ∧ mod(y, istota1 )] ∨
∃y, n [hypo0 (x0 , y) ∧ mod(y, n) ∧ hypo0 (n, istota1 )]
Symbols x0 , y, z in Listing 8 are lexical units,
x and y are adjectives, a0 is an adverb counterpart
of adjective x0 , n is a noun. The noun istota1
means ‘being, causal agent, human being, spirit
or animal’; hypo´(x, y) holds if y is a direct or
indirect hypernym of x; mod(x, n) holds if x is a
modifier of n; val(x, n) holds if x is a value of
the attribute n.
5. XPOS synonymy. Add the cross-categorial
(XPOS) synonymy between adverb lexical
units a and the corresponding adjective lexical
units x.
For the adverbs described above, the XPOS synonymy relation instances are the following:
żmudnie → żmudny,
żmudno → żmudny,
mozolnie → mozolny.
The last step is to copy relations from the adjective part of plWordNetc .
Listing 9: Illustration for rule #4.
val(x0 , zachowanie7 ) ∨
∃y [hypo0 (x0 , y) ∧ val(y, zachowanie7 )]
6. Copying relations. Copy relations from the
adjective part of plWordNetc onto the adverbial part. This step is split in two sub-steps,
one for copying hyponymy chains, and another
for copying various other relations.
Symbols in Listing 9 – see Listing 8. The noun
zachowanie7 means ‘behaviour, manner of acting
or controlling oneself’.
Rules #2 and #3 are derived from rule #1 by
replacing istota1 with organizm1 ‘living organism’ and grupa5 ‘group of people’, respectively.
Rules #5 and #6 arise from rule #4 by replacing
zachowanie7 by cecha osobowości1 ‘character
trait’ and pochodzenie5 ‘origin, source of someone/something’, respectively. The rules are based
(a) Hypernymy/value. If there is hyponymy between adjectives x and y, their counterpart
adverbs a and b are also connected by hyponymy. There also may be “holes” in hyponymy chains, created by adjective synsets
which do not have any corresponding adverb synsets (either not generated or filtered
213
out). Such “holes” are stepped over; see Listing 10.9 For example, given an adjective
chain x1 → x2 → x3 such that only the
adverbs a1 and a3 exist, the link a1 → a3
is created. The relation “value of the attribute” is treated specially here; it may connect a top adjective hypernym in a chain to a
noun. When copying this relation, a top adverb in a hypernymy chain will be linked to
that noun if there is a hypernymy + value-ofthe-attribute path from its counterpart to the
noun; see Listing 11. Figure 1 is a descriptive
example of this process.
(b) Other relations.
Four other adjectivelinking relations are copied onto their counterpart adverbs: gradation, inter-register synonymy, antonymy, and converseness. So,
if one of these relations connects adjectives
x1 , x2 , their counterparts a1 , a2 will also be
connected. Since these relations do not form
chains, only immediate neighbours are considered; if one of the connected adjectives
has no adverb counterpart, the relation will
not be copied.
NOUN
{podobieństwo1}
`similarity’
{inny3 }
{dziwny1 }
{dziwnie1 }
{† dziwno1 }
`strange’
`strangely’
`strangely’
{zwariowany3 }
{zwariowanie3 }
`crazy’
`crazily’
Х
{świrnięty2 , świśnięty1 }
`~ crazy’
hyponymy
value of the attribute
{postrzelony2 }
`~ crazy’
Х
derivation
Х
no derivatives
Figure 1: The hyponymy path for postrzelony
‘crazy’. “X” marks synsets left empty by the algorithm in plWordNetc .
in the Figure) were omitted. Instances lacking relation were stepped over by pointing to the closest
synset possible (dziwnie – podobieństwo).
Listing 10: Illustration for hyponymy chain copying conditions.
4
∀a, b ∃x, y hypo0 (a, b) ⇐
hypo0 (x, y) ∧ xpos(a, x) ∧ xpos(b, y)
Manual verification
We evaluate the procedure from section 3 in three
experiments, two before copying plWordNetc onto
plWordNet (SL , ST ), and one afterwards (SV ).
The former two were based on simple random
samples of 69 (SL ) and 70 (ST ) adjective lexical units from plWordNet. The development set
SL helped write and check the filtering rules in
Section 3. As a baseline BL we chose the procedure’s performance, without filtering, on the first
set of 69 adjectives. The test set ST was used
to reassess the measures of efficiency. The randomly drawn adjectives were checked manually
by plWordNet editors (all of them linguists with
a university degree) for correspondence with adverbial lexical units .
In the SL sample (Table 3).10 , two of 27 adverbs
in plWordNetc are our procedure’s “creation”, and
Listing 11: Illustration for value-of-the-attribute
relation repair conditions.
∀a, b ∃x, y, n val(a, n) ⇐
val(x, n) ∧ xpos(a, x) ∨
hypo0 (x, y) ∧ xpos(a, x) ∧ val(y, n)
Symbols x, y, a, b, n in Listings 10-11 are lexical units: x, y are adjectives, a and b are adverbs,
n is a noun; hypo´(x, y) holds if y is a direct or
indirect hypernym of x; val(x, n) holds if x is a
value of the attribute n; xpos(a, x) holds if a is a
cross-categorial synonym of x.
Figure 1 illustrates the rule with the hyponymy
chain of the synset {postrzelony2 } ‘crazy’. There
are 6 elements in the adjective path (on the left),
including the value of the attribute relation. The
Derivator did not create some derivatives, so the
adverb structure (on the right) is not an exact copy
of the adjective part. Luckily, in this case only
derivatives forbidden in Polish (marked with “X”
9
Х
ADVERBS
ADJECTIVES
`unlike in nature, form,
or quality, different’
10
In Tables 3-5, A+ / A− denote lexical units which are
/ are not proper Polish adverbs. W + / W − denote lexical
units present / not present in plWordNetc , because either the
Derivator did not create them, or they were filtered by rules
#1-#6 from step 3 in section 3. P (W +) and R(A+) are
precision and recall of recognising real adverb lexical units.
CI is the confidence interval.
hypo’(•, •) stands for direct or indirect hyponymy.
214
25 of 36 existing adverbs were introduced into
plWordNetc . Let us calculate the precision of introducing adverbs into plWordNet P (W+ ) and recall of automatic recognition of adverbial lexical
units R(A+ ), the most important measures of reliability in this case (N (•) is set cardinality):
A−
A+
M
P (W+ )
95% CI
R(A+ )
95% CI
P (W+ ) = N (W+ ∩ A+ )/N (W+ ) = 93% (1)
R(A+ ) = N (W+ ∩ A+ )/N (A+ ) = 69% (2)
BL (n = 69)
W−
W+
22
11
10
26
11,402
70%∗
[53÷84%]
72%
[55÷86%]
SL (n = 69)
W−
W+
31
2
11
25
10,190
93%∗
[76÷99%]
69%
[52÷84%]
Table 3: The confusion matrix for our automatic
procedure on the development set. BL – baseline,
the procedure without filtering; SL – the development set; M is plWordNetc size, n is sample size,
both in lexical units. The asterisks mark statistically significant differences between BL and SL at
the confidence level 95%.
The set W+ ∩ A− contains false positives: adverbs which do not exist in reality but were introduced by the algorithm. The set W− ∩ A+ contains false negatives: adverbs which do exist in
language but were omitted by the algorithm. For
illustration, we present their elements.
• W + ∩ A− =
{kurczliwy1 ‘contractible’, żeński3 ‘female’}
• W − ∩ A+ =
{redukowalny1 ‘reducible’, jednosetowy1 ‘oneset [e.g., in tennis]’, polarny1 ‘arctic or antarctic’, ropuchowaty1 ‘toadlike’, włókienkowaty1
‘fibrillose’, brutalny2 ‘brutal’, warzywny3
‘vegetableAdj ’, jednopasmowy1 ‘single-lane’,
równobrzmiacy
˛ 1 ‘consonant’, pilśniowaty1
‘felt-like’, dwupolowy2 ‘bi-polar’}
A−
A+
M
P (W+ )
95% CI
R(A+ )
95% CI
Precision and recall answer two questions:
ST (n = 70)
W−
W+
20
4
24
22
10,190
85%
[65÷96%]
45%
[33÷63%]
Table 4: The confusion matrix for our automatic
procedure on the test set. M is plWordNetc size,
n is sample size, both in lexical units.
• How many automatically generated lexical units
are real adverb lexical units?
• How many adverb lexical units that could be
generated from copying structure from adjective
part of plWordNetwere indeed created?
A−
A+
Z
P (W+ )
95% CI
R(A+ )
95% CI
Our procedure performed better on the SL sample, with a statistically significant increase of precision (from 70% to 93%), and a small, not significant, decrease of recall (from 72% to 69%). The
size of the adverbial base in plWordNetc was only
10% smaller after filtering the original base (see
the row M in Table 3).
The results were promising, so we drew yet another sample ST . Now precision was still high, but
recall was lower, however – since we ran the very
same algorithm as in SL – the size M of adverb
plWordNetc (in lexical units) did not change.
With high precision and a reasonably slight
“leakage” of lexical units (reasonably high M ), we
finally decided to copy plWordNetc onto the live
base plWordNet. The plWordNetc set consisted of
SV (n = 517)
W−
W+
NA
86
100
331
241
79%
[75÷83%]
78%
[72÷81%]
Table 5: The confusion matrix for our automatic
procedure on the validation set. SV – the validation set; Z – the number of adverb lemmas in SV ,
and n – sample size in lexical units. Note that the
cell W− ∩ A0− is empty because we changed the
interpretation of recall.
10,190 lexical units. We gave the resulting “adverbial” plWordNet to a team of 10 editors, asking
them to build upon this automatically generated
215
91%
88%
Table 6 shows that our procedure does not miss
much. For example (row 3), it only omitted 1418
adverbs with frequency above 10.
77%
62%
10M+
1M-10M
100k-1M
1
2
3
4
10k-100k
Figure 2: Coverage of lexicon built from plWordNet Corpus with regard to different frequency
bins.
wordnet. Table 5 presents the results of manual
verification of part of the automatically generated
adverb wordnet; that is the validation set SV . The
conditions of the validation were different than in
two earlier experiments SL and ST , in which the
starting point were adjective lexical units. SV contained only the adverb lemmas generated by the
procedure and worked upon by the editors. In SV ,
we were not interested in recall of adverbs derivable from the existing adjectives. We changed the
interpretation:
lemmas
3,720
2,601
1,418
958
%
42.8
29.9
16.3
11.0
8,697
100.0
(≈9,000÷10,000)
Table 6: The estimated size of plWordNet’s adverb
list, based of frequencies (f ) in the plWordNet corpus.
Row 4 in Table 6 refers to a productive class
of multi-word adverbs such as (mówić) po polsku,
po angielsku ‘(speak) Polish, English’. There also
are other productive patterns, e.g., (ubierać si˛e) z
polska, z niemiecka ‘(dress) Polish-style, Germanstyle’, as well as non-compositional constructions,
e.g., z dobroci serca ‘out of the goodness of one’s
heart’. All such adverbial expressions must be
added to plWordNet. The “po polsku” type is
much more frequent than other types; we found almost 1,000 such word combinations in the corpus.
Thus we estimate the number of all other multiword adverb lexical units at yet another 1,000. We
expect, all told, 9 to 10 thousand lemmas.
Clearly, the adding of adverbs to plWordNet is
work in progress. Detailed instructions for the
editors,4 in keeping with our practice over the
years, are meant to ensure the consistency of editorial decisions. Editors now verify, add to and complete the list of adverb lexical units, automatically
generated from plWordNet’s adjectives. Next, we
plan to add multi-word lexical units of the po polsku type and of other types.
• How many adverb lexical units which could
have been introduced into plWordNet from generated adverb lemmas were indeed created?
Around one of four-five lexical units is not an
appropriate adverb lexical unit; one of four-five
existing senses of a given lemma is missing.11
5
Adverb class
in plWN, f > 10
in plWN, f <= 10
not in plWN, f > 10
multi-word adverbs,
po polsku type, f > 10
Total
(with multi-word
adverbs, a guess)
Whither adverbs in plWordNet?
We have so far only considered adverbs which can
be generated from adjectives in plWordNet. It
stands to reason that coverage could increase if we
worked instead with corpus-based frequency lists.
Figure 2 presents coverage of a lexicon built from
the plWordNet corpus.12 The more frequent an adverb is, the more likely it is to appear plWordNet.
Even for the least frequent adverbs, the coverage
is still a high 62%.
Acknowledgments
Work supported by the Polish Ministry of Education and Science, Project CLARIN-PL, the European Innovative Economy Programme project
POIG.01.01.02-14-013/09, and by the EU’s 7FP
under grant agreement No. 316097 [ENGINE].
Thanks to Paweł K˛edzia for help with the adverb
generation algorithm. Thanks to Agnieszka Dziob
and Justyna Wieczorek, the co-authors of the adverb guidelines, for the manual verification of the
learning and test sets.
11
Note that this is no longer a simple random sample: editors work on packages with lists of senses of the same lemma,
also synonyms and hyponyms/hypernyms of the senses. The
sampling design most resembles cluster sampling. The confidence interval must be treated here as an approximation.
12
The corpus consists of 250M tokens in the ICS PAS
Corpus (Przepiórkowski, 2004); 113M tokens of news items
(Weiss, 2008); ≈80M tokens in a corpus made of Polish
Wikipedia (Wikipedia, 2010); an annotated corpus KPWr
with ≈0.5M tokens (Broda et al., 2012); ≈60M tokens of
shorthand notes from the Polish parliament; and ≈1.2 billion
tokens collected from the Internet.
216
References
[Broda et al.2012] Bartosz Broda, Michał Marcińczuk,
Marek Maziarz, Adam Radziszewski, and Adam
Wardyński. 2012. KPWr: Towards a Free Corpus
of Polish.
[Cruse1997] Alan Cruse. 1997.
Cambridge University Press.
Lexical semantics.
[Maziarz et al.2013] Marek Maziarz, Maciej Piasecki,
and Stanisław Szpakowicz. 2013. The chicken-andegg problem in wordnet design: synonymy, synsets
and constitutive relations. Language Resources and
Evaluation, 47(3):769–796. DOI 10.1007/s10579012-9209-9.
[Geeraerts2010] Dirk Geeraerts. 2010. Theories of
Lexical Semantics. Oxford University Press.
[Maziarz et al.2014] Marek Maziarz, Maciej Piasecki,
Ewa Rudnicka, and Stan Szpakowicz. 2014. Registers in the System of Semantic Relations in plWordNet. In Proc. 7th International Global Wordnet
Conference, pages 330–337.
[Grzegorczykowa1975] Renata
Grzegorczykowa.
1975. Funkcje semantyczne i składniowe polskich
przysłówków [The semantic and syntactic function
of Polish adverbs]. Ossolineum, Wrocław.
[Maziarz et al.2015] Marek Maziarz, Maciej Piasecki,
and Stan Szpakowicz. 2015. The system of register labels in plWordNet. Cognitive Studies / Études
Cognitives, 15:in print.
[Grzegorczykowa1998] Renata
Grzegorczykowa.
1998. IV. Słowotwórstwo: Przysłówek [IV. Derivation: The adverb]. In Renata Grzegorczykowa,
Roman Laskowski, and Henryk Wróbel, editors, Gramatyka współczesnego j˛ezyka polskiego
[Grammar of contemporary Polish], volume 2 of
Morfologia [Morphology], pages 524–535. PWN, 2
edition.
[Nagórko1987] Alicja Nagórko.
1987.
Zagadnienia derywacji przymiotników [Issues on adjective
derivation]. Warsaw University Press.
[J˛edrzejko1993] Ewa J˛edrzejko. 1993. Nominalizacje
w systemie i w tekstach współczesnej polszczyzny
[Nominalisations in language system and in texts of
contemporary Polish]. University of Silesia Press,
Katowice.
[Laskowski1998] Roman Laskowski. 1998. Kategorie
morfologiczne j˛ezyka polskiego – charakterystyka
funkcjonalna [The morphological categories of Polish – a functional characterisation]. In Renata Grzegorczykowa, Roman Laskowski, and Henryk Wróbel, editors, Gramatyka współczesnego j˛ezyka polskiego [Grammar of Contemporary Polish], volume 1 of Morfologia [Morphology], pages 151–224.
PWN, 2 edition.
[Lyons1981] John Lyons. 1981. Language and Linguistics: An Introduction. Cambridge University
Press.
[Maziarz et al.2011a] Marek Maziarz, Maciej Piasecki,
Stan Szpakowicz, and Joanna Rabiega-Wiśniewska.
2011a. Semantic Relations among Nouns in Polish
WordNet Grounded in Lexicographic and Semantic
Tradition. Cognitive Studies / Études Cognitives,
11:161–181.
[Piasecki et al.2009] Maciej Piasecki, Stanisław Szpakowicz, and Bartosz Broda. 2009. A Wordnet from the Ground Up. Wrocław University of
Technology Press. http://www.eecs.uottawa.ca/ szpak/pub/A_Wordnet_from_the_Ground_Up.zip.
[Piasecki et al.2012] Maciej
Piasecki,
Radoslaw
Ramocki, and Marek Maziarz. 2012. Recognition
of Polish Derivational Relations Based on Supervised Learning Scheme. In Proceedings of the
Eight International Conference on Language Resources and Evaluation (LREC’12), pages 916–922,
Istanbul, Turkey. European Language Resources
Association (ELRA).
[Przepiórkowski2004] Adam Przepiórkowski. 2004.
The IPI PAN Corpus, Preliminary Version. Institute
of Computer Science PAS.
[Weiss2008] Dawid
Weiss.
2008.
Korpus
Rzeczpospolitej.
http://www.cs.put.poznan.pl/dweiss/rzeczpospolita.
Corpus of text from the online edtion of Rzeczypospolita.
[Wikipedia2010] Wikipedia. 2010. Polish Wikipedia.
https://pl.wikipedia.org, accessed in 2010.
[Maziarz et al.2011b] Marek Maziarz, Maciej Piasecki,
Stan Szpakowicz, Joanna Rabiega-Wiśniewska, and
Bożena Hojka. 2011b. Semantic Relations Between
Verbs in Polish Wordnet. Cognitive Studies / Études
Cognitives, 11:183–200.
[Maziarz et al.2012] Marek Maziarz, Stan Szpakowicz,
and Maciej Piasecki. 2012. Semantic Relations
among Adjectives in Polish WordNet 2.0: A New
Relation Set, Discussion and Evaluation. Cognitive
Studies / Études Cognitives, 12:149–179.
217
A Language-independent Model for Introducing a New Semantic Relation
Between Adjectives and Nouns in a WordNet
Miljana Mladenović
Jelena Mitrović
Cvetana Krstev
Faculty of Mathematics
Faculty of Philology
Faculty of Philology
University of Belgrade
University of Belgrade
University of Belgrade
[email protected] [email protected] [email protected]
Abstract
al., 2010), WordNetAffect (Strapparava and Valitutti, 2004) and others, which define the prior
sentiment polarity (taken out of the context) of
synsets are also being used. Still, the intensity of
sentiment polarity of the lexical representation of
synsets can be reduced, increased or completely
changed in a given context with the usage of
rhetorical figures from the group of Tropes — figures that change the meaning of words or phrases
over which the figure itself is formed. These figures can be metaphor, metonymy, irony, sarcasm,
oxymoron, simile, dysphemism, euphemism, hyperbole, litotes etc. (Mladenović and Mitrović,
2013). Analysing the usage of figurative language
in the form of ironic similes, Hao and Veale (2010)
noticed that they act similarly to valence shifters
(Kennedy and Inkpen, 2006) “not”, “never” and
“avoid” in text, because they change the polarity of
sentiment words or phrases. In general, modifiers
decrease, increase or change the sentiment polarity of words or phrases. Tropes work in a similar
way. By definition, irony and sarcasm change the
polarity, dysphemism and hyperbole increase the
existing level of sentiment expressiveness, while
litotes and euphemism decrease that expressiveness. Metaphor, metonymy, oxymoron and simile have a more complex mechanism of affecting
both directions of change regarding the strength
and polarity of sentiment.
The aim of this paper is to show a
language-independent process of creating
a new semantic relation between adjectives and nouns in wordnets. The existence of such a relation is expected to
improve the detection of figurative language and sentiment analysis (SA). The
proposed method uses an annotated corpus
to explore the semantic knowledge contained in linguistic constructs performing
as the rhetorical figure Simile. Based on
the frequency of occurrence of similes in
an annotated corpus, we propose a new
relation, which connects the noun synset
with the synset of an adjective representing that noun’s specific attribute. We elaborate on adding this new relation in the
case of the Serbian WordNet (SWN). The
proposed method is evaluated by human
judgement in order to determine the relevance of automatically selected relation
items. The evaluation has shown that 84%
of the automatically selected and the most
frequent linguistic constructs, whose frequency threshold was equal to 3, were also
selected by humans.
1
Introduction
Automatic detection of figurative language is a
new area of interest in the field of SA that can
improve the existing SA methods. Reyes and
Rosso (2012) showed that the precision of classification in an SA task can be improved significantly (from 54% to 89.05% max.) when predictors detecting figurative speech are involved, compared to a set of predictors that treat the text literally. Similarly, Rentoumi et al. (2010) improved
the SA method of machine learning by integrating
it with a rule-based method which detects the usage of figurative language, so the integrated meth-
In this paper, we want to demonstrate that a WordNet (WN) can be expanded by a new semantic
relation between adjectives and nouns in a way
that could allow for its usage in detecting figurative language and in existing methods of sentiment
analysis. WN is used successfully for analysis of
literal meaning of texts using SA methods (Pease
et al., 2012), (Reyes and Rosso, 2012), (Rademaker et al., 2014). Resources that came out of the
Princeton WordNet (PWN), such as SentiWordNet (Esuli and Sebastiani, 2006), (Baccianella et
218
gest specific enrichment of WordNet in their papers (Veale and Hao, 2008) and (Hao and Veale,
2010). As a source to be used in that enrichment, authors suggest semantic knowledge contained in language constructs of the form as ADJ
as a NOUN which, in fact, are similes (e.g. “as
free as a bird”, “as busy as a bee”). In order
to obtain examples of simile, the authors first
extracted all antonymous pairs of adjectives in
PWN and made a list of candidate adjectives. For
each adjective ADJ from that list, a query in the
form as ADJ as a * was made and sent to
the Google search engine. Out of the obtained
results, the first 200 snippets were kept. A collection of as ADJ as a NOUN constructs was
made and a task of disambiguation was performed
over it. In this process, one noun (peacock)
can semantically be connected to many adjectives
based on different semantic grounds. The structure, named by the authors as frame:slot:filler,
consists of a noun (frame), property of the noun
(slot) and an adjective as a value of the property (filler). For one noun there can be a number
of instances of such structure. Authors point out
that an average number of slot:filler constructs per
one noun obtained in this particular research was
8. For instance, the noun peacock contains the
following set of slot:filler values: {Has feather:
brilliant; Has plumage: extravagant; Has strut:
proud; Has tail: elegant; Has display: colorful;
Has manner: stately; Has appearance: beautiful}, therefore the suggested enrichment of WordNet only for the noun peacock leads to addition of
7 relations out of which the first one is of the form
‘{peacock} Has feather {brilliant}’.
ods achieved better precision than the baseline.
2
Related work
WordNet is a dynamic, flexible structure that can
be expanded in different ways and for various
purposes. In certain cases, introducing morphosemantic relations results in solving the problems that stem from specificities of a language
with rich morphology and derivation (Koeva et al.,
2008). Otherwise, introducing new semantic relations can lead to the improvement of the representation of relations between synsets, e.g. Kuti et
al. (2008) present a semantic relation scalar middle with which the antonimy relation of two descriptive adjective synsets is transformed into a
triple gradable structure lower-upper-middle. Angioni et al. (2008) define a new relation Commonsense with which a literal in a synset is being
connected with Wikipedia links in which it is described, while Maziarz et al. (2012) introduce a series of relations pertinent to adjectives, e.g. derivational relations comparative and superlative define
gradable forms of descriptive adjectives. Derivational relation similarity defines a relation between
an adjective and a noun such that, based on a given
adjective, the structure or form of the object described by the noun can be discovered. Similarly,
derivational relation characterstic defines a relation between an adjective and a noun where the
contents or quality of an object described by the
noun is known based on the adjective, e.g. based
on the statement “If someone is famous, then he is
characterised by fame” the relation characteristic
will be set between the noun fame and the adjective famous.
The new semantic relation between nouns and
adjectives in the Portuguese WordNet is described
in (Marrafa et al., 2006) and (Mendes, 2006).
This relation is given in the form of a pair of inverse relations a characteristic of / has as a characteristic. According to the authors, although
the purpose of the relation is to mark significant characteristics of a noun expressed by an adjective (e.g.‘{carnivorous} is a characteristic of
{shark}’), the status of this relation in the sense
of lexical knowledge is not completely clear. Authors also point out that introducing this new relation enriches a WordNet, that it can contribute
to the process of determining the semantic domain of an adjective and that it can be included in
reasoning applications. Veale and Hao also sug-
3
Motivation
The research described in this paper is based on
the previously mentioned research results by Marrafa et al. (2006) and Mendes (2006), because we
are searching for specific relations between nouns
and adjectives. However, unlike the relation has
as a characteristic which connects a number of
nouns {shark, cobra, orca, predator,...} to the same
adjective {carnivorous}, we consider those descriptive adjectives that are specific to a small set
of nouns, or only to a single noun. In the process
of generating of the new relation, we are proposing usage of the rhetorical figure simile which has
a relatively high frequency of occurrence in texts
written in a natural language. In that case, the re-
219
lation ‘{peludo} is a characteristic of {abelha}’,
meaning (‘{furry} is a characteristic of {bee}’),
which exists in the Portuguese WordNet, would
not be an adequate example, but the new relation
would be created based on the common rhetorical
figure simile “as busy as a bee” in which case the
relation would be ‘{busy} specific of {bee}’.
With the suggested relation specificOf/specifiedBy we can determine the nature
of the semantic connection between the concepts
arrow, light and rabbit, which cannot be achieved
with the existing PWN relations. Namely, the
simile constructs brz kao zec “as fast as a rabbit”,
brz kao svetlost “as fast as light”, brz kao strela
“as fast as an arrow”, obtained by querying over
the Corpus of Contemporary Serbian, we can
confirm that ‘{strela, svetlost, zec} specifiedBy
{brz}’i.e. ‘{arrow, light, rabbit} specifiedBy
{fast}’holds true.
On the other hand, significant research, that the
work described in this paper leans on, is depicted
in papers by Veale and Hao (2008) and (2010),
regarding the development of automatic methods
of extracting semantic knowledge out of examples
of the simile figures usage. We suggest extraction
of linguistic constructs of the form as ADJ as
a NOUN from the corpus annotated with PoS and
lemmas, which means that, in contrast to the results of Google search engine, the search would
be faster and more precise, because in one step,
we would obtain the set of those potential figures of simile that have only nouns positioned at
the end of the observed linguistic structure. Furthermore, if we do not take into account all of
the attributes that are characteristic for a certain
noun, but only those that are used the most in everyday language (measured by the frequency of
occurrence of the corresponding figure simile in
the observed corpus) we would get the possibility
to describe the set of “noun-adjective” candidates
for expansion of the existing structure of WordNet
with one unique relation (specificOf/specifiedBy).
Introduction of a single relation would eliminate
the risk pointed out in (Veale and Hao, 2008) that
the introduction of a large number of relations expressed by the structure slot:filler would reduce
the system’s ability to recognize similar properties. In a case of one relation, for example, {frame:
Has strut: proud} and {frame: Has gait: majestic} would be transformed into {frame: specifiedBy: proud} and {frame: specifiedBy: majestic}. Apart from that, taking into account
only the most frequent ones, the described transformation would not involve all of the slot:filler
structures of a certain noun, but only the most
frequent one, which would, in the case of the
noun peacock result in generating only one relation ‘{peacock} specifiedBy {proud}’, and not
all seven of them. If we introduce the frequency
threshold as a parameter, its change can affect the
number of specificOf/specifiedBy relations for the
single noun synset, as well as for the total number
of relations of that type.
4
Language-independent model for
WordNet Expansion
The procedure of expansion with the relation
specificOf/specifiedBy that we are proposing, will
be shown on the example of expansion of the Serbian WordNet (SWN) (Krstev, 2008), but it can
also be used for other wordnets. The procedure
consists of the following steps:
1) From the annotated corpus of a natural language Kl extract linguistic constructs of the form
pridev kao imenica (in the case of English
as ADJ as a NOUN) and create the set Sims
such that:
Sims={“as ADJ as a NOUN”}, sims∈ Sims ⊂ Kl
In our case, from the Corpus of Contemporary
Serbian Language1 (Utvić, 2014) 59 concordances of the form “<as ADJ as a NOUN>” were
generated, such as the following:
ri više.-Kakva je?-<Bela kao mleko>.
Ona traži isto
crnog mrežastog šala, <lakog kao pero>, smele zelene dan
od zatvorenika; lica <žuta kao limun>, radosno polete
...............................-<White as milk>. ..............................
.................................., <light as a feather>, ......................
............................... <yellow as a lemon>, .........................
2) Eliminate all elements from the Sims
set whose adjectives are not descriptive:
SimsRedycByAdj={sims∈ Sims|ADJ 0 is descriptive0 }
like in the following examples where the adjectives are possessive:
za taj dan. Jer reč je <ljudska kao glad>. Nema za
Drugog? Ljubav <majčinska kao vernost>, ljubav muško....................................... <human as hunger>. .................
........................... <motherly as loyalty>, ..........................
1
http://www.korpus.matf.bg.ac.rs/
index.html/
220
In our case, the result was
|SimsRedycByAdj| = 2030 elements.
errors in the case when there are no synonyms in
the observed synsets and the sense of the literals is
the first sense. In that case, the possibility of error
exists only if: at least one of the synsets is not
correctly complemented with synonyms and there
are no correctly assigned senses, or the desired
sense is not the first one and it does not exist. In
this regard, since the source of errors is known in
advance, it is possible to check it before applying
the algorithm. On the other hand, if at least
one of the synsets has more than one synonym,
or has one but its sense is not the first one, the
new relation is not created and adjective-noun
pair is separated into two independent files: the
file containing adjectives and all their senses
from a wordnet (named adjective senses) and
the file containing nouns and all their senses
(named noun senses). These resources are later
used in a web application for manual pairing
of adjectives and nouns and their connection
through the desired relation. Finally, pairs for
which it is determined at the very beginning of
the process that they do not exist in the form of
literals in a given wordnet, become candidates for
later regular wordnet expansion – by adding new
synsets.
3) From the set SimsRedycByAdj, eliminate all
elements whose nouns are proper names, or have
been replaced by acronyms (3rd example)
SimsRedycByN oun = {sims ∈ SimsRedycByAdj
|N OU N 0 is a common N 0 }
Like in the following examples:
Pljevlja bi bila bogata i <bleštava kao Las> Vegas
da bude slavna i <bogata kao Monika> Seleš.
Kako
zatvoru u Beogradu, <opštepoznatom kao CZ>, naći u
............................................... <glistening as Las> Vegas
.............................. <rich as Monika>, Seleš. ..................
............ ...................... <generally known as CZ>, ..........
In our case, the result was
|SimsRedycByN oun| = 1059.
4) From the set SimsRedycByNoun generate a
subset of the most frequent elements
SimsM ostF req = {sims ∈ SimsRedycByN oun
|f req(sims) ≥ k}
where k is the minimal frequency of occurrence
as ADJ as a NOUN in the observed corpus
Kl . In our case, for the value k = 1, the total
number of ADJ-NOUN pairs, candidates for
wordnet expansion is |SimsM ostF req| = 1059.
Algorithm
Input: Adjective As Noun text file
Output: 1. a pair of WordNet mutually inverse
semantic relations (specificOf/specifiedBy)
for each input adjective-noun pair
2. file containing adjectives and all their senses
3. file containing nouns and all their senses
foreach adjective-noun pair in adjective-noun pairs
if ((adjective exists in Wordnet.adjective.literals)
and (noun exists in Wordnet.noun.literals)) {
if ((Wordnet.senses(adjective).Count==1)
and (Wordnet.senses(noun).Count==1)
and (Wordnet.sense(adjective).FirstSense)
and (Wordnet.sense(noun).FirstSense) ) {
Create Relation(specificOf,adjective,noun);
Create Relation(specifiedBy,noun,adjective);
}
else
foreach (sense in Wordnet.senses(adjective)) {
add to adjective senses(adjective,sense,synsetId)}
foreach (sense in Wordnet.senses(noun)) {
add to noun senses(noun,sense,synsetId)}
}
}
5) From the set SimsMostFreq create a text
file Adjective As Noun with ADJ-NOUN pairs
over which an algorithm for wordnet expansion is
executed (see Algorithm).
The presented algorithm is used for sequential processing of input candidate ADJ-NOUN
pairs. For each pair, it checks whether in a given
wordnet there are synsets of adjectives and nouns
which are lexicalized by literals of the observed
adjective and noun. After that, the procedure
of automatic creation of the relation specificOf/specifiedBy is implemented between synsets
of an adjective and a noun using a restriction —
both of them have to be lexicalized by only one
literal whose sense is the first sense. The first
sense of a literal is considered to be the sense
of a word in a certain language which is defined
by a relevant dictionary or a corpus as the most
commonly used one. Intuition on which this
restriction is based is related to minimal pairing
Prior to the implementation of the given algorithm, we examined the SWN in order to determine its structure in terms of the previously described restrictions. SWN has more than 22,000
synsets, contains 1660 synsets of adjectives with
one literal, out of which in 1452 synsets the sense
221
of that literal is the first sense, while the number
of noun synsets with one literal, where the sense
of that literal is the first sense is 15,035. By implementing the suggested algorithm, out of a total
of 1059 ADJ-NOUN pairs, 69 pairs were found
which are “pairs whose both members have one
sense and that sense is the first sense”. In SWN
there are 302 ADJ-NOUN pairs in which there
is more than one sense or that sense is not the
first sense. The 688 pairs that are left pertain to
those cases when at least one member of the ADJNOUN pair does not exist as a literal in SWN.
Therefore, using the proposed method produces
372 candidates that can be connected in SWN by
the relation specificOf/specifiedBy after approval.
For 302 ADJ-NOUN pairs present in SWN, but
with many senses or with one sense that is not the
first sense, a web page is created in the SWNE2
application (Mladenović et al., 2014) which allows users to input adjectives, thus generating a
column with synsets lexicalized by the given adjective, while inputting nouns leads to generating
of the second column, with synsets lexicalized by
the noun at hand. New relations can be generated
by looking for appropriate synsets and senses in
adjective senses and noun senses files as well as
by chosing the desired relation from the third column.
5
finding out whether “in everyday language we can
say that someone/something is ADJ as NOUN?”.
The answers were Yes or No and answering all
questions in a form was mandatory. The Table 1
gives an overview of the distribution of questions
in each form as well as the number of participants
who were involved in answering the questions.
Google
form
1
2
3
4
Total
Participants
per
form
46
138
150
100
434
Table 1: Distribution of questions and participants
per form.
A Phd student at the Faculty of Philology, as
a linguistic expert, manually selected 154 items
from List1 for which it could be presumed with
some degree of certainty that they may be used
in everyday language; namely, we retrieved a lot
of noisy data from the Corpus, and some items
stopped carrying meaning when taken out of the
context. Linguistic constructs, chosen from the
given List1, included čist kao apoteka “clean as
a farmacy”; čist kao suza “pure as a teardrop”;
hladan kao led “cold as ice”; lak kao pero, “as
light as a feather”; veran kao pas “as faithful as a
dog” whereas constructs such as: dobar kao oblik “good as shape”; dobar kao pisac “good as a
writer”; poznat kao vodja “famous as a leader”
were not used as they represented occasional occurrences. As we could not predict how willing
to help out the potential participants would be, we
were aiming for at least 30 participants. Also, the
first form had less constructs than the rest — 30 —
as we wanted to test the method and to see what
would be an optimal number of fields in the form.
We obviously wanted to test as many constructs as
possible, but had also to keep the forms interesting
and easy to fill in. The rest of the forms were balanced unit-wise. The number of participants was
not pre-chosen, it depended on the turnout on the
particular day.
The problem with this kind of participant involvement and with posts on Facebook in general
is that the novelty wears off fast and if some post
is very popular today, it might not be popular at all
tomorrow. The call for participation in this project
did receive a lot of attention in the first few hours
Evaluation
In order to assess whether the frequency of occurrence is a valid parameter for finding ADJ-NOUN
pairs which are parts of similes that are used in everyday life, we used an online survey which was
carried out through Google Forms. Comparing the
list (marked here as List1) which was automatically generated using the Corpus and filtered using
steps 1-4 explained in Section 4, and ordered in a
decreasing order according to pair frequency, with
the list which, in fact, represents a subset of the
List1 of those pairs that were marked positively
in the anonymized survey (marked as List2), we
wanted to assess which frequency threshold value
entails the results obtained in the survey.
The survey itself was conducted over the time
period of 5 days, such that a total of 4 forms were
published successively. Anonymous users of the
social network Facebook were supposed to give an
answer to each question generated on the basis of
ADJ-NOUN pairs from the List1 list with a goal of
2
Number of
questions
per form
30
42
41
41
154
http://resursi.mmiljana.com/
222
acceptable level of reliability, the works of (Hayes
and Krippendorff, 2007), (Lombard et al., 2002)
and (Maggetti, 2013) show that agreements whose
values are α ≥ 0.667 are reliable, and that agreements whose values are α ≥ 0.8 can be considered very reliable. The results we obtained using
the Kalpha test over the set of 5 annotators for
each of the subsets of the forms is given in Table 2. Provided that for the first two forms and a
after being posted on Facebook. The privacy for
the post was set to Public, which meant that everyone could participate and share the link leading
to the Google Forms. Due to the fact that people
did share the link, and some of their friends did
the same thing, we could see that the forms were
being filled in quickly and that our research was
getting a lot of attention. In the following three
days, we posted another three forms on the same
URL address (precisely because the post received
a lot of attention and shares) and we were able to
get enough responses in order to get valid results.
On the fourth day, the novelty wore off and we
were getting significantly fewer responses, which
only proved our assumption that we had to move
fast and to post new forms every day.
First, we measured the contribution of participants and determined the set of those participants
whose results were to be taken into account as relevant, on the basis that there was no substantial
difference between arithmetic means of their answers. In order to measure the participants’ contribution we generated 7 subsets of questions and
answers where each set had less than 30 questions (units) using four spreadsheets containing
participants’ answers, as it is shown in Table 2
(each Google Form, except the first one, was divided into two parts). All 7 units were converted
into matrices where each row represented answers
of each participant and each column represented
one question in the form <adjective>as<noun>.
Content of each cell of the matrix had the value 1
if the participant marked a certain expression with
“Yes” and the value 0 if the participant marked that
expression with “No”. Rows of the matrix were
compared against each other with a paired t-test
in order to determine that there was no substantial difference between arithmetic means of participants’ answers. From each set we selected,
among all participants belonging to that set, five
participants whose difference in the paired t-test
was the slightest.
Form
set
No of
participants
No of
questions
Kalpha
value
1
2a
2b
3a
3b
4a
4b
Total
5
5
5
5
5
5
5
30
21
21
21
20
21
19
154
α = 0.757∗
α = 0.713∗
α = 0.698∗
α = 0.688∗
α = 0.484
α = 0.434
α = 0.375
No of
quest.
annot.
with Yes
16
17
15
5
53
Table 2: Inter-annotator agreement over Google
Forms and number of items which belong to reliable forms and were annotated with “Yes”.
part of the third one, the value of Kalpha was such
that the annotator agreement could be considered
reliable, for all of the constructs in those forms,
if a majority of annotators (3 or more than 3 out
of 5) annotated a certain question with “Yes”, that
item was taken as an element of the List2’. Thus,
we obtained 53 items in total and their distribution
over form sets is given in the last column of Table 2. Furthermore, we want to draw attention to
the phenomenon which we did not study in depth,
which was described here in Table 2 and has to do
with the decline of the Kalpha coefficient over the
same questionnaire structure, related to the time
period when the participants filled in the Google
Forms.
Finally, we wanted to assess how much the
change of the frequency threshold influenced the
relevance of automatically selected ADJ-NOUN
pairs, measured based on the results obtained
through the surveys. The list List1 has been
reduced so that it contains forms 1, 2a, 2b and 3a
which amounted to 93 elements, that is to say,
all ADJ-NOUN pairs for which evaluation by
the participants was proved relevant. That list
was named List1’. In contrast, the list named
List2’ contained only those ADJ-NOUN pairs
from the List1’ that were marked positively.
First, we wanted to set the frequency threshold
After that, inter-annotator (participant) agreement was evaluated using the Krippendorff α coefficient (Kalpha). When the value of α is in the [0,
1] interval, it represents the agreement level which
ranges from complete disagreement, when α = 0,
to complete agreement, when α = 1. The α measure can also have a negative value, up to -1, when
two mistakes are present: mistake in sampling and
mistake in systemic disagreement. Considering an
223
6
to k = 4, which meant that the algorithm was
used to process only those pairs whose frequency
of occurrence in the Corpus was k ≥ 4. There
were 23 such pairs in the list List1’. Out of
those 23, 19 pairs were present in the list List2’,
which meant that the participants in the survey
did not recognize 4 pairs that were recognized
by the algorithm. The entire statistics showing
the percentage of pairs we obtained using the
algorithm as well as human judgement is given
in Table 3, and the graph showing the relation
between human selection, as opposed to automatic
selection, when the frequency threshold is being
changed, is given in Figure 1.
Frequency
threshold
k=1
k=2
k=3
k=4
by
algorithm
93
44
32
23
by
humans
53
32
27
19
Conclusions
In this work, we presented a general way of automatic expansion of a WordNet with the semantic relation specificOf/specifiedBy which was produced after extraction of semantic knowledge contained in the relation of comparison from the annotated corpus. The results of the proposed method
of selection of the most frequent ADJ-NOUN
pairs extracted from the described linguistic constructs as ADJ as a NOUN for the frequency
threshold k ≥ 3 were matched in 84% of cases
with the results obtained from anonymous evaluators, on identical sets of ADJ-NOUN pairs. The
Algorithm for automatic WordNet expansion can
be improved in step 5) by including the Word
sense disambiguation (WSD) method. That would
enable literals with more than one sense to be used
in automatic adding of the new relation. In future
work we plan to implement WSD and to use other
linguistic constructs which indicate Simile.
Using the relation specificOf/specifiedBy between a noun and its specific adjective, the hidden
meaning of another word or a phrase can be detected, e.g. in sentences such as “My sister is like
a bee” or “My sister is a bee”, based on the relation specificOf/specifiedBy between the noun bee
and its specific adjective busy, a sentiment neutral
noun sister can have the same sentiment polarity
as the adjective busy, i.e. positive polarity. If we
say “My sister is like a lizard”, based on the same
principle, the same noun changes its sentiment polarity into negative polarity, considering the fact
that the noun lizard is connected with a relation
specifiedBy with the adjective lazy. In the example “My sister is as fast as a turtle” the indirect
connection of the antonymous pair fast-slow in the
construct “as fast as a turtle” indicates the existence of the rhetorical figure irony, therefore, in a
given context, the noun sister can have a negative
sentiment polarity. In our future work, we plan on
analysing whether the process of sentiment classification can be improved by changing the default
sentiment polarity of n-gram predictors, depending on the figurative context detected in the previously described way.
humans /
algorithm
57%
73%
84%
83%
Table 3: Relationship of manually and automatically selected pairs depending on the frequency
threshold.
Figure 1: Relationship of selected pairs obtained
with the survey method compared to the ones obtained with the method of the most frequent occurrence for different frequency thresholds.
Figure 1 shows the way in which, on a sample of 93 ADJ-NOUN pairs contained in the List2’
list (Kalpha reliable), the percentage of participation of the manually selected pairs changes in the
subset obtained by choosing only those pairs from
the same list whose frequency was equal or higher
than the set threshold, when the threshold changes.
The achieved result of 84% gives us the manually
measured accuracy of the Algorithm for automatic
WordNet expansion with the frequency threshold
of k=3.
Acknowledgments
This research was partly supported by the Serbian
Ministry of Education and Science under the grant
47003.
224
References
Marek Maziarz, Stanisław Szpakowicz, and Maciej Piasecki. 2012. Semantic Relations among Adjectives
in Polish WordNet 2.0: A New Relation Set, Discussion and Evaluation. Cognitive Studies / Études
Cognitives, 12:149–179.
Manuela Angioni, Roberto Demontis, Massimo Deriu,
and Franco Tuveri. 2008. Semanticnet: a WordNetbased Tool for the Navigation of Semantic Information. Proceedings of the 4th International Global
Wordnet Conference (GWC2008), 21–34.
Sara Mendes. 2006. Adjectives in WordNet. Proceedings of the 3th International Global Wordnet Conference (GWC2006), 225–230.
Stefano Baccianella, Andrea Esuli and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: An Enhanced
Lexical Resource for Sentiment Analysis and Opinion Mining. Proceedings of the 7th Language Resources and Evaluation Conference (LREC 2010),
2200–2204.
Miljana Mladenović and Jelena Mitrović. 2013. Ontology of rhetorical figures for Serbian. LNAI,
Springer, 8082:386–393.
Miljana Mladenović, Jelena Mitrović and Cvetana
Krstev. 2014. Developing and Maintaining a
WordNet: Procedures and Tools. Proceedings of
the 7th International Global Wordnet Conference
(GWC2014), 55–62.
Andrea Esuli and Fabrizio Sebastiani. 2006. Sentiwordnet: A Publicly Available Lexical Resource
for Opinion Mining. Proceedings of the 5th Conference on Language Resources and Evaluation (LREC
2006), 417–422.
Adam Pease, John Li, and Karen Nomorosa 2012.
WordNet and SUMO for Sentiment Analysis. Proceedings of the 6th International Global Wordnet
Conference (GWC2012).
Yanfen Hao and Tony Veale. 2010. An Ironic Fist in
a Velvet Glove: Creative Mis-Representation in the
Construction of Ironic Similes. Journal Minds and
Machines, 20(4):635–650.
Alexandre Rademaker, Valeria de Paiva, Gerard de
Melo, Livy Maria Real Coelho, and Maira Gatti.
2014. OpenWordNet-PT: A Project Report. Proceedings of the 7th Global WordNet Conference
(GWC2014), 383–390.
Andrew F. Hayes and Klaus Krippendorff. 2007. Answering the Call for a Standard Reliability Measure for Coding Data. Communication Methods and
Measures , 1(1):77–89.
Alistair Kennedy and Diana Inkpen. 2006. Sentiment Classification of Movie Reviews Using Contextual Valence Shifters. Computational Intelligence, 22(2):110–125.
Vassiliki Rentoumi, Stefanos Petrakis, Manfred Klenner, George A. Vouros, and Vangelis Karkaletsis.
2010. United we stand - improving sentiment analysis by joining machine learning and rule based methods. Proceedings of the 7th Language Resources and
Evaluation Conference (LREC 2010).
Svetla Koeva, Cvetana Krstev, and Duško Vitas. 2008.
Morpho-semantic Relations in WordNet. A Case
Study for two Slavic Languages. Proceedings of
the 4th International Global Wordnet Conference
(GWC2008), 239–253.
Antonio Reyes and Paolo Rosso. 2012. Making objective decisions from subjective data: Detecting irony
in customer reviews. Decision Support Systems,
53(4):754–760.
Cvetana Krstev. 2008. Processing of Serbian - Automata, Texts and Electronic dictionaries. Faculty of
Philology, University of Belgrade, Belgrade.
Carlo Strapparava and Alessandro Valitutti. 2004.
Wordnet-affect: An Affective Extension of Wordnet.
Proceedings of the 4th International Conference on
Language Resources and Evaluation (LREC 2004),
1083–1086.
Judit Kuti, Károly Varasdi, Ágnes Gyarmati, and Péter
Vajda. 2008. Language Independent and Language
Dependent Innovations in the Hungarian WordNet.
Proceedings of the 4th International Global Wordnet
Conference (GWC2008), 254–269.
Miloš Utvić. 2014. Liste učestanosti Korpusa savremenog srpskog jezika [Corpus of Contemporary
Serbian – Frequency Lists]. Naučni sastanak slavista u Vukove dane, 241–262. Faculty of Philology,
University of Belgrade, Belgrade.
Matthew Lombard, Jennifer Snyder-Duch and Cheryl
Campanella Bracken. 2002. Content analysis in
mass communication: Assessment and reporting of
intercoder reliability. Human Communication Research, 28(4):587–604.
Tony Veale and Yanfen Hao. 2008. Enriching WordNet with folk knowledge and stereotypes. Proceedings of the 4th International Global Wordnet Conference (GWC2008), 453–461.
Martino Maggetti. 2013. Regulation in Practice: The
de facto Independence of Regulatory. Swiss Political Science Review, 19(1):111–113.
Palmira Marrafa, Raquel Amaro, Rui Pedro Chaves,
Susana Lourosa, Catarina Martins, and Sara
Mendes. 2006. WordNet.PT new directions. Proceedings of the 3th International Global Wordnet
Conference (GWC2006), 319–321.
225
Identifying and Exploiting Definitions in Wordnet Bahasa
David Moeljadi, Francis Bond
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Singapore
[email protected], [email protected]
Abstract
variants: Indonesian and Standard Malay. It combines data from several lexical resources: the
French-English-Malay dictionary (FEM), the KAmus Melayu-Inggeris (KAMI), and wordnets for
English, French and Chinese (Nurril Hirfana Mohamed Noor et al., 2011, p. 258).
We added Indonesian definitions from the Asian
Wordnet project (Riza et al., 2010) to Wordnet Bahasa. To the best of our knowledge, the Asian
Wordnet project is the only project that translated
the English definitions of some synsets in PWN
into Indonesian. However, the definitions were
crowd sourced and had little quality control so not
all of 14,190 definitions could be directly transferred. Many of the definitions had problems
and needed to be cleaned up. The definitions for
nouns and verbs which had been cleaned up were
exploited to extract relations, such as synonym,
hyponym, hypernym and instance hypernym, between lemmas and definitions. The method of extracting these relations was done in Bond et al.
(2004) to build an ontology. We used Python (3.4,
Python Software Foundation) and the Natural Language Toolkit (NLTK) (Bird et al., 2009) to process the data.
This paper is organized as follows: Section 2
describes the process of cleaning up the definitions, Section 3 explains the process of extracting hypernyms and other relations from the definitions. Section 4 presents the results and discussion
and Section 5 concludes.
This paper describes our attempts to
add Indonesian definitions to synsets in
the Wordnet Bahasa (Nurril Hirfana Mohamed Noor et al., 2011; Bond et al.,
2014), to extract semantic relations between lemmas and definitions for nouns
and verbs, such as synonym, hyponym,
hypernym and instance hypernym, and to
generally improve Wordnet. The original,
somewhat noisy, definitions for Indonesian came from the Asian Wordnet project
(Riza et al., 2010). The basic method of
extracting the relations is based on Bond
et al. (2004). Before the relations can be
extracted, the definitions were cleaned up
and tokenized. We found that the definitions cannot be completely cleaned up
because of many misspellings and bad
translations. However, we could identify four semantic relations in 57.10% of
noun and verb definitions. For the remaining 42.90%, we propose to add 149 new
Indonesian lemmas and make some improvements to Wordnet Bahasa and Wordnet in general.
1
Introduction
A lexical database with comprehensive data about
words, definitions, and examples is very useful in
language research. In Princeton Wordnet, nouns,
verbs, adjectives, and adverbs are grouped into
sets of cognitive synonyms (synsets) which are
interlinked through a number of semantic relations (Fellbaum, 1998; Fellbaum, 2005). Since
its creation, many other wordnets in different languages have been built based on Princeton Wordnet (PWN) (Bond and Paik, 2012; Bond and
Foster, 2013). One of them, Wordnet Bahasa,
is built as a lexical database of the Malay language. At present, it consists of two language
2
Cleaning up the definitions
As mentioned in Section 1 above, the definitions
we had available were not clean. Many infelicities
were found, such as misspellings, definitions using abbreviations, typos, synsets having more than
one similar definitions, definitions written in English, improper use of hyphens, and lemmas written as the first word in the definitions. Each error
is illustrated in the following subsections.
226
2.1
Correcting and deleting definitions
as the lemma with the real definition placed between brackets afterwards, the first word and the
brackets were deleted (see Table 4).
Words in the definitions which are not spelled correctly according to standard Indonesian, such as
dimana “where” and lain lain “others”, as well as
typos such as enerji “energy” and bagain “part”,
were semi-automatically corrected. Since the typos are many and scattered throughout the file, we
may have missed some. Abbreviations, most of
them are prepositions, such as dgn “with” and utk
“for”, were also normalized to their full forms (see
Table 1).
Before
correction
(double space)
dimana
dengans
dgn
utk
kpd
pd
lain lain
enerji
bagain
spt
dr
thdp
sst
After
correction
(single space)
di mana
dengan
dengan
untuk
kepada
pada
lain-lain
energi
bagian
seperti
dari
terhadap
sesuatu
Meaning
“where”
“with”
“with”
“for”
“to”
“at”
“others”
“energy”
“part”
“like”
“from”
“toward”
“something”
Synset
09543673-n
‘an evil spirit
or ghost’
Number
of hits
416
313
121
93
52
25
23
21
12
12
12
10
10
3
2.2
07904637-n
‘gin flavored
with sloes
(fruit of the
blackthorn)’
Before
cleaning
up
After
cleaning
However, if the definitions are all completely different and one of them was considered good based
on the English and Japanese definitions, that one
was chosen to be the correct one (see Table 6).
This manual checking was done by the first author
who has a good command of Indonesian, English,
and Japanese.
If we found no satisfying definition after checking and comparing with the English and Japanese
definitions, one or two of the words in the definitions were manually corrected (see Table 7).
After the cleaning up process, we made the
Indonesian definitions available in the Open
Multilingual Wordnet (1.2) hosted by Nanyang
Technological University in Singapore (http:
//compling.hss.ntu.edu.sg/omw/). Figure 1
shows a screenshot of synset 06254371-n ‘heliogram’ with its Indonesian definition.
Some definitions had hyphens separating the
words. In this case, the hyphens were deleted (see
Table 3).
14118423-n
‘severe diabetes
mellitus with an
early onset’
Definition
buah dari semak
“fruit of the blackthorn”
gin yang diberi rasa sloea
“gin flavored with sloes”
gin yang diberi rasa sloea
(buah dari semak)
“gin flavored with sloes (fruit
of the blackthorn)”
gin yang diberi rasa sloea
(buah dari semak)
Table 5: An example of a synset with many parts
of definition, before and after the cleaning up
Definition
Hanging Gardens of Babylon
ho chi minh city
George Herbert Walker Bush
rain in the face
a unit of measure for capacity officially
adopted in the British Imperial System
Definition
diabetes-mellitustergantung-insulin
“diabetes mellitus
depending on insulin”
diabetes mellitus
tergantung insulin
Choosing definitions
Synset
Table 2: Some examples of deleted definitions
Synset
After
correction
Some synsets have two or more different definitions as shown in Table 5. The longest one which
includes other definitions, is assumed to be the
correct one and automatically selected as the best
definition.
Definitions which are obviously written in English
or just names, were deleted (see Table 2).
13615557-n
Before
correction
Table 4: An example of a definition with lemma
as the first word, before and after the correction
Table 1: Some examples of misspellings, abbreviations and typos, before and after the correction
Synset
03491491-n
09164241-n
10875910-n
11252392-n
Definition
Ghoul (roh jahat atau
hantu)
“Ghoul (an evil spirit or
ghost)”
roh jahat atau hantu
“an evil spirit or ghost”
Before
correction
After
correction
Table 3: An example of a definition having hyphens, before and after the correction
For definitions in which the first word is the same
227
Figure 1: A screenshot of synset 06254371-n ‘heliogram’
Synset
01711910-a
‘causing a
sharply painful or stinging
sensation’
Definition
kedinginannya menggigit
ke tulang
“the coldness bites to
bones”
kedinginannya menusuk
ke tulang
“the coldness stings to
bones”
sejuk hingga menggigit
ke tulang
“cool biting to bones”
sejuk hingga menusuk
ke tulang
“cool stinging to bones”
sejuk hingga menusuk
ke tulang
“cool stinging to bones”
Synset
00731471-a
‘supported
by both
sides’
Before
correction
Before
correction
After
correction
Table 7: An example of a synset having two definitions, before and after the correction
synset 09500625-n ‘Pegasus’, the head of which
is preceded by a numeral prefix se- “one” and a
classifier ekor (lit. “tail”) and followed by an attributive verb bersayap (lit. “having wings”) and a
prepositional phrase.
After
correction
Table 6: An example of a synset having many definitions, before and after the correction
3
Definition
didukung oleh dua negara
“supported by both
countries”
didukung oleh dua partai
“supported by both
parties”
didukung oleh dua pihak
“supported by both sides”
(1)
Extracting relations from the
definitions
seekor kuda bersayap dalam mitologi Yunani
one-CL horse winged in mythology Greece
“a winged horse in Greek mythology”
Unlike Bond et al. (2004) who parsed the definition sentences using a grammar before extracting
hypernyms and other relations, we simply used
regular expressions. Indonesian has a strong tendency to be head-initial (Sneddon et al., 2010, pp.
160-162). In a noun phrase with an adjective, a
demonstrative or a relative clause, the head noun
precedes the adjective, the demonstrative or the
relative clause. Typically numerals and classifiers
precede the head noun (Alwi et al., 2014, pp.251255).
Example (1) shows the Indonesian definition of
Example (2) contains a part of the Indonesian definition of synset 05316175-n ‘ocular muscle’. Its
head otot-otot “muscles” is in the plural (reduplicated) form, preceded by satu dari “one of” and
followed by an adjective kecil “small”.
(2)
satu dari otot-otot kecil pada mata. . .
one of muscle-RED small at eye
“one of the small muscles of the eye”
228
the same word is used to define the lemma. Besides synonyms, hyponyms can also be employed
to define the lemma. In order to confirm this, the
lemma synset was compared to the hyponyms of
the first word in the definition.
The next important step was to check whether
the hypernym is used to define the lemma by comparing the hypernyms and instance hypernyms of
the lemma synset with the synsets of the first word
in the definition. If a lemma does not have any hypernym in Wordnet, we checked whether it has instance hypernym. Finally, lemmas having neither
hypernyms nor instance hypernyms were checked
by hand.
We assume that after modifying the definitions,
relations between lemmas and definitions can be
extracted from the first lexical word (i.e. the head)
in the definitions.
3.1
Modifying the definitions
For each definition for nouns and verbs, we removed the following words at the beginning:
(i) words which are written between brackets,
such as (Ilmu komputer) “(Computer science)” relating to domain
(ii) numerals, such as satu “one”, tiga “three”,
and 5 “five”
(iii) determiners, such as setiap “every”, sejenis “a kind of”, semacam “a sort of”, sembarang
“any kind of”, salah satu “one of”, suatu “a (for
thing)”, sebuah “a (for thing)”, seorang “a (for
person)”, seekor “a (for animal)”, selembar “a
piece of”, sekelompok “a group of”, beberapa
“some”, berbagai “various”, and segala “all”
(iv) relativizer yang “which”
(v) prepositions, such as untuk “for”, dari “of”,
and dalam “in”
(vi) other stop words, such as seperti “like”, tentang “about”, termasuk “including”, and biasanya
“usually”
We also changed the plural (reduplicated) form
of the head to its singular (non-reduplicated) form,
for example otot-otot “muscles” was changed to
otot “muscle” and daun-daunan “foliage, a cluster of leaves” was changed to daun “leaf”. Punctuations such as slashes (/), semicolons (;), and
commas (,) dividing two words were replaced as a
space. After we made these changes, the first word
in the definition was taken as a potential genus
term.
3.2
4
Results and discussion
The definition file which originally has 14,190
lines of definitions was cleaned up and 1,522 definitions (10.7%) were deleted. The remaining
12,668 definitions consist of 10,549 definitions
for nouns, 1,663 definitions for adjectives, 409
definitions for verbs, and 47 definitions for adverbs. Although these definitions are considered
quite clean, they may still contain small errors as
mentioned in Section 2.1. Since adjectives and
adverbs do not have relations such as hypernym
in Wordnet, we only examined nouns and verbs.
Out of 10,958 definitions for nouns and verbs, we
could extract four relations from 6,257 definitions
(57.10%) as shown in Table 8. The remaining
4,701 definitions (42.90%) have problems, such as
words which could not be found in Wordnet and
lemmas without explicit relations as shown in Table 9.
Most of the relations we extracted (95.89%) are
hypernym and instance hypernym. The remaining are synonym and hyponym as shown in Table
8 for synset 00004475-n and 00029677-n. Synset
00004475-n has six Indonesian lemmas. One of
these lemmas, i.e. makhluk “being”, is used as
the head of its definition and thus we regard the
lemma is synonymous with the definition. Synset
00029677-n has proses “process” as one of its
lemmas, which is the hypernym of the head of the
definition fenomena “phenomenon”.
Out of the 4,701 definitions for which we could
not find the relations, most of them (83.88%) have
hypernyms which are different from the first word
in the definitions. We found five patterns for this
problem (see Table 9):
Extracting relations
The first step was to check whether each first word
of the definitions is in Wordnet or not. If it is not in
Wordnet, we checked whether it is in Kamus Besar
Bahasa Indonesia (KBBI) “The Great Dictionary
of the Indonesian Language of the Language Center” or not. KBBI is published by the language
institute who provides support for the standardization and propagation of Indonesian. Its third edition has been made online to public and has an official site (http://badanbahasa.kemdikbud.go.
id/kbbi/) (Alwi et al., 2008).
The next step was to check whether the lemma
synset is the same as the synset of the first word
in the definition. This allows us to identify when
1. The genus term is correct but Wordnet Ba-
229
Relation
Number
of synsets
Hypernym
5,451
Synset
00021939-n
artifact
02956500-n
Capitol
Example
Definition
suatu objek buatan manusia “a man-made object”
gedung DPR di AS “the government building in the United States”
Instance hypernym
549
Synonym
252
00004475-n
organism
makhluk hidup yang dapat mengembangkan kemampuan bertindak
independen “a living thing that can develop the ability to act
independently”
Hyponym
5
00029677-n
process
sebuah fenomena yang berkelanjutan “a sustained phenomenon”
Total
6,257
Table 8: Relations extracted from lemmas and definitions
Problem
No match
Number
of synsets
3,943
Example
Synset
14350206-n
myelitis
14573846-n
viremia
13251154-n
clobber
07603411-n
choc
14364217-n
sword-cut
00046344-n
stunt
Definition
inflamasi pada syaraf tulang belakang
“inflammation of the spinal cord”
kehadiran suatu virus di dalam aliran darah
“the presence of a virus in the blood stream”
istilah informal untuk harta pribadi
“informal terms for personal possessions”
singkatan dalam bahasa Inggris untuk coklat
“colloquial British abbreviation for chocolates”
bekas luka dari sayatan pedang
“a scar from a cut made by a sword”
tidak biasa atau berbahaya
“not usual or dangerous”
13436063-n
automatic data processing
07865105-n
chili dog
14099050-n
visual aphasia
09603258-n
Pluto
14155506-n
cystic fibrosis
00662589-v
insure
01773734-v
grudge
pemrosesan data secara otomatis
“automatic data processing”
hot dog dengan daging sapi diberi cabai bubuk
“a hotdog with chili con carne on it”
ketidakmampuan memahami kata-kata tertulis
“inability to perceive written words”
karakter kartun anjing ciptaan Walt Disney
“a cartoon character created by Walt Disney”
disebabkan kerusakan suatu gen
“caused by defect in a single gene”
membagikan kawasan untuk kawalan tentara
“allot regions for soldiers”
terpaksa menerima atau mengakui
“accept or admit unwillingly”
Word not in Wordnet
- Word in KBBI
252
- Word not in KBBI
495
No explicit relations
11
Total
4,701
Table 9: Problems found in extracting relations
hasa does not have the right synset for the
lemma. For example, synset 14350206-n
‘myelitis’ has 14336539-n ‘inflammation’ as
its hypernym, which is also the first word
in the English and Indonesian definitions.
Wordnet Bahasa does have inflamasi “inflammation” but only in a different synset.
For example, synset 13251154-n has istilah “terms” and synset 07603411-n has
singkatan “abbreviation” as the first word in
the definition. To get the real genus term requires more parsing.
4. Compounds were not extracted. For example, although the head of the definition of
synset 14364217-n, was bekas luka “scar”
(lit. “former wound”), we extracted only the
first word bekas “former, past”
2. The semantic relation is not written explicitly in the definition. For example, synset
14573846-n ‘viremia’ has kehadiran “presence” as the first word in the English and
Indonesian definitions which has nothing related with the semantic relation.
5. The definition is incomplete. For example, the Indonesian definition for synset
00046344-n lacks the head noun usaha “feat”
3. The genus candidate is a relational noun.
230
5
The second problem we found is that the first
word in 747 definitions (15.89%) is not in Wordnet. In this case, we checked whether the word
is in the Indonesian dictionary (KBBI) or not as
mentioned in the previous section. We found 252
definitions having 149 unique words (the heads)
which are in KBBI but not in Wordnet. Some of
them are compounds as in synset 07865105-n with
the definition hot dog dengan daging sapi diberi
cabai bubuk “a hotdog with chili con carne on it”.
We did not distinguish compounds and thus, failed
to extract hot dog as the head. The word hot does
exist in KBBI as an adjective meaning ‘sexually
excited or exciting’.
The remaining 495 definitions have 235 unique
words which are not in KBBI. We examined four
patterns for this:
Summary and future work
We have presented the process of cleaning up the
definitions and extracting relations from the definitions. While doing the relation extraction, we
spotted errors such as incompleteness and incorrectness in the definitions which we could not detect only by cleaning up the definitions. The reason why there are errors is probably because of
little quality control in the translation process. In
addition, we found things to be improved in Wordnet Bahasa and Wordnet in general. Based on our
findings, we propose to:
1. Edit the incomplete Indonesian definitions. For example, definitions for synset
00046344-n which lacks the head noun usaha “feat” and 14155506-n which lacks the
head noun penyakit “disease”, as mentioned
in Section 4
1. Derived words with negation are not listed
as lexical items in KBBI. For example, the
word ketidakmampuan “inability” (lit. “not
able-ness”) has the stem tidak mampu “not
able” with a circumfix ke-...-an to nominalize. Including in this group are ketidakadaan
“absence” (lit. “not present-ness”) and ketidaksempurnaan “imperfection” (lit. “not
perfect-ness”).
2. Delete the incorrect Indonesian definitions. For example, definitions for synset
00662589-v ‘insure’ which has the Indonesian definition membagikan kawasan untuk
kawalan tentara “allot regions for soldiers”
3. Add 149 new lemmas from KBBI and possibly derived words with negation to Wordnet
Bahasa
2. The online KBBI data is not perfect, it does
not include all Indonesian words listed in
the paper dictionary. For example, the word
karakter “character” is listed in the paper dictionary but not in the online version.
4. Add existing lemmas in Wordnet Bahasa to
the correct synsets. For example, inflamasi
to be added to synset 14336539-n ‘inflammation’
5. Edit definitions in Wordnet to make them
more informative, possibly add the hypernyms. For example, instead of having definition jenis dari genus Soleidae “type genus of
the Soleidae” for synset 02664136-n ‘Solea’,
we propose jenis ikan dari genus Soleidae
“type of fish from the Soleidae genus”
3. The Indonesian definition is incomplete. For
example, the Indonesian definition for synset
14155506-n lacks the head noun penyakit
“disease”.
4. The Indonesian definition is incorrect. For
example, the Indonesian definition for synset
00662589-v.
6. Standardize the definitions in Wordnet, possibly make some guidelines for definitions. For
example, regarding the numerals, some of
them are written alphabetically, as in synset
09506337-n ‘Fury’ tiga monster berambut
ular. . . “three snake-haired monsters. . . ”,
but some of them are written in numbers,
as in synset 09549416-n ‘Hyades’ 7 putri
Atlas. . . “7 daughters of Atlas. . . ”. Another problematic case is circular definitions.
We found 11 lemmas have no explicit semantic relations with the definitions.
They
are all verbs: 01773734-v ‘grudge’, 00616857v ‘neglect’, 01336635-v ‘overlay’, 01767949-v
‘strike’, 01944252-v ‘hover’, 02086805-v ‘stampede’, 02119241-v ‘ignore’, 02150510-v ‘watch’,
02413480-v ‘work’, 02581477-v ‘prosecute’, and
02673965-v ‘stand out’.
231
For example, for synset 04658942-n ‘inhospitableness’ memiliki sifat tidak ramah “having an unfriendly and inhospitable disposition” and synset 04657876-n ‘unfriendliness’
“an unfriendly disposition”
James Neil Sneddon, Alexander Adelaar, Dwi Noverini
Djenar, and Michael C. Ewing. 2010. Indonesian
Reference Grammar. Allen & Unwin, New South
Wales, 2 edition.
Acknowledgments
Thanks to Hammam Riza who gave us permission to use the Indonesian definitions from Asian
WordNet project. Thanks to Randy Sugianto and
Ruli Manurung for their help. This research was
supported in part by the MOE Tier 2 grant That’s
what you meant: a Rich Representation for Manipulation of Meaning (MOE ARC41/13).
References
Hasan Alwi, Dendy Sugono, and Sri Sukesi Adiwimarta. 2008. Kamus Besar Bahasa Indonesia
Dalam Jaringan (KBBI Daring). 3 edition.
Hasan Alwi, Soenjono Dardjowidjojo, Hans Lapoliwa,
and Anton M. Moeliono. 2014. Tata Bahasa Baku
Bahasa Indonesia. Balai Pustaka, Jakarta, 3 edition.
Steven Bird, Edward Loper, and Ewan Klein.
2009. Natural Language Processing with Python.
O’Reilly Media Inc.
Francis Bond, Eric Nichols, Sanae Fujita, and Takaaki
Tanaka. 2004. Acquiring an ontology for a fundamental vocabulary. In 20th International Conference on Computational Linguistics (COLING2004), pages 1319–1325, Geneva.
Francis Bond, Lian Tze Lim, Enya Kong Tang, and
Hammam Riza. 2014. The combined wordnet bahasa. NUSA: Linguistic studies of languages in and
around Indonesia, 57:83–100.
Christiane Fellbaum. 1998. WordNet: an electronic
lexical database. MIT Press, Cambridge.
Christiane Fellbaum. 2005. WordNet and wordnets.
In Encyclopedia of language and linguistics, pages
665–670. Elsevier, Oxford, 2 edition.
Nurril Hirfana Mohamed Noor, Suerya Sapuan, and
Francis Bond. 2011. Creating the open Wordnet Bahasa. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation
(PACLIC 25), pages 258–267, Singapore.
Hammam Riza, Budiono, and Chairil Hakim. 2010.
Collaborative work on Indonesian WordNet through
Asian WordNet (AWN). In Proceedings of the 8th
Workshop on Asian Language Resources, pages 9–
13, Beijing, China. Asian Federation for Natural
Language Processing.
232
Semantics of body parts in African WordNet: a case of Northern Sotho
Mampaka Lydia Mojapelo
University of South Africa
Department of African Languages
[email protected]
involved. The premise in building African
Wordnets is to model it on the Princeton
structure while staying true to the African
context.
Abstract
This paper presents a linguistic account of the
lexical semantics of body parts in African
WordNet, with special reference to Northern
Sotho. It focuses on external human body
parts synsets in Northern Sotho. The paper
seeks to support the effectiveness of African
WordNet as a resource for services such as in
the healthcare and medical field in South
Africa. It transpired from this exploration that
there is either a one-to-one correspondence or
some form of misalignment of lexicalisation
with regard to the sample of examined
synsets. The paper concludes by making
suggestions on how African WordNet can
deal with such semantic misalignments in
order to improve its efficiency as a resource
for the targeted purpose.
1
Among the challenges that were encountered in
the process of building African Wordnets was
that some of the synsets extracted from
Princeton for development of African WordNet
did not make immediate sense for African
languages and the African context, for a number
of reasons. For example, among them are
synsets for concepts that are geographically
distant from the South African context, such as
animal and plant species. This situation would
result in non-lexicalised concepts. Some nonlexicalised concepts were left blank and for
some it was decided that available linguistic
resources would be used for coinage and
borrowing. The envisaged convenience of
African WordNet became clearer to the writer
(a linguist, project translator or lexicographer)
through other synsets of a more general nature
that were easy to work with. One of the
semantic domains that was considered generally
applicable to any context was Anatomy,
BodyPart. It was assumed that this kind of
domain would have relatively fewer gaps
compared to domains that are geographically or
culturally more restricted. BodyPart also ranks
ninth among the 50 most frequently suggested
upper merged ontologies (SUMOs) in Princeton
WordNet (PWN), as at 2014-03-11.
Introduction
African WordNet is a project that aims to build
a lexical database for all indigenous official
languages of South Africa, which will be linked
to one another. It is modelled on Princeton
WordNet 1
through the expand approach
(Vossen, 1998). The approach was informed by
experiences shared by earlier Wordnets such as
BalkaNet, MultiWordNet, and other languages
in the EuroWordNet, to name but a few. The
expand approach takes synonym sets (synsets)
from Princeton WordNet, with their relations,
and convert them into the target language. The
approach already lends the development of
African Wordnets to the use of more than one
language, that is, English and the target
language concerned. African WordNet is further
internally multilingual with five out of nine
official African languages of South Africa that
are currently part of the project. Northern Sotho
(Sesotho sa Leboa) 2 is one of the languages
1
2
The downside of BodyPart was that the synsets
extracted from Northern Sotho showed that
none of the synsets done so far were aimed at
the human anatomy. The SUMO_BodyPart
consisted of words that were unrelated to
humans, such as ‘scale’ (as in fish-scale),
‘shell’, ‘paw’, ‘feather’ and ‘wool’. Other
examples to illustrate unrelatedness to humans
is that the senses of the word seatla ‘hand’ were
limited to Domain_Transport, SUMO_Device
http://wordnet.princeton.edu
Cf. Guthrie’s zone S30
233
Open Rubric
other concepts through semantic relations such
as hyponymy and meronymy, in the case of
nouns. African WordNet will further link the
languages spoken in the country to each other.
and
Domain_Factotum,
SUMO_Constant
Quantity, and denotation to parts of the human
body did not feature. Similarly, the senses of
leoto ‘leg’ were limited to Domain_Factotum,
SUMO_Shape-Attribute and Domain_Zoology,
SUMO_Mammal, which is a different synset
from Domain_Anatomy, SUMO_BodyPart.
This paper was premised on the understanding
that, comparatively speaking, non-human body
parts and other domains mentioned here may
not demonstrate the immediate and direct
societal impact of African WordNet to the
extent that may be achieved with human body
parts.
2
About the body parts lexicon in
Northern Sotho
Since the available body-parts synsets in the
Northern Sotho Wordnet were deemed not
immediately useful for human healthcare and
medicine purposes, the writer considered
exploring external human body parts, which will
later be followed by internal ones to complete the
healthcare and medical intent. A list was drawn,
verified and augmented against Northern Sotho
Language Board (1988) as well as Ziervogel &
Mokgokong (1975) and a paper in progress on
verbs expressing physical pain. The list had
Northern Sotho and English equivalents. Already
when giving equivalents outside WordNet it
emerged that there may be misalignment in the
form of general-specific lexicalisation of senses.
For example, Northern Sotho uses the same word
for ‘finger’ and ‘toe’. Unless the difference is
readily apparent from the context a descriptive
phrase is used for ease denotation. The question
is: How big is the misalignment and how are we
going to solve the problem linguistically? The
sample used here is used as an index of
misalignments, as well as possible solutions, for
the rest of the development of the Northern Sotho
Wordnet. The next step was to match the body
parts on the list with English synsets.
South Africa is a multilingual and multicultural
country. According to the latest South African
statistics (Statistics South Africa, 2012) on the
use of home languages only 9,6% of the general
population speak English as their home
language (L1), while the majority speak the
other ten official languages and their dialects as
L1. The remainder (>90%) speak English either
as a second, third or fourth language or not at
all. Among this vast majority are healthcare
workers, medical students and practitioners, as
well as individuals and communities who
should receive healthcare and medical services.
Another issue is that studies incidental to most
academic qualifications in South Africa are
presented through the medium of English,
which inevitably means that most students learn
through a foreign medium. For some English
schooling starts before they have duly mastered
their L1. This apparent disadvantage is balanced
by the foundation laid in English, which will
give the student a significant headstart in his or
her academic career, still with insufficient
knowledge of his or her L1. L1 English
speakers on the other hand are not motivated to
learn other languages until they have completed
their studies and happen to find themselves in
an occupational environment where they have to
adjust to a different language medium. It may
therefore be useful to provide a multilingual
platform for accessing domain lexicons on a
level that is more than just a dictionary.
Terminology lists and glossaries are being
developed for various purposes in South Africa,
including healthcare and medicine, but none of
these is an African language Wordnet. African
WordNet will not only provide definitions and
contextual usages of words, but will be based on
synsets. Synsets are sets of lexicalisations of a
particular concept, and WordNet links them to
3
Lexical entries in Northern Sotho
Wordnet
In keeping with Princeton the lexical entries in
African WordNet are guided by information such
as part of speech (POS), domain, SUMO,
definition, usage and the English ID. This paper
focuses on the Northern Sotho nouns under the
Domain_Anatomy, SUMO_BodyPart. According
to the definition and usage provided in English as
well as the ID, only body parts that are
specifically human were picked out. Fellbaum
(1998) contends that although the majority of
lexicalised concepts are shared among languages,
not every language will have words denoting
concepts that are lexicalised in other languages.
Therefore it is expected that some concepts may
234
mechanism may prove to be more practical than
another. For the purpose of this paper it is
assumed that Northern Sotho may be differently
resourced, given the object to explore how the
project can try to solve extant misalignment
challenges without losing the Princeton structure
while remaining true to the African context, a
manoeuvre requiring a certain amount of
fineness.
be lexicalised in English and not in Northern
Sotho, and vice versa. It is deemed necessary for
this semantic domain to have as many lexicalised
concepts as possible, given the envisaged use in
the healthcare sector. The paper will also look
into these semantic relations and ensure that the
Northern Sotho synsets are presented in a manner
that is not misconstrued.
Lexicalisation is defined as realisation of
meaning in a single word or morpheme where
words are already present in a language, as well
as the addition of new words as new concepts
enter the languages in due course. The addition of
new words involves strategies of word formation
such as compounding, derivation and borrowing.
Another issue to lexicalisation is some level of
acceptability among the speakers of a language,
which will lead to general acceptability. The
body-parts synsets in Northern Sotho reflect
different types of lexicalisation, including
addition of new words by the strategies
mentioned above. There are also cases of nonlexicalisation which have yet to be resolved.
4
Queries and results
To begin, the items on the list were queried from
the English dictionary in DEBVisDic (WordNet
editor and browser). Only sense 1 of
SUMO_BodyPart under Domain_Anatomy was
selected. The definitions, usages and synset IDs
were used to obtain correct matches. General
personal knowledge of Northern Sotho, as a
mother tongue speaker, was complemented and
verified against the Northern Sotho-English
bilingual and Northern Sotho-English-Afrikaans
trilingual dictionaries. The results gained from
the queries confirmed some degree of
misalignment between Northern Sotho and
English. Clearly no comment is required on the
one-to-one matches. The examples used here
represent
one-to-many
and
many-to-one
mappings as well as lexicalisation gaps.
Although the expand approach has proved to be
most expedient for new wordnets, lexicalisation
challenges are inevitable for most of them. For
example, in building the Konkani WordNet from
Hindi WordNet (Walawalikar et. al 2010), which
is a closely related language, some challenges
were experienced. The challenges also involved
the English source and they include linking
errors, missing entries, definitions, concept
misalignment and lexicalisation. The issue of
culture-specificity is also reported as one of the
causes of misalignment. In dealing with
alignment in the Hebrew WordNet, which was
also built on the expand approach; Ordan and
Winter (2007) distinguish between contingent
and systematic instances of non-equivalence.
The two cases attest to the fact that lexicons of
different languages mirror misalignments of both
cultural and internal language structural nature.
A sample of words representing 88 Northern
Sotho concepts, with English equivalents, was
used. The list is not exhaustive, but it is a fair
representation of external human body parts.
Also, not all possible connections have been
indicated in the illustrations. While the initial
focus was on external body parts, parts of the oral
cavity were included as they are too close to the
external facial body-parts and not as concealed as
other internal body-parts. The English
equivalents of the Northern Sotho words on the
list were browsed and their IDs noted in order
that their definitions and usages establish correct
matches.
Vincze and Almási (2014) also treat
lexicalisation challenges encountered in dealing
with the Hungarian WordNet. The intention of
this paper is not to reinvent the wheel but to learn
from others’ experiences in the realisation that
languages may be dissimilarly resourced,
materially and structurally. Northern Sotho is a
Bantu language of the Niger-Congo language
family, which is agglutinating with productive
morphology. Therefore one lexicalisation type or
Queried senses in English (anatomy, human body
part) were not found for the following words:
head
big hair
hair on arms and legs
235
Therefore the issue of misalignment is not only a
matter of lexical items, but of concepts as well.
protruding forehead
eye ridge
The following diagrams provide reference for the
current discussion. For every Northern Sotho
lexical item, an English translation equivalent is
provided. For combined connections, refer to
appendix 1.
cheek
tongue
adam’s apple
below the buttock (where the thigh starts)
back of hand
back
back of knee
foot
heel
When queried, the relevant senses of the words
above could not be matched with the IDs found
in DEBVisDic. A peculiar gap in English on
human body parts relates to ‘head’, ‘cheek’,
‘tongue’, ‘adam’s apple’, ‘back’, ‘foot’ and
‘heel’. It is assumed that the rest of the words
may be more physiologically or culturally
relevant in Northern Sotho than in English. While
it is still peculiar to some extent that ‘back’ was
not found because physiologically, especially in
the healthcare and medical context, the concept
should have the same denotative significance in
both languages, the gap was understood in the
context of possible cultural dissimilarities.
Mokokotlo ‘back’, as in the ‘back part of the
human torso’, is one of the most recognisable
lexical items in Northern Sotho due to what the
concept represents. It is the part of the body that a
baby or toddler is carried and strapped on for
guaranteed safety and protection. In this context
the back is culturally associated with care,
nurturing, raising, acceptance and protection. The
concept (and therefore the word) is culturally
significant. With regard to setšhitšhi ‘big hair’
(not the same as ‘long hair’, which would be
natural in the English lexicon) the gap in English
is understood to be due to physiological
difference.
Diagram 1: Arm connections
Halliday et. al. (2004) explicate at length
problems of cross-language mapping even for
concepts that seem simple such as kinship terms.
The examples of siblings and cousins between
English and Australian Pitjantjatjara resonate
with Northern Sotho and other Bantu languages.
Diagram 2: Leg connections
236
contexts (Fellbaum 1998). Meronymy is
explained by Croft and Cruse (2004) as a sense
relation between meanings rather than between
individual entities, that is when the meaning of
one word is part of the meaning of another. The
word for ‘hand’ in Northern Sotho is seatla. It
expresses the same concept expressed by [POS: n
ID: ENG 20-05246212-n BCS: 3], which is sense
1 of the Domain_ Anatomy, SUMO_Bodypart
and defined in English as “the (prehensile)
extremity of the superior limb”. Letsogo is
Northern Sotho for ‘arm’ [POS: n ID: ENG 2005245410-n BCS: 3], Arm: 1, defined in English
as “a human limb; technically part of the superior
limb between the shoulder and the elbow but
commonly used to refer to the whole superior
limb”. In Northern Sotho letsogo refers to the
whole superior limb, which includes the hand.
According to the definition provided above the
common usage of the English ‘arm’ is the same
as the Northern Sotho letsogo, but the technical
usage is not. In Northern Sotho the word letsogo
is also used to refer to seatla ‘hand’, but the
whole limb is never called seatla. That is, while
seatla ‘hand’ is a meronym of letsogo ‘arm’, the
two are also synonymous. Similarly leoto ‘leg’
[ENG20-05242579-n] is used for both ‘leg’ and
‘foot’ while a separate specific word for ‘foot’ is
lenao. These examples illustrate lexicalisation
that reflects the occurrence of meronymy
between lexical items that are also synonymous.
Diagram 3: Torso connections
Another scenario relates to the case of monwana
for both ‘toe’ [ENG20-05258265-n] and ‘finger’
[ENG20-05247839-n], and ntši for ‘eyebrow’
[ENG20-05007503-n] and ‘eyelash’ [ENG2005008887-n]. In this case Northern Sotho uses
one word to express separate concepts, or
concepts that are viewed as separate in English.
These two examples illustrate that the words
monwana and ntši are used in Northern Sotho as
hypernyms. Descriptive phrases ‘of the foot’ and
‘of the hand’ are used as hyponyms of monwana
in cases where distinction is deemed necessary. A
similar descriptive strategy is not used for ntši; it
may also be cumbersome as both ‘eyebrow’ and
‘eyelash’ belong to the eye.
Diagram 4: Head connections
4.1
One-to-many and many-to-one
Two types of misalignment will be used for
illustration here. There are cases of Northern
Sotho lexicalisation of human body parts that
mingle synonymy and meronymy, not in a
confusing way though. In the context of WordNet
words are synonymous if they express the same
concept and can be interchanged in some
237
4.2
the form of descriptive phrases to distinguish
‘finger’ and ‘toe’. The descriptions wa lenao and
wa leoto ‘of the foot’; wa seatla and wa letsogo
‘of the hand’ are consistent with language usage
and are not expected to pose any problems. The
same solution cannot work in the case of ntši
since eyebrow and eyelash are both ‘of the eye’.
Northern Sotho Language Board (1988) uses
compounding as a strategy to distinguish the two.
While they are both ntši the source coined
ntšikgolo as additional lexicalisation for
‘eyebrow’. The second component of the
compound -kgolo (-golo) ‘big’ suggests that an
eyebrow is dominant. The source was produced
by a standardising body (Northern Sotho
Language Board) which was obviously cognisant
of the gaps in terms of lexicalisation. They
probably considered either the overaching
position of the eyebrow in relation to the
eyelashes or the perceived amount of hair in both,
to come up with a suggestion that an eyebrow is
the main ntši. Another example of compounding
from the same source is khurumelakhuru for
‘kneecap’. -khurumela is a verb stem which
means to close or to cover. Khuru is ‘knee’.
Therefore conceptualisation points to something
that covers, closes off or protects the knee.
Lexicalisation strategies such as these provide
promising resources for African WordNet. What
remains is whether or not such lexical items will
filter down to everyday usage.
Possible non-lexicalisation in English
Another concept that is lexicalised in Northern
Sotho but could not be found from querying the
English in DEBVisDic is nyaraga (Mokgokong
and Ziervogel 1975), also pronounced nyarago.
The English trees relating to ‘leg’ and ‘buttock’
were examined as the concept is understood to be
either a body part below the buttock or the
uppermost back part of the leg. Its absence in the
two trees pointed to possible non-lexicalisation.
The following section proposes possible
linguistic means of catering for the misalignment
issues mentioned above in African WordNet.
5
Handling misalignments
It is necessary to provide linguistic solutions to
the misalignment challenges mentioned above.
Vincze and Almási (2014) suggest a number of
strategies for the Hungarian lexicalisation issues,
namely to shorten the tree, flatten the tree,
restructure the tree and lexicalize the concepts.
They are also of the opinion that the merge
approach would have alleviated some of the
challenges. For Konkani Walawalikar et. al
(2010) suggest, among others, that the target
language synsets for which there were gaps in the
source language could be used to fill the gaps,
thereby strengthening the HWN. Ordan and
Winter (2007) detail strategies for building
Hebrew synsets, which include linking Hebrew
word senses to related PWN sysnsets from
Hebrew to English and from English to Hebrew.
Lexical gaps from both sides are acknowledged
and used to preserve and link semantic
information.
The last issue relates to the apparent English nonlexicalisation of concepts that are lexicalised in
Northern Sotho, and vice versa. Nyaraga ‘below
the buttocks’ is part of the Northern Sotho
lexicon whose lexicalisation could not be
ascertained in English. The English equivalent is
provided in Northern Sotho dictionaries as a
phrase. The English lexicalisation of the Northern
Sotho ntahle ‘back of hand’ could also not be
ascertained. Over and above being a body part,
part of a hand, ntahle has an added connotation
relating to slapping (backhand slap). That is,
slapping someone with the inner part of a hand
and the outer part of a hand would be reflected by
the use of different lexical items. Such words
need to be added as they represent concepts that
are intertwined with the idiom of the language.
This paper takes a linguistic view to addressing
the challenges mentioned above, which relate to
lexicalisation of the concepts. The first group of
Northern Sotho words which could not be
matched from English seem to be a matter of
misses which can be addressed if probed further.
The next situation concerns seatla ‘hand’ and
lenao ‘foot’ which are meronyms of letsogo
‘arm’ and leoto ‘leg’, respectively, and proved to
be synonymous as well. Therefore lexical items
seatla and letsogo will be in the same synset
while they are meronymically related as well.
The same applied to lenao ‘foot’ and leoto ‘leg’.
An expected scenario of the expand approach
where English is the source language would
obviously reveal Northern Sotho nonlexicalisation of concepts that are lexicalised in
English. With regard to the domain under
The next issue concerns monwana ‘finger’ and
‘toe’ and ntši ‘eyelash’ and ‘eyebrow’. In the
language synonyms for monwana are provided in
238
were different types of lexical misalignment
presented, but also lexicalisation mechanisms
that are used in the language. While the
mentioned mechanisms may be grammatically
sound and fill lexicalisation gaps, the words also
need to receive general acceptability to the point
of being in reasonably high frequency used rather
than merely existing.
discussion descriptive phrases are common, for
example ‘nose’ is nko and ‘nostril’ is lešoba la
nko, literally ‘hole of nose’. ‘Pubis’ is lerapo la
pele la noka, literally ‘bone of front of waist’.
Another lexicalisation mechanism that is
productive in Bantu languages, which was
nevertheless not observed in the current sample,
is derivation. Affixes are used productively to
form words from different word categories.
Direct borrowing is also not evident in the
current sample, but it is commonly used in the
lexicalisation of technological concepts and
specific disease names. From this sample an
example of indirect borrowing is evident in
coinage that resembles the English formations
such as khurumelakhuru above and moropana wa
tsebe literally ‘small drum of ear’ for ‘eardrum’.
Lexicalisation mechanisms that were employed
for this sample hint at linguistic routes to follow
in dealing with further development of human
body parts.
It is envisaged that the proposed strategies will
fill the gaps, and that inclusion of internal body
parts and functions, as well as verbs of
expressing physical pain will produce trees that
mirror the language. It remains to be seen how
far the translators in the project will go in
utilising the lexicalisation strategies mentioned in
this paper. To assist with acceptability and
standardisation the synsets will also be shared
with selected practitioners in the target field for
comments.
Acknowledgement
6
Challenges
Ms Marissa Griesel, for support with the
illustrations
While the linguistic side of the project may prove
exciting, there are challenges of an IT nature. The
challenges include changes in the IT
infrastructure at the hosting institutions, as well
as problems with the DEBVisDic editor. Such
challenges hamper the development of the
wordnets, as they result in interrupted access to
the server and inconsistent functionality of the
editor. This becomes a challenge if one wants to
browse and edit existing synsets, or add new
synstes. Nonetheless, manual and semi-automatic
data gathering methods are used so that when a
permanent IT solution is reached there is enough
linguistic data to fast-track the development of
the wordnets.
7
References
Christiane Fellbaum (Ed). 1998. Wordnet: an
electronic lexical database. Cambridge, Mass:
The MIT Press.
Dirk Ziervogel and Pothinus C. Mokgokong. 1975.
Pukuntšu ye kgolo ya Sesotho sa Leboa/
Comprehensive Northern Sotho dictionary/
Groot Noord-Sotho woordeboek. Pretoria: J. L.
Van Schaik.
M.A.K Halliday, Wolfgang Teubert, Colin Yallop and
Anna Čermáková. 2004. Lexicology and
Corpus Linguistics: an introduction. London:
Continuum.
Conclusion
Noam Ordan and Shuly Wintner. 2007. Hebrew
WordNet: a test case of aligning lexical
databases
across
languages. International
Journal of Translation 19(1):39-58.
The paper presented actual and possible scenarios
that may pose challenges when developing the
Northern Sotho Wordnet on Domain_Anatomy,
SUMO_BodyPart. Human body parts are
targeted in this paper due to their connection to
human health care and medicine. Many speakers
whose L1 is not Northern Sotho may benefit
from the database as it will be linking Northern
Sotho not only to English but to other South
African indigenous languages as well. Not only
Northern Sotho Language Board. 1988. Sesotho sa
Leboa Mareo le Mongwalo No. 4/ Northern
Sotho Terminology and Orthography No. 4/
Noord-Sotho Terminologie en Spelreëls No. 4.
Pretoria: Government Printer.
239
Piek Vossen (ed.) 1998. EuroWordNet:A Multilingual
Database with Lexical Semantic Networks.
Kluwer Academic Publishers, Dordrecht.
Shantaram Walawalikar, Shilpa Desai, Ramdas
Karmali, Sushant Naik, Damodar Ghanekar,
Chandralekha D'Souza and Jyoti Pawar. 2010.
Experiences in Building the Konkani WordNet
Using the Expansion Approach. In Proceedings
of the Fifth Global WordNet Conference,
January 2010. Mumbai, India.
Statistics South Africa http://www.stassa.gov.za
accessed on 19 August 2015
Verinika Vincze and Attila Almási (2014) In
Proceedings of the Seventh Global WordNet
Conference, January 2014. University of Tartu,
Estonia.
William Croft and D. Alan Cruse 2004. Cognitive
linguistics. Cambridge: University Press.
240
Appendix 1: Combined connections
241
WME: Sense, Polarity and Affinity based Concept Resource for Medical
Events
1
1
2
Anupam Mondal
Dipankar Das
Erik Cambria 1Sivaji Bandyopadhyay
1
2
Department of CSE
School of Computer Engineering
Jadavpur University, India
Nanyang Technological University, Singapore
1
[email protected],[email protected],jdvu.ac.in,
2
[email protected],[email protected]
In the present attempt, we have expanded the
WME resource with new features like semantics and affinity. The semantic feature helps to
extract the relative sense-based words from the
medical words and assign the type of medical
words (e.g. medicine, disease etc.). The affinity feature helps to develop a medical Concept
Network (ConceptNet) for visualization (Cambria et al., 2010). Started with an initial seed
list of medical terms, the WordNet synonyms
and hyponyms along with several polarity lexicons were employed to enrich the WME resource. The polarity lexicons viz. SentiWordNet1, SenticNet2, Bing Liu’s subjectivity list3
and Taboda’s adjective list4 were applied on
the extracted synonyms and hyponyms for
identifying the proper sense.
In next Section, we have discussed the related
work associated to prepare of lexical resources
for clinical domain. In Section 3, WME expansion techniques have been described along
with statistics as a part of WME building. The
feature selection and identification techniques
were discussed under Section 4. The evaluation of the expanded WME resource and conducting agreement studies are described in
Section 5. Finally, in Section 6, we conclude
and mention the future scopes of the task.
Abstract
In order to overcome the scarcity of medical
corpora, we develop the WordNet for Medical Events (WME) for identifying such
medical terms and their sense related information using a seed list. The initial WME
resource contains 1654 number of medical
terms. In the present task, we have reported
the enhancement of WME with 6415 number of medical terms along with their conceptual features viz. gloss, semantics, polarity, sense and affinity. Several polarity
lexicons viz. SentiWordNet, SenticNet,
Bing Liu’s subjectivity list and Taboda’s adjective list were introduced with WordNet
synonyms and hyponyms for expansion.
The affinity feature helped us to prepare a
medical ConceptNet containing the medical
terms for visualization. Finally, we evaluated with respect to Adaptive Lesk Algorithm
and conducted an agreement analysis for
validating the expanded WME resource.
1
Introduction
In the domain of clinical text processing,
sense-based information extraction is considered as a challenging task due to the unstructured nature of the corpus. The hardness in
preparing structured corpora for clinical domain was found because of the less involvement of the domain experts (Smith and Fellbaum, 2004). Though several lexicons were
developed and used to overcome the complexity present in the conventional NLP domain
(Miller, 1995; Fellbaum, 1998).
In contrast to medical domain, the researchers
introduced few number of resources e.g., Medical WordNet to overcome such problems
(Burgun and Bodenreider, 2001; Bodenreider
et al., 2003). The WME resource was developed along with sense-based medical information for the experts and non-expert group of
people (Mondal et. al., 2015).
2
Related Work
In the context of Bio-medical corpora, the
medical terms (event) and their related information extraction can help to develop an annotation system, which is essential for representing the structured corpus (UzZaman and Allen,
2010; Hogenboom et al., 2011). The polarity,
sense and concept related features are taking
crucial role for preparing the structured corpus
in this domain.
1
2
3
4
242
http://sentiwordnet.isti.cnr.it/
http://sentic.net/
http://www.cs.uic.edu/~liub/
https://www.sfu.ca/~mtaboada/research/pubs.html
Several taxonomies were designed by the researchers for understanding the medical terms
and their related information for the nonexperts (Tse, 2003; Zeng et al., 2003). In this
concern, a research group was developed to
build a medical information system using vocabulary for arbitrate the extracted information
and recognize the context for the experts and
non-experts (Patel et al., 2002).
Fellbaum and Smith proposed Medical WordNet (MEN) with two sub networks e.g. Medical FactNet (MFN) and Medical BeliefNet
(MBN) for justifying the consumer health
(Smith and Rosse, 2004). The MEN was followed the formal architecture of the Princeton
WordNet (Fellbaum, 1998). The MFN guides
to extract and understand the generic medical
information for non-expert group whereas the
MBN identifies the fraction of the beliefs
about the medical phenomena (Smith and
Rosse, 2004). Their primary motivation was to
develop a network of medical information retrieval system with visualization effect.
The information (medical terms) extraction
from the clinical corpus was treated as an ambiguous task (Pustejovsky, 1995). A group of
researchers introduced the sense selection and
pruning strategies for expanding the ontology
of the medical domain (Toumouh et al., 2006).
WordNet of Medical Event (WME) resource
was introduced as a lexical resource for identifying the medical events and their related features viz. POS, gloss, polarity and sense from
the corpus (Mondal et. al., 2015). The POS
signifies the lexical category of the medical
events where the gloss, polarity and sense features help to provide the semantics and
knowledge based information related to the
medical events.
3
with medical terms and it supplies the related
information, by which we can identify the syntactic and semantic behavior of the medical
corpus.
The seed list of WME resource has prepared
from the trial and training datasets of the
SemEval-2015 Task-6.5 The conventional
WordNet and English medical dictionary were
applied on the seed list for developing the initial WME resource. Primarily, the resource
extracted 2479 numbers of medical events
along with their attributes such as type, spancontext, sense (positive/negative) from the
provided datasets (e.g., <tumor>, <event>,
<An abnormal new mass of tissue that serves
no purpose.>, <negative>). WordNet provides
the lexical information like POS, synonyms
and definition of the word (medical events)
(e.g., <Abdomen>, <Noun>, <1. abdomen 2.
abdominal cavity>, <1. “The region of the
body is vertebrate between the thorax and the
pelvis.” 2.”The cavity containing the major
viscera; in mammals it is separated from the
thorax by the diaphragm.>). Meanwhile, an
English Medical Dictionary identifies the POS
descriptions or glosses of the words. The English Medical Dictionary was developed by H.
Bateman and her group in 2007.6 A huge
amount of manual editing was carried out for
the preprocessing and the preprocessed dictionary covers the 11,750 medical words in
English along with POS and gloss (e.g., <Adenoma>, <Noun>, <A benign tumor of a
gland>).
Several polarity lexicons like SentiWordNet,
Taboda’s adjective list etc were used for identifying the appropriate gloss of the medical
events from file context, WordNet definition
and dictionary gloss of the medical events. The
sense-based gloss identification was considered as a task of Word Sense Disambiguation
(WSD) (Basili et al., 1997). The sequential and
combined WSD algorithms were applied for
identifying the proper sense-based gloss of the
medical terms (events) (Mondal et al., 2015).
WME1.0 Building
The keyword extraction is essential for identifying the sense related information (e.g. “improves” and “capability” keywords provide
the positive sense of the following sentence “A
supplementary component that improves capability.”). The sense-based word identification is
tedious job in the domain of Bio-NLP. In this
regard, in order to identify the meaning, the
conventional WordNet helps to extract the
word related information viz. Parts-Of-Speech
(POS), synonyms, hyponyms and definition.
To grasp the syntactic behavior of the medical
corpus, WME1.0 resource has been prepared
4
WME2.0 Building
The inclusion of semantic and knowledge
based features is crucial for preparing the expanded version of existing resource, WME1.0.
5
http://alt.qcri.org/semeval2015/task6/
6
http://alexabe.pbworks.com/f/Dictionary+of+Medic
al+Terms+4th+Ed.-+(Malestrom).pdf
243
The semantic, polarity, sense and affinity features have been employed as these features
help to identify and extract the medical events
from clinical corpus.
4.1
Feature Selection for Expansion
In order to select features, we have considered glosses and senses. The semantics and
polarity features have been used for conceptual
visualization (Cambria et al., 2015) and coreferencing along with the affinity relations
exist among the medical events.
Gloss: It is obvious that all the words of a
sentence do not always carry the concept related information (e.g., “achievable” is the
knowledge information of the following sentence, “The state of being achievable.”). As the
gloss identification based on concept words is
crucial, in WME2.0, we have used the sequential and combined WSD approaches for extracting the proper gloss of the medical terms
present in the seed lists. The extracted gloss
provides the sense-based knowledge of the
medical terms.
Polarity and Sense: Nowadays, opinion
and sense identification is treated as an emerging task. Thus the polarity and sense features
were extracted using several polarity lexicons
viz. SentiWordNet, SenticNet, Bing Liu’s subjective list and Taboda’s adjective list. Figure
1 shows the procedure of identifying polarity
and sense features of WME2.0 (e.g., <mismanage>, <-0.625>, <Negative>).
Semantic: The inclusion of semantic is to
identify the similar sense-based words. In
WME 2.0, the semantic of a medical term has
been extracted with the help of WordNet synonyms. The example is illustrated the semantic
feature of WME 2.0 (e.g., <maltreatment>,
<abuse, misuse, mismanage, overlook>).
Affinity: The affinity feature is introduced
in the present task to build a medical ConceptNet because the medical ConceptNet is
essential for visualization as well as of identifying co-reference relationship. The affinity
score between a pair of medical terms has been
calculated by the number of similar occurrences of the semantic words. The affinity score of
the medical term is measured by the following
equations:
Affinity(s) = MT1(s) ∩ MT2(s)
(1)
Affinity-Score(s) = Affinity(s) / ∑ MTi (s),
(2)
Figure 1. Sense-based technique for WME 2.0
representation
Where i denotes the first and second terms
and MT1(S) and MT2(S) represent the semantic
sets of two different medical terms. Affinity(S)
indicates the number of common semantics of
between the medical terms. Affinity-Score(S) is
calculated with the help of Affinity(S) with respect to all the semantics of the medical terms.
The following figure shows the medical ConceptNet along with their affinity relations.
Figure 2. Partial Visualization of the Affinity
score based medical ConceptNet
4.2 Statistics
We have tabulated the statistics of initial and
expanded versions of WME with respect to the
number of medical terms, POS and sense distributions in Table 1. The initial and expanded
WME resources are termed as WME 1.0 and
WME 2.0, respectively throughout the paper.
The above-mentioned statistics indicate that it
is difficult to expand the WME resource with
the help of word level lexical analysis (like
POS distribution). The sentiment (like sense)
based approaches were introduced to overcome
the challenges. The detail statistics of the expanded medical terms using the above-
244
mentioned polarity lexicons along with a combined polarity lexicon are given in Table 2.
The combined polarity lexicon has been prepared from all the above-mentioned polarity
lexicons by considering the common occurrences of medical terms.
The simplified versions of Lesk algorithm
primarily compares with the dictionary definition and generates the sense-based output of
the term. Thus, we have found a simplified
version of the Lesk algorithm which was not
suitable for WME 2.0 resource due to the unavailability of dictionary definitions for most of
the medical terms. To resolve it, we have applied an Adaptive Lesk algorithm for extracting the sense-based descriptions. The Adaptive
Lesk algorithm not only compares the dictionary definitions but also considers the definitions of WordNet synsets.
We have evaluated the WME 2.0 using Adaptive Lesk algorithm applied to identify the
proper sense-based gloss for the medical terms
and represented in terms of F-Measure. FMeasure has been calculated with the help of
Recall (R) and Precision (P).
Different
Basic WME1.0 WME2.0
Operation
No. of Medical 1654
6415
terms
POS
Noun
1019
4219
DistriVerb
488
2026
bution
Adjective 124
111
Sense
Positive
1338
2800
DistriNegative 316
3615
bution
Table 1. Comparative Statistics
Taboda’s adjective list, Bing Liu’s subjective list and SentiWordNet polarity lexicons
were given satisfactory outputs for expanding
the WME resource, where SenticNet (Cambria
et al., 2016) guides us to introduce the semantic feature.
O
U
S
H
S
H
SW
SN
BL
2938 210 1250
4125 1136 5301
1151 196 615
1623 698 2761
TA
2509
9901
1017
4833
F-Measure = 2 * [(R * P) / (R + P)]
The Precision and Recall are 82%, 62% and
57%, 29% for the WME 2.0 and Lesk algorithms, respectively. The calculated F-measure
values are 71% and 38% for the WME2.0 and
Lesk algorithm. The evaluation indicates that
the WME 2.0 resource provides much accurate
sense-based gloss information in comparison
with Adaptive Lesk algorithm.
CM
6698
19328
1592
6584
5.2 Agreement Analysis
SW → SentiWordNet, SN → SenticNet, BL → Bing
Liu’s subjectivity list, CM → Combined Medical List,
TA → Taboda’s Adjective List
O → Original terms
U → Unique terms
S → Synonyms
H → Hyponyms
We have conducted a manual evaluation of
WME 2.0 resource for validating the expanded
medical terms and their features. The agreement study is conducted by the manual annotators for the reason of unavailability of medical
sense-based lexicons. The agreement analysis
has been calculated by the Cohen’s kappa
based statistical approach.7 The Cohen’s Kappa (k) value is measured using the Proportionate (Pr(a)) and Random (Pr(e)) agreement values as follows.
Table 2. Statistics based on Senses of different
Polarity lexicons
5
(3)
Discussion
5.1 Evaluation
We have done the preliminary evaluation of
WME 2.0 in contrast to WME1.0 with the help
of sense feature. The gloss sense of the medical terms of WME 2.0 was compared with the
sense extracted from the polarity lexicon, SentiWordNet. In case of clinical corpus, the SentiWordNet has a limitation of unavailability in
terms of medical words. It was observed that
SentiWordNet nearly covers only 40% of the
medical terms of WME 2.0. On the other hand,
the Lesk WSD algorithm is used to validate the
senses of the medical terms of WME 2.0.
k = [Pr(a) – Pr(e)] / [1 – Pr(e)]
(4)
Table 3 represents the agreed (Y) and nonagreed (N) medical terms and their related information for both of the annotators (denoted
as A and B). The agreement score indicates a
satisfactory result for WME 2.0 resource with
Kappa (k) value of 0.73.
7
245
https://en.wikipedia.org/wiki/Cohen's_kappa
No. of Medical Terms
6415
A
Cambria. E., Poria. S., and Schuller. B. 2016.
B
Y
N
Y
6094
51
N
77
193
SenticNet 4: A semantic resource for sentiment analysis based on conceptual primitives. In: AAAI, Phoenix.
Fellbaum. C. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge.
Table 3. Agreement study of WME 2.0
6
Hogenboom. F., Frasincar. F., Kaymak. U. and
deJong. F. 2011. An overview of event extraction from text. In: Derive Workshop, Bonn.
Conclusion and Future Work
The present task was initially concerned to
expand the WME1.0 resource. Several polarity
lexicons were used on a seed list of medical
terms and their synonyms and hyponyms were
also used with sense mapping for expansion.
The WME 2.0 resource contains 6415 number
of medical terms along with several features
viz. POS, gloss, semantics, polarity, sense and
affinity. The affinity feature helps us to build a
medical ConceptNet for visualization. The extracted features assist to represent a system in
clinical domain by which we can provide support to expert and non-expert group of people.
In future, we will attempt to enrich the WME
2.0 resource with more number of medical
terms along with some concept-based features
for improving the quality as well as coverage
of the resource.
Miller. G. A. 1995. WordNet: a lexical database
for English. Comm ACM. pp. 39-41.
Mondal. A., Chaturvedi. I., Bajpai. R., Das. D.,
Bandyopadhyay. S. 2015. Lexical Resource for
Medical Events: A Polarity Based Approach. IEEE 15th International Conference on
Data Mining Workshops, Atlantic City.
Patel. V. L., Arocha. J. F. and Kushniruk. 2002. A
Patients’ and physicians’ understanding of
health and biomedical concepts: relationship to the design of EMR systems. Journal of
Biomedical Informatics: 35(1). pp. 8-16.
Pustejovsky. J. 1995. The generative lexicon.
MIT Press, Cambridge.
Smith. B. and Fellbaum. C. 2004. Medical Word-
Net: A New Methodology for the Construction and Validation of Information Resources for Consumer Health. In: Coling,
References
Basili. R., DellaRocca. M. and Pazienza. M. T.
1997. Contextual word sense tuning and disambiguation. Applied Artificial Intelligence.
pp. 235-262.
Geneva, pp. 31-38.
Smith. B. and Rosse. C. 2004. The role of founda-
tional relations in the alignment of biomedical ontologies. In: Medinfo, San Francisco.
Bodenreider. O., Burgun. A. and Mitchell. J. A.
2003. Evaluation of WordNet as a source of
Toumouh. A., Lehireche. A., Widdows. D. and
Malki. M. 2006. Adapting WordNet to the
lay knowledge for molecular biology and
genetic diseases: a feasibility study. Studies
Medical Domain using Lexicosyntactic Patterns in the Ohsumed Corpus. IEEE/ACS In-
in Health Technology and Informatics. pp. 379384.
ternational Conference on Computer Systems
and Applications (AICCSA).
Burgun. A. and Bodenreider. O. 2001. Comparing
Tse. A. Y. 2003. Identifying and characterizing
a consumer medical vocabulary. Doctoral
dissertation, College of Information Studies,
University of Maryland, College Park.
terms, concepts and semantic classes in
WordNet and the Unified Medical Language
System. In: NAACL Workshop on WordNet and
Other Lexical Resources, pp. 77-82.
UzZaman. N. and Allen. J. F. 2010. Extracting
Cambria. E., Hussain. A., Havasi. C., Eckl. C.
2010. SenticSpace: Visualizing opinions and
Events and Temporal Expressions from
Text. Proceedings of IEEE International Confer-
sentiments in a multi-dimensional vector
space. In: LNAI, vol. 6279, pp. 385–393.
ence on Semantic Computing.
Zeng. Q., Kogan. S., Ash. N., Greenes. R. A. and
Boxwala. A. A. 2003. Characteristics of con-
Cambria. E., Fu. J., Bisio. F. and Poria. S. 2015.
AffectiveSpace 2: Enabling affective intuition for concept-level sentiment analysis. In:
sumer terminology for health information
retrieval: A formal study of use of a health
information service. Methods of Information
AAAI, pp. 508-514, Austin.
in Medicine.
246
Mapping and Generating Classifiers using an Open Chinese Ontology
Luis Morgado da Costa,♠ Francis Bond♠
Helena Gao♦
♠
♦
Linguistics and Multilingual Studies
Chinese
Nanyang Technological University
Singapore
<[email protected],[email protected],[email protected]>
Abstract
be defined, for example, by shape). Huang et al.
(1997) identify four main classes, individual classifiers, mass classifiers, kind classifiers, and event
classifiers. And Bond and Paik (2000) define five
major types of CLs: sortal (which classify the kind
of the noun phrase they quantify); event (which are
used to quantify events); mensural (which are used
to measure the amount of some property); group
(which refer to a collection of members); and taxonomic (which force the noun phrase to be interpreted as a generic kind). This enumeration is far
from complete, and Lai (2011) provides a detailed
literature review on the most prominent views on
Chinese classifiers.
Most languages make use of some of these
classes (e.g. most languages have measure CLs,
as in a kilo of coffee, or group CLs, as in a school
of fish). What appears to be specific to some languages (e.g. Chinese, Japanese, Thai, etc.) is a
class of CLs (sortal classifiers: S-CL) that depicts a selective association between quantifying
morphemes and specific nouns. This association
is licensed by a number of features (e.g. physical,
functional, etc.) that are shared between CLs and
nouns they can quantify, and these morphemes add
little (but redundancy) to the semantics of nounphrase they are quantifying.
Consider the following examples of S-CL usage
in Mandarin Chinese:
In languages such as Chinese, classifiers
(CLs) play a central role in the quantification of noun-phrases. This can be a
problem when generating text from input
that does not specify the classifier, as in
machine translation (MT) from English to
Chinese. Many solutions to this problem rely on dictionaries of noun-CL pairs.
However, there is no open large-scale
machine-tractable dictionary of noun-CL
associations. Many published resources
exist, but they tend to focus on how a CL
is used (e.g. what kinds of nouns can be
used with it, or what features seem to be
selected by each CL). In fact, since nouns
are open class words, producing an exhaustive definite list of noun-CL associations is not possible, since it would quickly
get out of date. Our work tries to address
this problem by providing an algorithm for
automatic building of a frequency based
dictionary of noun-CL pairs, mapped to
concepts in the Chinese Open Wordnet
(Wang and Bond, 2013), an open machinetractable dictionary for Chinese. All results will released under an open license.
1
Introduction
Classifiers (CLs) are an important part of the Chinese language. Different scholars treat this class
of words very differently. Chao (1965), the traditional and authoritative native Chinese grammar,
splits CLs into nine different classes. Cheng and
Sybesma (1998) draw a binary distinction between
count-classifiers and massifiers. Erbaugh (2002)
splits CLs into three categories (measure, collective and sortal classifiers). Measure classifiers describe quantities (e.g. ‘a bottle of’, ‘a mouthful
of’), collective classifiers describe arrangement of
objects (‘a row of’, ‘a bunch of’), and sortal classifiers refer to a particular noun category (which can
(1)
两
只 狗
liǎng zhı̌ gǒu
2
CL dog
“two dogs”
(2)
两
条 狗
liǎng tiáo gǒu
2
CL dog
“two dogs”
247
(3)
两 条 路
liǎng tiáo lù
CL road
2
Examples (6–8) show how the use of different
CLs with ambiguous senses can help resolve this
ambiguity. In (6), we can see that with the use of
个 ge, the most general S-CL in Mandarin Chinese, mu4tou is ambiguous because it does not restrict the noun’s semantic features. With the use
of 位 wèi (7), an honorific S-CL used almost exclusively with people, it can only be interpreted as
”blockhead”. And the reverse happens when using 根 gēn (8), a S-CL for long, slender, inanimate
objects: the sense of log (of wood) of 木头 mùtou
is selected.
Even though written resources concerning CLs
are abundant, they are not machine tractable, and
their usage is limited by copyright. Natural Language Processing (NLP) tasks depend heavily on
open, machine tractable resources. Wordnets
(WN) are a good example on the joint efforts to
develop machine tractable dictionaries, linked in
rich hierarchies. Resources like WNs play a central role in many NLP tasks (e.g. Word Sense Disambiguation, Question Answering, etc.).
Huang et al. (1998) argue that the integration
between corpora and knowledge rich resources,
like dictionaries, can offer good insights and generalizations on linguistic knowledge. In this paper, we follow the same line of thought by integrating both a large collection of Chinese corpora
and a knowledge rich resource (the Chinese Open
Wordnet: COW (Wang and Bond, 2013)). COW is
a large open, machine tractable, Chinese semantic
ontology, but it lacks information on noun-CL associations. We believe that enriching this resource
with concept-CL links will increase the domain of
it’s applicability. Information about CLs could be
used to generate CLs in MT tasks, or even to improve on Chinese Word Sense Disambiguation.
The remainder of this paper is structured as follows: Section 2 presents related work, followed by
a description of the resources used in Section 3;
Section 4 describes the algorithms applied, and
Section 5 presents and discusses our results; Section 6 describes ongoing and future work; and Section 7 presents our conclusion.
“two roads”
(4)
三 台 电脑
sān tái diànnǎo
3 CL computer
“three computers”
(5)
*三 只 电脑
sān zhı̌ diànnǎo
3 CL computer
“three computers”
Examples (1) through (4) show how the simple
act of counting in Mandarin Chinese involves pairing up nouns with specific classifiers, if incompatible nouns and classifiers are put together then the
noun phrase is infelicitous, see (5).
Different S-CLs can be used to quantify the
same noun, see (1) and (2), and the same type of
S-CL can be used with many different nouns – so
long as the semantic features are compatible between the S-CL and the noun, see (2) and (3). Extensive work on these features is provided by Gao
(2010) – where more than 800 classifiers (both sortal and non-sortal) are linked in a database according to the nominal features they select, but providing only a few example nouns that can be quantified by each CL. These many-to-one selective
associations are hard to keep track of, especially
since they depend greatly on context, which often
restricts or coerces the sense in which the noun is
being used (Huang et al., 1998).
(6)
一 个 木头
yı̄ ge mùtou
1 CL log (of wood) / blockhead
“a log / blockhead”
(7)
一 位 木头
yı̄ wèi mùtou
1 CL blockhead
2
“a blockhead”
(8)
Related Work
Mapping CLs to semantic ontologies has been
attempted in the past (Sornlertlamvanich et al.,
1994; Bond and Paik, 2000; Paik and Bond, 2001;
Mok et al., 2012). Sornlertlamvanich et al. (1994)
is the first description of leveraging hierarchical
一 根 木头
yı̄ gēn mùtou
1 CL log (of wood)
“a log”
248
minimize coverage issues, we enriched it with data
from the Bilingual Ontological Wordnet (BOW)
(Huang et al., 2004), the Southeast University
Wordnet (SEW) (Xu et al., 2008), and automatically collected data from Wiktionary and CLDR,
made available by the Extended OMW (Bond and
Foster, 2013). The final version of this resource
had information for over 261k nominal lemmas,
from which over 184k were unambiguous (i.e.
have only a single sense).
We filtered all CLs against a list of 204 S-CLs
provided by Huang et al. (1997). Following Lai
(2011), we treated both Huang’s individual classifiers and event classifiers as S-CLs.
semantic classes to generalize noun-CL pairs (in
Thai). Still, their contribution was mainly theoretical, as it failed to report on the performance
of their algorithm. Bond and Paik (2000) and
Paik and Bond (2001) further develop these ideas
to develop similar works for Japanese and Korean. In their work, CLs are assigned to semantic
classes by hand, and achieve up to 81% of generation accuracy, propagating CLs down semantic
classes of Goi-Taikei (Ikehara et al., 1997). Mok
et al. (2012) develop a similar approach using the
Japanese Wordnet (Isahara et al., 2008) and the
Chinese Bilingual Wordnet (Huang et al., 2004),
and report a generation score of 78.8% and 89.8%
for Chinese and Japanese, respectively, on a small
news corpus.
As it is common in dictionary building, all
works mentioned made use of corpora to identify
and extract CLs. Nevertheless, extracting nounCL associations from corpora is not a straightforward task. Quantifier phrases are often used without a noun, resorting to anaphoric or deictic references to what is being quantified (Bond and Paik,
2000). Similarly, synecdoches also generate noise
when pattern matching (Mok et al., 2012).
3
4
Our Algorithm
Our algorithm produces two CL dictionaries with
frequency information: a lemma based dictionary,
and a concept based dictionary, using COW’s extended ontology. We tested both dictionaries with
a generation task, automatically validated against
a held out portion the corpus.
4.1
Extracting Classifier-Noun Pairs
Extracting CL-noun pairs is done by matching
POS patterns against the training section of our
corpus. To avoid, as much as possible, noise in the
extracted data, we choose to take advantage of our
large corpus to apply restrictive pattern variations
of the basic form: (determiner or numeral) + (CL)
+ (noun) + (end of sentence punctuation/select
conjunctions). Our patterns assure that no long
dependencies exist after the CL, and try to maximally reduce the noise introduced by anaphoric,
deictic or synecdochic uses of classifiers (Mok et
al., 2012). Variations of this pattern were also included to cover for different segmentations produced by the preprocessing tools.
If an extracted CL matches the list of S-CLs,
we include this noun-CL pair in the lemma based
dictionary. The frequency with which a specific
noun-CL pair is seen in the corpus is also stored,
showing the strength of the association.
Extracting noun-CL pairs from the Chinese
Google NGram corpus required a special treatment. We used the available 4 gram version of
this corpus to match a similar pattern (and variations) to the one mentioned above: (determiner or
numeral) + (CL) + (X) + (end of sentence punctuation/select conjunctions). Given we had no POS
information available for the NGram corpus, we
Resources
Our corpus joins data from three sources: the latest
dump of the Chinese Wikipedia, the second version of Chinese Gigaword (Graff et al., 2005) and
the UM-Corpus (Tian et al., 2014). This data was
cleaned, sentence delimited and converted to simplified Chinese script. It was further preprocessed
using the Stanford Segmentor and POS tagger
(Chang et al., 2008; Tseng et al., 2005; Toutanova
et al., 2003). The final version of this corpus has
over 30 million sentences (950 million words).
For comparison, the largest reported corpora from
previous studies contained 38,000 sentences (Mok
et al., 2012). In addition, we also used the latest
version (2012) of the Google Ngram corpus for
Chinese (Michel et al., 2011).
There are some differences between the usage
of classifiers in different dialects and variations of
Chinese in these different corpora, but our current
goal focused on collecting generalizations. Future work could be done to single out differences
across dialects and variants.
We used COW (Wang and Bond, 2013) as our
lexical ontology, which shares the structure of the
Princeton Wordnet (PWN) (Fellbaum, 1998). To
249
contributes information to a single concept. Frequency information and possible CLs are collected
for each matched sense. The resulting conceptbased mapping, for each concept, is the union of
CLs for each unambiguous lemma along with sum
of frequencies.
Following one of the examples above, the
lemma 类别 lèibié, was unambiguously mapped
to the concept ID 05838765-n – defined as “a general concept that marks divisions or coordinations
in a conceptual scheme”. This concept provides
two other synonyms: 范畴 fànchóu and 种类
zhǒnglèi. In the concept based dictionary, the concept ID 05838765-n will aggregate the information provided by all its unambiguous senses. This
results in a frequency count of 132 for the CL 个
ge, and of 2 for 项 xiàng (both valid uses).
As has been shown in previous works, semantic
ontologies should, in principle, be able to simulate
the taxanomic features hierarchy that link nouns
and CLs. We use this to further expand the concept
based dictionary of CLs.
For each concept that didn’t receive a classifier, we collect information concerning ten levels
of hypernymy and hyponymy around it. If any
pair of hypernym-hyponym was associated with
the same CL, we assign this CL to the current concept. Since we’re interested in the task of generating the best (or most common) CL, we rank CLs
inside these expanded concepts by summing the
frequencies of all hypernyms and hyponyms that
shared the same CL. If more than one CL can be
assigned this way, we do so.
Figure 1 exemplifies this expansion. While concepts A, B and C did not get classifiers directly assigned to them, they are still assigned one or more
classifiers based on their place in the concept hierarchy. For every concept that didn’t receive any
CL information, if it has at least a hypernym and
a hyponym sharing a CL (within a distance of 10
jumps), then it will inherit this CL and the sum of
their frequencies. Assuming a full concept hierarchy is represented in Figure 1, concept A would
inherit two classifiers, and concept B and C would
inherit one each.
This expansion provides extra coverage to the
concept based dictionary. But we differ from previous works in the sense that we do not blindly
assign CLs down the concept hierarchy, making
it depend on previously extracted information for
both hypernyms and hyponyms. By following a
used regular expression matching, listing common
determiners, numerals, punctuation, and our list
of 204 S-CLs. We did not restrict the third gram.
We also transferred the frequency information provided for matched ngrams to our lemma based dictionary.
Our training set included 80% of the text portion
of the corpus, from which we extracted over 435k
tokens of noun-CL associations, along with the
full Chinese Google NGram corpus, from which
we extracted 13.5 million tokens of noun-CL associations.
This lemma based dictionary contained, for example, 59 pairs of noun-CL containing the lemma
类别 lèibié “category”. It occurred 58 times with
the CL 个 ge, and once with the CL 项 xiàng.
Despite the large difference in frequencies, both
CLs can be used with this lemma. Another example, where the relevance of the frequency becomes evident, is the word 养鸡场 yǎngjı̄chǎng
“chicken farm”, which was seen in our corpus 12
times: 6 times with the CL 个 ge, 3 times with
the CL 家 jiā, twice with the CL 只 zhı̌, and once
with the CL 座 zuò. Chinese native speaker judgments identified that three out of the 4 CLs identified were correct ( 个 ge, 家 jiā and 座 zuò). In
addition, two other classifiers would also be possible: 间 jiān and 所 suǒ. This second example shows that while the automatic matching process is still somewhat noisy, and incomplete, the
frequency information can help to filter out ungrammatical examples. When used to generate a
classifier, our lemma based dictionary can use the
frequency information stored for each identified
CL for a particular lemma, and choose the most
frequent CL. This process will likely increase the
likelihood of it being a valid CL. Also, by setting
a minimum frequency threshold for which nounCLs pair would have to be seen before being added
to the dictionary, we can exchange precision for
coverage.
4.2
Concept Based Dictionary
The concept based dictionary is created by mapping and expanding the lemma based dictionary
onto COW’s expanded concept hierarchy. Since
ambiguous lemmas can, in principle, use different
CLs depending on their sense, we map only unambiguous lemmas (i.e. that belong to a single
concept). This way, each unambiguous entry from
the lemma based dictionary matching to COW
250
baseline
All lemmas
Figure 1: Classifier Expansion
lem-all
lem-all-mfcl
lem-all-no-info
Unamb. lemmas
92.7 88.5 86.2 93.6
75.1 73.8 72.8 78.9
4.7 9.2 12.1 4.1
lem-unamb
wn-unamb
93.2 88.2 85.5 94.5
95.1 90.9 88.3 95.9
lem-unamb-mfcl
wn-unamb-mfcl
77.0 75.5 74.1 77.9
72.3 71.6 70.7 73.5
lem-unamb-noinfo
wn-unamb-no-info
Coverage
3.4
9.5
13.6
2.8
1.7
5.3
8.3
1.5
lemmas-w/cl
wn-concepts-w/cl
stricter approach, we hope to provide results of
better quality.
4.3
τ = 1 τ = 3 τ = 5 Test
44.2 44.2 44.2 40.4
32.4k 10.4k 7.0k
22.7k 15.0k 12.3k
Table 1: Automatic Evaluation Results
Automatic Evaluation
tween both tasks.
The best performing τ was then tested in a a second held-out set of data (test-set), also containing
roughly 10% of the size of the text corpus, roughly
39.9k tokens of noun-CL pairs. The test-set is used
to report our final results.
The results are presented in Table 1, and are discussed in the following section.
We evaluated both lemma and concept based dictionaries with two tasks: predicting the validity of
and generating CLs. We used roughly 10% of held
out data (dev-set), from which we extracted about
37,4k tokens of noun-CL pairs, as described in 4.1.
We used this data to evaluate the prediction and
generation capabilities of both dictionaries in the
following ways: predicting the validity of a CL
was measured by comparing every noun-CL pair
extracted from the dev-set to the data contained in
the dictionary for that particular lemma (i.e. if that
particular classifier was already predicted by the
dictionary); generation was measured by selecting
the best likely classifier, based on the cumulative
frequencies of noun-CL pairs in the dictionary (i.e.
if the classifier seen in the example matched the
most frequent classifier). This was done separately
for both dictionaries.
When no other classifier had been assigned, we
used 个 ge, the most frequent CL on the corpus,
as the default classifier. And a baseline was established by assigning 个 ge as the only CL for every
entry.
The dev-set was used to experiment with different thresholds (τ ) of the minimum frequency,
from one to five, for which noun-CL pairs would
have to be seen in the train-set in order to be considered into the dictionaries. These different minimum frequency thresholds were compared be-
5
Discussion and Results
In Table 1 we can start to note that the baseline, of
consistently assigning 个 ge to every entry in the
dictionary is fairly high, of roughly 40%.
In order to allow a fair comparison, since we
decided that the concept based dictionary would
contain only unambiguous lemmas, we only use
unambiguous lemmas to compare the performance
across dictionaries. All results can be compared
across the different thresholds discussed in 4.3.
τ = 1, 3 and 5 present the results obtained in the
automatic evaluation, using minimum frequencies
of one, three and five, respectively.
The first three reported results report exclusively about the lemma dictionary (including both
ambiguous and unambiguous lemmas). lem-all reports the results of the prediction task, lem-allmfcl reports the results of the generation task, and
lem-all-no-info reports the relative frequency of
lemmas for which there was no previous infor-
251
son, we decided to hand validate a sample of each
dictionary.
mation in the dictionary, and which could have
boosted both task’s performance by falling back
on the default CL 个 ge.
These initial results show that it was easy to perform better than baseline, and that τ = 1 achieved
the best results on both predicting noun-CL pairs,
and generating CLs that matched the data.
Comparing different τ s shows that, even considering the over-generation reduction that imposing minimum frequencies brings (validated but not
presented here), the best generation performance
is achieved by not filtering the training data. And
this will be consistent across the remainder of the
results.
When comparing both dictionaries, we look
only at unambiguous lemmas. Similar to what was
explained above, lem-unamb and wn-unamb report
the results of the prediction task for the lemma
based and concept based dictionary, respectively.
The labels lem-unamb-mfcl and wn-unamb-mfcl
report the results for the generation task. And
the lem-unamb-no-info and wn-unamb-no-info report about the lack of informed coverage (where
backing-off to the default CL might have help the
performance).
Between the lemma and the concept based dictionaries, this automatic evaluation shows that
while the concept based dictionary is better at predicting if a noun-CL pair was valid, the lemma
based dictionary outperforms the former in the
generation task.
The final results of this automatic evaluation are
shown in column Test, where we re-evaluated the
dictionary produced by τ = 1 on the test-set. Test
shows slightly better results, perhaps because the
random sample was easier than the dev-set, but the
same tendencies as reported above.
Considering that the concept based dictionary
should be able to provide CL information to some
lemmas that have not been seen in the training
data (either by expansion or by leveraging on a
single lemma to provide information about synonyms), we expected the concept based dictionary
to present the best results.
Many different reasons could be influencing
these results, such as errors in the ontology, the
fact that Chinese CLs relate better to specific
senses than to concepts (i.e. different lemmas inside a concept prefer different CLs), or noise introduced by the test and dev-set (since we don’t
have a hand curated golden test-set). For this rea-
Based on a random sample of 100 concepts and
100 lemmas extracted from each dictionary, a Chinese native speaker checked if the top ranked CL
(i.e. with highest frequency), that would be used
to generate a CL for each of the randomly selected
entries, was in fact a valid CL for that lemma or
concept. This human validation showed the concept based dictionary outperforming the lemma
based dictionary by a wide margin: 87% versus
76% valid guesses. This inversion of performance,
when compared to the automatic evaluation, was
confirmed to be mainly due to noisy data in the
test-set caused by the automatic segmentation and
POS tagging.
We then looked at a bigger sample of 200 lemmas and found roughly 7.5% of invalid lemmas
in the lemma based dictionary. Conversely, the
concept based dictionary assigns CLs by ‘bags of
lemmas’ (i.e. synsets). This allows the noise introduced by a few senses to be attenuated by the ‘bag’
nature of the concept. More importantly, most of
the nominal lemmas included in the extended version of COW are human validated, so the quality
of the concept based dictionary was confirmed to
be better – since most lemmas included in it are
attested to be valid.
Comparing the size of both dictionaries in Table 1, even though the τ 1 lemma based dictionary
is considerably larger (32.4k compared to 22.5k
entries of the concept based dictionary), we have
shown that noise is a problem for the lemma based
approach. Also, since the extended COW has,
on average, 2.25 senses per concept, the concept
based dictionary provides CL information for over
50.6k lemmas. When comparing the size of both
dictionaries across τ s, we can also effectively verify the potential of the expansion step possible
only for the concept based dictionary. As τ increases, the size of the concept based dictionary
increases relatively to the lemma based. When applied to other tasks, where noise reduction would
play a more important role (which can be done by
raising τ ), the concept based dictionary is able to
produce more informed decisions with less data.
Lastly, coverage was also tested against data
from a human curated database of noun-CL associations (Gao, 2014), by replicating the automatic
evaluation generation task described in 4.3. This
dictionary contains information about more than
252
class of words in COW currently prevents us from
using the internal ontology structure to link nouns
and classifiers. Once classifiers are represented as
concepts in this lexical ontology, we will make use
of this work to link nominal concepts and corresponding valid classifiers.
800 CLs and provides a few hand-selected examples for each CL – and hence it is not designed
with the same mindset. Testing the best performing dictionaries (τ 1) against the data provided for
S-CLs, we achieved only 43.9% and 28.3% for
prediction and generation, respectively, using the
lemma based dictionary; compared to 49.8% and
22.4% using the concept based dictionary.
The same trends in prediction and generation
are observed, where the concept based dictionary
is able to predict better than the lemma base, but
it is outperformed by the later in the generation
task. Ultimately, these weak results show that even
though we used a very large quantity of data, our
restrictive matching patterns in conjunction with
infrequent noun-CLs pairs still leaves a long tail
of difficult predictions.
6
7
Conclusions
Our work shows that it is possible to create a high
quality dictionary of noun-CLs, with generation
capabilities, by extracting frequency information
from large corpora. We compared both a lemma
based approach and a concept based approach,
and our best results report a human validated performance of 87% on generation of classifiers using a concept based dictionary. This is roughly
a 9% improvement against the only other known
work done on Chinese CL generation using wordnet (Mok et al., 2012).
Finally, we will merge all three data sets and,
from them, produce a release of this data. We
commit to make both lemma and WN mappings
available under an open license, release along
with the Chinese Open Wordnet at http://
compling.hss.ntu.edu.sg/cow/.
Ongoing and Future Work
Since our method is mostly language independent,
we would like to replicate it with other calssifier
languages for which there are open linked WN resources (such as Japanese, Indonesian and Thai).
This would require access to large amounts of
text segmented, POS tagged text, and adapting
the matching expressions for extracting noun-CL
pairs.
More training data would not only help improving overall performance on open data, by minimizing unseen data, but would also allow us to make
better use of frequency threshold filters for noise
reduction. Lack of training data as our biggest
drawback on performance, we would like to repeat this experiment with more data – including,
for example, a very large web-crawled corpus in
our experiments.
In addition, we would also like to perform WSD
on the training set, using UKB (Agirre and Soroa,
2009) for example. This would allow an informed
mapping of ambiguous senses onto the semantic ontology and, arguably, comparable performance on generating CLs for ambiguous lemmas.
We will also investigate further how to deal with
words not in COW: first looking them up in the
lemma dictionary, and then associating CLs to the
head (character / noun) of unseen noun-phrases, as
proposed in Bond and Paik (2000).
Even though this work was mainly focused on
producing an external resource linked to COW, we
are also investigating adding a new set of sortal
classifiers concepts to COW. The absence of this
8
Acknowledgments
This research was supported in part by the MOE
Tier 2 grant That’s what you meant: a Rich Representation for Manipulation of Meaning (MOE
ARC41/13).
References
Eneko Agirre and Aitor Soroa. 2009. Personalizing pagerank for word sense disambiguation. In
Proceedings of the 12th Conference of the European Chapter of the Association for Computational
Linguistics, pages 33–41. Association for Computational Linguistics.
Francis Bond and Ryan Foster. 2013. Linking and extending an open multilingual wordnet. In Proceedings of the 51st Annual Meeting of the Association
for Computational Linguistics, pages 1352–1362,
Sofia, Bulgaria. Association for Computational Linguistics.
Francis Bond and Kyonghee Paik. 2000. Reusing an
ontology to generate numeral classifiers. In COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics, pages 90–96.
Pi-Chuan Chang, Michel Galley, and Christopher D.
Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In
253
Wan-chun Lai. 2011. Identifying True Classifiers
in Mandarin Chinese. Master’s thesis, National
Chengchi University, Taiwan.
Proceedings of the Third Workshop on Statistical
Machine Translation, StatMT ’08, pages 224–232,
Stroudsburg, PA, USA. Association for Computational Linguistics.
Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser
Aiden, Adrian Veres, Matthew K Gray, Joseph P
Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig,
Jon Orwant, et al. 2011. Quantitative analysis of
culture using millions of digitized books. Science
14 January 2011, 331(6014):176–182.
Y.R. Chao. 1965. A Grammar of Spoken Chinese.
University of California Press.
Lisa Lai-Shen Cheng and Rint Sybesma. 1998. Yi-wan
tang, yi-ge tang: Classifiers and massifiers. Tsing
Hua journal of Chinese studies, 28(3):385–412.
Hazel Mok, Eshley Gao, and Francis Bond. 2012.
Generating numeral classifiers in Chinese and
Japanese. In Proceedings of the 6th Global WordNet Conference (GWC 2012), Matsue. 211-218.
Mary S Erbaugh. 2002. Classifiers are for specification: Complementary functions for sortal and general classifiers in Cantonese and Mandarin. Cahiers
de linguistique-Asie orientale, 31(1):33–69.
Kyonghee Paik and Francis Bond. 2001. Multilingual
generation of numeral classifiers using a common
ontology. In 19th International Conference on Computer Processing of Oriental Languages: ICCPOL2001, Seoul. 141–147.
Christine Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press.
Helena Gao. 2010. Computational lexicography: A
feature-based approach in designing an e-dictionary
of Chinese classifiers. In Proceedings of the 2nd
Workshop on Cognitive Aspects of the Lexicon,
pages 56–65. Coling 2010.
Virach Sornlertlamvanich, Wantanee Pantachat, and
Surapant Meknavin. 1994. Classifier assignment by
corpus-based approach. In Proceedings of the 15th
conference on Computational linguistics-Volume 1,
pages 556–561. Association for Computational Linguistics.
Helena Gao. 2014. Database design of an online elearning tool of Chinese classifiers. In Proceedings
of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex), pages 126–137.
Liang Tian, Derek F. Wong, Lidia S. Chao, Paulo
Quaresma, Francisco Oliveira, and Lu Yi. 2014.
UM-Corpus: A large English-Chinese parallel corpus for statistical machine translation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA).
David Graff, Ke Chen, Junbo Kong, and Kazuaki
Maeda. 2005. Chinese Gigaword Second Edition LDC2005T14. Web Download. Linguistic Data
Consortium.
Chu-Ren Huang, Keh-Jiann Chen, and Ching-Hsiung
Lai, editors. 1997. Mandarin Daily Dictionary of
Chinese Classifiers. Mandarin Daily Press, Taipei.
Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-ofspeech tagging with a cyclic dependency network.
In Proceedings of the NAACL HLT 2003 2003 - Volume 1, NAACL ’03, pages 173–180, Stroudsburg,
PA, USA. Association for Computational Linguistics.
Chu-Ren Huang, Keh-jiann Chen, and Zhao-ming Gao.
1998. Noun class extraction from a corpus-based
collocation dictionary: An integration of computational and qualitative approaches. Quantitative and
Computational Studies of Chinese Linguistics, pages
339–352.
Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel
Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter for sighan bakeoff 2005. In Proceedings of the Fourth SIGHAN
Workshop on Chinese Language Processing, pages
168–171.
Chu-Ren Huang, Ru-Yng Chang, and Hshiang-Pin
Lee. 2004. Sinica BOW (Bilingual Ontological Wordnet): Integration of bilingual wordnet and
sumo. In Proceedings of the Fourth International
Conference on Language Resources and Evaluation
(LREC’04), pages 825–826. European Language
Resources Association (ELRA).
Shan Wang and Francis Bond. 2013. Building the
Chinese Open Wordnet (COW): Starting from core
synsets. In Proceedings of the 11th Workshop on
Asian Language Resources, a Workshop at IJCNLP2013, pages 10–18, Nagoya.
Satoru Ikehara, Masahiro Miyazaki, Satoshi Shirai,
Akio Yokoo, Hiromi Nakaiwa, Kentaro Ogura,
Yoshifumi Ooyama, and Yoshihiko Hayashi. 1997.
Goi-Taikei — A Japanese Lexicon. Iwanami Shoten,
Tokyo. 5 volumes/CDROM.
Renjie Xu, Zhiqiang Gao, Yingji Pan, Yuzhong Qu, and
Zhisheng Huang. 2008. An integrated approach for
automatic construction of bilingual Chinese-English
wordnet. In John Domingue and Chutiporn Anutariya, editors, The Semantic Web, volume 5367 of
Lecture Notes in Computer Science, pages 302–314.
Springer Berlin Heidelberg.
Hitoshi Isahara, Francis Bond, Kiyotaka Uchimoto,
Masao Utiyama, and Kyoko Kanzaki. 2008. Development of the Japanese WordNet. In Sixth International conference on Language Resources and
Evaluation (LREC 2008), Marrakech.
254
IndoWordNet Conversion to Web Ontology Language (OWL)
Apurva Nagvenkar
DCST, Goa University
Jyoti Pawar
DCST, Goa University
Pushpak Bhattacharyya
CSE, IIT Bombay
[email protected]
[email protected]
[email protected]
Abstract
Bengali
Gujarati
Hindi
Kashmiri
Konkani
Odia
Punjabi
Urdu
WordNet plays a significant role in Linked
Open Data (LOD) cloud. It has numerous application ranging from ontology annotation to ontology mapping. IndoWordNet is a linked WordNet connecting 18 Indian language WordNets with Hindi as a
source WordNet. The Hindi WordNet was
initially developed by linking it to English
WordNet.
The World Wide Web (WWW) has formed a revolution in the data availability there is no other
place in the world where we can find so much
of the information, but the current web structure
fails to make best out of it. The user can access
limitless data from the web yet, it becomes a tedious task to retrieve relevant information. Data
available on the Web covers diverse structures,
formats and content. It also lacks a uniform organization of scheme that would allow easy access of data and information (Candan et al., 2001).
Many frameworks have been proposed to support
the search engine and information access. Resource Description Framework1 (RDF), Web Ontology Language2 (OWL) is one of the framework
which provides a platform for standardization and
organization of data from the Web. It has been
2
Adjective
5815
5828
6178
5365
5744
5273
5830
5786
Adverb
445
445
482
400
482
377
443
443
Total
36346
35599
39072
29469
32370
35284
32364
34280
highly influenced by the web standards community.
WordNet (Fellbaum, 1998), a lexical knowledge base system that has been adopted by the
Semantic Web research community. The current essential need is to link WordNet with different resources in order to assist Natural Language
Processing applications. IndoWordNet (Bhattacharyya, 2010) is an Indian community which
builds WordNets for Indian languages. It is a multilingual WordNet which links WordNets of different Indian languages on a common identification
number called as synset id given to each concept
(Bhattacharyya, 2010). It is constructed using the
expansion model where Hindi WordNet synsets
are taken as a source. The concepts provided along
with the Hindi synsets are first conceived and appropriate concepts in target language are manually
provided by the language experts. Figure 1 shows
the statistics of Indradhanush Consortium which
consist seven Indian languages belonging to IndoAryan family and is part of IndowordNet Consortium.
To use WordNet in Semantic Web the data
model for WordNet should be extensible, interoperable and flexible. It was created as a semantic
network of word meanings which at the conceptual level is a directed graph with labeled nodes
and arcs (Graves and Gutierrez, 2006). Hence,
OWL can be used to model WordNet since, it facilitates data manipulations and queries over the
Introduction
1
Verb
2804
2805
3306
2660
3000
2418
2836
2801
Table 1: POS wise statistics for Indradhanush
In this paper, we present a data representation of IndoWordNet in Web Ontology
Language (OWL). The schema of Princeton WordNet has been enhanced to support the representation of IndoWordNet.
This IndoWordNet representation in OWL
format is now available to link other web
resources. This representation is implemented for eight Indian languages.
1
Noun
27281
26503
29106
21041
23144
27216
23255
22990
http://www.w3.org/RDF
http://www.w3.org/TR/owl-features
255
graph structure. The main objective of this paper
is to represent IndoWordNet to OWL representation.
The rest of the paper is organized as follows section 2 describes the related work. Section 3 introduces to Semantic Web Layer Cake Model. Section 4 presents the architecture of IndoWordNet
OWL; section 5 gives the implementation details,
followed by conclusion and future work.
2
Related Work
WordNets other than Indian languages are already
available in RDF form. The work on Princeton WordNet (Assem et al., 2006) conversion
to RDF/OWL was carried out by WordNet Task
Force3 . The main goal of this conversion was to
represent a language in use of Semantic Web community and to provide application developers a resource. Also, the representation was done in such
a way that it maintained the WordNets conceptual
model.
There are other projects focusing on lexical
meta-models. Lexical Markup Framework (LMF)
(Francopoulo et al., 2009). IndoWordNet is already available in this format by IndoNet (Bhatt
et al., 2013) which proposes modification to LMF
to integrate Universal Word Dictionary (Uchida et
al., 1999) and Suggested Upper Merged Ontology
(SUMO) (Pease et al., 2002).
3
Figure 1: Semantic web layer cake model
provides the syntax for structured document,
but does not provide any meaning to the document. XML schema restricts the structure
of the document and extends XML with datatypes.
2. Standardized Semantic Web technologies:
The middle layer contains technologies
which are already standardized by Semantic
Web community that includes RDF, RDFS,
OWL and SPARQL. RDF is a data model
to represent triple, i.e. objects and relationship between them. It provides simple semantics and is represented by XML syntax.
RDF schema can be viewed as an extensible,
object oriented type system based on RDF
(Huang and Zhou, 2007). OWL is an envelope to the RDF schema and enriches the expressibility of the RDF schema by expressing more properties like transitivity, symmetry, cardinality, etc.
Semantic Web Layer Cake Model
The Semantic Web is not a separate web but a
vision for the future of the Web where information is given explicit meaning which makes easier for machine to automatically process and integrate the information available on the web. OWL
is a part of the growing stack of W3C recommendations related to the semantic web (McGuinness
and Harmelen, 2004).
Figure 1. is the semantic web layer cake model
(Hendler, 2001). This model is divided into three
section:
3. Unrealized Semantic Web technologies: The
top layer contains technologies like digital
signatures, trust, proof, etc this technologies
are not yet standardized by Semantic Web
community and needs to be implemented in
order to realize Semantic Web.
1. Hypertext Web technologies: The bottom
layer contains technologies which are used
by hypertext web that includes Unicode, Universal Resource Indicator (URI), XML and
XML-schema. Unicode is used to represent
and manipulate text for different languages.
URI represents the resources uniquely. XML
3
4
OWL for IndoWordNet
The architecture of the IndoWordNet OWL representation is adopted from WordNet Task Force
(Assem et al., 2006). The architecture of IndoWordNet OWL contains three main classes i.e.
http://www.w3.org/TR/wordnet-rdf/
256
Synset4 , WordSense and Word5 .
The schema for representing IndoWordNet6 using OWL is shown in figure 2 below.
on their attributes8 whereas in Princeton WordNet
there is no such division.
In IndoWordNet OWL, the RDF files are organized in such a way that the management is done
systematically. Unlike (Assem et al., 2006) all the
RDF files are placed in one directory.
Following is the formatting of URIs for IndoWordNet:
• URI representation of a synset: http://
nlp.unigoa.ac.in/indonet/owl/
hindi/v1/synset/noun/24.rdf
• URI representation of a wordSense: http:
//nlp.unigoa.ac.in/indonet/
owl/hindi/v1/wordSense/ł/noun/
1930.rdf
• URI representation of a word: http:
//nlp.unigoa.ac.in/indonet/
owl/hindi/v1/word/ł.rdf
5
Figure 2: IndoWordNet OWL schema
Implementation Details
The IndoWordNet OWL is currently available
in seven Indian languages. It is developed using JAVA platform, using Apache Jena9 and
IndoWordNet Application Programming Interface(API). The above architecture can be used
by other Indian langauges to represent their respective wordNets in OWL format. The repository of IndoWordNet OWL is available on http:
//nlp.unigoa.ac.in/indonet/owl/.
The schema includes three layers, namely Concept layer, WordSense layer and Word layer which
are previously described in (Huang and Zhou,
2007). Every synset has a unique concept and
can have several words associated with it sharing the same concept. WordSense represents a
unique sense of a word. It is also possible to represent a word with many WordSenses. IndoWordNet OWL schema handles the relations by dividing them into properties, i.e. Semantic property
and Lexical property. Semantic property represents the semantic relations which are handled in
concept layer, whereas lexical property represents
lexical relations which are handled in WordSense
layer. All the remaining types of semantic relations and lexical relations become the sub property of semantic and lexcial property. The above
schema uses several predicates7 i.e. properties.
IndoWordNet OWL schema elaborates the semantic relationship like meronymy and holonymy
by classifying them into the sub properties based
6
Conclusion and Future Work
The heart of Semantic Web is Linked Data that
provides integration and reasoning of the data on
web. The representation of IndoWordNet to OWL
will facilitate the semantic web community as
the WordNet is strong lexical resource that has
strengthened, enlarged and build up the other resources because of its taxonomy. In this paper we
have presented the framework to represent the Indian wordNets in the OWL format. Currently, we
have represented eight Indian language WordNets
in OWL format. In future, we will like to represent the WordNets from other Indian languages in
OWL format. Following are some future work to
this problem.
4
http://nlp.unigoa.ac.in/indonet/owl/
web/syn.php
5
http://nlp.unigoa.ac.in/indonet/owl/
web/wdSenseAndWord.php
6
http://nlp.unigoa.ac.in/indonet/owl/
IndoWNetSchema.rdf
7
http://nlp.unigoa.ac.in/indonet/owl/
web/prop.php
8
http://nlp.unigoa.ac.in/indonet/owl/
web/propdist.php
9
https://jena.apache.org/
257
Interlinking of WordNets: As the IndoWordNet is developed using ILI. The advantage of
this approach is that it preserves the semantic
structure, but it also has some disadvantages.
The drawbacks of this approach are lexical
gap and semantic gap (Fellbaum and Vossen,
2012). As a result, an effort must be made
to interlink the WordNet using Common
Concept Hierarchy (Bhatt et al., 2013) as a
backbone to link lexicons of different languages.
Need of approach to link DBpedia: The
work on linking the IndoWordNet to DBpedia should be carried out as, DBpedia is
the nucleus for the web of data and most
of the resources are already linked to DBpedia.
Link it to other Resources: We expect that
use of the OWL representation of IndoWordNet
will be used as an infrastructure to enrich and link
other web resources in India.
References
[Graves and Gutierrez2006] Alvaro Graves, Claudio
Gutierrez. 2006. Data Representation for WordNet:
A Case for RDF. 3rd Global WordNet Association
Conference.
[Bhatt et al.2013] Brijesh Bhatt, Lahari Poddar, Pushpak Bhattacharyya. 2013. IndoNet: A Multilingual
Lexical Knowledge Network for Indian Languages.
Association for Computational Linguistics.
[Fellbaum1998] Christiane Fellbaum. 1998. WordNet:
An Electronic Lexical Database.. Cambridge, MA:
MIT Press.
tion framework: metadata and its applications.
SIGKDD Explor. Newsl. 3, 1 (July 2001), 6-19.
[Kuroda et al.2010] Kow Kuroda, Francis Bond, Kentaro Torisawa. 2010. Why Wikipedia needs to make
friends with WordNet. 5th Global WordNet Association Conference.
[Assem et al.2006] Mark van Assem, Aldo Gangemi,
Guus Schreiber. 2006. Conversion of WordNet to
a standard RDF/OWL representaion. Proceedings
of LERC.
[Casado et al.2005] Maria Ruiz-Casado, Enrique Alfoneseca, Pablo Castells. 2005. Automatic assignment of Wikipedia encyclopedic entries to WordNet
synsets. In: Proceedings of the Atlantic Web Intelligence Conference, AWIC-2005.
[Bhattacharyya2010] Pushpak Bhattacharyya.
IndoWordNet. Proceedings of LERC.
2010.
[Auer et al.2007] Soren Auer, Christian Bizer, Georgi
Kobilarov, Jens Lehmann, Richard Cyganiak,
Zachary Ives. 2007. DBpedia: A Nucleus for a
Web of Open Data. ISWC’07/ASWC’07 Proceedings of the 6th international The semantic web and
2nd Asian conference on Asian semantic web conference.
[Huang and Zhou2007] Xiao-xi Huang,
Chang-le
Zhou. 2007. An OWL-based WordNet lexical
ontology. Journal of Zhejiang Science A.
[Pease et al.2002] Adam Pease, Ian Niles and John Li
2002. The Suggested Upper Merged Ontology: A
Large Ontology for the Semantic Web and its Applications. In Working Notes of the AAAI-2002 Workshop on Ontologies and the Semantic Web.
[Uchida et al.1999] H. Uchida, M. Zhu, and T. Della
Senta 1999. UNL- a Gift for the Millenium. United
Nations University Press, Tokyo.
[Fellbaum and Vossen2012] Christiane Fellbaum and
Piek Vossen. 2012. Challenges for a multilingual
wordnet.. Lang. Resour. Eval., 46(2):313326.
[McGuinness and Harmelen2004] Deborah
L.
McGuinness, Frank van Harmelen. 2004. OWL Web
Ontology Language. http://www.w3.org/TR/owlfeatures.
[Francopoulo et al.2009] Gil Francopoulo, Nuria Bel,
Monte George, Nicoletta Calzolari, Monica Monachini, Mandy Pet, and Claudia Soria. 2009. LMF
for Multilingual Specialized Lexicons. LREC Workshop on Acquiring and Representing Multilingual,
Specialized Lexicons.
[Hendler2001] J. Hendler. 2001. Agents and the Semantic Web. IEEE Intellegent Systems.
[Candan et al.2001] K. Seluk Candan, Huan Liu, and
Reshma Suvarna.
2001.
Resource descrip-
258
A Two-Phase Approach for Building Vietnamese WordNet
Phuong-Thai Nguyen
VNU University of Engineering and Technology
[email protected]
Van-Lam Pham
VASS Institute of Linguistics
[email protected]
Hoang-An Nguyen
Naiscorp Inc.
[email protected]
Huy-Hien Vu
VNU University of Engineering and Technology
[email protected]
Ngoc-Anh Tran
Le Quy Don Technical University
[email protected]
Thi-Thu-Ha Truong
VASS Institute of
Lexicography and Encyclopedia
[email protected]
Abstract
Wordnets play an important role not only in
linguistics but also in natural language processing (NLP). This paper reports major results of a project which aims to construct a
wordnet for Vietnamese language. We propose a two-phase approach to the construction
of Vietnamese WordNet employing available
language resources and ensuring Vietnamese
specific linguistic and cultural characteristics.
We also give statistical results and analyses to
show characteristics of the wordnet.
1
Length
1
2
3
4
5
Total
Words
6,303
28,416
2,259
2,784
419
40,181
Percentage
15.69
70.72
5.62
6.93
1.04
100
Table 1: Word length statistics from a popular Vietnamese dictionary, made by the Vietnam Lexicography
Center (Vietlex).
Introduction
than the number of single words. As in many other
Asian languages such as Chinese, Japanese and
Thai, there is no word delimiter in Vietnamese. The
space is a syllable delimiter but not a word delimiter,
so a Vietnamese sentence can often be segmented
in many ways. Secondly, Vietnamese is an isolating language in which words do not change their
forms according to their grammatical function in a
sentence.
Constructing wordnets is a complicated task. This
task involves answering questions including which
approach is appropriate, how to ensure specific characteristics of the language, how to take full advantage of available resources. This paper makes an attempt to answer these fundamental questions and reports major results of a project aiming to construct a
wordnet for Vietnamese language, whose database
includes 30,000 synonym sets and 50,000 words
with 30,000 commonly used by the Vienamese.
Figure 1 represents major steps in construction
In order to solve various problems in NLP including
information retrieval, machine translation, text classification, etc. we need language resources such as
corpora and dictionaries. Wordnet is one of important resources for solving such problems. The first
wordnet was created at Princeton University for English language. After that, diverse wordnets were
constructed such as EuroWordNet for European languages, Asian WordNet for Asian languages, etc.
There are a number of important characteristics
of the Vietnamese language that impact the construction of wordnet. Firstly, the smallest unit in
the formation of Vietnamese words is the syllable. Words can have just one syllable, for example
‘đẹp’beautif ul , or be a compound of two or more
syllables, for example ‘màu sắc’color . As shown in
Table 1, single-syllable words only cover a small
proportion while two-syllable words account for the
largest proportion of the whole vocabulary. Forming that vocabulary is a set of 7,729 syllables, higher
259
statistics and analyses of the wordnet being constructed. Section 5 gives a number of conclusions
and future works.
2
2.1
Existing Wordnets
Princeton’s WordNet
Since 1978, George Miller (Fellbaum, 1998) had researched and developed a database of words and semantic relations between words. This database was
called wordnet and was considered a model of mental lexicon. Conceivably, wordnet is a large discrete
graph in which nodes are synonym sets (synsets) and
edges are semantic relations of synsets. A synset is
a collection of synonym words of the same part of
speech in which each word can be replaced by one
of the others in certain contexts. For example, car,
auto, automobile, machine, motorcar form a synset.
This synset has a hyponymy relation with the synset
vehicle because a car is a kind of vehicle.
2.2
Figure 1: Steps in Vietnamese WordNet construction.
EuroWordNet
EuroWordNet (Vossen, 2002) is a multilingual lexical database of nine European languages. Each language has its own wordnet. These component wordnets are linked via Princeton’s WordNet version
1.5. More specifically, their synsets are linked to
Princeton’s WordNet’s synsets which are equivalent
or closest in meaning. EuroWordNet accepts different levels of lexicalization. For example, Princeton’s
WordNet contains both lexicalized and unlexicalized
synsets, while Dutch WordNet contains only lexicalized ones. Component wordnets have been built by
exploiting available resources such as monolingual
dictionaries, bilingual dictionaries, and the Princeton’s WordNet.
process of Vietnamese WordNet. We put these steps
in two phases. Phase 1 involves steps 1-3, phase 2
involves steps 4 and 5. We exploit a number of language resources including Princeton’s WordNet, a
Vietnamese dictionary and an English-Vietnamese
bilingual dictionary.
The class of adverbs in Vietnamese is a closed
class (or a class of function words), while in English the class of adverbs is an open class (or a
class of content words). Vietnamese adverbs express
time (such as ‘đã’past , ‘đang’continuous ), degree
(such as ‘rất’very , ‘hơi’rather ), and negation (such
as ‘không’not ). Therefore the number of adverbs in
Vietnamese is much smaller than that in English. For
that reason, there are only three parts of speech in
Vietnamese WordNet including noun, verb, and adjective. Semantic relations in Vietnamese WordNet
are similar to those in Princeton WordNet except a
number of relations such as derivationally related
form, participle of verb, etc.
2.3
Asian WordNet
This project (Virach et al., 2009) aims to create wordnets for Asian languages such as Thai,
Japanese, Korean, etc. Currently, there are data of 13
languages in Asian WordNet. The authors adopted
a semi-automatic approach to translate Princeton’s
WordNet’s synsets into Asian languages using bilingual dictionaries. The authors also built an online
tool for editing and visualizing contents of the wordnet. By using this tool, many people can easily participate in the task of translation. They can also mod-
The remaining part of this paper is organized as
follows: Section 2 gives a review of several existing wordnets. Section 3 introduces our method to
construct Vietnamese WordNet. Section 4 presents
260
ify translations and can vote for the best one. In
terms of wordnet design, Asian WordNet is a special case of EuroWordNet because it was built by
translation approach. The major limitation of Asian
WordNet is that it lacks specific concepts of Asian
languages.
2.4
ởn’white , ‘làng xã’village , etc. or words relating to
history, society and culture of Vietnamese such
as ‘truyện Kiều’a f amous story in V ietnam , ‘bánh
chưng’a kind of cake , etc. Therefore in phase 2, we
select coordinated compound words, reduplicative
words, and subordinated compound words to add to
the Vietnamese WordNet. We choose words from a
popular Vietnamese dictionary, made by the Vietnam Lexicography Center (Vietlex).
Laconec
This is a semantic-based multilingual dictionary
available on the Internet1 . According to the information on the website: This dictionary has been developed since 2007. The goal of Laconec is to provide
multilingual lexical knowledge word lookup based
on semantics. The core of the system is the large
scale Princeton’s Wordnet-like monolingual dictionaries linked to each other. This dictionary acknowledges Dr. Francis Bond’s works (Bond and Paik,
2012) and four wordnets including English, Thai,
Japanese, and Finnish.
3
3.2
Editing data for wordnet is not an easy task, guideline documents are required to ensure the correctness and the consistency of data. In a wordnet, words are linked by semantic relations, therefore in the guideline document we focus on describing how to identify semantic relations especially synonymy, antonymy, hypernymy, hyponymy,
holonymy, meronymy, and troponymy. We created
diagnostic tests to verify relations between synsets.
For instance, synonymy relation is identified on the
basis of the possibility of a word being replaced by
another in a specific context. This can be verified
by the possibility of being mutually substitutable in
sentence ‘X is a Noun1 therefore X is a Noun2 ’. In
addition to the tests there are a number of principles
which can be used for encoding the relations, for example the Economy principle and the Compatibility
principle (Fellbaum, 1998). Besides, we give guidelines as to handling Vietnamese specific linguistic
and cultural characteristics. Last but not least, the
guideline document contains instructions as to how
to give definitions and examples, how to exploit resources such as existing dictionaries, and spelling
rules.
A Method to Construct Vietnamese
WordNet
3.1
Two Phases in Constructing Vietnamese
WordNet
We construct Vietnamese Wordnet through two
phases (Figure 1). In phase 1 (steps 1 to 3), we focus on translating a part of Princeton’s WordNet into
Vietnamese. In phase 2 (steps 4 and 5), we make use
of Vietnamese resources to create the wordnet. Contents and requirements of these phases are different
and separated.
The major work of phase 1 is translating a part of
English Wordnet into Vietnamese. Thus, we firstly
need to determine a list of English synsets to translate. Because of the significantly smaller size of our
target Vietnamese wordnet, we choose to translate
only a part of Princeton’s WordNet. Our criteria for
selecting English synsets include: (1) the lexicalization possibility in Vietnamese; (2) the connectivity
of the selected part; (3) the inclusion of common
base concepts.
Since the set of lexicalized concepts in English and the set of lexicalized concepts in Vietnamese are different, the data of wordnet built
in phase 1 does not contain Vietnamese specific words such as ‘âm dương’yin and yang , ‘trắng
1
Guideline Development
3.3
Treatment of Vietnamese Specific Words
With regard to their structure, Vietnamese words
can be divided into a number of types including single-syllable words, coordinated compound
words, subordinated compound words, reduplicative
words, and accidental compound words. The syllables which are not single words are bound morphemes2 , which can only be used as part of a word
but not as a word on its own. The coordinated compound words (CCWs), specific to Vietnamese, are
2
They may have a meaning (‘trường’long , ‘hàn’cold ) or not
(‘lẽo’, ‘nhánh’)
www.laconec.com
261
words in which their parts– each part can be a word,
single or compound words– are parallel in the sense
that their meanings are similar and their order can
be reversed. The meaning of a coordinated compound is often more abstract than the meanings of its
parts. The proportion of this kind of words is about
10% of the number of compound words according to
the statistics in the Vietlex dictionary. Reduplicative
words (RWs) such as ‘đất đai’land , ‘làm lụng’work
are compounds whose parts have a phonetic relationship. This kind of words is specific to Vietnamese despite its small proportion. The identification of reduplicative words is normally deterministic
and not ambiguous. Accidental compounds are nonsyntactic compounds containing at least two meaningless syllables such as ‘đười ươi’orangutan , ‘bù
nhìn’puppet . Subordinated compound words (SCWs)
are the most problematic. A SCW can be considered
as having two parts, a head and a modifier. Normally,
the head goes first and then the modifiers. SCWs
make up the largest proportion in the Vietnamese
dictionary. Generally, discrimination between SCW
and phrase is problematic because SCW’s (syntactic) structure is similar to that of a phrase. This is a
classical but persistent problem in Vietnamese linguistics.
The following are a number of synsets from
Princeton’s WordNet that were translated into Vietnamese. Words added to the synsets in phase 2 are
in italics.
POS
Noun
Verb
Adjective
Total
Synsets
17,084
9,483
5,846
32,413
Words
32,122
21,180
13,590
66,892
Word-synset pairs
37,452
32,273
18,289
88,014
Table 2: Vietnamese wordnet statistics.
3.4
Treatment of Vietnamese Proper Names
Proper names (place name, personal name, work
name, etc.) represent important information about
Vietnamese history, society, culture and thought.
Vietnamese WordNet contains about 4,000 such linguistic expressions. Besides, Vietnamese WordNet
has to also include worldwide famous names such
as Amazon, Yangtze, Bacon, Nehru, etc. However,
such names occupy only a small proportion in comparison with Vietnamese ones. The following are a
few examples.
• ‘nhân
vật’character
>
‘nhân
vật
kịch’drama character
>
‘nhân
vật
chèo’V ietnamese traditional operetta0 s/character >
‘hề’clown / ‘mẹ Đốp’mother Dop
• ‘làng’village > ‘Đường Lâm’Duong Lam / ‘Mộ
Trạch’M o T rach / ‘Hành Thiện’Hanh T hien
• ‘dân tộc’ethnic group > ‘Kinh’Kinh / ‘Tày’T ay /
‘Thái’T hai
• ‘bánh’cake
>
chưng’square glutinous rice cake /
trôi’f loating cake / ‘bánh rán’f ried cake
• (n) tree (a tall perennial woody plant having a
main trunk and branches forming a distinct elevated crown): cây; cây cối, cây cỏ (CCW)
• ‘hồ’lake > ‘Hồ
Tây’W est Lake
• (v) laugh, express joy, express mirth (produce
laughter): cười; cười đùa (CCW), cười cợt (RW)
4
• (adj) strong (having strength or power greater
than average or expected): mạnh, mạnh mẽ,
khoẻ; khoẻ mạnh (CCW), khoẻ khoắn (RW)
4.1
Gươm’Sword Lake /
‘bánh
‘bánh
‘Hồ
Empirical Analyses of Vietnamese
WordNet
Vietnamese WordNet Statistics
Table 2 shows basic statistics of Vietnamese WordNet. Nouns take the largest proportion while the
number of verbs and adjectives is smaller. Like
Princeton’s WordNet, Vietnamese WordNet can be
considered as including three subwordnets corresponding to different parts of speech. The subwordnet of nouns has a unique root ‘thực thể’entity . The
• (adj) black (being of the achromatic color of
maximum darkness): đen, màu đen, có màu
đen, mun, thâm, ô, ác, mực, huyền; đen sì, đen
sì sì, đen thui, đen trũi, đen nhẻm (SCW), đen
đen (RW)
262
POS
Noun
Verb
Adjective
Total
subwordnet of verbs has 255 roots. The subwordnet
of adjectives has 2,201 clusters.
As shown in Table 4, there are 61,509 semantic
relations, in which 34,161 between noun synsets,
18,465 between verb synsets, and 8,883 between
adjective synsets. The most frequent semantic relations include hypernymy-hyponymy, synonymy,
antonymy, and similar-to. Vietnamese WordNet inherits the WordNet Domains Hierarchy (Bentivogli
et al., 2004) including 164 domain labels organized
as a tree structure.
4.2
CCWs
976
2,347
1,406
4,729
RWs
186
772
1,217
2,175
SCWs
2,068
138
505
2,711
Table 3: Vietnamese WordNet statistics: phase 2.
Relation
Antonymy
Hypernymy
Hyponymy
Holonymy
Meronymy
Entailment
Cause
Attribute
Similar to
Total
Synset Size Distributions
Noun
572
15,240
15,240
1,362
1,362
Verb
667
8,661
8,661
Adjective
2,658
307
169
385
34,161
18,465
61,509
385
5,840
8,883
Table 4: Semantic relation statistics.
5
The paper has presented the most up-to-date results
of the process of constructing Vietnamese WordNet.
Since this project is coming to final stage, there can
be slight differences between current version and the
final version. We continue to revise data by lexical
phenomenon or following statistical methods. Vietnamese WordNet will be published online and available for research and development purposes.
Figure 2: Synset size distributions.
Figure 2 shows synset size distributions of nouns,
verbs, and adjectives. The horizontal axis represents
synset size and the vertical axis represents the proportion. These distributions are not significantly different. On average each synset contains 2.42 words.
When synset size increases, the corresponding proportion decreases.
4.3
Conclusions
Acknowledgments
This paper has been supported by the national
project number KC.01.20/11-15.
References
Phase 2 Contributions
Luisa Bentivogli, Pamela Forner, Bernardo Magnini and
Emanuele Pianta. 2004. Revising WordNet Domains
Hierarchy: Semantics, Coverage, and Balancing. Proceedings of Workshop on Multilingual Linguistic Resources, COLING 2004.
Francis Bond and Kyonghee Paik. 2012. A Survey of
WordNets and Their Licenses. Proceedings of the
6th Global WordNet Conference (GWC 2012). Matsue.
64–71.
Table 3 represents word statistics in phase 2 of
Vietnamese WordNet construction. The number of
words added in this phase is 9,615. These words are
specific to Vietnamese and different from words in
phase 1. Besides, we also add nearly 4,000 proper
nouns to Vietnamese WordNet. These nouns reflex
Vietnamese anthronyms, toponyms (rivers, mountains, etc.), social events, etc.
263
Dhanon Leenoi, Thepchai Supnithi, Wirote Aroonmanakun. 2008. Building a Gold Standard for Thai
WordNet. Proceedings of IALP.
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.
Virach Sornlertlamvanich, Thatsanee Charoenporn,
Kergrit Robkop, Chumpol Mokarat, and Hitoshi
Isahara. 2009. Review on Development of Asian
WordNet. JAPIO 2009 Year Book, Japan Patent
Information Organization, Tokyo, Japan.
Piek Vossen. 2002. Wordnet, EuroWordnet and Global
Wordnet. Pub. linguistiques, 2002/1 - Vol. VII, pages
27-38.
Piek Vossen. 2002. EuroWordNet General Document.
Online document.
264
Extending the WN-Toolkit: dealing with polysemous words in the
dictionary-based strategy
Antoni Oliver
Universitat Oberta de Catalunya (UOC)
Barcelona - Catalonia - (Spain)
[email protected]
Abstract
In this paper we present an extension of
the dictionary-based strategy for wordnet construction implemented in the WNToolkit. This strategy allows the extraction of information for polysemous English words if definitions and/or semantic relations are present in the dictionary.
The WN-Toolkit is a freely available set of
programs for the creation and expansion
of wordnets using dictionary-based and
parallel-corpus based strategies. In previous versions of the toolkit the dictionarybased strategy was only used for translating monosemous English variants. In the
experiments we have used Omegawiki and
Wiktionary and we present automatic evaluation results for 24 languages that have
wordnets in the Open Multilingual Wordnet project. We have used these existing
versions of the wordnet to perform an automatic evaluation.
1
Introduction
1.1
The WN-Toolkit
The WN-Toolkit1 (Oliver, 2014) is a set of programs developed in Python for the automatic creation of wordnets following the expand model
(Vossen, 1998), that is, by translation of the variants (words) associated with the Princeton WordNet synsets. The toolkit also provides some free
language resources. These resources are preprocessed so they can be easily used with the toolkit.
The WN-Toolkit implements the following
strategies for wordnet creation:
• Dictionary based methodology: This strategy uses bilingual dictionaries to translate the
1
The WN-Toolkit can be freely downloaded from http:
//sourceforge.net/projects/wn-toolkit/
English variants associated with each synset.
In previous versions of the toolkit this direct
translation using dictionaries could be performed only on monosemic English, that is,
variants associated to a single synset. About
82% of the English variants in the Princeton WordNet 3.0 are monosemic but frequent
words tend to be polysemic. With the extension of the toolkit presented in this paper we
are able to deal with polysemic English variants.
• Babelnet based strategies: BabelNet (Navigli
and Ponzetto, 2010) is a semantic network
and a multilingual encyclopedic dictionary
with lexicographic and encyclopedic coverage of terms. In this methodology we simply
extract the data from the BabelNet file to get
the target wordnet. This strategy can only be
applied to old versions of Babelnet, as new
versions have a use restriction not allowing
the creation of wordnets from its data.
• Parallel corpus based methodologies: In order to extract wordnets from a parallel corpus
we need this parallel corpus to be semantically tagged with Princeton WordNet synsets
in the English part. As these corpora are not
easily available, we use two strategies for the
automatic construction of the required corpora:
– By machine translation of sense-tagged
corpora.
– By automatic sense-tagging of Englishtarget language parallel corpora.
The WN-Toolkit also provides some resources,
as dictionaries and preprocessed bilingual corpora.
265
Language
Albanian
Arabic
Basque
Bulgarian
Catalan
Chinese
Croatian
Danish
Finnish
French
Galician
Greek
Hebrew
Indonesian
Italian
Japanese
Norwegian N.
Norwegian B.
Polish
Portuguese
Slovene
Spanish
Swedish
Thai
Code
sqi
arb
eus
bul
cat
cmn
hrv
dan
fin
fra
glg
ell
heb
ind
ita
jpn
nno
nob
pol
por
slv
spa
swe
tha
Synsets
4,676
10,165
29,413
4,999
45,826
42,312
23,122
4,476
116,763
59,091
19,312
18,049
5,448
38,085
35,001
57,184
3,671
4,455
36,054
43,895
42,583
38,512
6,796
73,350
Words
5,990
14,595
26,240
6,783
46,531
61,533
29,010
4,468
129,839
55,373
23,124
18,227
5,325
36,954
41,855
91,964
3,387
4,186
61,393
54,071
40,233
36,681
5,824
82,504
Senses
9,602
21,751
48,934
9,056
70,622
79,809
47,906
5,859
189,227
102,671
27,138
24,106
6,872
106,688
63,133
158,069
4,762
5,586
88,889
74,012
70,947
57,764
6,904
95,517
Core
31%
48%
71%
100%
81%
100%
100%
81%
100%
92%
36%
57%
27%
94%
83%
95%
66%
81%
66%
84%
86%
76%
99%
81%
Table 1: Statistics for the wordnets in OMW
1.2 The Open Multilingual Wordnet project
The Open Multilingual Wordnet2 (OMW) (Bond
and Paik, 2012) provides free access to several
wordnets in a common format. We have performed experiments for 24 languages out of the
28 available wordnets. In table 1 we can observe
some statistics about the wordnets for these languages. These wordnets have been used to perform an automatic evaluation of the results.
1.3
Omegawiki
Omegawiki3 is a free collaborative dictionary that
can be accessed through the Internet as well as
downloaded as a relational database. The downloads are performed in MySQL dumps so it’s easy
to set up a MySQL database to have a local copy of
Omegawiki. For our experiments we have downloaded all the sql dumps corresponding to the
lexical data and we have created our own copy
of Omegawiki. From this database we have extracted all the required data and we have filled up
a new MySQL database according to the layout
explained in section 2.1.
In table 2 we can observe the number of
English-target language entries for Omegawiki for
the languages in our experiments.
2
3
http://compling.hss.ntu.edu.sg/omw/
http://www.omegawiki.org/
Omegawiki uses a complex set of semantic relations between its entries. It seems to be a great
degree of freedom for the users to create new relations. A total number of 77 relations are found in
the English Omegawiki, but only 22 of them has
at least 50 occurrences. These relations can be observed in table 3.
We tried to relate the name of the relations in
Omegawiki with standard relation names used in
WordNet and Wiktionary (hypernym, hyponym,
holonym, meronym, antonym and synonym). As
holonym, meronym and antonym are already used
in Omegawiki, we will try to find out the name
used for hypernym, hyponym and synonym looking at examples of these relations in Wiktionary
and observing if some of these examples are also
present in Omegawiki. In this way we could establish the correspondence between relation codes
and names in Omegawiki and standard relations
names. An special case are synonyms, that are expressed as translations into the same language. In
table 4 we can observe these correspondences.
In table 5 the number of definition and semantic relations in Omegawiki and Wiktionary can be
observed.
266
Language
Albanian
Arabic
Basque
Bulgarian
Catalan
Chinese
Croatian
Danish
Finish
French
Galician
Greek
Hebrew
Indonesian
Italian
Japanese
Norwegian N.
Norwegian B.
Polish
Portuguese
Slovene
Spanish
Swedish
Thai
Code
sqi
arb
eus
bul
cat
cmn
hrv
dan
fin
fra
glg
ell
heb
ind
ita
jpn
nno
nob
pol
por
slv
spa
swe
tha
Omegawiki
417
3,293
5,293
5,851
4,001
3,368
1,687
7,177
9,654
26,492
1,636
6,193
3,447
2,219
25,083
6,674
787
6,399
8,280
11,858
5,102
36,139
10,271
1,614
Wiktionary
4,431
17,157
3,834
24,983
24,625
70,553
34,485*
18,625
94,193
70,178
7,832
30,161
12,452
6,669
51,098
45,135
5,842
6,395
32,486
58,925
9,036
63,512
45,016
6,339
Table 2: Number of English-target language entries for each language
1.4
relation
is part of theme
parent
child
broader terms
narrower terms
is spoken in
related terms
borders on
is written in
antonym
official language
capital
country
wordt gevolgd door
currency
holonym
demonym
flows through
dialectal variant
meronym
flows into
is practiced by a
Table 3: Relations with at least 50 occurrences in
English Omegawiki
Code OW
4
5
7574
375074
375078
-
Wiktionary
Wiktionary4 is also a free collaborative dictionary.
This project is related with the Wikipedia and it
is developed in a Mediawiki format. It can be
accessed through the Internet and it can be also
downloaded. The download format is an XML
that includes sections in mediawiki format and for
this reason it is difficult to parse.
Dbnary5
The project
(Sérasset, 2012) parses the
Wiktionary content as soon as a new dump is available and provides this content in a easy to parse
format.
In our first experiments we have used our own
parser to extract the information for the English
Wiktionary dumps but we missed a lot of information and it was very difficult and time consuming to correct the errors and expand the parser, so
we started to use the results of the Dbanry project.
We have