TEAM LinG - Higher Intellect

TEAM LinG - Higher Intellect
TEAM LinG
Lecture Notes in Artificial Intelligence
3171
Edited by J. G. Carbonell and J. Siekmann
Subseries of Lecture Notes in Computer Science
TEAM LinG
This page intentionally left blank
TEAM LinG
Ana L.C. Bazzan Sofiane Labidi (Eds.)
Advances in
Artificial Intelligence –
SBIA 2004
17th Brazilian Symposium on Artificial Intelligence
São Luis, Maranhão, Brazil
September 29 – October 1, 2004
Proceedings
Springer
TEAM LinG
eBook ISBN:
Print ISBN:
3-540-28645-4
3-540-23237-0
©2005 Springer Science + Business Media, Inc.
Print ©2004 Springer-Verlag
Berlin Heidelberg
All rights reserved
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,
mechanical, recording, or otherwise, without written consent from the Publisher
Created in the United States of America
Visit Springer's eBookstore at:
and the Springer Global Website Online at:
http://ebooks.springerlink.com
http://www.springeronline.com
TEAM LinG
Preface
SBIA, the Brazilian Symposium on Artificial Intelligence, is a biennial event
intended to be the main forum of the AI community in Brazil. The SBIA 2004
was the 17th issue of the series initiated in 1984. Since 1995 SBIA has been
accepting papers written and presented only in English, attracting researchers
from all over the world. At that time it also started to have an international
program committee, keynote invited speakers, and proceedings published in the
Lecture Notes in Artificial Intelligence (LNAI) series of Springer (SBIA 1995,
Vol. 991, SBIA 1996, Vol. 1159, SBIA 1998, Vol. 1515, SBIA 2000, Vol. 1952,
SBIA 2002, Vol. 2507).
SBIA 2004 was sponsored by the Brazilian Computer Society (SBC). It was
held from September 29 to October 1 in the city of São Luis, in the northeast
of Brazil, together with the Brazilian Symposium on Neural Networks (SBRN).
This followed a trend of joining the AI and ANN communities to make the joint
event a very exciting one. In particular, in 2004 these two events were also held
together with the IEEE International Workshop on Machine Learning and Signal
Processing (MMLP), formerly NNLP.
The organizational structure of SBIA 2004 was similar to other international
scientific conferences. The backbone of the conference was the technical program
which was complemented by invited talks, workshops, etc. on the main AI topics.
The call for papers attracted 209 submissions from 21 countries. Each paper
submitted to SBIA was reviewed by three referees. From this total, 54 papers
from 10 countries were accepted and are included in this volume. This made
SBIA a very competitive conference with an acceptance rate of 25.8%. The
evaluation of this large number of papers was a challenge in terms of reviewing
and maintaining the high quality of the preceding SBIA conferences. All these
goals would not have been achieved without the excellent work of the members
of the program committee – composed of 80 researchers from 18 countries – and
the auxiliary reviewers.
Thus, we would like to express our sincere gratitude to all those who helped
make SBIA 2004 happen. First of all we thank all the contributing authors;
special thanks go to the members of the program committee and reviewers for
their careful work in selecting the best papers. Thanks go also to the steering
committee for its guidance and support, to the local organization people, and to
the students who helped with the website design and maintenance, the papers
submission site, and with the preparation of this volume. Finally, we would like
to thank the Brazilian funding agencies and Springer for supporting this book.
Porto Alegre, September 2004
Ana L.C. Bazzan
(Chair of the Program Committee)
Sofiane Labidi
(General Chair)
TEAM LinG
Organization
SBIA 2004 was held in conjunction with SBRN 2004 and with IEEE MMLP
2004. These events were co-organized by all co-chairs involved in them.
Chair
Sofiane Labidi (UFMA, Brazil)
Steering Committee
Ariadne Carvalho (UNICAMP, Brazil)
Geber Ramalho (UFPE, Brazil)
Guilherme Bitencourt (UFSC, Brazil)
Jaime Sichman (USP, Brazil)
Organizing Committee
Allan Kardec Barros (UFMA)
Aluízio Araújo (UFPE)
Ana L.C. Bazzan (UFRGS)
Geber Ramalho (UFPE)
Osvaldo Ronald Saavedra (UFMA)
Sofiane Labidi (UFMA)
Supporting Scientific Society
SBC
Sociedade Brasileira de Computação
TEAM LinG
Organization
VII
Program Committee
Luis Otavio Alvares
Analia Amandi
Univ. Federal do Rio Grande do Sul (Brazil)
Universidad Nacional del Centro de la Provincia
de Buenos Aires (Argentina)
John Atkinson
Universidad de Concepcin (Chile)
Pontifícia Universidade Católica, PR (Brazil)
Bráulio Coelho Avila
Flávia Barros
Universidade Federal de Pernambuco (Brazil)
Guilherme Bittencourt
Universidade Federal de Santa Catarina (Brazil)
Olivier Boissier
École Nationale Superieure des Mines
de Saint-Etienne (France)
University of Liverpool (UK)
Rafael H. Bordini
Dibio Leandro Borges
Pontifícia Universidade Católica, PR (Brazil)
University of Amsterdam (The Netherlands)
Bert Bredeweg
Jacques Calmet
Universität Karlsruhe (Germany)
Mario F. Montenegro Campos Universidade Federal de Minas Gerais (Brazil)
Universidade Federal do Ceará (Brazil)
Fernando Carvalho
Francisco Carvalho
Universidade Federal de Pernambuco (Brazil)
Institute of Psychology, CNR (Italy)
Cristiano Castelfranchi
Univ. Técnica Federico Santa María (Chile)
Carlos Castro
Université Montpellier II (France)
Stefano Cerri
Université Laval (Canada)
Ibrahim Chaib-draa
Universidade de Lisboa (Portugal)
Helder Coelho
Université Pierre et Marie Curie (France)
Vincent Corruble
Ernesto Costa
Universidade de Coimbra (Portugal)
Anna Helena Reali Costa
Universidade de São Paulo (Brazil)
Antônio C. da Rocha Costa Universidade Católica de Pelotas (Brazil)
Augusto C.P.L. da Costa
Universidade Federal da Bahia (Brazil)
Evandro de Barros Costa
Universidade Federal de Alagoas (Brazil)
Kerstin Dautenhahn
University of Hertfordshire (UK)
Keith Decker
University of Delaware (USA)
Marco Dorigo
Université Libre de Bruxelles (Belgium)
Michael Fisher
University of Liverpool (UK)
University of Bristol (UK)
Peter Flach
Ana Cristina Bicharra Garcia Universidade Federal Fluminense (Brazil)
Uma Garimella
AP State Council for Higher Education (India)
Lúcia Giraffa
Pontifícia Universidade Católica, RS (Brazil)
Claudia Goldman
University of Massachusetts, Amherst (USA)
Fernando Gomide
Universidade Estadual de Campinas (Brazil)
Gabriela Henning
Universidad Nacional del Litoral (Argentina)
Michael Huhns
University of South Carolina (USA)
Nitin Indurkhya
University of New South Wales (Australia)
Alípio Jorge
University of Porto (Portugal)
Celso Antônio Alves Kaestner Pontifícia Universidade Católica, PR (Brazil)
TEAM LinG
VIII
Organization
Franziska Klügl
Sofiane Labidi
Lluis Godo Lacasa
Marcelo Ladeira
Nada Lavrac
Christian Lemaitre
Victor Lesser
Vera Lúcia Strube de Lima
Jose Gabriel Pereira Lopes
Michael Luck
Ana Teresa Martins
Stan Matwin
Eduardo Miranda
Maria Carolina Monard
Valérie Monfort
Eugenio Costa Oliveira
Tarcisio Pequeno
Paolo Petta
Geber Ramalho
Solange Rezende
Carlos Ribeiro
Francesco Ricci
Sandra Sandri
Sandip Sen
Jaime Simão Sichman
Carles Sierra
Milind Tambe
Patricia Tedesco
Sergio Tessaris
Luis Torgo
Andre Valente
Wamberto Vasconcelos
Rosa Maria Vicari
Renata Vieira
Jacques Wainer
Renata Wasserman
Michael Wooldridge
Franco Zambonelli
Gerson Zaverucha
Universität Würzburg (Germany)
Universidade Federal do Maranhão (Brazil)
Artificial Intelligence Research Institute (Spain)
Universidade de Brasília (Brazil)
Josef Stefan Institute (Slovenia)
Lab. Nacional de Informatica Avanzada (Mexico)
University of Massachusetts, Amherst (USA)
Pontifícia Universidade Católica, RS (Brazil)
Universidade Nova de Lisboa (Portugal)
University of Southampton (UK)
Universidade Federal do Ceará (Brazil)
University of Ottawa (Canada)
University of Plymouth (UK)
Universidade de São Paulo at São Carlos (Brazil)
MDT Vision (France)
Universidade do Porto (Portugal)
Universidade Federal do Ceará (Brazil)
Austrian Research Institut for Artificial
Intelligence (Austria)
Universidade Federal de Pernambuco (Brazil)
Universidade de São Paulo at São Carlos (Brazil)
Instituto Tecnológico de Aeronáutica (Brazil)
Istituto Trentino di Cultura (Italy)
Artificial Intelligence Research Institute (Spain)
University of Tulsa (USA)
Universidade de São Paulo (Brazil)
Institut d’Investigació en Intel. Artificial (Spain)
University of Southern California (USA)
Universidade Federal de Pernambuco (Brazil)
Free University of Bozen-Bolzano (Italy)
University of Porto (Portugal)
Knowledge Systems Ventures (USA)
University of Aberdeen (UK)
Univ. Federal do Rio Grande do Sul (Brazil)
UNISINOS (Brazil)
Universidade Estadual de Campinas (Brazil)
Universidade de São Paulo (Brazil)
University of Liverpool (UK)
Università di Modena Reggio Emilia (Italy)
Universidade Federal do Rio de Janeiro (Brazil)
TEAM LinG
Organization
IX
Sponsoring Organizations
By the publication of this volume, the SBIA 2004 conference received financial
support from the following institutions:
CNPq
CAPES
FAPEMA
FINEP
Conselho Nacional de Desenvolvimento Científico e Tecnológico
Fundação Coordenação de Aperfeiçoamento de Pessoal de Nível
Superior
Fundação de Amparo à Pesquisa do Estado do Maranhão
Financiadora de Estudos e Projetos
TEAM LinG
X
Organization
Additional Reviewers
Mara Abel
Nik Nailah Bint Abdullah
Diana Adamatti
Stephane Airiau
João Fernando Alcântara
Teddy Alfaro
Luis Almeida
Marcelo Armentano
Dipyaman Banerjee
Dante Augusto Couto Barone
Gustavo Batista
Amit Bhaya
Reinaldo Bianchi
Francine Bica
Waldemar Bonventi
Flávio Bortolozzi
Mohamed Bouklit
Paolo Bouquet
Carlos Fisch de Brito
Tiberio Caetano
Eduardo Camponogara
Teddy Candale
Henrique Cardoso
Ariadne Carvalho
André Ponce de Leon F. de Carvalho
Ana Casali
Adelmo Cechin
Luciano Coutinho
Damjan Demsar
Clare Dixon
Fabrício Enembreck
Paulo Engel
Alexandre Evsukoff
Anderson Priebe Ferrugem
Marcelo Finger
Ricardo Freitas
Leticia Friske
Arjita Ghosh
Daniela Godoy
Alex Sandro Gomes
Silvio Gonnet
Marco Antonio Insaurriaga Gonzalez
Roderich Gross
Michel Habib
Juan Heguiabehere
Emilio Hernandez
Benjamin Hirsch
Jomi Hübner
Ullrich Hustadt
Alceu de Souza Britto Junior
Branko Kavsek
Alessandro Lameiras Koerich
Boris Konev
Fred Koriche
Luís Lamb
Michel Liquière
Peter Ljubic
Andrei Lopatenko
Gabriel Lopes
Emiliano Lorini
Teresa Ludermir
Alexei Manso Correa Machado
Charles Madeira
Pierre Maret
Graça Marietto
Lilia Martins
Claudio Meneses
Claudia Milaré
Márcia Cristina Moraes
Álvaro Moreira
Ranjit Nair
Marcio Netto
André Neves
Julio Cesar Nievola
Luis Nunes
Maria das Graças Volpe Nunes
Valguima Odakura
Carlos Oliveira
Flávio Oliveira
Fernando Osório
Flávio Pádua
Elias Pampalk
Marcelino Pequeno
Luciano Pimenta
Aloisio Carlos de Pina
Joel Plisson
Ronaldo Prati
Carlos Augusto Prolo
TEAM LinG
Organization
Ricardo Prudêncio
Josep Puyol-Gruart
Sergio Queiroz
Violeta Quental
Leila Ribeiro
María Cristina Riff
Maria Rifqi
Ana Rocha
Linnyer Ruiz
Sabyasachi Saha
Luis Sarmento
Silvia Schiaffino
Hernan Schmidt
Antônio Selvatici
David Sheeren
Alexandre P. Alves da Silva
Flávio Soares Corrêa da Silva
Francisco Silva
XI
Klebson dos Santos Silva
Ricardo de Abreu Silva
Roberto da Silva
Valdinei Silva
Wagner da Silva
Alexandre Simões
Eduardo do Valle Simoes
Marcelo Borghetti Soares
Marcilio Carlos P. de Souto
Renata Souza
Andréa Tavares
Marcelo Andrade Teixeira
Clésio Luis Tozzi
Karl Tuyls
Adriano Veloso
Felipe Vieira Fernando Von Zuben
Alejandro Zunino
TEAM LinG
This page intentionally left blank
TEAM LinG
Table of Contents
Logics, Planning, and Theoretical Methods
On Modalities for Vague Notions
Mario Benevides, Carla Delgado, Renata P. de Freitas,
Paulo A.S. Veloso, and Sheila R.M. Veloso
1
Towards Polynomial Approximations of Full Propositional Logic
Marcelo Finger
11
Using Relevance to Speed Up Inference. Some Empirical Results
Joselyto Riani and Renata Wassermann
21
A Non-explosive Treatment of Functional Dependencies
Using Rewriting Logic
Gabriel Aguilera, Pablo Cordero, Manuel Enciso, Angel Mora,
and Inmaculada Perez de Guzmán
31
Reasoning About Requirements Evolution
Using Clustered Belief Revision
Odinaldo Rodrigues, Artur d’Avila Garcez, and Alessandra Russo
41
Analysing AI Planning Problems in Linear Logic –
A Partial Deduction Approach
Peep Küngas
52
Planning with Abduction: A Logical Framework
to Explore Extensions to Classical Planning
Silvio do Lago Pereira and Leliane Nunes de Barros
62
High-Level Robot Programming:
An Abductive Approach Using Event Calculus
Silvio do Lago Pereira and Leliane Nunes de Barros
73
Search, Reasoning, and Uncertainty
Word Equation Systems: The Heuristic Approach
César Luis Alonso, Fátima Drubi, Judith Gómez-García,
and José Luis Montaña
A Cooperative Framework
Based on Local Search and Constraint Programming
for Solving Discrete Global Optimisation
Carlos Castro, Michael Moossen, and María Cristina Riff
83
93
TEAM LinG
XIV
Table of Contents
Machine Learned Heuristics to Improve Constraint Satisfaction
Marco Correia and Pedro Barahona
103
Towards a Natural Way of Reasoning
José Carlos Loureiro Ralha and Célia Ghedini Ralha
114
Is Plausible Reasoning a Sensible Alternative
for Inductive-Statistical Reasoning?
Ricardo S. Silvestre and Tarcísio H. C. Pequeno
124
134
Paraconsistent Sensitivity Analysis for Bayesian Significance Tests
Julio Michael Stern
Knowledge Representation and Ontologies
An Ontology for Quantities in Ecology
Virgínia Brilhante
144
Using Color to Help in the Interactive Concept Formation
Vasco Furtado and Alexandre Cavalcante
154
Propositional Reasoning for an Embodied Cognitive Model
Jerusa Marchi and Guilherme Bittencourt
164
A Unified Architecture to Develop Interactive Knowledge Based Systems
Vládia Pinheiro, Elizabeth Furtado, and Vasco Furtado
174
Natural Language Processing
Evaluation of Methods for Sentence and Lexical Alignment
of Brazilian Portuguese and English Parallel Texts
Helena de Medeiros Caseli, Aline Maria da Paz Silva,
and Maria das Graças Volpe Nunes
Applying a Lexical Similarity Measure
to Compare Portuguese Term Collections
Marcirio Silveira Chaves and Vera Lúcia Strube de Lima
Dialog with a Personal Assistant
Fabrício Enembreck and Jean-Paul Barthès
Applying Argumentative Zoning in an Automatic Critiquer
of Academic Writing
Valéria D. Feltrim, Jorge M. Pelizzoni, Simone Teufel,
Maria das Graças Volpe Nunes, and Sandra M. Aluísio
DiZer: An Automatic Discourse Analyzer for Brazilian Portuguese
Thiago Alexandre Salgueiro Pardo, Maria das Graças Volpe Nunes,
and Lucia Helena Machado Rino
184
194
204
214
224
TEAM LinG
Table of Contents
A Comparison of Automatic Summarizers
of Texts in Brazilian Portuguese
Lucia Helena Machado Rino, Thiago Alexandre Salgueiro Pardo,
Carlos Nascimento Silla Jr., Celso Antônio Alves Kaestner,
and Michael Pombo
XV
235
Machine Learning, Knowledge Discovery,
and Data Mining
Heuristically Accelerated Q–Learning: A New Approach
to Speed Up Reinforcement Learning
Reinaldo A.C. Bianchi, Carlos H.C. Ribeiro, and Anna H.R. Costa
Using Concept Hierarchies in Knowledge Discovery
Marco Eugênio Madeira Di Beneditto and Leliane Nunes de Barros
A Clustering Method for Symbolic Interval-Type Data
Using Adaptive Chebyshev Distances
Francisco de A.T. de Carvalho, Renata M.C.R. de Souza,
and Fabio C.D. Silva
245
255
266
An Efficient Clustering Method for High-Dimensional Data Mining
Jae- Woo Chang and Yong-Ki Kim
276
Learning with Drift Detection
João Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues
286
Learning with Class Skews and Small Disjuncts
Ronaldo C. Prati, Gustavo E.A.P.A. Batista,
and Maria Carolina Monard
296
Making Collaborative Group Recommendations
Based on Modal Symbolic Data
Sérgio R. de M. Queiroz and Francisco de A.T. de Carvalho
307
Search-Based Class Discretization
for Hidden Markov Model for Regression
Kate Revoredo and Gerson Zaverucha
317
SKDQL: A Structured Language
to Specify Knowledge Discovery Processes and Queries
Marcelino Pereira dos Santos Silva and Jacques Robin
326
Evolutionary Computation, Artificial Life,
and Hybrid Systems
Symbolic Communication in Artificial Creatures:
An Experiment in Artificial Life
Angelo Loula, Ricardo Gudwin, and João Queiroz
336
TEAM LinG
XVI
Table of Contents
What Makes a Successful Society?
Experiments with Population Topologies in Particle Swarms
Rui Mendes and José Neves
346
Splinter: A Generic Framework
for Evolving Modular Finite State Machines
Ricardo Nastas Acras and Silvia Regina Vergilio
356
An Hybrid GA/SVM Approach for Multiclass Classification
with Directed Acyclic Graphs
Ana Carolina Lorena and André C. Ponce de Leon F. de Carvalho
366
Dynamic Allocation of Data-Objects in the Web,
Using Self-tuning Genetic Algorithms
Joaquín Pérez O., Rodolfo A. Pazos R., Graciela Mora O.,
Guadalupe Castilla V., José A. Martínez., Vanesa Landero N.,
Héctor Fraire H., and Juan J. González B.
376
Detecting Promising Areas by Evolutionary Clustering Search
Alexandre C.M. Oliveira and Luiz A.N. Lorena
385
A Fractal Fuzzy Approach to Clustering Tendency Analysis
Sarajane Marques Peres and Márcio Luiz de Andrade Netto
395
On Stopping Criteria for Genetic Algorithms
Martín Safe, Jessica Carballido, Ignacio Ponzoni, and Nélida Brignole
405
A Study of the Reasoning Methods Impact on Genetic Learning
and Optimization of Fuzzy Rules
Pablo Alberto de Castro and Heloisa A. Camargo
414
Using Rough Sets Theory and Minimum Description Length Principle
to Improve a
Fuzzy Revision Method for CBR Systems
Florentino Fdez-Riverola, Fernando Díaz, and Juan M. Corchado
424
Robotics and Computer Vision
Forgetting and Fatigue in Mobile Robot Navigation
Luís Correia and António Abreu
434
Texture Classification Using the Lempel-Ziv-Welch Algorithm
Leonardo Vidal Batista and Moab Mariz Meira
444
A Clustering-Based Possibilistic Method for Image Classification
Isabela Drummond and Sandra Sandri
454
An Experiment on Handshape Sign Recognition
Using Adaptive Technology: Preliminary Results
Hemerson Pistori and João José Neto
464
TEAM LinG
Table of Contents
XVII
Autonomous Agents and Multi-agent Systems
Recent Advances on Multi-agent Patrolling
Alessandro Almeida, Geber Ramalho, Hugo Santana, Patrícia Tedesco,
Talita Menezes, Vincent Corruble, and Yann Chevaleyre
On the Convergence to and Location
of Attractors of Uncertain, Dynamic Games
Eduardo Camponogara
Norm Consistency in Electronic Institutions
Marc Esteva, Wamberto Vasconcelos, Carles Sierra,
and Juan A. Rodríguez-Aguilar
Using the
for a Cooperative Framework
of MAS Reorganisation
Jomi Fred Hübner, Jaime Simão Sichman, and Olivier Boissier
474
484
494
506
A Paraconsistent Approach for Offer Evaluation in Negotiations
Fabiano M. Hasegawa, Bráulio C. Ávila,
and Marcos Augusto H. Shmeil
516
Sequential Bilateral Negotiation
Orlando Pinho Jr., Geber Ramalho, Gustavo de Paula,
and Patrícia Tedesco
526
Towards to Similarity Identification to Help in the Agents’ Negotiation
Andreia Malucelli and Eugénio Oliveira
536
Author Index
547
TEAM LinG
This page intentionally left blank
TEAM LinG
On Modalities for Vague Notions
Mario Benevides1,2, Carla Delgado2, Renata P. de Freitas2,
Paulo A.S. Veloso2, and Sheila R.M. Veloso2
1
Instituto de Matemática
Programa de Engenharia de Sistemas e Computação, COPPE
Universidade Federal do Rio de Janeiro, Caixa Postal 68511, 21945-970
Rio de Janeiro, RJ, Brasil
2
{mario,delgado,naborges,veloso,sheila}@cos.ufrj.br
Abstract. We examine modal logical systems, with generalized operators, for the precise treatment of vague notions such as ‘often’, ‘a meaningful subset of a whole’, ‘most’, ‘generally’ etc. The intuition of ‘most’
as “all but for a ‘negligible’ set of exceptions” is made precise by means
of filters. We examine a modal logic, with a new modality for a local
version of ‘most’ and present a sound and complete axiom system. We
also discuss some variants of this modal logic.
Keywords: Modal logic, vague notions, most, filter, knowledge representation.
1
Introduction
We examine modal logical systems, with generalized operators, for the precise
treatment of assertions involving some versions of vague notions such as ‘often’,
‘a meaningful subset of a whole’, ‘most’, ‘generally’ etc. We wish to express these
vague notions and reason about them.
Vague notions, such as those mentioned above, occur often in ordinary language and in some branches of science, some examples being “most bodies expand when heated” and “typical birds fly”. Vague terms such as ‘likely’ and
‘prone’ are often used in more elaborate expressions involving ‘propensity’, e.g.
“A patient whose genetic background indicates a certain propensity is prone
to some ailments”. A precise treatment of these notions is required for reasoning about them. Generalized quantifiers have been employed to capture some
traditional mathematical notions [2] and defaults [10]. A logic with various generalized quantifiers has been suggested to treat quantified sentences in natural
language [1] and an extension of first-order logic with generalized quantifiers for
capturing a sense of ‘generally’ is presented in [5]. The idea of this approach
is formulating ‘most’ as ‘holding almost universally’. This seems quite natural,
once we interpret ‘most’ as “all, but for a ‘negligible’ set of exceptions”.
Modal logics are specification formalisms which are simpler to be handled
than first-order logic, due to the hiding of variables and quantifiers through the
modal operators (box and diamond). In this paper we present a modal counterpart of filter logic, internalizing the generalized quantifier through a new
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 1–10, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
2
Mario Benevides et al.
modality whose behavior is intermediate between those of the classical modal
operators
and
Thus one will be able to express “a reply to a message
will be received almost always”:
“eventually a reply to a message
will be received almost always”:
“the system generally operates
correctly”:
etc.
An important class of problems involves the stable property detection. In
a more concrete setting consider the following situation. A stable property is
one which once it becomes true it remains true forever: deadlock, termination
and loss of a token are examples. In these problems, processes communicate
by sending and receiving messages. A process can record its own state and the
messages it sends and receives, nothing else.
Many problems in distributed systems can be solved by detecting global
states. An example of this kind of algorithm is the Chandy and Lamport Distributed Snapshots algorithm for determining global states of distributed systems
[6]. Each process records its own state and the two processes that a channel is
incident on cooperate in recording the channel state. One cannot ensure that
the states of all processes and channels will be recorded at the same instant,
because there is no global clock, however, we require that the recorded process
and channel states form a meaningful global state. The following text illustrates
this problem [6]: “The state detection algorithm plays the role of a group of
photographers observing a panoramic, dynamic scene, such as a sky filled with
migrating birds – a scene so vast that it cannot be captured by a single photograph. The photographers must take several snapshots and piece the snapshots
together to form a picture of the overall scene. The snapshots cannot all be taken
at precisely the same instant because of synchronization problems. Furthermore,
the photographers should not disturb the process that is being photographed;
(...) Yet, the composite picture should be meaningful. The problem before us
is to define ‘meaningful’ and then to determine how the photographs should be
taken.”
If we take the modality
to capture the notion of meaningful, then the
formula
means:
is true in a meaningful set of states”. Returning to the
example of Chandy and Lamport Algorithm, the formula:
would mean “if in a meaningful set of states, for each pair of processes and
the snapshot of process
local state has property
snapshot of process has
property
and the snapshot of the state of channel ij has property
then it is
always the case that global stable property holds forever”. So we can express
relationships among local process states, global system states and distributed
computation’s properties even if we cannot perfectly identify the global state at
each time; for the purpose of evaluating stable properties, a set of meaningful
states that can be figured out from the local snapshots collected should suffice.
Another interesting example comes from Game Theory. In Extensive Games
with Imperfect Information (well defined in [9]), a player may not be sure about
TEAM LinG
On Modalities for Vague Notions
3
the complete past history that has already been played. But, based on a meaningful part of the history he/she has in mind, he/she may still be able to decide
which action to choose. The following formula can express this fact
The formula above means: “it is always the case that, if it is player’s turn and
properties
are true in a meaningful part of his/her history, then player
should choose action to perform”. This is in fact the way many absent-minded
players reason, especially in games with lots of turns like ‘War’, Chess, or even
a financial market game.
We present a sound and complete axiomatization for generalized modal logic
as a first step towards the development of a modal framework for generalized
logics where one can take advantage of the existing frameworks for modal logics
extending them to these logics.
The structure of this paper is as follows. We begin by motivating families,
such as filters, for capturing some intuitive ideas of ‘generally’. Next, we briefly
review a system for reasoning about generalized assertions in Sect. 3. In Sect.
4, we introduce our modal filter logic. In Sect. 5 we comment on how to adapt
our ideas to some variants of vague modal logics. Sect. 6 gives some concluding
remarks.
2
Assigning Precise Meaning to Generalized Notions
We now indicate how one can arrive at the idea of filters [4] for capturing some
intuitive ideas of ‘most’, ‘meaningful’, etc. Our approach relies on the familiar
intuition of ‘most’ as “all but for a ‘negligible’ set of exceptions” as well as on
some related notions. We discuss, trying to explain, some issues in the treatment
of ‘most’, and the same approach can be applied in treating ‘meaningful’, ‘often’,
etc.
2.1
Some Accounts for ‘Most’
Various interpretations seem to be associated with vague notions of ‘most’. The
intended meaning of “most objects have a given property” can be given either
directly, in terms of the set of objects having the property, or by means of the
set of exceptions, those failing to have it. In either case, a precise formulation
hinges on some ideas concerning these sets. We shall now examine some proposals
stemming from accounts for ‘most’.
Some accounts for ‘most’ try to explain it in terms of relative frequency
or size. For instance, one would understand “most Brazilians like football” as
the “the Brazilians that like football form a ‘likely’ portion”, with more than,
say, 75% of the population, or “the Brazilians that like football form a ‘large’
set”, in that their number is above, say, 120 million. These accounts of ‘most’
may be termed “metric”, as they try to reduce it to a measurable aspect, so
to speak. They seek to explicate “most people have property
as “the people
TEAM LinG
4
Mario Benevides et al.
having form a ‘likely’ (or ‘large’) set”, i.e. a set having ‘high’ relative frequency
(or cardinality), with ‘high’ understood as above a given threshold. The next
example shows a relaxed variant of these metric accounts.
Example 1. Imagine that one accepts the assertions “most naturals are larger
than fifteen” and “most naturals do not divide twelve” about the universe of
natural numbers. Then, one would probably accept also the assertions:
“Most naturals are larger than fifteen or even”
“Most naturals are larger than fifteen and do not divide twelve”
Acceptance of the first two assertions, as well as inferring
from them,
might be explained by metric accounts, but this does not seem to be the case
with assertion
A possible account for this situation is as follows. Both sets
F, of naturals below fifteen, and T, of divisors of twelve, are finite. So, their
union still form a finite set.
This example uses an account based on sizes of the extensions: it explains
“most naturals have property
as “the naturals failing to have form a ‘small’
set”, where ‘small’ is taken as finite. Similarly, one would interpret “most reals
are irrational” as “the rational reals form a ‘small’ set”, with ‘small’ now understood as (at most) denumerable. This account is still quantitative, but more
relaxed. It explicates “most objects have property
as “those failing to have
form a ‘small’ set”, in a given sense of ‘small’.
As more neutral names encompassing these notions, we prefer to use ‘sizable’,
instead of ‘large’ or ‘likely’, and ‘negligible’ for ‘unlikely’ or ‘small’. The previous
terms are vague, the more so with the new ones. This, however, may be advantageous. The reliance on a – somewhat arbitrary – threshold is less stringent and
they have a wider range of applications, stemming from the liberal interpretation
of ‘sizable’ as carrying considerable weight or importance. Notice that these notions of ‘sizable’ and ‘negligible’ are relative to the situation. (In “uninteresting
meetings are those attended only by junior staff”, the sets including only junior
staff members are understood as ‘negligible’.)
2.2
Families for ‘Most’
We now indicate how the preceding ideas can be conveyed by means of families,
thus leading to filters [4] for capturing some notions of ‘most’. One can understand “most birds fly” as “the non-flying birds form a ‘negligible’ set”. This
indicates that the intended meaning of “most objects have
may be rendered
as “the set of objects failing to have is negligible”, in the sense that it is in
a given family of negligible sets. The relative character of ‘negligible’ (and ‘sizable’) is embodied in the family of negligible sets, which may vary according
to the situation. Such families, however, can be expected to share some general
properties, if they are to be appropriate for capturing notions of ‘sizable’, such
as ‘large’ or ‘likely’. Some properties that such a family may, or may not, be
expected to have are illustrated in the next example.
TEAM LinG
On Modalities for Vague Notions
5
Example 2. Imagine that one accepts the assertions:
“Most American males like beer”
“Most American males like sports” and
“Most American are Democrats or Republicans”
In this case, one is likely to accept also the two assertions:
“Most American males like beverages”
“Most American males like beer and sports”
Acceptance of
should be clear. As for
its acceptance may be explained by exceptions. (As the exceptional sets of non beer-lovers and of nonsports-lovers have negligibly few elements, it is reasonable to say that “negligibly
few American males fail to like beer or sports”, so “most American males like
beer and sports”.) In contrast, even though one accepts
neither one of the
assertions “most American males are Democrats” and “most American males
are Republicans” seems to be equally acceptable.
This example hinges on the following ideas: if
and B has ‘most’
elements, then W also has ‘most’ elements; if both and have ‘negligibly few’
elements, then
will also have ‘negligibly few’ elements; a union
may
have ‘most’ elements, without either D or R having ‘most’ elements.
We now postulate reasonable properties of a family
of negligible
sets (in the sense of carrying little weight or importance) of a universe V.
if
“subsets of negligible sets are negligible”.
“the empty set is negligible”.
“the universe V is not negligible”.
if
“unions of negligible sets are negligible”.
These postulates can be explained by means of a notion of ‘having about the
same importance’ [12]. Postulates
and (V ) concern the non-triviality of our
notion of ‘negligible’. Also,
is not necessarily satisfied by families that may
be appropriate for some weaker notions, such as ‘several’ or ‘many’. In virtue
of these postulates, the family
of negligible sets is non-empty and proper as
well as closed under subsets and union, thus forming an ideal. Dually, a family
of sizable sets – of those having ‘most’ elements – is a proper filter (but not
necessarily an ultrafilter [4]).
Conversely, each proper filter gives rise to a non-trivial notion of ‘most’. Thus,
the interpretation of “most objects have property
as “the set of objects failing
to have is negligible” amounts to “the set of objects having belongs to a
given proper filter”. The properties of the family
are intuitive and coincide
with those of ideals. As the notion of ‘most’ was taken as the dual of ‘negligible’,
it is natural to explain families of sizeable sets in terms of filters (dual of ideals).
So, generalized quantifiers, ranging over families of sets [1], appear natural to
capture these notions.
3
Filter Logic
Filter logic extends classical first-order logic by a generalised quantifier
whose
intended interpretation is ‘generally’. In this section we briefly review filter logic:
its syntax, semantics and axiomatics.
TEAM LinG
6
Mario Benevides et al.
Given a signature we let
be the usual first-order language (with equality
of signature and use
for the extension of
by the new operator
The formulas of
are built by the usual formation rules and a new
variable-binding formation rule for generalized formulas: for each variable if
is a formula in
then so is
Example 3. Consider a signature with a binary predicate L (on persons). Let
stand for
loves
Some assertions expressed by sentences of
are: “people generally love everybody” “somebody loves people in
general” –
and “people generally love each other” –
Let
be
is taller than
We can express “people generally are taller
than
by
and
is taller than people in general” by
The semantic interpretation for ‘generally’ is provided by enriching first-order
structures with families of subsets and extending the definition of satisfaction to
the quantifier
A filter structure
for a signature consists of a usual structure
for together with a filter over the universe A of
We extend the usual
definition of satisfaction of a formula in a structure under assignment to its
(free) variables, using the extension
as follows: for a formula
iff
is in
As usual, satisfaction of a formula hinges only on the realizations assigned to
its symbols. Thus, satisfaction for purely first-order formulas (without
does
not depend on the family of subsets. Other semantic notions, such as reduct,
model
and validity, are as usual [4, 7]. The behavior of is intermediate between those of the classical and
A deductive system for the logic of ‘generally’ is formalized by adding axiom
schemata, coding properties of filters, to a calculus for classical first-order logic.
To set up a deductive system for filter logic one takes a sound and complete
deductive calculus for classical first-order logic, with Modus Ponens (MP) as
the sole inference rule (as in [7]), and extend its set A of axiom schemata by
adding a set
of new axiom schemata (coding properties of filters), to form
This set
consists of all the generalizations of the following five
schemata (where
and are formulas of
for a new variable not occurring in
These schemata express properties of filters, the last one covering alphabetic variants. Other usual deductive notions, such as (maximal) consistent sets,
witnesses and conservative extension [4,7], can be easily adapted. So, filter derivability amounts to first-order derivability from the filter schemata:
iff
Hence, we have monotonicity of
and substitutivity of equivalents.
This deductive system is sound and complete for filter logic, which is a proper
conservative extension of classical first-order logic. It is not difficult to see that
we have a conservative extension of classical logic:
iff
for and
TEAM LinG
On Modalities for Vague Notions
without
such as
4
7
We have a proper extension of classical logic, because sentences,
cannot be expressed without
Serial Local Filter Logic
In this section, we examine modal logics to deal with vague notions. As pointed
out in Sect. 1, these notions play an important role in computing, knowledge
representation, natural language processing, game theory, etc.
In order to introduce the main ideas, consider the following situation. Imagine
we wish to examine properties of animals and their offspring. For this purpose,
we consider a universe of animals and binary relation corresponding to “being
an offspring of”. Suppose we want to express “every offspring of a black animal
is dark”; this can be done by the modal formula
Similarly,
expresses “some offsprings of black animals are dark”. Now, how
do we express the vague assertion “most offsprings of black animals are dark” ?
A natural candidate would be
where
is the vague modality
for ‘most’. Here, we interpret
as “a sizable portion of the offsprings is
dark”. Thus,
captures a notion of “most states among the reachable ones”.
This is a local notion of vagueness. (In the FOL version, sorted generalized
quantifiers were used for local notions.) One may also encounter global notions
of vagueness. For instance, in “most animals are herbivorous”, ‘most’ does not
seem to be linked to the reachable states (see also Sect. 6).
The alphabet of serial local filter logic (SLF) is that of basic modal logic with
a new modality
The formulas are obtained by closing the set of formulas of
basic modal logic by the rule:
Frames, models and rooted models of SLF are much as in the basic modal
logic. For each
we denote by
the set of states in
the frame that are accessible from Semantics of the
is given by a
family of filters
one for each state in a frame. A model of SLF is 4-tuple
where
is a serial frame (R is serial, i.e.,
for all
V is a valuation, as usual, and
with
a
filter over S, for each
Satisfaction of a formula
in a rooted arrow model
denoted by
is defined as in the basic modal logic, with the following extra clause:
with
and
being the set of
states that satisfies a formula in a model
A formula is a consequence of a
set of formulas in SLF, denoted by
when
implies
for every rooted arrow model
as usual.
A deductive system for SLF is obtained by extendind the deductive system
for normal modal logic [14] with the axiom
for seriality and the
following modal versions of the axioms for filter first-order logic:
TEAM LinG
Mario Benevides et al.
8
We write
to express that formula is derivable from set in SLF. The
notion of derivability is defined as usual, considering the rules of necessitation
and Modus Ponens.
Completeness
It is an easy exercise to prove that the Soundness Theorem for SFL, i. e.,
We now prove the Completeness Theorem for SLF, i. e.,
We use the
canonical model construction.
We start with the canonical model
of basic modal logic
[3]1. Since we have axiom
model
is a serial model2.
It remains to define a family
of filters over
For this purpose, we will
introduce some notation and obtain a few preliminary results.
Define
and
Proposition 1. For every
closed under intersection.
(i)
(ii)
and (iii)
is
Proof. (i) For all
(as
is an MCS). Thus,
Given
by Necessitation and
we have
Thus
(ii) Assume
Then, for some formula we have
i. e.,
for some By
we have
i. e., there is some
with
But since
for all
a contradiction. (iii)
From
we have
As a result, each family
has the finite intersection property. Now, let
be the closure of
under supersets. Note that
is a proper filter over
Proposition 2.
Proof.
Thus, by
Define
Clear.
(i. e.,
and
iff
Suppose
and
we have
Then,
for some
Now,
Hence,
for some
3
.
Define the canonical SLF model to be
Then we can prove the Satisfability Lemma
by induction on formulas. Completeness is an easy corollary.
1
2
3
to be
iff
is the set of maximal consistent sets of formulæ.
Recall that
iff
and
Also, given
if
then there is some
s. t.
[3].
If
then
for if
then there exists
s. t.
Thus
(by consistency and maximality), i. e.,
and
Thus we have
as
and
Hence
a contradiction.
TEAM LinG
On Modalities for Vague Notions
5
9
Variants of Vague Modal Logics
We now comment on some variants of vague modal logics.
Variants of Local Filter Logics. First note that the choice of serial models
is a natural one, in the presence of
and
i. e.,
whence
An alternative choice would be non-serial local filter
logics where one takes a filter over the extended universe
for
each
and the corresponding axiom system
where
and
with
iff
Soundness and completeness
can be obtained in analogous fashion.
Other Local Modal Logics. Serial local filter axioms encodes properties of
filters through
– closed under supersets,
– closed under intersections,
and
- non-emptyness axioms. Our approach is modular being easily adapted
to encode properties of other structures, e. g., to encode families that are upclosed, one removes axiom
to encode lattices one replaces
axiom
by the
where
For those systems one
obtains soundness and completeness results with respect to semantics of the
being given by a family of up-closed sets and a family of lattices,
respectively, along the same lines we provided for SLF logics. (In these cases, one
takes
in the construction of the canonical model.)
6
Conclusions
Assertions and arguments involving vague notions, such as ‘generally’, ‘most’
and ‘typical’ occur often in ordinary language and in some branches of science.
Modal logics are specification mechanisms simpler to handle than first-order
logic.
We have examined a modal logic, with a new generality modality for expressing and reasoning about a local version of ‘most’ as motivated by the hereditary
example in Sect. 4. We presented a sound and complete axiom system for local
generalized modal logic, where the locality aspect corresponds to the intended
meaning of the generalized modality: “most of the reachable states have the
property”. (We thank one of the referees for an equivalent axiomatization for
SLF. It seems that it works only for filters, being more difficult to adapt to other
structures.)
Some global generalized notions could appear in ordinary language, for instance; “most black animals will have most offspring dark”. The first occurrence
of ‘most’ is global (among all animals) while the second is a local one (referring
to most offspring of each black animal considered). In this case one could have
two generalized operators: a global one,
and a local one,
Semantically
would refer to a filter (over the set of states) in a way analogous to the universal
modality [8].
TEAM LinG
10
Mario Benevides et al.
Other variants of generalized modal logics occur when one considers multimodal generalized logics as motivated by the following example. In a chess game
setting, a state is a chessboard configuration. States can be related by different
ways, depending on which piece is moved. Thus, one would have
for: is
a chessboard configuration resulting from a queen’s move (in state
for:
is the chessboard configuration resulting from a pawn’s move (in state
etc.
This suggests having
etc. Note that with pawn’s moves one can reach
fewer states of the chessboard than with queen’s moves, i. e.,
is (absolutely) large, while
is not. Thus, we would have
holding in all states and
not holding in all states. On the other hand,
among the pawn’s moves many may be good, that is:
is large within
(i. e.
In this fashion one has a wide spectrum of new modalities and relations
among them to be investigated. We hope the ideas presented in this paper provide a first step towards the development of a modal framework for generalized
logics where vague notions can be represented and be manipulated in a precise
way and the relations among them investigated (e. g. relate important with very
important, etc.). By setting this analysis in a modal environment one can further
take advantage of the machinery for modal logics [3], adapting it to these logics
for vague notions.
References
1. Barwise, J., Cooper, R.: Generalized quantifiers and natural language. Linguistics
and Philosophy 4 (1981) 159–219
2. Barwise, J., Feferman, S.: Model-Theoretic Logics, Springer, New York (1985)
3. Blackburn, P., de Rijke, M., Venema, Y.: Modal Logic, Cambridge University Press,
Cambridge (2001)
4. Chang, C., Keisler, H.: Model Theory, North-Holland, Amsterdam (1973)
5. Carnielli, W., Veloso, P.: Ultrafilter logic and generic reasoning. In Abstr. Workshop
on Logic, Language, Information and Computation, Recife (1994)
6. Chandy, K., Lamport, L.: Distributed Snapshot: Determining Global States of
Distributed Systems. ACM Transactions on Computer Systems 3 (1985) 63–75
7. Enderton, H.: A Mathematical Introduction to Logic, Academic Press, New York
(1972)
8. Goranko, V., Passy, S.: Using the Universal Modality: Gains and Questions. Journal of Logic and Computation 2 (1992) 5–30
9. Osborne, M., Rubinstein, A.: A Course in Game Theory, MIT, Cambrige (1998)
10. Schelechta, K.: Default as generalized quantifiers. Journal of Logic and Computation 5 (1995) 473–494
11. Turner, W.: Logics for Artificial Intelligence, Ellis Horwood, Chichester (1984)
12. Veloso, P.: On ‘almost all’ and some presuppositions. Manuscrito XXII (1999)
469–505
13. Veloso, P.: On modulated logics for ‘generally’. In EBL’03 (2003)
14. Venema, Y.: A crash course in arrow logic. In Marx, M., Pólos, L., Masuch, M.
(eds.), Arrow Logic and Multi-Modal logic, CSLI, Stanford (1996) 3–34
TEAM LinG
Towards Polynomial Approximations
of Full Propositional Logic
Marcelo Finger*
Departamento de Ciência da Computação, IME-USP
[email protected]
Abstract. The aim of this paper is to study a family of logics that approximates classical inference, in which every step in the approximation
can be decided in polynomial time. For clausal logic, this task has been
shown to be possible by Dalal [4, 5]. However, Dalal’s approach cannot be
applied to full classical logic. In this paper we provide a family of logics,
called Limited Bivaluation Logics, via a semantic approach to approximation that applies to full classical logic. Two approximation families
are built on it. One is parametric and can be used in a depth-first approximation of classical logic. The other follows Dalal’s spirit, and with
a different technique we show that it performs at least as well as Dalal’s
polynomial approach over clausal logic.
1
Introduction
Logic has been used in several areas of Artificial Intelligence as a tool for modelling an intelligent agent reasoning capabilities. However, the computational
costs associated with logical reasoning have always been a limitation. Even if
we restrict ourselves to classical prepositional logic, deciding whether a set of
formulas logically implies a certain formula is a co-NP-complete problem [9].
To address this problem, researchers have proposed several ways of approximating classical reasoning. Cadoli and Schaerf have proposed the use of approximate entailment as a way of reaching at least partial results when solving
a problem completely would be too expensive [13]. Their influential method is
parametric, that is, a set S of atoms is the basis to define a logic. As we add more
atoms to S, we get “closer” to classical logic, and eventually, when S contains all
prepositional symbols, we reach classical logic. In fact, Schaerf and Cadoli proposed two families of logic, intending to approximate classical entailment from
two ends. The
family approximates classical logic from below, in the following sense. Let
be a set of propositions and
let
indicate the set of the entailment relation of a logic in the family. Then:
where CL is classical logic.
*
Partly supported by CNPq grant PQ 300597/95-5 and FAPESP project 03/00312-0.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 11–20, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
12
Marcelo Finger
Approximating a classical logic from below is useful for efficient theorem proving. Conversely, approximating classical logic from above is useful for disproving
theorems, which is the satisfiability (SAT) problem and has a similar formulation. In this work we concentrate only in theorem proving and approximations
from below.
The notion of approximation is also related with the notion of an anytime
decision procedure, that is, an algorithm that, if stopped anytime during the
computation, provides an approximate answer, that is, an answer of the form
“up to logic
in the family, the result is/is not provable”. This kind of anytime
algorithms have been suggested by the proponents of the Knowledge Compilation
approach [14,15], in which a theory was transformed into a set of polynomially
decidable Horn-clause theories. However, the compilation process is itself NPcomplete.
Dalal’s approximation method [4] was the first one designed such that each
reasoner in an approximation family can be decided in polynomial time. Dalal’s
initial approach was algebraic only. A model-theoretic semantics was provided
in [5]. However, this approach was restricted to clausal form logic only.
In this work, we generalize Dalal’s approach. We create a family of logics
of Limited Bivalence (LB) that approximates full prepositional logic. We provide a model-theoretic semantics and two entailment relations based on it. The
entailment
is a parametric approximation on the set of formulas and follows Cadoli and Schaerf’s approximation paradigm. The entailment
follows
Dalal’s approach and we show that for clausal form theories, the inference
is polynomially decidable and serves as a semantics for Dalal’s inference
This family of approximations is useful in defining families of efficiently decidable formulas with increasing complexity. In this way, we can define the set
and
of tractable theorems, such that
This paper proceeds as follows. Next section briefly presents Dalal’s approximation strategy, its semantics and discuss its limitations. In Section 3 we present
the family
of Limited Bivaluation Logics; the semantics for full propositional
is provided in Section 3.1; a parametric entailment
is presented
is presented in Section 3.4 and the soundness
in Section 3.2. The entailment
is Shown in Sections 3.3
and completeness of Dalal’s
with respect to
and 3.4.
Notation: Let be a countable set of prepositional letters. We concentrate on
the classical prepositional language
formed by the usual boolean connectives
(implication), (conjunction), (disjunction) and ¬ (negation).
Throughout the paper, we use lowercase Latin letters to denote prepositional letters,
denote formulas,
denote clauses and denote a literal. Uppercase Greek letters denote sets of formulas. By
we mean
the set of all prepositional letters in the formula
if
is a set of formulas,
Due to space limitations, some proofs of lemmas have been omitted.
TEAM LinG
Towards Polynomial Approximations of Full Propositional Logic
13
Dalal’s Polynomial Approximation Strategy
2
Dalal specifies a family of anytime reasoners based on an equivalence relation
between formulas [4]. The family is composed of a sequence of reasoners
such that each is tractable, each
is at least as complete (with respect
to classical logic) as
and for each theory there is a complete to reason with
it.
The equivalence relation that serves as a basis for the construction of a family
has to obey several restrictions to be admissible, namely it has to be sound,
modular, independent, irredundand and simplifying [4].
Dalal provides as an example a family of reasoners based on the classically
sound but incomplete inference rule known as BCP (Boolean Constraint Propagation) [12], which is a variant of unit resolution [3]. For the initial presentation,
no proof-theoretic or model-theoretic semantics were provided for BCP, but an
algebraic presentation of an equivalence
was provided. For that,
consider a theory as a set of clauses, where a disjunction of zero literals is denoted by f and the conjunction of zero clauses is denoted t. Let
denote the
negation of the atom and let
be the complement of the formula obtained
by pushing the negation inside in the usual way using De Morgan’s Laws until
the atoms are reached, at which point
and
The equivalence
is then defined as:
where
are literals.
The idea is to use an equivalence relation to generate an inference in which
can be inferred from if
is equivalent to an inconsistency. In this
way, the inference
is defined as
iff
Dalal presents an example1 in which, for the theory
we both have
and
but
This example shows that
is unable to use a previously inferred clause
to infer
Based on this fact comes the proposal of an anytime family of
reasoners.
2.1
The Family of Reasoners
Dalal defines a family of incomplete reasoners
is given by the following:
where the size of a clause
1
where each
is the number of literals it contains.
This example is extracted from [5].
TEAM LinG
14
Marcelo Finger
The first rule tells us that every
is also a
The
second rule tells us that if was inferred from a theory and it can be used as a
further hypothesis to infer
and the size of is at most then is can also
be inferred from the theory.
Dalal shows that this is indeed an anytime family of reasoners, that is, for
each
is tractable,
is as complete as
and if you remove the
restriction on the size of in rule 2, then
becomes complete, that is, for
each classically inferable
there is a such that
2.2
Semantics
In [5], Dalal proposed a semantics for
based on the notion of
which we briefly present here.
Dalal’s semantics is defined for sets of clauses. Given a clause
the support
set of
is defined as the set of all literals occurring in
Support sets
ignore multiple occurrences of the same literal and are used to extend valuations
from atoms to clauses. According to Dalal’s semantics, a propositional valuation
is a function
note that the valuation maps atoms to real numbers.
A valuation is then extended to literals and clauses in the following way:
1.
2.
for any atom
for any clause
Valuations of literals are real numbers in [0,1], but valuations of clauses are
non-negative real numbers that can exceed 1. A valuation is a model of
if
A valuation is a countermodel of if
Therefore it is
possible for a formula to have neither a model nor a countermodel. For instance,
if
then
has neither a model nor a countermodel. A
valuation is a model of a theory (set of clauses) if it is a model of all clauses in
it.
Define
iff no model of the theory is a countermodel of
Proposition 1 ([5]). For every theory
and every clause
iff
So
is sound and complete with respect to
The next step is to generalize this approach to obtain a semantics of
For that, for any
a set
V of valuations is a
iff for each clause of size at most
if V has
a non-model of then V has a countermodel of
V is a
of if each
is a model of this notion extends to theories as usual.
It is then possible to define
iff there is no countermodel of in any
of
Proposition 2 ([5]). For every theory
Thus the inference
and every clause
iff
is sound and complete with respect to
TEAM LinG
Towards Polynomial Approximations of Full Propositional Logic
2.3
15
Analysis of
Dalal’s notion of a family of anytime reasoners has very nice properties. First,
every step in the approximation is sound and can be decided in polynomial
time. Second, the approximation is guaranteed to converge to classical inference.
Third, every step in the approximation has a sound and complete semantics,
enabling an anytime approximation process.
However, the method based on
also has its limitations:
1. It only applies to clausal form formulas. Although every prepositional formula is classically equivalent to a set of clauses, this equivalence may not
be preserved in any of the approximation steps. The conversion of a formula
to clausal form is costly: one either has to add new prepositional letters
(increasing the complexity of the problem) or the number of clauses can be
exponential in the size of the original formula. With regards to complexity,
BCP is a form of resolution, and it is known that there are theorems that
can be proven by resolution only in exponentially many steps [2].
2. Its non-standard semantics makes it hard to compare with other logics known
in the literature, specially other approaches to approximation. Also, the semantics presented is based on support sets, which makes it impossible to
generalize to non-clausal formulas.
3. The proof-theory for
is poor in computational terms. In fact, if we
are trying to prove that
and we have shown that
then we would have to guess a
with
so that
and
Since the BCP-approximations provide no method to guess the
formula this means that a computation would have to generate and test
all the
possible clauses, where is the number of propositional
symbols occurring in and
In the rest of this paper, we address problems 1 and 2 above. That is, we are
going to present a family of anytime reasoners for the full fragment of propositional logic, in which every approximation step has a semantics and can be
decided in polynomial time. Problem 3 will be treated in further work.
3
The Family of Logics
We present here the family of logics of Limited Bivalence,
This is a parametric family that approximates classical logic, in which every approximation
step can be decided in polynomial time. Unlike
is parameterized
by a set of formulas
when
contains all formulas of size at most
can simulate an approximation step of
The family
can be applied to the full language of propositional logic,
and not only to clausal form formulas, with an alphabet consisting of a countable
set of propositional letters (atoms)
and the connectives
and
and the usual definition of well-formed propositional formulas; the set
of all well-formed formulas is denoted by
The presentation of LB is made in
terms of a model theoretic semantics.
TEAM LinG
Marcelo Finger
16
3.1
Semantics of
The semantics of
is based of a three-level lattice,
where L is
a countable set of elements
is the least upper bound,
is the gratest lower bound, and is defined, as usual, as
iff
iff
1 is the
and 0 is the
L is subject to the conditions:
(i)
for every
and (ii)
for
This three-level lattice
is illustrated in Figure 3.1(a).
(a) The 3-Level Lattice
(b) The Converse Operation ~
This lattice is enhanced with a converse operation, ~, defined as: ~ 0 = 1,
~ 1 = 0 and
for all
This is illustrated in Figure 3.1(b).
We next define the notion of an unlimited valuation, and then we present
its limitations. An unlimited propositional valuation is a function
that maps atoms to elements of the lattice. We extend
to all propositional
formulas,
in the following way:
A formula can be mapped to any element of the lattice. However, the formulas
that belong to the set
are bivalent, that is, they can only be mapped to the
top or the bottom element of the lattice. Therefore, a limited valuation must
satisfy the restriction of Limited Bivalence given by, for every
In the rest of this work, by a valuation
we mean a limited valuation
subject to the condition above.
A valuation
satisfies if
and is said satisfiable; a set of
formulas is satisfied by
if all its formulas are satisfied by
A valuation
contradicts if
if is neither satisfied nor contradicted by
we say that
is neutral with respect to
A valuation is classical if it assigns
only 0 or 1 to all proposition symbols, and hence to all formulas.
For example, consider the formula
and
Then
if
if
then
then
TEAM LinG
Towards Polynomial Approximations of Full Propositional Logic
if
if
if
then
then
and
17
then
The first four valuations coincide with a classical behavior. The last one
shows that if and are mapped to distinct neutral values, then
will be
satisfiable. Note that, in this case,
will also be satisfiable, and that
will be contradicted.
3.2
LB-Entailment
The notion of a parameterized LB-Entailment,
follows the spirit of Dalal’s
entailment relation, namely
if it is not possible to satisfy and contradict at the same time. More specifically,
if no valuation
such
that
also makes
Note that since this logic is not classic,
if
and
it is possible that the
is either neutral or satisfied
by
For example, we reconsider Dalal’s example, where
and make
We want to show that
but
To see that
suppose there is a
such that
Then we
have
and
Since it is not possible to
satisfy both, we cannot have
so
To show that
suppose there is a
such that
and
Then
and
Again,
it is not possible to satisfy both, so
Finally, to see that
take a valuation
such that
Then
However, if we make
then we have only two possibilities for
If
we have
already seen that no valuation that contradicts will satisfy
If
we
have also seen that no valuation that contradicts will satisfy
So for
we obtain
This example indicates that
behave in a similar way to
and that
by adding an atom to
we have a behavior similar to
We now have to
demonstrate that this is not a mere coincidence.
An Approximation Process. As defined in [8], a family of logics, parameterized with a set,
is said to be an approximation of classical logic “from below”
if, for increasing size of the parameter set we get closer to classical logic. That
is, for
we have that,
Lemma 1. The family of logics
from below.
is an approximation of classical logic
TEAM LinG
18
Marcelo Finger
Note that for a given pair
the approximation of
can be done
in a finite number of steps. In fact, if
any formula made up of and
has the property of bivalence. In particular, if all atoms of and are in
then only classical valuations are allowed.
An approximation method as above is not in the spirit of Dalal’s approximation, but follows the paradigm of Cadoli and Schaerf [13,1], also applied by
Massacci [11,10] and Finger and Wassermann [6–8].
We now show how Dalal’s approximations can be obtained using LB.
3.3
Soundness and Completeness of
with Respect to
For the sake of this section and the following, let be a set of clauses and let
and denote clauses, and
denote literals. We now show that, for
iff
Lemma 2. Suppose BCP transforms a set of clauses
then
iff
Lemma 3.
Theorem 1. Let
Proof.
into a set of clauses
iff for all valuations
be a set of clauses and
iff for no
iff, by Lemma 3,
and
iff
a clause. Then
iff
iff for no
Lemma 4 (Deduction Theorem for
Let
be a set of clauses,
literal and a clause. Then the following are equivalent statements:
3.4
a
Soundness and Completeness of
As mentioned before, the family of entailment relations
does not follow
Dalal’s approach to approximation, so in order to obtain a sound and complete
semantics for
we need to provide another entailment relation based on
which we call
For that, let be a set of sets of formulas and define
iff there
exists a set
such that
We concentrate on the case where
is a set of clauses,
is a clause and each
is a set of atoms. We define
That is,
is a set of sets of atoms of size
attention to atoms,
sets of
have to consider a polynomial number of sets of
We then write
to mean
Theorem 2. Let
be a set of clauses and
Note that if we restrict our
atoms. For a fixed we only
atoms.
a clause. Then
iff
TEAM LinG
Towards Polynomial Approximations of Full Propositional Logic
19
Proof.
By induction on the number of uses of rule 2 in the definition of
For the base case, Theorem 1 gives us the result. Assume that
due to
and
Suppose for contradiction that
then
for all
there exists
such that
and
By
the induction hypothesis,
which implies
and
which implies
So
for some
which implies that
but this cannot hold for all
a contradiction. So
Suppose
Then for some
with
and
suppose that
is a smallest set with such property. Therefore, for all with
with
we have
Choose one such
and define the set of
literals
is a literal whose atom is in
We first show that
for every
Suppose for contradiction that
for some
then there is a
with
and
but
Let
If does not occur in
then
which contradicts the minimality of
So
or
Consider
a
such that
if
maps to 0 or 1 it is a
so
if
for some
then clearly we have that
so
which contradicts the minimality of
It
follows that
We now show that
Suppose for contradiction that
Then, by Theorem 1,
that is, there exists
such that
and
However, such
maps all atoms of to 0 or 1, so it is actually
a
that contradicts
So
If
then clearly
So suppose
In this case, we
show that
Let
we prove by induction that for
From
and Theorem 1 we know that
there is a valuation
such that
and
From
we infer that there must exist a
such that
without loss of
generality, let
Suppose for contradiction that
Then there exists a valuation
such that
but
which contradicts
So
Now note that for
otherwise the minimality
of would be violated. From Theorem 1 we know that there is a valuation
such that
and
From
we infer
that there must exist a
such that
without loss
of generality, let
Suppose for contradiction that
Then there exists a valuation
such that
but
but this contradicts
So
Thus we have that
It
follows that
as desired. Finally, from
and
we obtain that
and the result is proved.
The technique above differs considerably from Dalal’s use of the notion of
vividness. It follows from Dalal’s result that each approximation step
is
decidable in polynomial time.
TEAM LinG
20
4
Marcelo Finger
Conclusions and Future Work
In this paper we presented the family of logics
and provided it with a
lattice-based semantics. We showed that it can be a basis for both a parametric
and a polynomial clausal approximation of classical logic. This semantics is sound
and complete with respect to Dalal’s polynomial approximations
Future work should extend polynomial approximations to non-clausal logics.
It should also provide a proof-theory for these approximations.
References
1. Marco Cadoli and Marco Schaerf. The complexity of entailment in propositional
multivalued logics. Annals of Mathematics and Artificial Intelligence, 18(1):29–50,
1996.
2. Alessandra Carbone and Stephen Semmes. A Graphic Apology for Symmetry and
Implicitness. Oxford Mathematical Monographs. Oxford University Press, 2000.
3. C. Chang and R. Lee. Symbolic Logic and Mechanical Theorem Proving. Academic
Press, London, 1973.
4. Mukesh Dalal. Anytime families of tractable propositional reasoners. In International Symposium of Artificial Intelligence and Mathematics AI/MATH-96, pages
42–45, 1996.
5. Mukesh Dalal. Semantics of an anytime family of reasponers. In 12th European
Conference on Artificial Intelligence, pages 360–364, 1996.
6. Marcelo Finger and Renata Wassermann. Expressivity and control in limited reasoning. In Frank van Harmelen, editor, 15th European Conference on Artificial
Intelligence (ECAI02), pages 272–276, Lyon, France, 2002. IOS Press.
7. Marcelo Finger and Renata Wassermann. The universe of approximations. In Ruy
de Queiroz, Elaine Pimentel, and Lucilia Figueiredo, editors, Electronic Notes in
Theoretical Computer Science, volume 84, pages 1–14. Elsevier, 2003.
8. Marcelo Finger and Renata Wassermann. Approximate and limited reasoning: Semantics, proof theory, expressivity and control. Journal of Logic And Computation,
14(2):179–204, 2004.
9. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the
Theory of NP-Completeness. Freeman, 1979.
10. Fabio Massacci. Anytime approximate modal reasoning. In Jack Mostow and
Charles Rich, editors, AAAI-98, pages 274–279. AAAIP, 1998.
11. Fabio Massacci. Efficient Approximate Deduction and an Application to Computer
Security. PhD thesis, Dottorato in Ingegneria Informatica, Università di Roma I
“La Sapienza”, Dipartimento di Informatica e Sistemistica, June 1998.
12. D. McAllester. Truth maintenance. In Proceedings of the Eighth National Conference on Artificial Intelligence (AAAI-90), pages 1109–1116, 1990.
13. Marco Schaerf and Marco Cadoli. Tractable reasoning via approximation. Artificial
Intelligence, 74(2):249–310, 1995.
14. Bart Selman and Henry Kautz. Knowledge compilation using horn approximations.
In Proceedings AAAI-91, pages 904–909, July 1991.
15. Bart Selman and Henry Kautz. Knowledge compilation and theory approximation.
Journal of the ACM, 43(2):193–224, March 1996.
TEAM LinG
Using Relevance to Speed Up Inference
Some Empirical Results
Joselyto Riani and Renata Wassermann
Department of Computer Science
Institute of Mathematics and Statistics
University of São Paulo, Brazil
{joselyto,renata}@ime.usp.br
Abstract. One of the main problems in using logic for solving problems
is the high computational costs involved in inference. In this paper, we
propose the use of a notion of relevance in order to cut the search space
for a solution. Instead of trying to infer a formula directly from a large
knowledge base K, we consider first only the most relevant sentences in
K for the proof. If those are not enough, the set can be increased until,
at the worst case, we consider the whole base K.
We show how to define a notion of relevance for first-order logic with
equality and analyze the results of implementing the method and testing
it over more than 700 problems from the TPTP problem library.
Keywords: Automated theorem provers, relevance, approximate reasoning.
1
Introduction
Logic has been used as a tool for knowledge representation and reasoning in
several subareas of Artificial Intelligence, from the very beginning of the field.
Among these subareas, we can cite Diagnosis [1], Planning [2], Belief Revision
[3], etc.
One of the main criticisms against the use of logic is the high computational
costs involved in the process of making inferences and testing for consistency.
Testing satisfiability of a set of formulas is already an NP-complete problem
even if we stay within the realms of propositional logic [4]. And propositional
logic is usually not rich enough for most problems we want to represent. Adding
expressivity to the language comes at the cost of adding to the computational
complexity.
In the area of automatic theorem proving [5], the need for heuristics that help
on average cases has long been established. Recently, there have been several
proposals in the literature of heuristics that not only help computationally, but
are also based on intuitions about human reasoning. In this work, we concentrate
on the ideas of approximate reasoning and the use of relevance notions.
Approximate reasoning consists in, instead of attacking the original problem
directly, performing some simplification such that, if the simplified problem is
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 21–30, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
22
Joselyto Riani and Renata Wassermann
solved, the solution is also a solution for the original problem. If no solution is
found, then the process is restarted for a problem with complexity lying between
those of the original and the simplified problem.
That is, we are looking for a series of deduction mechanisms
with
computationally less expensive than
for
such that if
represents the theorems which can be proved using
and is a sound and
complete deduction mechanism for classical logic, we get:
An example of such kind of system is Schaerf and Cadoli’s “Approximate
Entailment” [6] for propositional logic. The idea behind their work is that at
each step of the approximation process, only some atoms of the language are
considered.
Given a set S of propositional letters, their system
disconsiders those
atoms outside S by allowing both and
to be assigned the truth value 1
when is not in S. If is in S, then its behavior is classic, i.e., is assigned
the truth value 1 if and only if
is assigned 0. The system
is sound but
incomplete with respect to classical logic. This means that for any S, if a formula
is an
consequence of a set of formulas, it is also a classical consequence. Since
the system is incomplete, the fact that a formula does not follow from the set
according to
does not give us information about its classical status.
There are several other logical systems found in the literature which are also
sound and incomplete, such as relevant [7] and paraconsistent logics [8].
In this work, we present a sound an incomplete system based on a notion of
relevance. We try to prove that a sentence follows from a set of formulas K by
first considering only those elements of K which are most relevant to
If this
fails, we can add some less relevant elements and try again. In the worst case,
we will end up adding all the elements of K, but if we are lucky, we can prove
with less.
The system presented here is based on the one proposed in [9]. The original
framework was developed for propositional logic. In this paper, we extend it to
deal with first order logic and show some empirical results.
The paper proceeds as follows: in the next section, we present the idea of
using a relevance graph to structure the knowledge base, proposed in [9]. In
Section 3, we introduce a particular notion of relevance, which is based purely
on the syntactical analysis of the knowledge base. In Section 4, we show how
these ideas were implemented and the results obtained. We finally conclude and
present some ideas for future work.
2
The Relevance Graph Approach
In this section, we assume that the notion of relevance which will be used is given
and show some general results proven in [9]. In the next section, we consider a
particular notion of relevance which can be obtained directly from the formulas
considered, without the need of any extra-logical resources.
Let
be a relation between two formulas with the intended meaning that
if and only if the formulas and are directly relevant to each other.
TEAM LinG
Using Relevance to Speed Up Inference
23
Given such a relatedness relation, we can represent a knowledge base (a set of
formulas) as a graph where each node is a formula and there is an edge between
and if and only if
This graph representation gives us immediately
a notion of degrees of relatedness: the shorter the path between two formulas
of the base is, the closer related they are. Another notion made clear is that
of connectedness: the connected components partition the graph into unrelated
“topics” or “subjects”. Sentences in the same connected component are somehow
related, even if far apart (see Figure 1).
Fig. 1. Structured Knowledge Base
Fig. 2. Degrees of Relevance
Definition 1. [9] Let K be a knowledge base and
be a relation between
formulas. A
between two formulas
and
in K is a sequence
of formulas such that:
1.
2.
3.
and
and
If it is clear from the context to which relation we refer we will talk simply
about a path in K.
We represent the fact that P is a path between and by
The length of a path
is
Note that the extremities of a path in K are not necessarily elements of K.
Definition 2. [9] Let K be a knowledge base and
of the language. We say that two formulas and
only if there is a path P such that
a relation between formulas
are related in K by if and
Given two formulas and
and a base K, we can use the length of the
shortest path between them in K as the degree of unrelatedness of the formulas.
If the formulas are not related in K, the degree of unrelatedness is set to infinity.
Formulas with a shorter path between them in K are closer related in K.
TEAM LinG
24
Joselyto Riani and Renata Wassermann
Definition 3. [9] Let K be a knowledge base,
a relation between formulas of
the language and and formulas. The unrelatedness degree of and in K
is given by:
We now show, given the structure of a knowledge base, how to retrieve the
set of formulas relevant for a given formula
Definition 4. [9] The set of formulas of K which are relevant for
is given by:
Definition 5. [9]
The set of formulas of K which are relevant for
We say that
up to degree
with degree
is given by:
is the set of relevant formulas
for
In Figure 2, we see an example of a structured knowledge base
The dotted circles represent different
levels of relevance for
We have:
We can now define our notion of relevant inference as:
Definition 6.
if and only if
Since
is a subset of K, it is clear that if for any
then
Note however that if
we cannot say anything about whether
or not.
An interesting point of the framework above is that it is totally independent on which relevance relation is chosen. In the next section, we explore one
particular notion of relevance, which can be used with this framework.
3
Syntactical Relevance
We have seen that, given a relevance relation, we can use it to structure a
set of formulas so that the most relevant formulas can be easily retrieved. But
where does the relevance relation come from? Of course, we could consider very
sophisticated notions of relevance. But in this work, our main concern is to find
a notion that does not require that any extra information is added to the set K.
TEAM LinG
Using Relevance to Speed Up Inference
25
In [9], a notion of syntactical relevance is proposed (for propositional logic),
which makes
if and only if the formulas and share an atom. It can
be argued that this notion is very simplistic, but it has the advantage of being
very easy to compute (this is the relation used in Figure 1). We can also see that
it gives intuitive results. Consider the following example, borrowed from [10]1.
Example 1. Consider Paul, who is finishing school and preparing himself for the
final exams. He studied several different subjects, like Mathematics, Biology,
Geography. His knowledge base contains (among others) the beliefs in Figure 3.
When Paul gets the exam,
the first question is: Do cows
have molar teeth?
Of course Paul cannot reason with all of his knowledge
at once. First he recalls what
he knows about cows and about
molar teeth:
Cows eat grass.
Mammals have canine teeth
or molar teeth.
From these two pieces of
knowledge alone, he cannot answer the question. Since all he
knows (explicitly) about cows is
that they eat grass, he recalls
Fig. 3. Student’s knowledge base
what he knows about animals
that eat grass:
Animals that eat grass do not have canine teeth.
Animals that eat grass are mammals.
From these, Paul can now derive that cows are mammals, that mammals
have canine teeth or molar teeth, but that cows do not have canine teeth, hence
cows have molar teeth.
The example shows that usually, a system does not have to check its whole
knowledge base in order to answer a query. Moreover, it shows that the process
of retrieving information is made gradually, and not in a single step. If Paul had
to go too far in the process, he would not be able to find an answer, since the
time available for the exam is limited. But this does not mean that if he was
given more time later on, he would start reasoning from scratch: his partial (or
approximate) reasoning would be useful and he would be able to continue from
more or less where he stopped.
Using the syntactical notion of relevance, the process of creating the relevance
graph can be greatly simplified. The graph can be implemented as a bipartite
graph, where some nodes are formulas and some are atoms. The list of atoms
is organized in lexicographic order, so that it can be searched efficiently. For
every formula which is added to the graph, one only has to link it to the atoms
1
The example is based on an example of [6].
TEAM LinG
26
Joselyto Riani and Renata Wassermann
occurring in it. In this way, it will be automatically connected to every other
formula with which it shares an atom.
This notion of relevance gives us a “quick and dirty” method for retrieving
the most relevant elements of a set of formulas.
Epstein [11] proposes some desiderata for a binary relation intended to represent relevance. Epstein’s conditions are:
R1
R2
R3
R4
R5
iff
iff
iff
iff
or
It is easy to see that syntactical relevance satisfies Epstein’s desiderata. Moreover, Rodrigues [12] has shown that this is actually the smallest relation satisfying the conditions given in [11].
Unfortunately, propositional logic is very often not enough to express many
problems found in Artificial Intelligence. We would like to move to first-order
logic. As is well known, this makes the inference problem much harder. On the
other hand, having a problem which is hard enough is a good reason to abandon
completeness and try some heuristics.
In what follows, we adapt the definition of syntactical relevance relation to
deal with full first-order logic with equality.
Definition 7. Let be a formula. Then
is the set of non-logical constants
(constants, predicate, and function names) which occur in
Definition 8 (tentative). Let
if and only if
be a binary relation defined as:
It is easy to see that this relation satisfies Epstein’s desiderata.
One problem with the definition above is that we very often have predicates,
functions or constants that appear in too many formulas of the knowledge base,
and could make all formulas seem relevant to each other. In this work, we consider
one such case, which is the equality predicate (~).
Based on the work done by Epstein on relatedness for propositional logic,
Krajewski [13] has considered the difficulties involved in extending it to firstorder logic. He notes that the equality predicate should be dealt with in a different way and presents some options. The option we adopt here is that of handling
equality as a connective, i.e., not considering it as a symbol which would contribute for relevance:.
Definition 9. Let be a binary relation defined as:
if and only if
We can now use as the relatedness relation needed to structure the relevance
graph, and instantiate the general framework.
In the next section, we describe how this approximate inference has been implemented and some results obtained, which show that the use of the relatedness
relation does bring some gains in the inference process.
TEAM LinG
Using Relevance to Speed Up Inference
4
27
Implementation and Results
In this section, we show how the framework for approximate inference based on
syntactical relevance has been implemented and the results which were obtained.
The idea is to have the knowledge base structured by the relatedness relation
and to use breadth-first search in order to retrieve the most relevant formulas.
The algorithm receives as input the knowledge base K, the formula which we
are trying to prove, the relation a global limit of resources (time, memory)
for the whole process, a local limit which will be used at each step of the approximation process, an inference engine I, which will be called at each step and
a function H which decides whether it is time to move to the next approximation
step.
The basic algorithm is as follows:
Input:
(Global limit of resources), (Local limit of resources), I
(inference engine, returns Yes, No, or Fail), H (function that decides whether to
apply next inference step).
Output: Yes, No or Fail.
Data Structures: Q (a queue),
(a subset of K)
In our tests, the inference engine used (the function I) was the theorem prover
OTTER [14]. OTTER is an open-source theorem prover for first-order logic
written in C. The code and documentation can be obtained from http://wwwunix.mcs.anl.gov/AR/OTTER. OTTER was modified so that it could receive as
a parameter the maximum number of sentences to be considered at each step.
It was also modified to build the relevance graph after reading the input file.
We call the modified version RR-OTTER (Relevance-Reasoning OTTER). The
algorithm was implemented in C and the code and complete set of tests are
available in [15].
The function H looks at the number of formulas retrieved at each step. At
the first step, only the 25 most relevant formulas are retrieved, i.e., for
when
H returns true.
TEAM LinG
28
Joselyto Riani and Renata Wassermann
In order to test the algorithm, two knowledge bases were created, putting
together problems of the TPTP2 (Thousands of Problems for Theorem Provers)
benchmark. Base 1 was obtained by putting together the axioms of the problems
in the domains “Set theory”, “Geometry”, and “Management”, and it contained
1029 clauses. Base 2 was obtained adding to Base 1 the axioms of the problems in
“Group Theory”, “Natural Language Processing”, and “Logic Calculi”, yielding
1781 clauses. Only problems in which the formula was a consequence of the base
were considered.
Two sets of tests were run. The first one (Tests 1) contained 285 problems
from the “Set Theory” domain, and used as the knowledge base Base 1 described
above. The function H was set to try to solve the problems with 25, then 50,
100, 200, 250, 300, 350, 400, 450, 500, 550, and 600 clauses at each step. For
each step, the maximum time allowed was 12.5 seconds. This gives a global time
limit of 150 seconds.
The second set of tests (Tests 2) contained 458 problems from the “Group
Theory” domain and used Base 2. It was tested with 25, 50, 100, 200, 250, 300,
350, 400, 450, and 500 clauses at each step, with the time limit at each step
being 15 seconds. Again, the global limit was 150 seconds.
In order to compare the results obtained, each problem was also given to the
original implementation of OTTER, with the time limit of 150 seconds.
The table below shows the results for six problems from the set Tests 1.
Problem Time
SET003-1
SET018-1
SET024-6
SET031-3
SET183-6
SET296-6
OTTER (s) Time RR-OTTER (s) # of sentences used
13.06
50
300
63.21
12.96
0.76
50
0.71
25
0.45
98.46
400
0.74
38.08
200
We can see that for the problems SET003-1, SET018-1, and SET183-6, which
OTTER could not solve given the limit of 150 seconds, RR-OTTER could find
a solution, considering 50, 300 and 400 clauses respectively. In this cases, it is
clear that limiting the attention to relevant clauses brings positive results. For
problem SET031-3, the heuristic proposed did not bring any significant gain.
And for problems SET024-6 and SET296-6, we can see that OTTER performed
better than RR-OTTER. These last two problems illustrate the importance of
choosing a good function H. Consider problem SET024-6. RR-OTTER spent the
first 12.5 seconds trying to prove it with 25 clauses and failed. Once it started
with 50 clauses, it took 0.46 seconds. The same happened in problem SET296-6,
where the first 37.5 seconds were spent with unsuccessful steps.
The following is a summary of the results which were obtained:
2
http://www.tptp.org/
TEAM LinG
Using Relevance to Speed Up Inference
Solutions found by OTTER
Solutions found by RR-OTTER
Average time 1 OTTER
Average time 1 RR-OTTER
Average time 2 OTTER
Average time 2 RR-OTTER
29
Tests 1 (285 problems) Tests 2 (458 problems)
212
111
196
258
93 sec
128 sec
61 sec
138 sec
3.04 sec
11.6 sec
6.9 sec
23.07 sec
We can see that, given the global limit of 150 seconds, RR-OTTER solved
more problems than the original OTTER. The lines “Average time 1” consider
the average time for all the problems, while “Average time 2” takes into account
only those problems in which the original version of OTTER managed to find a
solution.
An interesting fact which can be seen from the tests is the influence of a bad
choice of function H. For the problems in Tests 1, if we had started with 50
sentences instead of 25, the Average time 2 of RR-OTTER would have been 3.1
instead of 6.9 (for the whole set of results, please refer to [15]).
As it would be expected, when RR-OTTER manages to solve a problem
considering only a small amount of sentences, the number of clauses it generates
is much lower than what OTTER generates, and therefore, the time needed is
also shorter. As an example, the problem SET044-5 is solved by RR-OTTER
at the first iteration (25 sentences) in 0.46 seconds, generating 29 clauses, while
OTTER takes 9.8 seconds and generates 3572 new clauses. This shows that, at
least for simple problems, the idea of restricting attention to relevant sentences
helps to avoid the generation of more irrelevant data and by doing so, keeps the
search space small.
5
Conclusions and Future Work
In this paper, we have extended the framework proposed in [9] to deal with
first-order logic and showed how it can be used to perform approximate theorem
proving.
The method was implemented, using the theorem prover OTTER. Although
the implementation is still naive, we could see that in many cases, we could obtain
some gains. The new method, RR-OTTER, managed to solve some problems
that OTTER could not prove, given a time limit.
The tests show that the strategy of considering the most relevant sentences
first can be fruitful, by keeping the search space small.
Future work includes more tests in order to better determine the parameters
of the method, such as the function H, and improving the implementation.
Instead of external calls to OTTER, we plan to use otterlib [16], a C library
developed by Flavio Ribeiro. The idea is that we could then keep the inference
state after each step of the approximation (for example, all the clauses that were
generated), instead of restarting from scratch.
TEAM LinG
30
Joselyto Riani and Renata Wassermann
Acknowledgements
Renata Wassermann is partly supported by the Brazilian Research Council
(CNPq), grant PQ 300196/01-6. This work has been supported by FAPESP
project 03/00312-0.
References
1. Hamscher, W., Console, L., de Kleer, J., eds.: Readings in Model-Based Diagnosis.
Morgan Kaufmann (1992)
2. Allen, J., Hendler, J., Tare, A., eds.: Readings in Planning. Morgan Kaufmann
Publishers (1990)
3. Gärdenfors, P.: Knowledge in Flux - Modeling the Dynamics of Epistemic States.
MIT Press (1988)
4. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory
of NP-Completeness. Freeman (1979)
5. Robinson, J.A., Voronkov, A., eds.: Handbook of Automated Reasoning. MIT Press
(2001)
6. Schaerf, M., Cadoli, M.: Tractable reasoning via approximation. Artificial Intelligence 74 (1995) 249–310
7. Anderson, A., Belnap, N.: Entailment: The Logic of Relevance and Necessity, Vol.
1. Princeton University Press (1975)
8. da Costa, N.C.: Calculs propositionnels pour les systémes formels inconsistants.
Comptes Rendus d’Academie des Sciences de Paris 257 (1963)
9. Wassermann, R.: Resource-Bounded Belief Revision. PhD thesis, Institute for
Logic, Language and Computation — University of Amsterdam (1999)
10. Finger, M., Wassermann, R.: Expressivity and control in limited reasoning. In van
Haxmelen, F., ed.: 15th European Conference on Artificial Intelligence (ECAI02),
Lyon, France, IOS Press (2002) 272–276
11. Epstein, R.L.: The semantic foundations of logic, volume 1: Propositional Logic.
Nijhoff International Philosophy Series. Kluwer Academic Publishers (1990)
12. Rodrigues, O.T.: A Methodology for Iterated Information Change. PhD thesis,
Imperial College, University of London (1997)
13. Krajewski, S.: Relatedness logic. Reports on Mathematical Logic 20 (1986) 7–14
14. McCune, W., Wos, L.: Otter: The cade-13 competition incarnations. Journal of
Automated Reasoning (1997)
15. Riani, J.: Towards an efficient inference procedure through syntax based relevance.
Master’s thesis, Department of Computer Science, University of São Paulo (2004)
Available at http://www.ime.usp.br/~joselyto/mestrado.
16. Ribeiro, F.P.: otterlib - a C library for theorem proving. Technical Report RT-MAC
2002-09, Computer Science Department, University of São Paulo (2002) Available
from http://www.ime.usp.br/~fr/otterlib/.
TEAM LinG
A Non-explosive Treatment of Functional
Dependencies Using Rewriting Logic*
Gabriel Aguilera, Pablo Cordero, Manuel Enciso,
Angel Mora, and Inmaculada Perez de Guzmán
E.T.S.I. Informática, Universidad de Málaga, 29071, Málaga, Spain
[email protected]a.uma.es
Abstract. The use of rewriting systems to transform a given expression
into a simpler one has promoted the use of rewriting logic in several areas
and, particularly, in Software Engineering. Unfortunately, this application has not reached the treatment of Functional Dependencies contained
in a given relational database schema. The reason is that the different
sound and complete axiomatic systems defined up to now to manage
Functional Dependencies are based on the transitivity inference rule. In
the literature, several authors illustrate different ways of mapping inference systems into rewriting logics. Nevertheless, the explosive behavior of
these inference systems avoids the use of rewriting logics for classical FD
logics. In a previous work, we presented a novel logic named
whose
axiomatic system did not include the transitivity rule as a primitive rule.
In this work we consider a new complexity criterion which allows us
to introduce a new minimality property for FD sets named atomicminimality. The
logic has allowed us to develop the heart of this
work, which is the use of Rewriting Logic and Maude 2 as a logical
framework to search for atomic-minimality.
Keywords: Knowledge Representation, Reasoning, Rewriting Logic,
Redundancy Removal
1
Introduction
E.F. Cood introduces the Relational Model [1] having both, a formal framework
and a practical orientation. Cood’s database model is conceived to store and
to manage data in an efficient and smart way. In fact, its formal basis is the
main reason of their success and longevity in Computer Science. In this formal
framework the notion of Functional Dependency (FD) plays an outstanding role
in the way in which the Relational Model stores, recovers and manages data.
FDs were introduced in the early 70’s and, after an initial period in which
several authors study in depth their power, they fell into oblivion, considering
that the research concerning them had been completed. Recently, some works
have proved that there is still a set of FDs problems which can be revisited in a
successful way with novel techniques [2,3].
*
This work has been partially supported by TIC-2003-08687-CO2-01.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 31–40, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
32
Gabriel Aguilera et al.
On the other hand, rewriting systems have been used in databases for database query optimization, analysis of binding propagation in deductive databases
[4], and for proposing a new formal semantics for active databases [5]. Nevertheless, we have not found in the literature any work which uses rewriting logic
(RL) to tackle an FD problem. FD problems can be classified in two classes
according to one dimension: their abstraction level. So, we have instance problems (for example the extraction of all the FD which are satisfied in a given
instance relation) and schema problems (for example the construction of all the
FDs which are inferred from a given set of FDs). The first kind of problems are
being faced successfully with Artificial Intelligence techniques. Schema problems
seem to be suitable to be treated with RL.
There are some authors who introduce several FDs logics [6–8]. All of these
logics are cast in the same mold. In fact, they are strongly based on Armstrong’s
Axioms [6], a set of expressions which illustrates the semantics of FDs. These FD
logics cited above were created to formally specify FDs and as a metatheoretical
tool to prove FD properties. Unfortunately, all of these FD axiomatic systems
have a common heart: the transitivity rule. The strong dependence with respect
to the transitivity inference rule avoids its executable implementation into RL.
The most famous problem concerning FDs is the Implication Problem: we
have a set of FDs and we would like to prove if a given FD can be deduced
from using the axiomatic system. If we incorporate any of these axiomatic
systems into RL, the exhaustive use of the inference rule would make this rewrite
system unapplicable, even for trivial FD sets. This limitation caused that a set
of indirect methods with polinomial cost were created to solve the Implication
Problem. Furthermore, other well-known FD problems are also tackled with
indirect methods [9].
As the authors says about Maude in [10]: “The same reasons that make it a
good semantic framework at the computational level make it also a good logical
framework at the logical level, that is, a metalogic in which many other logics can
be naturally represented and implemented”. To do that, we need a new FD logic,
which does not have the transitivity rule in its axiomatic system. Such a logic
was presented in [11] and we named it the Functional Dependencies Logic with
Substitution
In this work we use for the first time RL to manage FDs.
Particularly, we apply Maude as a metalogical framework for representing FD
logics illustrating that “Maude can be used to create executable environments
for different logics” [10].
The main novelty of
is the replacement of the transitivity rule by another inference rule, named Substitution rule1 with a non-explosive behavior.
This rule preserves equivalence and reduce the complexity of the original expression in linear time. Substitution rule allows the design of a new kind of FD
logic with a sound and complete inference system. These characteristics allow
the development of some FD preprocessing transformations which improve the
use of indirect methods (see [12]) and open the door to the development of a
future automated theorem prover for FDs.
1
We would like to remark that our novel rule does not appear in the literature either
like a primitive rule nor a derived rule.
TEAM LinG
A Non-explosive Treatment of Functional Dependencies
33
The implication problem for FDs was motivated by the search for sets of
FDs with less size, where the measure is the number of FDs2. In this work we
introduce another criterion for FD complexity. We present the notion of atomicminimality and we show how the Substitution rule may be used to develop a
rewriting system which receives a set of FDs and produces an atomic-minimal
FD set. As a general conclusion, we show that Rewriting Logic and Maude are
very appropriate to tackle this problem.
The work is organized as follows: In Section 2 we show the implication problem for FDs and the classical FD logics. Besides, we provide a Maude 2 implementation of Paredaens FD logic. Section 3 introduces the atomic-minimality
concept, a novel criterion to detect redundancy. Atomic-minimality can be used
to design a rewriting system to depurate FDs sets. In Section 4 we use RL and
Maude 2 to develop such a system. We conclude this section with several illustrative examples. The work ends with the conclusions and future work section.
2
The Implication Problem for FDs
The problem of removing redundancy in FD sets is presented exhaustively in [7].
In this paper, the authors illustrate the strong relation between the implication
problem and redundancy removal. Thus, they introduce the notion of minimality,
a property of FD sets which ensures that every FD contained in the set can not
be deduced from the others i.e. it is not redundant. As P. Atzeni and V. de
Antonellis cite [7], the soundness and completeness of the inference rules for
FDs guaranteed the decidability of the implication problem: given
a set of
FDs, we can exhaustively apply the inference rules to generate the closure of
This new set of FDs is used to test whether a given FD is implied by
Obviously, the method is not used in practice, because the size of this set of FDs
is exponential with respect to the cardinality of
This situation is due to both
the axiom and the transitivity that are shown below.
We select FD Paredaens Logic (with no loss of generality) to illustrate its
explosive behavior:
Definition 1 (The
language). 3 Let
be an infinite enumerable set of
atoms (attributes) and let
be a binary connective, we define the language
Definition 2 (The
pair
where
axiomatic system).
is the logic given by the
has one axiom scheme and two inference rules:
Transitivity Rule
Augmentation Rule
2
3
The treatment of a set of FDs is normally focussed on the reduction of the size of the
set. Nevertheless, we would like to remark that this treatment is not deterministic.
As usual, XY is used as the union of sets X,Y;
as X included in Y; Y – X
as the set of elements in Y that are not in X (difference) and
as the empty set.
TEAM LinG
34
In
Gabriel Aguilera et al.
we have the following derived rules (these rules appear in [8]):
Union
Composition
Intersection
Reduction
Fragmentation
where
Rule
Rule
Rule
Rule
Rule
and
Generalized Augmentation Rule
where
and
Generalized Transitivity Rule
Unfortunately,
and all the other classical FD axiomatic systems are not
suitable tools to develop automated deduction techniques, because all of them
are based on the transitivity rule, which has an inherent explosive behavior. This
is a well-known problem in other deduction methods, like tableaux-like methods,
based on a distribution rule which limits their use.
Primitive rules of
have been implemented in Maude 2 [13]. It is remarkable the direct translation of the inference rules into conditional equations.
Some basic modules in Maude 2 have been necessary for the implementation4:
ostring.maude (this module is defined for ordered strings management), dependency. maude (this module defines the sort Dep (dependency) and related operators and sorts) and subgenerator. maude (this module produces all the dependencies generated by the axiom through the operators subdeps and subfrag).
The axiom
has been implemented by way of two equations called
“raxiom” and “laxiom”. The first one adds all the dependencies of the form
if
The second one does the same but applied to the right part
of any dependency. The corresponding module in Maude is shown below. As it
is cited in [10], “Maude’s equational logic is so expressive as to offer very good
advantages as a logical framework”.
The application of this Maude 2 code to a given set of FDs produces all
the inferrable FDs. The cardinality of this equivalent output set grows in an
4
The complete specification is available at
http://www.satd.uma.es/gabri/fd/sources.htm.
TEAM LinG
A Non-explosive Treatment of Functional Dependencies
35
exponential way. This is an unsurprising result due to the inherent exponentiality
of the problem. Even for trivial examples (up to two FDs), the execution of this
rewriting module generates a huge FDs set. It is clear that this situation requires
us to investigate in another direction.
If we are looking for an efficient method to solve the implication problem, we
do not use
Instead of that, a closure operator for attributes is used. Thus,
if we have to prove if
is a consequence of
we compute
(the closure
of X in
and we test if Y is a subset of
In the literature there are several
algorithms to compute the closure attribute operator in linear time (see [7,9]
for further details). This ensures that we can solve the implication problem in
polinomial time.
Nevertheless, this efficient method has a very important disadvantage: it
does not allow giving an explanation about the answer. When we use an indirect
method we are not able to translate the final solution into an inference chain
to explain the answer in terms of the inference system. This limits the use of
the indirect methods, because we cannot apply them in artificial intelligence
environments, where the explanation is as important as the final answer.
3
The Minimality and the Optimality Problems.
A New Intermediate Solution
The number of FDs in a set is a critical measure, because it is directly related
to the cost of every problem concerning FD. The search for a set of FDs with
minimal cardinality that is equivalent to a given one it is known as Minimality
problem.
Nevertheless, as Atzeni et al. [7] remark, the problem is not always the number of FDs in the set but it is sometimes the number of attributes of the FD
set. This second approach of the size of a FD set conduces to the Optimality
problem.
Firstly, we define formally the concept of size of an FD set.
Definition 3. Let
be finite. We define the size of
as
Secondly, we outline problems mentioned before as follows.
Minimality: the search for a set equivalent to
with lower cardinality is non-equivalent to
Optimality: the search for a set equivalent to
with lower size is non-equivalent to
such that any set of FDs
such that any set of FDs
As they demonstrate, optimality implies minimality and, while minimality
can be checked in polinomial time using indirect algorithms, optimality is NPhard. Besides, the exponential cost of optimality is due, particularly, to the need
of testing cycles in the FD graph.
TEAM LinG
Gabriel Aguilera et al.
36
Now we formalize these problems and a new non NP-Hard problem more
useful than minimality. We will show in section 4 that this new problem has
linear cost. Moreover, we propose the use of RL to solve this new problem.
Definition 4. Let
condition holds
We say that
be finite. We say that
is minimal if the following
is optimal if the following condition holds
The minimality condition is in practise unapproachable with the axiomatic system and the optimality condition take us to an NP-hard problem. We are interested in an intermediate point between minimality and optimality. To this end
we characterize the minimality using the following definition.
Definition 5. We define Union to be a rewriting rule which is applied
to condense FDs with the same left-hand side. That is, if
is finite,
Union systematically makes the following transformation:
Therefore, when we say that a set is minimal, we mean that this set is a
minimal element of its equivalence class. In this case we use the order given by
the inclusion of sets. However, when we say that a set is optimal, we refer to the
“minimality” in the preorder given by
if and only if
Now
we define a new order to improve the concept of minimality.
Definition 6. Let
and
be finite subsets of
inclusion, denoted by
as follows: we say that
We define the atomic
if and only if
Obviously, this relation is an order5. Now, we introduce a new concept of minimality based on this order.
Definition 7. Let
be finite. We say that
the following conditions hold
If
and
is atomic-minimal if
then
Example 1. Let us consider the following sets of FDs:
5
Note that, if we extend this relation to all subsets of
this relation is a preorder but not an order.
(finite and infinite subsets),
TEAM LinG
A Non-explosive Treatment of Functional Dependencies
37
These sets are equivalent in Paredaen’s logic and we have that:
is optimal.
is not optimal because
(the FD
of
has been replaced
by
in
is atomic-minimal.
is not atomic-minimal because
and
(notice the FDs
and
of
and their
corresponding
and
in
is minimal because there are no
superfluous FDs. Finally,
is not minimal because
can be obtained by
transitivity from
and
The relation among these sets is depicted
in the following table:
Finally, we remark that
However,
Let us remark that a set is minimal if we cannot obtain an equivalent set by
removing some FD of
Therefore, we may design an algorithm to obtain minimal sets through elimination of redundant FDs. In the same way, is atomicminimal if we cannot obtain an equivalent set by removing an attribute of one
FD belonging to
This fact guide the following results.
Definition 8. Let
is superfluous in
and
if
is l-redundant in
if there exist
such that
is r-redundant in
if there exist
such that
The following theorem is directly obtained from Definition 8.
Theorem 1. Let
be a finite set of FDs such that
Then
is atomic-minimal if and only if there not exist
superfluous,
or
in
such that
is
This theorem relates atomic-minimality and the three situations included in
Definition 8. The question is, what situations in Definition 8 are not covered
with minimality? The superfluous FDs are covered trivially.
In the literature, the algorithms which treat with sets of FDs consider a
preprocessing transformation which renders FDs with only one attribute in
the right-hand side. This preprocessing transformation applies exhaustively the
rule in Definition 2.
TEAM LinG
38
Gabriel Aguilera et al.
In these algorithms, r-redundant attributes are captured as superfluous FDs.
The l-redundant attribute situation is a novel notion in the literature and the
classical FDs algorithms do not deal with it.
4
The Search for Atomic-Minimality
In Section 2 the implication problem cannot be solved using directly Paredaens
logic. Thus, we use a novel logic, the
logic presented in [11] which avoids
the disadvantages of classical FD logics. The axiomatic system of
logic is
more appropriate to automate.
Definition 9. We define the
has one axiom scheme:
is an axiom scheme.
The inferences are the following:
logic as the pair
where
where
Particular,
Fragmentation rule
Composition rule
Substitution rule
Theorem 2. The
and
systems are equivalent.
The proof of this equivalence and the advantages of the
were shown in
[11].
is sound and complete (see [11]), thus, we have all the derived rules
presented in
Besides that, we have the following novel derived rule:
r-Substitution Rule
Obviously,
does not avoid the exponential complexity of the problem
of searching for all the inferrable FDs. Nevertheless, the replacement of the
transitivity law by the substitution law allows us to design a rewriting method
to search for atomic-minimality in a FDs set.
Next, we show how to use Maude to create an executable environment to
search for atomic-minimal FD sets. The
inference system can be directly translated to a rewriting system which allows a natural implementation of
FD sets transformations. This rewriting view is directly based on the following
theorem.
Theorem 3. Given
we have the following
Reduction
Union
Fi If
then
Fi-r If
and
then
6
These equivalences are used to rewrite the FD set into a simpler one. Atomicminimality induces the application of these equivalences from left to right. Fi
and Fi-r are only applied when they render a proper reduction, i.e.:
6
It is easily proven that the reduction rule and the union rule are
transformations.
TEAM LinG
A Non-explosive Treatment of Functional Dependencies
If
If
or
then Fi is applied to eliminate
atoms of
then Fi-r is applied to eliminate
atoms of
39
or
Now, we give the corresponding rewriting rules in Maude.
Let be a FD set. Since the size of is reduced in every rewrite, the number
of rewrites is linear in the size of
Below, some examples are shown. We reduce several set of FDs and we show
the results that offers Maude 2. The low cost of these reductions is remarkable.
Example 2. This example is used in [14]. The size of the FD set decrease from
26 to 18.
Example 3. This example is depurated in [9] using
Our reduction
in RL and Maude obtains the same result without using a closure operator.
5
Conclusions and Future Work
In this work we have studied the relation between RL and the treatment of sets
of FDs. We have illustrated the difficulties to face the implication problem with
a method directly based on FDs logics. We have introduced the notion of atomicminimality, which guides the treatment of sets of FDs in a rewriting style. Given
a set of FDs, we rewrite
into an equivalent and more depurated FD set.
This goal has been reached using
This axiomatic system avoids the use of
TEAM LinG
40
Gabriel Aguilera et al.
transitivity paradigm and introduces the application of substitution paradigm.
axiomatic system
is easily translated to RL and Maude 2.
The implementation of
in Maude 2 allows us to have an executable
rewriting system to reduce the size of a given FDs set in the direction guided
by atomic-minimality. Thus, we open the door to the use of RL and Maude 2 to
deal with FDs.
As a short-term future work, our intention is to develop a Maude 2 system
to get atomic-minimality FDs sets. As a medium-term future work, we will use
Maude strategies to fully treat the redundancy contained in FDs sets.
References
1. Codd, E.F.: The relational model for database management: Version 2. reading,
mass. Addison Wesley (1990)
2. Bell, D.A., Guan, J.W.: Computational methods for rough classifications and discovery. J. American Society for Information Sciences. Special issue on Data Minig
49 (1998)
3. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., Lakhal, L.: Functional and
embedded dependency inference: a data mining point of view. Information Systems
26 (7) (2002) 477–506
4. Han, J.: Binding propagation beyond the reach of rule / goal graphs. Information
Processing Letters 42 (5) (1992 Jul 3) 263–268
5. Montesi, D., Torlone, R.: Analysis and optimization of active databases. Data &
Knowledge Engineering 40 (3) (2002 Mar) 241–271
6. Armstrong, W.W.: Dependency structures of data base relationships. Proc. IFIP
Congress. North Holland, Amsterdam (1974) 580–583
7. Atzeni, P., Antonellis, V.D.: Relational Database Theory. The Benjamin/Cummings Publishing Company Inc. (1993)
8. Paredaens, J., De Bra, P., Gyssens, M., Van Gucht, D.: The structure of the relational database model. EATCS Monographs on TCS (1989)
9. Diederich, J., Milton, J.: New methods and fast algorithms for database normalization. ACM Transactions on Database Systems 13 (3) (1988) 339–365
10. Clavel, M., Durán, F., Eker, S., Lincoln, P., Martí-Oliet, N., Meseguer, J., Quesada, J.F.: Maude: specification and programming in rewriting logic. Theoretical
Computer Science (TCS) 285 (2) (2002) 187–243
11. Cordero, P., Enciso, M., Guzmán, I.P.d., Mora, Á.: Slfd logic: Elimination of data
redundancy in knowledge representation. (Advances in AI, Iberamia 2002. LNAI
2527 141-150. Springer-Verlag.)
12. Mora, Á., Enciso, M., Cordero, P., Guzmán, I.P.d.: An efficient preprocessing transformation for funtcional dependencies set based on the substitution paradigm.
CAEPIA 2003. To be published in LNAI. (2003)
13. Clavel, M., Durán, F., Eker, S., Lincoln, P., Martí-Oliet, N., Meseguer, J., Quesada,
J.: A Maude Tutorial. SRI International. (2000)
14. Ullman, J.D.: Database and knowledge-base systems. Computer Science Press
(1988)
TEAM LinG
Reasoning About Requirements Evolution
Using Clustered Belief Revision
Odinaldo Rodrigues1, Artur d’Avila Garcez2, and Alessandra Russo3
1
Dept. of Computer Science, King’s College London, UK
2
Department of Computing, City University London, UK
[email protected]
[email protected]
3
Department of Computing, Imperial College London, UK
[email protected]
Abstract. During the development of system requirements, software
system specifications are often inconsistent. Inconsistencies may arise
for different reasons, for example, when multiple conflicting viewpoints
are embodied in the specification, or when the specification itself is at
a transient stage of evolution. We argue that a formal framework for
the analysis of evolving specifications should be able to tolerate inconsistency by allowing reasoning in the presence of inconsistency without
trivialisation, and circumvent inconsistency by enabling impact analyses
of potential changes to be carried out. This paper shows how clustered
belief revision can help in this process.
1
Introduction
Conflicting viewpoints inevitably arise in the process of requirements analysis.
Conflict resolution, however, may not necessarily happen until later in the development process. This highlights the need for requirements engineering tools
that support the management of inconsistencies [12,17].
Many formal methods of analysis and elicitation rely on classical logic as
the underlying formalism. Model checking, for example, typically uses temporal
operators on top of classical logic reasoning [10]. This facilitates the use of wellbehaved and established proof procedures. On the other hand, it is well known
that classical logic theories trivialise in the presence of inconsistency and this is
clearly undesirable in the context of requirements engineering, where inconsistency often arises [6].
Paraconsistent logics [3] attempt to ameliorate the problem of theory trivialisation by weakening some of the axioms of classical logic, often at the expense
of reasoning power. While appropriate for concise modelling, logics of this kind
are too weak to support practical reasoning and the analysis of inconsistent
specifications.
Clustered belief revision [15] takes a different view and uses theory prioritisation to obtain plausible (i.e., non trivial) conclusions from an inconsistent
theory, yet exploiting the full power of classical logic reasoning. This allows the
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 41–51, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
42
Odinaldo Rodrigues, Artur d’Avila Garcez, and Alessandra Russo
requirements engineer to analyse the results of different possible prioritisations
by reasoning classically, and to evolve specifications that contain conflicting viewpoints in a principled way. The analysis of user-driven cluster prioritisations can
also give stakeholders a better understanding of the impact of certain changes
in the specification.
In this paper, we investigate how clustered belief revision can support requirements analysis and evolution. In particular, we have developed a tool for
clustered revision that translates requirements given in the form of “if then else”
rules into the (more efficient) disjunctive normal form (DNF) for classical logic
reasoning and cluster prioritisation. We have then used a simplified version of
the light control case study [9] to provide a sample validation of the clustered
revision framework in requirements engineering.
The rest of the paper is organised as follows. In Section 2, we present the
clustered revision framework. In Section 3, we apply the framework to the simplified light control case study and discuss the results. In Section 4, we discuss
related work and, in Section 5, we conclude and discuss directions for future
work.
2
Clustered Belief Revision
Clustered belief revision [15] is based on the main principles of the well established field of belief revision [1,7], but has one important feature not present
in the original work: the ability to group sentences with a similar role into a
cluster. As in other approaches [11,8], extra-logical information is used to help
in the process of conflict resolution. Within the context of requirements evolution, such extra-logical information is a (partial) ordering relation on sentences,
expressing the relative level of preference of the engineer on the requirements
being formalised. In other words, less preferred requirements are the ones the
engineer is prepared to give up first (as necessary) during the process of conflict
resolution.
The formalism uses sentences in DNF in order to make the deduction and
resolution mechanisms more efficient. The resolution extends classical deduction
by using the extra-logical information to decide how to solve the conflicts. A cluster can be resolved and simplified into a single sentence in DNF. Clusters can
be embedded in other clusters and priorities between clusters can be specified in
the same way as priorities can be specified within a single cluster. The embedding allows for the representation of complex structures which can be useful in
the specification of requirements in software engineering. The behaviour of the
selection procedure in the deduction mechanism – that makes the choices in the
resolution of conflicts – can be tailored according to the ordering of individual
clusters and the intended local interpretation of that ordering.
Our approach has the following main characteristics: i) it allows users to
specify clusters of sentences associated with some (possibly incomplete) priority
information; ii) it resolves conflicts within a cluster by taking into account the
priorities specified by the user and provides a consistent conclusion whenever
TEAM LinG
Reasoning About Requirements Evolution Using Clustered Belief Revision
43
possible; iii) it allows clusters to be embedded in other clusters so that complex
priority structures can be specified; and finally iv) it combines the reasoning
about the priorities with the deduction mechanism itself in an intuitive way.
In the resolution of a cluster, the main idea is to specify a deduction mechanism that reasons with the priorities and computes a conclusion based on these
priorities. The priorities themselves are used only when conflicts arise, in which
case sentences associated with higher priorities are preferred to those with lower
priorities. The prioritisation principle (PP) used here is that “a sentence with
priority
cannot block the acceptance of another sentence with priority higher
than
In the original AGM theory of belief revision, the prioritisation principle
exists implicitly but is only applied to the new information to be incorporated.
We also adopt the principle of minimal change (PMC) although to a limited
extent. In the original AGM theory PMC requires that old beliefs should not
be given up unless this is strictly necessary in order to repair the inconsistency
caused by the new belief. In our approach, we extend this idea to cope with
several levels of priority by stating that “information should not be lost unless
it causes inconsistency with information conveyed by sentences with higher priority”
As a result, when a cluster is provided without any relative
priority between its sentences, the mechanism behaves in the usual way and
computes a sentence whose models are logically equivalent to the models of the
(union of) the maximal consistent subsets of the cluster. On the other extreme,
if the sentences in the cluster are linearly prioritised, the mechanism behaves in
a way similar to Nebel’s linear prioritised belief bases [11].
Unfortunately, we do not have enough space to present the full formalism
of clustered belief revision and its properties here. Further details can be found
in [15]. The main idea is to associate labels of set
to propositional formulae
via a function and define a partial order on
according to the priorities
one wants to express. is then extended to the power set of in the following
way1.
Definition 1. Let
iff either i)
or iii)
or ii)
be a cluster of sentences and
and
and
and
The ordering above is intended to extend the user’s original preference relation on the set of requirements to the power set of these requirements. This
allows one to compare how subsets of the original requirements relate to each
other with respect to the preferences stated by the user on the individual requirements. Other extensions of to could be devised according to the needs
of specific applications.
A separate mechanism selects some sets in
according to some criteria. For
our purposes here, this mechanism calculates the sets in
that are associated
1
In the full formalism, the function can map an element of J to another cluster as
well, creating nested revision levels, i.e., when the object mapped to by
namely
is not a sentence,
is recursively resolved first.
TEAM LinG
44
Odinaldo Rodrigues, Artur d’Avila Garcez, and Alessandra Russo
with consistent combination of sentences2. In order to choose the best consistent
sets (according to
we use the ordering
i.e., we take the minimal elements
in
that are consistent. Since forms a lattice on
where is always the
minimum, if the labelled belief base is consistent, then the choice of the best
consistent sets will give just
itself. Otherwise, this choice will identify some
subsets of according to
The search for consistent combinations of sentences
and minimal elements of can be combined and optimised (see [14]).
Example 1. Consider the cluster defined by the set
the partial
order on given in the middle of Figure 1, where an arrow from to indicates
priority of over and the following function
and
The sentences above taken conjunctively are inconsistent, so we look for
consistent subsets of the base. It can be shown that the maximal consistent
subsets of
will be those associated with the labels
in the sets
and
According to the ordering
amongst these
and
are the ones which best verify
The sets
and
do not verify PP. In fact,
has lower
priority even than
since it does not contain the label associated with the
most important sentence in
on the other hand is strictly worse than
since the latter contains which is strictly better than according
to
The resolution of would produce a result which accepts the sentences
associated with and and includes the consequences of the disjunction of the
sentences associated with
and
This signals that whereas it is possible to
consistently accept the sentences associated with and
it is not possible to
consistently include both the sentences associated with
and
Not enough
information is given in in order to make a choice between and and hence
their disjunction is taken instead.
3
The Light Control Example
In what follows, we adapt and simplify the Light Control Case Study (LCS)
[13] in order to illustrate the relevant aspects of our revision approach. LCS
describes the behaviour of light settings in an office building. We consider two
possible light scenes: the default light scene and the chosen light scene. Office
lights are set to the default level upon entry of a user, who can then override
this setting to a chosen light scene.
If an office is left unoccupied for more than minutes, the system turns the
office’s lights off. When an unoccupied office is reoccupied within minutes, the
light scene is re-established according to its immediately previous setting. The
value of is set by the facilities’ manager whereas the value of is set by the
office user [9]. For simplicity, our analysis does not take into account how these
two times relate.
2
As suggested about the extension of
this selection procedure can be tailored to
fit other requirements. One may want for instance to select amongst the subsets of
those that satisfy a given requirement.
TEAM LinG
Reasoning About Requirements Evolution Using Clustered Belief Revision
Fig. 1. Examples of orderings
45
and the corresponding final ordering
A dictionary of the symbols used in the LCS case study is given in Table 1.
As usual, unprimed literals denote properties of a given state of the system, and
primed literals denote properties of the state immediately after (e.g., occ denotes
that the office is occupied at time and
that the office is occupied at time
A partial specification of the LCS is given below:
Behaviour rules
Safety rules
Economy rules
TEAM LinG
46
Odinaldo Rodrigues, Artur d’Avila Garcez, and Alessandra Russo
We assume that LCS should satisfy two types of properties: safety properties
and economy properties.
The following are safety properties: i) the lights are not off in the default
light scene; ii) if the fire alarm (alm) is triggered, the default light scene must
be re-established in all offices; and iii)
minutes after the alarm is triggered,
all lights must be turned off (i.e., only emergency lights must be on). The value
of is set by the facilities manager. The above requirements are represented by
rules
to
The economy properties include the fact that, whenever possible, the system
ought to use natural light to achieve the light levels required by the office light
scenes. Sensors can check i) whether the luminosity coming from the outside is
enough to surpass the luminosity required by the current light scene; and ii)
whether the luminosity coming from the outside is greater than the maximum
luminosity achievable by the office lights. The latter is useful because it can be
applied independently of the current light scene in an office. Let
denote the
luminosity required by the current light scene, and
the maximum luminosity
achievable by the office lights. i) if the natural light is at least
and
the office is in the chosen or default light scene, then the lights must be turned
off; and ii) if the natural light is at least
then the lights must be
turned off. This is represented by rules
and
Now, consider the following scenario. On a bright Summer’s day, John is
working in his office when suddenly the fire alarm goes off. He leaves the office
immediately. Once outside the building, he realises that he left his briefcase
behind and decides to go back to fetch it. By the time he enters his office, more
than
minutes have elapsed. This situation can be formalised as follows:
John enters the office (ui),
the alarm is sounding (alm)
minutes or more have elapsed since the alarm went off
daylight provides luminosity enough to dispense with artificial lighting
We get inconsistency in two different ways:
1. Because John walks in the office
lights go to the default setting
By
the lights must be on in this setting. This contradicts
which states
that lights should be turned off minutes after the alarm goes off.
2. Similarly, as John walks in the office
lights go to the default setting
Therefore lights are turned on
However, by
this is not necessary,
since it is bright outside and the luminosity coming through the window is
higher the maximum luminosity achievable by the office lights
This is a situation where inconsistency on the light scenes occur due to violations of safety and economy properties. We need to reason about how to resolve
the inconsistency. Using clustered belief revision, we can arrange the components
of the specification in different priority settings, by grouping rules in clusters,
TEAM LinG
Reasoning About Requirements Evolution Using Clustered Belief Revision
47
e.g., a safety cluster, an economy cluster, etc. It is possible to prioritise the clusters internally as well, but this is not considered here for reasons of space and
simplicity.
The organisation of the information in each cluster can be done independently
but the overall prioritisation of the clusters at the highest level requires input
from all stakeholders. For example, in the scenario described previously, we might
wish to prioritise safety rules over the other rules of the specification and yet not
have enough information from stakeholders to decide on the relative strength of
economy rules. In this case, we would ensure that the specification satisfies the
safety rules but not necessarily the economy ones.
Fig. 2. Linearly (L1, L2 and L3) and partially (P1 and P2) ordered clusters.
Let us assume that sensor and factual information is correct and therefore
not subject to revision. We combine this information in a cluster called “update”
and give it highest priority. In addition, we assume that safety rules must have
priority over economy rules. At this point, no information on the relative priority
of behaviour rules is available. With this in mind, it is possible to arrange the
clusters with the update, safety, behaviour and economy rules as depicted in
Figure 2. Prioritisations L1, L2 and L3 represent all possible linear arrangements
of these clusters with the assumptions mentioned above, whereas prioritisations
P1 and P2 represent the corresponding partial ones.
The overall result of the clustered revision will be consistent as long as the
cluster with the highest priority (factual and sensor information) is not itself
inconsistent. When the union of the sentences in all clusters is indeed inconsistent, in order to restore consistency, some rules may have to be withdrawn. For
example, take prioritisation L1. The sentences in the safety cluster are consistent
with those in the update cluster; together, they conflict with behaviour rule
(see Figure 3). Since
is in a cluster with lower priority in L1, it cannot be
consistently kept and it is withdrawn from the intermediate result. The final step
is to incorporate what can be consistently accepted from the economy cluster.
For example, rule is consistent with the (partial) result given in Figure 3 and
is therefore included in the revised specification, and similarly for rule
Notice however, that
might be kept given a different arrangement of the
priorities. The refinement process occurs by allowing one to reason about these
TEAM LinG
48
Odinaldo Rodrigues, Artur d’Avila Garcez, and Alessandra Russo
Fig. 3. Conflict with behaviour rule
different arrangements and the impact on the rules in the specification, without
trivialising the results. Eventually, one aims to reach a final specification that is
consistent regardless of the priorities between the clusters, i.e., consistent in the
classical logic sense, although this is not essential in our framework.
Prioritisations L2 and P2 give the same results as L1, i.e., withdrawal of
is recommended. On the other hand, in prioritisation L3, the sentence in the
behaviour cluster is consistent with those in the update cluster; together, they
conflict with safety rule
(see Figure 4). Since the safety cluster is given lower
priority in L3, both sentences
and
cannot be consistently kept. One has
to give up either
or
However, if
were to be kept, then
would also
have to be withdrawn. Minimal change to the specification forces us to keep
instead, as it allows for the inclusion of
Fig. 4. Conflict with safety rule
Finally, prioritisation P1 offers a choice between the sets of clusters {update,
safety, economy} and {update, behaviour, economy}. The former corresponds to
withdrawing (reasoning in the same way as for L1, L2 and P2), whereas the
latter corresponds to withdrawing as in the case of L3.
In summary, from the five different cluster prioritisations analysed, a recommendation was made to withdraw a behaviour rule in three of them, to withdraw
a safety rule in one of them, and to withdraw either a behaviour or a safety rule
in one of them. From these results and the LCS context, the withdrawal of
behaviour rule
seems more plausible. In more complicated cases, a decision
support system could be used to help the choice of recommendations made by
the clustered revision framework.
4
Related Work
A number of logic-based approaches for handling inconsistency and evolving requirements specifications have been proposed in the literature. Zowghi and Offen
[18] proposed belief revision for default theories as a formal approach for resolving inconsistencies. Specifications are formalised as default theories where each
TEAM LinG
Reasoning About Requirements Evolution Using Clustered Belief Revision
49
requirement may be defeasible or non-defeasible, each kind assumed to be consistent within itself. Inconsistencies introduced by an evolutionary change are
resolved by performing a revision operation over the entire specification. Defeasible information that is inconsistent with non-defeasible information is not used
in the reasoning process and thus does not trigger a revision. Similarly, in our
approach, requirements with lower priority that are inconsistent with requirements with higher priority are not considered in the computation of the revised
specification. However, in our approach, the use of different levels of priority enables the engineer to fine-tune the specification and reason with different levels
of defeasibility.
In [16], requirements are assumed to be defeasible, having an associated preference ordering relation. Conflicting defaults are resolved not by changing the
specification but by considering only scenarios or models of the inconsistent specification that satisfy as much of the preferrable information as possible. Whereas
Ryan’s representation of priorities is similar to our own, we use classical logic
entailment as opposed to Ryan’s natural entailment and the priorities in our
framework are used only in the solution of conflicts. Moreover, the use of clusters in our approach provides the formalisation of requirements with additional
dimensions, enabling a more refined reasoning process about inconsistency.
In [4], a logic-based approach for reasoning about requirements specifications
based on the construction of goal tree structures is proposed. Analyses of the
consequences of alternative changes are carried out by investigating which goals
would be satisfied and which would not, after adding or removing facts from
a specification. In a similar fashion, our approach supports the evaluation of
consequences of evolutionary changes by checking which requirements are lost
and which are not after adding or deleting a requirement.
Moreover, other techniques have been proposed for managing inconsistency
in specifications. In [2], priorities are used but only in subsets of a knowledge
base which are responsible for inconsistency. Some inference mechanisms are
proposed for locally handling inconsistent information using these priorities. Our
approach differs from that work in that the priorities are defined independently
of the inconsistency and thus facilitating a richer impact analysis on the overall
specification. Furthermore, in [2] priorities can only be specified at the same
level within the base, whereas we allow for more complex representations (e.g.,
between and within sub-bases).
Finally, a lot of work has focused on consistency checking, analysis and action
based on pre-defined inconsistency handling rules. For example, in [5], consistency checking rules are combined with pre-defined lists of possible actions, but
with no policy or heuristics on how to choose among alternative actions. The entire approach relies on taking decisions based on an analysis of the history of the
development process (e.g., past inconsistencies and past actions). Differently, our
approach provides a formal support for analysing the impact of changes over the
specification by allowing the engineer to perform if questions on possible changes
and to check the effect that these changes would have in terms of requirements
that are lost or preserved.
TEAM LinG
50
5
Odinaldo Rodrigues, Artur d’Avila Garcez, and Alessandra Russo
Conclusions and Future Work
In this paper, we have shown how clustered belief revision can be used to analyse
the results of different prioritisations on requirements reasoning classically, and
to evolve specifications that contain conflicting viewpoints in a principled way.
A simplified version of the light control case study was used to provide an early
validation of the framework. We believe that this approach gives the engineer
more freedom to make appropriate choices on the evolution of the requirements,
while at the same time offering rigourous means for evaluating the consequences
that such choices have on the specification.
Our approach provides not only a technique for revising requirements specifications using priorities, but also a methodology for handling evolving requirements. The emphasis of the work is on the use of priorities for reasoning about
potentially inconsistent specifications. The same technique can be used to check
the consequences of a given specification and to reason about “what if” questions
that arise during evolutionary changes.
A number of heuristics about the behaviour of the ordering
have been
investigated in [14]. The use of DNF greatly simplifies the reasoning, but the
conversion to DNF sometimes generates complex formulae making the reasoning
process computationally more expensive. To improve scalability of the approach,
these formulae should be as simple as possible. This simplification could be
achieved by using Karnaugh maps to find a “minimal” DNF of a sentence.
References
1. C. A. Alchourrón and D. Makinson. On the logic of theory change: Contraction
functions and their associated revision functions. Theoria, 48:14–37, 1982.
2. S. Benferhat and L. Garcia, Handling Locally Stratified Inconsistent Knowledge
Bases, Studia Logica, 70:77–104, 2002.
3. N. C. A. da Costa, On the theory of inconsistent formal systems. Notre Dame
Journal of Formal Logic, 15(4):497–510, 1974.
4. D. Duffy et al., A Framework for Requirements Analysis Using Automated Reasoning, CAiSE95, LNCS 932, Springer, 68–81, 1995.
5. S. Easterbrook and B. Nuseibeh, Using ViewPoints for Inconsistency Management.
In Software Engineering Journal, 11(1): 31-43, BCS/IEE Press, January 1996.
6. A. Finkelstein et. al, Inconsistency handling in multi-perspective specifications,
IEEE Transactions on Software Engineering, 20(8), 569-578, 1994.
7. Peter Gärdenfors. Knowledge in Flux: Modeling the Dynamics of Epistemic States.
The MIT Press, Cambridge, Massachusetts, London, England, 1988.
8. P. Gärdenfors and D. Makinson. Revisions of knowledge systems using epistemic
entrenchment. TARK II, pages 83–95. Morgan Kaufmann, San Francisco, 1988.
9. C. Heitmeyer and R. Bharadwaj, Applying the SCR Requirements Method to the
Light Control Case Study, Journal of Universal Computer Science, Vol.6(7), 2000.
10. M. R. Huth and M. D. Ryan. Logic in Computer Science: Modelling and Reasoning
about Systems. Cambridge University Press, 2000.
11. B Nebel. Syntax based approaches to belief revision. Belief Revision, 52–88, 1992.
TEAM LinG
Reasoning About Requirements Evolution Using Clustered Belief Revision
51
12. B. Nuseibeh, J. Kramer and A. Finkelstein, A Framework for Expressing the Relationships Between Multiple Views in Requirements Specification, IEEE Transactions on Software Engineering, 20(10): 760-773, October 1994.
13. S. Queins et al, The Light Control Case Study: Problem Description. Journal of
Universal Computer Science, Special Issue on Requirements Engineering: the Light
Control Case Study, Vol.6(7), 2000.
14. Odinaldo Rodrigues. A methodology for iterated information change. PhD thesis,
Department of Computing, Imperial College, January, 1998.
15. O. Rodrigues, Structured Clusters: A Framework to Reason with Contradictory
Interests, Journal of Logic and Computation, 13(1):69–97, 2003.
16. M. D. Ryan. Default in Specification, IEEE International Symposium on Requirements Engineering (RE93), 266–272, San Diego, California, January 1993.
17. G. Spanoudakis and A. Zisman. Inconsistency Management in Software Engineering: Survey and Open Research Issues, Handbook of Softawre Engineering and
Knowledge Engineering, (ed.) S.K. Chang, pp. 329-380, 2001.
18. D. Zowghi and R. Offen, A Logical Framework for Modeling and Reasoning about
the Evolution of Requirements, Proc. 3rd IEEE International Symposium on Requirements Engineering RE’97, Annapolis, USA, January 1997.
TEAM LinG
Analysing AI Planning Problems in Linear Logic
– A Partial Deduction Approach
Peep Küngas
Norwegian University of Science and Technology
Department of Computer and Information Science
[email protected]
Abstract. This article presents a framework for analysing AI planning
problem specifications. We consider AI planning as linear logic (LL) theorem proving. Then the usage of partial deduction is proposed as a foundation of an analysis technique for AI planning problems, which are
described in LL. By applying this technique we are able to investigate
for instance why there is no solution for a particular planning problem.
We consider here !-Horn fragment of LL, which is expressive enough for
representing STRIPS-like planning problems. Anyway, by taking advantage of full LL, more expressive planning problems can be described,
Therefore, the framework proposed here could be seen as a step towards
analysing both, STRIPS-like and more complex planning problems.
1 Introduction
Recent advancements in the field of AI planning together with increase of computers’ computational power have established a solid ground for applying AI
planning in mainstream applications. Mainstream usage of AI planning is especially emphasised in the light of the Semantic Web initiative, which besides
other factors, assumes that computational entities in the Web embody certain
degree of intelligence and autonomy. Therefore AI planning could have applications in automated Web service composition, personalised assistant agents and
intelligent user interfaces, for instance.
However, there are issues, which may become a bottleneck for a wider spread
of AI planning technologies. From end-user’s point of view a planning problem
has to be specified in the simplest way possible, by omitting many details relevant
at AI planning level. Thus a planning system is expected to reason about missing
information and construct a problem specification, which still would provide
expected results.
Another issue is that there exist problems, where quite often a complete
solution to a declaratively specified problem may not be found. Anyway, system
users would be satisfied with an approximate solution, which could be modified
later manually. Thus, if there would be no solution for a planning problem, a
planner could modify the problem and announce the user about the situation.
An example of such applications includes automated Web service composition.
It may happen that no service satisfying completely user requirements could be
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 52–61, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Analysing AI Planning Problems in Linear Logic
53
composed. However, there might be a solution available which at least partially
satisfies user requirements.
Similar problems may arise in dynamically changing systems as well, since
it is sometimes hard to foresee, the exact planning problem specification, which
would be really needed. Therefore, while computational environments are changing, specifications should alter as well. Anyway, the planner should follow certain
criteria while changing specifications. Otherwise the planning process may easily
loose its intended purpose.
Finally, humans tend to produce errors even to small pieces of code. Hence
a framework for debugging planning specification and identifying users about
potential mistakes would be appreciated. One way of debugging could be runtime
analysis of planning problems. If no solution to a problem is found, a reason may
be a bug in the planning problem specification.
Masseron et al [13], besides others, demonstrated how to apply linear logic
(LL) theorem proving for AI planning. We have implemented an AI planner [8],
which applies LL theorem proving for planning. Experimental results indicate
that on certain problems the performance of our planner is quite close to the
current state-of-the-art planners like TALPlanner, SHOP2 and TLPlan.
In this paper we present a framework for applying partial deduction to LL
theorem proving in order to extend applicability of AI planning. Our approach to
analysing AI planning problems provides a framework, which could assist users
while debugging planning specifications. Additionally, the framework allows autonomous systems, given predefined preferences, to adapt themselves to rapidly
changing environments.
The rest of the paper is organised as follows. In Section 2 we present an introduction to LL and PD. Additionally we show how to encode planning problems
in LL such that LL theorem proving could be used for AI planning. Section 3
describes a motivating example and illustrates how PD in LL could be applied
for AI planning. Section 4 sketches theorems about completeness and soundness
of PD in LL. Section 5 reviews related work. The last section concludes the paper
and discusses future work.
2
2.1
Formal Basics and Definitions
Linear Logic
LL is a refinement of classical logic introduced by J.-Y. Girard to provide means
for keeping track of “resources”. In LL two assumptions of a propositional constant A are distinguished from a single assumption of A. This does not apply in
classical logic, since there the truth value of a fact does not depend on the number of copies of the fact. Indeed, LL is not about truth, it is about computation.
In the following we are considering !-Horn fragment [5] of LL (HLL) consisting of multiplicative conjunction
linear implication
and “of course”
operator (!). In terms of resource acquisition the logical expression
means that resources C and D are obtainable only if both A and B are obtainable. After the sequent has been applied, A and B are consumed and C and D
are produced.
TEAM LinG
54
Peep Küngas
While implication
as a computability statement clause in HLL could
be applied only once,
may be used an unbounded number of times.
Therefore the latter formula could be represented with an extralogical LL axiom
When
is applied, then literal A becomes deleted from and
B inserted to the current set of literals. If there is no literal A available, then
the clause cannot be applied. In HLL ! cannot be applied to other formulae than
linear implications.
Since HLL could be encoded as a Petri net, theorem proving complexity
of HLL is equivalent to the complexity of Petri net reachability checking and
therefore decidable [5]. Complexities of many other LL fragments have been
summarised by Lincoln [11].
2.2
Representing STRIPS-Like Planning Problems in LL
While considering AI planning within LL, one of the intriguing issues is how
to represent planning domains and problems. This section reflects a resourceconscious representation of STRIPS-like operators as adopted by several researchers [13,4,3,6] for LL framework.
Since we do not use negation in our subset of LL, there is no notion of truthvalue for literals. All reasoning is reduced to the notion of resource – the number
of occurrences of a literal determines whether an operator can be applied or not.
Moreover, it is crucial to understand that absence or presence of a literal from
any state of a world does not determine literal’s truth-value.
While LL may be viewed as a resource consumption/generation model, the
notion of STRIPS pre- and delete-lists overlap, if we translate STRIPS operators
to LL. This means that a LL planning operator may be applied, if resources in
its delete-list form a subset of resources in a given state of the world. Then, if the
operator is applied, all resources in the delete-list are deleted from the particular
state of the world and resources in the add-list are inserted to the resulting state.
Therefore, all literals, which have to be preserved from the pre-list, should be
presented in the add-list. For instance, let us consider the STRIPS operator in
Figure 1. An appropriate extralogical LL axiom representing semantics of that
operator is
Thus every element in the pre-list of a STRIPS operator is inserted to the
left hand side of linear implication
Additionally, all elements of the deletelist, which do not already exist there already, are inserted. To the right hand
side of
the add-list elements are inserted plus all elements from the pre-list,
which have to be preserved. This is due to the resource-consciousness property
of LL, meaning literally that everything in the left hand side of would become
consumed and resources in the right hand side of
would become generated.
Definition 1. Planning operator is an extralogical axiom
where
D is the delete-list, A is the add-list, and is a set of variables, which are free
in D and A. D and A are multiplicative conjunctions.
TEAM LinG
Analysing AI Planning Problems in Linear Logic
55
Fig. 1. A STRIPS operator.
It should be noted that due to resource consciousness several instances of
the same predicate may be involved in a state of the world. Thus, in contrast
to classical logic and informal STRIPS semantics, in LL formulae
and
are distinguished.
Definition 2. A state is a multiplicative conjunction.
Definition 3. A planning problem is represented with a LL sequent
where S is the initial state and G is the goal state of the planning problem.
Both, S and G, are multiplicative conjunctions consisting of ground literals.
represents a set of planning operators as extralogical LL axioms.
From theorem proving point of view the former LL sequent represents a
theorem, which has to be proved. If the theorem is proved, a plan is extracted
from the proof.
2.3
Partial Deduction and LL
Partial deduction (PD) (or partial evaluation of logic programs, first introduced
in [7]) is known as one of optimisation techniques in logic programming. Given a
logic program, partial deduction derives a more specific program while preserving
the meaning of the original program. Since the program is more specialised, it
is usually more efficient than the original program, if executed. For instance, let
A, B, C and D be propositional variables and
and
computability statements in LL. Then possible partial deductions are
and
It is easy to notice that the first corresponds to forward
chaining (from initial to goal state), the second to backward chaining (from goal
to initial state) and the third could be either forward or backward chaining.
Partial deduction in logic programming is often defined as unfolding of program
clauses.
Although the original motivation behind PD was to deduce specialised logic
programs with respect to a given goal, our motivation for PD is a bit different. We
are applying PD for determining planning subtasks, which cannot be performed
by the planner, but still are possibly closer to a solution than an initial task.
This means that given a state S and a goal G of a planning problem we compute
a new state and a new goal
This information is used for planning problem
adaptation or debugging. Similar approach has been applied by Matskin and
Komorowski [14] in automatic software synthesis. One of their motivations was
debugging of declarative software specifications.
TEAM LinG
56
Peep Küngas
PD steps for back- and forward chaining in our framework are defined with
the following rules.
Definition 4. First-order forward chaining PD step
Definition 5. First-order backward chaining PD step
In the both preceding definitions
is defined as
is a rule
is a rule
A,
B, C are first-order LL formulae. Additionally we assume that
is
an ordered set of constants,
is an ordered set of variables,
denotes substitution, and
When substitution is applied, elements
in and are mapped to each other in the order they appear in the ordered
sets. These sets must have the same number of elements.
PD steps
and
respectively, apply planning operator
to move the initial state towards the goal state or vice versa. In
step formulae
and
denote respectively a goal state G and a modified
goal state
Thus the step encodes that, if there is a planning operator
then we can change goal state
to
Analogously,
in the inference figure
formulae
and
denote respectively
an initial state S and its modification
And the rule encodes that, if there is
a planning operator
then we can change the initial state
to
3
A Motivating Example
To illustrate the usage of PD in AI planning, let us consider the following planning problem in the blocks world domain. We have a robot who has to collect
two red blocks and place them into a box. The robot has two actions available
and
The first action picks up a red block, while the
other places two blocks into a box. The planning operators are defined in the
following way:
We write
to show that a particular linear implication represents planning
operator L. The planning problem is defined in the following way:
TEAM LinG
Analysing AI Planning Problems in Linear Logic
57
In the preceding is the set of available planning operators as we defined
previously. The initial state is encoded with formula
The goal state is encoded with formula Filled. Unfortunately, one can easily
see that there is only one red block available in the initial state. Therefore, there
is no solution to the planning problem.
However, by applying PD we can find at least a partial plan and notify the
user about the situation. Usage of PD on the particular problem is demonstrated
below:
This derivation represents plan
where X represents a part of the plan, which could not be computed. The derivation could
be derived further through LL theorem proving:
The sequent
could be sent to user now, who
would determine what to do next. The partial plan and the achieved planning
problem specification could be processed in some domains further automatically.
In this light, one has to implement a selection function for determining literals
in the planning problem specification which could be modified by the system.
4
Formal Results for PD
Definition 6 (Partial plan). A partial plan
of a planning problem
is a sequence of planning operator instances such that state O is
achieved from state I after applying the operator instances.
One should note that a partial plan
is an empty plan, while
symmetrically, a partial plan
of a planning problem
is a
complete plan, since it encodes that the plan leads from the initial state S to
the goal state G.
Definition 7 (Resultant). A resultant is a partial plan
TEAM LinG
Peep Küngas
58
where is a term representing the function, which generates O from I by applying potentially composite functions over
which represent planning
operators in the partial plan.
Definition 8 (Derivation of a resultant). Let be any predefined PD step.
A derivation of a resultant
is a finite sequence of resultants:
where
denotes to an application of a PD step
Definition 9 (Partial deduction). A partial deduction of a planning problem
is a set of all possible derivations of a complete plan
from
any resultant
The result of PD is a multiset of resultants
One can easily denote that this definition of PD generates a whole proof tree
for a planning problem
Definition 10 (Executability). A planning problem
is executable,
iff given
as a set of operators, resultant
can be
derived such that derivation ends with resultant
which equals to
and where A is an arbitrary state.
Soundness and completeness are defined through executability of planning
problems.
Definition 11 (Soundness of PD of a planning problem). A partial plan
is executable, if a complete plan
is executable in a planning
problem
and there is a derivation
Completeness is the converse:
Definition 12 (Completeness of PD of a planning problem). A complete
plan
is executable, if a partial plan
is executable in a
planning problem
and there is a derivation
Our proofs of soundness and completeness are based on proving that derivation of a partial plan is a derivation in a planning problem using PD steps, which
were defined as inference figures in HLL.
Proposition 1. First-order forward chaining PD step
respect to first order LL rules.
is sound with
Proof.
TEAM LinG
Analysing AI Planning Problems in Linear Logic
Proposition 2. First-order backward chaining PD step
respect to first order LL rules.
59
is sound with
Proof. The proof in LL is the following
Theorem 1 (Soundness). PD for LL in first-order HLL is sound.
Proof. Since all PD steps are sound, PD for LL in HLL is sound as well. The
latter derives from the fact that, if there exists a derivation
then the derivation is constructed by PD in a formally
correct manner.
Theorem 2 (Completeness). PD for LL in first-order HLL is not complete.
Proof. In the general case first-order HLL is undecidable. Therefore, since PD
applies HLL inference figures for derivation, PD in first-order HLL is not complete. With other words – a derivation
may
not be found in a finite time, even if there exists such derivation. Therefore PD
for LL in first-order HLL fragment of LL is not complete.
Kanovich and Vauzeilles [6] determine certain constraints, which help to reduce the complexity of theorem proving in first-order HLL. By applying those
constraints, theorem proving complexity could be reduced to PSPACE. However, in the general case theorem proving complexity in first-order HLL is still
undecidable.
5
Related Work
Several works have considered theoretical issues of LL planning. The multiplicative conjunction
and additive disjunction
have been employed in [13],
where a demonstration of a robot planning system has been given. The usage
of ? and !, whose importance to AI planning is emphasised in [1], is discussed
there, but not demonstrated.
Influenced by [13], LL theorem proving has been used by Jacopin [4] as an
AI planning kernel. Since only the multiplicative conjunction is used in formulae there, the problem representation is almost equivalent to presentation in
STRIPS-like planners – the left hand side of a LL sequent represents a STRIPS
delete-list and the right hand side accordingly an add-list. In [2] a formalism has
been proposed for deductively generating recursive plans in LL. This advancement is a step further to more general plans, which are capable to solve instead
of a single problem a class of problems.
TEAM LinG
60
Peep Küngas
Although PD was first introduced by Komorowski [7], Lloyd and Shepherdson [12] were first ones to formalise PD for normal logic programs. They showed
PD’s correctness with respect to Clark’s program completion semantics. Since
then several formalisations of PD for different logic formalisms have been developed. Lehmann and Leuschel [10] developed a PD method capable of solving
planning problems in the fluent calculus. A Petri net reachability checking algorithm is used there for proving completeness of the PD method. However, they
do not consider how to handle partial plans.
Matskin and Komorowski [14] applied PD to automated software synthesis.
One of their motivations was debugging of declarative software specification. The
idea of using PD for debugging is quite similar to the application of PD in symbolic agent negotiation [9]. In both cases PD helps to determine computability
statements, which cannot be solved by a system.
6
Conclusions
In this paper we described a PD approach for analysing AI planning problems.
Generally our method applies PD to the original planning problem until a solution (plan) is found. If no solution is found, one or many modified planning
problems are returned. User preferences could be applied for filtering out essential modifications.
We have implemented a planner called RAPS, which is based on a fragment
of linear logic (LL). RAPS applies constructive theorem proving in multiplicative intuitionistic LL (MILL). First a planning problem is described with LL
sequents. Then LL theorem proving is applied to determine whether the problem is solvable. And if the problem is solvable, finally a plan is extracted from a
proof.
By combining the planner with PD approach we have implemented a symbolic
agent negotiation [9]. The main idea is that, if one agent fails to find a solution
for a planning problem, it engages other agents who possibly help to develop the
partial plan further. As a result the system implements distributed AI planning.
The main focus of the current paper, however, has been set to analysing planning
problems, not to cooperative problem solving as presented in [9].
Acknowledgements
This work was partially supported by the Norwegian Research Foundation in the
framework of Information and Communication Technology (IKT-2010) program
– the ADIS project. The author would like to thank anonymous referees for their
comments.
References
1. S. Brüning, S. Hölldobler, J. Schneeberger, U. Sigmund, M. Thielscher. Disjunction
in Resource-Oriented Deductive Planning. Technical Report AIDA-93-03, Technische Hochschule Darmstadt, Germany, 1994.
TEAM LinG
Analysing AI Planning Problems in Linear Logic
61
2. S. Cresswell, A. Smaill, J. Richardson. Deductive Synthesis of Recursive Plans in
Linear Logic. In Proceedings of the Fifth European Conference on Planning, pp.
252–264, 1999.
3. G. Grosse, S. Hölldobler, J. Schneeberger. Linear Deductive Planning. Journal of
Logic and Computation, Vol. 6, pp. 232–262, 1996.
4. É. Jacopin. Classical AI planning as theorem proving: The case of a fragment of
Linear Logic. In AAAI Fall Symposium on Automated Deduction in Nonstandard
Logics, Palo Alto, California, AAAI Press, pp. 62–66, 1993.
5. M. I. Kanovich. Linear Logic as a Logic of Computations. Annals of Pure and
Applied Logic, Vol. 67, pp. 183–212, 1994.
6. M. I. Kanovich, J. Vauzeilles. The Classical AI Planning Problems in the Mirror
of Horn Linear Logic: Semantics, Expressibility, Complexity. Mathematical Structures in Computer Science, Vol. 11, No. 6, pp. 689–716, 2001.
7. J. Komorowski. A Specification of An Abstract Prolog Machine and Its Application to Partial Evaluation. PhD thesis, Technical Report LSST 69, Department
of Computer and Information Science, Linkoping University, Linkoping, Sweden,
1981.
8. P. Küngas. Resource-Conscious AI Planning with Conjunctions and Disjunctions.
Acta Cybernetica, Vol. 15, pp. 601–620, 2002.
9. P. Küngas, M. Matskin. Linear Logic, Partial Deduction and Cooperative Problem
Solving. In Proceedings of the First International Workshop on Declarative Agent
Languages and Technologies (in conjunction with AAMAS 2003), DALT’2003, Melbourne, Australia, July 15, 2003, Lecture Notes in Artificial Intelligence, Vol. 2990,
2004, Springer-Verlag.
10. H. Lehmann, M. Leuschel. Solving Planning Problems by Partial Deduction. In
Proceedings of the 7th International Conference on Logic for Programming and Automated Reasoning, LPAR’2000, Reunion Island, France, November 11–12, 2000,
Lecture Notes in Artificial Intelligence, Vol. 1955, pp. 451–467, 2000, SpringerVerlag.
11. P. Lincoln. Deciding Provability of Linear Logic Formulas. In J.-Y. Girard, Y. Lafont, L. Regnier (eds). Advances in Linear Logic, London Mathematical Society
Lecture Note Series, Vol. 222, pp. 109–122, 1995.
12. J. W. Lloyd, J. C. Shepherdson. Partial Evaluation in Logic Programming. Journal
of Logic Programming, Vol. 11, pp. 217–242, 1991.
13. M. Masseron, C. Tollu, J. Vauzeilles. Generating plans in Linear Logic I–II. Theoretical Computer Science, Vol. 113, pp. 349–375, 1993.
14. M. Matskin, J. Komorowski. Partial Structural Synthesis of Programs. Fundamenta
Informaticae, Vol. 30, pp. 23–41, 1997.
TEAM LinG
Planning with Abduction: A Logical Framework
to Explore Extensions to Classical Planning*
Silvio do Lago Pereira and Leliane Nunes de Barros
Institute of Mathematics and Statistics – University of São Paulo
{slago,leliane}@ime.usp.br
Abstract. In this work we show how a planner implemented as an abductive reasoning process can have the same performance and behavior
as classical planning algorithms. We demonstrate this result by considering three different versions of an abductive event calculus planner on reproducing some important comparative analyses of planning algorithms
found in the literature. We argue that a logic-based planner, defined as
the application of general purpose theorem proving techniques to a general purpose action formalism, can be a very solid base for the research
on extending the classical planning approach.
Keywords: abduction, event calculus, theorem proving, planning.
1 Introduction
In general, in order to cope with domain requirements, any extension to STRIPS
representation language would require the construction of complex planning algorithms, whose soundness cannot be easily proved. The so called practical planners, which are said to be capable of solving complex planning problems, are constructed in an ad hoc fashion, making difficult to explain why they work or why
they present a successful behavior. The main motivation for the construction
of logic-based planners is the possibility to specify planning systems in terms
of general theories of action and implement them as general purpose theorem
provers, having a guarantee of soundness. Another advantage is that a planning
system defined in this way has a close correspondence between specification and
implementation. There are several works aiming the construction of sound and
complete logic-based planning systems [1], [2],[3]. More recent research results [4]
demonstrate that a good theorectical solution can coexist with a good practical
solution, despite of contrary widespread belief [5].
In this work, we report on the implementation and analysis of three different
versions of an abductive event calculus planner, a particular logic-based planner
which uses event calculus [6] as a formalism to reason about actions and change
and abduction [7] as an inference rule. By reproducing some important results on
comparative analyses of planning algorithms [8] [9], and including experiments
with the corresponding versions of the abductive event calculus planner, we show
that there is a close correspondence between well known planning algorithms and
*
This work has been supported by the Brazilian sponsoring agencies Capes and CNPq.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 62–72, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Planning with Abduction
63
logic-based planners. We also show that the efficiency results observed with a
logic-based planner that adopts abductive event calculus and theorem proving
can be comparable to that observed with some practical planners. We claim
that one should start from an efficient logical implementation in order to make
further extensions towards the specification of non-classical planners.
2
Abductive Reasoning in the Event Calculus
Abduction is an inference principle that extends deduction, providing hypothetical reasoning. As originally introduced by [10], it is an unsound inference rule
that resembles a reverse modus ponens: if we observe a fact
and we know
then we can accept as a possible explanation for Thus, abduction is
a weak kind of inference in the sense that it only guarantees that the explanation
is plausible, not that it is true.
Formally, given a set of sentences describing a domain (background theory)
and a sentence
describing an observation, the abduction process consists of
finding a set of sentences (residue or explanation) such that
is consistent
and
Clearly, depending on
for the same observed fact we can
have multiple possible explanations. In general, the definition of best explanation depends on the context, but it is almost always related to some notion of
minimallity. In practice, we should prefer explanations that postulates the minimum number of causes [11]. Furthermore, abduction is, by definition, a kind
of nonmonotonic reasoning, i.e. an explanation consistent, w.r.t. a determined
knowledge state, can become inconsistent when new information is taken into
account [7].
Next, we present the event calculus as the formalism used to describe the
background theory on planning domains and we show how the planning task
can be understood as an abductive process in the event calculus.
2.1 The Event Calculus Formalism
The event calculus [12] is a formalism designed to model and reason about scenarios described as sets of events whose occurrences have the effect of starting
or terminating the truth of determined properties (fluents) of the world. There
are many versions of event calculi [13]. In this work, we use a version defined in
[6], whose axiomatization is the following:
TEAM LinG
64
Silvio do Lago Pereira and Leliane Nunes de Barros
In the event calculus, the frame problem is overcome through circumscription. Given a domain description expressed as a conjunction of formulae that
does not include the predicates initially or happens;
a narrative of actions
expressed as a conjunctions of formulae that does not include the predicates
initiates, terminates or releases;
a conjunction of uniqueness-of-names axioms for actions and fluents; and EC a conjunction of the axioms of the event
calculus, we have to consider the following formula as the background theory on
the abductive event calculus planning:
where
means the circumscription of with relation to the
predicates
By circumscribing initiates, terminates and releases we
are imposing that the known effects of actions are the only effects of actions,
and by circumscribing happens we assume that there are no unexpected event
occurrences. An extended discussion about the frame problem and its solution
through circumscription can be found in [14].
2.2
An Abductive Event Calculus Planner
Planning in the event calculus is naturally handled as an abduction process [2].
In this setting, given a domain description
the task of planning a sequence of
actions in order to satisfy a given goal corresponds to an abductive process
expressed by:
where
the abductive explanation – is a plan for the goal
In [4] a planning
system based on this idea is presented as a PROLOG abductive meta-interpreter.
This meta-interpreter is specialized for the event calculus by compiling the EC
axioms into its meta-clauses. The main advantage of this compilation is to allow
an extra level of control to the planner. In particular, it allows us to define an ordering in which subgoals can be achieved, improving efficiency and giving special
treatment to predicates that represent incomplete information. By incomplete
information we mean predicates for which we do not assume its closure, i.e. we
TEAM LinG
Planning with Abduction
65
cannot use negation as failure to prove their negations, since they can be abduced. The solution to this problem is to give a special treatment for negated
literals with incomplete information at the meta-level. In the case of partial order
planning, we have incomplete information about the predicate before, allowing
the representation of partial order plans. Thus, when the meta-interpreter finds
a literal ¬before(X,Y), it tries to prove it by adding before(Y, X) to the plan
(abductive residue) and checking its consistence.
In the abductive event calculus planner – AECP – a planning problem is given
by a domain description represented by a set of clauses initiates, terminates
and releases, an initial state description represented by a set of clauses
and
and a goal description represented by a list of literals holds At. As
solution, the planner returns an abductive residue composed by literals happens
and before (the partial order plan) and a negative residue composed of literals
clipped and declipped (the causal links of the partial order plan).
3
Classical Planning in the Event Calculus
In order to perform a fair comparative analysis with STRIPS-like planning algorithms, some modifications have to be done in the AECP, which are related to
the following assumptions in classical planning: (i) atomic time, (ii) deterministic effects and (iii) omniscience. From (i) follows that we need to change the
predicate happens(A,T1,T2) to a binary version. Thus, happens(A,T) means
that the action A happens at time T and, by doing this change, the axiom EC7
will be no longer necessary. From (ii) follows that there is no need for the predicate releases and, finally, from (iii) (remembering the fact that STRIPS’s action
representation does not allow negative preconditions), follows that there is no
need for predicate
neither the axioms EC3, EC4 and EC6. With these
changes, we specify a simplified axiomatization of the event calculus containing
only its relevant aspects to the classical planning:
3.1
The ABP Planning System
Based on this simplified axiomatization, we have implemented the ABP planning
system. This planner uses iterative deepening search (IDS) and first-in, first-out
(FIFO) goal ordering, while AECP uses depth first search (DFS) and last-in, firstout (LIFO) strategies. Using IDS, we turn out the method complete and we
increase the possibility to find minimal explanations. It is important to notice
that in the original version of the AECP, both properties did not hold. Next,
we explain the details of the knowledge representation and control knowledge
decisions made in our implementations that are relevant on the comparative
analysis presented in the next section.
TEAM LinG
66
Silvio do Lago Pereira and Leliane Nunes de Barros
Action Representation. In the event calculus, the predicates initiates and
terminates are used to describe the effects of an action. For instance, consider
the predicate walk(X, Y) representing the act of walking from to The effects
of this action can be described as:
In the AECP’s meta-level, the above clauses are represented by the predicate
axiom(H, B), where H is the head of the clause and B is its body, that is:
Similarly, the STRIPS representation of this action is:
Note that, in the STRIPS representation, the first parameter of the predicate
oper is the action’s name, while in the EC representation, the first parameter of the predicate axiom is initiates or terminates. Since PROLOG’s indexing method uses the first parameter as the searching key, finding an action with the predicate oper would take constant time, while a search with
the predicate axiom would take time proportional to the number of clauses
for this predicate included in the knowledge base. Thus, in order to establish a suitable correspondence between both approaches, in the implementation
of the ABP, the clauses of the form
are represented at the meta-level as
In analogous way, the clauses
are represented as
Abducible and Executable Predicates. In the AECP [4], the meta-predicates abducible and executable are used to establish which are the abducible
predicates and the executable actions, repectively. The declaration of the abducible predicates is important to the planner, as it needs to know the predicates
with incomplete information that can be added to the residue. By restricting
the facts that can be abduced, we make sure that only basic explanations are
computed (i.e. those explanations that cannot be formulated in terms of others
effects). On the other hand, the declaration of executable actions only makes
sense in hierarchical task network planners (HTN), where it is important to distinguish between primitive and compound actions. Since in this work we only
want to compare the logical planner with partial order planners, we can assume
that all the actions in the knowledge base are executable and that the only abducible predicates are happens and before (the same assumption is made in
STRIPS-like partial order planners).
Codesignation Constraints. Since the AECP uses the PROLOG’s unification
procedure as the method to add codesignations constraints to the plan, it is
difficult to compare it with STRIPS-like planning algorithms (which have a special
TEAM LinG
Planning with Abduction
67
procedure implemented for this purpose). So, we have implemented ABP as a
propositional planner, as is commonly done in most of the performance analyses
in the planning literature. As we will see, this change has positively affected the
verification of the consistency of the negative residue.
Consistency of the Negative Residue. In the AECP, the negative residue
(i.e. facts deduced through negation as failure) has to be checked for consistency every time the positive residue H (i.e. facts obtained through abduction)
is modified. This behavior corresponds to an interval protection strategy for the
predicate clipped (in a way equivalent to book-keeping in partial order planning). However, in the case of a propositional planner, we have only to check
for consistency a new literal clipped (added to the negative residue) with respect to the actions already in the positive residue, and a new literal happens
(added to the positive residue) with respect to the intervals already in the negative residue. Thus, in contrast with the performance presented by the AECP,
the conflict treatment in the ABP is incremental and has a time complexity of
In addition, when an action in the plan is selected as the establisher of
a subgoal, only the new added literal clipped has to be protected.
3.2
Systematicity and Redundancy
In order to analyse the performance of the abductive event calculus planner, we
have implemented three different planning strategies:
ABP: abductive planner (equivalent to POP [15]);
SABP: systematic version of ABP (equivalent to SNLP [16]);
RABP: redundant version of ABP (equivalent to TWEAK [17]).
Systematicity. A systematic version of the ABP, called SABP, can be obtained
by modifying the event calculus axiom SEC3 to consider as a “threat” to a fluent
F not only an action that terminates it, but also an action that initiates it:
With this simple change, we expect that SABP will have the same performance
of systematic planners, like SNLP [16], and the same trade-off performance with
the corresponding redundant version of the ABP planner.
Redundancy. A redundant version of the ABP, called RABP, does not require
any modification in the EC axioms. The only change that we have to make is in
the goal selection strategy. In the ABP, as well in the SABP, subgoals are selected
and then eliminated from the list of subgoals as soon as they are satisfied. This
can be safely done because those planners apply a causal link protection strategy.
A MTC – modal truth criterion – strategy for goal selection can be easily
implemented in the RABP by performing a temporal projection. This is done
by making the meta-interpreter to “execute” the current plan, without allowing
TEAM LinG
68
Silvio do Lago Pereira and Leliane Nunes de Barros
any modification on it. This process returns as output the first subgoal which is
not necessarily true.
Another modification is on the negative residue: the RABP does not need to
check the consistency of negative residues every time the plan has been modified.
So, in the RABP, the negative literals of the predicate clipped does not have a
special treatment by the meta-interpreter. As in TWEAK [17], this will make the
RABP to select the same subgoal more than once but, on the other hand, it can
plan for more than one subgoal with a single goal establishment.
4
The Comparative Analysis
In order to show the correspondence between abductive planning and partial
order planning, we have implemented the abductive planners (ABP, SABP and
RABP) and three well known partial order planning algorithms (POP, SNLP and
TWEAK). All these planners have been implemented in PROLOG and all the cares
necessary to guarantee the validity of the comparisons have been taken (e.g.
all the planners shared common data structures and procedures). A complete
analysis of these results is presented in [18] and [19].
We have performed two experiments with these six planners: (i) evaluation of
the correspondence between abductive planning in the event calculus and partial
order planning and (ii) evaluation of systematicity/redundancy obtained with
different goal protection strategies.
4.1
Experiment I: Correspondence Between POP and ABP
In order to evaluate the relative performance of the planners POP and ABP, we
have used the artificial domains family
[15]. With this, we ensure that
the empirical results we have obtained were independent of the idiosyncrasies of
a particular domain.
Based on these domains, we have performed two tests: in the first, we observe how the size of the search space explored by the systems increases as we
increase the number of subgoals in the problems; in the second, we observe how
the average CPU-time consumed by the systems increases as we increase the
number of subgoals in the problems. In figure 1, we can observe that the ABP
and POP explore identical search spaces. Therefore, we can conclude that they
implement the same planning strategies (i.e. they examine the same number
of plans, independently of the fact that they implement different approaches).
This result extends the work presented in [4], which verifies the correspondence
between abductive planning in the event calculus (AECP) and partial order planning (POP) only in an informal way, by inspecting the code. In figure 2, we can
observe that, for all problems solved, the average CPU-time consumed by both
planners is approximately the same. This shows that the necessary inferences in
the logical planners do not increase the time complexity of the planning task.
Therefore, through this first experiment, we have corroborated the conjecture
that abductive planning in the event calculus is isomorphic to partial order
TEAM LinG
Planning with Abduction
69
Fig. 1. Search space size to solve problems in
Fig. 2. Average CPU-time to solve problems in
planning [4]. Also, we have showed that, using abduction as inference rule and
event calculus as formalism to reasoning about actions and chance, a logical
planning system can be as efficient as a partial order planning system, with the
advantage that its specification is “directly executable”.
4.2
Experiment II: Trade-Off Between Systematicity
and Redundancy
There was a belief that by decreasing redundancy it would be possible to improve planning efficiency. So, a systematic planner, which never visits the same
plan twice in its search space, would be more efficient than a redundant planner [16]. However, [20] has shown that there is a trade-off between redundancy
elimination and least commitment: redundancy is eliminated at the expense of
increasing commitment in the planner. Therefore, the performance of a partial
order planner is better predicted based on the way it deals with the trade-off
between redundancy and commitment than on the systematicity of its search.
In order to show the effects of this trade-off, Kambhampati chose two well
known planning algorithms: TWEAK and SNLP.TWEAK does not keep track of
which goals were already achieved and which remains to be achieved. Therefore,
TWEAK may achieve and clobber a subgoal arbitrarily many times, having a lot
of redundancy on its search space. On the other hand, SNLP achieves systematicity by keeping track of the causal links of the plans generated during search, and
ensuring that each branch of the search space commits to and protects mutually
exclusive causal links for the partial plans, i.e. it protects already established
goals from negative or positive threats. Such protection corresponds to a strong
TEAM LinG
70
Silvio do Lago Pereira and Leliane Nunes de Barros
Fig. 3. Average CPU-time to solve problems in
form of premature commitment (by imposing ordering constraints on positive
threats) which can increase the amount of backtracking as well as the solution
depth, having an adverse effect on the performance of the planner.
Kambhampati’s experimental analyses show that there is a spectrum of solutions to the trade-off between redundancy and commitment in partial order
planning, in which the SNLP and TWEAK planners fall into opposite extremes.
To confirm this result, and to show that it is valid to abductive planners too,
we created a new family of artificial domains, called
[19], through which
we can accurately control the ratio between the number of positive threats
(i.e. distinct actions that contribute with one same effect) and negative threats
(i.e. distinct actions that contribute with opposing effects) in each domain. To
observe the behavior of the compared planners, as we vary the ratio between the
number of positive and negative threats in the domains, we keep constant the
number of subgoals in the solved problems. Then, as a consequence of this fact
and of the characteristics of the domains in the family
the number of
steps in all solutions stays always the same.
The results of this second experiment (figure 3), show that the systematic
and redundant versions of the abductive planner (SABP and RABP) have the
same behavior of its corresponding algorithmic planners (SNLP and TWEAK).
So, we have extended the results of the previous experiment and show that
the isomorphism between abductive reasoning in the event calculus and partial order planning can be preserved for systematic and redundant methods of
planning. Moreover, we also corroborate the conjecture that the performance of
a systematic or redundant planner is strongly related to the ratio between the
number of positive and negative threats in the considered domain [8] and that
this conjecture remains valid to abductive planning in the event calculus.
5
Conclusion
The main contribution of this work is: (i) to propose a formal specification
of different well-known algorithms of classical planning and (ii) to show how a
TEAM LinG
Planning with Abduction
71
planner based on theorem proving can have similar behavior and performance to
those observed in partial order planners based on STRIPS. One extra advantage of
our formal specification is its close relationship with a PROLOG implementation,
which can provide a good framework to test extensions to the classical approach,
as well to the integration of knowledge-based approaches for planning.
It is important to note that the original version of the AECP proposed in [4]
does not guarantee completeness neither minimal plan solution. However, the abductive planners we have specified and implemented guarantee these properties
by using IDS (iterative deepening search) and FIFO goal ordering strategies.
We are currently working on the idea proposed in [21] which aims to build,
on the top of our abductive planners, a high-level robot programming language
for applications in cognitive robotics. First, we have implemented a HTN version of the abductive event calculus planner to cope with the idea of high-level
specifications of robotic tasks. Further, we intend to work with planning and
execution with incomplete information.
References
1. Green, C.: Application of theorem proving to problem solving. In: International
Joint Conference on Artificial Intelligence. Morgan Kaufmann (1969) 219–239
2. Eshghi, K.: Abductive planning with event calculus. In: Proc.of the 5th International Conference on Logic Programming. MIT Press (1988) 562–579
3. Missiaen, L., Bruynooghe, M., Denecker, M.: Chica, an abductive planning system
based on event calculus (1994)
4. Shanahan, M.P.: An abductive event calculus planner. In: The Journal of Logic
Programming. (2000) 44:207–239
5. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach (second edition).
Prentice-Hall, Englewood Cliffs, NJ (2003)
6. Shanahan, M.: A circumscriptive calculus of events. Artificial Intelligence 77 (1995)
249–284
7. Kakas, A.C., Kowalski, R.A., Toni, F.: Abductive logic programming. Journal of
Logic and Computation 2 (1992) 719–770
8. Knoblock, C., Yang, Q.: Evaluating the tradeoffs in partial-order planning algorithms (1994)
9. Kambhampati, S., Knoblock, C.A., Yang, Q.: Planning as refinement search: A unified framework for evaluating design tradeoffs in partial-order planning. Artificial
Intelligence 76 (1995) 167–238
10. Peirce, C.S.: Collected Papers of Charles Sanders Peirce. Harvard University Press
(1931-1958)
11. Cox, P.T., Pietrzykowski, T.: Causes for events: their computation and applications. In: Proc. of the 8th international conference on Automated deduction,
Springer-Verlag New York, Inc. (1986) 608–621
12. Kowalski, R.A., Sergot, M.J.: A logic-based calculus of events. In: New Generation
Computing 4. (1986) 67–95
13. Santos, P.E.: Formalising the common sense of a mobile robot (1998)
14. Shanahan, M.P.: Solving the Frame Problem: A Mathematical Investigation of the
Common Sense Law of Inertia. MIT Press (1997)
TEAM LinG
72
Silvio do Lago Pereira and Leliane Nunes de Barros
15. Barrett, A., Weld, D.S.: Partial-order planning: Evaluating possible efficiency
gains. Artificial Intelligence 67 (1994) 71–112
16. MacAllester, D., Rosenblitt, D.: Systematic nonlinear planning. In: Proc. 9th National Conference on Artificial Intelligence. MIT Press (1991) 634–639
17. Chapman, D.: Planning for conjunctive goals. Artificial Intelligence 32 (1987) 333–
377
18. Pereira, S.L., Barros, L.N.: Efficiency in abductive planning. In: Proceedings of
2nd Congress of Logic Applied to Technology. Senac, São Paulo (2001) 213–222
19. Pereira, S.L.: Abductive Planning in the Event Calculus. Master Thesis, Institute
of Mathematics and Statistics - University of Sao Paulo (2002)
20. Kambhampati, S.: On the utility of systematicity: Understanding tradeoffs between redundancy and commitment in partial-ordering planning. In: Foundations
of Automatic Planning: The Classical Approach and Beyond: Papers from the 1993
AAAI Spring Symposium, AAAI Press, Menlo Park, California (1993) 67–72
21. Barros, L.N., Pereira, S.L.: High-level robot programs based on abductive event
calculus. In: Proceedings of 3rd International Cognitive Robotics Workshop. (2002)
TEAM LinG
High-Level Robot Programming:
An Abductive Approach Using Event Calculus
Silvio do Lago Pereira and Leliane Nunes de Barros
Institute of Mathematics and Statistics – University of São Paulo
{slago,leliane}@ime.usp.br
Abstract. This paper proposes a new language that can be used to
build high-level robot controllers with high-level cognitive functions such
as plan specification, plan generation, plan execution, perception, goal
formulation, communication and collaboration. The proposed language is
based on GOLOG, a language that uses the situation calculus as a formalism to describe actions and deduction as an inference rule to synthesize
plans. On the other hand, instead of situation calculus and deduction, the
new language uses event calculus and abductive reasoning to synthesize
plans. As we can forsee, this change of paradigm allows the agent to reason about partial order plans, making possible a more flexible integration
between deliberative and reactive behaviors.
Keywords: cognitive robotics, abduction, event calculus, planning.
1
Introduction
The area of cognitive robotics is concerned with the development of agents with
autonomy to solving complex tasks in dynamic environments. This autonomy
requires high-level cognitive functions such as reasoning about actions, perceptions, goals, plans, communication, collaboration, etc. As we can guess, to implement these functions using a conventional programming language can be a very
difficult task. On the other hand, by using a logical formalism to reason about
actions and change, we can have the necessary expressive power to provide these
capabilities.
A logical programming language designed to implement autonomous agents
should have two important characteristics: (i) to allow a programmer to specify a
robot control program, as easily as possible, using high-level actions as primitives
and (ii) to allow a user to specify goals and provide them to an agent with the
ability to plan a correct course of actions to achieve these goals. The GOLOG [1]
programming language for agents, developed by the Group of Cognitive Robotics
of the University of Toronto, was designed to attend this purpose: (i) it is a highlevel agent programming language, in which standard programming constructs
(e.g. sequence, choice and iteration) are used to write the agent control program
and (ii) it can effectively represent and reason about the actions performed by
agents in dynamic environments. The emerging success of GOLOG has shown
that, by using a logical approach, it is possible to solve complex robotic tasks
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 73–82, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
74
Silvio do Lago Pereira and Leliane Nunes de Barros
efficiently, despite of the contrary widespread belief [2]. However, GOLOG uses a
planning strategy based on situation calculus, a logical formalism in which plans
are represented as a totally ordered sequence of actions and, therefore, it inherits
the well known deficiencies of this approach [3].
In this work, we argue that a partial order plan representation can be better
adapted to different planning domains, being more useful in robotic applications
(notice that a least commitment strategy on plan step ordering can allow a more
flexible interleaving of reactive and deliberative behavior). We also propose a
new high-level robot programming language called ABGOLOG. This language is
based on GOLOG (i.e. has same sintax and semantic), but it uses event calculus
as the formalism to describe actions and abductive reasoning to synthesize plans,
which corresponds to partial order planning [4]. So, based on our previous work
on implementation and analysis of abductive event calculus planning systems [5],
we show how it is possible to modify ABGOLOG’s implementation to improve its
efficiency, according to specific domain characteristics.
This paper is organized as follows: in Section 2, we briefly review the basis
of situation calculus and how it is used in the GOLOG language; in Section 3, we
present the event calculus and how it can be used to implement three versions of
an abductive event calculus planner that can serve as a kernel in ABGOLOG; we
also show how the different versions of the abductive planner can be used by this
language, depending on the characteristics of the robotics application; finally, in
Section 4, we discuss important aspects of the proposed language ABGOLOG.
2
Robot Programming with GOLOG
GOLOG [1] is an attempt to combine two different styles of knowledge representation – declarative and procedural – in the same programming language,
allowing the programmer to cover the whole spectrum of possibilities from a
pure reactive agent to a pure deliberative agent. In contrast to programs written in standard programming languages, when executed, GOLOG programs are
decomposed into primitives which correspond to the agent’s actions. Furthermore, since these primitives are described through situation calculus axioms, it
is possible to reason logically about their effects.
2.1
The Situation Calculus Formalism
The situation calculus [6] is a logical formalism, whose ontology includes situations, which are like “snapshots” of the world; fluents, which describe properties
of the world that can change their truth value from one situation to another
one; and actions, which are responsible for the actual change of a situation into
another. In the situation calculus, which is a dialect of the first order predicate
logic, the constant denotes the initial situation; the function
denotes
the resulting situation after the execution of the action in the situation
the predicate
means that it is possible to execute the action in the
situation and, finally, the predicate
means that the fluent holds
in the situation
TEAM LinG
High-Level Robot Programming
75
Given a specification of a planning domain in the situation calculus formalism, a solution to a planning problem in this domain can be found through
theorem proving. Let be a set of axioms describing the agent’s actions, a set
of axioms describing the initial situation and a logical sentence describing a
planning goal. Thus, a constructive proof of
where
causes the
variable S to be instanciated to a term of the form
Clearly, the sequence of actions
corresponding to this term is a
plan that, when executed by the agent from the initial situation
leads to a
situation that satisfy the planning goal.
2.2
The GOLOG Interpreter
GOLOG programs are executed by a specialized theorem prover (figure 1). The
user has to provide an axiomatization
describing the agent’s actions (declarative knowledge), as well a control program specifying the desired behavior of the
agent (procedural knowledge). After that, to execute the program corresponds to
prove that exists a situation such that
Thus, if the situation
found by the theorem prover is a term of the form
the corresponding sequence of actions
is executed by the agent.
Fig. 1. A very simplified implementation of GOLOG in PROLOG
For instance, consider the following situation calculus axiomatization for the
elevator domain [1], where the agent’s actions are open, close, turnoff, up e
down:
TEAM LinG
76
Silvio do Lago Pereira and Leliane Nunes de Barros
In this domain, the agent’s goal is to attend all calls1, represented by the
fluent
and its behavior can be specified by the following GOLOG program:
Once the domain axiomatization and the control program are provided, we
can execute the GOLOG interpreter as follows:
3
Abductive Reasoning in the Event Calculus
Abduction is an inference principle that extends deduction, providing hypothetical reasoning. As originally introduced by [7], it is an unsound inference rule
1
There is no distinction between calls made from inside or outside the elevator.
TEAM LinG
High-Level Robot Programming
77
that resembles a reverse modus ponens: if we observe a fact
and we know
then we can accept as a possible explanation for Thus, abduction is
a weak kind of inference in the sense that it only guarantees that the explanation
is plausible, not that it is true.
Formally, given a set of sentences describing a domain (background theory)
and a sentence
describing an observation, the abduction process consists of
finding a set of sentences
(residue or explanation) such that
is consistent and
Clearly, depending on the background theory, for the same
observed fact we can have multiple possible explanations. In general, the definition of best explanation depends on the context, but it is almost always related
to some notion of minimallity. In practice, we should prefer explanations which
postulates the minimum number of causes [8]. Furthermore, by definition, abduction is a kind of nonmonotonic reasoning, i.e. an explanation consistent, w.r.t. a
determined knowledge state, can become inconsistent when new information is
considered [9].
Next, we present the event calculus as the logical formalism used to describe
the background theory in ABGOLOG programs and we show how abduction can
be used to synthesize partial order plans in this new language.
3.1
The Event Calculus Formalism
The event calculus [10] is a temporal formalism designed to model and reason
about scenarios described as a set of events whose occurrences on time have the
effect of starting or terminating the validity of fluents which denote properties
of the world [11]. Note that event calculus emphasize the dynamics of the world
and not the statics of the situations, as the situation calculus does. The basic
idea of events is to establish that a fluent holds in a time point
if it holds
initially or if it is initiated in some previous time point
by the occurrence of
an action, and it is not terminated by the occurrence of another action between
and
A simplified axiomatization to this formalism is the following:
In the event calculus, the frame problem is overcome through circumscription.Given a domain description expressed as a conjunction of formulae that
does not include the predicates initially or happens;
a narrative of actions
expressed as a conjunctions of formulae that does not include the predicates
initiates or terminates; a conjunction of uniqueness-of-names axioms for actions and fluents; and EC a conjunction of the axioms of the event calculus, we
have to consider the following formula as the background theory on the abductive
event calculus planning:
TEAM LinG
78
Silvio do Lago Pereira and Leliane Nunes de Barros
where
means the circumscription of w.r.t. the predicate
symbols
By circumscribing initiates and terminates we are imposing
that the known effects of actions are the only effects of actions, and by circumscribing happens we assume that there are no unexpected event occurrences. An
extended discussion about the frame problem and its solution through circumscription can be found in [11].
Besides the domain independent axioms [SEC1]-[SEC3], we also need axioms
to describe the fluents that are initially true, specified by the predicate initially,
as well the positive and negative effects of the domain actions, specified by
the predicates initiates and terminates, respectively. Remembering the elevator
domain example, we can write:
In the event calculus, a partial order plan is represented by a set of facts
happens, establishing the occurrence of actions in time, and by a set of temporal constraints
establishing a partial order over these actions. For instance,
is a partial order plan.
Given a set of facts happens e
representing a partial order plan, the axioms
[SEC1]-[SEC3] and the domain description, we can find the truth of the domain
fluents at any time point. For instance, given the plan
we can conclude that
holdsAt(cur floor(5),
is true, which is the effect of the action up(5); and that
holdsAt(on(3),
is also true, which is a property that persists in time, from
the instant 0. In fact, the axioms [SEC1]-[SEC3] capture the temporal persistence
of fluents and, therefore, the event calculus does not require persistence axioms.
3.2
The ABP Planning System
As [12] has shown, planning in event calculus is naturally handled as an abductive
process. In this setting, planning a sequence of actions that satisfies a given
goal
w.r.t. a domain description
is equivalent to finding an abductive
explanation
(narrative or plan) such that:
Based on this idea, we have implemented the ABP [4] planning system. This
planner is a PROLOG abductive interpreter specialized to the event calculus
TEAM LinG
High-Level Robot Programming
79
formalism. An advantage of this specialized interpreter is that predicates with
incomplete information can receive a special treatment in the meta-level. For
instance, in partial order planning, we have incomplete information about the
predicate before, used to sequencing actions. Thus, when the interpreter finds a
negative literal ¬before(X, Y), it tries to prove it by showing that before(Y, X)
is consistent w.r.t. the plan being constructed.
3.3
Systematicity and Redundancy in the ABP Planning System
An interesting feature of the abductive planning system ABP is that we can
modify its planning strategy, according to the characteristics of the application
domain [5]. By making few modifications in the ABP specification, we have implemented two different planning strategies: SABP, a systematic partial order
planner, and RABP, a redundant partial order planner.
A systematic planner is one that never visits the same plan more than once in
its search space (e.g. SNLP [13]). A systematic version of the ABP, called SABP,
can be obtained by modifying the axiom [SEC3] to consider as a threat to a fluent
F not only an action that terminates it, but also an action that initiates it:
A redundant planner is one that does not keep track of which goals were already achieved and which remains to be achieved and, therefore, it may establish
and clobber a subgoal arbitrarily many times (e.g. TWEAK [14]). A redundant
version of the ABP, called RABP, does not require any modification in the EC
axioms. The only change that we have to make is in the goal selection strategy.
In the ABP, as well in the SABP, subgoals are selected and then eliminated from
the list of subgoals as soon as they are satisfied. This can be safely done because
those planners apply a causal link protection strategy.
A MTC – modal truth criterion – strategy for goal selection can be easily
implemented in the RABP by performing a temporal projection. This is done
by making the meta-interpreter to “execute” the current plan, without allowing
any modification on it. This process returns as output the first subgoal which is
not necessarily true.
Another modification is on the negative residue: the RABP does not need to
check the consistency of negative residues every time the plan has been modified.
So, in the RABP, the negative literals of the predicate clipped does not have a
special treatment by the meta-interpreter. As in TWEAK [14], this will make the
RABP to select the same subgoal more than once but, on the other hand, it can
plan for more than one subgoal with a single goal establishment.
3.4
Selecting the Best Planner: ABP, SABP or RABP
In our previous publication [4], we have demonstrated that, by varying the ratio
between positive and negative threats on a planning domain, the abductive
TEAM LinG
80
Silvio do Lago Pereira and Leliane Nunes de Barros
planners exhibit different behavior: when
the systematic version is dramatically better than the redundant version; on the other hand, when
the systematic version is dramatically worse than the redundant version. This
result provides a foundation for predicting the conditions under which different
planning systems will perform better, depending on different characteristics of
the domain. In other words, this result allows that someone building a planning
system or a robotic control program can select the appropriate goal protection
strategy, depending on the characteristics of the problem being solved.
By running the same experiment presented in [15], using our three implementations of the abductive planner, we can observe that these planners show
behaviors significantly similar to the well known STRlPS-based planners POP,
SNLP and TWEAK (see figure 2). Therefore, we can conclude that the logical
and the algorithmic approaches implement the same planning strategies and
present the same performance.
Fig. 2. The performance of the planners, depending on the domains characteristics
4
The New Programming Language Proposed
As we guess, GOLOG is a programming language that could be used to implement
deliberative and reactive agents. However, as we have tested, GOLOG computes
(in off-line mode) a complete total order plan that is submitted to an agent to
execute it (in on-line mode). The problem with this approach is that it does
not work well in many real dynamic applications. In the example of the elevator
domain, this can be noticed by the fact that the agent cannot modify its plan in
TEAM LinG
High-Level Robot Programming
81
order to attend new serving requests. This is a case where we need to interleave
the theorem prover with the executive module that controls the actions of the
elevator, which is not possible with GOLOG.
Although our first implementation of the ABGOLOG (omitted here due to
space limitation) suffered from the same problem, we foresee many ways of how
we can change both, the ABGOLOG interpreter and the abdutive planner, in
order to allow the agent to re-plan when relevant changes in the environment
occur. These modifications are currently under construction but, to illustrate
the idea, consider the following situation: The elevator is parked in the
floor
and there are two calls: the first one is from the
floor and the second one
from the
floor. Thus, the agent generates a plan to serve these floors and
initiates its execution, going first to the
floor. However, in the way up, a new
call from the
floor is made by a user. Because of the occurrence of this new
event, the agent should react to fix its execution plan, in order to also take care
of to this new call. As we know, a partial order planner can modify its plan with
a relative ease, since it keeps track of the causal links informations about plan’s
steps (clipped predicate in the abductive planner).
5
Conclusions
Traditionally, the notion of agents in Artificial Intelligence has been close related with the capability to think about actions and its effects in a dynamic
environment [6], [16]. In this last decade, however, the notion of a purely rational agent, which ignores almost completely its interaction with the environment,
gives place to the notion of an agent which must be capable of react to the perceptions received from its environment [17].
In this work, we consider a new logical programming language, especially
guided for programming of robotic agents, which aims at to conciliate deliberation and reactivity. This language, based on GOLOG [1], uses the event calculus
as the formalism to describe actions and to reason about its effects, and uses
abduction as the mechanism for the synthesis of plans. Therefore, the main
advantage of the ABGOLOG language is to be based on a more flexible and
expressive action formalism, when compared to the situation calculus.
Our future work on ABGOLOG’s implementation includes exploring aspects
of compound actions (HTN planning), domain constraints involving the use of
metric resources, conditional effects and durative actions.
References
1. Levesque, H.J., Reiter, R., Lesperance, Y., Lin, F., Scherl, R.B.: GOLOG: A logic
programming language for dynamic domains. Journal of Logic Programming 31
(1997) 59–83
2. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach (second edition).
Prentice-Hall, Englewood Cliffs, NJ (2003)
3. Weld, D.S.: An introduction to least commitment planning. AI Magazine 15 (1994)
27–61
TEAM LinG
82
Silvio do Lago Pereira and Leliane Nunes de Barros
4. Pereira, S.L.: Abductive Planning in the Event Calculus. Master Thesis, Institute
of Mathematics and Statistics - University of Sao Paulo (2002)
5. Pereira, S.L., Barros, L.N.: Efficiency in abductive planning. In: Proceedings of
2nd Congress of Logic Applied to Technology. Senac, São Paulo (2001) 213–222
6. McCarthy, J.: Situations, actions and causal laws. Technical Report Memo 2 Stanford University Artificial Intelligence Laboratory (1963)
7. Peirce, C.S.: Collected Papers of Charles Sanders Peirce. Harvard University Press
(1931-1958)
8. Cox, P.T., Pietrzykowski, T.: Causes for events: their computation and applications. In: Proc. of the 8th international conference on Automated deduction,
Springer-Verlag New York, Inc. (1986) 608–621
9. Kakas, A.C., Kowalski, R.A., Toni, F.: Abductive logic programming. Journal of
Logic and Computation 2 (1992) 719–770
10. Kowalski, R.A., Sergot, M.J.: A logic-based calculus of events. In: New Generation
Computing 4. (1986) 67–95
11. Shanahan, M.P.: Solving the Frame Problem: A Mathematical Investigation of the
Common Sense Law of Inertia. MIT Press (1997)
12. Eshghi, K.: Abductive planning with event calculus. In: Proc.of the 5th International Conference on Logic Programming. MIT Press (1988) 562–579
13. MacAllester, D., Rosenblitt, D.: Systematic nonlinear planning. In: Proc. 9th
National Conference on Artificial Intelligence. MIT Press (1991) 634–639
14. Chapman, D.: Planning for conjunctive goals. Artificial Intelligence 32 (1987)
333–377
15. Knoblock, C., Yang, Q.: Evaluating the tradeoffs in partial-order planning algorithms (1994)
16. Green, C.: Application of theorem proving to problem solving. In: International
Joint Conference on Artificial Intelligence. Morgan Kaufmann (1969) 219–239
17. Brooks, R.A.: A robust layered control system for a mobile robot. IEEE Journal
of Robotics and Automation 2 (1986) 14–23
TEAM LinG
Word Equation Systems:
The Heuristic Approach
César Luis Alonso1,*, Fátima Drubi2,
Judith Gómez-García3, and José Luis Montaña3
1
Centro de Inteligencia Artificial, Universidad de Oviedo
Campus de Viesques, 33271 Gijón, Spain
[email protected]
2
3
Departamento de Informática, Universidad de Oviedo
Campus de Viesques, 33271 Gijón, Spain
Departamento de Matemáticas, Estadística y Computación
Universidad de Cantabria
[email protected]
Abstract. One of the most intrincate algorithms related to words is
Makanin’s algorithm for solving word equations. Even if Makanin’s algorithm is very complicated, the solvability problem for word equations remains NP-hard if one looks for short solutions, i. e. with length bounded
by a linear function w. r. t. the size of the system ([2]) or even with constant bounded length ([1]). Word equations can be used to define various
properties of strings, e. g. characterization of imprimitiveness, hardware
specification and verification and string unification in PROLOG-3 or
unification in theories with associative non-commutative operators. This
paper is devoted to propose the heuristic approach to deal with the problem of solving word equation systems provided that some upper bound
for the length of the solutions is given. Up to this moment several heuristic strategies have been proposed for other NP-complete problems, like
3-SAT, with a remarkable success. Following this direction we compare
here two genetic local search algorithms for solving word equation systems. The first one consists of an adapted version of the well known
WSAT heuristics for 3-SAT instances (see [9]). The second one is an improved version of our genetic local search algorithm in ([1]). We present
some empirical results which indicate that our approach to this problem
becomes a promising strategy. Our experimental results also certify that
our local optimization technique seems to outperform the WSAT class
of local search procedures for the word equation system problem.
Keywords: Evolutionary computation, genetic algorithms, local search
strategies, word equations.
1
Introduction
Checking if two strings are identical is a rather trivial problem. It corresponds to
test equality of strings. Finding patterns in strings is slightly more complicated.
* Partially supported by the spanish MCyT and FEDER grant TIC2003-04153.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 83–92, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
84
César Luis Alonso et al.
It corresponds to solve word equations with a constant side. For example:
where
are variable strings in {0, 1}*. Equations of this type are not difficult
to solve. Indeed many cases of this problem have very efficient algorithms in the
field of pattern matching.
In general, try to find solutions to equations where both sides contain variable
strings, like for instance:
where
are variables in {0, 1}* or show it has none, is a surprisingly difficult
problem.
The satisfiability problem for word equations has a simple formulation: Find
out whether or not an input word equation (like that in example (2)) has a solution. The decidability of the problem was proved by Makanin [6]). His decision
procedure is one of the most complicated algorithms in theoretical computer
science. The time complexity of this algorithm is
nondeterministic time,
where
is a single exponential function of the size of the equation ([5]).
In recent years several better complexity upper bounds have been obtained:
EXPSPACE ([4]), NEXPTIME ([8]) and PSPACE ([7]). A lower bound for the
problem is NP ([2]). The best algorithms for NP-hard problems run in single exponential deterministic time. Each algorithm in PSPACE can be implemented in
single exponential deterministic time, so exponential time is optimal in the context of deterministic algorithms solving word equations unless faster algorithms
are developed for NP-hard problems.
In the present paper we compare the performance of two new evolutionary
algorithms which incorporate some kind of local optimization for the problem of
solving systems of word equations provided that an upper bound for the length
of the solutions is given. The first strategy proposed here is inspired in the well
known local search algorithms GSAT an WSAT to find a satisfying assignment
for a set of clauses (see [9]). The second one is an improved version, including
random walking in hypercubes of the kind
of the flipping genetic local
search algorithm announced in ([1]). As far as we know there are no references
in the literature for solving this problem in the framework of heuristic strategies
involving local search. The paper is organized as follows: in section 2 we explicitly state the WES problem with bounds; section 3 describes the evolutionary
algorithms with the local search procedures; in section 4, we present the experimental results, solving some word equation systems randomly generated forcing
solvability; finally, section 5 contains some conclusive remarks.
2
The Word Equation Systems Problem
Let A be an alphabet of constants and let
be an alphabet of variables. We
assume that these alphabets are disjoint. As usual we denote by A* the set of
words on A, and given a word
stands for the length of
denotes
the empty word.
TEAM LinG
Word Equation Systems: The Heuristic Approach
85
Definition 1. A word equation over the alphabet A and variables set
is a
pair
usually denoted by L = R. A word equation
system (WES) over the alphabet A and variables set
is a finite set of word
equations
where, for
each pair
Definition 2. Given a WES over the alphabet A and variables set
a solution of S is a morphism
such that
for
and
for
The WES problem, in its general form, is stated as follows: given a word
equation system as input find a solution if there exists anyone or determine
the no existence of solutions otherwise. The problem we are going to study in
this contribution is not as general as stated above, but it is also a NP-complete
problem (see Theorem 5 below). In our formulation of the problem also an upper
bound for the length of the variable values in a solution is given. We name this
variation the
problem.
Problem: Given a WES over the alphabet A with variables set
find a solution
such that
for each
or determine the no existence otherwise.
Example 3. (see [1]) For each
let
and
be the
Fibonacci
number and the
Fibonacci word over the alphabet A = {0, 1}, respectively.
For any
let
be the word equation system over the alphabet A = {0, 1}
and variables set
defined as:
Then, for any
for
the morphism
defined by
is the only solution of the system
This solution satisfies
for each
Recall that
and
if
Remark 4. Example 3 is quite meaningful itself. It shows that any exact deterministic algorithm which solves the WES problem in its general form (or any
heuristic algorithm solving all instances
must have, at least, exponential
worst-case complexity. This is due to the fact that the system
has polynomial
size in and the only solution of
namely
has exponential length w.r.t
because it contains, as a part, the
Fibonacci word,
Note that
has size equal to the
Fibonacci number,
which is exponential
w.r.t
TEAM LinG
86
César Luis Alonso et al.
A problem which does not allow to exhibit the exponential length argument
for lower complexity bounds is the
problem stated above. But this problem remains NP-complete.
Theorem 5. (c. f. [1]) For any
3
the
problem is NP-complete.
The Evolutionary Algorithm
Given an alphabet A and some string over A,
for any pair of positions
in the string
denotes the substring of given
by the extraction of
consecutive many letters through from string
In the case
we denote by
the single letter substring
which
represents the
symbol of the string
3.1
Individual Representation
Given an instance for the
problem, that is, a word equation system
with
equations and
variables, over the
alphabet A = {0, 1} and variables set
if a morphism is
candidate solution for S, then for each
the size of the value of
any variable
must be less than or equal to
This motivates the
representation of a chromosome as a list of
strings
where, for
each
is a word over the alphabet A = {0, 1} of length
such that the value of the variable
is represented in the chromosome by the
string
3.2
Fitness Function
First, we introduce a notion of distance between strings which extends Hamming
distance to the case of non-equal size strings. This is necessary because the
chromosomes (representing candidate solutions for our problem instances) are
variable size strings.
Given to strings
the generalized Hamming distance between them
is defined as follows:
Given a word equation system
over the alphabet A = {0, 1} with set variables
and a chromosome
representing a candidate solution for S, the fitness of is
computed as follows:
First, in each equation, we substitute, for
every variable
for the corresponding string
and, after this replacement, we get the
expressions
where
for all
TEAM LinG
Word Equation Systems: The Heuristic Approach
Then, the fitness of the chromosome
87
is defined as:
Proposition 6. Let
be a word equation system
over the alphabet A = {0, 1} with set variables
and let
be a chromosome representing a candidate solution for S. Define
the morphism
as
for each
Then the morphism is a solution of system S if and only if the fitness of the
chromosome is equal to zero, that is
Remark 7. According to Proposition 7, the goal of our evolutive algorithm is to
minimize the fitness function By means of this fitness function, we propose a
measure of the quality of an individual which distinguishes between individuals
that satisfy the same number of equations. This last objective cannot be reached
by other fitness functions like, for instance, the number of satisfied equations in
the given system.
3.3
Genetic Operators
selection: We make use of the roulette wheel selection procedure (see [3]).
crossover: Given two chromosomes
and
the result of a crossover is a chromosome constructed applying a local crossover to every of the corresponding strings
Fixed
the
crossover of the strings
denoted as
is given as follows. Assume
then, the substring
is the result of applying uniform crossover ([3]) to the strings
and
Next, we randomly select a position
and define
We clarify this local crossover by means of the following example:
Example 8. Let
and
be the variable strings. In this case,
we apply uniform crossover to the first two symbols. Let us suppose that 11
is the resulting substring. This substring is the first part of the resulting
child. Then, if the selected position were, for instance, position 4, the second
part of the child would be 00, and the complete child would be 1100.
mutation: We apply mutation with a given probability The concrete value
of in our algorithms is given in Section 4 below. Given a chromosome
the mutation operator applied to consists in replacing
each gene of each word
with probability
where is the given upper
bound.
3.4
Local Search Procedures
Given a word equation system
phabet A = {0, 1} with set variables
over the aland a chromosome
TEAM LinG
88
César Luis Alonso et al.
fine the
as follows:
representing a candidate solution for S, for any
we deof with respect to the generalized Hamming distance
Local Search 1 (LS1) First, we present our adapted version of the local search
procedure WSAT which will be sketched below. The local search procedure takes
as input a chromosome
and, at each step, yields a chromosome
which satisfies the following properties. With probability
is a random chromosome in
and with probability
is a chromosome
in
with minimal fitness. In this last case cannot be improved by adding
or flipping any single bit from
(because their components are at Hamming
distance at most one). This process iterates until a given specified maximum
number of flips is reached. We call the parameter probability of noise.
Below, we display the pseudo-code of this local search procedure taking as
input a chromosome with
string variables of size bounded by (one for each
variable).
Local Search 2 (LS2) Suppose we are given a chromosome
At each iteration step, the local search generates a random walk inside the
TEAM LinG
Word Equation Systems: The Heuristic Approach
89
truncated hypercube
and at each new generated chromosome makes a flip
(or modifies its length by one unit if possible) if there is a gain in the fitness.
This process iterates until there is no gain. Here is the number of genes of the
chromosome that is
For each chromosome
and each pair
such that,
and
(representing the gene at position in the
of
chromosome we define the set
trough the next two properties:
and
Any element
if
satisfies: for all pair
then
Note that any element in
can be obtained in one of the following
ways: if
by flipping the gene
in if
adding a new gene at
the end of the component
of
or deleting the gene
of the component
or flipping gene
In the pseudo-code displayed below we associate a gen
with a pair
and a chromosome cr with an element
Then, notation
denotes a subset of the type
Summarizing, the pseudo-code of our evolutionary algorithms is the following:
TEAM LinG
César Luis Alonso et al.
90
Remark 9. The initial population is randomly generated. The procedure evaluate (population) computes the fitness of all individuals in the population. The
procedure local_search(Child) can be either LS1 or LS2. Finally, the termination
condition is true when a solution is found (the fitness at some individual equals
zero) or the number of generations attains a given value.
Experimental Results
4
We have performed our experiments over problem instances having equations,
variables and a solution of maximum variable length denoted as
We run our program for various upper bounds of variable length
Let
us note that,
variables and as upper bound for the length of a variable,
determine a search space of size
Since we have not found in the literature any benchmark instance for this
problem, we have implemented a program for random generate word equation
systems with solutions, and we have applied our algorithm to these systems1.
All runs where performed over a processor AMD Athlom XP 1900+; 1,6 GHz
and 512 Mb RAM. For a single run the execution time ranges from two seconds,
for the simplest problems, to five minutes, for the most complex ones. The complexity of a problem is measured through the average number of evaluations to
solution.
4.1
Probability of Mutation and Size of the Initial Population
After some previous experiments, we conclude that the best parameters for the
LS2 program are population size equals 2 and probability of mutation
equals 0.9. This previous experimentation was reported in ([1]). For the LS1
program we conclude that the best parameters are Maxflips equals 40 and probability of noise equals 0.2. We remark that these parameters correspond to the
best results obtained in the problems reported in Table 1.
4.2
LS1 vs. LS2
We show the local search efficiency executing some experiments with both local
search procedures. In all the executions, the algorithm stops if a solution is found
1
Available on line in http://www.aic.uniovi.es/Tc/spanish/repository.htm
TEAM LinG
Word Equation Systems: The Heuristic Approach
91
or the limit of 1500000 evaluations is reached. The results of our experiments
are displayed in Table 1 based on 50 independent runs for each instance. As
usually, the performance of the algorithm is measured first of all by the Success
Rate (SR), which represents the portion of runs where a solution has been found.
Moreover, as a measure of the time complexity, we use the Average number of
Evaluations to Solution (AES) index, which counts the average number of fitness
evaluations performed up to find a solution in successful runs. Comparing the
two local search procedures, we observe that the improved version of our local
search algorithm (LS2) is significantly better than the adapted version to our
problem (LS1) of the WSAT strategies. This can be confirmed by looking at the
respective Average number of Evaluations to Solution reported in our table of
experiments. The comparatione between the evolutionary local-search strategy
an the pure genetic approach was already reported in ([1]) using a preliminary
version of (LS2) that does not use random walks. We observed there a very bad
behavior of the pure genetic algorithm.
5
Conclusions, Summary and Future Research
The results of the experiments reported in Table 1, indicate that the use of evolutive algorithms is a promising strategy for solving the
problem, and that
our algorithms have a good behavior also dealing with large search space sizes.
TEAM LinG
92
César Luis Alonso et al.
Nevertheless, these promising results, there are some hard problems, as p5-15-3,
over which our algorithms have some difficulties trying to find a solution and in
other ones, as for example p25-8-3, the program always finds just the same. In
both cases, the found solution agrees with that proposed by the random problem
generator. In this sense, we have not a conclusion about the influence either of
the number of equations or of the ratio size of the system/number of variables,
on the difficulty of the problem. For the two compared local search algorithms
we conclude that LS2 seems to outperform LS1, that is, the WSAT extension
of local search procedures for the word equation system problem. Nevertheless
it would be convenient to execute new experiments over problem instances with
higher size of search space and to adjust for each instance, the parameters of
Maxflips and probability of noise in procedure LS1. The most important limitation of our approach is the use of upper bounds on the size of the variables when
looking for solutions. In a work in progress, we are developing an evolutionary
algorithm for the general problem of solving systems of word equations (WES)
that profits a logarithmic compression of the size of a minimal solution of a word
equation via Lempel–Ziv encodings of words. We think that this will allow to
explore much larger search spaces and avoiding the use of the upper bound on
the size of the solutions.
References
1. Alonso C. L., Drubi F., Montana J. L.: An evolutionary algoritm for solving Word
Equation Systems. Proc. CAEPIA-TTIA’2003. To appear in Springer L.N.A.I.
2. Angluin D.: Finding patterns common to a set of strings, J. C. S. S. 21(1) (1980)
46-62
3. Goldbert, D. E.: Genetic Algorithms in Search Optimization & Machine Learning.
Addison Wesley Longmann, Inn. (1989)
4. Gutiérrez, C.: Satisfiability of word equations with constants is in exponential space.
in Proc. FOCS’98, IEEE Computer Society Press, Palo Alto, California (1998)
5. Koscielski, A., Pacholski, L.: Complexity of Makanin’s algorithm, J. ACM 43(4)
(1996) 670-684
6. Makanin, G.S.: The Problem of Solvability of Equations in a Free Semigroup. Math.
USSR Sbornik 32 (1977) 2 129-198
7. Plandowski, W.: Wojciech Plandowski: Satisfiability of Word Equations with Constants is in PSPACE. FOCS’99 495-500 (1999)
8. Plandowski, W., Rytter, W.: Application of Lempel-Ziv encodings to the Solution
of Words Equations. Larsen, K.G. et al. (Eds.) L.N.C.S. 1443 (1998) 731-742
9. Selman, B., Levesque H., Mitchell: A new method for solving hard satisfiability
problems. Pro. of the Tenth National Conference on Artificial Intelligence, AAAI
Press, California (1992) 440-446
TEAM LinG
A Cooperative Framework Based on Local
Search and Constraint Programming
for Solving Discrete Global Optimisation
Carlos Castro, Michael Moossen*, and María Cristina Riff**
Departamento de Informática
Universidad Técnica Federico Santa María
Valparaíso, Chile
{Carlos.Castro,Michael.Moossen,Maria-Cristina.Riff}@inf.utfsm.cl
Abstract. Our research has been focused on developing cooperation
techniques for solving large scale combinatorial optimisation problems
using Constraint Programming with Local Search. In this paper, we introduce a framework for designing cooperative strategies. It is inspired
from recent research carried out by the Constraint Programming community. For the tests that we present in this work we have selected two
well known techniques: Forward Checking and Iterative Improvement.
The set of benchmarks for the Capacity Vehicle Routing Problem shows
the advantages to use this framework.
1
Introduction
Solving Constraint Satisfaction Optimisation Problems (CSOP) consists in assigning values to variables in such a way that a set of constraints is satisfied
and a goal function is optimised [15]. Nowadays, complete and incomplete techniques are available to solve this kind of problems. On one hand, Constraint
Programming is an example of complete techniques where a sequence of Constraint Satisfaction Problems (CSP) [10] is solved by adding constraints that
impose better bounds on the objective function until an unsatisfiable problem
is reached. On the other hand, Local Search techniques are incomplete methods
where an initial solution is repeatedly improved considering neighborg solutions.
The advantages and drawbacks of each of these techniques are well-known: complete techniques allow to get, when possible, global optimum but they show a
poor scalability to solve very large problems, thus they do not give an optimal
solution. Incomplete methods gives solutions very quickly but they remains local.
Recently, the integration of both complete and incomplete approaches to
solve Constraint Satisfaction Problems has been studied and it has been recognized that an cooperative approach should give good results when none is able
to solve a problem. In [13], Prestwich proposes a hybrid approach that sacrifies
* The first and the second authors have been partially supported by the Chilean
National Science Fund through the project FONDECYT 1010121
** She is supported by the project FONDECYT 1040364
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 93–102, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
94
Carlos Castro, Michael Moossen, and María Cristina Riff
completeness of backtracking methods to achieve the scalability of local search,
this method outperforms the best Local Search algorithms. In [8], Jussien &
Lhomme present a hybrid technique where Local Search performs over partial
assignments instead of complete assignments, and uses constraint propagation
and conflict- based heuristics to improve the search. They applied their approach to open-shop scheduling problems obtaining encouraging results. In the
metaheuristics community various hybrid approaches combining Local Search
and Constraint Programming have been proposed. In [5], Focacci et al. present
a good state of the art of hybrid methods integrating both Local Search and
Constraint Programming techniques.
On the other hand, solver cooperation is a hot research topic that has been
widely investigated during the last years [6, 11, 12]. Nowadays, very efficient constraint solvers are available and the challenge is to integrate them in order to
improve their efficiency or to solve problems that cannot be treated by an elementary constraint solver. In the last years, we have been interested in the
definition of cooperation languages allowing to define elementary solvers and to
integrate several solvers in a flexible and efficient way [4, 3, 2].
In this work, we concentrate our attention on to solve CSOP instead of CSP.
We introduce a framework for designing cooperative strategies using both kinds
of techniques: Constraint Programming and Local Search. The main difference
of our framework with respect to existing hybrid methods is that, using a cooperation of solvers point of view, we build a new solver based on elementary
ones. In this case, elementary solvers implement Local Search and Constraint
Programming techniques independently, each of them as a black-box. Thus, in
this sense, this work does not have to be considered as a hybrid algorithm. In
this framework, all cooperation scheme must not lose completeness. The motivation for this work is that we strongly believe that local search carried out,
for example, by a Hill-Climbing algorithm, should allow us to reduce the search
space by adding more constraints on the bounds for applying the Constraint
Programming approach. Preliminary results on a classical hard combinatorial
optimisation problem, the Capacity Vehicle Routing Problem, using simplifications of the Solomon benchmarks [14], show that in our approach Hill-Climbing
really helps Forward Checking and it becomes able to find better solutions.
This paper is organised as follows: in section 2, we briefly present both complete and incomplete techniques. In section 3, we introduce the framework for
design cooperative hybrid strategies. In section 4, we present the test solved using
Forward Checking as complete technique and Iterative Improvement as incomplete technique and we evaluate and compare the results. Finally, in section 5,
we conclude the paper and give further research lines.
2
Constraint Programming and Local Search for CSOP
Constraint Programming evolved from research carried out during the last thirty
years on constraint solving. Techniques used for solving CSPs are generally classified in: Searching, Problem Reduction, and Hybrid Techniques [9]. Searching
TEAM LinG
A Cooperative Framework
95
consists of techniques for systematic exploration of the space of all solutions.
Problem reduction techniques transform a CSP into an equivalent problem by
reducing the set of values that the variables can take while preserving the set
of solutions. Finally, hybrid techniques integrate problem reduction techniques
into an exhaustive search algorithm in the following way: whenever a variable is
instantiated, a new CSP is created; then a constraint propagation algorithm is
applied to remove local inconsistencies of these new CSPs [16]. Many algorithms
that essentially fit the previous format have been proposed. Forward Checking,
Partial Lookahead, and Full Lookahead, for example, primarily differ in the degree of local consistency verification performed at the nodes of the search tree [7,
9, 16].
Constraint Programming deals with optimisation problems, CSOPs, using
the same basic idea of verifying the satisfiability of a set of constraints that is
used for solving CSPs. Asuming that one is dealing with a minimisation problem,
the idea is to use an upper bound that represents the best possible solution obtained so far. Then we solve a sequence of CSPs each one giving a better solution
with respect to the optimisation function. More precisely, we compute a solution
to the original set of constraints C and we add the constraint
where
represents the optimisation function and
represents the evaluation of
in the solution
Adding this constraint restricts the set of possible solutions
to those that give better values for the optimisation function always satisfying
the original set of constraints. When, after adding such a constraint, the problem becomes unsatisfiable, the last feasible solution so far obtained represents
the global optimal solution [1]. Very efficient hybrid techniques, such as Forward Checking, Full Lookahead or even more specialised algorithms, are usually
applied for solving the sequence of CSPs. The next figure presents this basic
optimisation scheme.
Fig. 1. Basic Constraint Programming Algorithm for CSOP
Local Search is a general approach, widely used, to solve hard combinatorial
optimisation problems. Roughly speaking, a local search algorithm starts off with
an initial solution and then repeatedly tries to find better solutions by searching
neighborhoods, the algorithm is shown in figure 2.
TEAM LinG
96
Carlos Castro, Michael Moossen, and María Cristina Riff
Fig. 2. Basic Local Search Algorithm for CSOP
A basic version of Local Search is Iterative Improvement or Hill-Climbing
procedures. Iterative Improvement starts with some initial solution that is constructed by some other algorithm, or just generated randomly, and from then
on it keeps moving to a better neighborg, as long as there is one, until finally it
finishes at a locally optimal solution, one that does not have a better neighborg.
Iterative Improvement can apply either first improvement, in which the current
solution is replaced by the first cost-improving solution found by the neighborhood search, or best improvement in which the current solution is replaced by
the best solution in its neighborhood. Empirically, local search heuristics appear
to converge usually rather quickly, within low-order polynomial time. However,
they are only able to find near-optimal solutions, i.e., in general, a local optimum might not coincide with a global optimum. In this paper, we analise the
cooperation between Forward Checking, for solving the sequence of CSPs, and
Iterative Improvement using a best improvement strategy to carry out a local
search approach.
3
A Framework for Design Cooperative Hybrid Strategies
Our idea to make an incomplete solver to cooperate with a complete one is to take
advantage of the efficiency of incomplete techniques to find a new bound and,
given it to the complete approach, to continue searching the global optimum.
For the next, we will call the incomplete solver as
and the complete
one as
In our approach,
could begin solving a CSP, after this
the solution so obtained gives a bound for the optimal value of the problem. In
case of
is stucked trying to find a solution, the collaborative algorithm
detects this situation and gives this information to a
method, that is
charged to find quickly a new feasible solution.
The communication between
and
depends on the direction
of the communication. Thus, when the algorithm gives the control from
to
receives the variable values previously found by
and
it works trying to find a new better solution applying some heuristics. When
the control is from
to
gives information about the local
optima that it found. This information modifies the bound of the objective funcTEAM LinG
A Cooperative Framework
97
tion constraint, and
works trying to find a solution for this new problem
configuration.
Roughly speaking, we expect that using
will reduce its search
tree cutting some branches using the new bound for the objective function. On
the other hand,
focuses its search when it uses an instantiation previously
found by
on an area of the search space where it is more probably to
obtain the optimal value.
In figure 3, we ilustrate the general cooperation approach proposed in this
work.
Fig. 3. Cooperating Solvers Strategy
The goal is to find the solution named Global-Solution of a constrained
optimisation problem with the objective function Min
and its constraints
represented by C. The cooperating strategy is an iterative process that begins
trying to solve the CSP associated to the optimisation problem using a complete
c-solver. This algorithm has associated an stuck condition criteria, i.e., when
it becomes enable to find a complete instantiation in a reasonable either time
or number of iterations. The pre-solution-from-c-solver corresponds to the
variables values instantiated until now. When c-solver is stopped because it
accomplished the stuck condition another algorithm, i-solver, which does an
incomplete search, continues taking as input the pre-solution-from-c-solver.
i-solver uses it to find a near-optimal-solution for the optimisation problem
until it accomplishes to a stuck condition. A new CSP is defined including the
new constraint that indicates that the objective function value must be lower
than the value found either by c-solver with a complete instantiation or by
i-solver with the near optimal solution. This framework is a general cooperation between complete and incomplete techniques.
TEAM LinG
98
4
Carlos Castro, Michael Moossen, and María Cristina Riff
Evaluation and Comparison
In this section, we first explain the problems that we will use as benchmarks and
then we present results obtained using our cooperative approach.
4.1
Tested Problems
In order to test our schemes of cooperation, we use the classical Capacity Vehicle
Routing Problem (CVRP). In the basic Vehicle Routing Problem (VRP),
identical vehicles initially located at a depot are to deliver discrete quantities
of goods to customers, each one having a demand for goods. A vehicle has to
make only one tour starting at the depot, visiting a subset of customers, and
returning to the depot. In CVRP, each vehicle has a capacity, extending in this
way the VRP. A solution to a CVRP is a set of tours for a subset of vehicles
such that all customers are served only once and the capacity constraints are
respected. Our objective is to minimise the total distance travelled by a number
fixed of vehicles to satisfy all customers.
Our problems are based in instances C101, R101 and RC101, proposed by
Solomon, [14], belonging to classes C1, RC1, R1, respectively. Each class defines a different topology. Thus, in C1 the location of customers are clustered.
In R1, the location of customers are generated randomly. In RC1, instances are
generated considering clustered groups of randomly generated locations of customers. These instances are modified including capacity constraints. We named
the so obtained problems as instances
and rc1. These problems are hard
to solve for a complete approach. We remark that the goal of our tests are to
evaluate and to compare the search made by a complete algorithm in contrast
to its behaviour when another algorithm, which does an incomplete search that
is incorporated into the search process.
4.2
Evaluating Forward Checking with Iterative Improvement
For the test we have selected two very known techniques: Forward Checking (FC)
from Constraint Programming and Hill Climbing or Iterative Improvement from
Local Search. Forward Checking is a technique specially designed to solve CSP,
it is based on a backtracking procedure but it includes filtering to eliminate
values that the variables cannot take in any solution to the set of constraints.
Some heuristics have been proposed in the literature to improve the search of
FC. For example, in our tests we include the minimum domain criteria to select
variables.
On the other hand, local search works with complete instantiations. We select
iterative improvement which is particular to solve CVRP. The characteristics of
our iterative improvement algorithm are:
The Initial Solution is obtained from FC.
The moves are 2-opt proposed by Kernighan.
The acceptance criterium is best improvement.
It works only with feasible neighbourhood.
TEAM LinG
A Cooperative Framework
99
The first step in our research was to verify the performance of applying a
standard FC algorithm to solve problems
and rc1 as defined previously.
Table 1 presents the obtained results, where, for each instance, we show all partial
solutions found during the execution, the time
in milliseconds, at which the
partial solution has been found, and the value of the objective function
evaluated in the corresponding partial solution.
Thus, reading the last row of columns and for the
instance, we can see
that the best value of
is obtained for the objective function after 15
instantiations in 106854 seconds. In the same way, we can see that for instance
rc1, the best value obtained by the application of FC is
after 5
instantiations in 111180 seconds, and for instance
the value
is
also obtained after 5 instantiations but in 40719 seconds. For all applications of
FC in this work, we consider a limit of 100 minutes to find the optimal solution
and carry out optimality proofs. This table only show the results of applying FC
for solving each instance, we cannot infer any thing about these results because
we are solving three differents problems.
Our idea to make these two solvers cooperate is to help Forward Checking
when the problem became too hard for this algorithm, and take advantage of
Hill-Climbing that could be able to find a new bound for the search of the optimal
solution. In our approach, Forward Checking could begin solving a CSP, after
this solution gives a bound for the optimal value of the problem. In case of Forward Checking is stucked trying to find a solution, the collaborative algorithm
detects this situation and gives this information to a Hill-Climbing method that
is charged to find quickly a new feasible solution. The communication between
Forward Checking and Hill-Climbing depends on the direction of communicaTEAM LinG
100
Carlos Castro, Michael Moossen, and María Cristina Riff
tion. Thus, when the algorithm gives the control to Hill-Climbing from Forward
Checking, Hill-Climbing receives the variable values previously found by Forward Checking, at it works trying to find a new better solution applying some
heuristics and accepting using a strong criteria, that is selecting the best feasible
solution on the neighborhood defined by the move.
When the control is from Hill-Climbing to Forward Checking, Hill-Climbing
gives information about the local optima that it found. This information modifies
the bound of the objective function constraint, and Forward Checking works
trying to find a solution for this new problem configuration.
Roughly speaking, we expect that using Hill-Climbing, Forward Checking
will reduce its search tree cutting some branches using the new bound for the
objective function. On the other hand, Hill-Climbing focuses its search when it
uses an instantiation previously found by Forward Checking on an area of the
search space where it is more probably to obtain the optimal value.
The first scheme of cooperation that we have tried consists in:
1. We first try to apply FC looking for an initial solution.
2. Once a solution has been obtained, we try to apply HC until it cannot be
applied any more, i.e., a local optimum has been reached.
3. Then, we try again both algorithms in the same order until the problem
becomes unsatisfiable or a limit time is achieved.
The results of applying this scheme are presented in table 2.
In order to verify the effect of applying HC inmediately after the application
of FC, we try the same cooperation scheme but we give the possibility to FC to
be applied several times before trying HC. The idea was to analyse the possibility
of improve bounds just by the application of FC. As we know that FC can need
too much time to get a new solution, we establish a limit of two seconds, if this
limit was reached and FC has not yet return a solution, we try to apply HC.
The results of this second scheme of cooperation are presented in table 3.
We can make the following remarks concerning these results:
Surprisenly, when solving each instance, both cooperation schemes found the
same best value.
TEAM LinG
A Cooperative Framework
101
The first scheme of cooperation (table 2) always takes less time than the second one (table 3). In fact, the total time is mainly due to the time expended
by FC.
In general, applying both cooperations schemes, the results are better, in
terms of than applying FC isolated.
5
Conclusions
The main contribution of this work is that we have presented a framework for
design cooperative hybrid strategies integrating complete methods with incomplete for solving combinatorial optimisation problems. The results tested shown
that Hill Climbing can help Forward Checking by adding bounds during the
search procedure. This is based on the well-known idea that adding constraints,
in general, can improve the performance of Constraint Programming. We are
currently working on using other complete and incomplete methods. In case of
good news, we plan to try solving other combinatorial optimisation problems to
validate this cooperation scheme.
It is important to note that the communication between the methods use for
testing in this paper has been carried out by communicating information about
bounds. In case of we were interested in communicating more information we
should to address the problem of representation, because complete and incomplete generally do not use the same codification. In order to prove optimality the
use of an incomplete technique is not useful, so, as further work, we are interested in using other techniques to improve optimality proofs. We think that the
research already done on overconstrained CSPs could be useful because when
an optimal solution has been found the only task is to prove that the remaining
problem has become unsatisfiable, i.e., an overconstrained CSP.
Nowadays, considering that the research carried out by each community separately has produced good results, we strongly believe that in the future the
work will be in the integration of both approaches.
TEAM LinG
102
Carlos Castro, Michael Moossen, and María Cristina Riff
References
1. Alexander Bockmayr and Thomas Kasper. Branch-and-Infer: A unifying framework for integer and finite domain constraint programming. INFORMS J. Computing, 10(3):287–300, 1998. Also available as Technical Report MPI-I-97-2-008 of
the Max Planck Institut für Informatik, Saarbrücken, Germany.
2. C. Castro and E. Monfroy. A Control Language for Designing Constraint Solvers.
In Proceedings of Andrei Ershov Third International Conference Perspective of
System Informatics, PSI’99, volume 1755 of Lecture Notes in Computer Science,
pages 402–415, Novosibirsk, Akademgorodok, Russia, 2000. Springer-Verlag.
3. Carlos Castro and Eric Monfroy. Basic Operators for Solving Constraints via Collaboration of Solvers. In Proceedings of The Fifth International Conference on
Artificial Intelligence and Symbolic Computation, Theory, Implementations and
Applications, AISC 2000, volume 1930 of Lecture Notes in Artificial Intelligence,
pages 142–156, Madrid, Spain, July 2000. Springer-Verlag.
4. Carlos Castro and Eric Monfroy. Towards a framework for designing constraint
solvers and solver collaborations. Joint Bulletin of the Novosibirsk Computing Center (NCC) and the A. P. Ershov Institute of Informatics Systems (IIS). Series:
Computer Science. Russian Academy of Sciences, Siberian Branch., 16:1–28, December 2001.
5. Filippo Focacci, François Laburthe, and Andrea Lodi. Constraint and Integer Programming: Toward a Unified Methodology, chapter 9, Local Search and Constraint
Programming. Kluwer, November 2003.
6. L. Granvilliers, E. Monfroy, and F. Benhamou. Symbolic-Interval Cooperation in
Constraint Programming. In Proceedings of the 26th International Symposium on
Symbolic and Algebraic Computation (ISSAC’2001), pages 150–166, University of
Western Ontario, London, Ontario, Canada, 2001. ACM Press.
7. Robert M. Haralick and Gordon L. Elliot. Increasing Tree Search Efficiency for
Constraint Satisfaction Problems. Artificial Intelligence, 14:263–313, 1980.
8. Jussien and Lhomme. Local search with constraint propagation and conflict-based
heuristics. Artificial Intelligence, 139:21–45, 2002.
9. Vipin Kumar. Algorithms for Constraint-Satisfaction Problems: A Survey. Artificial Intelligence Magazine, 13(1):32–44, Spring 1992.
10. Alan K. Mackworth. Consistency in Networks of Relations. Artificial Intelligence,
8:99–118, 1977.
11. Philippe Marti and Michel Rueher. A Distributed Cooperating Constraints Solving
System. International Journal of Artificial Intelligence Tools, 4(1-2) :93–1 13, 1995.
12. E. Monfroy, M. Rusinowitch, and R. Schott. Implementing Non-Linear Constraints
with Cooperative Solvers. In K. M. George, J. H. Carroll, D. Oppenheim, and
J. Hightower, editors, Proceedings of ACM Symposium on Applied Computing
(SAC’96), Philadelphia, PA, USA, pages 63–72. ACM Press, February 1996.
13. Prestwich. Combining the scalability of local search with the pruning techniques
of systematic search. Annals of Operations Research, 115:51–72, 2002.
14. M. Solomon. Algorithms for the vehicle routing and scheduling problem with time
window constraints. Operations Research, pages 254–365, 1987.
15. Edward Tsang. Foundations of Constraint Satisfaction. Academic Press, 1993.
16. Martin Zahn and Walter Hower. Backtracking along with constraint processing
and their time complexities. Journal of Experimental and Theoretical Artificial
Intelligence, 8:63–74, 1996.
TEAM LinG
Machine Learned Heuristics
to Improve Constraint Satisfaction*
Marco Correia and Pedro Barahona
Centro de Inteligência Artificial, Departamento de Informática
Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
{mvc,pb}@di.fct.unl.pt
Abstract. Although propagation techniques are very important to solve
constraint solving problems, heuristics are still necessary to handle non
trivial problems efficiently. General principles may be defined for such
heuristics (e.g. first-fail and best-promise), but problems arise in their
implementation except for some limited sources of information (e.g. cardinality of variables domain). Other possibly relevant features are ignored
due to the difficulty in understanding their interaction and a convenient
way of integrating them. In this paper we illustrate such difficulties in a
specific problem, determination of protein structure from Nuclear Magnetic Resonance (NMR) data. We show that machine learning techniques
can be used to define better heuristics than the use of heuristics based
on single features, or even than their combination in simple form (e.g
majority vote). The technique is quite general and, with the necessary
adaptations, may be applied to many other constraint satisfaction problems.
Keywords: constraint programming, machine learning, bioinformatics
1 Introduction
In constraint programming, the main role is usually given to constraint propagation techniques (e.g. [1,2]) that by effectively narrowing the domain of the
variables significantly decrease the search.
However non trivial problems still need the adoption of heuristics to speed
up search, and find solutions with an acceptable use of resources (time and/or
space). Most heuristics follow certain general principles, namely the first-fail
principle for variable selection (enumerate difficult variables first) and the best
promise heuristic for value selection (choose the value that more likely belongs
to a solution) [3].
The implementation of these principles usually depends on the specificities of
the problems or even problem instances, and is more often an art than a science.
A more global view of the problem can be used, by measuring its potential by
means of global indicators (e.g. the kappa indicator [4]), but such techniques
* This work was supported by Fundção para a Ciência e Tecnologia, under project
PROTEINA, POSI/33794/SRI/2000
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 103–113, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
104
Marco Correia and Pedro Barahona
do not take into account all the possible relevant features. Many other specific
features could possibly be taken into account, but they interact in such unpredictable ways that it is often difficult to specify an adequate form of combining
them.
This paper illustrates one such problem, determination of protein structure
given a set of distance constraints between atom pairs extracted from Nuclear
Magnetic Resonance (NMR) data. Previous work on the problem led us to adopt
first-fail, best promise heuristics, selecting from the variables with smaller domains those halfs that interact least with other variables. However, many other
interesting features can be specified (various forms of volumes, distances, constraint satisfaction promise, etc.) but their use was not tried before.
In this paper we report on the first experiments that exploit the rich information that was being ignored. We show that none of the many heuristics that
can be considered always outperforms the others. Hence, a combination of the
various features would be desirable, and we report on various possibilities of such
integration. Eventually, we show that a neural network based machine learning
approach is the one that produces the best results.
The paper is organised as follows. In section 2 we briefly describe PSICO, developed to predict protein structure from NMR data. The next section discusses
profiling techniques to measure the performance of PSICO. Section 4 presents a
number of features that can be exploited in search heuristics, and shows the potential of machine learning techniques to integrate them. Section 5 briefly shows
a preliminary evaluation of the improvements obtained with such heuristics. Last
section presents the main conclusions and directions for further research.
2
The PSICO Algorithm
PSICO (Processing Structural Information with Constraint programming and
Optimisation) [5,6] is a constraint based algorithm to predict the tri-dimensional
structure of proteins from the set of atom distances found by NMR, a technique
that can only estimate lower and higher bounds for the distance between atoms
not too far apart. The goal of the algorithm is to take the list of bounded
distances as input and produce valid 3D positions for all atoms in the protein.
2.1 Definition as a CSP
This problem can be modeled as a constraint satisfaction problem assuming
the positions of the atoms as the (tri-dimensional) variables and the allowed
distances among them as the constraints. The domain of each variable is represented by one good cuboid (allowed region) resulting from the intersection of
several cubic in constraints (see below), containing a number (possibly zero) of
no-good cuboids (forbidden regions) representing the union of one or more cubic
out constraints. Spherical distance constraints are implemented based on the
following relaxation:
TEAM LinG
Machine Learned Heuristics to Improve Constraint Satisfaction
2.2
105
Algorithm
We will focus on the first phase of the PSICO algorithm, which is a depth first
search with full look ahead using an AC-3 variant for consistency enforcement.
Variables are pruned in round robin by eliminating one half of the cuboid at
each enumeration. Two heuristics are used for variable and value selection. After
variable enumeration, arc-consistency is enforced by constraint propagation.
Whenever a domain becomes empty the algorithm backtracks, but backtracking is seldom used. The number of enumerations between an incorrect enumeration and the enumeration where the inconsistency is found is prohibitively large
and recovering from bad decisions has a profound impact on execution time.
While better alternatives to standard backtracking are being pursued, the first
phase of the PSICO algorithm is used only for providing good approximations
of the solution, relying on the promise of both variable and value enumeration
heuristics to do so. When domains are small enough (usually less than
search terminates with success and PSICO algorithm moves on to second phase
(see [5]).
3
3.1
Profiling
Sample Set
The set of proteins used for profiling the impact of heuristics on search is composed of seven samples (proteins) ordered from the smallest (sample 1) to the
largest (sample 7) chosen sparsely over the range of NMR solved structures in
the BMRB database [7]. The corresponding structural data was retrieved from
the PDB database [8]. The number of constraints per variable and the constraint
tightness does not vary significantly with the size of the problems [9].
3.2
Profiling Techniques
Common performance measures like the number of nodes visited, average path
length, and many others, are not adequate for this problem, given the limited use
of backtracking. Instead, in this paper, the algorithm performance is estimated
by the RMS (Root Mean Square) distance between the solution found by the
algorithm (molecule) and a known solution for the problem (the oracle), once
properly aligned [5].
3.3
Probabilistic Analysis of Value Ordering
Tests were made to access the required performance of a value enumeration
heuristic. Final RMS distance was averaged over several runs of all test samples
by using a probabilistic function for selecting the correct half of the domain at
each enumeration. Fig. 1 shows that an heuristic must make correct predictions
80% of the times for achieving a good solution for the first phase of the algorithm
(4Å or less).
TEAM LinG
106
Marco Correia and Pedro Barahona
Fig. 1. Final RMS distance at the end of search using a value enumeration with a given probability of
making the correct choice. Results are displayed for
all samples.
4
Domain Region Features
A number of features that characterize different regions of the domain of the
variable to enumerate given the current state of the problem may help in suggesting which region is most likely to contain the solution. They were grouped
according to their source of information: (a) Volumes, (b) Distances, (c) No-good
information and (d) Constraint minimization vectors.
Fig. 2. Illustrates two sources of features. The spotted cuboid
is the domain
of the variable to enumerate. Geometrical
properties of other domains (exterior cuboids)
can
help choosing which half R of
will be selected. The set of
no-goods
(interior cuboids) is another source of
features being used.
In the following functions, N is the number of atoms,
is the domain of the
variable to enumerate,
are the domains of the other variables in the
problem, and R is a region inside
typically one half of the cubic domain (see
fig. 2). For the first group, the following measures were considered:
The first function is the sum of the volumes of all domains intersected with
the considered region. Function loc-a2 accumulates the fraction of each domain
inside region R, to account for the small cuboid intersections largely ignored by
the previous function. The third function assigns more weight to the intersection
value. The last function simply counts the number of domains that intersect the
considered region.
The second group of features considers distances between the center of the
cubic domains:
TEAM LinG
Machine Learned Heuristics to Improve Constraint Satisfaction
107
Function loc-b1
accumulates the distances between the center of
region R and the center of all the other domains. Function loc-b2
sums the distances between the center of the considered region and the center
of the intersections between the region and the other domains. Function loc-b3
is similar but considers the center of the intersection with the
entire domain
instead.
The third group of features is based on the set of NG no-goods of the domain
of the variable to enumerate, each represented by
(see fig. 2):
They represent respectively the sum of the no-good volume inside region R
and the sum of distances between the center of each no-good and the region R.
The last set considers constraint information features, where
represents
a constraint from a set of
constraints involving variable
to enumerate:
The first function counts the number of constraints violated considering the
center of the region R and the center of the other domains involved. Feature locd2 has three variations. For each violated constraint
between variables
and
is a vector applied from center
to center
with magnitude
and direction defined by sign (distance[center
center
Features Ioc-d2-1 and loc-d2-3 are respectively the sums of these vectors for
in and out constraints over
Feature loc-d2-2 averages the vectors for all
constraints affecting
Feature loc-d3 is a special case of loc-d2-2, as it does
not consider vectors for constraints which are not being violated (by the domain
centers).
4.1
Isolated Features as Value Enumeration Heuristics
An interval enumeration heuristic can be generally seen as a function which
selects the domain region R most likely to contain the solution(s) from several
TEAM LinG
108
Marco Correia and Pedro Barahona
1
candidate domain regions
. The most straightforward method to incorporate each feature presented above in a value enumeration heuristic is
by using a function
that selects R based on a simple relation (> or <)
among the output of the feature for each
Fig. 3. Value ordering
heuristic
performance
averaged over all samples.
Each point represents
the percent of correct
choices for an average
domain side length and
was averaged over 100
independent runs.
Figure 3 shows the percentage of correct choices of each isolated feature
averaged over all test samples. As can be seen, some features are best suited
for early stages of search (e.g. features based on volumes) and others for final
stages (features based on constraint minimization vectors). This did not come as
a complete surprise since at the beginning of search most domains are very large
and overlapping making volume based features meaningful and constraint vectors
useless. At the end of search, domains are sparsely distributed, uniformly sized
and smaller, thus turning constraint minimization vectors more informative.
All these value heuristics based on isolated features clearly underperform the
lower bound estimated on section 3:3 to obtain good approximate solutions.
4.2
Feature Combination
Since none of the heuristics dominates the others, we considered their combination, by the following methods:
Ad-Hoc Selection of Best Feature. Analysis of the charts of figure 3 suggest
that feature loc-a2 is used for average side length above 10Å and loc-d2-2 for
those bellow. This heuristic is referred to as
1
It this approach only two regions (halves) of the domain are being considered at
each enumeration. For an explanation of why this is a better option refer to [9].
TEAM LinG
Machine Learned Heuristics to Improve Constraint Satisfaction
109
Majority Voting. In this case an odd number of heuristics based on isolated
features vote on the region most likely to contain the solutions, and the region
with more votes is selected. The heuristic,
is based on three presented
heuristics,
and
taken from three different
classes (volumes, distances, and constraints).
Neural Network. Features were also combined using a two-layer, feed-forward
neural network [10], representing a function that combines all features evaluated in both domain halves
plus the average domain side length
Training (with usual backpropagation) and testing data was collected by
doing several runs of the test samples using a random variable enumeration
heuristic and at each enumeration recording the feature vector plus a boolean
indicating the correct half. The optimal value enumeration heuristic was used.
Data was then arranged in seven partitions, where each partition included
a training set made of data collected from runs of all samples except sample
and a test set with only data collected from runs of sample for ensuring
the generalization ability of the learned function. For more details concerning
training/network see [9].
The output of the learned function is filtered and used in an heuristic
where
is a constant denoting the risk associated with the prediction.
Note that this function may be undefined for a given feature vector if
TEAM LinG
110
Marco Correia and Pedro Barahona
Training and test performance of the neural network used in
and
test results of the learned function
and
are displayed on table
1. Figure 4 shows a comparison of the performance of the
and
heuristics.
These results show that, as expected, predictions with less risk associated
occur fewer times than those made with higher risk. They also show that heuristics based on neural networks with smaller associated risk perform better than
the others, and
can actually guess the correct half of the domain 80%
of the times, which has been shown to be a lower bound for an acceptable final
solution quality (see section 3.3).
Fig. 4. Runtime performance
comparison of the methods
presented for feature combination, averaged over all partitions.
Neural Network Trained with Noisy Data. Since in this problem backtracking is not an option, the value enumeration heuristic must be robust and
account for early mistakes so that a good approximation to the solution may
still be found. It is therefore important that inconsistent states be included in
the training data of the neural networks.
The
neural network was then trained with data collected from several
runs driven by a probabilistic value ordering heuristic with 90% probability
of making the correct choice. For enumerations where the solution was already
outside the domain because of earlier mistakes, the correct choice was considered
the half whose center was closer to the solution.
Figure 5 shows the performance of both networks when classifying noisy data.
As expected, the neural network trained with noisy data outperforms the neural
network trained with clean data, if not by much. The chart on the right shows
that safe decisions with noisy data are much more rare than with clean data.
5
Application
In this section the enumeration heuristics presented above were integrated with
search and the final results produced were compared. The results for the heuristics based on neural networks were obtained by using networks trained with
noisy data.
The variable enumeration heuristic used with
and
always selects
the variable with smaller domain, which has been shown to maximize overall
TEAM LinG
Machine Learned Heuristics to Improve Constraint Satisfaction
111
Fig. 5. The graphic on the left shows the percentage of correct choices over the total
number of choices considered “safe” by the
and
heuristics trained
with normal and noisy data. The graphic on the right shows the percentage of decisions
considered “safe” over all decisions made by the heuristics. Results were averaged over
all partitions.
search promise (see [9]). Since the
heuristic may be undefined for a given
enumeration with a given a modified version was used instead,
which is always defined. This version gives a hint on the correct region plus
the risk associated with the prediction, a valuable information that was used to
define the variable enumeration order. This was done by evaluating
for all
domains at each enumeration and choosing the variable for which the prediction
with smallest risk can be made.
As errors accumulate, the information provided by the features degrades,
since they are measured from an already inconsistent state of the problem. To estimate the influence on the overall performance of the above described heuristics,
tests were made where the first 10 and 20 value selection errors were corrected
(fig. 6), with a view on exploiting a limited form of backtracking (e.g. limited
discrepancy search [11]) since full exploitation of backtracking is unfeasible due
to the sheer size of the search space.
The solutions obtained with the neural network are consistently much better
than those obtained with the ad-hoc heuristics selection or majority vote, which
justifies the use of this technique. Moreover, the quality of the solutions provided
is quite promising. The RMSD above 5Å for the smallest proteins was reduced
to less than 4Å if the first 10 wrong value choices are corrected. For the larger
proteins tha effect is more visible with correction of 20 wrong choices, where
the RMSD decreases from around 10Å to less than 6Å. Notwithstanding further
improvements, these results already provide quite acceptable starting points for
the second phase of PSICO.
TEAM LinG
112
Marco Correia and Pedro Barahona
Fig. 6. Final RMS distance between the solution found and a known solution using
the value heuristics described. The three charts show the results when correcting the
first 0 (left), 10 (center) and 20 (right) mistakes of the heuristics.
6
Conclusion
In this paper we show that machine learning techniques can be used to integrate
various features, and that they outperform heuristics based on single features or
on simple feature combination. Notwithstanding the specificity of the problem
under consideration, the approach should be easily adapted to handle other
difficult problems (notice that none of the domain features considered conveys
any specific biochemical information).
In the determination of protein structure from NMR data, these heuristics
made the constraint satisfaction phase of our algorithm to reach results with
much lower RMSD deviations than previously achieveable.
We are now considering the tuning of the heuristics selection, not only by
including biochemical information (e.g. amino-acid hidrophobicity), but also by
incorporating other advanced propagation techniques for rigid (sub-)structures,
as well as developing a controlled form of backtracking (e.g. limited discrepancy
search), that may efficiently exploit the correction of the first wrong value choice
decisions.
References
1. Beldiceanu, N., Contejean, E.: Introducing global constraints in CHIP. Mathl.
Comput. Modelling 20 (1994) 97–123
2. Krippahl, L., Barahona, P.: Propagating N-ary rigid-body constraints. In: ICCP:
International Conference on Constraint Programming (CP), LNCS. (2003)
3. Beck, J., Prosser, P., Wallace, R.: Toward understanding variable ordering heuristics for constraint satisfaction problems. In: Procs. of the Fourteenth Irish Artificial
Intelligence and Cognitive Science Conference (AICS03). (2003)
4. Gent, I.P., MacIntyre, E., Prosser, P., Walsh, T.: The constrainedness of search.
In: AAAI/IAAI, Vol. 1. (1996) 246–252
5. Krippahl, L., Barahona, P.: PSICO: Solving protein structures with constraint
programming and optimization. Constraints 7 (2002) 317–331
6. Krippahl, L.: Integrating Protein Structural Information. PhD thesis, FCT/UNL
(2003)
TEAM LinG
Machine Learned Heuristics to Improve Constraint Satisfaction
113
7. Seavey, B., Farr, E., Westler, W., Markley, J.: A relational database for sequencespecific protein nmr data. J. Biomolecular NMR 1 (1991) 217–236
8. Noguchi, T., Onizuka, K., Akiyama, Y., Saito, M.: PDB-REPRDB: A database
of representative protein chains in PDB (Protein Data Bank). In: Procs. of the
5th International Conference on Intelligent Systems for Molecular Biology, Menlo
Park, AAAI Press (1997) 214–217
9. Correia, M.: Heuristic search for protein structure determination. Master’s thesis,
FCT/UNL (Submitted March/2004)
10. Haykin, S.: Neural Networks: A comprehensive Foundation. Macmillan College
Publishing Company, Inc. (1994)
11. Harvey, W., Ginsberg, M.: Limited discrepancy search. In Mellish, C., ed.: IJCAI’95: Procs. Int. Joint Conference on Artificial Intelligence, Montreal (1995)
TEAM LinG
Towards a Natural Way of Reasoning
José Carlos Loureiro Ralha and Célia Ghedini Ralha
Departamento de Ciência da Computação
Institute de Ciências Exatas
Universidade de Brasília
Campus Universitário Darcy Ribeiro
Asa Norte Brasília DF 70.910–900
{ralha,ghedini}@cic.unb.br
Abstract It is well known that traditional quantifiers and are not suitable for
expressing common sense rules such as ‘birds fly.’ This sentence expresses the defeasible idea that ‘most birds fly,’ and not ‘all birds fly.’ Another defeasible rule is
exemplified by ‘many Americans like American football.’ Noun phrases such as
‘most birds, many birds,’ or even ‘some birds’ are recognized by semanticists as
natural language generalized quantifiers. From a non-monotonic reasoning perspective, one can divide the class of linguistic generalized quantifiers into two
categories. The first partition includes categorical quantifiers such as ‘all birds.’
The other one includes the defeasible quantifiers such as ‘most birds’ and ‘many
birds.’ It is clear that the semantics of defeasible quantifiers leaves room for exceptions. The exceptional elements licensed by defeasible generalized quantifiers
are usually the ‘non flying birds’ that non-monotonic logics deal with.
Keywords: generalized quantifiers, non-monotonic reasoning, defeasible reasoning, argumentative systems.
1 Generalized Quantifiers and Non-monotonic Reasoning
It is common knowledge that mammals don’t lay eggs. However, there are some recalcitrant mammals that do lay eggs. Actually, there are only three species of monotreme1
in the world – the platypus and two species of echidna known as spiny anteaters. This
knowledge, which is properly and easily expressed through natural language sentences
as in (1), can not be formalized so easily. Only through the use of sophisticated logic
systems this common sense knowledge can be grasped by formal systems based on
universal and existential quantifiers ([1], [12], [18], [19]).
(1) a – Most mammals don’t lay eggs.
b – Few mammals lay eggs.
In English and other natural languages, quantifying determiners like all, no, every,
some, most, many, few, all but two, are always accompanied by nominal expressions that
seem to restrict the universe of discourse to individuals to which the nominal applies.
Although quantification in ordinary logic bears a connection with quantification in English, such a connection is not straightforward. Nominals like man in (2) are usually
represented by a predicate in ordinary logic.
1
Monotreme is the order of mammals that lay eggs.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI3171, pp. 114–123, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Towards a Natural Way of Reasoning
115
(2) a – Every man snores.
b–
c – Some man snores.
d–
However, such representations do not emphasize the intuition that nominals like man
do play, indeed, quite a different role from predicates like snore. Moreover, ordinary
logic’s formulas change not only the quantifier but also the connective in a complex
formula over which the quantifier has scope. The companion connective for is
for
is In contrast, the English sentences differ only in the quantifying expression used.
(3) shows a way to make the dependence of the quantifier on the nominal explicit with
the further advantage of making clear the need for no connectives when considering
simple sentences as those in (2)2.
(3) a –
b–
Logics using this kind of quantification impose that the range of quantifiers be restricted to those individuals satisfying the formula immediately following the quantifying expression. The quantifiers are then interpreted as usual requiring that all (3.a) or
some (3.b) of the assignments of values to x satisfying the restricting formula must also
satisfy what follows.
Both approaches work equally well for traditional quantifiers as those in (2). However, quantifiers like most, and few which can not be represented in ordinary logic, are
easily accounted for in the restricted quantification approach. Example (4.c) is true iff
more than half assignments from the restricted domain of mammals are also assignments for which lay-eggs is false3. On the other hand, if one takes few as the opposite
of most, then example (4.f) is true iff less than half assignments from the restricted
domain of mammals are also assignments for which lay_eggs is true. (4.b) and (4.e)
are dead-ends since there are no combination between most assignments and few assignments, respectively, and connectives
¬ capable to express (4.a) and (4.d) in
ordinary logic.
(4) a –
b–
c–
d–
e–
most mammals don’t lay eggs.
?
?
Few mammals lay eggs.
?
?
f–
2
Notation in (3) was first seen by the authors at the Fifth European Summer School in Logic,
Language and Information, Jan Tore Lønning’s reader on Generalized Quantifiers. The same
kind of notation can be found in ([8, page 42]).
3
The semantics presented for most as more than half successful assignments is oversimplified.
One can argue for most As are Bs, as [9] does, the following possibilities: (i)
for some specified
(ii)
(iii) in terms of a measure.
TEAM LinG
116
José Carlos Loureiro Ralha and Célia Ghedini Ralha
Ideas employed in (4) can be applied to other quantification structures as long as we
take (i) full noun phrases, NPs, (determiner + nominal), as logical quantifiers and (ii)
sets of sets as the semantic objects interpreting NPs.
The previous discussion made a point toward moving from determiners as quantifiers, as it occurs in ordinary logic, to full NPs as quantifiers; full NPs as quantifiers is
also known as generalized quantifiers, GQ.
One remarkable feature of GQ is the notational uniformity of its formulas. As (3.a),
(3.b), (4.c), and (4.f) exemplified, GQ formulas can be expressed by
schema. This uniformity makes the development of formal systems and automatic theorem provers easier. It is worth to note that the parallel between natural language and
formal language induced by the compositional principle – the uniformity referred to
before – makes even easier the translation between them.
2 GQ Notation
Although GQ notation, conforming to the schema
have some advantages when compared to traditional notations, it can be pushed even further when
considering theorem proving implementation issues.
From now on, the general schema
replaces the previous
GQ notation. There are many reasons to stick to the new notation introduced by the
authors in this paper. First, it makes clear the nature of the quantified variable as pointed
out by
says on a naïve basis that variable belongs to the class of
det. This nature could be used to determine what kind of inference ought to be (or
has been) used. Determiners such as all, and any allow categorical deductions while
most, almost, fewer... than, etc. allow only defeasible deductions. So, if there exists a
relationship amongst GQs, they implicitly carry into them such relationship. Therefore,
we don’t have to assign degrees to the common sense rules as many non-monotonic
systems usually do4. Second, GQ notation sticks to the compositional criteria. The same
is not true for Fregean systems as pointed out on sect. 1. At last, but not least, it seems
not difficult to develop deduction systems using such notation. And this is a desirable
feature since it allows the development of efficient natural deduction automatic theorem
provers as well as resolution based ones ([14], [5], [12]).
3 A Few Comments on Inference Rules
It seems clear that non-monotonic systems ought to relay on non-monotonic inference
rules in order to keep inconsistency away.
For traditional5 systems, the most common inference rules are the defeasible modus
ponens, the penguin principle, and the Nixon diamond. Basically, the defeasible modus
ponens says that one is granted to (defeasibly) infer from and
but only in
the absence of contrary evidence. Penguin Principle expresses a specificity relationship
4
5
This is akin in spirit to [5] where they say “....We believe that common sense reasoning should
be defeasible in a way that is not explicitily programmed.”
Understand traditional the logic systems, monotonic or not, build upon and
TEAM LinG
Towards a Natural Way of Reasoning
117
between conditions; it says that one should pick up more specific rules. Nixon Diamond
rule states that opposite conclusions should not be allowed on a system if they were
drawn on inference chains of same specificity. These three principles try to overcome
the problem brought to traditional logics by the choice of and as the only quantifiers.
Suppose we adopt NPs as quantifying expressions. For each NP, if one looks at the
core determiner, one can recognize specific (common sense) inference patterns which
can be divided into two categories. Some inference patterns are categorical as exemplified by all. But most patterns are defeasible as exemplified by most6.
Defeasible inference patterns induced by “defeasible NPs” get their strength from
human communication behavior. People involved in a communicative process work in
a collaborative way. If someone says “most birds fly”, and “Tweety is a bird”, the conversation partners usually take as granted that “Tweety flies”. The inferred knowledge
comes from the use of (defeasible) modus ponens. Only when someone has a better
knowledge about Tweety’s flying abilities, one is supposed to introduce that knowledge
to the partners. The better (or more specific) knowledge acquainted by one defeats the
previous knowledge. The remarkable point about getting better knowledge is the way
it is done; it is done in a dialectical argumentative way resembling to game theoretical
systems ([10], [16], [4], [7]).
4 Argumentative Approach
The naïve idea behind argumentation could be summarized through the motto “contrary evidence should be provided by the opponent”. Therefore, argumentative theories
could be seen as a two challengers game where both players try to rebutt the opponent
conclusions.
The game is played in turns by challengers
and
in the following fashion:
if player
comes to a defeasible conclusion, the opponent
takes a turn trying
to defeat
result. The easiest way to do so is assuming the opponent’s opposite
argument hopping to arrive at opposite conclusion. If
succeeds and the derivation
path does not include any defeasible quantifier, then the defeasible conclusion of player
was successfully defeated by the opponent
If
succeeds but the derivation
path includes a defeasible argument, then both players loose the game.
Literature presents different argumentative strategies which can be seen as polite
or greedy ones. Examples in the present paper conform to a polite approach since the
adversary keeps himself quiet to the end of his opponent deduction. A greedy strategy is based on the monitoring the adversary deduction stopping him when he uses a
defeasible argument or rule.
Figure 1 and Fig. 2 show the argumentation process at work. For these figures,
Challenger
wins in the first example while both players loose in the second one.
To understand these figures, one has to know: (i) that for these and all subsequent figures
in the paper, the first column enumerates each formula in a deduction chain; the second
column is the GQ-formula proper and the third column is a retro-referential system, i.e.,
the explanation for deriving such a formula; (ii) how the inference rules work.
6
These patterns corresponds to [5]’s strict and defeasible rules, respectively.
TEAM LinG
118
José Carlos Loureiro Ralha and Célia Ghedini Ralha
Next section presents GQ inference rules in a setup suitable for the development of
dialectical argumentative systems.
5 GQ Inference Rules
Robinson’s Resolution rule [17] can be adapted, in a similar way found in [3], to become
the GQ-resolution rule. GQ-resolution is shown in (5).
where
defeasible and
if either both
is categorical.
and
belongs to the same category7 or
is
GQ-resolution means that one can mix categorical and defeasible knowledge. It also
means that defeasible knowledge “weakens” the GQ-resolvent. This is accomplished
by making the weakest determiner between
and
the determiner on the GQresolvent. Notice also that the weakness process gets propagated through the inference
chain making possible the conclusion’s rebuttal as presented on section 4. Examples
(6), and (7) shed light into the subject.
First line of (6) says that most birds fly; this clause is defeasible as emphasized by
Second line says that Tux doesn’t fly. This categorical clause is characterized
by the proper name ‘Tux’ taken as a categorical determiner denoted as
Both clauses GQ-resolve delivering the GQ-resolvent
Notice
however that
weakens
The GQ-resolvent is a defeasible clause. Defeasibility is stressed by using most
in
Notice also that
is GQ-satisfiable iff
there is evidence supporting Tux is a bird.
7
Recall that all, some and cte are taken as categorical while most is taken as defeasible.
TEAM LinG
Towards a Natural Way of Reasoning
119
(7) shows the case for categorical determiners. Now, the GQ-resolvent is a categorical
clause
is GQ-satisfiable iff there is evidence supporting Tux is
a penguin.
Traditional resolution systems define the empty clause
GQ-clauses such as
could be seen as GQ-quasi_empty clause. Opposed to
which is unsatisfiable, no one can take for sure the unsatisfiability of GQquasi_empty clauses. In order to show the unsatisfiability of GQ-quasi_empty clauses,
one has to make sure that the “left hand side”, i.e., what comes before is unsatisfiable.
This is accomplished through the shift (admissible) rule informally stated in (8).
(8) Let
If
be an arbitrary GQ-formula and an arbitrary derivation tree.
occurs in then
can be adjoined to
6 GQ Reasoner at Work
It should not be difficult to develop an argumentative refutation style defeasible theorem
prover based on generalized quantifiers. For the sake of space and simplicity, we explain
the GQ-reasoning in the context of a polite argumentation approach.
Deductions using GQ-resolution resort on the idea of “marking” and “propagating” the kind of determiner been used. This seems easily achievable from the inspection on the notation adopted for variables,
For defeasible dets, the new derived clause is defeasible. Therefore, all subsequent clauses, including GQ-quasi_empty
clauses, derived from defeasible clauses are defeasible. The main point here is that combinations between defeasible and categorical clauses lead to defeasible clauses. This
point is exemplified by entry 7 in Fig. 1. Clause 7 was inferred by using rule 2a, which
is categoric, over argument 6, which was arrived at on a defeasible inference chain.
Therefore, argument on entry 7 must be marked as defeasible. The mark goes to the det
entry of
Entry 5 in Fig. 1 deserves special attention. Note that the system inferred
This clause is not the classical empty clause, i.e.
because
Fig. 1. Defeasible argumentation game for Penguin Principle
TEAM LinG
120
José Carlos Loureiro Ralha and Célia Ghedini Ralha
Fig. 2. Defeasible argumentation game for Nixon diamond
there is a restriction to be achieved, namely
Therefore, the system
must verify if restrictions can be met (cf. [3], [12], [2]). The easiest way to do so is
pushing the restrictions to the opposite side of negating the material being pushed
over. This move is based on rule (8) and makes possible to arrive to the GQ-empty
clause
when there is a refutation for the argument under dispute.
Suppose now that two players engage themselves on a dispute trying to answer the
question Does Silvester lay eggs? Suppose also they share the following knowledge:
(i) all cats are not monotremes; (ii) all cats are mammals; (iii) few mammals lay eggs;
(iv) only8 monotremes are mammals that lay eggs; (v) Silvester is a cat. The winner is
the player sustaining that Silvester does not lay eggs, i.e.,
The dispute is shown in
fig. 3 where
and
stand for categorical axiom and defeasible axiom.
It is worth notice that
lose the game in the absence of the very specific knowledge expressed by ‘only monotremes are mammals that lay eggs’. In this situation,
wins the game but his argument could be defeated as soon as new knowledge is brought
to them.
Rebuttals are started whenever a GQ-empty clause
is drawn. The winner, if any,
is the one who has arrived to the GQ-empty clause under a categorical inference chain.
If no player wins the game, both arguments go to a controversial list ([ 12] ’s control set).
This is the case for Nixon diamond (see Fig. 2). In this case, pacifist and ¬pacifist
go to the controversial list and can not be used on any other deduction chain. This is the
dead-end entry in Fig. 4 and is known as ambiguity propagation.
Since “Nixon diamonds” will be put on the controversial list, before trying to unify
a literal, the system should consult the list. If the literal is there, the system should
try another path. If there are no options available, the current player gives up and the
opponent should try to prove his/her rebutting argument. This process goes on in a
recursive fashion and the winner will be, as already pointed out before, the one who
has arrived to the empty clause under (i) a categorical inference chain, or (ii) defeasible
inference chain provided the opponent can not rebutt the argument under dispute. The
last situation occurs when one player has a “weak argument” – a defeasible argument,
but the opponent has none. “Nixon diamonds” are not covered by either (i) or (ii). In
8
Only is not a determiner, it is an adverb and therefore not in the scope of the present work.
However, it seems reasonable to accept the translation given wrt the example given.
TEAM LinG
Towards a Natural Way of Reasoning
121
Fig. 3. The monotreme dispute
this case, both players arrived at mutually defeasible conclusons; therefore both players
loose the game.
The strategy described is clearly algorithmic and its implementation straightforward; however, its complexity measures are not dealt with in the present paper.
7 Conclusions and Further Developments
In the paper we claimed that natural language generalized quantifiers most, and all
could be used as natural devices for dealing with non-monotonic automated defeasible
reasoning systems.
Non-monotonicity could be achieved through (i) defeasible logics exemplified by
[11], [15], and [1] or (ii) defeasible argumentative systems as [4], [20], and [16]. As
Simari states in [5], “.. .in most of these formalisms, a priority relation among rules
must be explicitly given with the program in order to decide between rules with contradictory consequents.” Anyway, what all non-monotonic formalisms try to explain
(and deal with) is the meaning of vague concepts exemplified by ‘generally’. A different approach can be found in Veloso(s)’ work ([18], [19]). In these papers, the authors characterize ‘generally’ in terms of filter logics (FL) being faithfully embedded
into a first-order theory of certain predicates. In this way, they provide a framework
where semantic intuitions of filter logics could capture the naïve meaning of ‘most’,
for instance. Their framework supports theorem proving in FL via proof procedures
and theorem provers for classical first-order logic (via faithful embedding). In this way,
Velosos’ work deal with defeasibility in a monotonic fashion. However, in order to deTEAM LinG
122
José Carlos Loureiro Ralha and Célia Ghedini Ralha
Fig. 4. Ambiguity propagation
velop a theorem prover for ‘generally’ they have to embed FL into first-order logic (of
certain predicates).
The main advantage of GQ approach proposed in this paper is the clear separation
between categorical and defeasible knowledge and their interaction given by, for example, most, few, and all. Of course, such separation improves the understanding of what
makes common sense knowledge processing so hard. Most importantly, the approach
introduced in the paper could be further expanded by introducing new natural language
quantifiers. The natural rank amongst quantifiers would be used to control their interactions in a way almost impossible to be achieved on traditional non-monotonic systems.
As a future development, logical properties of generalized quantifiers ([13], [6])
should be used hi order to set up a GQ-framework dealing with a bigger class of defeasible determiners. This should improve the inference machinery of future GQ-theorem
provers.
Acknowledgments
The authors would like to thank anonymous referees for suggestions and comments that
helped to improve the structure of the first version of this paper.
References
1. G. Antoniou, D. Bilington, G. Governatori, and M. Maher. Representation results for defeasible logic. ACM Transactions on Computational Logic, 2(2):255–287, April 2001.
2. G. Antoniou, D. Billington, G. Governatori, M. J. Maher, and A. Rock. A family of defeasible reasoning logics and its implementation. In Proceedings of European Conference on
Artificial Intelligence, pages 459–463, 2000.
3. Hans-Jürgen Bürckert, Bernard Hollunder, and Armin Laux. On skolemization in constrained
logics. Technical Report RR-93-06, DFKI, March 1993.
4. C. I. Chesñevar, A. Maguitman, and R. Loui. Logical models of argument. ACM Computing
Surveys, 32(4):343–387, 2000.
TEAM LinG
Towards a Natural Way of Reasoning
123
5. Alejandro J. Garcia and Guillermo R. Simari. Defeasible logic programming: An argumentative aproach. Article downloaded on May 204 from
http://cs.uns.edu.ar/~grs/Publications/index-publications.html. To appear in Theory and Practice of Logic Programming.
6. Peter Gärdenfors, editor. Generalized Quantifiers, Linguistic and Logical Approaches, volume 31 of Studies in Linguistics and Philosophy. Reidel, Dordrecht, 1987.
7. Jaakko Hintikka. Quantifiers in Natural Language: Game-Theoretical Semantics. D. Reidel,
1979. pp. 81–117.
8. Edward L. Keenan. The semantic of determiners. In Shalom Lappin, editor, The Handbook
of Contemporary Semantic Theory, pages 41–63. Blackwell Plublishers Inc., 1996.
9. Jan Tore Lønning. Natural language determiners and binary quantifiers. Handout, August
1993. Handout on Generalized Quantifiers presented at the fifth European Summer School
on Logic, Language and Information.
10. Paul Lorenzen. Metamatemática. Madrid: Tecnos, 1971, 1962. [Spanish translator: Jacobo
Muñoz].
11. Donald Nute. Defeasible logic. In D. M. Gabbay, C. J. Hogger, and J. A. Robinson, editors,
Handbook of Logic in Artificial Intelligence and Logic Programming, volume 3, pages 355–
395. Oxford University Press, 194.
12. Sven Eric Panitz. Default reasoning with a constraint resolution principle. Ps file downloaded
on January 2003 from
http://www.ki.informatik.uni-frankfurt.de/persons/panitz/paper/bbt.ps.gz. The article was
presented at the LPAR 1993 in St Petersburg.
13. Stanley Peters and Dag Westerståhl. Quantifiers, 2002. Pdf file downloadable on May 2003
from http://www.stanford.edu/group/nasslli/courses/peter-wes/PWbookdraft2-3.pdf.
14. John L. Pollock. Natural deduction. Pdf file downloadable on December 2002 from Oscar’s
home page http://oscarhome.soc-sci.arizona.edu/ftp/OSCAR-web-page/oscar.html.
15. H. Prakken. Dialectical proof theory for defeasible argumentation with defeasible priorities.
In Proceedings of the ModelAge Workshop ‘Formal Models of Agents’, Lecture Notes in
Artificial Intelligence, Berlin, 1999. Springer Verlag.
16. Henry Prakken and Gerard Vreeswijk. Logics for defeasible argumentation. In D. Gabbay
and F. Guenthner, editors, Handbook of Philosophical Logic, volume 4, pages 218–319.
Kluwer Academic Publishers, Dordrecht, 2002.
17. J. A. Robinson. A machine-oriented logic based on the resolution principle. J. ACM,
12(1):23–41, 1965.
18. P. A. S. Veloso and W. A. Carnielli. Logics for qualitative reasoning. CLE e-prints, Vol. 1(3),
2001 (Section Logic) available at http://www.cle.unicamp.br/e-prints/abstract_3.htm.
To appear in “Logic, Epistemology and the Unity of Science” edited by Shahid Rahman and
John Symons, Kluwer, 2003.
19. P. A. S. Veloso and S. R. M. Veloso. On filter logics for ‘most’ and special predicates. Article
downloaded on May 2004 from
http://www.cs.math.ist.utl.pt/comblog04/abstracts/veloso.pdf.
20. G. A. W. Vreeswijk. Abstract argumentation systems. Artficial Intelligence, 90:225–279,
1997.
TEAM LinG
Is Plausible Reasoning a Sensible Alternative
for Inductive-Statistical Reasoning?*
Ricardo S. Silvestre1 and Tarcísio H.C. Pequeno2
1
Department of Philosophy, University of Montreal
2910 Édouard-Montpetit, Montréal, QC, H3T 1J7, Canada
(Doctoral Fellow, CNPq, Brazil)
[email protected]
2
Department of Computer Science, Federal University of Ceará
Bloco 910, Campus to Pici, Fortaleza-Ceará, 60455-760, Brazil
[email protected] br
Abstract. The general purpose of this paper is to show a practical instance of
how philosophy can benefit from some ideas, methods and techniques developed in the field of Artificial Intelligence (AI). It has to do with some recent
claims [4] that some of the most traditional philosophical problems have been
raised and, in some sense, solved by AI researchers. The philosophical problem
we will deal with here is the representation of non-deductive intra-theoretic scientific inferences. We start by showing the flaws with the most traditional solution for this problem found in philosophy: Hempel’s Inductive-Statistical (I-S)
model [5]. After we present a new formal model based on previous works motivated by reasoning needs in Artificial Intelligence [11] and show that since it
does not suffer from the problems identified in the I-S model, it has great
chances to be successful in the task of satisfactorily representing the nondeductive intra-theoretic scientific inferences.
1 Introduction
In the introduction of a somewhat philosophical book of essays on Artificial Intelligence [4], the editors spouse the thesis that in the field of AI “traditional philosophical questions have received sharper formulations and surprising answers”, adding that
“... important problems that the philosophical tradition overlooked have been raised
and solved [in AI]”. They go as far as claiming that “Were they reborn into a modern
university, Plato and Aristotle and Leibniz would most suitably take up appointments
in the department of computer science.” Even recognizing a certain amount of over
enthusiasm and exaggeration in those affirmations, the fact is that there are evident
similarities and parallels between some problems concretely faced in AI practice with
some classic ones dealt with within philosophical investigation. However, although
there is some contact between AI and philosophy in fields like philosophy of mind
and philosophy of language, the effective contribution of ideas, methods and tech* This work is partially supported by CNPq through the LOCIA (Logic, Science and Artificial
Intelligence) Project.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 124–133, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Is Plausible Reasoning a Sensible Alternative for Inductive-Statistical Reasoning?
125
niques from AI to philosophy is still something hard to be seen. In this paper we
continue a project started in a previous work [14] and present what we believe to be a
bridge between these two areas of knowledge that, in addition to its own interest, can
also serve as an example and an illustration of a whole lot of connections we hope to
come over.
The study of non-deductive inferences has played a fundamental role in both artificial intelligence (AI) and philosophy of science. While in the former it has given rise
to the development of nonmonotonic logics [9], [10], [13], as AI theorists have
named them, in the later it has attracted philosophers, for over half century, in the
pursuit of a so-called logic of induction [2], [6], [7]. Perhaps because the technical
devices used by these areas were quite different, the obvious fact that both AI researches and philosophers were dealing with the same problem has been during all
these years of formal investigation of non-deductive reasoning remained almost unnoticed. The first observations about the similarities between these two domains appeared in print as late as about the end of the eights [8], [11], [12], [15]. More than a
surprising curiosity, the mentioned connection is important because, being concerned
with the same problem of formalizing non-deductive patterns of inference, at least in
principle, computer scientists and philosophers can benefit from the results achieved
by each other. It is our purpose here to lay down what we believe to be an instance of
such a sort of contribution from the side of AI to philosophy of science.
One of the problems that have motivated philosophers of science to engage in the
project of developing a logic of induction was the investigation of what we can call
intra-theoretic scientific inference, that is to say, the inferences performed inside a
scientific theory already existent and formalized in its basic principles. This kind of
inference thus goes from the theory’s basic principles to the derived ones, in opposition to the inductive inferences which go from particulars facts in order to establish
general principles. Intra-theoretic inferences play an important role, for example, in
the explanation of scientific laws and singular facts as well as in the prediction of
non-observed facts.
The traditional view concerning intra-theoretic scientific inferences states that scientific arguments are of two types: deductive and inductive/probabilistic. This deductive/inductive-probabilistic view of intra-theoretic scientific inferences was put forward in its most precise form by Carl Hempel’s models of scientific explanation [5].
In order to represent the non-deductive intra-theoretic scientific inferences, Hempel
proposed a probabilistic-based model of scientific explanation named by him Inductive-Statistical (I-S) model. However, in spite of its intuitive appeal, this model was
unable to solve satisfactorily the so-called problem of inductive ambiguities, which is
surprisingly similar to the problem of anomalous extensions that AI theorists working
with nonmonotonic logics are so familiar with.
Our purpose in this paper is to show how a logic which combines nonmonotonicity
(in the style of Reiter’s default logic [13]) with paraconsistency [3] regarding nonmonotonically derived formulae is able to satisfactorily express reasoning under these
circumstances, dealing properly with the mentioned inconsistency problems that undermined Hempel’s project. The structure of the paper is as follows. First of all we
introduce Hempel’s I-S model and show, through some classical examples, how it
TEAM LinG
126
Ricardo S. Silvestre and Tarcísio H.C. Pequeno
fails in treating properly some very simple cases. This is done in the Section 2. Our
nonmonotonic and paraconsistent logical system is presented in Section 3, were we
also show how it is able to avoid the problems that plagued Hempel’s proposal.
2 Hempel’s I-S Model and the Problem
of Inductive Inconsistencies
According to most historiographers of philosophy, the history of the philosophical
analysis of scientific explanation began with the publication of ‘Studies in the Logic
of Explanation’ in 1948 by Carl Hempel and Paul Oppenheim. In this work, Hempel
and Oppenheim propose their deductive-nomological (D-N) model of scientific explanation where scientific explanations are considered as being deductive arguments
that contain essentially at least one general law in the premises. Later, in 1962, Hempel presented his inductive-statistical (I-S) model by which he proposed to analyze
the statistical scientific explanations that clearly could not be fitted into the D-N
model. (These papers were reprinted in [5].)
Because of his emphasis on the idea that explanations are arguments and his commitment to a numerical approach, Hempel’s models perfectly exemplify the deductive-inductive/probabilistic view of intra-theoretic scientific inferences. According to
Hempel’s I-S model, the general schema of non-deductive scientific explanations is
the following:
Here the first premise is a statistical law asserting that the relative frequency of Gs
among Fs is r, r being close to 1, the second stands for b having the property F, and
the expression ‘[r]’ next to the double line represents the degree of inductive probability conferred on the conclusion by the premises. Since the law represented by the
first premise is not a universal but a statistical law, the argument above is inductive
(in Carnap’s sense) rather than deductive.
If we ask, for instance, why John Jones (to use Hempel’s preferred example) recovered quickly from a streptococcus infection we would have the following argument as the answer:
where F stands for having a streptococcus infection, H for administration of penicillin, G for quick recovery, b is John Jones, and r is a number close to 1. Given that
penicillin was administrated in John Jones case (Hb) and that most (but not all) streptococcus infections clear up quickly when treated with penicillin
the
argument above constitutes the explanation for John Jones’s quick recovery.
However, it is known that certain strains of streptococcus bacilli are resistant to
penicillin. If it turns out that John Jones is infected with such a strain of bacilli, then
TEAM LinG
Is Plausible Reasoning a Sensible Alternative for Inductive-Statistical Reasoning?
127
the probability of his quick recovery after treatment of penicillin is low. In that case,
we could set up the following inductive argument:
J stands for the penicillin-resistant character of the streptococcus infection and r’ is
a number close to zero (consequently 1 – r’ is a number close to 1.)
This situation exemplifies what Hempel calls the problem of explanatory or inductive ambiguities. In the case of John Jones’s penicillin-resistant infection, we have
two inductive arguments where the premises of each argument are logically compatible and the conclusion is the same. Nevertheless, in one argument the conclusion is
strongly supported by the premises, whereas in the other the premises strongly undermine the same conclusion.
In order to solve this sort of problem, Hempel proposed his requirement of maximal specificity, or RMS. It can be explained as follows. Let s be the conjunction of
the premises of the argument and k the conjunction of all statements accepted at the
given time (called knowledge situation). Then, according to Hempel, “to be rationally
acceptable” in that knowledge situation the explanation must meet the following
condition: If
implies that b belongs to a class
and that is a subclass of F,
then
must also imply a statement specifying the statistical probability of G in
say,
Here, r’ must equal r unless the probability statement just cited is a
theorem of mathematical probability theory.
The RMS intends basically to prevent that the property or class F to be used in the
explanation of Gb has a subclass whose relative frequency of Gs is different from
P(G,F). In order to explain Gb through Fb and a statistical law such as P(G, F) = 0.9,
we need to be sure that, for all sets
such that
the relative frequency of Gs
among
is the same as that among Fs, that is to say,
In other
words, in order to be used in an explanation, the class F must be a homogeneous one
with respect to G. (All these observations are valid for the new version of the RMS
proposed in 1968 and called
[5].)
The RMS was proposed of course because of I-S model’s inability to solve the
problem of ambiguities. Since the I-S model allows the appearance of ambiguities
and gives no adequate treatment for them, without RMS it is simply useless as a
model of intra-theoretical scientific inferences. But we can wonder: Is the situation
different with the RMS?
First of all, in its new version the I-S model allows us to classify arguments as authentic scientific inferences able to be used for explaining or predicting only if they
satisfy the RMS. It is not difficult to see that this restriction is too strong to be satisfied in practical circumstances. Suppose that we know that most streptococcus infections clear up quickly when treated with penicillin, but we do not know whether this
statistical law is applicable to all kinds of streptococcus bacillus taken separately (that
is, we do not know if the class in question is a homogeneous one). Because of this
incompleteness of our knowledge, we are not entitled to use argument (1) to explain
TEAM LinG
128
Ricardo S. Silvestre and Tarcísio H.C. Pequeno
(or predict) the fact that John Jones had (or will have) a quick recovery. Since when
making scientific prediction, for example, we have nothing but imprecise and incomplete knowledge, the degree of knowledge required by the RMS is clearly incompatible with actual scientific practice.
Second, the only cases that the RMS succeeds in solving are those that involve
class specificity. In other words, the only kind of ambiguity that the RMS prevents
consists of that that comes from a conflict arising inside a certain class (that is, a
conflict taking place between the class and one of its subclasses.) Suppose that John
Jones has contracted HIV. As such, the probability of his quick recovery (from any
kind of infection) will be low. But given that he took penicillin and that most streptococcus infections clear up quickly when treated with penicillin, we will still have the
conclusion that he will recover quickly. Thus an ambiguity will arise. However, as
the class of HIV infected people who have an infection does not belong to the class of
individuals having a streptococcus infection which were treated with penicillin (and
nor vice-versa), the RMS will not be able to solve the conflict.
Third, sometimes the policy of preventing all kinds of contradictions may not be
the best one. Suppose that the antibiotic that John Jones used in his treatment belongs
to a recently developed kind of antibiotic that its creators guarantee to cure even the
known penicillin-resistant infection. The initial statistics showed a 90% of successful
cases. Even though this result cannot be considered as definitive (due to the alwayssmall number of cases considered in initial tests), it must be taken into account. Now,
given argument (2), the same contradiction will arise. But here we do not know yet
which of the two ‘laws’ has priority over the other: maybe the penicillin-resistant
bacillus will prove to be resistant even to the new antibiotic or maybe not. Anyway, if
we reject the contradiction as the I-S model does and do not allow the use of these
inferences, we will loss a possibly relevant part of the total set of information that
could be useful or even necessary for other inferences.
3 A Nonmonotonic and Paraconsistent Solution
to the Problem of Inductive Inconsistencies
Compared to the traditional probabilistic-statistical view of non-deductive intratheoretical scientific inferences, our proposal’s main shift can be summarized as follows. First, we import from AI some techniques often used there in connection to
nonmonotonic reasoning to express non-deductive scientific inferences. Second, in
order to prevent the appearance of ambiguities we provide a mechanism by which
exceptions to laws can be represented. This mechanism has two main advantages over
Hempel’s RMS: it can prevent the class specificity ambiguities without rejecting both
arguments (as Hempel’s does), being also able to treat properly those cases of ambiguity that do not involve class specificity (that remained uncovered by Hempel’s
system.) Finally, in order to consider the cases where the ambiguities are due to the
very nature of the knowledge to be formalized and, as such, cannot be prevented, we
supply a paraconsistent apparatus by which those ambiguities can be tolerated and
sensibly reasoned out, without trivializing the whole set of conclusions. ConseTEAM LinG
Is Plausible Reasoning a Sensible Alternative for Inductive-Statistical Reasoning?
129
quently, even in the presence of contradictions we can make use of all available information, achieving just the reasonable conclusions.
Our proposal takes the form of a logical system consisting of two different logics,
organically connected, which are intended to operate in two different levels of reasoning. At one level, the nonmonotonic one, there is a logic able to perform nondeductive inferences. This logic is conceived in a style which resembles Reiter’s
default logic [13], but incorporates a very important distinction: it is able to generate
extensions including contradictory conclusions obtained through the use of default
rules. By this reason, it is called Inconsistent Default Logic (IDL) [1], [11]. The conclusions achieved by means of nonmonotonic inferences do not have the same epistemic status of the ones deductively derived from known facts and assumed principles. They are taken as just plausible (in opposition to certain, as far as the theory and
the observations are themselves so taken). In order to explicitly recognize this epistemic fact, and thus make it formally treatable in the reasoning system, they are
marked with a modal sign (?), where
means
is plausible.” In this way, differently from traditional nonmonotonic logics, IDL is able to distinguish revisable formulae obtained though nonmonotonic inferences from non-refutable ones, deductively obtained.
At the second level, operates a deductive logic. But here again, not a classic one,
but one able to properly treat and make sensible deductions in the theory that comes
out from the first level, even if it is inconsistent, as it may very well be. This feature
makes this logic a paraconsistent one, but, again, not one of the already existent paraconsistent logics, as the ones proposed by da Costa and his collaborators [3], but one
specially conceived to reason properly under the given circumstances. It is called the
Logic of Epistemic Inconsistencies (LEI) [1], [11]. In this logic, a distinction is made
between strong contradictions, a contradiction involving at least one occurrence of
deductive, non-revisable knowledge, from weak contradictions, of the form
which involves just plausible conclusions. This second kind of contradictions are
well tolerated and do not trivialize the theory, as the first kind still do, just as in classical logic.
The general schema of an IDL default rule is
which can be read as
can
be nonmonotonically inferred from unless
Adopting Reiter’s terminology,
represents the prerequisite, the consequent, and the negation of the justification,
here called exception. One important difference between Reiter’s logic and ours is
that the consistency of the consequent does not need to be explicitely stated in it is
internally guaranteed by the definition of extension. Translating Reiter’s normal and
semi-normal defaults to our notation, for example, would produce respectively
and
where
is an abbreviation for
Other
difference is that the consequent is added to the extension necessarily with the plausibility mark ? attached to it. The definition of IDL theory is identical to default logic’s
one. Above it follows the definition of IDL extension.
Let S be a set of closed formulae and <W, D> a closed IDL theory.
is the
smallest set satisfying the following conditions:
TEAM LinG
130
Ricardo S. Silvestre and Tarcísio H.C. Pequeno
(i)
(ii) If
then
(iii) If
and
is ?-consistent, then
A set of formulae E is an extension of <W, D> iff
that is, iff E is a fixed
point of the operator
The symbol
refers to the inferential relation defined by the deductive and paraconsistent logic LEI, according to which weak inconsistencies do not trivialize the
theory. Similarly, the expression “?-consistent” refers to consistency or nontrivialization under such relation. Above we show the axiomatic of LEI. Latin letters
represent ?-free formulae and ~ is a derived operator defined as follows:
where
is any atomic ?-free formula.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
where t is free for x in
where x is not free in
where x is a varying object.
where t is free for x in
where x is not free in
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
where ? is a varying
object.
Axiom schema 24, which is a weaker version of the reductio ad absurdum axiom,
is the key of LEI’s paraconsistency. By restricting its use only to situations where B
is ?-free, it prevents that from weak contradictions we derive everything; at the same
time that allows ?-free formulae to have a classical behaviour. Axiom schema 27 is
another important one in LEI’s axiomatic. It allows for the internalization and externalization of ? and ¬ with respect to each other and represents, in our view, one of the
main differences between the notions of possibility and plausibility. The varying
object restriction present in some axiom schemas is needed to guarantee the universality of the deduction theorem. For more details about LEI’s axiomatic (and semantics) see [11].
Turning back to the problem of inductive inconsistencies, as Hempel himself acknowledged [5], the appearance of ambiguities is an inevitable phenomenon when we
deal with non-deductive inferences. Surprisingly enough, all cases of inductive ambiguities identified by Hempel are not due to this suggested connection between ambiTEAM LinG
Is Plausible Reasoning a Sensible Alternative for Inductive-Statistical Reasoning?
131
guity and induction, but to the incapacity of his probabilistic approach to represent
properly the situations in question.
Consider again John Jones’s example. The situation exposed in section 2 can be
formalized in IDL as follows:
Here (3) is a default schema that says that if someone has a streptococcus infection
and was treated with penicillin, then it is plausible that it will have a quick recovery
unless it is verified that the streptococcus is a penicillin resistant one. (4) states that
John Jones has a streptococcus infection and that he took penicillin. Given
and
as the IDL-theory, we will have
as the only extension of <W,D>, where
is the set of all
formulae that can be inferred from A through
Suppose now that we have got the new information that John Jones’s streptococcus is a penicillin resistant one. We represent this through the following formula:
Like in Hempel’s formalism, if someone is infected with a penicillin-resistant bacillus, it is not plausible that he will have a quick recovery after the treatment of penicillin (unless we know that he will recover quickly). This can be represented by the
following default schema:
Given
and
as our new IDLtheory, we will have
as the only extension of <W’,D’>.
Since in Hempel’s approach there is no connection between laws (1) and (2), the
conclusion of
has no effect upon the old conclusion Gb. Here however it is
being represented the priority that we know law (5) must have over law (3): the
clauses Jx in the exception of (3) and Jx in the prerequisite of (5) taken together mean
that if (5) can be used for inferring, for example
(3) cannot be used for inferring Gb?. So, if after using law (3) we get new information that enable us to use law
(5), since in the light of the new state of knowledge law (3)’s utilization is not possible, we have to give up the previous conclusion got from this law. So, since
we are no longer entitled to infer Gb? from (3). The only plausible fact that we can
conclude from (3) and (5) is
As such, in contrast to Hempel’s approach, we do
not have the undesirable consequence that it is plausible (or in Hempel’s approach,
high probable) that John will quickly recover and it is plausible that he will not.
As we have said, in this specific case we know that law (5) has a kind of priority
over law (3), in the sense that if (5) holds, (3) does not hold. Like we did in Section 2,
suppose now that the antibiotic that John Jones used in his treatment belongs to a
recently developed kind of antibiotic that its creators guarantee to cure even the
known penicillin-resistant infection. The initial statistics showed a 90% of successful
TEAM LinG
132
Ricardo S. Silvestre and Tarcísio H.C. Pequeno
cases but due to the always-small number of initial cases, this result cannot be considered as definitive. Even so, we can set up the following tentative law:
Here H’ stands for administration of the new kind of antibiotic. To complete the
formalization we have the two following formulae:
Given
and
(laws (5) and (6)) as our new IDL-theory, we have that the extension
of <W”,D”> is
In this case, we do not know which of the two ‘laws’ has priority over the other.
Maybe the penicillin-resistant bacillus will prove to be resistant even to the new antibiotic or maybe not. Instead of rejecting both conclusions, as I-S model with its RMS
would do, we defend that a better solution is to keep reasoning even in the presence
of such ambiguity, but without allowing that we deduce everything from it. Formally
this is possible because of the restriction imposed by the already shown LEI’s axiom
of non-contradiction. However, if a modification that resolves the conflict is made in
the set of facts (a change in (5) representing the definitive success of the new kind of
penicillin, for example) the IDL’s nonmonotonic inferential mechanism will update
the extension and exclude one of the two contradictory conclusions.
Finally, the HIV example can be easily solved in the following way.
Here A stands for having contracted HIV and I for having an infection. The solution is similar to our first example. Since (7) has priority over (3’), we will be able to
conclude only ¬Gb? and consequently the ambiguity will not arise.
We have shown therefore that our formalism solves the three problems identified
in Hempel’s I-S model. One consideration remains to be done. Hempel’s main intention with the introduction of the I-S model was to analyze the scientific inferences
which contain statistical laws. At a first glance, it is quite fair to conclude that in
those cases where something akin to a relative frequency is involved, a qualitative
approach like ours will not have the same representative power as a quantitative one.
However, there are several ways we can “turn” our qualitative approach into a quantitative one in such a way as to represent how much plausible a formula is. For instance, we could drop axiom 29 as to allow the weakening of the “degree of plausibility” of formulae:
would represent the highest plausibility status a formulae may
have, which could be weakened by additional ?’s. In this way, a default could represent the statistical probability of a law by changing the quantity of ?’s attached to its
conclusion. A somehow inverse path could also be undertaken. In LEI’s semantic, it
is used a Kripke possible worlds structure (in our case we call them plausible worlds)
TEAM LinG
Is Plausible Reasoning a Sensible Alternative for Inductive-Statistical Reasoning?
133
to evaluate sentences in such a way that
is true iff is true in at least one plausible
world [11]. We could define then
as
as
as
as
where p and q are different ?-free atomic formulae, and so on, in such a way that the
index n at the abbreviation
says in how many plausible worlds is true.
References
1. Buchsbaum, A. Pequeno, T., Pequeno, M: The Logical Expression of Reasoning. To appear in: Béziau, J., Krause, D. (eds.): New Threats in Foundations of Science. Papers Dedicated to the Eightieth Birthday of Patrick Suppes. Kluver, Dordrecht (2004).
2. Carnap, R.: Logical Foundations of Probability. U. of Chicago Press, Chicago (1950)
3. da Costa, N. C. A.: On the Theory of Inconsistent Formal Systems. Notre Dame Journal of
Formal Logic 15 (1974) 497–510.
4. Ford, M., Glymour, C., Hayes, P. (eds.): Android Epistemology. The MIT Press (1995).
5. Hempel, C. G.: Aspects of Scientific Explanation and Other Essays in the Philosophy of
Science. Free Press, New York (1965)
6. Hintikka, J.: A Two-Dimensional Continuum of Inductive Methods. In: Hintikka, J., Suppes P. (eds.): Aspects of Inductive Logic. North Holland, Amsterdam (1966).
7. Kemeny, J.: Fair Bets and Inductive Probabilities. Journal of Symbolic Logic 20 (1955)
263–273.
8. Kyburg, H.: Uncertain Logics. In: Gabbay, D., Hogge D., Robinson, J. (eds.): Handbook of
Logic in Artificial Intelligence and Logic Programming, Vol. 3, Nonmonotonic Reasoning
and Uncertain Reasoning. Oxford University Press, Oxford (1994).
9. McCarthy, J.: Applications of Circumscription to Formalizing Commonsense Knowledge.
Artificial Intelligence 26 (1986) 89–116.
10. Moore, R.: Semantic Considerations on Nonmonotonic Logic. Artificial Intelligence 25
(1985) 75–94.
11. Pequeno, T., Buchsbaum, A.: The Logic of Epistemic Inconsistency. In: Allen, J., Fikes, R.,
Sandewall, E. (eds.): Principles of Knowledge Representation and Reasoning: Proceedings
of Second International Conference. Morgan Kaufmann, San Mateo (1991) 453-460.
12. Pollock, J. L: The Building of Oscar. Philosophical Perspectives 2 (1988) 315–344
13. Reiter, R.: A Logic for Default Reasoning. Artificial Intelligence 13 (1980) 81–132
14. Silvestre, R., Pequeno, T: A Logical Treatment of Scientific Anomalies. In: Arabnia, H,
Joshua, R., Mun, Y. (eds.): Proceedings of the 2003 International Conference on Artificial
Intelligence, CSRA Press, Las Vegas (2003) 669-675.
15. Tan, Y. H.: Is Default Logic a Reinvention of I-S Reasoning? Synthese 110 (1997) 357–
379.
TEAM LinG
Paraconsistent Sensitivity Analysis
for Bayesian Significance Tests
Julio Michael Stern
BIOINFO and Computer Science Dept., University of São Paulo
[email protected]
Abstract. In this paper, the notion of degree of inconsistency is introduced as a tool to evaluate the sensitivity of the Full Bayesian Significance Test (FBST) value of evidence with respect to changes in the prior
or reference density. For that, both the definition of the FBST, a possibilistic approach to hypothesis testing based on Bayesian probability
procedures, and the use of bilattice structures, as introduced by Ginsberg
and Fitting, in paraconsistent logics, are reviewed. The computational
and theoretical advantages of using the proposed degree of inconsistency
based sensitivity evaluation as an alternative to traditional statistical
power analysis is also discussed.
Keywords: Hybrid probability / possibility analysis; Hypothesis test;
Paraconsistent logic; Uncertainty representation.
1 Introduction and Summary
The Full Bayesian Significance Test (FBST), first presented in [25] is a coherent
Bayesian significance test for sharp hypotheses. As explained in [25], [23], [24]
and [29], the FBST is based on a possibilistic value of evidence, defined by coherent Bayesian probability procedures. To evaluate the sensitivity of the FBST
value of evidence with respect to changes in the prior density, a notion of degree
of inconsistency is introduced and used. Despite the possibilistic nature of the
uncertainty given by the degree of inconsistency defined herein, its interpretation is similar to standard probabilistic error bars used in statistics. Formally,
however, this is given in the framework of the bilattice structure, used to represent inconsistency in paraconsistent logics. Furthermore, it is also proposed
that, in some situations, the degree of inconsistency based sensitivity evaluation
of the FBST value of evidence, with respect to changes in the prior density, be
used as an alternative to traditional statistical power analysis, with significant
computational and theoretical advantages. The definition of the FBST and its
use are reviewed in Section 2. In Section 3, the notion of degree of inconsistency
is defined, interpreted and used to evaluate the sensitivity of the FBST value of
evidence, with respect to changes in the prior density. In Section 4, two illustrative numerical examples are given. Final comments and directions for further
research are presented in Section 5. The bilattice structure, used to represent
inconsistency in paraconsistent logics is reviewed in the appendix.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 134–143, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Paraconsistent Sensitivity Analysis for Bayesian Significance Tests
2
135
The FBST Value of Evidence
Let
be a vector parameter of interest, and
the likelihood associated to the observed data a standard statistical model. Under the
Bayesian paradigm the posterior density,
is proportional to the product
of the likelihood and a prior density
That is,
The (null) hypothesis H states that the parameter lies in the null set
defined by
where and are functions defined in the parameter space. Herein, however,
interest will rest particularly upon sharp (precise) hypotheses, i.e., those for
which
The posterior surprise,
relative to a given reference density
is given
by
The relative surprise function,
was used by several others statisticians, see
[19], [20] and [13]. The supremum of the relative surprise function over a given
subset
of the parameter space, will be denoted by
that is,
Despite the importance of making a conceptual distinction between the statement of a statistical hypothesis, H, and the corresponding null set,
one often
relax the formalism and refers to the hypothesis
instead of
In
the same manner, when some or all of the argument functions,
and are
clear from the context, they may be omitted in a simplified notation and
or even
would be acceptable alternatives for
The contour or level sets,
of the relative surprise function, and
the Highest Relative Surprise Set (HRSS),
at a given level
are
given by
The FBST value of evidence against a hypothesis H, Ev(H) or
defined by
is
The tangential HRSS
or T(H), contains the points in the parameter
space whose surprise, relative to the reference density, is higher than that of
TEAM LinG
136
Julio Michael Stern
any other point in the null set
When the uniform reference density,
is used,
is the Posterior’s Highest Density Probability Set (HDPS)
tangential to the null set
The role of the reference density in the FBST is to make Ev(H) implicitly
invariant under suitable transformations of the coordinate system. Invariance,
as used in statistics, is a metric concept. The reference density is just a compact
and interpretable representation for the reference metric in the original parameter space. This metric is given by the geodesic distance on the density surface,
see [7] and [24]. The natural choice of reference density is an uninformative prior,
interpreted as a representation of no information in the parameter space, or the
limit prior for no observations, or the neutral ground state for the Bayesian operation. Standard (possibly improper) uninformative priors include the uniform
and maximum entropy densities, see [11], [18] and [21] for a detailed discussion.
The value of evidence against a hypothesis H has the following interpretation:
“Small” values of Ev(H) indicate that the posterior density puts low probability
mass on values of with high relative surprise as compared to values of
thus providing weak evidence against hypothesis H. On the other hand, if the
posterior probability of
is “large”, that is for “large” values of Ev(H),
values of with high relative surprise as compared to values of
have high
posterior density. The data provides thus strong evidence against the hypothesis
H. Furthermore, the FBST is “Fully” coherent with the Bayesian likelihood
principle, that is, that the information gathered from observations is represented
by (and only by) the likelihood function.
3
Prior Sensitivity and Inconsistency
For a given likelihood and reference density, let,
denote the
value of evidence against a hypothesis H, with respect to prior Let
denote the evidence against H with respect to priors
The degree of
inconsistency of the value of evidence against a hypothesis H, induced by a set
of priors,
can be defined by the index
This intuitive measure of inconsistency can be made rigorous in the context
of paraconsistent logic and bilattice structures, see the appendix. If
is the value of evidence against H, the value of evidence in favor of H is defined
by
The point
in the unit square bilattice,
represents herein a single evidence, see the appendix. Since
such
a point is consistent. It is also easy to verify that for the multiple evidence
values, the definition of degree of inconsistency given above, is the degree of
inconsistency of the knowledge join of all the single evidence points in the unit
square bilattice,
TEAM LinG
Paraconsistent Sensitivity Analysis for Bayesian Significance Tests
137
As shown in [29], the value of evidence in favor of a composite hypothesis
is the most favorable value of evidence in favor of each of its terms,
i.e.,
This makes
a possibilistic (partial) support structure coexisting with the probabilistic support structure given by the
posterior probability measure in the parameter space, see [10] and [29]. The degree of inconsistency for the evidence against H induced by multiple changes of
the prior can be used as an index of imprecision or fuzziness of the value of evidence Ev(H). Moreover, it can also be interpreted within the possibilistic context
of the partial support structure given by the evidence. Some of the alternative
ways of measuring the uncertainty of the value of evidence Ev(H), such as the
empirical power analysis have a dual possibilistic / probabilistic interpretation,
see [28] and [22]. The degree of inconsistency has also the practical advantage
of being “inexpensive”, i.e., given a few changes of prior, the calculation of the
resulting inconsistency requires about the same work as computing Ev(H). In
contrast, an empirical power analysis requires much more computational work
than it is required to compute a single evidence.
4
Numerical Examples
In this paper we will concentrate on two simple model examples: the HardyWeinberg (HW) Equilibrium Law model and Coefficient of Variation model.
The HW Equilibrium is a genetic model with a sample of individuals, where
and
are the two homozygote sample counts and
is the
hetherozygote sample count. The parameter vector for this trinomial model is
and the parameter space, the null hypothesis set, the prior density,
likelihood function and the reference density are given by:
For the Coefficient of Variation model, a test for the coefficient of variation
of a normal variable with mean and precision
the parameter space, the null hypothesis set, the maximum entropy prior, the reference
density, and the likelihood density are given by:
Figure 1 displays the elements of a value of evidence against the hypothesis,
computed for the HW (Left) and Coefficient of Variation (Right) models. The
TEAM LinG
138
Julio Michael Stern
null set,
is represented by a dashed line. The contour line of the posterior,
delimiting the tangencial set,
is represented by a solid line. The posterior unconstrained maximum is represented by “o” and the posterior maximum
constrained to the null set is represented by
Fig. 1. FBST for Hardy-Weinberg (L) and Coefficient of Variation (R)
In order to perform the sensitivity analysis several priors have to be used.
Uninformative priors are used to represent no previous observations, see [16],
[21] and [31] for a detailed discussion.
For the HW model we use as uniformative priors the uniform density, that
can be represented as [0, 0, 0] observation counts, and also the standard maximum entropy density, that can be represented as [–1, –1, –1] observation counts.
Between these two uninformative priors, we also consider perturbation priors
corresponding to [–1, 0, 0], [0, –1, 0] and [0, 0, –1] observation counts. Each of
these priors can be interpreted as the exclusion of a single observation of the
corresponding type from the data set,
Finally, we consider the dual perturbation priors corresponding to [1, 0, 0],
[0, 1, 0] and [0, 0, 1] observation counts. The term dual is used meaning that
instead of exclusion, these priors can be interpreted as the inclusion of a single
artificial observation of the corresponding type,
in the data set.
The examples in the top part of Table 1 are given by size and proportions,
where the HW hypothesis is true.
For the Coefficient of Variation model we use as uninformative priors the
uniform density, for the mean, and either the standard maximum entropy density,
or the uniform,
for the precision. We also consider (with
uniform prior) perturbations by the inclusion in the data set of an artificial
observation,
at fixed quantiles of the predictive posterior, in this case, at
three standard deviations below or above the mean,
The examples in the bottom part of Table 2 are given by cv = 0.1 size
and the sufficient statistics
and std = 1.2, where the hypothesis is false.
TEAM LinG
Paraconsistent Sensitivity Analysis for Bayesian Significance Tests
139
In order to get a feeling of the asymptotic behavior of the evidence and the
inconsistency, the calculations are repeated for the same sufficient statistics but
for sample sizes, taking values in a convenient range. In Figure 2, the maximum
and minimum values of evidence against the hypothesis H, among all choices of
priors used in the sensitivity analysis, are given by the interpolated dashed lines.
For the HW model, Table 1 and Figure 2 top, the sample size ranged from
to
For the Coefficient of Variation model, Table 1 and Figure 2 bottom,
the sample size ranged from
to
In Figure 2, the induced degree
of inconsistency is given by the vertical distance between the dashed lines. The
interpretation of the vertical interval between the lines in Figure 2 (solid bars) is
similar to that of the usual statistical error bars. However, in contrast with the
empirical power analysis developed in [28] and [22], the uncertainty represented
by these bars does not have a probabilistic nature, being rather a possibilistic
measure of inconsistency, defined in the partial support structure given by the
FBST evidence, see [29].
Fig. 2. Sensitivity Analysis for Ev(H)
TEAM LinG
140
5
Julio Michael Stern
Directions for Further Research and Acknowledgements
For complex models, the sensitivity analysis in the last section can be generalized
using perturbations generated by the inclusion of single artificial observations
created at (or the exclusion of single observations near) fixed quantiles of some
convenient statistics,
of the predictive posterior. Perturbations generated
by the exclusion of the most extreme observations, according to some convenient
criteria, could also be considered. For the sensitivity analysis consistency when
the model allows the data set to be summarized by some sufficient statistics in
the form of L-estimators, see [4], section 8.6. The asymptotic behavior of the
sensitivity analysis for several classes of models and perturbations is the subject
of forthcoming articles.
Finally, perturbations to the reference density, instead of to the prior, could
be considered. One advantage of this approach is that, when computing the
evidence, only the integration limit, i.e. the threshold
is changed, while the
integrand, i.e. the posterior density, remains the same. Hence, when computing
Ev(H), only little additional work is required for the inconsistency analysis.
The author has benefited from the support of FAPESP, CNPq, BIOINFO, the
Computer Science Department of Sao Paulo University, Brazil, and the Mathematical Sciences Department at SUNY-Binghamton, USA. The author is grateful to many of his colleges, most specially, Jair Minoro Abe, Wagner Borges,
Joseph Kadane, Marcelo Lauretto, Fabio Nakano, Carlos Alberto de Bragança
Pereira, Sergio Wechsler, and Shelemyahu Zacks. The author can be reached at
[email protected] .
References
1. Abe,J.M. Avila,B.C. Prado,J.P.A. (1998). Multi-Agents and Inconsistence. ICCIMA’98. 2nd International Conference on Computational Intelligence and Multimidia Applications. Traralgon, Australia.
2. Alcantara,J. Damasio,C.V. Pereira,L.M. (2002). Paraconsistent Logic Programs.
JELIA-02. 8th European Conference on Logics in Artificial Intelligence. Lecture
Notes in Computer Science, 2424, 345–356.
3. Arieli,O. Avron,A. (1996). Reasoning with Logical Bilattices. Journal of Logic,
Language and Information, 5, 25–63.
4. Arnold,B.C. Balakrishnan,N. Nagaraja.H.N. (1992). A First Course in Order
Statistics. NY: Wiley.
5. C.M.Barros, N.C.A.Costa, J.M.Abe (1995). Tópicos de Teoria dos Sistemas Ordenados. Lógica e Teoria da Ciência, 17,18,19. IEA, Univ. São Paulo.
6. N.D.Belnap (1977). A useful four-valued logic, pp 8–37 in G.Epstein, J.Dumm.
Modern uses of Multiple Valued Logics. Dordrecht: Reidel.
7. Boothby,W. (2002). An Introduction to Differential Manifolds and Riemannian
Geometry. Academic Press, NY.
8. N.C.A.Costa, V.S.Subrahmanian (1989). Paraconsistent Logics as a Formalism for
Reasoning about Inconsistent Knowledge Bases. Artificial Inteligence in Medicine,
1, 167–174.
TEAM LinG
Paraconsistent Sensitivity Analysis for Bayesian Significance Tests
141
9. Costa,N.C.A. Abe,J.M. Subrahmanian,V.S. (1991). Remarks on Annotated Logic.
Zeitschrift für Mathematische Logik und Grundlagen der Mathematik, 37, 561–570.
10. Darwiche,A.Y. Ginsberg,M.L. (1992). A Symbolic Generalization of Probability
Theory. AAAI-92. 10th Natnl. Conf. on Artificial Intelligence. San Jose, USA.
11. Dugdale,J.S. (1996). Entropy and its Physical Meaning. Taylor-Francis,London.
12. Epstein,G. (1993). Multiple-Valued Logic Design. Inst.of Physics, Bristol.
13. M.Evans (1997). Bayesian Inference Procedures Derived via the Concept of Relative Surprise. Communications in Statistics, 26, 1125–1143.
14. M. Fitting (1988). Logic Programming on a Topological Bilattice. Fundamentae
Informaticae, 11, 209–218.
15. Fitting,M. (1989). Bilattices and Theory of Truth. J. Phil. Logic, 18, 225–256.
16. M.H.DeGroot (1970). Optimal Statistical Decisions. NY: McGraw-Hill.
17. Ginsberg,M.L. (1988). Multivalued Logics. Computat. Intelligence, 4, 265–316.
18. Gokhale,D.V. (1999). On Joint Conditional Enptropies. Entropy Journal,1,21–24.
19. Good,I.J. (1983). Good Thinking. Univ. of Minnesota.
20. Good,I.J. (1989). Surprise indices and p-values. J. Statistical Computation and
Simulation, 32, 90–92.
21. Kapur,J.N.(1989). Maximum Entropy Models in Science Engineering. Wiley, NY.
22. Lauretto,M. Pereira,C.A.B. Stern,J.M. Zacks,S. (2004). Comparing Parameters
of Two Bivariate Normal Distributions Using the Invariant FBST. To appear,
Brazilian Journal of Probability and Statistics.
23. Madruga,M.R. Esteves,L.G. Wechsler,S. (2001). On the Bayesianity of PereiraStern Tests. Test, 10, 291–299.
24. Madruga,M.R. Pereira,C.A.B. Stern,J.M. (2003). Bayesian Evidence Test for Precise Hypotheses. Journal of Statistical Planning and Inference, 117,185–198.
25. Pereira,C.A.B. Stern,J.M. (1999). Evidence and Credibility: Full Bayesian Significance Test for Precise Hypotheses. Entropy Journal, 1, 69–80.
26. Pereira,C.A.B. Stern,J.M. (2001). Model Selection: Full Bayesian Approach. Environmetrics, 12, 559–568.
27. Perny,P. Tsoukias,A. (1998). On the Continuous Extension of a Four Valued Logic
for Preference Modelling. IPMU-98. 7th Conf. on Information Processing and
Management of Uncertainty in Knowledge Based Systems. Paris, France.
28. Stern,J.M. Zacks,S. (2002). Testing the Independence of Poisson Variates under
the Holgate Bivariate Distribution, The Power of a new Evidence Test. Statistical
and Probability Letters, 60, 313–320.
29. Stern,J.M. (2003). Significance Tests, Belief Calculi, and Burden of Proof in Legal
and Scientific Discourse. Laptec-2003, 4th Cong. Logic Applied to Technology.
Frontiers in Artificial Intelligence and its Applications, 101, 139–147.
30. Zadeh,L.A. (1987). Fuzzy Sets and Applications. Wiley, NY.
31. Zellner,A. (1971). Introduction to Bayesian Inference in Econometrics. NY:Wiley.
Appendix: Bilattices
Several formalisms for reasoning under uncertainty rely on ordered and lattice
structures, see [5], [6], [8], [9], [14], [15], [17], [30] and others. In this section we
recall the basic bilattice structure, and give an important example. Herein, the
presentations in [2] and [3], is followed.
TEAM LinG
142
Julio Michael Stern
Given two complete lattices,
two orders, the knowledge order,
and
the bilattice B(C, D) has
and the truth order,
given by:
The standard interpretation is that C provides the “credibility” or “evidence
in favor” of a hypothesis (or statement) H, and D provides the “doubt” or
“evidence against” H. If
then we have more information
(even if inconsistent) about situation 2 than 1. Analogously, if
then we have more reason to trust (or believe) situation 2 than 1 (even if with
less information).
For each of the bilattice orders we define a join and a meet operator, based
on the join and the meet operators of the single lattices orders. More precisely,
and
for the truth order, and
and
for the knowledge order, are
defined by the folowing equations:
Negation type operators are not an integral part of the basic bilattice structure. Ginsberg (1988) and Fitting (1989) require of possible “negation”, ¬ and
“conflation”, –, operators to be compatible with the bilattice orders, and to
satisfy the double negation property:
Hence, negation should reverse trust, but preserve knowledge, and conflation
should reverse knowledge, but preserve trust. If the double negation property is
not satisfied (Ng3 or Cf3) the operators are called weak (negation or conflation).
The “unit square” bilattice,
has been routinely used to
represent fuzzy or rough pertinence relations, logical probabilistic annotations,
etc. Examples can be found in [1], [9], [12], [27], [30] and others. The lattice
is the standard unit interval, where the join and meet, and coincide
with the max and min operators. The standard negation and conflation operators
are defined by
In the unit square bilattice the “truth”, “false”, “inconsistency” and “indetermination” extremes are
whose coordinates are given in Table 3.
As a simple example, let region R be the convex hull of the four vertices
and
given in Table 3. Points kj, km, tj and tm are the knowledge and truth
join and meet, over
In the unit square bilattice, the degree of trust and degree of inconsistency
for a point
are given by a convenient linear reparameterization of
to
defined by
TEAM LinG
Paraconsistent Sensitivity Analysis for Bayesian Significance Tests
Fig. 3. Points in Table 3, using
143
and (BT, BI) coordinates
Figure 3 shows the points in Table 3 in the unit square bilattice, also using the
trust-inconsistency reparameterization.
TEAM LinG
An Ontology for Quantities in Ecology
Virgínia Brilhante
Computing Science Department, Federal University of Amazonas
Av. Gen. Rodrigo O. J. Ramos, 3000, Manaus – AM, 69060-020, Brazil
[email protected]
Abstract. Ecolingua is an ontology for ecological quantitative data,
which has been designed through reuse of a conceptualisation of quantities and their physical dimensions provided by the EngMath family of
ontologies. A hierarchy of ecological quantity classes is presented together
with their definition axioms in first-order logic. An implementation-level
application of the ontology is discussed, where conceptual ecological models can be synthesised from data descriptions in Ecolingua through reuse
of existing model structures.
Keywords: Ontology reuse, engineering and application; ecological data;
model synthesis.
1 Introduction
The Ecolingua ontology brings a contribution towards a conceptualisation of the
Ecology domain by formalising properties of ecological quantitative data that
typically feed simulation models. Building on the EngMath family of ontologies
[6], data classes are characterised in terms of physical dimensions, which are a
fundamental property of physical quantities in general. The ontology has been
developed as part of a research project on model synthesis based on metadata
and ontology-enabled reuse of model designs [1].
We start by briefly referring to other works on ontologies in the environmental
sciences domain in Sect. 2, followed by a discussion on the reuse of the EngMath
ontology in the development of Ecolingua in Sect 3. Section 4 is the core of the
paper, presenting the concepts in upper-level Ecolingua. In Section 5 we give a
summary description of an application of Ecolingua in the synthesis of conceptual
ecological models with the desirable feature of consistency with respect to the
properties of their supporting data. Finally, conclusions and considerations on
future work appear in Sect. 6.
2
Environmental Ontologies
There has been little research on the intersection between ontologies and environmental sciences despite the need for a unifying conceptualisation to reconcile
conflicts of meaning amongst the many fields – biology, geology, law, computing
science, etc. – that draw on environmental concepts. The work by B.N. Niven
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 144–153, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
An Ontology for Quantities in Ecology
145
[9,10] proposes a formalisation of general concepts in animal and plant ecology,
such as environment, niche and community. Although taxonomies have been in
use in the field for a long time, this work is the earliest we are aware of where
concepts are defined in the shape of what we call today a formal ontology. Other
developments have been EDEN, an informal ontology of general environmental
concepts designed to give support to environmental information retrieval [7], and
an ontology of environmental pollutants [13], built in part through reuse of a
chemical elements ontology. These more recent ontologies have a low degree of
formality, lacking axiomatic definitions.
3
Quantities in Ecology and EngMath Reuse
The bulk of ecological data consists of numeric values representing measurements of attributes of entities and processes in ecological systems. The most
intrinsic property of such a measurement value lies on the physical nature, or
dimension, of what the value quantifies [3]. For example, a measure of weight is
intrinsically different from a measure of distance because they belong to different
physical dimensions, mass1 and length respectively. The understanding of this
fundamental relation between ecological measurements and physical dimensions
drew our attention towards the EngMath family of ontologies, which is publicly
available in the Ontolingua Server [11]. All defined properties in EngMath’s conceptualisation of constant, scalar, physical quantities are applicable to ecological
measurements:
1. Every ecological measurement has an intrinsic physical dimension – e.g. vegetation biomass is of the mass dimension, the height of a tree is of the length
dimension;
2. The physical dimension of an ecological measurement can be a composition of
other dimensions through multiplication and exponentiation to a real power
– e.g. the amount of a fertiliser applied to soil every month has the composite
dimension mass/time;
3. Ecological measurements can be dimensionless – e.g. number of individuals
in a population; and can be non-physical – e.g. profit from a fishing harvest;
4. Comparisons and algebraic operations (including unit conversion) can be
meaningfully applied to ecological measurements, provided that their dimensions are homogeneous – e.g. you could add or compare an amount of
some chemical to an amount of biomass (both of the mass dimension).
Also relevant to Ecolingua is the EngMath conceptualisation of units of measure,
which are also physical quantities, but established by convention as an absolute
amount of something to be used as a standard reference for quantities of the
same dimension. Therefore, one can identify the physical dimension of a quantity from the unit of measure in which it is expressed [8]. Being Ecolingua an
ontology for description of ecological data, instantiation of its terms, as we shall
1
Or force, if rigorously interpreted (see Sect. 4.4).
TEAM LinG
146
Virgínia Brilhante
see in Sect. 4, requires the specification of a quantity’s unit of measure. In this
way, describing a data set in Ecolingua does not demand additional effort in the
sense that it is of course commonplace to have the units of measure of the data
on hand, whereas the data’s physical dimensions are not part of the everyday
vocabulary of ecologists and modellers. Lower-level Ecolingua, available in [1],
includes a detailed axiomatisation of units and scales of measurement, including
their dimensions, base and derived units, and the SI system (Système International d’Unités), which allows for automatic elicitation of a quantity’s physical
dimension from the unit or scale in which it is specified.
4
Quantity Classes
Ecolingua’s class hierarchy can be seen in Fig. 1. The hierarchy comprises Ecolingua classes and external classes defined in other ontologies, namely, PhysicalQuantities and Standard-Dimensions of the EngMath family of ontologies, and
Okbc-Ontology and Hpkb-Upper-Level, all of which are part of the Ontolingua
Server’s library of ontologies. External classes are denoted [email protected], a
notation we borrow from the Ontolingua Server. The arcs in the hierarchy represent subclass relations, bottom up, e.g. the ‘Weight of’ class is a subclass of
the ‘Quantity’ class. We distinguish two different types of subclass relations,
indicated by the bold and dashed arcs in Fig. 1. Bold arcs correspond to full,
formal subclass relations. Dashed arcs correspond to relations we call referential
between Ecolingua classes and external classes, mostly of the EngMath family of
ontologies, in that they do not hold beyond the conceptual level, i.e., definitions
that the (external) class involves are not directly incorporated by the (Ecolingua) subclass. In the forthcoming definitions of Ecolingua classes, textual and
axiomatic KIF [5] definitions of their referential classes are given as they appear
in the Ontolingua Server. Ecolingua axioms defining quantity classes incorporate the physical dimension, when specified, of its referential class in EngMath
through the unit of measure concept, as explained in Sect. 3, and contextualises
the quantity in the ecology domain through concepts such as ecological entity,
compatibility between materials and entities, etc.
Forms of Ecolingua Concept Definitions. Ecolingua axioms are represented
as first-order logic, well-formed formulae of the form
That is, if
Cpt holds then Ctt must hold, where Cpt is an atomic sentence representing an
Ecolingua concept and Ctt is a logical sentence that constrains the interpretation
of Cpt. The Cpt sentences make up Ecolingua vocabulary. One describes an
ecological data set by instantiating these sentences.
4.1
Amount Quantity
Many quantities in ecology represent an amount of something contained in a
thing or place, for example, carbon content in leaves, water in a lake, energy
stored in an animal’s body.
TEAM LinG
An Ontology for Quantities in Ecology
147
Fig. 1. Ecolingua class hierarchy
Material Quantity. Quantities that represent an amount of material things
are of the mass dimension (intuitively a ‘quantity of matter’ [8]). For such quantities the amount of material class is defined as a referential subclass of the
[email protected] class defined in the Ontolingua Server as:
Amount of Material class – If A identifies a measure of amount of material
Mt in E specified in U then Mt is a material, E is an entity which is compatible
with Mt, and U is a unit of mass:
A material Mt is anything that has mass and can be contained in an ecological entity (e.g. biomass, chemicals, timber). An ecological entity E is any distinguishable thing, natural or artificial, with attributes of interest in an ecological
system (e.g. vegetation, water, an animal, a population, a piece of machinery),
the system itself (e.g. a forest, a lake), or its boundaries (e.g. atmosphere). Ecological quantities usually consist of measurements of attributes of such entities
(e.g. carbon content of vegetation, temperature of an animal’s body, birth rate
of a population, volume of water in a lake). A material and an entity are compatible if it occurs in nature that the entity contains the material. For example,
biomass is only thought of in relation to living entities (plants and animals), not
in relation to inorganic things.
Other quantities represent measurements of amount of material in relation
to space, e.g. amount of biomass in a crop acre, or of timber harvested from a
TEAM LinG
148
Virgínia Brilhante
hectare of forest. The dimension of such quantities is mass over some power of
length. For these quantities, we define the material density class as a referential subclass of the [email protected] class defined in the
Ontolingua Server as:
Material Density class – If A identifies a measure of density of Mt in E
specified in U then Mt is a material, E is an entity which is compatible with Mt,
and U is equivalent to an expression Um/ Ul, where Um is a unit of mass and Ul
is a unit of some power of length:
Amount of Time. Quantities can also represent amounts of immaterial things,
time being a common example. The duration of a sampling campaign and
the gestation period of females of a species are examples of ecological quantities of the amount of time class. The class is a referential subclass of the
[email protected] class defined in the Ontolingua Server as:
Amount of Time class – If A identifies a measure of an amount of time of
Ev specified in U then Ev is an event and U is a unit of time:
where an event Ev is any happening of ecological interest with a time duration
(e.g. seasons, sampling campaigns, harvest events, etc.).
Non-physical Quantity. Despite the name, the ‘physical quantity’ concept in
EngMath allows for so-called non-physical quantities. These are quantities of new
or non-standard dimensions, such as the monetary dimension, which can be defined preserving all the properties of physical quantities, as already defined in the
ontology (Sect. 3). The class of non-physical quantities is a referential subclass of
[email protected] class defined in the Ontolingua Server
as: “A Constant-Quantity is a constant value of some Physical-Quantity, like
3 meters or 55 miles per hour. . . . ” Ecological data often involve measurements
of money concerning some economical aspect of the system-of-interest, e.g. profit
given by a managed natural system.
Amount of Money class – If A identifies a measure of amount of money in
E specified in U then E is an entity and U is a unit of money:
TEAM LinG
An Ontology for Quantities in Ecology
4.2
149
Time-Related Rate Quantity
In general, rates express a quantity in relation to another. In ecology, rates
commonly refer to instantaneous measures of processes of movement or transformation of something occurring over time, for example, decay of vegetation
biomass every year, consumption of food by an animal each day. Ecolingua defines a class of rate quantities of composite physical dimension including time,
as a referential subclass of the [email protected] class in
the Ontolingua Server, which is a generalisation of dimension-specific quantity
classes (see Fig. 1). The absolute rate class is for measures of processes where an
amount of some material is processed over time. These quantities have a composite dimension of mass,
or money (the dimensions of amount
quantities with the exception of time) over the time dimension. To the mass
or
dimensions will correspond adequate units of measure (e.g. kg,
ton/ha) which we call units of material.
Absolute Rate class – If R identifies a measure of the rate of processing Mt
from
to
specified in U then Mt is a material,
and
are entities
which are different from each other and compatible with Mt, and U is equivalent
to an expression Ua/Ut, where Ua is a unit of material and Ut is a unit of time:
Sometimes, processes are measured in relation to an entity involved in the
process. We call these measures specific rates. For example, a measure given in,
say, g/g/day is a specific rate meaning how much food in grams per gram of the
animal’s weight is consumed per day.
Specific Rate class – If R identifies a measure of a specific rate, related to
of processing Mt specified in U then:
measures the absolute rate of processing
Mt from
to
specified in
which is an expression equivalent to Ua/ Ut
where Ua is a unit of measure of material; and U is equivalent to an expression
Ub/ Uc/ Ut where both Ub and Uc are units of measure of material and are of the
same dimension D:
4.3
Temperature Quantity
Another fundamental physical dimension is temperature, which has measurement
scales rather than units [8]. Temperature in a green house or of water in a
TEAM LinG
150
Virgínia Brilhante
pond, are two examples of temperature quantities in ecological data sets. The
referential superclass of the class below is [email protected] defined in the Ontolingua Server as:
Temperature of class – If T identifies a measure of the temperature of E
specified in S then E is an entity and S is a scale of temperature:
4.4
Weight Quantity
Strictly speaking weight is a force, a composite physical dimension of the form
But in ecology, as in many other contexts, people colloquially refer to ‘weight’ meaning a quantity of mass. For example, the weight of
an animal, the weight of a fishing harvest. It is in this everyday sense of weight
that we define a class of weight quantities. It has [email protected]sions as referential superclass defined in the Ontolingua Server.
Weight of class – If W identifies a measure of the weight of E specified in U
then E is an entity and U is a unit of mass:
Note that for quantities of both this class and the Amount of Material class
the specified unit must be a unit of mass. But the intuition of a measure of weight
does not bear a containment relationship between a material and an entity like
the intuition of an amount of material does.
4.5
Dimensionless Quantity
Another paradoxically named concept in the EngMath ontology is that of dimensionless quantities. They do have a physical dimension but it is the identity dimension. Real numbers are an example. The class of dimensionless quantities has
a referential superclass of the same name, [email protected], defined in the Ontolingua Server as:
This concept applies to quantities in ecology that represent counts of things,
such as number of individuals in a population or age group.
Number of class – If N measures the number of E specified in U then E is an
entity and N is a dimensionless quantity specified in U:
TEAM LinG
An Ontology for Quantities in Ecology
151
Percentages can also be defined as dimensionless quantities. Food assimilation
efficiency of a population, mortality and birth rates are examples of ecological
quantities expressed as percentages.
Percentage class – If P is a percentage that quantifies an attribute of E specified
in U then E is an entity and P is a dimensionless quantity specified in U:
5
A Practical Application of Ecolingua
In ecological modelling, as in other domains, using data derived from observation
to inform model design adds credibility to model simulation results. Also, a common methodological approach that facilitates understanding of complex systems
is to firstly design a conceptual (or qualitative) model which is later used as a
framework for specification of a quantitative model. However, data sets given to
support modelling of ecological systems contain mainly quantitative data which,
in its low representational level, do not directly connect to high-level model conceptualisation. In this context, an ontology of properties of domain data can
play the role of a conceptual vocabulary for representation of data sets, by way
of which the data’s level of abstraction is raised to facilitate connections with
conceptual models. Ecolingua was initially built to support an application of synthesis of conceptual system dynamics models [4] stemming from data described
in the ontology, where existing models are reused to guide the synthesis process.
The application is depicted in Fig. 2 and is briefly discussed in the sequel; a
complete description including an evaluation of the synthesis system on the run
time efficiency criterion and examples of syntheses of actual and fictitious models
can be found in [1].
Fig. 2. Application of Ecolingua in model synthesis through reuse
Figure 2 shows the synthesis process starting with a given modelling data
set to support the design of a conceptual ecological model. Ecolingua vocabulary is then manually employed to describe the data set yielding metadata (e.g.
amt_of_mat(t, timber, tree, kg) is an instance of metadata). The synthesis mechanism tries and matches the structure (or topology) of the existing model with
the metadata set, whose content marks up the structure to give a new model
TEAM LinG
152
Virgínia Brilhante
that is consistent with the given metadata. This is done by solving constraint
rules that represent modelling knowledge in the mechanism. Matching the existing model with metadata means to reduce its structure to the concepts in
Ecolingua. It all comes down to how similar the two data sets – the new model’s
described in Ecolingua, and the data set that once backed up the existing model
– are with respect to ontological properties. The more properties they share, the
more of the existing model’s structure will be successfully matched with the new
model’s metadata.
5.1
Automatically Checking for Ecolingua-Compliant Metadata
Besides providing a vocabulary for description of ecological data by users, Ecolingua is employed by the synthesis system to check compliance of the manually
specified metadata with the underlying ontology axioms, ensuring that only compliant metadata are carried forward into the model synthesis process. Since in
the synthesis system the
axioms are only reasoned upon when a
metadata term
with logical value true unifies with Cpt, the use of the
axiom can be reduced to solving Ctt, as its logical value alone will correspond to
the logical value of the whole expression. Therefore, each Ecolingua axiom can
be represented in the synthesis systems as a clause of the form c_ctt(Cpt, Ctt).
The following Ecolingua compliance checking mechanism is thus defined.
Let
be an instance of an Ecolingua concept Cpt. As defined by the
Ecolingua axioms formula,
being true and unified with Cpt implies that
the consequent constraint Ctt must be true. If however, the concept in question
is one that lacks an axiomatic definition, it suffices to verify that
unifies
with an Ecolingua concept:
6
Concluding Remarks
We have defined classes of quantitative data in ecology, using the well-known
formalism of first-order logic. The definitions draw on the EngMath ontology
to characterise quantity classes with respect to their physical dimension, which
can be captured through the unit of measure in which instances of the quantity
classes are expressed in. The ontology has been employed to enable a technique
of synthesis of conceptual ecological models from metadata and reuse of existing models. The synthesis mechanism that implements the technique involves
proofs over the ontology axioms written in Prolog in order to validate metadata that is given to substantiate the models. This is an application where an
ontology is not used at a conceptual level only, as we commonly see, but at a
practical, implementational level, adding value to a knowledge reuse technique.
As the ontology is founded on the universal concept of physical dimensions, its
range of application can be widened. However, while the definitions presented
here have been validated by an ecological modelling expert at the Institute of
TEAM LinG
An Ontology for Quantities in Ecology
153
Ecology and Resource Management, University of Edinburgh, Ecolingua’s concepts and axioms are not yet fully developed. Quantities of space, energy and
frequency dimensions, for example, as well as precise definitions, with axioms
where possible, of contextual ecological concepts such as ecological entity, event,
the compatibility relation between entities and materials, are not covered and
will be added as the ontology evolves. We would also like to specify Ecolingua
using state-of-the-art ontology languages, such as DAML+OIL [2] or OWL [12],
and make it publicly available so as to allow its cooperative development and
diverse applications over the World Wide Web.
Acknowledgements
The author wishes to thank FAPEAM (Fundação de Amparo a Pesquisa do
Estado do Amazonas) for its partial sponsorship through the research project
Metadata, Ontologies and Sustainability Indicators integrated to Environmental
Modelling.
References
1. Brilhante, V.: Ontology and Reuse in Model Synthesis. PhD thesis, School of Informatics, University of Edinburgh (2003)
2. DARPA Agent Markup Language. http://www.daml.org/2001/03/, Defense Advanced Research Projects Agency (2001) (last accessed on 10 Mar 2004)
3. Ellis, B.: Basic Concepts of Measurement (1966) Cambridge University Press, London
4. Ford, A.: Modeling the Environment: an Introduction to System Dynamics Modeling of Environmental Systems (1999) Island Press
5. Genesereth, M., Fikes, R.: Knowledge Interchange Format, Version 3.0, Reference
Manual, Logic-92-1 (1992) Logic Group, Computer Science Department, Stanford
University, Stanford, California
6. Gruber, T., Olsen, G.: An Ontology for Engineering Mathematics. In Proceedings
of the Fourth International Conference on Principles of Knowledge Representation
and Reasoning (1994) Bonn, Germany, Morgan Kaufmann
7. Kashyap, V.: Design and Creation of Ontologies for Environmental Information
Retrieval. In Proceedings of the Twelfth International Conference on Knowledge
Acquisition, Modeling and Management (1999) Banff, Canada
8. Massey, B.: Measures in Science and Engineering: their Expression, Relation and
Interpretation (1986) Ellis Horwood Limited
9. Niven, B.: Formalization of the Basic Concepts of Animal Ecology. Erkenntnis 17
(1982) 307–320
10. Niven, B.: Formalization of Some Concepts of Plant Ecology. Coenoses 7(2) (1992)
103–113
11. Ontolingua Server. http://ontolingua.stanford.edu, Knowledge Systems Laboratory, Department of Computer Science, Stanford University (2002) (last accessed
on 10 Mar 2004)
12. Ontology Web Language. http://www.w3.org/2001/sw/WebOnt, WebOnt Working Group, W3C (2004) (last accessed on 10 Mar 2004)
13. Pinto, H.: Towards Ontology Reuse. Papers from the AAAI-99 Workshop on Ontology Management, WS-99-13, (1999) 67–73. Orlando, Florida, AAAI Press
TEAM LinG
Using Color to Help
in the Interactive Concept Formation
Vasco Furtado and Alexandre Cavalcante
University of Fortaleza – UNIFOR, Av. Washington Soares 1321, Fortaleza – CE, Brazil
[email protected], [email protected]
Abstract. This article describes a technique that aims at qualifying a concept
hierarchy with colors, in such a way that it can be feasible to promote the interactivity between the user and an incremental probabilistic concept formation algorithm. The main idea behind this technique is to use colors to map the concept properties being generated, to combine them, and to provide a resulting
color that will represent a specific concept. The intention is to assign similar colors to similar concepts, thereby making it possible for the user to interact with
the algorithm and to intervene in the concept formation process by identifying
which approximate concepts are being separately formed. An operator for
interactive merge has been used to allow the user to combine concepts he/she
considers similar. Preliminary evaluation on concepts generated after interaction has demonstrated improved accuracy.
1 Introduction
Incremental concept formation algorithms accomplish the concept hierarchy construction process from a set of observations – usually an attribute/value paired list – that
characterizes an observed entity. By using these algorithms, learning occurs gradually
over a period of time.
Different from non-incremental learning (where all observations are presented at
the same time), incremental systems are capable of changing the hierarchical structure
constructed as new observations become available for processing. These systems,
besides closely representing the manner in which humans learn, they present the disadvantage that the quality of the generated concepts depends on the presentation order
of the observations.
This work proposes a strategy to help in the identification of bad concept formation, making it possible to initiate an interaction process. The resource that makes this
interaction possible is a color-based data visualization technique. The idea is to help
users recognize similarities or differences in the conceptual hierarchies being formed.
The basic assumption of this strategy is to match up human strengths with those of
computers. In particular, by using the human visual perceptive capacity in identifying
color patterns, it seeks to aid in the identification of poor concept formation.
The proposed solution consists of mapping colors to concept properties and then
mixing them to obtain a concept color. The probability of an entity, represented by a
concept, having a particular property assumes a fundamental role in the mixing process mentioned above, thereby directly influencing the final color of the concept. At
the end of the process, each concept of the hierarchy will be qualified by a color. An
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 154–163, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Using Color to Help in the Interactive Concept Formation
155
operator for the interactive merge has been defined to allow the user to combine concepts he/she considers similar. Preliminary evaluations on generated concepts, after
such an interaction, have demonstrated that the conceptual hierarchy accuracy has
improved considerably.
2 Incremental Probabilistic Concept Formation
Incremental probabilistic concept formation systems accomplish a process of concept
hierarchy formation that generalizes observations contained in the node in terms of
the conditional probability of their characteristics.
The task that these systems accomplish does not require a “teacher” to pre-classify
objects, but such systems use an evaluation function to discover classes with “good”
conceptual descriptions. Generally, the most common criterion to qualify how good is
a concept is its capacity to make inferences about unknown properties of new entities.
Most of the recent work on this topic is built on the work of Fisher (1987) and the
COBWEB algorithm, which forms probabilistic concepts (CP) defined in the following manner. Let:
be the set of all attributes and
be the set of all values of an attribute
that describes a
concept CP where
indicates the probability of an entity possessing an
attribute with the value
given that this entity is a member of class C (extent of
CP). Then, consider the pair
as being a property of the concept CP.
The incremental character of processing observations causes the presentational order of these observations to influence the concept formation process. Consider the set
of observations in the domain of animals in Table 1:
When processing the observations with COBWEB in the order of 1,3,4,5, and 2, it
may be noticed that the concept hierarchy formed does not reflect an ideal hierarchy
for the domain since the two mammals are not in the same class, as it can be seen in
Figure 1.
Fig. 1. Hierarchical structure generated in a bad order by Cobweb.
TEAM LinG
156
Vasco Furtado and Alexandre Cavalcante
3 Modeling Colors
In 1931, the CIE (Comission Internationale de L’Eclairage) defined its first model of
colors. An evolution of this first CIE color model led the CIE to define two other
models: the CIELuv and the CIELab, which represented that the Euclidean distance
between two coordinates represents two colors, and the same distance between two
coordinates represents other two colors, agreeing on the same difference in visual
perception (Wyszecki, 1982).
The CIELab standard began to influence new ways for the measurement of visual
differences between colors, such as the CIELab94 and the CMC (Sharma & Trussell,
1997). For this work, similarity between colors is the first key point for the solution,
and it will be used extensively in the color mixture algorithm.
On the other hand, the color models may also be used to define colors according to
their properties. These properties are luminosity (L), hue (H), which is the color that
an object is perceived (green, blue, etc.) and saturation (S) or chrome (C), indicating
the depth in which the color is either vivid or diluted (Fortner & Meyer, 1997). These
color spaces are denominated HLS or HLC, and they will be further applied in this
work to map colors to properties.
4 Mixing Color in Different Proportions
We defined a color-mixing algorithm following the assumption that the resulting
mixture of two colors must be perceptually similar to the color being mixed that carries the higher weight in the mixing process.
The proposed algorithm considers the CIELab94 metric as a measure of similarity/dissimilarity between colors. We define the function
which
measures the extent to which the color R1 and the color R2 resemble each other in
accordance with the CIELab94 model. The range of results points out that the smaller
the calculated result of the CIELab94 metric is, the greater the similarity between the
colors involved will be.
4.1 Mixing Two Colors
When mixing two colors, consider that the set
a and b} are
coordinates of the CIELab model. The colors and
and the weights
and
associated with the colors and
are such that,
The color
the
mixture result of and
is then calculated in the following manner:
1
L stands for luminosity, a stands for the red/green aspect of the color, and b stands for the
yellow/blue aspect.
TEAM LinG
Using Color to Help in the Interactive Concept Formation
157
It should be underscored that a color is a point in a three dimensional space. That is
why it is necessary to compute the medium point between two colors. The function
Mix2colors searches for the color represented by this medium point, so that its similarity with the first color that participated in the mixture is equal to the weight
multiplied by the similarity between the colors that are being mixed.
4.2 Mixing n Colors
To mix n colors, first, it is necessary to mix two colors, and the result obtained is
mixed with the third color. This procedure is extended for n colors. The weight by
which each color influences the result of the mixture is proportional to the order in
which the color participates in the mixing process. For instance, the first two colors,
when mixed, participate with 0.5 each. The third color, participates with 0.33, since
the first two will already have influenced 0.66 of the final color. Generalizing, let i be
the order that a color is considered in the process of mixing n colors, the influence of
each color in the process is given by 1/i2. Different order of color mixture can produce
The function to mix n colors has the following steps: let
be
the set of colors to be mixed, where each color
and the mixture of the
colors of the set RT will be accomplished by the following function:
The Mixncolor function accepts, as a parameter, a set of colors to be mixed (set
RT), and returns a single color belonging to the set CIELab. It calls the Mix2colors
function with three parameters: (i) the color
of the set RT, (ii) the result of the
mixture (M) obtained from the two previous colors, and (iii) a weight 1/i for a
color
5 Aiding the Identification
of Poor Concept Hierarchy Formations Using Colors
The strategy developed to aid in the identification of poor concept hierarchy formation is done in two phases. The first one maps the initial colors to concept properties
and the second phase mixes these colors, concluding with the resulting color of the
concept.
2
Different order of color mixture can produce different results, but this won’t be a problem,
since the same process will be applied to every concept.
TEAM LinG
158
Vasco Furtado and Alexandre Cavalcante
5.1 Initial Color Mapping of Probabilistic Concept Properties
The initial color mapping attributes, to each property
of a probabilistic
concept, a color
so that, at the end of this procedure, a set denominated
RM will be obtained, made up of all these mapped colors. To carry out this task, we
have as parameters: (i) the set of properties
formed by
all of the properties of CP, (ii) the value for minimum luminosity,
and, (iii) the
value for maximum luminosity,
In this work, we used values between 50 and 98
for these latter parameters in order not to generate excessively dark color patterns.
The procedure initiates going through all the attributes of set A, which will receive a coordinate H of the color that is being mapped. Knowing that, coordinate H of
the HLC model varies from 0° to 360°, we have for the set of observations in table 1,
the following values for H: 72, 144, 216, and 288.
The second step seeks to attribute the coordinates L and C, for each value of attribute
First, for each value
of a value of L is calculated. The third column in
table 2 shows the coordinates L calculated for the set of observations in table 1. Finally, coordinate C is calculated so that its value is the biggest possible, whose
transformation of all the values of H given a same L, returns only valid RGB values3
(R>=0 and <=255,G>=0 and <=255,B >=0 and <= 255).
Table 2 describes the mapping of H, L and C for the two first attributes of the example in Table 1.
5.2 Processing Mapped Colors
In order to complete the color qualification process of the hierarchical structure, we
will consider the following parameters: (i) the RM set of the initial mapping, (ii) the
conditional probabilities of the properties
The conditional probability of
each property will function as the weight that the function Mix2Colors needs. As a
final product, we will have set RT (input of the Mixncolor function). The algorithm to
generate RT and its explanation follows:
3
That heuristic aims at having RGB valid for all lines for the H, L and C being chosen.
TEAM LinG
Using Color to Help in the Interactive Concept Formation
159
To calculate the color of an attribute, we mix, two by two, the colors of each value of this attribute. For such, the variable is initialized with the color of the first
value of attribute a.
is set up with the conditional probability of the attribute
a, which is equal to given class C. After this, the procedure enters in a loop that
treats each color of the values of the attributes a, using the Mix2colors function. It
uses the partial result
and the color
that is being processed. The parameter
normalizes the accumulated weight of the values so far considered and
the weight of the current property
so that
as the Mix2colors function requires.
Finally, the generated set RT feeds the Mixncolor function resulting in the final color of the concept. This process is repeated for each concept of the hierarchy.
Figure 2 shows the colored concept hierarchy for the example described in section
2. Note that the colors aid the user in perceiving the need to restructure the hierarchy
since the two mammals, which did not form a single class, have a similar resulting
color in the eyes of the user:
Fig. 2. Hierarchical structure qualified with colors.
6 Evaluation of the Color Heuristic
We have defined an evaluation method, which seeks to prove that two things will take
place:
1. Highly similar concepts will result in highly similar colors;
2. Concepts of low similarity will result in colors also of low similarity.
It is important to state that the proposed method will qualify each pair of equal
concepts with equal colors. However, it cannot guarantee that similar concepts will
receive similar colors, but in most cases, this will be true4. The evaluation process will
use two basic functions. One function aims to measure the similarity between two
probabilistic concepts and the other one aims to measure the similarity between two
colors that represent these same concepts. The first was defined in (Talavera & Béjar,
4
The proposed method doesn’t guarantee this because the color space is not linear and sometimes little variation produces big color perception variation.
TEAM LinG
160
Vasco Furtado and Alexandre Cavalcante
1998), which considers two probabilistic concepts as similar if their probabilistic
distributions5 are highly intersecting. The second is the
function
already seen in this article.
The basic idea is to generate a concept hierarchy with associated colors and to evaluate the similarity between the concepts, two by two, in terms of similarity of content
and of color. The ranges of similarities of concepts were defined as ten by ten, and for
each band the average of the similarity values among the colors of the compared concepts was calculated. Three databases were considered in the tests. The first two databases are composed of animal observations, with 105 and 305 observations, and the
third is the Mushrooms base (UCI, 2003), composed of 1000 observations.
Fig. 3. Evolution of probabilistic similarity versus similarity among colors.
Figure 3 shows the analysis defined in the previous paragraphs for the three bases
considered. Note that for all bases there is a decrease of metric CIE94 as the measure
of probabilistic similarity among them increases. This evidence reveals that the heuristic strategy we have defined for the concept qualification with colors reaches its
main goal that is to generate similar colors for similar concepts in the greatest number
of cases possible.
7 Interacting with the Concept Formation Process
The main goal of the strategy developed to qualify a hierarchical structure with colors
is to make interaction between the structure and the user feasible. Thus, it is possible
to improve the quality of the concept hierarchy easily because instead of accessing the
probabilistic values of each concept in order to compare them one by one, the user
can use his/her visual ability to have a global view of the conceptual structure and to
identify similarities. The interaction is simple and intuitive because the user only has
to identify two similar colors, comparing the probabilistic distribution of the concepts,
and proceeds with the merge of the two colored concepts, if he/she considers interesting. To do that, we define an operator called I-merge similar to COBWEB’s original
merge operator. Unlike COBWEB’s merge that only combines concepts in the same
5
A probabilistic distribution of a concept is the set of its properties associated to its conditional probabilities.
TEAM LinG
Using Color to Help in the Interactive Concept Formation
161
hierarchical level, with I-merge it is possible to merge concepts, which are in different
levels of the hierarchy. The algorithm below explains the steps of I-merge:
The original node counters will be used to subtract the counters from the hierarchical nodes, starting at the parent node of the original node until the root node is reached. Once that is accomplished, a merge node, resultant of the juxtaposing of the
original and destination nodes, is created, and it will be hierarchically superior to the
destination node, accumulating the counters of the two clustered nodes. Finally, the
node counters will be updated beginning with the parent node of this node cluster,
until the root node is reached.
8 Accuracy Evaluation
To evaluate whether the method proposed here has improved the accuracy of the
probabilistic concept formation, an animal database with 105 observations was used.
This set of observations was divided into 80 training observations and 25 test observations. The accuracy test consists in modifying a test observation by ignoring an attribute and classifying this observation in a previously built concept hierarchy. From the
concept found, the algorithm must suggest a value for the attribute based on the attribute value with higher predictability. This process is performed for each attribute of
each test observation. The higher the number of correct suggestions, the better the
concept hierarchy is, in terms of prediction.
The procedure begins with the application of COBWEB and the visualization of
the hierarchical structure formed by means of a tool that we developed to visualize
colored concept hierarchies called SmartTree. The initial shape of the hierarchical
structure is shown in figure 4 where each colored square represents a concept.
Fig. 4. Conceptual structure for ANIMALS database.
For that initial structure, the inference test is carried out using the test set of 25 observations, with 47% of errors observed. The performance of the user begins at this
TEAM LinG
162
Vasco Furtado and Alexandre Cavalcante
moment. He/She observes that node 29 has a color similar to node 50. The user then
asks SmartTree about the probabilistic similarity between them to finally decide to
merge them. In this example, 3 mergers were carried out, linking the following
nodes: 29 to 50, 76 to 96, and the result of the latter to node 79. The application of the
inference tests on the structure, after each merge, indicates the following evolution in
the accuracy of the hierarchical structure:
After the 1st merge: 43% of errors;
After the 2nd merge: 38% of errors;
After the 3rd merge: 32% of errors;
Figure 5 shows the format of the resulting tree. Besides a substantial increase in
terms of accuracy, it may be verified that the resultant tree presents more uniformity
with more clustered concepts.
Fig. 5. Resultant tree after the interactive merge process.
A second test was done with the Mushrooms database composed of 1000 observations. We use 900 training observations and 100 for testing. The initial accuracy indicated 28.2% of errors. After the first I-merge, that rate decreased to 25.9% and after a
second I-merge, that rate still reduced to 25.2%.
9 Related Work
Proposals to solve the order problem are based on algorithmic alternatives implemented in the original concept formation model. The original proposal of Fisher
(1987) has already considered two operators (merge and split) in an attempt to minimize the problem. Along the same lines, ARACHNE (McKusick & Langley, 1991)
included two others in an attempt to adjust the hierarchy generated. The problem with
these alternatives is that restructuring the tree is only done at the local partition level.
Further, to reduce problems in the order of time complexity, operators only act upon
the two best nodes of the partition.
Fisher et al. (1992) showed that a database that contains consecutive dissimilar observations, based on Euclidean distance, tend to form a good hierarchy. Biswas et al.
(1994) adapted that study in the ITERATE algorithm. Later, Fisher (1996) suggested
minimizing the effects of the order through an interactive optimization process running in the background.
Another line of study is based on the mingling of non-incremental techniques with
those of the incremental approach. This work is exemplified in (Atlintas 1995), where
TEAM LinG
Using Color to Help in the Interactive Concept Formation
163
instances already added to the hierarchy are reprocessed together with the new observation until a measure of structural stability is attained. Upon concluding this phase,
the process continues incrementally following the models already commented on.
10 Conclusion
We have defined a heuristic method to give colors to probabilistic concepts to allow a
user to interact with the conceptual structure. Moreover, we have defined a way to
user interaction via the I-merge operator, and we showed that improvements on the
accuracy of concepts could be easily obtained as a result of this interaction.
This work is innovative and multidisciplinary, since it combines resources of Graphics Computation – in this case, color technology – with concept formation from
Artificial Intelligence. It has demonstrated that topics related to the concept formation
process, as a problem of dependence on observational presentation order, can be dealt
with this focus.
Other alternatives of the use of this approach are being investigated for combining
concepts, which are produced via distributed data mining in grid computing architectures. Improvements on SmartTree for elaborating different strategies for concept
visualization are also in development.
References
1. Altintas, I., N.:Incremental Conceptual Clustering without Order Dependency. Master’s
Degree Thesis, Middle East Technical University (1995)
2. Biswas, G., Weiberg, J., & Li, C.:ITERATE: A conceptual clustering method for knowledge discovery in databases. In Innovative Applications of Artificial Intelligence in the Oil
and Gas Industry, Editions Technip (1994)
3. Fisher, D. H.: Knowledge Acquisition via Incremental Conceptual Clustering. Machine
Learning, 2(1987)139-172
4. Fisher, D., Xu, L., & Zard, N.:Order effects in clustering. Proceedings of the Ninth International Conference on Machine Learning. Aberdeen, UK: Morgan Kaufmann (1992) 163168
5. Fisher, D.:Iterative optimization and simplification of hierarchical clusterings. Journal of
Artificial Intelligence and Research, 4 (1996) 147-179
6. Fortner, B., Meyer, T. E.: Number by Colors: A Guide to Using Color to Understand Technical Data. Springer, ISBN 0-387-94685-3 (1997)
7. McKusick, K., & Langley, P.:Constraints on Tree Structure in Concept Formation, Proceedings of the 12th International Joint Conference on Artificial Intelligence, (pp. 810816), Sydney, Australia (1991)
8. Sharma, G., & Trussell, H. J.:Digital Color Imaging, IEEE Transactions on Image Processing, Vol.6, No.7 (1997)
9. Talavera, L., Béjar, J.:Efficient and Comprehensible Hierarchical Clusterings in Proceedings of the First Catalan Conference on Artificial Intelligence, CCIA98. Tarragona, Spain,
ACIA Bulletin, no 14-15, (1998) 273-281
10. UCI. In http://www.ics.uci.edu /~mlearn/MLSummary.html/01/03/ (2003)
11. Wyszecki G., and Stiles, W. S.: Color Science: Concepts and Methods Quantitative Data
and Fornulae,
Ed. New York, Wiley (1982)
TEAM LinG
Propositional Reasoning
for an Embodied Cognitive Model
Jerusa Marchi and Guilherme Bittencourt
Departamento de Automação e Sistemas
Universidade Federal de Santa Catarina
88040-900 - Florianópolis - SC - Brazil
{jerusa,gb}@das.ufsc.br
Abstract. In this paper we describe the learning and reasoning mechanisms of
a cognitive model based on the systemic approach and on the autopoiesis theory. These mechanisms assume perception and action capabilities that can be
captured through propositional symbols and uses logic for representing environment knowledge. The logical theories are represented by their conjunctive and
disjunctive normal forms. These representations are enriched to contain annotations that explicitly store the relationship among the literals and (dual) clauses
in both forms. Based on this representation, algorithms are presented that learn a
theory from the agent’s experiences in the environment and that are able to determine the robustness degree of the theories given an assignment representing the
environment state.
Keywords: cognitive modeling, automated reasoning, knowledge representation.
1 Introduction
In recent years the interest in logical models applied to practical problems such as planning [1] and robotics [21] has been increasing. Although the limitations of the sensemodel-plan-act have been greatly overcome, the gap between the practical had hoc
path to “behavior-based artificial creatures situated in the world” [6] and the logical
approach is yet to be filled. A promising way to build such a unified approach is the
autopoiesis and enaction theory of Humberto Maturana and Francisco Varela [15] that
connect cognition and action stating that “all knowing is doing and all doing is knowing”. A cognitive autopoietic system is a system whose organization defines a domain
of interactions in which it can act with relevance to the maintenance of itself, and the
process of cognition is the actual acting or behaving in this domain.
In this paper we define the learning and reasoning mechanisms of a generic model
for a cognitive agent that is based on the systemic approach [16] and on the cognitive
autopoiesis theory [25]. These mechanisms belong to the cognitive level of a three level
architecture presented by Bittencourt in [2].
2 Framework
In the proposed model, the cognitive agent is immersed in an unknown environment,
its domain according to the autopoiesis theory nomenclature. The agent interaction
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 164–173, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Propositional Reasoning for an Embodied Cognitive Model
165
with this environment is only possible through a set of primitive propositional symbols.
Therefore, the states of the world, from the agent point of view, are defined as the possible truth assignments to this set of propositional symbols. We also suppose that, as time
goes by, the environment drifts along the possible states (i.e., assignments) through flips
of the primitive propositional symbols truth values. The primitive propositional symbols
can be of three kinds: controllable, uncontrollable and emotional. Roughly, uncontrollable symbols correspond to perceptions and controllable ones to actions. Controllable
and uncontrollable symbols are “neutral”, in the sense that, a priori, they have no semantic value from the agent point of view.
Emotional symbols correspond to internal perceptions, i.e. properties of the agent
that are not directly controllable but can be “felt”, such as pleasure, hunger or cold1. In
a first approximation, we assume that emotional symbols are either “good” or “bad”,
in the sense that the agent has the intention that good emotional symbols be true and
bad ones false. From the agent point of view, all semantic value is directly or indirectly
derived from primitive emotional symbols.
The goal of the agent’s cognitive capability is to recognize, memorize and predict
“objects” or “situations” in the world, i.e, propositional symbols assignments, that relate, in a relevant way, these three kind of symbols.
To apply the proposed cognitive model to some experimental situation, the first step
would be to define the emotional symbols and build the non cognitive part of the agent
in such a way that the adopted emotional symbols suitable represent the articulation
between the agent and the external environment, in terms of agent structure maintenance
and functional goals. Emotional symbols may include trustful peer communication,
i.e., symbols whose truth value in a given situation (as described by controllable and
uncontrollable symbols) is determined by an external entity (e.g., another agent) that
meaningfully communicates with the agent.
Example 1. Consider a simple agent-environment setting that consists of floor and
walls. The agent is a robot that interacts with the environment through 3 uncontrollable propositional symbols associated with left, front and right sensors
and
and 2 controllable symbols associated with left and right motors
and
A possible emotional symbol would be Move, that is true when the robot is not blocked by
some obstacle in the environment. The goal of the cognitive agent is to discover the
relation between its actions (movements) and the actions consequences (collisions or
non collisions), in order to connect the symbols and to find a semantical meaning for
them.
The working hypothesis is that the agent’s cognitive capabilities are supported by a
set of non contradictory logical theories that represent its knowledge about these relations. These theories are the agent structure, according to the autopoiesis theory and the
cognitive organization is such as to construct and maintain this structure according to
the interaction with the environment. The goal of this paper is to describe two aspects
of this organization: (i) the learning mechanism that determines how logical theories
constructed with controllable and uncontrollable propositional symbols are related with
emotional propositional symbols and (ii) a robustness [10] verification mechanism that
1
The name emotional is derived from Damasio’s notion of “somatic marker”, presented in [7].
TEAM LinG
166
Jerusa Marchi and Guilherme Bittencourt
determines what would be the effect, on the validity of one of these theories, of any
change in the assignments to propositional symbols, i.e., what is the minimal set of
flips in propositional symbol truth values that should be made to maintain the satisfiability of the theory when the present assignment is modified.
3 Theory Representation
Let
be a set of propositional symbols and
the
set of their associated literals, where
or
A clause C is a generalized
disjunction [9] of literals:
and a dual clause is a
generalized conjunction of literals:
Given a propositional theory
represented by an ordinary formula W, there
are algorithms for converting it into a conjunctive normal form (CNF):
defined as a generalized conjunction of clauses, or into a disjunctive normal form
(DNF):
defined as a generalized disjunction of dual clauses, such
that
e.g., [23].
Alternatively, a special case of CNF and DNF formula can be the prime implicates
and prime implicants, that consist of the smallest sets of clauses (or terms) closed for
inference, without any subsumed clauses (or terms), and not containing a literal and its
negation. In the sequel, conjunctions and disjunctions of literals, clauses or terms are
treated as sets.
A clause C is an implicate [12] of a formula W iff
and it is a prime
implicate iff for all implicates
of W such that
we have
or
syntactically [20], for all literals
We define
as a
conjunction of prime implicates of W such that
A term D is an implicant
of a formula W iff
and it is a prime implicant iff for all implicants
of
W such that
we have
or syntactically, for all literals
We define
as a disjunction of prime implicants of W such that
To transform a formula from one clause form to the other, what we call dual transformation (DT), only the distributivity of the logical operators and is needed. In
propositional logic, implicates and implicants are dual notions, in particular, an algorithm that calculates one of them can also be used to calculate the other [5,24].
To represent these normal forms, we introduce the concept of a quantum, defined
as a pair
where is a literal and
is its set of coordinates that contains
the subset of clauses in
to which the literal belongs. A quantum is noted
to
remind that F can be seen as a function
The rationale behind the choice
of the name quantum is to emphasize that the minimal semantical unity in the proposed
model is not the value of propositional symbol, but the value of a propositional symbol
with respect to the theory in which it occurs.
Any dual clause
can be represented by a set
in the DNF
of quanta:
such that
i.e., D contains at least one literal that belongs to each clause in
spanning a path through
and no pair of
contradictory literals, i.e., if a literal belongs to D, its negation is excluded. A dual
clause D is minimal, if the following condition is also satisfied:
This condition states that each literal in D should represent
TEAM LinG
Propositional Reasoning for an Embodied Cognitive Model
167
alone at least one clause in
otherwise it would be redundant and could be deleted.
The notation is symmetric, i.e., a clause
in the CNF
can be
associated with a set of quanta:
such that
with no
tautological literals allowed. Again the minimality condition for C is expressed by
The quantum notation is an enriched representation of the minimal normal forms,
in the sense that the quantum representation explicitly contains the relation between
literals in one form and the (dual) clauses in the other form. The CNF and DNF, from
a syntactical point of view, are totally symmetric and each one of them contains all the
information about the theory, but we propose that the agent should store its theories
in both minimal normal forms. We belief that this ‘holographic’ representation can be
used in others tasks of the agent, such as verification (as presented in the section 5) and
belief changes [4], among others2.
4 Learning
Theories can be learned by perceiving and acting in the environment, while keeping
track of the truth value of a specific emotional propositional symbol. This symbol can
be either a primitive emotional symbol or an abstract emotional symbol represented
by a theory that also contains controllable and uncontrollable symbols, but ultimately
depends on some set of primitive emotional symbols. The primitive emotional symbols
may also depend on a communication from another agent that can be trustfully used as
an oracle to identify its truth value.
The proposed learning mechanism has some analogy with the reinforcement learning method [11], where the agent acts in the environment monitoring a given utility
function. Directly learning the relevant assignments can be thought of as a practical
learning.
Example 2. Consider the robot of example 1. To learn the relation between the primitive
emotional symbol Move and the controllable
and uncontrollable
symbols, it may randomly act in the world, memorizing the situations in which the Move
symbol is assigned the value true. After, trying all
possible truth assignments, it
concludes that the propositional symbol Move is satisfied only by the 12 assignments3:
The dual transformation (DT), applied on the dual clauses associated with the good
assignments, returns the clauses of the minimal CNF
A further application of
2
3
The authors presently investigate others properties of the normal forms.
To simplify the notation, an assignment is noted as a set of literals, where is the number
of propositional symbols that appear in the theory, such that
represents
the assignment
if
or
if
and
is the semantic function that maps propositional symbols into truth values.
TEAM LinG
Jerusa Marchi and Guilherme Bittencourt
168
4
the dual transformation in this CNF returns the minimal DNF
. The minimal
forms and their relation can be represented by the following sets of quanta:
It should be noted that
contains less dual clauses than the original number
of assignments, nevertheless each assignment satisfies at least one of this dual clauses.
The application of the dual transformation provides a conjunctive characterization of
the theory that, because of the local character of the clauses, can be used as a set of
rules for decision making.
To formalize the proposed learning mechanism, we define an entailment relation
that connect semantically neutral propositional symbols (controllable and uncontrollable) to emotional symbols. Let be a neutral propositional formula and P an
emotional symbol, this entailment relation has the following properties.
If
If
then
and
then
In practice, learning is always incremental, that is, the agent begins with an empty
theory
and incrementally constructs a sequence of theories
such
that
correctly captures the intended emotional propositional symbol P. According
to the properties above, we have that
and
The algorithm to obtain
represented by its CNF and DNF,
and
given P,
and the assignment is the following:
if
and
then
where
is the literals dual clause such that
and DT is the dual transformation5.
A similar algorithm may be used to incrementally compute the sequence of theories
such that
and
The theories in this
sequence are descriptions of those situations that do not entail the emotional symbol
P. During learning, when the agent has already tried theories entailing P and not
entailing it, the theory
captures those situations that were not yet experienced by the agent and can be used in the choice of future interactions. Its DNF can
4
5
In fact, this second application is not necessary, because, once the prime implicants are known,
there are polynomial time algorithms to calculate the prime implicates [8].
As specified in the Section 3.
TEAM LinG
Propositional Reasoning for an Embodied Cognitive Model
169
be computed by flipping all literals in
If learning is complete,
then
Although nothing directly associated with the CNF occurs in the environment, if
its contents can be communicated by another agent, then a theory can be taught by
stating a CNF that represents it. In this case, the trustful oracle would communicate
all the relevant rules that define the theory. This transmission of rules can be thought
of as an intellectual learning, because it does not involve any direct experience in the
environment.
5 Verification and Robustness
As stated above, we assume that the agent stores, for each theory, both normal forms.
5.1 Conjunctive Memory
With the CNF, the agent can verify whether an assignment satisfies a theory using the
following method: given an assignment:
the agent, using the DNF coordinates of the quanta (that specify in which clauses of the CNF each literal occurs),
constructs the following set of quanta:
If
then the assignment satisfies the theory, otherwise it does not
satisfy it. In the case the assignment satisfies the theory, the number of times a given
coordinate appears in the associated set of quanta informs how robust is the assignment
with respect to changes in the truth value of the propositional symbol associated with it.
The smaller this number, more critical is the clause denoted by the coordinate. If a given
coordinate appears only once, then flipping the truth value of the propositional symbol
associated with it will cause the assignment not to satisfy the theory anymore. In this
case, the other literals in the critical rule represent additional changes in the assignment
that could lead to a new satisfying assignment.
Example 3. Consider the theory of example 2 and the following assignment:
The DNF coordinates (that refer to the CNF) of the literals in the assignment are:
The union of all coordinates is equal to the complete clause set: {0,1,2,3,4,5,6}
and, therefore, the assignment satisfies the theory. The only coordinate that appears
only once is 2. This means that, if the truth assignment to the propositional symbols
is changed, then the resulting assignment will not satisfy clause 2 and therefore will
not satisfy the theory anymore. On the other hand, the truth assignments to the other
propositional symbols can be changed and the resulting assignment would still satisfy
the theory. This is according to the intuition: the robot is moving forward
and
true) and the three sensors are off
and
false). In this situation the only
event that would affect the possibility of moving is the frontal sensor to become on
become true) and in this case, in order to satisfy again clause 2,
one
of the two motors should be turned off
or
false).
5.2 Disjunctive Memory
With the DNF, the agent can verify whether an assignment satisfies a theory using the
following method: given an assignment:
the agent should determines
TEAM LinG
170
Jerusa Marchi and Guilherme Bittencourt
whether one of the dual clauses in the DNF is included in the assignment. To facilitate the search for such a dual clause, it constructs the following set of quanta:
where the
are the CNF coordinates (that specify in which dual
clauses of the DNF each literal occurs). The number of times a given coordinate appears in this set of quanta informs how many literals the dual clauses denoted by the
coordinate shares with the assignment. If this number is equal to the number of literals
in the dual clause then it is satisfied by the assignment. Dual clauses that do not appear
in
need not to be checked for inclusion. If a dual clause is not satisfied by the
assignment, it is possible to determine the set of literals that should be flipped, in the
assignment, to satisfy it.
Example 4. Consider the theory of example 2 and the assignment:
The CNF coordinates of the literals are:
The coordinates determine which dual clauses share which literals with the assignment:
In this case, except for literals
any change will affect the satisfiability of
the theory. The robot is turning right and the right sensor is off, clearly the state of left
and frontal sensors are irrelevant.
5.3 Models and Supermodels
In the proposed framework, robustness is the main concern because the agent should
know how to modify its controllable symbols in order to maintain the satisfiability of
its theories, given any possible change in the uncontrollable symbols. In [10], Ginsberg
et al. introduce the concept of supermodels to measure the inherent degree of robustness
associated with a model. This concept is defined as follows: An
is a model such that, if we modify the values taken by the variables in a subset of of
size at most another model can be obtained by modifying the values of the variables
in a disjoint subset of of size at most
They also show that deciding whether a propositional theory has a
6
or not is NP-complete and provide an encoding for the more specific notion of
(1, 1)-supermodel that allows to find out if a given theory has such a supermodel using
standard SAT solvers. In our case, we are interested in the
case,
because we have controllable and uncontrollable symbols. We formalize the intuitive
notions of the previous sections in the algorithms below.
Although each algorithm uses just one normal form, they require the minimal form,
what implies, whether the theory is obtained through practical or intellectual learning,
6
An
is a
all propositional symbols.
in which the sets
and
are the set of
TEAM LinG
Propositional Reasoning for an Embodied Cognitive Model
171
the calculation of the dual transformation. The algorithms receive as input a literal7 to
be flipped
a satisfying assignment represented as a dual clause
and
one of the normal forms (either
or
They return, either “Prime implicate”,
if is a unary prime implicate (UPI) of the theory, or the set of literals
that should
be flipped in order to restore satisfiability after is flipped. The algorithms are non
deterministic and each choice would produce a different
set. Any
set returned
by algorithm
has the minimal because this algorithm always chooses one of
the
sets that has minimal size. The algorithm
only returns a set with
minimal if the choice of the
is such that they form one of the dual clauses with
minimal size associated with the theory whose CNF is given by
These minimal dual clauses can be obtained by the application of the dual
transformation to this (small) theory.
The dual transformation, i.e. finding one minimal normal form given its dual non
minimal form, is NP-complete and is as hard as the SAT problem [26], but the fact
that the number of minimal dual clauses is always less (or in the worst case equal)
than the number of models indicates that searching only for minimal dual clauses can
be a good heuristic for a SAT solver [3]. Once the minimal normal form is available,
both supermodel algorithms are polynomial. For a theory with symbols and (dual)
clauses, both algorithms are O(nm).
The dual transformation has been implemented for first-order and propositional
logic and the results reported elsewhere. The algorithms presented above have been
implemented in Common Lisp and applied to the theories in the SATLIB benchmark
7
We note
the flipped form of literal
TEAM LinG
172
Jerusa Marchi and Guilherme Bittencourt
(http://www.satlib.org/).Some results for supermodels with
obtained with the random 3SAT theories with 50 propositional symbols and 218 clauses,
are shown in the table below:
For theories in the critical region the size of the
sets for those literals in
that
are not UPI’s nor pure are usually quite big, but some of them can be as small as 1.
6 Related Work
This work is rooted in the logicist school [18] and inscribe itself in the Cognitive
Robotics domain [13, 14, 22]. We try to apply the, seemingly underexplored, properties
of the minimal normal forms of logical theories to the several challenges of the domain:
environment learning and modeling, reasoning about change, planning, abstraction and
generalization. Because of its focus on normal forms, this work is also related with the
SAT research concerned with the syntactical properties of the theories [17,19]. A particularity of this work is that it searches for semantic grounds for logical theories in the
autopoiesis theory [15], instead of in a pure model theoretical account.
7 Conclusion
The paper describes learning and robustness verification of logical theories that represent the knowledge of a cognitive agent. The semantics of these theories, instead of
being a mapping from syntactic expressions to an outside world reality, is represented
by the holographic relation between the two syntactic normal forms of theories that
represent relevant interaction properties with the environment. This paper is part of a
cognitive representation project.
Acknowledgments
The authors express their thanks to the Brazilian research support agency “Fundação
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (Capes)” (project number 400/02) for the partial support of this work.
References
1. W. Bibel. Let’s plan it deductively. In Proceedings of IJCAI 15, Nagoya, Japan, August 2329, pages 1549–1562. Morgan Kaufmann (ISBN 1-55860-480-4), 1997.
2. G. Bittencourt. In the quest of the missing link. In Proceedings of IJCAI 15, Nagoya, Japan,
August 23-29, pages 310–315. Morgan Kaufmann (ISBN 1-55860-480-4), 1997.
TEAM LinG
Propositional Reasoning for an Embodied Cognitive Model
173
3. G. Bittencourt and J. Marchi. A syntactic approach to satisfaction. In Boris Konev and Renate
Schmidt, editors, Proceedings of the 4th International Workshop on the Implementation of
Logics, pages 18–32. University of Liverpool and University of Manchester, 2003.
4. G. Bittencourt, L. Perrussel, and J. Marchi. A syntactical approach to revision. Accepted to
ECAI’04.
5. G. Bittencourt and I. Tonin. An algorithm for dual transformation in first-order logic. Journal
of Automated Reasoning, 27(4):353–389, 2001.
6. R. A. Brooks. Intelligence without representation. Artificial Intelligence (Special Volume
Foundations of Artificial Intelligence), 47(1-3): 139–159, January 1991.
7. A. R. Damasio. Descartes’ Error: Emotion, Reason, and the Human Brain. G.P. Putnam’s
Sons, New York, NY, 1994.
8. A. Darwiche and P. Marquis. A perspective on knowledge compilation. In IJCAI, pages 175–
182, 2001.
9. M. Fitting. First-Order Logic and Automated Theorem Proving. Springer Verlag, New York,
1990.
10. M. L. Ginsberg, A. J. Parkes, and A. Roy. Supermodels and robustness. In Proceedings of
AAAI-98, pages 334–339, 1998.
11. L. P. Kaelbling, M. L. Littman, and A.W. Moore. Reinforcement learning: A survey. Journal
of Artificial Intelligence Research, 4:237–285, 1996.
12. A. Kean and G. Tsiknis. An incremental method for generating prime implicants/implicates.
Journal of Symbolic Computation, 9:185–206, 1990.
13. Y. Lespérance, H. J. Levesque, F. Lin, D. Marcu, R. Reiter, and R. B. Scherl. A logical
approach to high level robot programming – a progress report. In B. Kuipers, editor, Working notes of the 1994 AAAI fall symposium on Control of the Physical World by Intelligent
Systems, New Orleans, LA, November 1994.
14. H. Levesque, F. Pirri, and R. Reiter. Foundations for the situation calculus, 1998.
15. H. R. Maturana and F. J. Varela. Autopoiesis and cognition: The realization of the living.
In Robert S. Cohen and Marx W. Wartofsky, editors, Boston Studies in the Philosophy of
Science, volume 42. Dordecht (Holland): D. Reidel Publishing Co., 1980.
16. E. Morin. La Méthode 4, Les Idées. Editions du Seuil, Paris, 1991.
17. N. Murray, A. Ramesh, and E. Rosenthal. The semi-resolution inference rule and prime
implicate computations. In Proc. Fourth Golden West International Conference on Intelligent
Systems, San Fransisco, CA, USA, pages 153–158, 1995.
18. A. Newell. The knowledge level. Artificial Intelligence, 18:87– 127, 1982.
19. A. J. Parkes. Clustering at the phase transition. In AAAI/IAAI, pages 340–345, 1997.
20. A. Ramesh, G. Becker, and N. V. Murray. CNF and DNF considered harmful for computing
prime implicants/implicates. Journal of Automated Reasoning, 18(3):337–356, 1997.
21. R. Scherl and H. J. Levesque. Knowledge, action, and the frame problem. Artificial Intelligence, 1(144): 1–39, March 2003.
22. M. Shanahan. Explanation in the situation calculus. In Ruzena Bajcsy, editor, Proceedings of
the Thirteenth International Joint Conference on Artificial Intelligence, pages 160–165, San
Mateo, California, 1993. Morgan Kaufmann.
23. J.R. Slagle, C.L. Chang, and R.C.T. Lee. A new algorithm for generating prime implicants.
IEEE Transactions on Computing, 19(4):304–310, 1970.
24. R. Socher. Optimizing the clausal normal form transformation. Journal of Automated Reasoning, 7(3):325–336, 1991.
25. F. J. Varela. Autonomie et Connaissance: Essai sur le Vivant. Editions du Seuil, Paris, 1989.
26. L. Zhang and S. Malik. The quest for efficient boolean satisfiability solvers. In Proceedings
of 8th International Conference on Computer Aided Deduction(CADE 2002), 2002. Invited
Paper.
TEAM LinG
A Unified Architecture
to Develop Interactive Knowledge Based Systems
Vládia Pinheiro, Elizabeth Furtado, and Vasco Furtado
Universidade de Fortaleza – UNIFOR
Av. Washington Soares 1321 – Edson Queiroz - Fortaleza – CE - Brasil - Cep: 60811-905
[email protected], {elizabet,vasco}@unifor.br
Abstract. A growing need related to the use of knowledge-based systems
(KBSs) is that these systems provide ways of adaptive interaction with the user.
A comparative analysis of approaches to develop KBSs allowed us to identify a
high functional quality level and a lack of integration of human factors in their
frameworks. In this article, we propose an approach to develop adaptive and interactive KBSs that integrate works from the Knowledge Engineering and HCI
areas, through the definition of a unified software architecture. A contribution
of this work is the use of interaction patterns in order to define the interaction
flow according to the user profile. These interaction patterns are defined for different kinds of interaction, such as, explanation, cooperation, argumentation or
criticism. The reusable architecture components were implemented using Java
and Protégé-2000, and they were used in a KBS for assessment of installments
of tax debts.
Keywords: Knowledge-based systems, reusable components, interaction patterns.
1 Introduction
The Knowledge Engineering area has evolved since the art of building Expert Systems began until now, thus, providing methods, technologies, and patterns for the
development of Knowledge-Based Systems (KBS). These systems are used in various
domains to solve problems that involve the human reasoning process.
Some of the Knowledge Engineering works concentrate in providing problem
solving methods (PSM) libraries. A PSM describes the reasoning steps and the
knowledge roles used during the problem solving process, independent of the domain, allowing its reuse in many applications [1].
A growing need related to the use of KBSs is that these systems provide ways of
interaction with the user. Moulin et al [2] analyze that kind of user-KBS interaction,
such as, explanation, cooperation, argumentation, or criticism, allows a better level of
acceptance from users related to the solutions proposed by the system.
McGraw [3] observes that novice users do not understand complex reasoning
strategies. This is true mainly due to the fact that KBSs are developed based on PSMs
developed according to the vision that experts have about the problem. That is, the
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 174–183, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
A Unified Architecture to Develop Interactive Knowledge Based Systems
175
development of KBSs does not consider the end-users knowledge level, neither their
point of view of the problem. Therefore, it is important that the user-KBS interaction
is adaptive according to users and to the context of use.
The Human-Computer Interaction (HCI) area develops methods and techniques to
build adaptive interactive systems. The focus of HCI researches is on the people who
use the system, which tasks they execute, their ability level, preferences, and external
factors, such as organizational and environmental factors.
A comparative analysis of approaches to develop KBSs allowed us to identify a
high functional quality level, and, on the other hand, there is a lack of integration of
human factors in their frameworks.
As a solution, we propose an approach to develop adaptive and interactive KBSs
that integrate works from the Knowledge Engineering and HCI areas, through the
definition of a unified software architecture. In this article, we show the implementation of the proposed architecture components, describing how they were used in a
KBS to evaluate the concession of installments of tax debts.
2 HCI Aspects for KBS Development
We studied some approaches on KBS development verifying how they treat aspects
related to HCI. The HCI aspects used in model-based user interface design are: user
modeling, context of use modeling, user tasks modeling, and adaptability.
These aspects are particularly relevant in the interactive KBS context. Kay [4] affirms that user modeling allows adapting the presentation of the information according to users, and facilitates the definition of the type of intervention that can be made
during the user-system collaborative processes. User tasks modeling, which tasks are
performed through the system interface, allows the analysis of the interaction based
on the users’ point of view and identifies the information they need, as well as their
goals.
CommonKADS [5] defines phases in its methodology that consist on the construction of its models: Organization Model, Task Model, Knowledge Model, Agent
Model, and Communication Model. Specifically, the Agent Model, which describes
the abilities of the stakeholders when executing tasks, and the Communication Model,
which models how the agents communicate, already consider user modeling in its
phases. However, it does not use models for the user-interaction design, neither for
the adaptation of the user-interaction.
Sengès [6] proposes an extension of CommonKADS to allow the user-KBS cooperation during the system execution. She proposes a new model: the cooperation
model, which structures the sequence of resolution steps and the exchange of information according to the users’ knowledge level and to the organizational context.
However, the adaptation of the cooperation is defined by generating a cooperation
model for each kind of user during the KBS development. Therefore, this adaptation
is static, that is, the system is not capable of dynamically adapting itself to a new kind
of user.
TEAM LinG
176
Vládia Pinheiro, Elizabeth Furtado, and Vasco Furtado
Unified Problem-Solving Method Description Language (UPML) [7] describes
different KBS software components by integrating two important research lines in
Knowledge Engineering: components reusability and ontologies. UPML, besides
being an architecture, also is a KBS development framework because it describes
components, adaptors, architecture restrictions, development guidelines, and tools.
The architecture components are: (i) Task, that defines the problem that should be
solved by the KBS; (ii) PSM, that defines the reasoning process of a KBS; (iii) Domain Model, that describes the domain knowledge of the KBS; (iv) Ontologies, that
provide the terminology used in the other elements; (v) Bridges, that models the relationships between two UPML components; and (vi) Refiners, that can be used to
specialize an component. Each component in the UPML is described independently
to enable reusability. For instance, problem-solving methods can be reused for different tasks and domains. This is possible because of the fifth element – bridges.
The comparative analysis of how HCI aspects are considered in the studied KBS
development approaches demonstrated that the model-based user interface design is
not taken into account by any of the approaches. However, CommonKADS and
Sengès already consider aspects such as user modeling, user task modeling, and context of use modeling, although, with some disadvantages. UPML does not consider
any HCI aspect and only the approach proposed by Sengès for cooperative KBSs uses
interaction modeling through the cooperation model.
3 A Unified Architecture for Interactive KBSs
The analysis of the integration of HCI aspects in the KBS development approaches
lead us to the definition of a software architecture that integrates works from the
Knowledge Engineering and HCI areas, aiming at attending the following requirements for interactive KBSs: knowledge modeling from reusable components for
problem-solving, user modeling, context of use modeling, and user task modeling for
adaptability.
This unified architecture integrates components of a KBS architecture, such as
UPML, and from the interactive systems architecture defined in [8], thus, providing
components that consider the user point of view during the KBS development.
A major contribution of our approach is the use of interaction patterns in order to
define the interaction flow according to the user profile. The interactive tasks performed by the users, such as, require and receive explanations, cooperate with the
KBS, are defined by means of design patterns [9] aiming at reducing the development
effort [10].
3.1 Architecture Description
The architecture components, presented in Figure 1, are separately described according to their responsibilities.
TEAM LinG
A Unified Architecture to Develop Interactive Knowledge Based Systems
177
User-KBS Dialogue Control
Functional Core: it contains the PSM functionalities and the domain knowledge.
Dialogue Controller, Interface Toolkit, and Adaptors: these components are
responsible for controlling the interaction flow and presenting the information
to the user. The Dialogue Patterns Model, which is part of the Dialogue Controller, implements the dialogue patterns identified during the system user interface design [11]. Dialogue patterns are ways to present information and to
allow interaction according to the task to be performed, the user profile, data
types, etc.
Construction of the Models
PSMs Library: library that contains problem-solving methods.
Interaction Patterns Library: library that contains patterns for various forms of
user-KBS interaction, such as, explanation, cooperation, argument, and criticism.
Organization Model: it models the functional staff of the organization, in
which each function is associated to rules that define the behavior of users
who perform such function.
User Model: it represents characteristics of users, being individuals or grouped
in stereotypes. These characteristics can be: expertise level, domain concepts
known by the user, goals, etc.
Fig. 1. A Unified Architecture for Interactive KBSs.
Adaptability
Adaptability requires a dynamic user and context of use modeling, as well as the
choice of appropriate dialogue patterns. In order to provide adaptation during the
system execution, two servers work in the acquisition of dynamic information. They
are: the User and Organization Information Server and the Environment Parameters
Server.
TEAM LinG
178
Vládia Pinheiro, Elizabeth Furtado, and Vasco Furtado
The User and Organization Information Server contains the logic to infer information about the user model and the organization model, such as, identifying which
concepts are known by users, allowing the adaptation of the type of explanation to be
provided. The Environment Parameters Server contains the logic to infer data about
the context of use, before and during the execution.
The information inferred by these servers is provided to the Decision Component,
which selects the appropriate dialogue patterns. For example, a dialogue pattern of
the plan text type is more appropriate to present the explanation content to novice
users.
Table 1 presents the activities supported by the architecture components aiming at
attending the requirements for interactive KBS.
4 Implementation of the Architecture Components
The generic architecture components were implemented using Java and using Protégé-2000 [12] as the UPML editor. This implementation focus on reusability and,
therefore, these components can be reused in other applications. Following, we detail
the implementation of each component:
Functional Core
For this component, we implemented the elements of the UPML architecture in the
Java classes: BridgeComponent, PSMComponent, TaskComponent, and DomainComponent. The reason to use the UPML framework is because this approach makes
the reasoning process explicit by implementing the PSM as part of the application.
This enhances, for instance, the quality of the explanation to be given to the user
because it allows a greater control of the reasoning steps. This implementation was
done according to the design patterns for translation of the UPML in Java defined in
[7].
The PSMComponent Java class contains the generic methods that execute the
mapping of domain-PSM and task-PSM, and that are responsible for the communication of the knowledge roles among the other UPML elements, and for the execution
of the sub-tasks associated to the PSM. These methods are defined in the Java interTEAM LinG
A Unified Architecture to Develop Interactive Knowledge Based Systems
179
face BridgeComponent. For example, its method executeSubTask receives the name of
a sub-task as parameter; searches for the object related to that sub-task, and calls the
execute() method of this object. Figure 2 shows the Java implementation of this
method.
According to the definition of the design pattern for the UPML implementation,
each PSM can be implemented as a subclass of the PSMComponent class and the
subtasks of the PSM as methods.
The TaskComponent Java class is an abstract class responsible for providing the
knowledge roles necessary in each subclass. The PSM subtasks are implemented as
subclasses of this class and each subclass implements the abstract execute( ) method.
The DomainComponent Java class is responsible for defining the properties and
methods common to the various PSM knowledge roles. The ontology of the problemsolving method is implemented as subclasses of this class.
Fig. 2. Java implementation of the executeSubTask method from the PSMComponent super
class.
Interaction Patterns Library
This component contains design patterns, called interaction patterns, which define
how the interaction functionalities should be implemented in a KBS. In this article, is
described the implementation of an interaction pattern for explanation composed of
two classes: Explanation and PSMLog. The Explanation class represents the explanation to be provided to the user that is defined by operations that answer the questions:
What (is this)?, How (did this happen)?, Why (did this happen)?. The PSMLog class
represents the KBS reasoning steps during the search for a solution to the problem.
The operations in this class are responsible for associating values to the attributes that
characterize each reasoning step.
Figure 3 presents the sequence diagram in Unified Model Language (UML) representing the implementation of the interaction pattern for explanation. The interaction
flow is the following one: (i) the User requests an explanation and the DialogueController receives the object to be explained and the explanation type; (ii) the DialogueController requests the user and organization profiles to the UserOrganizationServer; (iii) the
UserOrganizationServer infers about UserModel and OrganizationModel and answers to the
DialogueController and to the DecisionComponent; (iv) the DialogueController requests the
explanation sending the object to be explained, the explanation type and the user and
organization informations; (v) the Explanation defines the explanation adapted to the
UserModel and to the OrganizationModel. The Explanation executes a method according
to the explanation type; (vi) the DialogueController requests a dialogue pattern to the
DecisionComponent and shows the explanation using the dialogue pattern.
TEAM LinG
180
Vládia Pinheiro, Elizabeth Furtado, and Vasco Furtado
The same Figure 3 presents an example of an algorithm in English that represents a
implementation of the explainWhat( ) method of the Explanation class responsible for
defining the explanation of the type What (is this)? This method defines the explanation according to the object type (a Method, a Field, a Class or an Instance). For instance, when the object to be explained is an instance of a class, this method defines
the description from the domain concepts known by the user, which are modeled in
the UserModel.
Fig. 3. UML sequence diagram of the interaction pattern for explanation and an algorithm in
English of the implementation of the explainWhat() method.
Organization Model
The Java classes that implement this component are: (i) Organization Model, which
represents the generic rules applied for all users in the organization; (ii) OrganizaTEAM LinG
A Unified Architecture to Develop Interactive Knowledge Based Systems
181
tionFunction, which represents the specific rules of each function in the organization
applied for the users who perform such function; (iii) OrganizationRule, which represents the organizational rules that are associated to the other two classes as generic or
specific rules.
User Model
The implemented user model represents the users’ stereotypes. According to Sengès
[6], we identified that KBS users can be classified as: domain expert users, expert
users in other knowledge domains, and general public users. The UserModel Java
class implements the user model, which is composed of three other classes FunctionUser, ObjectiveUser, and DomainComponentUser. These classes represent parts
of the user model and contain, respectively, the expertise level according to the user
function in the organization, users’ goals, and domain concepts known by users.
Adaptability Components and Dialogue Controller
The components responsible for Adaptability and Dialogue Controller were implemented in the following Java classes: UserOrganizationServer, EnvironmentServer,
DecisionComponent, and DialogueController.
5 An Example of Adaptive Interaction in a KBS
In order to demonstrate how the use of the architecture generic components facilitates
the development of adaptive interactions, we used the example of user-KBS interaction to evaluate the concession of installments of tax debts, in which there is a dialogue for explanation about the evaluation process of the installments of tax debts.
One requirement is that this KBS provides explanations adapted to the users.
This knowledge-based application evaluates a set of criteria based on the taxpayer
data and on the installment request. After the criteria evaluation, the system must
decide whether or not to provide the installment plan request.
The functional core of this KBS was implemented as subclasses of the UPML generic classes. The abstract-and-match PSM for assessment tasks was implemented as
subclasses of the PSMComponent class. The tax installment plan domain model was
implemented as subclasses of the DomainComponent class.
The users of this KBS are tax auditors or directors, experts on the tax domain, or
the actual taxpayers who request installments of their debts through the Internet.
Therefore, we identified two user stereotypes: domain experts and general public
users. The User Model of this application is mapped to the domain model of the
UPML Architecture through a bridge. This way, the domain concepts known by users
and the domain concepts known by experts are related.
In this KBS, the heuristic used to adapt the explanation is the following one: general public users receive simple explanations with the terminology known by them,
and the domain expert users receive contextual explanations that show the hierarchy
of the knowledge involved.
TEAM LinG
182
Vládia Pinheiro, Elizabeth Furtado, and Vasco Furtado
Figure 4 presents the explanation dialogue during the evaluation process with a
general public user (a) and a domain expert user (b). The question is: What is the tax
evasion level?. This expert concept “tax evasion level” is mapped to the “tax fraud level”
concept from the user model for general public users. Notice that the explanation
given about the same concept for a domain expert user is presented in a dialogue
pattern interactive tree, which facilitates the knowledge hierarchy organization. Besides this, the description presented about the concept is different because it was recovered from the user model for domain expert users.
Fig. 4. An example of adaptive explanation for a general public user (a) and for a domain expert user (b).
6 Conclusion
In this article, based on the growing need for knowledge–based systems to allow
interaction with its final users, we evaluated how some KBS development approaches
consider HCI aspects. This analysis pointed out the lack of an approach that completely considers aspects such as: knowledge modeling from reusable components for
problem-solving, user modeling, context of use modeling, user task modeling, use of
usability patterns, and adaptability.
Therefore, we defined components of a software architecture for interactive KBSs
that unifies a KBS development architecture, such as UPML, and an architecture for
interactive systems. Two characteristics are in this architecture: interaction adaptation
based on user modeling and organizational context modeling, and the construction of
the user-KBS dialogue based on interaction patterns. Interaction patterns provide a
solution to implement interaction adaptation to various users, independent of the
domain.
Another contribution of this work was the implementation of generic components
of the architecture in Java. This way, the architecture components are available to be
TEAM LinG
A Unified Architecture to Develop Interactive Knowledge Based Systems
183
reused in others interactive knowledge-based applications. In this article, we exemplified the use of the architecture in adapting user-KBS interaction to evaluate the concession of installments of tax debts. Specifically, the interaction consists of dialogues
for explanations to various kinds of KBS users about the installment evaluation process and about the domain concepts.
As future work, we intend to apply this architecture in the development of other interactive applications, as a way to enhance its validation and maturity. An important
extension for this work is the development of plug-ins in Protégé for the architecture
generic components. Thus, the architecture can be integrated with a powerful modeling and knowledge acquisition tool.
References
1. Fensel, D. and Benjamins, V.R., Key Issues for Automated Problem-Solving Methods Reuse. 13th European Conference on Artificial Intelligence, ECAI98, Wiley & Sons Pub,
1998.
2. Moulin, B., et al. Explanation and Argumentation Capabilities: Towards the Creation of
More Persuasive Agents. Artificial Intelligence Review, Kluwer Academic Publishers, 17:
169-222, 2002.
3. McGraw, K.L., Designing and evaluating User Interface for Knowledge-Based Systems.
Ellis Hordwood series in Interactive Information Systems, 1993.
4. Kay, J. User Modeling for Adaptation. User Interfaces for All – Concepts, Methods and
Tools, LEA Publishers. London. 271-294,2001.
5. Schreiber et al., Knowledge Engineering and Management: The CommonKADS Methodology. The MIT Press. Cambridge, MA, 2000.
6. Senges, V., Coopération Homme-Machine dans lês Systèmes à Base de Connaissances.
Thèse de 1’Universitè Toulouse, 1994.
7. Fensel, D. et al., The Unified Problem-Solving Method Development Language UPML.
Knowledge and Information Systems, An International Journal, 5, 83-127, 2003.
8. Savidis, A. and Stephanidis, C., The Unified User Interface Software Architecture. User Interfaces for All – Concepts, Methods and Tools, LEA Publishers. London. 389-415, 2001.
9. Gamma, E., Helm, R., Johnson, R., and Vlissides, J., Design Patterns: Elements of Reusable
Object-Oriented Software. Reading, MA, Addison-Wesley, 1995.
10. Pinheiro, V. Furtado, V. An Architecture for Interactive Knowledge-Based Systems. ACM
International Conference Proceeding Series, Proceedings of the Latin American conference
on Human-computer interaction, Rio de Janeiro, Brazil, 2003.
11. Savidis, A., Akoumianakis, D., Stephanidis, C., The Unified User Interface Design Method.
User Interfaces for All – Concepts, Methods and Tools, LEA Publishers. London. 417-440,
2001.
12. Eriksson H, Fergerson R.W., Shahar Y, Musen M. A. Automatic generation of ontology editors. In Proceedings of the
Banff Knowledge Acquisition for Knowledge-based Systems Workshop. Banff, Alberta, Canada. 1999.
TEAM LinG
Evaluation of Methods for Sentence
and Lexical Alignment of Brazilian Portuguese
and English Parallel Texts
Helena de Medeiros Caseli,
Aline Maria da Paz Silva, and Maria das Graças Volpe Nunes
NILC-ICMC-USP, CP 668P, 13560-970 São Carlos, SP, Brazil
{helename,alinepaz,gracan}@icmc.usp.br
http://www.nilc.icmc.usp.br
Abstract. Parallel texts, i.e., texts in one language and their translations to other languages, are very useful nowadays for many applications
such as machine translation and multilingual information retrieval. If
these texts are aligned in a sentence or lexical level their relevance increases considerably. In this paper we describe some experiments that
have being carried out with Brazilian Portuguese and English parallel
texts by the use of well known alignment methods: five methods for sentence alignment and two methods for lexical alignment. Some linguistic
resources were built for these tasks and they are also described here. The
results have shown that sentence alignment methods achieved 85.89% to
100% precision and word alignment methods, 51.84% to 95.61% on corpora from different genres.
Keywords: Sentence alignment, Lexical alignment, Brazilian Portuguese
1
Introduction
Parallel texts – texts with the same content written in different languages – are
becoming more and more available nowadays, mainly on the Web. These texts
are useful for applications such as machine translation, bilingual lexicography
and multilingual information retrieval. Furthermore, their relevance increases
considerably when correspondences between the source and the target (source’s
translation) parts are tagged.
One way of identifying these correspondences is by means of alignment. Aligning two (or more) texts means to find correspondences (translations) between
segments of the source text and segments of its translation (the target text).
These segments can be the whole text or its parts: chapters, sections, paragraphs,
sentences, words or even characters. In this paper, the focus is on sentence and
lexical (or word) alignment methods.
The importance of sentence and word aligned corpora has increased mainly
due to their use in Example Based Machine Translation (EBMT) systems. In
this case, parallel texts can be used by machine learning algorithms to extract
translation rules or templates ([1], [2]).
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 184–193, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Evaluation of Methods for Sentence and Lexical Alignment
185
The purpose of this paper is to report the results of experiments carried out
on sentence and lexical alignment methods for Brazilian Portuguese (BP) and
English parallel texts. As far as we know this is the first work on aligners involving BP. Previous work on sentence alignment involving European Portuguese has
shown similar values to the experiment for BP described in this paper. In [3],
for example, the Translation Corpus Aligner (TCA) has shown 97.1% precision
on texts written in English and European Portuguese.
In a project carried out to evaluate sentence and lexical alignment systems,
the ARCADE project, twelve sentence methods have been evaluated and it
was achieved over 95% precision while the five lexical alignment methods have
achieved 75% precision ([4]).
The lower precision for lexical alignment is due to its hard nature and it still
remains problematic as shown in previous evaluation tasks, such as ARCADE.
Most alignment systems deal with the stability of the order of translated segments, but this property does not stand to lexical alignment due to the syntactic
difference between languages1.
This paper is organized as following: Section 2 presents an overview of alignment methods, with special attention to the five sentence alignment methods and
the two lexical alignment methods considered in this paper. Section 3 describes
the linguistic resources developed to support these experiments and Section 4
reports the results of the seven alignment methods evaluated on BP-English
parallel corpora. Finally, in Section 5 some concluding remarks are presented.
2
Alignment Methods
Parallel text alignment can be done on different levels: from the whole text to its
parts (paragraphs, sentences, words, etc). In the sentence level, given two parallel texts, a sentence alignment method tries to find the best correspondences
between source and target sentences. In this process, the methods can use information about sentences’ length, cognate and anchor words, POS tags and other
clues. These information stands for the alignment criteria of these methods.
In the lexical level, the alignment can be divided into two steps: a) the identification of word units in the source and in the target texts; b) the establishment
of correspondences between the identified units. However, in practice the modularization of these tasks is not quite simple considering that a single unit can
correspond to a multiword unit. A multiword unit is a word group that expresses
ideas and concepts that can not be explained or defined by a single word, such as
phrasal verbs (e.g., “turn on”) and nominal compounds (e.g., “telephone box”).
In both sentence and lexical alignments the most frequent alignment category
is 1-1, in which one unit (sentence or word) in the source text is translated
exactly to one unit (sentence or word) in the target text. However, there are
other alignment categories, such as omissions (1-0 or 0-1), expansions (n-m,
with n < m; n, m >= 1), contractions (n-m, with n > m; n, m >= 1) or unions
1
Gaussier, E., Langé, J.-M.: Modèles statistiques pour l’extraction de lexiques
bilingues. T.A.L. 36 (1–2) (1995) 133–155 apud [5].
TEAM LinG
186
Helena de Medeiros Caseli et al.
(n-n, with n > 1). In the lexical level, categories different from 1-1 are more
frequent than in the sentence level as can be exemplified by multiword units.
2.1
Sentence Alignment Methods
The sentence alignment methods evaluated here were named: GC ([6], [7]), GMA
and GSA+ ([8], [9]), Piperidis et al. ([10]) and TCA ([11]).
GC (its authors’ initials) is a sentence alignment method based on a simple
statistical model of sentence lengths, in characters. The main idea is that longer
sentences in the source language tend to have longer translations in the target
language and that shorter sentences tend to be translated into shorter ones.
GC is the most referenced method in the literature and it presents the best
performance considering its simplicity.
GMA and GSA+ methods use a pattern recognition technique to find the
alignments between sentences. The main idea is that the two halves of a bitext
– source and target sentences – are the axes of a rectangular bitext space where
each token is associated with the position of its middle character. When a token
at the position in the source text and a token at the position in the target
text correspond to each other, it is said to be a point of correspondence
These methods use two algorithms for aligning sentences: SIMR (Smooth Injective Map Recognizer) and GSA (Geometric Segment Alignment). The SIMR
algorithm produces points of correspondence (lexical alignments) that are the
best approximation of the correct translations (bitext maps) and GSA aligns the
segments based on these resultant bitext maps and information about segment
boundaries. The difference between GMA and GSA+ methods is that, in the
former, SIMR considers only cognate words to find out the points of correspondence, while in the latter a bilingual anchor word list2 is also considered.
The Piperidis et al.’s method is based on a critical issue in translation: meaning preservation. Traditionally, the four major classes of content words (or open
class words) – verb, noun, adjective and adverb – carry the most significant
amount of meaning. So, the alignment criterion used by this method is based on
the semantic load of a sentence3, i.e., two sentences are aligned if, and only if,
the semantic loads of source and target sentences are similar.
Finally, TCA (Translation Corpus Aligner) relies on several alignment criteria
to find out the correspondence between source and target sentences, such as a
bilingual anchor word list, words with an initial capital (candidates for proper
nouns), special characters (such as question and exclamation marks), cognate
words and sentence lengths.
2
3
An anchor word list is a list of words in source language and their translations in the
target language. If a pair source_word/target_word that occurs in this list appears in
the source and target sentence respectively, it is taken as a point of correspondence
between these sentences.
Semantic load of a sentence is defined, in this case, as the union of all open classes
that can be assigned to the words of this sentence ([10]).
TEAM LinG
Evaluation of Methods for Sentence and Lexical Alignment
2.2
187
Lexical Alignment Methods
The lexical alignment methods evaluated here were: SIMR ([12], [9], [13]) and
LWA ([14], [15], [16]).
The SIMR method is the same used in sentence alignment task (see Section 2.1). This method considers only single words (not multiword units) in its
alignment process.
The LWA (Linköping Word Aligner) is based on co-occurrence information
and some linguistic modules to find correspondences between source and target
lexical units (words and multiwords). Three linguistic modules were used by
this method: the first one is responsible for the categorization of the units, the
second one deals with multiword units using multiword unit lists and the last
one establishes an area (a correspondence window) within the correspondences
will be looked for.
Linguistic Resources
3
3.1
Linguistic Resources for Sentence Alignment
The required linguistic resources for sentence alignment methods can be divided
into two groups: corpora and anchor word lists ([17]). For testing and evaluation
purposes, three BP-English parallel corpora of different genres – scientific, law
and journalistic – were built: CorpusPE, CorpusALCA and CorpusNYT.
CorpusPE is composed of 130 authentic (non-revised) academic parallel texts
(65 abstracts in BP and 65 in English) on Computer Science. A revised (by a
human translator) version of this corpora was also generated. They were named
authentic CorpusPE and pre-edited CorpusPE respectively.
Authentic CorpusPE has 855 sentences, 21432 words and 7 sentences per
text on average. Pre-edited CorpusPE has 849 sentences, 21492 words and also
7 sentences per text on average. These two corpora were used to investigate the
methods’ performance on texts with (authentic) and without (pre-edited) noise
(grammatical and translation errors).
CorpusALCA is composed of 4 official documents of Free Trade Area of the
Americas (FTAA)4 written in BP and in English with 725 sentences, 22069
words and 91 sentences per text on average.
Finally, CorpusNYT is composed of 8 articles in English and their translation
to BP from the journal “The New York Times”5. It has 492 sentences, 11516
words and 30 sentences per text on average.
To test and evaluate the methods, two corpora were built (test and reference)
based on the four previous corpora. Texts in the test corpora were given as
input for the five sentence alignment methods. Reference corpora – composed of
correctly aligned parallel texts – were built in order to calculate precision and
recall metrics for the texts of test.
4
5
Available in http://www.ftaa-alca.org/alca_e.asp.
Available in http://www.nytimes.com (English version) and
http://ultimosegundo.ig.com.br/useg/nytimes (BP version).
TEAM LinG
188
Helena de Medeiros Caseli et al.
The texts of test and reference corpora have been tagged to distinguish paragraphs and sentences. Tags for aligned sentences were also manually introduced
in the reference corpora. A tool for aiding this pre-processing was especially
implemented [18].
Most of the alignments in the reference corpora (94%), as expected, are of
type 1-1 while omissions, expansions, contractions and unions are quite rare.
Other linguistic resources developed include an anchor word list for each
corpus genre: scientific, law and journalistic. Examples of BP/English anchor
words found in these lists are: “abordagem/approach”, “algoritmo/algorithm”
(in scientific list); “adoção/adoption”, “afetado/affected” (in law list) and “armas/weapons”, “ataque/attack” (in journalistic list).
3.2
Linguistic Resources for Lexical Alignment
The linguistic resources for lexical alignment methods can be divided into two
groups: corpora and multiword unit lists.
For testing and evaluation purposes, three corpora were used: pre-edited CorpusPE6, CorpusALCA and CorpusNYT, the same corpora built for the sentence
alignment task (see Section 3.1). Texts in the test corpora were automatically
tagged with word boundaries and reference corpora were also built with alignments of words and multiwords.
Multiword unit lists contain the multiwords that have to be considered during
the lexical alignment process. For the extraction of these lists, were used the
following corpora: texts on Computer Science from the ACM Journals (704915
English words); academic texts from Brazilian Universities (809708 BP words);
journalistic texts from the journal “The New York Times” (48430 English words
and 17133 BP words) and official texts from ALCA documentation (251609
English words and 254018 BP words).
The multiword unit lists were built using automatic extraction algorithms
followed by a manual analysis done by a human expert. The algorithms used for
automatic extraction of multiword units were NSP (N-gram Statistic Package)7
and another which was implemented based on the Mutual Expectation technique
[19]. Through this process, three lists (for each language) were generated by each
algorithm and the final English and BP multiword lists have 240 and 222 units
respectively.
Some examples of multiwords in these lists are: “além disso”, “nações unidas”
and “ou seja” for BP; “as well as”, “there are” and “carry out” for English8.
4
Evaluation and Results
The experiments described in this paper used the precision, recall and F-measure
metrics to evaluate the alignment methods. Precision stands for the number of
6
7
8
It is important to say that CorpusPE was evaluated with 64 pairs rather than 65
because we note that one of them was not parallel at lexical level.
Available in http://www.d.umn.edu/ tdeperse/code.html.
For more details of automatic extraction of multiword units lists see [20].
TEAM LinG
Evaluation of Methods for Sentence and Lexical Alignment
189
correct alignments per the number of proposed alignments; recall stands for the
number of correct alignments per the number of alignments in the reference
corpus; and F-measure is the combination of these two previous metrics [4].
The values for these metrics range between 0 and 1 where a value close to
0 indicates a bad performance of the method while a value close to 1 indicates
that the method performed very well.
4.1
Evaluation and Results of Sentence Alignment Methods
Precision, recall and F-measure for each corpus of test corpora (see Section 3.1)
are shown in Table 1.
It is important to say that only GMA, GSA+ and TCA methods were evaluated on CorpusNYT because this corpus was evaluated later and only the
methods which had had better performance where considered in this last experiment.
It can be noticed that precision ranges between 85.89% and 100% and recall is between 85.71% and 100%. The best methods considering these metrics
were GMA/GSA+ for CorpusPE (authentic and pre-edited) and TCA for CorpusALCA and CorpusNYT.
Taking into account these results, it is possible to notice that all methods
performed better on pre-edited CorpusPE than on the authentic one, as already evidenced by other experiments [21]. These two corpora have some features which distinguish them from the other two. Firstly, the average text length
(in words) in the former two is much smaller than in the latter two (BP=175,
E=155 on authentic CorpusPE and BP=173, E=156 on pre-edited CorpusPE
versus BP=2804, E=2713 on CorpusALCA and BP=772, E=740 on CorpusNYT). Secondly, texts in CorpusPE have more complex alignments than those
TEAM LinG
190
Helena de Medeiros Caseli et al.
in law and journalistic corpora. For example, CorpusPE contains six 2-2 alignments while 99.7% and 96% of all alignments in CorpusALCA and CorpusNYT,
respectively, are 1-1.
These differences between authentic/pre-edited CorpusPE and CorpusALCA
/CorpusNYT probably causes the differences in methods’ performance on these
corpora. It is important to say that text lengths affected the alignment task
since the greater the number of sentences are, the greater will be the number of
combinations among sentences to be tried during alignment.
Besides the three metrics, the methods were also evaluated by considering
the error rate per alignment category. The major error rate was in 2-3, 2-2 and
omissions (0-1 and 1-0) categories. The error rate in 2-3 alignments was of 100%
in all methods (i.e., none of them correctly aligned the unique 2-3 alignment in
authentic CorpusPE). In 2-2 alignments, the error rate for GC and GMA was
83.33% while for the remaining methods it was 100%.
TCA had the lowest error rate in omissions (40%), followed by GMA and
GSA+ (80% each), while the other methods had 100% of error in this category. It can be noticed that only the methods that consider cognate words as
an alignment criterion had success in omissions. In [7], Gale and Church had already mentioned the necessity of considering language-specific methods to deal
adequately with this alignment category and this point was confirmed by the
results reported in this paper.
As expected, all methods worked performed better on 1-1 alignments and
their error rate in this category was between 2.88% and 5.52%.
4.2
Evaluation and Results of Lexical Alignment Methods
Precision, recall and F-measure for each corpus of test corpora (see Section 3.2)
are shown in Table 2.
SIMR method had a better precision (91.01% to 95.61%) than LWA (51.84%
to 62.15%), but its recall was very low (16.79% to 20%) what can be a problem
for many applications such as bilingual lexicography. The high precision, on the
other hand, can be explained by its very accurate alignment criterion based only
on cognate words.
LWA had a better distribution between precision and recall: 51.84% to 62.15%
and 59.38% and 65.14% respectively. These values are quite different from that
obtained in an experiment carried out on English-Swedish pair in which LWA
has achieved 83.9% to 96.7% precision and 50.9% to 67.1% recall ([15]) but are
close to that obtained in another experiment carried out on English-French pair
in which LWA has achieved 60% precision and 57% recall ([4]). So, for languages
with common nature like French and BP the values were very close.
The LWA’s partially correct link proposals were also evaluated using the metrics proposed in [22]. With these metrics precision improved 12% to 16% (from
51.84%–62.15% considering only totally correct alignments to 66.87%–74.86%
considering also partially correct alignments) while recall improved almost 1%
(from 59.38%–65.14% to 59.81%–65.82% considering totally and partially correct
alignments respectively).
TEAM LinG
Evaluation of Methods for Sentence and Lexical Alignment
5
191
Some Conclusions
This paper has described some experiments carried out on five sentence alignment methods and two lexical alignment methods for BP-English parallel texts.
The obtained precision and recall values for all sentence alignment methods
in almost all corpora are above 95%, which is the average value related in the
literature [4]. However, due to the very similar performances of the methods,
at this moment it is not possible to choose one of them as the best sentence
alignment method for BP-English parallel texts. More tests are necessary (and
will be done) to determine the influence of the alignment categories, the text
lengths and genre on methods’ performance.
For lexical alignment, SIMR was the method that presented the best precision, but its recall was very low and it does not deal with multiwords. LWA, on
the other hand, achieved a better recall and it is able to deal with multiwords,
but its precision was not so good as SIMR’s one. Considering multiword units,
the literature has not yet established an average value for precision and recall,
but it has been clear and this work has stressed that corpus size and the pair of
language have great influence on the aligners’ performance ([15], [4]).
The results for sentence alignment methods have stressed the values related in
the literature while the results for lexical alignment methods have demonstrated
that there are still some improvement to be achieved.
In spite of this, this work has specially contributed to researches on computational linguistic involving Brazilian Portuguese by implementing, evaluating
and distributing a great number of potential resources which can be useful for
important applications such as machine translation and information retrieval.
Acknowledgments
We would like to thank FAPESP, CAPES and CNPq for financial support.
TEAM LinG
192
Helena de Medeiros Caseli et al.
References
1. Carl, M.: Inducing probabilistic invertible translation grammars from aligned texts.
In: Proceedings of CoNLL-2001, Toulouse, France (2001) 145–151
2. Menezes, A., Richardson, S.D.: A best-first alignment algorithm for automatic
extraction of transfer mappings from bilingual corpora. In: Proceedings of the
Workshop on Data-driven Machine Translation at 39th Annual Meeting of the
Association for Computational Linguistics (ACL’0l), Toulouse, France (2001) 39–
46
3. Santos, D., Oksefjell, S.: An evaluation of the translation corpus aligner, with special reference to the language pair English-Portuguese. In: Proceedings of the 12th
“Nordisk datalingvistikkdager”, Trondheim, Departmento de Lingüistíca, NTNU
(2000) 191–205
4. Véronis, J., Langlais, P.: Evaluation of parallel text alignment systems: The ARCADE project. In Véronis, J., ed.: Parallel text processing: Alignment and use of
translation corpora, Kluwer Academic Publishers (2000) 369–388
5. Kraif, O.: Prom translation data to constrative knowledge: Using bi-text for bilingual lexicons extraction. International Journal of Corpus Linguistic 8:1 (2003)
1–29
6. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora.
In: Proceedings of the 29th Annual Meeting of the Association for Computational
Linguistics (ACL), Berkley (1991) 177–184
7. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora.
Computational Linguistics 19 (1993) 75–102
8. Melamed, I.D.: A geometric approach to mapping bitext correspondence. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing,
Philadelphia, Pennsylvania (1996) 1–12
9. Melamed, I.D.: Pattern recognition for mapping bitext correspondence. In Véronis,
J., ed.: Parallel text processing: Alignment and use of translation corpora, Kluwer
Academic Publishers (2000) 25–47
10. Piperidis, S., Papageorgiou, H., Boutsis, S.: From sentences to words and clauses. In
Véronis, J., ed.: Parallel text processing: Alignment and use of translation corpora,
Kluwer Academic Publishers (2000) 117–138
11. Hofland, K.: A program for aligning English and Norwegian sentences. In Hockey,
S., Ide, N., Perissinotto, G., eds.: Research in Humanities Computing, Oxford,
Oxford University Press (1996) 165–178
12. Melamed, I.D.: A portable algorithm for mapping bitext correspondence. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. (1997) 305–312
13. Melamed, I.D., Al-Adhaileh, M.H., Kong, T.E.: Malay-English bitext mapping
and alignment using SIMR/GSA algorithms. In: Malaysian National Conference
on Research and Development in Computer Science (REDECS’01), Selangor Darul
Ehsan, Malaysia (2001)
14. Ahrenberg, L., Andersson, M., Merkel, M.: A simple hybrid aligner for generating lexical correspondences in parallel texts. In: Proceedings of Association for
Computational Linguistics. (1998) 29–35
15. Ahrenberg, L., Andersson, M., Merkel, M.: A knowledge-lite approach to word
alignment. In Véronis, J., ed.: Parallel text processing: Alignment and use of translation corpora. (2000) 97–116
TEAM LinG
Evaluation of Methods for Sentence and Lexical Alignment
193
16. Ahrenberg, L., Andersson, M., Merkel, M.: A system for incremental and interactive word linking. In: Third International Conference on Language Resources and
Evaluation (LREC 2002), Las Palmas (2002) 485–490
17. Caseli, H.M., Nunes, M.G.V.: A construção dos recursos lingüísticos do projeto
PESA. Série de Relatórios do NILC NILC-TR-02-07, NILC,
http://www.nilc.icmc.usp.br/nilc/download/NILC-TR-02-07.zip (2002)
18. Caseli, H.M., Feltrim, V.D., Nunes, M.G.V.: TagAlign: Uma ferramenta de préprocessamento de textos. Série de Relatórios do NILC NILC-TR-02-09, NILC,
http://www.nilc.icmc.usp.br/nilc/download/NILC-TR-02-09.zip (2002)
19. Dias, G., Kaalep, H.: Automtic extraction of multiword units for Estonian: Phrasal
verbs. In Metslang, H., Rannut, M., eds.: Languages in Development. Number 41
in Linguistic Edition, Lincom-Europa, München (2002)
20. Silva, A.M.P., Nunes, M.G.V.: Extração automática de multipalavras. Série de
Relatórios do NILC NILC-TR-03-11, NILC,
http://www.nilc.icmc.usp.br/nilc/download/NILC-TR-03-11.zip (2003)
21. Gaussier, E., Hull, D., Aït-Mokthar, S.: Term alignment in use: Machine-aided
human translation. In Véronis, J., ed.: Parallel text processing: Alignment and use
of translation corpora, Kluwer Academic Publishers (2000) 253–274
22. Ahrenberg, L., Merkel, M., Hein, A.S., Tiedemann, J.: Evaluation of word alignment systems. In: Proceedings of 2nd International Conference on Language Resources & Evaluation (LREC 2000). (2000) 1255–1261
TEAM LinG
Applying a Lexical Similarity Measure
to Compare Portuguese Term Collections
Marcirio Silveira Chaves and Vera Lúcia Strube de Lima
Pontifícia Universidade Católica do Rio Grande do Sul - PUCRS
Faculdade de Informática - FACIN
Programa de Pós-Graduação em Ciência da Computação - PPGCC
Av. Ipiranga, 6681 - Partenon - Porto Alegre - RS
CEP 90619-900
{mchaves,vera}@inf.pucrs.br
Abstract. The number of ontologies publicly available and accessible
through the web has increased in the last years, so that the task of
finding similar terms1 among these structures becomes mandatory. We
depict the application and the evaluation of a new similarity measure
for comparing Portuguese Ontological Structures (OSs) called Lexical
Similarity (LS). This paper describes contributions to the study and
application of mapping between terms present in multidomain OSs. In
order to approach this mapping we combine preliminar similarity measures and heuristics. Our measure uses a stemmer, it is established upon
String Matching (SM) proposed in [1] and it was evaluated by means of
a comparison to human evaluation. Finally, we concentrate on the application of LS measure to terms belonging to same domain thesauri and
discuss the results obtained.
Keywords: Lexical Similarity Measure, Mapping, Ontological Structures
1
Introduction
The automatic mapping between Ontological Structures (OSs) has been a continuous concern as a task of integration and reuse of knowledge. However, the
manual execution of such task is quite tedious and slow, so it is important to
automate it, at least partially.
In this work, OSs are understood as sets of pre-defined terms explicitly connected by semantic relations in a format, which is readable by humans and
machines. This notion is suitable for collections of vocabularies as well as for
collections of concepts.
Several efforts have been reported in the literature to mapping different OSs
in English language [2–4] and in German language [1]. However, other works that
deal with Portuguese OSs have not been found. We concentrate our efforts on
1
The words “terms” and “concepts” will be used with the same meaning in this
article.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 194–203, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Applying a Lexical Similarity Measure
195
Portuguese OSs, developing, testing, validating and evaluating a proper measure
to help detecting similar terms between OSs, which are projected independently
using preview studies [1,3].
This paper is further organized as follows. Section 2 describes the SM measure [1]. Section 3 details the similarity measure proposed in this paper. The
experiments accomplished over multidomain Portuguese OSs are presented in
Section 4. Section 5 presents the experiments with thesauri belonging to the
same domain. Finally, Section 6 gives an outlook on future work.
2
Maedche and Staab Measure
Maedche and Staab [1] present a two layer approach, first lexical and then conceptual, to measure the similarity between terms of different OSs. At the lexical
level, they consider the Edit Distance (ED) formulated by Levenshtein [5]. This
distance contemplates the minimum number of insertions, deletions or substitutions (reversals) necessary to transform one string into another using a dynamic
programming algorithm. The contribution of Maedche and Staab consists of the
String Matching (SM) measure given by:
The SM measure calculates the similarity between two terms
The
length in characters of the shortest term is represented by
For example, to obtain the similarity between the terms (comerciario, comerciante)
the minimum length is 11 and
is 3 (changes “r” by “n” and inserts
“t” and “e”). Thus, the resulting value for SM(comerciario, comerciante) is
0.73.
This measure always returns a value between 0 and 1, where 1 stands for
perfect match and zero indicates absence of match. Maedche and Staab worked
with German language OSs from tourism domain. However, while applying SM
measure to Portuguese OSs, many terms were mapped inconsistently. In order
to get better results we developed a proper measure, which was validated and
evaluated2.
3
Lexical Similarity Measure
We propose an alternative to SM measure which is based on the radicals3 of the
words. Generally, these radicals are the most representative part of a word in
Portuguese, and they can be extracted with the help of a stemmer. We used a
stemmer that was specifically developed for Portuguese by Orengo and Huyck,
2
3
Detailed results, experiments, validation and evaluation can be found in [6].
The term radical as used in this article represents the initial character string of a
word and not necessarily the linguistic concept of radical.
TEAM LinG
196
Marcirio Silveira Chaves and Vera Lúcia Strube de Lima
which presented good performance when compared [7] to Porter algorithm or
other [8]. Our proposal is named Lexical Similarity (LS) and it is expressed by
the equation in 2, where terms are represented by
and
and index points
to the terms in
while index refers to terms in
Terms can be formed by single-words, or by more than one word. LS measure,
in contrast to SM measure, considers only the radical of each word, instead of
the complete string of characters. The symbol
represents the value obtained
by SM measure under the following conditions:
The radical of a word that is part of a term T is represented by
where
indicates the position of this word in T and indicates the OS to which this
term belongs. When
and
are multiword terms, the index reaches the
value of the amount of words of the term with the minimum number of words,
so that LS measure calculates the similarity between the first pairs of radicals
in the terms being compared.
The result returned by LS measure is the minimum value produced by equation 3, which depends on the Edit Distance. As the radical of a term owns a
strong semantic weight, the result obtained by ED is decremented according to
the conditions stated in equation 3. The highest is the ED, the highest is the
penalty used. The penalty values (0.1 and 0.2) were obtained from empirical
studies with SM measure. We assume that, if
the value returned by SM
is zero and, consequently, LS is zero, too. What means, three or more changes
in the radical of a word suggest a low degree of similarity.
For example, in order to check the similarity between the terms
areaEstrategica and armaEstrategica, the words of the each term are processed by a stemming algorithm, which produces the stems “are” and “arm”,
“estrateg” and “estrateg”, so that:
To calculate SM(are, arm), we obtain the length of the shortest term, in this
case 3. Then ED (are, arm) is calculated, which gives 1, since the letter “e” is
changed to “m” to transform the string “are” into “arm”. So, SM (are, arm) is
solved as:
As in this case ED = 1, the penalty to be applied is 0.1. So, the resultant
similarity is 0.57.
TEAM LinG
Applying a Lexical Similarity Measure
197
The next result to be obtained is the similarity between SM(estrateg, estrateg)
that is 1. In this case ED(estrateg, estrateg) is zero, (since the strings are in
perfect match). Thus:
We did not find other works in the literature that provide a study on semantic weighting for each single-word in a multiword term, which would be suitable
for Portuguese language as well as for several other languages such as Spanish, French and so on. In our proposal, as the reader can observe, words with
the lowest lexical similarity value may perform an important role on similarity
detection.
4
Multidomain Experiment
The OSs we used in this experiment come from two distinct sources4. Their
terms belong to one of two groups: single-word terms or multiword terms5.
The experiments were organized in two steps: testing and validation6 of LS
measure, followed by its evaluation. The terms in
were categorized into two
sets for each phase, while terms in
remained without categorization during
both validation and evaluation phases. The terms were placed in alphabetical
order and an algorithm was developed to randomly distribute them through
validation and evaluation experiment groups.
We also disclosed a heuristic to tune the mappings generated by LS measure.
In Portuguese language, the semantic weight of the first characters in a term is
apparently strong, which gives rise to the heuristic that is stated as:
According to LS measure (equation 2), let the index inside the brackets be
the position of the first character in the radical of the word in a term. If the
two radicals
being compared have a different first letter, the value
returned by SM measure will be zero. Consequently, LS will be zero, too.
For the evaluation phase, we used 1,823 single-word terms of Senate OS, while
the USP OS remained with its original 7,039 single-word terms. We selected 4,701
multiword terms of Senate OS and kept 16,986 multiword terms of USP. The
aim of the experiments in this phase was to check the agreement among LS and
SM measures according to the results given by a human analysis of similarity.
4
5
6
Namely: Brazilian Senate Thesaurus
and São Paulo University - USP Thesaurus
For the experiments with multiword terms, OSs were first preprocessed in order to
eliminate blanks. Moreover, the first character of each word was capitalized, except
for the first word in a term. This procedure is necessary to compare results with
those in English [3] and German [1] experiments.
Details on the experiments carried out in testing and validation can be found in [9].
TEAM LinG
198
Marcirio Silveira Chaves and Vera Lúcia Strube de Lima
In order to examine in detail the 2,887 pairs of terms and the corresponding
system-computed or human confirmed analysis, we split them into seven groups.
These groups are presented in Table 1, where G1 to G7 stand for the respective
group7.
Human analysts pointed the pairs of terms as “similar”, “unlike” or “doubtful” . This result was compared with the automatically processed combinations.
We choose Group G5 in Table 1 deemed as the most representative to be described in detail in the next section.
4.1
Analysis of Group G5
This group contains terms whose are deemed similar by SM measure and unlike
by LS measure as well as by the human analysis. Moreover, in G5 there are most
of the pairs analyzed during the evaluation phase, that is, about 73% which
corresponds to 907 single-word terms and 1,211 multiword terms. We show an
extract of these terms in Table 2.
Table 2 contemplates single-word (first five lines) and multiword (next five
lines) terms. At first, let’s analyze single-word terms. Most of them belonging
to this group have the same suffix, that is, the final string is a perfect match of
characters. As SM equally weights the strings belonging to the radical or to the
suffix, a high value of similarity was observed between the terms having same
suffix. However, this policy is not yet confirmed for Portuguese.
Otherwise, in the multiword terms, at least one word of the term has the
same suffix. As the reader may note, all terms in Table 2 seem to be unlike,
despite SM measure detects them as similar. We can increase the threshold from
0.75 to 0.8 in order to get a more consistent mapping by SM. However, this
higher threshold is not enough to deem the terms belonging to G5 as dissimilar,
once just some pairs of terms have similarity value under 0.8.
As this group represents most of the terms analyzed in evaluation phase
and, taking into account the results generated by SM measure, it is possible to
question if this measure is really proper to treat Portuguese terms. Specifically
for multiword terms, we believe that the best performance of LS measure is due
to the fact that it considers each constituent word individually.
7
We used the threshold 0.75 in our experiments. This value is also used in [1].
TEAM LinG
Applying a Lexical Similarity Measure
199
As a following step toward experimentation, we concentrate our efforts in
mapping of terms belonging to the same domain. We apply the SM and LS
measures to these terms through the experiment described in the next section.
5
Same Domain Experiment
In this experiment we verify the similarity among 2,083 terms from GEODESC
Thesaurus8 and 429 terms from USP Thesaurus, which belong to the Geosciences
domain. In order to carry out this experiment, we do not consider the cases
where there is a perfect matching of characters, because these ones do not help
to evaluate any of the measures. Moreover, we use the first letter heuristic to
help us obtain better results.
After running the algorithm with the two measures, 91 mappings were found
between the two thesauri representing 4.36% of the terms of GEODESC Thesaurus and 21.21% of the terms of USP Thesaurus. In order to analyze these
mappings, we split them into 2 groups. In Group A (GA) these are the terms
considered similar by LS measure, while the Group B (GB) includes the terms
deemed as similar by SM and dissimilar by LS. Table 3 shows these groups
considering similar terms with similarity
Table 3 presents the combinations between SM and LS similarity measures.
These cases are explained as follows:
8
Available by ftp://ftp.cprm.gov.br/pub/pdf/didote/geodesc.pdf
TEAM LinG
200
5.1
Marcirio Silveira Chaves and Vera Lúcia Strube de Lima
Analysis of Group A
This group contains those terms which are considered similar by LS measure. The
analysis was broken into two tables, comparing our LS measure with Maedche
and Staab’s SM measure. Only 4 mappings were detected while considering SM
< 0.75 and
as is shown in Table 4. In our point of view just the first
mapping (between the terms sais and sal) can be considered correct by LS. In
order to evaluate the remaining mappings it is necessary to know the semantic
relations among the terms and to take into account the meaning of each term.
In Group A, when both measures consider the terms being compared as
similar
and
we have the terms presented in Table 5.
Lines 1 to 5 show terms with number variation and they are correctly deemed
as similar by both measures. The remaining pairs of terms, such as those in
Table 4, do not present a unique characteristic and it is difficult to perform an
evaluation of the results generated.
5.2
Analysis of Group B
This group presents most of the mappings found in our experiment. We split
these pairs of terms into two tables, the former composed by only one word
terms and the latter by multiword terms.
The single-word terms are shown in Table 6. Despite all these pairs of terms
have high lexical similarity, their meanings are different. So, in the context of
mapping of similar terms between OSs we consider they should not be mapped.
TEAM LinG
Applying a Lexical Similarity Measure
201
In this moment it is important to stress a contribution of our measure. According to the literature studied, just the SM measure has been used to map
terms among OSs. In this work, when we apply SM measure to single-word
terms the reader can note its low performance, while our measure seems to attribute a suitable similarity value to the same pairs of terms. So, LS measure
contributes to avoid detection of dissimilar terms like similar.
Still in this group, we analyze the multiword terms. The pairs of terms in
this case are depicted in Table 7.
The reader may note that these pairs are considered similar by SM measure
mainly due to the fact of dealing with them as a single string. As oppose to
the LS measure, SM does not verify the similarity among individual words.
The multiword terms belonging to Geosciences domain are generally composed
by more than 10 characters. So, the value returned by ED does not generate
sufficient impact to reduce the final similarity value of SM of the full term.
On the other hand, our measure considers individually the words belonging to
the terms. This fact helps reducing the final similarity value, once the shortest
term has a lower value than the one used by SM. So, the result of ED has a
greater impact in the equation, decreasing the value of LS measure.
It is important to observe in Table 7 that most of the values generated by
LS measure is zero. This occurs because those pairs have 3 or more distinct
characters in the radical of the words.
TEAM LinG
202
Marcirio Silveira Chaves and Vera Lúcia Strube de Lima
Finally, it is worth noting the contribution of the penalties introduced in
equation 2, as expressed in Table 8.
These penalties allow decreasing the value of LS measure and, consequently, considering terms as dissimilar (maintaining threshold 0.75), in opposite to SM measure. For example, the similarity between the pair of terms
bioestratigrafia and litoestratigrafia by LS measure without penalties
would be 0.86. This value allows us to consider it as similar, however, introducing the penalties (in this case 0.2) we have the final similarity value 0.66, which
is under the threshold established. In fact, this pair is not really similar likewise
the remaining ones in Table 8. Thus, they should not be mapped in the context
of our analysis.
6
Final Remarks and Future Work
This work is the first effort towards the detection of similar terms between Portuguese OSs. LS measure was evaluated based on human evaluation of similarity,
even though we find difficulties to evaluate similarity measures in agreement with
a human point of view. A full description and analysis of the results obtained
with LS measure are given in [6]. We believe that our measure contributes to
help the ontology engineers reuse the information contained in the ontological
structures, since the reuse is one of the main concerns in the context of the
semantic web.
We carried out experiments with terms belonging to multidomain as well as
to the same domain structures, and we commented the main results obtained.
In spite of being them preliminary results, they are encouraging.
The next step is the application of LS measure to other languages, such
as English or Spanish. In this situation a proper stemming algorithm, suitable
for each different language, should be used. Besides, the similarity measures presented in this article can be used in order to aid on the task of union or alignment
of ontological structures. It could also be connected to specific interface to help
the ontologists detect terms suggested as similar.
Acknowledgements
Marcirio Silveira Chaves was supported by the research center HP-CPAD (Centro de Processamento de Alto Desempenho HP Brasil-PUCRS).
TEAM LinG
Applying a Lexical Similarity Measure
203
References
1. Alexander Maedche and Steffen Staab. Measuring Similarity between Ontologies. In
Proceedings of the European Conference on Knowledge Acquisition and Management
- (EKAW-2002). Madrid, Spain, October 1-4, pages 251–263, 2002.
2. AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy. Learning to
Map between Ontologies on the Semantic Web. In Proceedings of the World- Wide
Web Conference (WWW-2002), Honolulu, Hawaii, USA, May 2002.
3. Natalya Fridman Noy and Mark A. Musen. Anchor-PROMPT: Using Non-Local
Context for Semantic Matching. In Proceedings of the Workshop on Ontologies and
Information Sharing at the Seventeenth International Joint Conference on Artificial
Intelligence (IJCAI-2001), Seattle, WA, August 2001.
4. Sushama Prasad, Yun Peng, and Timothy Finin. Using Explicit Information To Map
Between Two Ontologies. In Proceedings of the
International Joint Conference
on Autonomous Agents and Multi-Agent Systems - Workshop on Ontologies in Agent
Systems (OAS) - Bologna, Italy. 15-19 July, 2002.
5. Vladimir Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions
and Reversals. Cybernetics and Control Theory, 10(8):707–710, 1966.
6. Marcirio Silveira Chaves. Comparação e Mapeamento de Similaridade entre Estruturas Ontológicas. Master’s thesis, PUCRS-FACIN-PPGCC, 2004.
7. Viviane Moreira Orengo and Christian Huyck. A Stemming Algorithm for Portuguese Language. In Proceedings of Eigth Symposium on String Processing and
Information Retrieval (SPIRE-2001), pages 186–193, 2001.
8. Marcirio Silveira Chaves. Um Estudo e Apreciação sobre Dois Algoritmos de Stemming para a Língua Portuguesa. Jornadas Iberoamericanas de Informática. Cartagena de Indias - Colômbia (CD-ROM), August 11-15, 2003.
9. Marcirio Silveira Chaves and Vera Lúcia Strube de Lima. Looking for Similarity
between Portuguese Ontological Structures. In: António Branco, Amália Mendes,
Ricardo Ribeiro (editors). Edições Colibri, Lisboa, 2004 (to appear).
TEAM LinG
Dialog with a Personal Assistant
Fabrício Enembreck1 and Jean-Paul Barthès2
1
PUCPR, Pontifícia Universidade Católica do Paraná
PPGIA, Programa de Pós-Graduação em Informática Aplicada
Rua Imaculada Conceição, 1155, Curitiba PR, Brasil
[email protected]
2
UTC – Université de Technologie de Compiègne
HEUDIASYC – Centre de Recherches Royallieu
60205 Compiègne, France
[email protected]
Abstract. This paper describes a new generic architecture for dialog systems
enabling communication between a human user and a personal assistant based
on speech acts. Dialog systems are often domain-related applications. That is,
the system is developed for specific applications and cannot be reused in other
domains. A major problem concerns the development of scalable dialog systems capable to be extended with new tasks without much effort. In this paper
we discuss a generic dialog architecture for a personal assistant. The assistant
uses explicit task representation and knowledge to achieve an “intelligent” dialog. The independence of the dialog architecture from knowledge and from
tasks allows the agent to be extended without needing to modify the dialog
structure. The system has been implemented in a collaborative environment in
order to personalize services and to facilitate the interaction with collaborative
applications like e-mail clients, document managers or design tools.
Keywords: Dialog Systems, Natural Language, Personal Assistants
1 Introduction
While using our computers to work or to communicate, we observe three major
trends: (i) the user’s environment becomes increasingly complex; (ii) cooperative
work is growing; (iii) knowledge management is spreading rapidly. Because of the
increasing complexity of their environment, users are frequently overwhelmed with
tasks that they must accomplish through many different tools (e-mail managers, web
browsers, word processors, etc.). The resulting cognitive overload leads to some disorganization, which has negative impacts, in particular when the information is
shared among different people. A major issue is thus to develop better and more
intuitive interfaces.
We are currently developing a Personal Assistant Agent in a project called
AACC1, for supporting collaboration between French and American groups of
1
The AACC (Agents d’Aide à la Conception Coopérative) project is a collaborative project
involving the CNRS HEUDIASYC laboratory of UTC, and the LaRIA laboratory of UPJV in
France.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 204–213, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Dialog with a Personal Assistant
205
students, located at UTC (Université de Technologie de Compiègne) and at ISU
(Iowa State University). The students must design electro-mechanical devices using
assistant agents. In this paper, we focus on the Personal Assistant (PA), discussing
how a Natural Language interface allows the user to interact with the Assistant efficiently, and how this interaction can be used to increase the agent knowledge of the
user. We developed a generic dialog system using several models: dialog model,
tasks models, domain knowledge model and user model. We focus in this paper the
construction of the dialog model and show how speech acts can be used to make the
dialog model independent of domain data (tasks and knowledge).
The paper is organized as follows: Section 2 presents some theory on natural language and dialog systems; Section 3 describes the architecture of our system. The
deployment and evaluation are discussed in section 4. We discuss related work in
section 5. Finally, Section 6 concludes with our observations.
2 Natural Language and Dialog Systems
Communications using natural language (NL) have been proposed in the past. Early
attempts were done at the end of the sixties, early seventies. Szolovits et al. [1] or
Goldstein and Roberts [6] developed formalisms and languages for representing the
knowledge contained in English utterances. The internal language would support
inferences in order to produce answers. In the first project (OWL language) the application was to draw inferences on an object database. In the second one (FRL-0 language), the goal was to schedule meetings. The internal language was used to represent knowledge and to translate utterances. Such an approach can simplify the
representations and inferences, because only very specific applications are considered. However, for new domains, a major part of the application must be rewritten,
which is unfortunate in an environment involving several tasks (like collaborative
work) when part of the dialog must be recoded each time a new task is added to the
system. Later, sophisticated knowledge representation techniques were proposed by
several researchers, including Schank and Abelson [13], Sowa [16] or Riesbeck and
Martin [12], for handling natural language and representing meaning. They allowed
expressing complex relationships between objects. The main difficulty with such
techniques however, is to define the right level of granularity for the representation,
because even very simple utterances can produce very complex structures. Moreover,
modeling concepts and utterances is a very time consuming non-trivial task. The field
of NL and machine understanding has expanded since the early attempts, however,
the techniques being used are fairly complex and most of the time unnecessary for the
purpose of conducting dialogs, in particular goal-oriented dialogs, since “it is not
necessary to understand in order to act.”
Like NL techniques, dialog systems generally use internal but simpler structures to
represent knowledge, e.g., ontologies, semantic nets, or frames systems. The emphasis however is not on the adequacy of the knowledge representation, but rather on the
dialog coordination by a dialog manager. In addition, the dialog systems are designed
so that they can be used in other domains without the necessity of changing the dialog
structure, in order to save development time.
TEAM LinG
206
Fabrício Enembreck and Jean-Paul Barthès
Many dialog systems implementing NL interfaces have been developed in applications like speech-to-speech translation [8], meeting schedule, travel books [2] [15],
telephone information systems, transportation and traffic information, tutorial systems, etc. Flycht-Eriksson [5] has classified dialog systems into query/answer systems and task-oriented systems. Query/answer systems include consultation systems
like tourist information, time information, traveling, etc. Task-oriented systems guide
the user through a dialog to execute a task. Tasks range from very simple tasks like
“find a document” to complex tasks decomposed into several subtasks. We argue that
a dialog system for supporting collaborative work must be both of query/answer and
task-oriented type because user problems can involve questions (“Where does Robert
work?”, “What does electrostatic mean?”) and tasks (“Find a document for me”,
“Send a message to the project leader”). We present our approach in next section.
3 A Personal Assistant That Participates in Dialogs
We discuss the different models that compose our dialog system paying a special
attention to the dialog model.
Fig. 1. Open Dialog.
3.1 Dialog Model
Our approach uses a speech act system. According to Searle [14], the speech act is the
basic unit of language used to express meaning through an utterance that expresses an
intention of doing something (to act). In our system, the users’ utterances express
questions and requests. Then, a PA starts a dialog to reach a state where an action is
triggered according to the intention of the user. The dialog states are nodes of a dialog
graph in which most speech acts are available at all times. For instance, consider the
dialog on Fig. 1onfoo. In lines 1-5, the user requests the task “send mail” and the
system asks for additional information. The user enters a new question during the task
dialog (lines 3-4), the system answers it and returns to the previous dialog context. To
accomplish this, the system keeps a stack of states. When a new task is requested the
system pushes a number of states in the stack equal to number of slots required to
accomplish the task. When a slot is successfully filled, the system marks it as
“poped.” This strategy also allows the user to return to previous states (Fig. 1 lines 510).
TEAM LinG
Dialog with a Personal Assistant
207
Fig. 2. Model-Based Architecture.
Our system (architecture Fig. 2) has been developed for dialog-based and question/answer interaction. In task-oriented dialog the system asks the user to fill slots of
a given task (like send e-mails or locate documents). Then the system runs the task
and presents the result. In question/answer interaction, the user asks the system for
information. In this case the assistant uses its knowledge base for providing correct
answers.
Fig. 2 shows that when the system receives a simple question or information the
syntactic analyzer produces a syntactic representation. The representation gives the
grammatical structure of the sentence (verbal phrase, nominal phrase, prepositional
phrase, etc). We developed a grammatical rule base, where each rule refers to a single
dialog act. The semantic analyzer uses this structure to build requests. The role of the
semantic analyzer is to identify objects, properties, values and actions in the syntactic
structure using the object hierarchy and relations defined in the domain model. The
information is used to create a formal query. During the semantic analysis the system
can ask the user for confirmation or request additional information to resolve conflicts. Finally, the inference engine uses the resulting formal query to retrieve the
required information and the system presents it to user.
Whenever a task-oriented dialog starts, the semantic analyzer first tries to determine if a known task is concerned. If it is the case, it verifies the slots initially filled
with information and continues the dialog to acquire important information for
executing the task. To identify the task and concerned slots, the analyzer retrieves
information from task models (see Section 3.3). The recursive stack strategy allows
the user to use relations and concepts defined in the domain model (see Fig. 1, line
10) at all times.
In our system, the dialog coordination depends on the type of utterances denoted
by speech acts. Schank and Abelson [13] proposed a categorization of messages, of
which we keep the following:
Assertive: message that affirms something or gives an answer (e.g., “Paul is professor of AI at UTC.”, “Mary’s husband”);
Directive: gives a directive (e.g., “Find a document for me.”);
Explicative: ask for explanation (e.g. “Why?”);
Interrogative: ask for a solution (e.g., “Where does Paul work?”).
TEAM LinG
208
Fabrício Enembreck and Jean-Paul Barthès
To maintain a terminological coherence, the previous categories will be referred to
by speech acts: Assert act for Assertive; Directive act for Directive; Explain act for
Explicative and WH/Question (where, what or which) or Y-N/Question (yes-no) for
Interrogative. Speech acts are used to classify nodes of the dialog state graph. The
dialog graph represents a discussion between the user and the system where nodes are
the user’s utterances and arcs are the classification given to the node.
To improve communications we introduced new specialized speech acts:
Confirm: used by the system to ask the user to confirm a given value;
Go-back: used by the user to go to the previous node of the dialog, for instance
when the user made a mistake;
Abort: used by the user to terminate the dialog;
Propose: used by the system to propose a value to a question. This act can be followed by a Confirm act.
Fig. 3 shows the functional architecture of the dialog coordination. Fig. 3 shows
how we implemented the semantic interpretation for each speech act. The interaction
with the user starts always with the “Ask” system act. The default question is “What
can I do for you?” Then, the user can ask for information or start a task. Based on the
user phrase the Task Recognizer classifies the user phrase as a “General Utterance” or
a “Task-Related Utterance.” The task recognizer compares the verbs and nouns of the
verbal and nominal parts of the utterance with the linguistic information stored previously into task templates (section 3.3).
A general utterance is simply analyzed by the semantic analyzer taking into account the speech act recognized during syntactic analysis. Four types of speech acts
are possible: Assertive (assert act), Explicative (explain act), Directive (directive act)
and Interrogative (wh/question or y-n/question acts). Finally the Inference Engine can
ask the knowledge base for the answer. Inference engine does a top-down search into
the concepts hierarchy, identifying classes and subclasses of concepts, properties and
values filtering the concepts that satisfy the constraints specified into the queries.
The interpretation of a task-related utterance is more complex. First the Task Recognizer locates the correct task based on the terms present on the nominal phrases
using the terminological representation (Task Template on Section 3.3) about the
tasks. Next the task recognizer matches the modifiers of these nominal phrases with
information about the parameters for filling the slots referred in the phrase. Then, the
Task Engine will ask the user about other parameters sequentially. For each parameter an Ask act is executed by the system. At this point the user can: simply answer the
question (in this case the task engine fills the slot and passes to the next one), ask for
Explanation, Go-back to the last slot or Abort the dialog. When a user asks for Explanation, the Task Explainer presents the information coded on the task description
concerning the current parameter (params-explains on Section 3.3) and the task engine restarts the dialog concerning the current parameter. The Go-back act simply
makes the task engine roll back the dialog flow to the last parameter filled. When the
user enters an Abort act the Task Eraser reinitializes variables concerning the current
task and the system goes back to the default prompt or to the top task of the task
stack.
TEAM LinG
Dialog with a Personal Assistant
209
Fig. 3. Functional Architecture.
The system can also ask for confirmation and propose values with Confirm and
Propose speech acts respectively. To confirm a given value the system shows a default question like “Confirm the value?” and waits for a valid answer. If a positive
answer is given, the system confirms the value and the dialog continues. Otherwise,
the task engine asks the question concerning the current parameter again. The Propose act is executed before the Ask act. The User Profile Manager looks at the user
model for a value to propose to the user. If a value if found, it is presented to the user
and the system ask the user for a confirmation executing a Confirm act.
Finally, when no more information is needed the Task Executor executes the task
and presents the solution or a feedback to the user. It also sends information to User
Profile Manager that saves the current task in the user model.
3.2 How to Interpret the User’s Utterances
In our approach, we use a simple English regular grammar extended from Allen [1].
We divided the syntactic and semantic processing into two steps. The algorithm uses
nominal and prepositional phrases to locate known objects and properties. We implemented an algorithm that analyzes the syntactic representation and the domain
ontology and generates well-formed requests. The semantic analysis is complemented
by a linguistic analysis of the phrase, where we try to identify if an action, e.g.,
“leave”, or some general modifier, e.g., “time (when), quantity (how many)” is being
asked using a list of verbs denoting actions and modifiers. Finally, the inference engine takes the resulting formal query and does the filtering. The query is a conjunction of atomic queries. The format of each query can be “(:Object O :slot S :value V)”
for object selection or “(:Object O :slot S)” for slot-value verification. “O” and “V”
can be complex recursive structures.
TEAM LinG
210
Fabrício Enembreck and Jean-Paul Barthès
3.3 Task Model
We divide a task into two parts: template and description. To identify the task requested by the user and the information related to parameters, the semantic analyzer
uses the template part of the task. The template contains linguistic terms related to the
parameters and the verbs used to start the task. The task description describes all the
information required for the task execution. The data required in the task structure
definition are:
Params: the parameters of the task;
Params-values: the values given by the user as parameters;
Semantic-value: the specification of a function that must be executed on the value
given by the user. For instance the function “e-mail” can give the value “[email protected]” for the term “carlos” given by the user;
Params-confirm: it is true if a confirmation for the value given by the user is necessary;
Params-labels: the question presented to the user;
Params-save: the specification if the values of the parameters are used to generate
the user model (see next Section);
Params-explains: if true (for a parameter) an explanation is given to the user;
Global-confirm: if true a global confirmation for the task execution is made.
3.4 User Model (UM)
We use a dynamic UM generation process. All the tasks and query executions are
saved within the user model. Values are predicted with a weighted frequency-based
technique. We use UM dynamic generation to avoid manual modeling of users. The
main idea is to minimize the user’s work during the execution of repetitive dialogs
predicting values and decreasing the needs for feedback. A more elaborated discussed
about user model in dialog systems is out of the scope of this paper.
3.5 Domain Model
To allow the system to identify users’ problems and provide answers to particular
questions it is necessary to keep a knowledge base within the assistant. The knowledge of the agent is used to identify objects, relations and values required by the user.
Such objects can represent instances of various object classes (People, Task, Design,
etc.) and have a number of synonyms. Therefore, it is quite important to use efficient
tools to represent objects, synonyms and a hierarchical structure of concepts. In our
approach, we use the MOSS system proposed by Barthès [4] to represent knowledge.
MOSS allows object indexing by terms and synonyms. Several objects can share the
same index. MOSS has been developed at the end of seventies for representing and
manipulating LISP objects. The objects can be versioned and modified simultaneously by several users. The MOSS concepts have been further used in object-oriented
databases.
TEAM LinG
Dialog with a Personal Assistant
211
Fig. 4. Intelligent Dialog.
Knowledge is important because it increases the capability of the system to produce rational answers. Consider the dialog reproduced Fig. 4. Initially, the system has
no information on Joe’s occupation. The user starts the dialog with an “Assert” dialog
act stating Joe’s occupation. Afterwards, the user asks several questions related to the
initial utterance and the system is able to answer them. The system can identify and
interpret correctly very different questions related to the same concepts (lines 3 and 5)
and answer questions about them. This is possible because the semantic meanings of
the slots are explored in the queries. Thus, a slot can play a role that is referenced in
different ways.
4 Deployment and Evaluation
We currently develop a personal assistant (PA) in the AACC project. We hope to use
the mechanisms discussed on this paper to improve the interaction with the actual
assistant prototype. Then, students will have an assistant for executing services and
for helping them with mechanical engineering tasks, and capable of answering questions using natural language.
The current state of the prototype did not allow its immediate application because
the current interface is not good enough. The interface is being redesigned for testing
our dialogue approach during the mechanical engineering courses given to students.
During the Spring semester of 2004 we will evaluate the results of students using or
not using the assistant and we will measure the quality of the information provided by
the assistant. A formal evaluation of the system can be accomplished for instance
with the criteria presented by Allen et al. [2], however, for us, the main criterion is
the acceptation or the non acceptation of the system by the students.
5 Related Work
Grosz and Sidner [7] discuss the importance of an explicit task representation for the
understanding of a task-oriented dialogue. According to the authors, the discourse is a
composite of three elements: (i) linguistics (utterances); (ii) intentions and (iii) attenTEAM LinG
212
Fabrício Enembreck and Jean-Paul Barthès
tional states (objects, properties, relations and intentions salient at any given point of
the discourse). Our system presents some very close elements like linguistic information (template of tasks), intentions (given by speech acts) and specific information
about tasks and tasks properties. Very often, assistants communicate using ACLs
(Agent Communication Languages) like KQML or FIPA ACL2. However, such messages are based into Performatives rather speech acts. A basic difference between
performatives and speech acts is that they tell what to do when something is said
(action) and do not express the meaning of what is said (intention). In other words,
ACL messages cannot express a Go-back like speech act because there is no an explicit action into the utterance. Unlike most dialog systems the dialog flow implemented in Section 3.1 is completely generic. Thereby, new tasks and knowledge can
be added to the system (Assistant) without changing or extending the dialog structure.
Generic dialog systems are relatively rare. Usually the developer specifies state
transition graphs where dialog flow should be coded entirely like the dialog model
discussed by McRoy and Ali [10]. Kölzer [9] discusses a generic dialog system
generator. In the Kölzer’s system, the developer must specify the dialog flow using
state charts. Such techniques make the development of real applications quite hard. In
contrast, in our approach we need to specify only tasks structure and domain knowledge concerning. Rich and Sidner [11] also used the concept of generic dialog systems. The authors used the core of the COLLAGEN system for developing very different applications. COLLAGEN is based on a plan recognition algorithm and a complex
model of collaborative discourse. The problem is that most part of the collaborative
discourse must be coded using a language for modeling the semantic of communicative acts. The representation includes knowledge concerning the application. So the
knowledge of the system is intermixed with the dialog discourse, which makes the
application domain dependent. Allen et al. [3] used speech acts for modeling the
behavior and the reasoning of a deliberative autonomous agent. Speech acts are separated into three groups: Interaction, Collaborative Problem Solving (CPS) and Problem Solving (PS). Assuming we do not intent to model interaction with the user like a
problem solving process, PS and CPS speech acts proposed by Allen are not relevant
to our work because they are domain-related. However the interactions acts are very
similar with the speech acts that we proposed.
6 Conclusions
In this paper we addressed the problem of communication between User and Personal
Assistant Agent (PA). In the AACC project, users need to communicate with a PA to
do collaborative work. We argued that natural language should be used to provide a
better interaction. A user assistant communication module was developed as a modular dialog system. To execute services and to ask for knowledge, the user enters a
dialog with her PA. In this application the dialog coordination model should be generic for supporting the scalability of the system concerning the addition of new tasks
2
FIPA – Foundations for Intelligent Physical Agents, http://www.fipa.org
TEAM LinG
Dialog with a Personal Assistant
213
without much effort. Then, we introduced a new generic model of dialog based on
speech acts. Simple tasks and questions have been used to highlight the effectiveness
of the system and the advantages in relation to traditional collaborative work tools.
References
1. Allen, J. F., Natural Language Understanding, The Benjamin/Cummings Publishing Company, Inc, Menlo Park, California, 1986. ISBN 0-8053-0330-8
2. Allen, J. F.; Miller, B. W. et al. Robust Understanding in a Dialogue System, Proc. 34 th.
Meeting of the Association for Computational Linguistics, June, 1996.
3. Allen, J.; Blaylock, N.; Ferguson, G., A Problem Solving Model for Collaborative Agents,
Proc. of AAMAS’02, pp. 774 – 781, ACM Press New York, NY, USA , 2002. ISBN 158113-480-0
4. Barthès, J-P. A., MOSS 3.2, Memo UTC/GI/DI/N 111, Université de Technologie de
Compiègne, Mars, 1994.
5. Flycht-Eriksson, A., A Survey of Knowledge Sources in Dialogue Systems, Proceedings of
the (IJCAJ)-99 Workshop on Knowledge and Reasoning in Practical Dialogue Systems, International Joint Conference on Artificial Intelligence, Murray Hill, New Jersey, Jan Alexandersson (ed.), pp 41-48, 1999.
6. Goldstein, I. P.; Roberts, R. B., Nudge, A Knowledge-Based Scheduling Program, MIT AI
memo 405, February, 23 pages, 1977.
7. Grosz, B. J., Sidner, C. L.. Attention, intentions, and the structure of discourse, Computational Linguistics, 12(3): 175--204, 1986.
8. Kipp, M.; Alexandersson, J.; Reithinger, N., 1999. Understanding Spontaneous Negotiation
Dialogue, Linköping University Electronic Press: Electronic Articles in Computer and Information Science, ISSN 1401-9841, vol. 4, n° 027.
9. Kölzer, A., Universal Dialogue Specification for Conversational Systems, Linköping University Electronic Press: Eletronic Articles in Computer and Information Science, ISSN
1401-9841, vol. 4, n° 028,1999.
10. McRoy, S., Ali, S. S., A practical, declarative theory of dialog. Electronic Transactions on
Artificial Intelligence, vol. 3, Section D, 1999, 18 pp.
11. Rich, C.; Sidner, C. L.; Lesh, N., COLLAGEN: Applying Collaborative Discourse Theory
to human-Computer Interaction, AI Magazine, Special Issue on Intelligent User Interfaces,
vol 22, issue 4, pp. 15-25, Winter 2001.
12. Riesbeck, C., Marlin, C., Direct Memory Access Parsing, Yale University Report 354,
1985.
13. Schank, R. C., Abelson, R. P., Scripts, Plans, Goals and Understanding, Lawrence Erlbaum Associates, Hillsdale, NJ, 1977.
14. Searle, J., Speech Acts: An Essay in the Philosophy of Language, Cambridge, Cambridge
University Press, 1969.
15. Seneff, S.; Polifroni, J., Formal and Natural Language Generation in the Mercury Conversational System, Proc.
Int. Conf. on Spoken Language Processing, Beijing, China, October, 2000.
16. Sowa, J. F., Conceptual Structures. Information Processing and Mind and Machine, Addison Wesley, Reading Mass, 1984.
17. Szolovits, P; Hawkinson L. B.; Martin W. A., An Overview of OWL, A Language for
Knowledge Representation, Technical Memo TM-86, Laboratory for Computer Science,
MIT, 1977.
TEAM LinG
Applying Argumentative Zoning in an
Automatic Critiquer of Academic Writing
Valéria D. Feltrim1, Jorge M. Pelizzoni1, Simone Teufel2,
Maria das Graças Volpe Nunes1, and Sandra M. Aluísio1
1
University of São Paulo - ICMC/NILC
Av. do Trabalhador São Carlense, 400
13560-970, São Carlos - SP, Brazil
{vfeltrim,jorgemp,gracan,sandra}@icmc.usp.br
2
University of Cambridge - Computer Laboratory
JJ Thomson Avenue, Cambridge CB3 0FD, UK
[email protected]
Abstract. This paper presents an automatic critiquer of Computer Science Abstracts in Portuguese, which formulates critiques and/or suggestions of improvement based on automatic argumentative structure recognition. The recognition is performed by an statistical classifier, similar
to Teufel and Moens’s Argumentative Zoning (AZ) [1], but ported to
work on Portuguese abstracts. The critiques and suggestions made by
the system come from a set of fixed critiquing rules based on corpus observations and guidelines for good writing from the literature. Here we
describe the overall system and report on the AZ porting exercise, its
intrinsic evaluation and application in the critiquer.
Keywords: Academic Writing Support Tools, Argumentative Zoning,
Machine Learning
1
Introduction
It is well known that producing a “good” argumentative structure in academic
writing is not an easy job, even for experienced writers. Besides dealing with
the inherent complexities of any writing task, the writer has also to deal with
those specific to the academic genre. More specifically, the academic audience
expects to find in papers a certain kind of information presented in a certain
way. However, novice writers are usually not quite aware of these expectations
or demands and are believed to benefit a lot from established structure models.
Many such models have indeed been proposed for academic writing in various
areas of Science [2–4], which one can in principle use as guide when preparing
or correcting one’s own text. Notwithstanding, there is a major pitfall to that:
as these models view text as a sequence of “moves” or categories ascribed to
textual segments, the burden falls upon the writer of having to identify these
categories within their own text, which tends to be harder the less experienced
the writer is. In consequence, “manual” application of such structure models by
novices is prone to inefficiency. One significant improvement to that scenario
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 214–223, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Applying Argumentative Zoning
215
would be having a computer aid, in the sense that there could be an artificial
collaborator able to recognize the argumentative structure of an evolving text
automatically, on which to base critiquing and suggestions.
In this paper, we present such an automatic critiquer of Computer Science
Abstracts in Portuguese. As a reference structure model, we use a seven-category
fixed-order scheme as illustrated in Table 11 and discussed in Section 2. The automatic category recognition is performed by an statistical classifier similar to
Teufel and Moens’s Argumentative Zoning (AZ) [1], but ported to work on Portuguese abstracts, for which reason it is called AZPort. The critiques/suggestions
made by the system come from a fixed set of critiquing rules generated by corpus
observations [6] and guidelines for good writing from the literature.
The critiquer was conceived to be part of a bigger system called SciPo,
whose ultimate goal is to support novice writers in producing academic writing in Portuguese. SciPo was inspired by the Amadeus system [7] and its current
functionality can be summarized thus: (a) a base of authentic thesis abstracts
and introductions annotated according to our structure scheme; (b) browse and
search facilities for this base; (c) support for building a structure that the writer
can use as a starting point for the text; (d) critiquing rules that can be applied
to such a structure; and (e) recovery of authentic cases that are similar to the
writer’s structure. Also, the existing lexical patterns (i.e. highly reusable segments) in the recovered cases are highlighted so that the writer can easily add
these patterns to a previously built structure. Examples of lexical patterns are
underlined in Table 1.
1
The sentences in Table 1 (except the one for OUTLINE) were collected from [5]. Note
that the texts in our corpus are in Portuguese, in contrast to this paper.
TEAM LinG
216
Valéria D. Feltrim et al.
The major shortcoming of SciPo before the work described in this paper is
that the writer is expected to explicitly state a schematic structure. Not only
is that usually unnatural to many writers, but also it implies that they should
master a common artificial language, i.e. they need to understand the meaning
of all categories or else they will fail to communicate their intentions to the
system. Our structure-sensitive critiquer is intended to overcome this by inverting the flow of interaction. Now the writer may just input a draft and benefit
from all of SciPo’s original features, because the schematic structure is elicited
automatically.
In the following section we present our reference scheme and report a human
annotation experiment to verify the reproducibility and stability of it. In Section 3 we report on the AZ porting exercise for Brazilian Portuguese. In Section 4
we comment on our critiquing rules and demonstrate the usage of the system.
2
Manual Annotation of Abstracts
As a starting point for our annotation scheme, we used three models: Swales’
CARS [2] and those by Weissberg and Buker [3] and by Aluísio and Oliveira
Jr. [8]. Although these works deal with introduction sections, we have found
that the basic structure of their models could also be applied to abstracts.
Thus, after some preliminary analysis, the scheme was modified to accommodate all the argumentative roles found in our corpus. Finally, in order to make
it more reproducible, we simplified it, ending up with a scheme close to that
presented by Teufel et al [9]. It comprehends the following categories: BACKGROUND (B), GAP (G), PURPOSE (P), METHODOLOGY (M), RESULT (R),
CONCLUSION (C) and OUTLINE (O).
One of the main difficulties faced by the annotators was the high number
of sentences with overlapping argumentative roles, which leads to doubt about
the correct category to be assigned. Anthony [10] also reported on categories
assignment conflicts when dealing with introductions of Software Engineering
papers. We have tried to minimize this difficulty by stating specific strategies
in the written guidelines to deal with frequent conflicts, such as, for example,
PURPOSE vs. RESULT.
Experiments performed on the basis of our scheme and specific annotation
guidelines (similar to AZ’s) showed it to be reproducible and stable. To check
reproducibility, we performed an annotation experiment with 3 human annotators who were already knowledgeable of the corpus domain and familiar with
scientific writing. To check stability, i.e. the extent to which one annotator will
produce the same classifications at different times, we repeated the annotation
experiment with one annotator with a time gap of 3 months. We used the Kappa
coefficient K [11] to measure reproducibility between k annotators on N items
and stability for one annotator. In our experiment, items are sentences and the
number of categories is n=7. The formula for the computation of Kappa is:
TEAM LinG
Applying Argumentative Zoning
217
where P(A) is pairwise agreement and P(E) is random agreement. Kappa varies
between -1 and 1. It is -1 for maximal disagreement, 0 for if agreement is only
as would be expected by chance annotation following the same distribution as
the observed distribution, and 1 for perfect agreement.
For the reproducibility experiment, we used 6 abstracts in the training stage,
which was performed in three rounds, each round consisting of explanation,
annotation, and discussion. After training, the annotators were asked to annotate
46 abstracts sentence by sentence, assigning exactly one category per sentence.
The results show our scheme to be reproducible (K=0.69, N=320, k=3), considering the subjectiveness of this kind of annotation and the literature recommendations. In a similar experiment, Teufel et al [9] reported the reproducibility
of their scheme as slightly higher (K=0.71, N=4261, k=3). However, collapsing our categories METHODOLOGY, RESULTS and CONCLUSION as a single one
(similar to Teufel et al’s category OWN) increases our agreement significantly
(K=0.82, N=320, k=3). We also found our scheme to be stable, as the same
annotator produced very similar annotation at different times (K=0.79, N=320,
k=2).
From this we conclude that trained humans can distinguish our set of categories and thus the data resulting from these experiments are reliable enough to
be used as training material for an automatic classifier.
3
Automatic Annotation of Abstracts
AZ [1] – and thus AZPort – is a Naive Bayesian classifier that renders each input
sentence a set of possible rhetorical roles with their respective estimated probabilities. As usual with machine learning algorithms, instead of dealing directly
with the object to be classified (i.e. sentences), AZ receives sentences as feature
vectors. Feature extraction is thus a crucial design step in such scenarios and
hopefully will yield a set of features that captures the target categories, i.e., that
correlates with them in patterns that the learning algorithm is able to identify.
Here we report on AZPort’s redesign of AZ’s feature extraction.
3.1
Description of the Used Features
Our first step was to select the set of features to be applied in our experiment.
We implemented a set of 8 features, derived from the 16 used by Teufel and
Moens [1]: sentence length, sentence location, presence of citations, presence of
formulaic expressions, verb tense, verb voice, presence of modal auxiliary and
history.
The Length feature classifies a sentence as short, medium or long length,
based on two thresholds (20 and 40 words) that were estimated using the average
sentence length present in our corpus.
The Location feature identifies the position occupied by a sentence within
the abstract. We use four values for this feature: first, medium, 2ndlast and last.
Experiments showed that these values characterize common sentence locations
for some specific categories of our scheme.
TEAM LinG
218
Valéria D. Feltrim et al.
The Citation feature flags the presence or absence of citations in a sentence.
As we are not working with full texts, it is not possible to parse the reference
list and identify self-citations.
The Formulaic feature identifies the presence of a formulaic expression in
a sentence and the scheme category to which an expression belongs. Examples
of formulaic expressions are underlined text in Table 1. In order to recognize
these expressions, we built a set of 377 regular expressions estimated to generate as many as 80,000 strings. The sources for these regular expressions were
phrases mentioned in the literature, and corpus observations. We then performed
a manual generalization to cover similar constructs.
Due to the productive inflectional morphology of Portuguese, much of the
porting effort went into adapting verb-syntactic features. The Tense, Voice
and Modal features report syntactic properties of the first finite verb phrase in
indicative or imperative mood. Tense may assume 14 values, including noverb
for verbless sentences. As verb inflection in Portuguese has a wide range of simple
tenses – many of which are rather rare in general and even absent in our corpus
– we collapsed some of them. As a result, we use one single value of past/future,
to the detriment of the three/two morphological past/future tenses. In addition,
mood distinction is neutralized. The Voice feature may assume noverb, passive
or active. Passive voice is understood here in a broader sense, collapsing some
Portuguese verb forms and constructs that are usually used to omit an agent,
namely (i) regular passive voice (analogous to English, by means of auxiliary
“ser” plus past participle), (ii) synthetic passive voice (by means of passivizating
particle “se”) and (iii) a special form of indeterminate subject (also by means
of particle “se”). The Modal feature flags the presence of a modal auxiliary (if
no verb is present, it assumes the value noverb).
The History feature takes into account the category of the previous sentence
in the classification process. It is known that some argumentative zones tend to
follow other particular zones [1,5]. This property is even more apparent in selfcontained texts such as abstracts [6]. In our corpus, some particular sequences of
argumentative zones are very frequent. For example, the pattern BACKGROUND
followed by GAP, with repetition or not, and then followed by PURPOSE, i.e.
((BG) (GB)+)P, occurs in 30.7% of the corpus. To determine the value of
History for unseen sentences, we calculate it as a second pass process during
testing, performing a beam search with width three among the candidate categories for the previous sentence to reach the most likely classification.
3.2
Automatic Annotation Results
Our training material was a collection of 52 abstracts from theses in Computer
Science, containing 366 sentences (10,936 words). The abstracts were automatically segmented into sentences using XML tags. Citations in running text were
also marked with a XML tag. The sentences were POS-tagged according to the
partial NILC tagset2. The target categories for our experiment were provided by
one of the subjects of the annotation experiment described in Section 2.
2
http://www.nilc.icmc.usp.br/nilc/tools/nilctaggers.html
TEAM LinG
Applying Argumentative Zoning
219
We implemented a simple Naive Bayesian classifier to estimate the probability that a sentence S has category C given the values of its features. The category
with the highest probability is chosen as the output for the sentence.
The results of classification were compiled by applying 13-fold cross-validation to our 52 abstracts (training sets of 48 texts and testing sets of 4 texts).
As Baseline 1, we considered a random choice of categories weighted by their
distribution in the corpus. As Baseline 2, we consider classification as the most
frequent category. The categories distribution in our corpus is BACKGROUND
(21%), GAP (10%), PURPOSE (18%), METHODOLOGY (12%), RESULT (32%),
CONCLUSION (5%) and OUTLINE (2%).
Comparing our Naive Bayesian classifier (trained with the full pool of features) to one human annotator, the agreement reaches K=0.65 (system accuracy
of 74%). This is an encouragingly high amount of agreement when compared to
Teufel and Moens’ [1] figure of K=0.45. Our good result might be in part due to
the fact that we are dealing with abstracts (instead of full papers) and that all
of them fall into the same domain (Computer Science). This result is also much
better than Baseline 1 (K=0 and accuracy of 20%) and Baseline 2 (K=0.26 and
accuracy of 32%).
Further analysis of our results shows that, except for category OUTLINE,
the classifier performs well on all other categories, cf. the confusion matrix in
Table 2. We use the F-measure, defined as
as a convenient way of reporting precision (P) and recall (R) in one value.
The classifier performs worst for OUTLINE sentences (F-measure=0). This is
no wonder, since we are dealing with an abstract corpus and thus there is
not much OUTLINE-type training material3 (total of 6 sentences in the whole
corpus). Regarding the other categories, the best performance of the classifier
is for PURPOSE sentences (F-measure=0.845), followed by RESULT sentences
(F-measure=0.769), cf. Table 3. We attribute the high performance for PURPOSE
to the presence of strong discourse markers on this kind of sentences (modelled
by the Formulaic feature). As for RESULT, we ascribe the good performance to
the high frequency of this kind of sentence in our corpus and to the presence of
specific discourse markers as well.
Looking at the contribution of single features, we found the strongest feature
to be Formulaic. We also observed that taking the context into account (History feature) is a helpful heuristic and improves the result significantly, by 12%.
Syntactic features – Tense, Voice and Modal – and Citation are the weakest
ones. We believe that the Citation feature would perform better in other kind
of text than abstracts (e.g. introductions).
In Table 4, the second column gives the predictiveness of the feature on
its own, in terms of Kappa between the classifier and one annotator. Apart
from Formulaic and History, all other features are outperformed by both
3
Many machine learning algorithms, including the Naive Bayes classifier, perform
badly on infrequent categories due to the lack of sufficient training material.
TEAM LinG
220
Valéria D. Feltrim et al.
baselines. The third column gives Kappa coefficients for experiments using all
features except the given one. As shown, all features apart from the syntactic
ones contribute some predictiveness in combination with others.
The results for automatic classification are reasonably in agreement with our
previous experimental results for human classification. We also observed that
the confusion classes of the automatic classification are similar to the confusion
classes of our human annotators. As can be observed in Table 2, the classifier
has some problems in distinguishing the categories METHODOLOGY, RESULT
and CONCLUSION and so do our human annotators. As mentioned in Section 2,
collapsing these three categories in one raises the human agreement considerably,
which suggests distinction problems amongst these categories even for humans.
We can conclude that the performance of the classifier, although much lower
than human, is promising and acceptable to be used as part of our automatic
critiquer. In the next section, we describe the critiquer and how it works on
unseen abstracts.
TEAM LinG
Applying Argumentative Zoning
4
221
Automatic Critiquing of Abstracts
Once the schematic structure of an input has been recognized, it is checked
against a fixed set of critiquing rules, which ultimately refer to our seven-category
fixed-order model scheme. We focus on two kind of possible deviations: (i) lack
of categories and (ii) bad flow (i.e. ordering) of categories.
Naturally, an abstract does not have to present all the categories predicted
by this model, neither does its strict order have to be verified. However, some
categories are considered obligatory (e.g. PURPOSE) and the lack of those and/or
the unbalanced use of (optional and obligatory) categories may lead to very poor
abstracts. As one major idea underlying our rules, we argue that a good abstract
must provide factual and specific information about a work. Thus, our aim is
to help writers to produce more “informative” abstracts, in which the reader is
likely to learn quickly what is most characteristic of and novel about the work
at hand.
Taking this into account, we find it reasonable to treat categories PURPOSE,
METHODOLOGY and RESULT as obligatory. On the other hand, categories BACKGROUND, GAP and CONCLUSION are treated as optional and, in the event of their
absence, the system only suggests their use to the writer. We consider OUTLINE
an inappropriate category for abstracts and, when detected, the critiquer will
recommend its removal. In fact, this category only appears in our scheme to
reflect our corpus observations.
Regarding the flow of categories, the critiquer tries to avoid error-prone sequences, such as RESULT before PURPOSE, and awkward sequences, such as the
use of BACKGROUND information separating two PURPOSEs, which is likely to
confuse the reader.
Table 5 exemplifies the AZPort output for one of the abstracts used in our
previous experiments4. We present the original English abstract, which is the direct translation of the Portuguese abstract. For illustration purposes, we include
between parentheses the (correct) manual annotation in those cases in which the
system disagreed; in agreement cases, we show a tick
Note that the classifier made some mistakes (BACKGROUND vs. GAP), but
that does not affect the resulting critiques in this specific example. Sometimes it
may also confound very dissimilar categories, e.g. PURPOSE with BACKGROUND.
However, we believe the latter to be a lesser problem because the writer is likely
to perceive such mistakes and is encouraged to correct AZPort’s output before
submitting it to critiquing. A major problem is confusion between METHODOLOGY and RESULT, which does reflect directly in the critiquing stage. Although
our experiments with human annotators showed that these categories are hard
to distinguish in general texts, we believe that it is easier to make this distinction
for authors in their own writing, after they received a critique of this writing
from our system.
4
Extracted from Simão, A.S.: “Proteum-RS/PN: Uma Ferramenta para a Validação
de Redes de Petri Baseada na Análise de Mutantes”. Master’s Thesis, University of
São Paulo (2000). Translation into English by the author.
TEAM LinG
222
Valéria D. Feltrim et al.
In Table 6, we present the critiquer’s output for the previously classified abstract (Table 5). It is in accordance with the critiquing rules commented above
and alerts the writer to the fact that no explicit methodology and result was
found in the abstract. One might argue that the PURPOSE sentence already indicates both methodology and result. However, as this system was designed for
writers of dissertations and theses (longer than journal/conference paper abstracts, which are usually written in English), it would be interesting to have
more detailed abstracts, in which the methodology and results/contributions of
the research are properly emphasized. Finally, the critiquer suggests the addition of the CONCLUSION component as a way to make the abstract more selfcontained.
It is important to say that the system does not ensure that the final abstract
will be a good one, as the system focuses only on the argumentative structure
and there are other factors involved in the writing task. However, the system
has been informally tested and offers potentially useful guidance towards more
informative and genre-compliant abstracts.
5
Conclusion
We have reported on the experiment of porting Argumentative Zoning [1] from
English to Portuguese, including its adaptation to a new purpose. We call this
TEAM LinG
Applying Argumentative Zoning
223
new classifier AZPort. The results showed that AZPort is suitable to be used in
the context of an automatic abstract critiquer, despite some limitations. As future work, we intent to evaluate the critiquer inside a supportive writing system,
called SciPo.
Acknowledgements
We would like to thank CAPES, CNPq and FAPESP for the financial support as
well as the annotators for their precious work. Special thanks to Lucas Antiqueira
for his invaluable help implementing SciPo.
References
1. Teufel, S., Moens, M.: Summarising scientific articles — experiments with relevance
and rhetorical status. Computational Linguistics 28 (2002) 409–446
2. Swales, J. In: Genre Analysis: English in Academic and Research Settings. Chapter
7: Research articles in English. Cambridge University Press, Cambridge, UK (1990)
110–176
3. Weissberg, R., Buker, S.: Writing up Research: Experimental Research Report
Writing for Students of English. Prentice Hall (1990)
4. Santos, M.B.d.: The textual organisation of research paper abstracts. Text 16
(1996) 481–499
5. Anthony, L., Lashkia, G.V.: Mover: A machine learning tool to assist in the reading
and writing of technical papers. IEEE Transactions on Professional Communication
46 (2003) 185–193
6. Feltrim, V., Aluisio, S.M., Nunes, M.d.G.V.: Analysis of the rhetorical structure
of computer science abstracts in Portuguese. In Archer, D., Rayson, P., Wilson,
A., McEnery, T., eds.: Proceedings of Corpus Linguistics 2003, UCREL Technical
Papers, Vol. 16, Part 1, Special Issue. (2003) 212–218
7. Aluisio, S.M., Barcelos, I., Sampaio, J., Oliveira Jr., O.N.: How to learn the many
unwritten “ Rules of the Game” of the Academic Discourse: A Hybrid Approach
Based on Critiques and Cases. In: Proceedings of the IEEE International Conference on Advanced Learning Technologies. (2001) 257–260
8. Aluisio, S.M., Oliveira Jr., O.N.: A detailed schematic structure of research papers
introductions: An application in support-writing tools. Revista de la Sociedad Espanyola para el Procesamiento del Lenguaje Natural (1996) 141–147
9. Teufel, S., Carletta, J., Moens, M.: An annotation scheme for discourse-level argumentation in research articles. In: Proceedings of the Ninth Meeting of the European Chapter of the Association for Computational Linguistics (EACL-99). (1999)
110–117
10. Anthony, L.: Writing research article introductions in software engineering: How
accurate is a standard model? IEEE Transactions on Professional Communication
42 (1999) 38–46
11. Siegel, S., Castellan, N.J.J.: Nonparametric Statistics for the Behavioral Sciences.
2nd edn. McGraw-Hill, Berkeley, CA (1988)
TEAM LinG
DiZer: An Automatic Discourse Analyzer
for Brazilian Portuguese
Thiago Alexandre Salgueiro Pardo, Maria das Graças Volpe Nunes, and
Lucia Helena Machado Rino
Núcleo Interinstitucional de Lingüística Computacional (NILC)
CP 668 – ICMC-USP, 13.560-970 São Carlos, SP, Brasil
[email protected], [email protected]
[email protected]
http://www.nilc.icmc.usp.br
Abstract. This paper presents DiZer, an automatic DIscourse analyZER for
Brazilian Portuguese. Given a source text, the system automatically produces its
corresponding rhetorical analysis, following Rhetorical Structure Theory – RST
[1]. A rhetorical repository, which is DiZer main component, makes the automatic analysis possible. This repository, produced by means of a corpus analysis, includes discourse analysis patterns that focus on knowledge about discourse markers, indicative phrases and words usages. When applicable,
potential rhetorical relations are indicated. A preliminary evaluation of the system is also presented.
Keywords: Automatic Discourse Analysis, Rhetorical Structure Theory
1 Introduction
Researches in Linguistics and Computational Linguistics have shown that a text is
more than just a simple sequence of juxtaposed sentences. Indeed, it has a highly
elaborated underlying discourse structure. In general, this structure represents how
the information conveyed by the text propositional units (that is, the meaning of the
text segments) correlate and make sense together.
There are several discourse theories that try to represent different aspects of discourse. The Rhetorical Structure Theory (RST) [1] is one of the most used theories
nowadays. According to it, all propositional units in a text must be connected by
rhetorical relations in some way for the text to be coherent. As an example of a rhetorical analysis of a text, consider Text 1 (adapted from [2]) in Figure 1 (with segments that express basic propositional units numbered) and its rhetorical structure in
Figure 2. The symbols N and S indicate the nucleus and satellite of each rhetorical
relation: in RST, the nucleus indicates the most important information in the relation,
while the satellite provides complementary information to the nucleus. In this structure, propositions 1 and 2 are in a CONTRAST relation, that is, they are opposing
facts that may not happen at the same time; proposition 3 is the direct RESULT (non
volitional) of the opposition between 1 and 2. In some cases, relations are multinuclear (e.g., CONTRAST relation), that is, they have no satellites and the connected
propositions are considered to have the same importance.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 224–234, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
DiZer: An Automatic Discourse Analyzer for Brazilian Portuguese
Fig. 1. Text 1.
225
Fig. 2. Text 1 rhetorical structure.
The ability to automatically derive discourse structures of texts is of great importance to many applications in Computational Linguistics. For instance, it may be very
useful for automatic text summarization (to identify the most important information
of a text to produce its summary) (see, for instance, [2] and [3]), co-reference resolution (determining the context of reference in the discourse may help determining the
referred term) (see, for instance, [4] and [5]), and for other natural language understanding applications as well. Some discourse analyzers are already available for both
English and Japanese languages, (see, for example, [2], [6], [7], [8], [9], [22] and
[23]).
This paper describes DiZer, an automatic DIscourse analyZER for Brazilian Portuguese. To our knowledge, it is the first proposal for this language. It follows those
existing ones for English and Japanese, having as the main process a rhetorical analyzer, in accordance with RST. DiZer main resource is a rhetorical repository, which
comprises knowledge about discourse markers, indicative phrases and words usages,
and the rhetorical relations they may indicate, in the form of discourse analysis patterns. Such patterns were produced by means of a corpus analysis. When applied to
an unseen text, they may identify the rhetorical relations between the propositional
units. The rhetorical repository also comprises heuristics for helping determining
some rhetorical relations, mainly those that are usually not superficially signaled in
the text.
Next section presents some relevant aspects of other discourse analysis researches.
Section 3 describes the corpus analysis and the repository of rhetorical information
used in DiZer. Section 4 outlines DiZer architecture and describes its main processes.
Section 5 shows some preliminary results concerning DiZer performance, while concluding remarks are given in Section 6.
2 Related Work
Automatic rhetorical analysis became a burning issue lately. Significant researches on
such an issue have arisen that focus on different methodologies and techniques. This
section sketches some of them.
Based on the assumption that cue-phrases and discourse makers are direct hints of
a text underlying discourse structure, Marcu [6] was the first to develop a cue-phrasebased rhetorical analyzer for free domain texts in English. He used a corpus-driven
methodology to identify discourse markers and information on their contextual occurTEAM LinG
226
Thiago Alexandre Salgueiro Pardo et al.
rences and possible rhetorical relations. Marcu also proposed a complete formalization for RST in order to enable its computational manipulation according to his purposes. Later on, Marcu [2], Marcu and Echihabi [7] and Soricut and Marcu [8] proposed, respectively, a decision-based rhetorical analyzer, a Bayesian machine
learning-based rhetorical analyzer and a sentence-level rhetorical analyzer using statistical models. In the first one, Marcu applied a shift-reduce parsing model to build
rhetorical structures. He achieved better results than with the cue-phrase-based analyzer. In the second one, Marcu and Echihabi trained a Bayesian classifier only with
the words of texts to identify four basic rhetorical relations. They achieved a high
accuracy in their analysis. Finally, Soricut and Marcu made use of syntactic and lexical information extracted from discourse annotated lexicalized syntactic trees to train
statistical models. With this method, in the sentence-level analysis, they achieved
results near human performance.
Also based on Marcu’s RST formalization, Corston-Oliver [9] developed a rhetorical analyzer for encyclopedic texts based on the occurrence of discourse markers
in texts and syntactic realizations relating text segments. He investigated which syntactic features could help determining rhetorical relations, focusing on features like
subordination and coordination, active and passive voices, the morphosyntactic categorization of words and the syntactic heads of constituents.
Following Marcu’s analyzer [6], DiZer may also be classified as a cue-phrasebased rhetorical analyzer. However, differently from Marcu’s analyzer, DiZer is
genre specific. For this reason, it makes use of other knowledge sources (indicative
phrases and words, heuristics) and adopts an incremental analysis method, as will be
discussed latter in this paper. Next section describes the conducted corpus analysis for
DiZer development.
3 Corpus Analysis and Knowledge Extraction
3.1 Annotating the Corpus
The corpus was composed of 100 scientific texts on Computer Science taken from the
introduction sections of MsC. Dissertations (c.a. 53.000 words and 1.350 sentences).
The scientific genre has been chosen for the following reasons: a) scientific texts are
supposedly well written; b) they usually present more discourse makers and indicative phrases and words than other text genres; c) other works on discourse analysis
for Brazilian Portuguese ([10], [11], [12], [13], [14]) have used the same sort of texts.
The corpus has been rhetorically annotated following Carlson and Marcu’s discourse annotation manual [15]. Although this manual focuses on the English language, it may be also applied to Brazilian Portuguese, since RST rhetorical relations
are theoretically language independent. The use of this manual has allowed a more
systematic and mistake-free annotation. For annotating the texts, Marcu’s adaptation
of O’Donnel’s RSTTool [16] was used. To guarantee consistency during the annotation process, the corpus has been annotated by only one expert in RST.
Initially, the original RST relations set has been used to annotate the corpus. When
necessary, more relations have been added to the set. In the end, the full set amounts
TEAM LinG
DiZer: An Automatic Discourse Analyzer for Brazilian Portuguese
227
to 32 relations, as shown in Figure 3. The added ones are in bold face. Some of them
(PARENTHETICAL and SAME-UNIT) are only used for organizing the discourse
structure. The table also shows the frequency (in %) of each relation in the analyzed
corpus.
Fig. 3. DiZer rhetorical relations set.
The annotation strategy for each text was incremental, step by step, in the following way: initially, all propositions of each sentence were related by rhetorical relations; then, the sentences of each paragraph were related; finally, the paragraphs of
the text were related. This annotation scheme takes advantage of the fact that the
writer tends to put together (i.e., in the same level in the hierarchical organization of
the text) the related propositions. For instance, if two propositions are directly related
(e.g., a cause and its consequence), it is probable that they will be expressed in the
same sentence or in adjacent sentences. This very same reasoning is used in DiZer for
analyzing texts. More details about the corpus and its annotation may be found in
[17] and [18].
3.2 Knowledge Extraction
Once completely annotated, the corpus has been manually analyzed in order to identify discourse markers, indicative phrases and words, and heuristics that might indicate rhetorical relations. Based on this, discourse analysis patterns for each rhetorical
relation have been yielded, currently amounting to 840 patterns. These convey the
main information repository of the system.
As an example, consider the discourse analysis pattern for the OTHERWISE rhetorical relation in Figure 4. According to it, an OTHERWISE relation connects two
propositional units 1 and 2, with 1 been the satellite and 2 the nucleus and with the
segment that expresses 1 appearing before the segment that expresses 2 in the text, if
the discourse marker ou, alternativamente, (in English, ‘or, alternatively,’) be present
in the beginning of the segment that expresses propositional unit 2.
The idea is that, when a new text is given as input to DiZer, a pattern matching
process is carried out. If one of the discourse analysis patterns matches some portion
of the text being processed, the corresponding rhetorical relation is supposed to occur
between the appropriate segments.
TEAM LinG
228
Thiago Alexandre Salgueiro Pardo et al.
The discourse analysis patterns may also convey morphosyntactic information,
lemma and specific genre-related information. For instance, consider the pattern in
Figure 5, which hypothesizes a PURPOSE relation. This pattern specifies that a
PURPOSE rhetorical relation is found if there is in the text an indicative phrase composed by (1) a word whose lemma is cujo (‘which’, in English1), (2) followed by any
word that indicates purpose (represented by the ‘purWord’ class, whose possible
values are defined apart by the user), (3) followed by any adjective, (4) followed by a
word whose lemma is ser (verb ‘to be’, in English). Based on similar features, any
pattern may be represented. Complex patterns, possibly involving long distance dependencies, may also be represented by using a special character (*) to indicate jumps
in the pattern matching process.
Fig. 4. Discourse analysis pattern for the OTHERWISE rhetorical relation.
Fig. 5. Discourse analysis pattern for the PURPOSE rhetorical relation.
For relations that are not explicitly signaled in the text, like EVALUATION and
SOLUTIONHOOD, it has been possible to define some heuristics to enable the discourse analysis, given the specific text genre under focus. For the SOLUTIONHOOD
relation, for example, the following heuristic holds:
if in a segment X, ‘negative’ words like ‘cost’ and ‘problem’ appear more than once
and, in segment Y, which follows X, ‘positive’words like ‘solution’ and ‘development’
appear more than once too, then a SOLUTIONHOOD relation holds between propositions expressed by segments X and Y, with X being the satellite and Y the nucleus
of the relation
Next section describes DiZer and its processes, showing how and where the rhetorical repository is used.
1
Although ‘which’ is invariable in English, its counterpart in Portuguese, cujo, may vary in
gender and number.
TEAM LinG
DiZer: An Automatic Discourse Analyzer for Brazilian Portuguese
229
4 DiZer Architecture
DiZer comprises three main processes: (1) the segmentation of the text into propositional units, (2) the detection of occurrences of rhetorical relations between propositional units and (3) the building of the valid rhetorical structures. In what follows,
each process is explained. Figure 6 presents the system architecture.
Fig. 6. DiZer architecture.
4.1 Text Segmentation
In this process, DiZer tries to determine the simple clauses in the source text, since
simple clauses usually express simple propositional units, which are assumed to be
the minimal units in a rhetorical structure. For doing this, DiZer initially attributes
morphosyntactic categories to each word in the text using a Brazilian Portuguese
tagger [19]. Then, the segmentation process is carried out, segmenting the text always
a punctuation signal (comma, dot, exclamation and interrogation points, etc.) or a
strong discourse maker or indicative phrase is found. By strong discourse maker or
indicative phrase we mean those words groups that unambiguously have a function in
discourse. According to this, words like e and se (in English, ‘and’ and ‘if’2, respectively) are ignored, while words like portanto and por exemplo (in English, ‘therefore’
and ‘for instance’, respectively) are not. DiZer still verifies whether the identified
segments are clauses by looking for occurrences of verbs in them.
Although this process is very simple, it produces reasonable results (see Figure 7
for an example of segmentation). In some cases, the system can not distinguish embedded clauses, causing inaccurate segmentation, but this may be overcome in the
future by using a syntactic parser.
4.2 Detection of Rhetorical Relations
DiZer tries to determine at least one rhetorical relation for each two adjacent text
segments representing the corresponding underlying propositions. In order to do so, it
uses both discourse analysis patterns and heuristics. Initially, it looks for a relation
between every two adjacent segments of each sentence; then, it considers every two
2
Although ‘if’ is rarely ambiguous in English, its counterpart in Portuguese, se, may assume
many roles in a text. See a comprehensive discussion about se possible roles in [20].
TEAM LinG
230
Thiago Alexandre Salgueiro Pardo et al.
adjacent sentences of a paragraph; finally, it considers every two adjacent paragraphs.
This processing order is supported by the premise that a writer organizes related information at the same organization level, as already discussed in this paper.
When more than one discourse analysis pattern apply, usually in occurrences of
ambiguous discourse markers, all the possible patterns are considered. In this case,
several rhetorical relations may be hypothesized for the same propositions. Because
of this, multiple discourse structures may be derived for the same text.
In the worst case, when no rhetorical relation can be found between two segments,
DiZer assumes a default heuristic: it adopts an ELABORATION relation between
them, with the segment that appears first in the text being its nucleus. This is in accordance with what has been observed in the corpus analysis, in that the first segment
is usually elaborated by following ones. Although this may cause some underspecification, or, maybe, inadequateness in the discourse structure, it is a plausible solution
and it may even be the case that such relation really applies. ELABORATION was
chosen as the default relation for being the most frequent relation in the corpus analyzed.
The system also keeps a record of the applied discourse analysis patterns and heuristics, so that it may be possible to identify later manually and/or computationally
problematic/ambiguous cases in the discourse structure. In this way, it is possible to
reengineer and improve the resulting discourse analysis.
4.3 Building the Rhetorical Structure
This step consists of determining the complete text rhetorical structure from the individual rhetorical relations between its segments. For this, the system makes use of the
rule-based algorithm proposed in [6]. This algorithm produces grammar rules for
each possible combination of segments by a rhetorical relation, in the form of a DCG
(Definite-Clause Grammar) rule [21]. When the final grammar is executed, all possible valid rhetorical structures are built.
As a complete example of DiZer processing, Figures 7 and 8 present, respectively,
a text (translated from Portuguese) already segmented by DiZer and one of the valid
rhetorical structures built. One may verify that the structure is totally plausible. It is
also worth noticing that paragraphs and sentences form complete substructures in the
overall structure, given the adopted processing strategy.
Next section presents some preliminary results concerning DiZer performance.
5 Preliminary Evaluation
A preliminary evaluation of DiZer has been carried out taking into account five scientific texts on Computer Science (which are not part of the corpus analyzed for producing the rhetorical repository). These have been randomly selected from introductions of MsC. dissertations of the NILC Corpus3, currently the biggest corpora of
texts for Brazilian Portuguese. Each text had, in average, 225 words, 7 sentences, 17
propositional units and 16 rhetorical relations.
3
www.nilc.icmc.usp.br/nilc/tools/corpora.htm
TEAM LinG
DiZer: An Automatic Discourse Analyzer for Brazilian Portuguese
231
Once discourse-analyzed by DiZer, the resulting rhetorical structures have been
verified in order to assess two main points: (I) the performance of the segmentation
process and (II) the plausibility of the hypothesized rhetorical relations. Such features
have been chosen for being the core of DiZer main processes. Only one expert in
RST has analyzed those structures, using as reference one manually generated discourse structure for each text, which incorporated all plausible relations between the
propositions. Table 1 presents the resulting recall and precision average numbers for
DiZer. It also shows the results for a baseline method, which considers complete
sentences as segments and always hypothesizes ELABORATION relations between
them (since it is the most common and generic relation).
Fig. 7. Text 2.
Fig. 8. Text 2 rhetorical structure.
For text segmentation, recall indicates how many segments of the reference discourse structure were correctly identified and precision indicates how many of the
identified segments were correct; for rhetorical relations hypotheses, recall indicates
how many relations of the reference discourse structure were correctly hypothesized
(taking into account the related segments and their nuclearity – which segments were
nuclei and satellites) and precision indicates how many of the hypothesized relations
were correct. It is possible to see that the baseline method performed very poorly and
that DiZer outperformed it.
TEAM LinG
232
Thiago Alexandre Salgueiro Pardo et al.
Some problematic issues might interfere in the evaluation, namely, the tagger performance and the quality of the source texts. If the tagger fails in identifying the morphosyntactic classes of the words, discourse analysis may be compromised. Also, if
the source texts present a significant misuse of discourse markers, inadequate rhetorical structures may be produced. These problems have not been observed in the current evaluation, but they should be taken into account in future evaluations.
It is worth noticing that Marcu’s cue-phrase-based rhetorical analyzer (which is
presently the most similar analyzer to DiZer), achieved worse recall in both cases
(51% and 47%), but better precision (96% and 78%) than DiZer. Although this direct
comparison is unfair, given that the languages, test corpora and even the analysis
methods differ, it gives an idea of the state of the art results in cue-phrase-based
automatic discourse analysis.
6 Concluding Remarks
This paper presented DiZer, a knowledge intensive discourse analyzer for Brazilian
Portuguese that produces rhetorical structures of scientific texts based upon the Rhetorical Structure Theory. To our knowledge, DiZer is the first discourse analyzer for
such language and, once available, must be the basis for the development and improvement of other NLP tasks, like automatic summarization and co-reference resolution.
Although DiZer was developed for scientific texts analysis, it is worth noticing
that it may also be applied for free domain texts, since, in general, discourse markers
are consistently used across domains.
In a preliminary evaluation, DiZer has achieved very good performance. However,
there is still room for improvements. The use of a parser and the development of new
specialized analysis patterns and heuristics must improve its performance. In the near
future, a statistical module should be introduced into the system, enabling it to determine the most probable discourse structure among the possible structures built, as
well as to hypothesize rhetorical relations in the case that there are not discourse
markers and indicative phrases and words present in some segment in the source text.
Acknowledgments
The authors are grateful to the Brazilian agencies FAPESP, CAPES and CNPq, and to
Fulbright Commission for supporting this work.
TEAM LinG
DiZer: An Automatic Discourse Analyzer for Brazilian Portuguese
233
References
1. Mann, W.C. and Thompson, S.A.: Rhetorical Structure Theory: A Theory of Text Organization. Technical Report ISI/RS-87-190 (1987).
2. Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. The MIT
Press. Cambridge, Massachusetts (2000).
3. O’Donnell, M.: Variable-Length On-Line Document Generation. In the Proceedings of the
6th European Workshop on Natural Language Generation. Duisburg, Germany (1997).
4. Cristea, D.; Ide, N.; Romary, L.: Veins Theory. An Approach to Global Cohesion and Coherence. In the Proceedings of Coling/ACL. Montreal (1998).
5. Schauer, H.: Referential Structure and Coherence Structure. In the Proceedings of TALN.
Lausanne, Switzerland (2000).
6. Marcu, D.: The Rhetorical Parsing, Summarization, and Generation of Natural Language
Texts. PhD Thesis, Department of Computer Science, University of Toronto (1997).
7. Marcu, D. and Echihabi, A.: An Unsupervised Approach to Recognizing Discourse Relations. In the Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics (ACL’02), Philadelphia, PA (2002).
8. Soricut, R. and Marcu, D.: Sentence Level Discourse Parsing using Syntactic and Lexical
Information. In the Proceedings of the Human Language Technology and North American
Association for Computational Linguistics Conference (HLT/NAACL), Edmonton, Canada
(2003).
9. Corston-Oliver, S.: Computing Representations of the Structure of Written Discourse. PhD
Thesis, University of California, Santa Barbara, CA, USA (1998).
10. Feltrim, V.D.; Aluísio, S.M.; Nunes, M.G.V.: Analysis of the Rhetorical Structure of Computer Science Abstracts in Portuguese. In the Proceedings of Corpus Linguistics (2003).
11. Pardo, T.A.S. and Rino, L.H.M.: DMSumm: Review and Assessment. In E. Ranchhod and
N. J. Mamede (eds.), Advances in Natural Language Processing, (2002) pp. 263-273 (Lecture Notes in Artificial Intelligence 2389). Springer-Verlag, Germany.
12. Aluísio, S.M. and Oliveira Jr., O.N.: A Case-Based Approach for Developing Writing
Tools Aimed at Non-native English Users. Lecture Notes in Computer Science, Vol. 1010,
(1995) pp. 121-132.
13. Aluísio, S.M.; Barcelos, I.; Sampaio, J.; Oliveira J, O.N.: How to Learn the Many Unwritten ´Rules of the Game´ of the Academic Discourse: A Hybrid Approach Based on Critiques and Cases to Support Scientific Writing. In the Proceedings of the IEEE International Conference on Advanced Learning Technologies. Madison, Wisconsin. Los
Alamitos, CA: IEEE Computer Society, Vol. 1, (2001) pp. 257-260.
14. Rino, L.H.M. and Scott, D.: A Discourse Model for Gist Preservation. In the Proceedings
of the XIII Brazilian Symposium on Artificial Intelligence (SBIA’96). Curitiba - PR, Brasil
(1996).
15. Carlson, L. and Marcu, D.: Discourse Tagging Reference Manual. ISI Technical Report
ISI-TR-545 (2001).
16. O’Donnell, M.: RST-Tool: An RST Analysis Tool. In the Proceedings of the 6th European
Workshop on Natural Language Generation. Gerhard-Mercator University, Duisburg,
Germany (1997).
17. Pardo, T.A.S. e Nunes, M.G.V.: A Construção de um Corpus de Textos Científicos em
Português do Brasil e sua Marcação Retórica. Série de Relatórios Técnicos do Instituto de
Ciências Matemáticas e de Computação - ICMC, Universidade de São Paulo, no. 212
(2003).
TEAM LinG
234
Thiago Alexandre Salgueiro Pardo et al.
18. Pardo, T.A.S. e Nunes, M.G.V.: Relações Retóricas e seus Marcadores Superficiais:
Análise de um Corpus de Textos Científicos em Português do Brasil. Relatório Técnico
NILC-TR-04-03. Série de Relatórios do NILC, ICMC-USP (2004).
19. Aires, R.V.X.; Aluísio, S.M.; Kuhn, D.C.S.; Andreeta, M.L.B.; Oliveira Jr., O.N.: Combining Multiple Classifiers to Improve Part of Speech Tagging: A Case Study for Brazilian
Portuguese. In the Proceedings of the Brazilian AI Symposium (SBIA’2000), (2000) pp.
20-22.
20. Martins, R.T.; Montilha, G.; Rino, L.H.M.; Nunes, M.G.V.: Dos Modelos de Resolução da
Ambigüidade Categorial: O Problema do SE. In the Proceedings do IV Encontro para o
Processamento Computational da Língua Portuguesa Escrita e Falada, PROPOR’99, (1999)
pp. 115-128. Évora, Portugal.
21. Pereira, F.C.N. and Warren, D.H.D.: Definite Clause Grammars for Language Analysis – A
Survey of the Formalism and Comparison with Augmented Transition Networks. Artificial
Intelligence, N. 13, (1980) pp. 231-278.
22. Schilder, F.: Robust discourse parsing via discourse markers, topicality and position. In J.
Tait, B.K. Boguraev and C. Jacquemin (eds.), Natural Language Engineering, Vol. 8. Cambridge University Press (2002).
23. Sumita, K.; Ono, K.; Chino, T.; Ukita, T.; Amano, S.: A discourse structure analyzer for
Japonese text. In the Proceedings of the International Conference on Fifth Generation
Computer Systems, Vol. 2, (1992) pp. 1133-1140. Tokyo, Japan.
TEAM LinG
A Comparison of Automatic Summarizers
of Texts in Brazilian Portuguese*
Lucia Helena Machado Rino1, Thiago Alexandre Salgueiro Pardo1,
Carlos Nascimento Silla Jr.2, Celso Antônio Alves Kaestner2, and Michael Pombo1
1
Núcleo Interinstitucional de Lingüística Computacional (NILC/São Carlos)
DC/UFSCar – CP 676, 13565-905 São Carlos, SP, Brazil
[email protected], {michaelp,lucia}@dc.ufscar.br
http://www.nilc.icmc.usp.br
2
Pontifícia Universidade Católica do Paraná (PUC-PR)
Av. Imaculada Conceição 1155, 80215-901 Curitiba, PR, Brazil
{silla,kaestner}@ppgia.pucpr.br
Abstract. Automatic Summarization (AS) in Brazil has only recently become a
significant research topic. When compared to other languages initiatives, such a
delay can be explained by the lack of specific resources, such as expressive lexicons and corpora that could provide adequate foundations for deep or shallow approaches on AS. Taking advantage of having commonalities with respect to resources and a corpus of texts and summaries written in Brazilian Portuguese, two
NLP research groups have decided to start a common task to assess and compare
their AS systems. In the experiment five distinct extractive AS systems have been
assessed. Some of them incorporate techniques that have been already used to
summarize texts in English; others propose novel approaches to AS. Two baseline
systems have also been considered. An overall performance comparison has been
carried out, and its outcomes are discussed in this paper.
1 Introduction
We definitely live in the information explosion era. A recent study from Berkeley
[12] indicates there were 5 million terabytes of new information created via print,
film, magnetic, and optical storage media in 2002, and the www alone contains about
170 terabytes of information on its surface. This is about twice the data generated in
1999, given an increasing rate at about 30% each year. Conversely, to use this information is very hard. Problems like information retrieval and extraction, and text
summarization became important areas in Computer Science research.
Especially concerning Automatic Summarization (AS), we focus on extractive
methods in order to produce extracts of texts written in Brazilian Portuguese. Extracts, in this context, are summaries produced automatically on the basis of superficial, empirical or statistical, techniques, broadly known as extractive methods [15].
These actually aim at producing summaries that consist entirely of material copied,
*
The Brazilian Agencies FAPESP and PIBIC-CNPQ supported this research.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 235–244, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
236
Lucia Helena Machado Rino et al.
usually sentences, from the source texts. Typically, extracts or summaries automatically generated have 10 to 30% of the original text length – being faster to read – but
must contain enough information to satisfy the user’s needs [13].
Five AS systems were assessed, all of them sharing the same linguistic resources,
when applicable. Only precision (P) and recall (R) have been considered, for practical
reasons: being extractive, all the summarizers under consideration could be automatically assessed to calculate P and R. The performance of those AS systems could thus
be compared, in order to identify the features that apply better to a genre-specific text
corpus in Brazilian Portuguese.
To calculate P and R, ideal summaries – extractive versions of the manual summaries – have been used, which have been automatically produced by a specific tool, a
generator of ideal extracts (available in http://www.nilc.icmc.usp.br/~thiago). This
tool is based upon the widely known vector space model and the cosine similarity
measure [25], and works as follows: 1) for each sentence in the manual summary the
most similar sentence in the text is obtained (through the cosine measure); 2) the most
representative sentences are selected, yielding the corresponding ideal, extractive,
summary. This procedure works as suggested by [14], i.e., it is based on the premise
that ideal extracts should be composed of as many sentences (the most similar ones)
as the ones in the corresponding manual summary.
As we shall see, some of the systems being assessed had to be trained. In this case,
the very same pre-processing tools and data have been used by all of them. We chose
TeMário [19] (available in_http://www.linguateca.pt/Repositorio/TeMario), a corpus
of 100 newspaper texts (c.a. 613 words, or 1 to 2 ½ pages long) that has been built for
AS purposes, as the only input for the assessment reported here. Those texts have
been withdrawn from online regular Brazilian newspapers, the Folha de São Paulo
(60 texts) and the Jornal do Brasil (40 texts) ones. They are equally distributed
amongst distinct domains, namely, those respecting to free author views, critiques,
world, politics, and foreign affairs. The summaries that come along with this corpus
are those hand-produced by the consultant on the Brazilian Portuguese language.
Details of the considered systems and their assessment are given below. In Section
2, we outline the main features of each system under focus. In Section 3 we describe
the experiment itself and a thorough discussion on their overall rating. Finally, in
section 4 we address the outcomes of the reported assessment, concerning the potentialities to apply AS for Brazilian Portuguese texts of a particular genre.
2 Extractive AS Systems Under Focus
Each of the assessed AS systems tackles a particular AS strategy. Specially, three of
them suggest novel approaches, as follows: (a) Gist Summarizer (GistSumm) [20],
focuses upon the matching of lexical items of the source text against lexical items of a
gist sentence, supposed to be the sentence of the source text that best expresses its
main idea, which is previously determined by means of a word frequency distribution; (b) Term Frequency-Inverse Sentence Frequency-based Summarizer (TF-ISFSumm) [9], adapts Salton’s TF-IDF information retrieval measure [25] in that, instead of signaling the documents to retrieve, it pinpoints those sentences of a source
TEAM LinG
A Comparison of Automatic Summarizers of Texts in Brazilian Portuguese
237
text that must be included in a summary; (c) Neural Summarizer (NeuralSumm) [21]
is based upon a neural network that, after training, is capable of identifying relevant
sentences in a source text for producing the extract. Added to those, we employ a
classification system (ClassSumm) that produces extracts based on a Machine Learning (ML) approach, in which summarization is considered as a classification task.
Finally, we use Text Summarization in Portuguese (SuPor) [17], a system aiming at
exploring alternative methodologies that have been previously suggested to summarize texts in English. Based on a ML technique, it allows the user to customize surface and/or linguistic features to be handled during summarization, permitting one to
generate diverse AS engines. In the assessment reported in this paper, SuPor has been
customized to just one AS system.
All the systems consistently incorporate language-specific resources, arming at ensuring the accuracy of the assessment. The most significant tools already available for
Brazilian Portuguese are a part-of-speech tagger [1], a parser [16], and a stemmer
based upon Porter’s algorithm [3]. Linguistic repositories include a lexicon [18], and
a list of discourse markers, which is derived from the DiZer system [22]. Additionally, a stoplist (i.e., a list of stopwords, which are too common and, therefore, irrelevant to summarization) and a list of the commonest lexical items that signal anaphors
are also used. Apart from the discourse markers and the lexical items lists, which are
used only by ClassSumm, and the tagger and parser, which are not used by
GistSumm and NeuralSumm, the other resources are shared amongst all the systems.
Text pre-processing is also common to all the systems. It involves text segmentation, through delimiting sentences by applying simple rules based on punctuation
marks, case folding and stemming, and stopwords removal. In the following we
briefly describe each AS system.
2.1 The GistSumm System
GistSumm is an automatic summarizer based on a novel extractive method, called
gist-based method. For GistSumm to work, the following premises must hold: (a)
every text is built around a main idea, namely, its gist; (b) it is possible to identify in
a text just one sentence that best expresses its main idea, namely, the gist sentence.
Based on them, the following hypotheses underlie GistSumm methodology: (I)
through simple statistics the gist sentence or an approximation of it is determined; (II)
by means of the gist sentence, it is possible to build coherent extracts conveying the
gist sentence itself and extra sentences from the source text that complement it.
GistSumm comprises three main processes: text segmentation, sentence ranking,
and extract production. Sentence ranking is based on the keywords method [11]: it
scores each sentence of the source text by summing up the frequency of its words and
the gist sentence is chosen as the most highly scored one. Extract production focuses
on selecting other sentences from the source text to include in the extract, based on:
(a) gist correlation and (b) relevance to the overall content of the source text. Criterion (a) is fulfilled by simply verifying co-occurring words in the candidate sentences
and the gist sentence, ensuring lexical cohesion. Criterion (b) is fulfilled by sentences
whose score is above a threshold, computed as the average of all the sentence scores,
TEAM LinG
238
Lucia Helena Machado Rino et al.
to guarantee that only relevant-to-content sentences are chosen. All the selected sentences above the cutoff are thus juxtaposed to compose the final extract.
GistSumm has already undergone several evaluations, the main one being
DUC’2003 (Document Understanding Conference). According to this, Hypothesis I
above has been proved to hold. Other methods than the keywords one were also used
for sentence ranking. The keywords one outperformed all of them.
2.2 The TF-ISF-Summ System
TF-ISF-Summ is an automatic summarizer that makes use of the TF-ISF (TermFrequency Inverse-Sentence-Frequency) metric to rank sentences in a given text and
then extract the most relevant ones. Similarly to GistSumm, the approach used by this
system has also three main steps: (1) text pre-processing (2) sentence ranking, and (3)
extract generation. Differently from that, in order to rank the sentences, it calculates
the mean TF-ISF of each sentence, as proposed in [9]: (1) each sentence is considered
as a fragment of the text; (2) given a sentence, the TF-ISF metric for each term (similar to the TF-IDF metric [25]) is calculated: TF is the frequency of the term in the
document and ISF is a function of the number of sentences in which the term appears;
(3) finally, the TF-ISF for the whole sentence is computed as the arithmetic mean of
all the TF-ISF values of its terms. Sentences with the highest mean-TF-ISF score and
above the cutoff are selected to compose the output extract.
The method showed to be only as good as the random sentences approach in the
experiments made by Larocca Neto [8] for documents in English.
2.3 The NeuralSumm System
NeuralSumm system makes use of a ML technique, and runs on four processes: (1)
text segmentation, (2) features extraction, (3) classification, and (4) extract production. It is primarily unsupervised, since it is based on a self-organizing map (SOM)
[6], which clusters information from the training texts. NeuralSumm produces two
clusters: one that represents the important sentences of the training texts (and, thus,
should be included in the extract) and another that represents the non-important sentences (and, thus, should be discarded). To our knowledge, it is the first time a SOM
has been used to help determining relevant sentences in AS.
During AS, after analyzing the source text, features extraction focuses on each sentence, in order to collect the following features: (i) sentence length, (ii) sentence position in the source text, (iii) sentence position in the paragraph it belongs to, (iv) presence of keywords in the sentence, (v) presence of gist words in the sentence, (vi)
sentence score by means of its words frequency, (vii) sentence score by means of TFISF and (viii) presence of indicative words in the sentence. It is worth noticing that
keywords are limited to the two most frequent words in the text, gist words are the
composing words of the gist sentence, and indicative words are genre-dependent and
could be corresponding to, e.g., ‘problem’, ‘solution’, ‘conclusion’, or ‘purpose’, in
scientific texts. Both feature (vi) and the gist sentence are calculated in the same way
as they are in GistSumm. The rationale behind incorporating these features in NeuTEAM LinG
A Comparison of Automatic Summarizers of Texts in Brazilian Portuguese
239
ralSumm may be found in [21]. Sentence classification is carried out by considering
every feature of each sentence, which is given as input to the SOM. This finally classifies the sentences as important or non-important, the important ones being selected
and juxtaposed to compose the final extract.
NeuralSumm SOM was already compared to other ML techniques. It proved to be
better than Naïve Bayes, decision trees and decision rules methods, with an error
decreasing rate to the worst case of c.a. 10% [21].
2.4 The ClassSumm System
The Classification System was proposed by Larocca Neto et al. [10] and uses a ML
approach to determine relevant segments to extract from source texts. Actually, it is
based on a Naïve Bayes classifier.
To summarize a source text, the system performs the same four processes that
NeuralSumm, as previously explained. Text pre-processing is similar to the one performed by TF-ISF-Summ. Features extracted from each sentence are of two kinds:
statistical, i.e., based on measures and counting taken directly from the text components, and linguistic, in which case they are extracted from a simplified argumentative structure of the text, produced by a hierarchical text agglomerative clustering
algorithm. A total of 16 features are associated to each sentence, to know: (a) meanTF-ISF, (b) sentence length, (c) sentence position in the source text, (d) similarity to
title, (e) similarity to keywords, (f) sentence-to-sentence cohesion, (g) sentence-tocentroid cohesion, (h) main concepts – the most frequent nouns that appear in the
text, (i) occurrence of proper nouns, (j) occurrence of anaphors, (k) occurrence of
non-essential information. Features (d), (e), (f) and (g) use the cosine measure to
calculate similarity; features (h) and (i) use the POS Tagger; finally, features (j) e (k)
use fixed lists, as mentioned before. The remaining are linguistic features, based on
the binary tree that represents the argumentative structure of the text, where each leaf
is associated to a sentence and the internal nodes are associated to partial clusters of
sentences. These features are: (l) the depth of each sentence in the tree, and (m) four
features that represent specific directions in a given level of the tree (height 1,2,3,4)
that indicate, for each depth level, the direction taken by the path from the root to the
leaf associated with the sentence.
Extract generation is considered as a two-valued classification problem: sentences
should be classified as relevant-to-extract or not. According to the values of the features for each sentence, the classification algorithm must “learn” which ones must
belong to the summary. Finally, the sentences to include in the extract will be those
above the cutoff and, thus, those with the highest probabilities of belonging to it.
In the experiment reported in this article, the only unused feature was the keywords similarity, because the TeMário corpus does not convey a list of keywords.
Compared to the other systems, ClassSumm uses two extra lists: one with indicators
of main concepts and another with the commonest anaphors. Although there are no
such fixed lists to Brazilian Portuguese, we followed Larocca Neto’s [8] suggestions,
incorporating to the current version of the system the corresponding pronoun anaphors for English, such as ‘this’, ‘that’, ‘those’, etc.
TEAM LinG
240
Lucia Helena Machado Rino et al.
ClassSumm was evaluated on a TIPSTER corpus of 100 news stories for training,
and two test procedures, namely, one that has used 100 automatic summaries and
another that has used 30 manual extracts [10], in which it outperforms the “from-top”
– those from the beginning of the source text, and random order baselines.
2.5 The SuPor System
Similarly to some of the above systems, SuPor also conveys two distinct processes:
training and extracting based on a Bayesian method, following [7]. Unlike them, it
embeds a flexible way to combine linguistic and non-linguistic constraints for extraction production. AS options include distinct suggestions originally aimed at texts in
English, which have been adapted to Brazilian Portuguese. To configure an AS strategy, SuPor must thus be customized by an expert user [17].
In SuPor, relevant features for classification are (a) sentence length (minimum of 5
words); (b) words frequency [11]; (c) signaling phrases; (d) sentence location in the
texts; and (e) occurrence of proper nouns. As a result of training, a probabilistic distribution is produced, which entitles extraction in SuPor. For this, only features (a),
(b), (d) and (e) are used, along with lexical chaining [2]. Adaptations from the originals have been made for Portuguese, to know: (i) for lexical chaining computation, a
thesaurus [4] for Brazilian Portuguese is used; (ii) sentence location (10% of the first
and 5% of the last sentences of a source text are considered); (iii) proper nouns are
those that are not abbreviations, occur more than once in the source text and do not
appear at the beginning of a sentence; (iv) a minimum threshold has been set for the
selection of the most frequent words: each term of the source text is frequencyweighed, and the total weight of the text is produced; then the average weight, along
with its standard deviation is taken as the cutoff of frequent words.
SuPor works in the following way: firstly, the set of features of each sentence are
extracted. Secondly, for each of the sets, the Bayesian classifier provides its probability, which will enable top-sentences to be included in the output extract.
SuPor performance has been previously assessed through two distinct experiments
that also focused on newspaper articles and their ideal extracts, produced by the generator of ideal extracts already referred to. However, testing texts had nothing to do
with TeMário. Both experiments addressed the representativeness of distinct groupings of features. Overall, the features grouping that have been most significant included lexical chaining, sentence length and proper nouns (avg.F-measure=40%).
3 Experiments and Results
We proceeded to a blackbox-type evaluation, i.e., only comparing the systems outputs. The main limitation imposed to the experiment was making it efficient: to compare the performance of the five systems, evaluation should be entirely automatic. As
a result, only co-selection measures [23], more specifically P, R, and F-measure were
used. We could not compare either automatic extracts with TeMário manual summaries because they are hand-built and do not allow for a viable automatic evaluation.
For this reason, the corresponding ideal extracts were used, as described in Section 1.
TEAM LinG
A Comparison of Automatic Summarizers of Texts in Brazilian Portuguese
241
In relation to the systems that need training, to assure non-biasing, a 10-fold cross
validation has been used (each fold comprising 10 texts).
We also included in the evaluation two baseline methods: the one based just upon
the selection of from-top sentences and the other that chooses them at random (hereafter, From-top and Random order methods, respectively). Following the same approach, the extracts contain as many sentences as the cutoff allows in this case.
In the AS context, the metrics under focus here are defined as follows: (a) compression rate is 30%. It has been chosen to conform to the sizes of both, the manual
summaries (length ranging from 25 to 30%) and ideal extracts; (b) Let N be the total
number of sentences in the output extract; M be the total number of sentences in the
ideal extract; NR be the number of relevant sentences included in the output extract,
i.e., the number of coinciding sentences between the output and its corresponding
ideal extract; (c) precision and recall are defined by P=NR/N and R=NR/M, and Fmeasure is the balance metric between P and R, F=2*P*R/(P+R).
All the systems were independently run. Table 1 shows the averaged precision, recall and F-measure metrics of each system obtained in the experiments, with last
column indicating the relative performance of each system as the percentage over the
random order baseline method, i.e. (F-measure/F-measure-random-baseline - 1).
Overall, the combination of features that lead to SuPor performance is [location,
words frequency, length, proper nouns, lexical chaining]. SuPor performance may
well be due to the inclusion of lexical chaining, since this is its most distinctive feature. Meaningfully, training has also counted on signaling phrases, which has been
considered only in SuPor. This, added to lexical chaining, may well be one of the
reasons for SuPor outperformance. Lexical chaining also has a close relationship to
the innovative features added to ClassSumm, the second topmost system. Especially,
it focuses on the strongest lexical chains, whilst ClassSumm focuses on sentence-tosentence and sentence-to-centroid for cohesion.
Close performance between SuPor and ClassSumm can also be explained through
the relationship between the following features combinations, respectively: [words
frequency, signaling phrases] and [mean TF-ISF, indicator of main concepts, similarity to title]. This is justified by acknowledging that the mean TF-ISF is based on
words frequency and main concepts and titles may signal phrases that lead to decision
patterns.
TEAM LinG
242
Lucia Helena Machado Rino et al.
Both topmost systems include features that have been formerly indicated for good
performance, when individually taken (see the generalization of Edmundson’s [5]
paradigm in [13]): sentences location and cue phrases (i.e., the referred signaling
ones). Additionally, both have been trained through a Bayesian classifier, with a considerable overlap of features. Keywords, which have been considered the poorest in
Edmundson’s model [5], have not been considered in any of them. In all, they substantially differ only through the anaphors and non-essential information features,
although location, in ClassSumm, addresses the argumentative tree of a source text,
instead of its surface structure, as it is in SuPor.
TF-ISF-Summ, which has a worse performance than ClassSumm, coincides with
that in the combination [words frequency, mean TF-ISF], for the same reasons given
above. Although its performance is not substantially far from that of SuPor, its upperbound is a baseline. This may also suggest that what distinguishes SuPor is not the
word frequency, neither is the mean TF-ISF measure in ClassSumm.
Not surprisingly, GistSumm performance is farther than the other systems referred
to, for it is based mainly upon words distribution, which has been repeatedly evidenced as a non-expressive feature. However, evidences provided by the DUC’2003
evaluation show that GistSumm is effective in determining the gist sentence. In that
evaluation, GistSumm scored 3.12 in a 0-4 scale for usefulness. This metric was presented to DUC judges in the following way: their score of any given summary should
indicate how useful the summary was in retrieving the corresponding source text (0
indicating no use at all and 4, totally useful, i.e., as good as having the full text instead. So, the problem must be in the extraction module instead. Although this system
achieved the best P, its R is the worst, even worse than the baselines. Recall could be
improved, for example, if gist words were spread over the whole source text, which
does not seem to be the case in newspaper texts, where the gist is usually in the lead
sentences.
Although NeuralSumm is based on a combination of most of the features embedded in SuPor and ClassSumm, its performance is much worse. This may be due to its
training on SOM, as well as on the means training has been carried out (e.g., a nonsignificant corpus) or, ultimately, on the features themselves, which also include
word frequency.
The From-top method occupies, as expected, the
position in the F-measure
scale. Being composed of newspaper texts of varied domains, the test corpus has an
expressive feature: lead sentences usually are the most relevant ones. Distinction
between that and the other 2 topmost systems may be due to the sophistication of
combining distinctive features. Since most of them coincide, but cohesive indicators,
lexical chaining (SuPor) and sentence-to-sentence or sentence-to-centroid cohesion
(ClassSumm) seem to be the key parameters for our outperforming systems.
It is important to notice that the described evaluation is not noise free. The ways
ideal extracts are generated bring about a problem to our evaluation: since the generator relies on the cosine similarity measure, and this does not take into account the
sentence size, there is no way to guarantee that compression rate is uniformly observed. Actually, there are ideal extracts in our reference corpus that are considerably
TEAM LinG
A Comparison of Automatic Summarizers of Texts in Brazilian Portuguese
243
longer than the extracts automatically generated. This poses an evaluation problem in
that the comparison between both penalizes recall, whilst increasing precision.
These results are relatively similar to the ones obtained in the literature for texts in
English, such as the ones of Teufel and Moens [26] (P=65% and R=44%), Kupiec et
al. (P=R=42%) and Saggion and Lapalme [24] (P=20% and R=23%). Although the
direct comparison between the results is not fair, due to different training, test corpora, and even language, it may indicate the general state of the art in extractive AS.
4 Final Remarks
Clearly, considering linguistic features and, thus, knowledge-based decisions, indicates a way of improving extractive AS. It is also worthy considering that the topmost
evaluated systems are based on training, which means that, with more substantial
training data, performance may be improved. Limitations usually addressed in the
literature refer to the impossibility of, e.g., aggregating or generalizing information.
SuPor and ClassSumm evaluations suggest that, although those procedures keep been
inexistent in extractive approaches, a way of surpassing those difficulties is still to
address the semantic-level through surface manipulation of text components. Another
significant way of improving SuPor and ClassSumm is to make the input reference
lists (e.g., stoplists and discourse markers) more expressive, by adding more terms to
them. Also, substituting the language-dependent repositories that have been currently
adapted (e.g., the thesaurus in SuPor) or building an argumentative tree in ClassSumm by other means may improve performance, since that will be likely to tune
better the systems to Brazilian Portuguese.
After all, the common evaluation presented here made it possible to compare different systems, allowing fostering AS research especially concerning texts in Brazilian Portuguese and, more importantly, delineating future goals to pursue.
References
1. Aires, R.V.X., Aluísio, S.M., Kuhn, D.C.e.S., Andreeta, M.L.B., Oliveira Jr., O.N.: Combining classifiers to improve part of speech tagging: A case study for Brazilian Portuguese.
In: Open Discussion Track Proceedings of the 15th Brazilian Symposium on AI. (2000)
227–236
2. Barzilay, R., Elhadad, M.: Using Lexical Chains for Text Summarization. In: Advances in
Automatic Text Summarization. MIT Press (1999) 111–121
3. Caldas Jr., J., Imamura, C.Y.M., Rezende, S.O.: Evaluation of a stemming algorithm for the
Portuguese language (in Portuguese). In: Proceedings of the 2nd Congress of Logic Applied to Technology. Volume 2. (2001) 267–274
4. Dias-da Silva, B., Oliveira, M.F., Moraes, H.R., Paschoalino, C., Hasegawa, R., Amorin,
D., Nascimento, A.C.: The Building of an Electronic thesaurus for Brazilian Portuguese (in
Portuguese). In: Proceedings of the V Encontro para o Processamento Computacional da
Língua Portuguesa Escrita e Falada. (2000) 1–11
5. Edmundson, H.P.: New methods in automatic extracting. Journal of the Association for
Computing Machinery 16 (1969) 264–285
6. Kohonen, T.: Self organized formation of topologically correct feature maps. Biological
Cybernetics 43 (1982) 59–69
TEAM LinG
244
Lucia Helena Machado Rino et al.
7. Kupiec, J., Pedersen, J., Chen, F.: A trainable document summarizer. In: Proc. of the 18th
ACM-SIGIR Conference on Research & Development in Information Retrieval. (1995)
68–73
8. Larocca Neto, J.: Contribution to the study of automatic text summarization techniques (in
Portuguese). Master’s thesis, Pontifícia Universidade Católica do Paraná (PUC-PR),
Graduate Program in Applied Computer Science (2002)
9. Larocca Neto, J., Santos, A.D., Kaestner, C.A.A., Freitas, A.A.: Document clustering and
text summarization. In: Proc. 4th Int. Conf. Practical Applications of Knowledge Discovery
and Data Mining. (2000) 41–55
10. Larocca Neto, J., Freitas, A.A., Kaestner, C.A.A.: Automatic text summarization using a
machine learning approach. In: XVI Brazilian Symp. on Artificial Intelligence. Number
2057 in Lecture Notes in Artificial Intelligence (2002) 205–215
11. Luhn, H.: The automatic creation of literature abstracts. IBM Journal of Research and Development 2 (1958) 159–165
12. Lyman, P., Varian, H.R.: How much information. Retrieved from
http://www.sims.berkeley.edu/how-much-info-2003 on [01/19/2004] (2003)
13. Mani, I.: Automatic Summarization. John Benjamin’s Publishing Company (2001)
14. Mani, I., Bloedorn, E.: Machine learning of generic and user-focused summarization. In:
Proc. of the 15th National Conf. on Artificial Intelligence (AAAI 98). (1998) 821–826
15. Mani, I., Maybury, M.T.: Advances in Automatic Text Summarization. MIT Press (1999)
16. Martins, R.T., Hasegawa, R., Nunes, M.G.V.: Curupira: a functional parser for Portuguese
(in Portuguese). NILC Tech. Report NILC-TR-02-26 (2002)
17. Módolo, M.: Supor: an environment for exploration of extractive methods for automatic
text summarization for portuguese (in Portuguese). Master’s thesis, Departamento de
Computação, UFSCar (2003)
18. Nunes, M.G.V., Vieira, F.M.V., Zavaglia, C., Sossolete, C.R.C., Hernandez, J.: The building of a Brazilian Portuguese lexicon for supporting automatic grammar checking (in Portuguese). ICMC-USP Tech. Report 42 (1996)
19. Pardo, T.A.S., Rino, L.H.M.: TeMário: A corpus for automatic text summarization (in Portuguese). NILC Tech. Report NILC-TR-03-09 (2003)
20. Pardo, T.A.S., Rino, L.H.M., Nunes, M.G.V.: GistSumm: A summarization tool based on a
new extractive method. In: 6th Workshop on Computational Processing of the Portuguese
Language – Written and Spoken. Number 2721 in Lecture Notes in Artificial Intelligence,
Springer (2003) 210–218
21. Pardo, T.A.S., Rino, L.H.M., Nunes, M.G.V.: NeuralSumm: A connexionist approach to
automatic text summarization (in Portuguese). In: Proceedings of the IV Encontro Nacional
de Inteligência Artificial. (2003)
22. Pardo, T.A.S., Rino, L.H.M., Nunes, M.G.V.: DiZer: An automatic discourse analysis proposal to brazilian portuguese (in Portuguese). In: Proc. of the I Workshop em Tecnologia
da Informação e da Linguagem Humana. (2003)
23. Radev, D.R., Teufel, S., Saggion, H., Lam, W., Blitzer, J., Qi, H., Çelebi, A., Liu, D., Drabek, E.: Evaluation challenges in large-scale document summarization. In: Proc. of the 41st
Annual Meeting of the Association for Computational Linguistics. (2003) 375–382
24. Saggion, H., Lapalme, G.: Generating indicative-informative summaries with sumUM.
Computational Linguistics 28 (2002) 497–526
25. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (1988) 513–523
26. Teufel, S., Moens, M.: Summarizing scientific articles: Experiments with relevance and
rhetorical status. Computational Linguistics 28 (2002) 409–445
TEAM LinG
Heuristically Accelerated Q–Learning: A New
Approach to Speed Up Reinforcement Learning
Reinaldo A.C. Bianchi1,2, Carlos H.C. Ribeiro3, and Anna H.R. Costa1
1
Laboratório de Técnicas Inteligentes
Escola Politécnica da Universidade de São Paulo
Av. Prof. Luciano Gualberto, trav. 3, 158. 05508-900, São Paulo, SP, Brazil
[email protected], [email protected]
2
Centro Universitário da FEI
Av. Humberto A. C. Branco, 3972. 09850-901, São Bernardo do Campo, SP, Brazil
3
Instituto Tecnológico de Aeronáutica
Praça Mal. Eduardo Gomes, 50. 12228-900, São José dos Campos, SP, Brazil
[email protected]
Abstract. This work presents a new algorithm, called Heuristically
Accelerated Q–Learning (HAQL), that allows the use of heuristics to
speed up the well-known Reinforcement Learning algorithm Q–learning.
A heuristic function that influences the choice of the actions characterizes the HAQL algorithm. The heuristic function is strongly associated
with the policy: it indicates that an action must be taken instead of another. This work also proposes an automatic method for the extraction
of the heuristic function
from the learning process, called Heuristic
from Exploration. Finally, experimental results shows that even a very
simple heuristic results in a significant enhancement of performance of
the reinforcement learning algorithm.
Keywords: Reinforcement Learning, Cognitive Robotics
1
Introduction
The main problem approached in this paper is the speedup of Reinforcement
Learning (RL), aiming its use in mobile and autonomous robotic agents acting in
complex environments. RL algorithms are notoriously slow to converge, making
it difficult to use them in real time applications. The goal of this work is to
propose an algorithm that preserves RL advantages, such as the convergence to
an optimal policy and the free choice of actions to be taken, minimizing its main
disadvantage: the learning time.
For being the most popular RL algorithm and because of the large amount
of data available in literature for a comparative evaluation, the Q–learning algorithm [11] was chosen as the first algorithm to be extended by the use of heuristic
acceleration. The resulting new algorithm is named Heuristically Accelerated Q–
Learning (HAQL) algorithm.
In order to describe this proposal in depth, this paper is organized as follows.
Section 2 describes the Q–learning algorithm. Section 3 describes the HAQL and
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 245–254, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
246
Reinaldo A.C. Bianchi, Carlos H.C. Ribeiro, and Anna H.R. Costa
its formalization using a heuristic
function and section 4 describes the algorithm used to define the heuristic function, namely Heuristic from Exploration.
Section 5 describes the domain where this proposal has been evaluated and the
results obtained. Finally, Section 6 summarizes some important points learned
from this research and outlines future work.
2
Reinforcement Learning and the Q–Learning Algorithm
Consider an autonomous agent interacting with its environment via perception
and action. On each interaction step the agent senses the current state of the
environment, and chooses an action to perform. The action alters the state
of the environment, and a scalar reinforcement signal (a reward or penalty)
is provided to the agent to indicate the desirability of the resulting state.
The goal of the agent in a RL problem is to learn an action policy that
maximizes the expected long term sum of values of the reinforcement signal,
from any starting state. A policy
is some function that tells the agent
which actions should be chosen, under which circumstances [8]. This problem
can be formulated as a discrete time, finite state, finite action Markov Decision
Process (MDP), since problems with delayed reinforcement are well modeled
as MDPs. The learner’s environment can be modeled (see [7, 9]) by a 4-tuple
where:
is a finite set of states.
is a finite set of actions that the agent can perform.
is a state transition function, where
is a probability
distribution over
represents the probability of moving from state
to by performing action
is a scalar reward function.
The task of a RL agent is to learn an optimal policy
that maps
the current state into a desirable action(s) to be performed in In RL, the
policy should be learned through trial-and-error interactions of the agent with
its environment, that is, the RL learner must explicitly explore its environment.
The Q–learning algorithm was proposed by Watkins [11] as a strategy to
learn an optimal policy
when the model
and
is not known in advance.
Let
be the reward received upon performing action in state
plus
the discounted value of following the optimal policy thereafter:
The optimal policy
sive form:
is
Rewriting
in a recur-
Let be the learner’s estimate of
The Q–learning algorithm iteratively approximates
i.e., the values will converge with probability 1 to
TEAM LinG
Heuristically Accelerated Q–Learning
247
provided the system can be modeled as a MDP, the reward function is bounded
and actions are chosen so that every state-action
pair is visited an infinite number of times. The Q learning update rule is:
where is the current state; is the action performed in
is the reward
received; is the new state; is the discount factor
where
is the total number of times this state-action pair
has been visited up to and including the current iteration.
An interesting property of Q–learning is that, although the explorationexploitation tradeoff must be addressed, the values will converge to
independently of the exploration strategy employed (provided all state-action pairs
are visited often enough) [9].
3
The Heuristically Accelerated Q–Learning Algorithm
The Heuristically Accelerated Q–Learning algorithm can be defined as a way of
solving the RL problem which makes explicit use of a heuristic function
to influence the choice of actions during the learning process.
defines the heuristic, which indicates the importance of performing the action
when in state
The heuristic function is strongly associated with the policy: every heuristic
indicates that an action must be taken regardless of others. This way, it can
said that the heuristic function defines a “Heuristic Policy”, that is, a tentative
policy used to accelerate the learning process. It appears in the context of this
paper as a way to use the knowledge about the policy of an agent to accelerate
the learning process. This knowledge can be derived directly from the domain
(prior knowledge) or from existing clues in the learning process itself.
The heuristic function is used only in the action choice rule, which defines
which action must be executed when the agent is in state
The action choice
rule used in the HAQL is a modification of the standard
rule used
in Q–learning, but with the heuristic function included:
where:
is the heuristic function, which influences the action choice.
The subscript indicates that it can be non-stationary.
is a real variable used to weight the influence of the heuristic function.
is a random value with uniform probability in [0,1] and
is the
parameter which defines the exploration/exploitation trade-off: the greater
the value of the smaller is the probability of a random choice.
is a random action selected among the possible actions in state
TEAM LinG
248
Reinaldo A.C. Bianchi, Carlos H.C. Ribeiro, and Anna H.R. Costa
As a general rule, the value of the heuristic
used in the HAQL
must be higher than the variation among the
so it
for a similar
can influence the choice of actions, and it must be as low as possible in order to
minimize the error. It can be defined as:
where is a small real value and
is the action suggested by the heuristic.
For instance, if the agent can execute 4 different actions when in state
the values of
for the actions are [1.0 1.1 1.2 0.9], the action that
the heuristic suggests is the first one. If
the values to be used are
and zero for the other actions.
As the heuristic is used only in the choice of the action to be taken, the
proposed algorithm is different from the original Q–learning only in the way
exploration is carried out. The RL algorithm operation is not modified (i.e.,
updates of the function Q are as in Q–learning), this proposal allows that many
of the conclusions obtained for Q–learning to remain valid for HAQL.
Theorem 1. Consider a HAQL agent learning in a deterministic .MDP, with
finite sets of states and actions, bounded rewards
discount factor such that
and where the values used on the heuristic
function are bounded by
For this agent, the
values will converge to
with probability one uniformly over all the states
if each state-action pair is visited infinitely often (obeys the Q-learning
infinite visitation condition).
Proof: In HAQL, the update of the value function approximation does not
depend explicitly on the value of the heuristic. The necessary conditions for the
convergence of Q–learning that could be affected with the use of the heuristic
algorithm HAQL are the ones that depend on the choice of the action. Of the
conditions presented in [8,9], the only one that depends on the action choice
is the necessity of infinite visitation to each pair state-action. As equation 4
considers an exploration strategy
regardless of the fact that the value
function is influenced by the heuristic
the infinite visitation condition
is guaranteed and the algorithm converges. q.e.d.
The condition of infinite visitation of each state-action pair can be considered
valid in practice – in the same way that it is for Q–learning – also by using other
visitation strategies:
Using a Boltzmann exploration strategy [7].
Intercalating steps where the algorithm makes alternate use of the heuristic
and exploration steps.
Using the heuristic during a period of time, smaller than the total learning
time for Q–learning.
The use of a heuristic function made by HAQL explores an important characteristic of some RL algorithms: the free choice of training actions. The consequence of this is that a suitable heuristic speeds up the learning process, and if
TEAM LinG
Heuristically Accelerated Q–Learning
249
the heuristic is not suitable, the result is a delay which does not stop the system
from converging to an optimal value.
The idea of using heuristics with a learning algorithm has already been considered by other authors, as in the Ant Colony Optimization presented in [5,2].
However, the possibilities of this use were not properly explored yet. The complete HAQL algorithm is presented on table 1. It can be noticed that the only
difference to the Q–learning algorithm is the action choice rule and the existence
of a step for updating the function
Although any function which works over real numbers and produces values
belonging to an ordered set may be used in equation 4, the use of addition is
particularly interesting because it allows an analysis of the influence of the values
of
in a way similar to the one which is made in informed search algorithm
(such as
[6]).
Finally, the function
can be derived by any method, but a good
one increases the speedup and generality of this algorithm. In the next section,
the method Heuristic from Exploration is presented.
4
The Method Heuristic from Exploration
One of the main questions addressed in this paper is how to find out, in an
initial learning stage, the policy which must be used for learning speed up. For
the HAQL algorithm, this question means how to define the heuristic function.
The definition of an initial situation depends on the domain of the system application. For instance, in the domain of robotic navigation, we can extract an
useful heuristic from the moment when the robot is receiving environment reinforcements: after hitting a wall, use as heuristic the policy which leads the robot
away from it.
TEAM LinG
250
Reinaldo A.C. Bianchi, Carlos H.C. Ribeiro, and Anna H.R. Costa
A method named Heuristic from Exploration is proposed in order to estimate
a policy-based heuristic. This method was inspired by [3], which proposed a
system that accelerates RL by composing an approximation of the value function,
adapting parts of previously learned solutions. The Heuristic from Exploration
is composed of two phases: the first one, which extracts information about the
structure of the environment through exploration and the second one, which
defines the heuristic for the policy, using the information extracted. These stages
were called Structure Extraction and Heuristic Backpropagation, respectively.
Structure Extraction iteratively estimates a map sketch, keeping track of
the result from all the actions executed by the agent. In the case of a mobile
robot, when the agent tries to move from one position to the next, the result of
the action is recorded. When an action does not result in a move, it indicates
the existence of an obstacle in the environment. With the passing of time, this
method generates a map sketch of the environment identified as possible actions
in each state.
From the map sketch of the environment, Heuristic Backpropagation composes the heuristic, described by a sub-optimal policy, by backpropagating the
possible actions over the map sketch. It propagates – from a final state – the
policies which lead to that state. For instance, the heuristic of the states immediately previous to a terminal state are defined by the actions that lead to
the terminal state. In a following iteration, this heuristic is propagated to the
predecessors of the states which already have a defined heuristic and so on.
Theorem 2. For a deterministic MDP whose model is known, the Heuristic
Backpropagation algorithm generates an optimal policy.
Proof Sketch: This algorithm is a simple application of the Dynamic Programming algorithm [1]. In case where the environment is completely known,
both of them work the same way. In case where only part of the environment is
known, the backpropagation is done only for the known states. On the example
of robotic mapping, where the model of the environment is gradually built, the
backpropagation can be done only on the parts of the environment which are
already mapped.
Results for a complete implementation of this algorithm will be presented in
the next section.
5
Experiments in the Grid-World Domain
In these experiments, a grid-world agent that can move in four directions have to
find a specific state, the target. The environment is discretized in a grid with N x
M positions the agent can occupy. The environment in which the agent moves can
have walls (figure 1), represented by states to which the agent cannot move. The
agent can execute four actions: move north, south, east or west. This domain,
called grid-world, is well-known and was studied by several researchers [3,4,7,
9]. Two experiments were done using HAQL with Heuristic from Exploration
in this domain: navigation with goal relocation and navigation in a new and
unknown environment.
TEAM LinG
Heuristically Accelerated Q–Learning
251
Fig. 1. Room with walls (represented by dark lines) discretized in a grid of states.
The value of the heuristic used in HAQL is defined using equation 5 as:
This value is computed only once, in the beginning of the acceleration. In all
the following episodes, the value of the heuristic is maintained fixed, allowing the
learning to overcome bad indications. If
is recalculated at each episode,
a bad heuristic would be difficult to overcome.
For comparative effects, the same experiments are also executed using the Q–
learning. The parameters used in Q–learning and HAQL were the same: learning
rate
the exploitation/exploration rate is 0.9 and the discount factor
The rewards used were +10 when the agent arrives to the goal state
and -1 when it executes any action. All the experiments presented were encoded
in C++ Language and executed in a Pentium 3-500MHz, with 256MB of RAM,
and Linux operating system.
The results presented in the next sub-sections show the average of 30 training
sessions in nine different configurations of the navigation environment – a room
with several walls – similar to the one in figure 1. The size of the environment
is of 55 × 55 positions and the goal is initially at the right superior corner. The
agent always start at a random position.
5.1
Goal Relocation During the Learning Process
In this experiment the robot must learn to reach the goal, which is initially
located at the right superior corner (figure 1) and, after a certain time, is moved
to the left inferior corner of the grid.
The HAQL initially only extracts the structure of the problem (using the
structure extraction method described in section 4), behaving as the Q–learning.
TEAM LinG
252
Reinaldo A.C. Bianchi, Carlos H.C. Ribeiro, and Anna H.R. Costa
At the end of
episode the goal is relocated. With this, both algorithms
must find the new position of the goal. As the algorithms are following the
policies learned until then, the performance worsens and both algorithms execute
a large number of steps to reach the new position of the goal.
As the robot controlled by the HAQL arrives at the new goal position (at the
end of
episode), the heuristic to be used is constructed using the Heuristic
Backpropagation (described in section 4) with information from the structure
of the environment (that was not modified) and the new position of the goal,
and the values of
are defined. This heuristic then is used, resulting in
a better performance in relation to Q–learning, as shown in figure 2.
Fig. 2. Result for the goal relocation at the end of the
episode (log y).
It can be observed that the HAQL has a similar performance to Q–learning
until the
episode. In this episode, the robot controlled by both algorithms
takes more than 1 million steps to find the new position of the goal (since the
known politics takes the robot to a wrong position).
After the
episode, while the Q–learning needs to learn the politics from
scratch, the HAQL will always execute the minimum number of steps necessary
to arrive at the goal. This happens because the heuristic function allows the
HAQL to use the information about the environment it already possessed.
5.2
Learning a Policy in a New Environment
In the second experiment the robot must learn to reach the goal located at the
right superior corner (figure 1) when inserted in an unknown environment, at a
random position.
TEAM LinG
Heuristically Accelerated Q–Learning
253
Again, the HAQL initially only extracts the structure of the problem, without
making use of the heuristic, and behaving as the Q–learning. At the end of
the ninth episode, the heuristic to be used is constructed using the Heuristic
Backpropagation with the information from the structure of the environment
extracted during the first nine episodes, and the values of
are defined.
This heuristic is then used in all the following episodes.
The result (figure 3) shows that, while the Q–learning continue to learn the
action policy, the HAQL converges to the optimal policy after the speed up.
Fig. 3. Result for the acceleration at the end of the
episode (log y).
The
episode was chosen for the beginning of the acceleration because
this allows to the agent to explore the environment before using the heuristic.
As the robot starts every episode at a random position and the environment
is small, the Heuristic from Exploration method will probably define a good
heuristic.
Finally, Student’s
[10] was used to verify the hypothesis that the use
of heuristics speed up the learning process. For both experiments described in
this section – goal relocation and navigation in a new environment – the value of
the module of T was calculated for each episode using the same data presented
in figures 2 and 3. The results confirm that after the speed up the algorithms
are significantly different, with a confidence level greater than 0.01%.
6
Conclusion and Future Works
This work presented a new algorithm, called Heuristically Accelerated Q–Learning (HAQL), that allows the use of heuristics to speed up the well-known Reinforcement Learning algorithm Q–learning.
TEAM LinG
254
Reinaldo A.C. Bianchi, Carlos H.C. Ribeiro, and Anna H.R. Costa
The experimental results obtained using the automatic method for the extraction of the heuristic function
from the learning process, called Heuristic
from Exploration, showed that the HAQL attained better results than the Q–
learning for the domain of mobile robots.
Heuristics allows the use of RL algorithms to solve problems where the convergence time is critic, as in real time applications. This approach can also
be incorporated in other well know RL algorithms, like the SARSA, QS and
Minimax-Q [8].
Among the actions that need to be taken for a better evaluation of this
proposal, the more important ones are:
Validate the HAQL, by applying it to other the domains such as the “car on
the hill” [3] and the “cart-pole” [4].
During this study several indications that there must be a large number of
methods which can be used to extract the heuristic function were found.
Therefore, the study of other methods for heuristic composition is needed.
References
1. D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models.
Prentice-Hall, Upper Saddle River, NJ, 1987.
2. E. Bonabeau, M. Dorigo, and G. Theraulaz. Inspiration for optimization from
social insect behaviour. Nature 406 [6791], 2000.
3. C. Drummond. Accelerating reinforcement learning by composing solutions of automatically identified subtasks. Journal of Artificial Intelligence Research, 16:59–
104, 2002.
4. D. Foster and P. Dayan. Structure in the space of value functions. Machine Learning, 49(2/3):325–346, 2002.
5. L. Gambardella and M. Dorigo. Ant–Q: A reinforcement learning approach to
the traveling salesman problem. Proceedings of the ML-95 – Twelfth International
Conference on Machine Learning, pages 252–260, 1995.
6. P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and
Cybernetics, 4(2): 100–107, 1968.
7. L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A
survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
8. M. L. Littman and C. Szepesvári. A generalized reinforcement learning model:
Convergence and applications. In Procs. of the Thirteenth International Conf. on
Machine Learning (ICML’96), pages 310–318, 1996.
9. T. Mitchell. Machine Learning. McGraw Hill, New York, 1997.
10. U. Nehmzow. Mobile Robotics: A Practical Introduction. Springer-Verlag, Berlin,
Heidelberg, 2000.
11. C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, University of
Cambridge, 1989.
TEAM LinG
Using Concept Hierarchies
in Knowledge Discovery
Marco Eugênio Madeira Di Beneditto1 and Leliane Nunes de Barros2
1
Centro de Análises de Sistemas Navais - CASNAV
Pr. Barão de Ladário s/n - Ilha das Cobras - Ed 8 do AMRJ, 3° andar
Centro – 20091-000, Rio de Janeiro, RJ, Brasil
[email protected]
2
Institute de Matemática e Estatística da Universidade de São Paulo - IME–USP
Rua do Matão, 1010, Cidade Universitária – 05508-090, São Paulo, SP, Brasil
[email protected]
Abstract. In Data Mining, one of the steps of the Knowledge Discovery in Databases (KDD) process, the use of concept hierarchies as a
background knowledge allows to express the discovered knowledge in a
higher abstraction level, more concise and usually in a more interesting
format. However, data mining for high level concepts is more complex
because the search space is generally too big. Some data mining systems
require the database to be pre-generalized to reduce the space, what
makes difficult to discover knowledge at arbitrary levels of abstraction.
To efficiently induce high-level rules at different levels of generality, without pre-generalizing databases, fast access to concept hierarchies and fast
query evaluation methods are needed.
This work presents the NETUNO-HC system that performs induction
of classification rules using concept hierarchies for the attributes values
of a relational database, without pre-generalizing them or even using another tool to represent the hierarchies. It is showed how the abstraction
level of the discovered rules can be affected by the adopted search strategy and by the relevance measures considered during the data mining
step. Moreover, it is demonstrated by a series of experiments that the
NETUNO-HC system shows efficiency in the data mining process, due
to the implementation of the following techniques: (i) a SQL primitive
to efficient execute the databases queries using hierarchies; (ii) the construction and encoding of numerical hierarchies; (iii) the use of Beam
Search strategy, and (iv) the indexing and encoding of rules in a hash
table in order to avoid mining discovered rules.
Keywords: Knowledge Discovery, Data Mining, Machine Learning
1
Introduction
This paper describes a KDD (Knowledge Discovery in Databases) system named
NETUNO-HC [1], that uses concept hierarchies to discover knowledge at a
high abstraction level than the existing in the database (DB). The search for
this kind of knowledge requires the construction of SQL queries to a Database
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 255–265, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
256
Marco Eugênio Madeira Di Beneditto and Leliane Nunes de Barros
Management System (DBMS), considering that the attribute values belong to a
concept hierarchy, not directly represented in the DB.
We argue that this kind of task can be achieved providing fast access to
concept hierarchies and fast query evaluation through: (i) an efficient search
strategy, and (ii) the use of a SQL primitive to allow fast evaluation of high
level hypotheses. Unlike in [2], the system proposed in this paper does not require the DB to be pre-generalized. Without pre-generalizing databases, fast
access to concept hierarchies and fast query evaluation methods are needed to
efficiently induce high-level rules at different levels of generality. Finally, the proposed representation of hierarchies followed by the use of SQL primitives turns
NETUNO-HC independent from other inference systems [3].
2
Concept Hierarchies
The concept hierarchy can be defined as a partial order set. Given two concepts
and belonging to a partial order relation R, i.e.,
(described by
or precedes
we say that concept is more specific than concept or that
is more general than Usually, the partial order relation in a concept hierarchy
represents the special-general relationship between concepts, also called subsetsuperset relation. So, a concept hierarchy is defined as:
Definition: A Concept Hierarchy
is a partial order set
is a finite set of concepts, and is a partial order relation in HC.
where HC
A tree is a special type of concept hierarchy, where a concept precedes only
one concept and the notion of greatest concept exists, i.e., a concept that does not
precede anyone. The tree root will be the most general concept, called ANY, and
the leaves will be the attribute values in the DB, that is, the lowest abstraction
level of the hierarchy. In this work, we will use concept hierarchies that can be
represented as a tree.
2.1
Representing Hierarchies
The use of concept hierarchies during the data mining to generate and evaluate
hypotheses is computationally more demanding than the creation of generalized
tables. The representation of a hierarchy in memory using a tree data structure
gives some speed and efficiency to traverse it. Nevertheless, the number of queries
necessary to verify the relationship between concepts in a hierarchy can be too
high. Our approach to decrease this complexity is to encode each hierarchy
concept in such a way that the code itself indicates the partial order relation
between the concepts. Thus the relation verification is made by only checking
the codes.
The concept encoding algorithm we propose is based on a post-fixed order
traversal of the hierarchy with complexity
where is the number of concepts in the hierarchy. The verification of the relationship between two concepts,
is performed shifting one of the codes, in this case, the bigger one. Figure 1
TEAM LinG
Using Concept Hierarchies in Knowledge Discovery
257
Fig. 1. Two concept codes where the code 18731 represents a concept that is a descendant of the concept with code 18
shows two concept codes where the code 18731 represents a concept that is a
descendant of the concept with code 18. Since the difference between the codes
corresponds to ten bits, the bigger code has to be shifted to the right by this
number of bits, and if this new value is equal to the smaller code, than the concepts belongs to the relation, i.e., the concept with smaller code is an ascendant
of the concept with the bigger code.
In the NETUNO-HC the hierarchies are stored in relational tables in the
DB and loaded before the data mining step. More than one hierarchy for each
attribute can be stored leaving to the user the possibility to choose one.
2.2
Generating Numerical Hierarchies
In this work, we suppose that a concept hierarchy, related with categorical data,
is a background knowledge, given by an expert in the field. However, for numerical or continuous attributes, the hierarchies can be automatically generated
(from the DB) and stored in relational tables, before the data mining step. There
are many ways to do this and any choice will affect the results of the data mining.
In the NETUNO-HC we propose an algorithm to generate a numerical hierarchy considering the class distribution. This algorithm is based on the InfoMerge
algorithm [4] used for discretization of continuous attributes. The idea underlying the InfoMerge algorithm is to group values in intervals which causes the
smaller information loss (a dual operation of information gain in C4.5 learning
algorithm [5]).
In the NETUNO-HC, the same idea is applied to the generation in a bottomup approach of a numerical concept hierarchy, where the nodes of a hierarchy
will represent numerical intervals, closed in the left. After the leaf level intervals
be generated, these are merged in bigger intervals until the root is reached, which
will correspond to an interval that includes all the existing values in the DB.
3
The NETUNO-HC Algorithm
The search space is organized in a general-to-specific ordering of hypotheses,
beginning with the empty hypothesis. A hypothesis will be transformed (node
expansion search operation) by specialization operations, i.e., by the addition
of an attribute or by doing hierarchy specialization to generate more specific
hypotheses. A hypothesis can be considered a discovered knowledge if it satisfies
the relevance measures. The node expansion operation is made in two steps.
First, an attribute is added to a hypothesis. Second, using the SQL query, the
algorithm check, in a top-down fashion, which values in the hierarchy of the
attribute satisfy the relevance measures.
TEAM LinG
258
Marco Eugênio Madeira Di Beneditto and Leliane Nunes de Barros
The search strategy employed by the NETUNO-HC is Beam Search. For
each level of the search space, which corresponds to hypotheses with the same
number of attribute-value pairs, the algorithm selects only a fixed number of
them. This number corresponds to the beam width, i.e., the number of hypotheses
that will be specialized.
3.1
NETUNO-HC Knowledge Description Language
The power of a symbolic algorithm for data mining resides in the expressiveness
of the knowledge description language used. The language specifies what the
algorithm is capable of discover or learning. NETUNO-HC uses a propositionallike language extending the attribute value with concept hierarchies in order to
achieve higher expressiveness.
Rules induced by NETUNO-HC take the form IF < A > THEN < class >,
where < A > is a conjunction of one or more attribute-value pairs. An attributevalue pair is a condition between an attribute and a value from the concept
hierarchy. For categorical attributes this condition is an equality, e.g.,
and for continuous attributes this condition is an interval inclusion (closed on
left), e.g.,
or an equality.
3.2
Specializing Hypotheses
In the progressive specialization, or top-down approach, the data mining algorithm generates hypotheses that have to be specialized. The specialization
operation of hypothesis
generates a new hypothesis
that covers a number
of tuples less or iqual the ones covered by
Specialization can be realized by
either adding an attribute or replacing the value of the attribute with any of its
descendants according with a concept hierarchy. In NETUNO-HC, both forms
of hypotheses specializations are considered.
If a hypothesis does not satisfy the relevance measures then it has to be
specialized. After the addition of the attribute, the algorithm has to check which
of the values forms valid hypotheses, i.e., hypotheses that satisfy the relevance
measures. With the use of hierarchies, the values have to be checked in a topdown way, i.e., from the most general concept to the more specific.
3.3
Rules Subsumption
The NETUNO-HC avoids the generation of two rules,
subsumed by
i.e.,
This occurs when:
and
if
is
1. the rules have the same size and for each attribute-value pair
exists a pair
where
2. the rules have different size and for each attribute-value pair
exists a pair
where
and
is the smaller rule.
TEAM LinG
Using Concept Hierarchies in Knowledge Discovery
259
This kind of verification is done in two different phases. The first phase is done
when the data mining algorithm checks for an attribute value in the hierarchy.
If the value generates a rule, the descendants values that can also generate rules
in the same class are not stored as valid rules, even though they satisfy the
relevance measures. Second, if a discovered rule subsumes other rules previously
discovered, these last ones are deleted from the list of discovered rules. On the
contrary, if a discovered rule is subsumed by one or more previously discovered
rules, this rule is not added to the list.
3.4
Relevance Measures and Selection Criteria
In NETUNO-HC system, the rule hypotheses are evaluated by two conditions:
completeness and consistency. Let P denote the total number of positive examples of a given class in the training data. Let R be a rule hypothesis to cover
tuples of that class; let and be the number of positive and negative tuples
covered by R, respectively.
The completeness will be measured by the ratio
which is called in this
work support (also known in the literature as positive coverage). The consistency
is measured by the ratio
which is called in this work confidence (also known
as training accuracy).
NETUNO-HC calculates the support and confidence values using the SQL
primitive, described in Section 4.
The criteria for the selection of the best hypotheses to be expanded is based
on the product support × confidence. The hypotheses in the open-list will be
stored in a decreasing order according with that product, and only the best
hypotheses (the beam width) will be selected.
3.5
Interpretation of the Induced Rules
The induced rules can be interpreted as classification rules. Thus, to use the
induced rules to classify new examples, NETUNO-HC employ an interpretation
in which all rules are tried and only those that cover the example are collected. If
a collision occurs (i.e., the example belongs to more than one class) the decision
is to classify the example in the class given by the rule with the greatest value
for the product support × confidence. If some example is not covered by any
rule, then the number of non-classified example is incremented (as a measure of
quality for the set of discovered rules). In Section 5.3, will be showed the result
of applying a default rule in this case.
4
SQL Primitive for Evaluation of High Level Hypothesis
In [6] was propose a generic KDD primitive in SQL which underlies the candidate
rule evaluation procedure. This primitive consists of counting the number of
tuples in each partition formed by a SQL group by statement. The primitive
has three input parameters: a tuple-set descriptor, a candidate attribute, and
TEAM LinG
260
Marco Eugênio Madeira Di Beneditto and Leliane Nunes de Barros
the class attribute. The output is a matrix
where
is the number of
different values of the new attribute, and is the number of different values of
the class attribute.
In order to use this primitive and the output matrix for the evaluation of high
level hypothesis (i.e., building a SQL primitive considering a concept hierarchy),
some extensions were made to the original proposal [6]. In the primitive, the
tuple-set descriptor has to be expressed by values in the DB, i.e., the leaf concepts
in the hierarchy. So, for each high level value the descriptor has to be expressed
by the leaf values that precedes it. This is made by NETUNO-HC, during the
data mining, using the hierarchy for building the SQL primitive.
An example of the use of the extended SQL primitive is shown in Figure 2.
Let {black, brown}
dark where {black, brown} are leaf concepts in a color
domain hierarchy. If the antecedent of a hypothesis has the attribute-value pair:
spore_print_color = dark, this has to be expressed in the tuple-set descriptor by
leaf values, i.e., spore_print_color = brown OR spore_print_color = black. Figure
2 shows the output matrix, where the lines are leaf concepts of the hierarchy.
Adding the lines whose concepts are leaf and precedes a high level concept is
equivalent to have a high level line, which can be used to evaluate the high level
hypotheses (see Figure 2).
A condition between an attribute and his value may be the inequality. In this
case, eg. spore-print-color < > dark, the tuple-set descriptor will be translated
to spore_print_color <> brown AND spore_print_color <> black. To calculate
the relevance measures for this condition, the same matrix can be used. The
line for this condition is the difference between the Total line and the line that
corresponds to the attribute value, i.e.,
Fig. 2. The lines of the matrix represents the leaf concepts of the hierarchy
5
Experiments
In order to evaluate the NETUNO-HC algorithm we used two DBs from the
UCI repository: the Mushroom and Adult. First, we tested how the size of the
search space changes performing data mining with and without the use of concept hierarchies. This was done using a simplified implemented version of the
NETUNO-HC algorithm that uses a complete search method.
In the rest of the experiments we analyzed the data mining process, with
and without the use of concept hierarchies, with respect to the following asTEAM LinG
Using Concept Hiearchies in Knowledge Discovery
261
pects: efficiency on DB access, concept hierarchy access and rules subsumption
verification; results on the accuracy of the discovered rule set; the capability
of discovering high level rules and finally, the semantic evaluation of high level
rules.
5.1
The Size of the Search Space
We have first analyzed how the use of concept hierarchies in data mining can
affect the size of the search space considering a complete search method, such
as Breadth-First Search.
Figure 3 shows, as it was expected, that the search space for high level rules
increases with the size of the concept hierarchies considered in a data mining
process.
Fig. 3. Breadth-First Search algorithm execution in the Mushroom DB with and without hierarchies and sup = 20%, conf= 90%. The graphics on Figure s3 shows the
open-list size (list of the candidate rules or rule hypotheses) versus the number of
open-list removes (number of hypothesis specializations)
We can also see in Figure 3 that pruning techniques, based on relevance
measures and rules subsumption, can eventually turn the list of open nodes
(open-list) empty, i.e., end the search task. This occurs for the Mushroom DB
after 15000 hypothesis specializations, in data mining WITHOUT concept hierarchies and after 59000 hypothesis specializations, in data mining WITH concept
hierarchies.
Another observation we can make from Figure 3 is that the size of the openlist is approximately four times bigger when using concept hierarchies evaluation
for the Mushroom DB. Therefore, it is important to improve performance on the
hypotheses evaluation through efficient DB access, concept hierarchy access and
rules subsumption verification.
TEAM LinG
262
5.2
Marco Eugênio Madeira Di Beneditto and Leliane Nunes de Barros
Efficiency in High Level SQL Primitive
and Hypotheses Generation
In order to evaluate the use of high level SQL primitive, it was implemented
a version of the ParDRI [3]. In ParDRI, the high level queries are made in a
different way: it uses the direct descendants of the hierarchy root. So, if the root
concept has tree descendants, the system will issue one query for each concept,
ie., three queries, while with the SQL primitive, only one query is necessary.
For the Mushroom DB, without the SQL primitive, the implemented ParDRI
algorithm generated 117 queries and discovered 26 rules. By using the primitive,
the same algorithm issue only 70 queries and discovered the same 26 rules,
showing a reduction of 40% in the number of queries.
To evaluate the time spent on hypotheses generation, the following times
were measured during the executions: (a) the time spent with DB queries, and
(b) the time spent by the data mining algorithm.
The ratio between the difference of these two times and the time spent by
the data mining algorithm is the percentage spent in the generation and evaluation of the hypotheses. This value is 1.87% (with
showing that the
execution time is dominated by queries issued to the DBMS. Therefore, the use
of the high level SQL primitive, combining with efficient techniques for encoding
and evaluation of hypotheses in the NETUNO-HC, makes it a more efficient
algorithm for high level data mining than ParDRI [3].
5.3
Accuracy
In Tab. 1, the accuracy results of the NETUNO-HC with and without hierarchies are compared with two other algorithms, C4.5 [5] and CN2 [7], which did
not use concept hierarchies.
In order to compare similar classification schemes, the NETUNO-HC results
were obtained using a default class, the majority class in this case, to label examples not covered, similar to the two other algorithms. For the other experiments,
the default class was not used.
The next experiments show the results obtained through ten-fold stratified
cross validation. In Table 2 is showed the accuracy of the discovered rule set.
For both DBs we can observe that by decreasing the minimum support value,
the accuracy tends to increase (in both situations: with or without hierarchies).
This happens because some tuples are covered by rules with small coverage, and
this rules can only be discovered defining a small support.
TEAM LinG
Using Concept Hierarchies in Knowledge Discovery
263
As expected, the use of hierarchies does not directly affect the accuracy of
the discovered rules. That can be explained by the following. On one hand, a
more general concept has greater inconsistency which decreases the accuracy. On
the other hand, with high support values an increase in the minimum confidence
value tends to increase the accuracy. In this case, the high level concept can
cover more examples (i.e., decreasing the number of non-covered examples, as
can be seen in Table 3), where the number of non-classified examples is very
small (considering a small beam width).
Intuitively, we can think that a larger beam width would discover a rule
set with a better accuracy since the search would become closer to a complete
search. However, in the Mushroom DB with hierarchies, an increase in the beam
width did not result in a better accuracy as can be seen in Table 3.
5.4
High Level Rules and Semantic Evaluation
The most important results we have to guarantee in this work, besides efficiency,
is the discovered of high level rules at different levels of generality, without a
previous choice of the abstraction level, which is the deficiency of other systems that use concept hierarchies only to pre-generalize the database like [2].
In NETUNO-HC system we found out that changes in the relevance measures
affect the discovered rule set: with a confidence minimum value of 90%, in the
TEAM LinG
264
Marco Eugênio Madeira Di Beneditto and Leliane Nunes de Barros
two DBs it can be seen that high support minimum values tends to discover
more high level rules in the rule set (see Table 4).
The use of hierarchies introduces more general concepts and can reduce the
discovered rule set. In fact, for the Mushroom DB, with support=20%, confidence=98% and beam width = 256, 66 rules were discovered without hierarchies
against 58 rules discovered with hierarchies and the accuracy was 0.9596 and
0.9845, respectively.
For the Adult DB, with support=4%, confidence=98% and beam width =
256, 30 rules were discovered without hierarchies against 27 rules discovered with
hierarchies and the accuracy was 0.7229 and 0.7235, respectively.
As can be seen, the discovered rule set is more concise and, sometimes, more
accurate.
A more concise concept description can be explained because more general concepts can cause low level rules to be subsumed by high level ones.
For example, in the Mushroom DB, given the high level concept BAD
({CREOSOTE, FOUL, MUSTY, FISHY, PUNGENT}
BAD), the rule
is discovered. This rule, is more general than the other following two rules,
and
discovered without the use of hierarchies.
odor = BAD -> POISONOUS - Supp: 0.822 Conf: 1.0
odor = CREOSOTE -> POISONOUS - Supp: 0.048 Conf: 1.0
odor = FOUL -> POISONOUS - Supp: 0.549 Conf: 1.0
6
Conclusions
The use of concept hierarchies in data mining results in a trade off between
the discovery of more interesting rules (expressed in high abstraction level) and,
sometimes, a more concise concept description, versus a higher computational
cost. In this work, we present the NETUNO-HC algorithm and its implementation to propose ways to solve the efficiency problems of the data mining with
concept hierarchies, that are: the use of Beam Search strategy, the encoding
and evaluation techniques of the concept hierarchies and the high level SQL
primitive.
The main contribution of this work is to specify a high level SQL primitive as
an efficient way to analyze rules considering concept hierarchies, and an encoding
method that reduces impact of the hierarchies size during the generation and
evaluation of the hypotheses. This made feasible the discovery of high level rules
without pre-generalize the DB.
We also perform some experiments to show how the mining parameters affects
the discovered rule set such as:
TEAM LinG
Using Concept Hierarchies in Knowledge Discovery
265
Variation of the Support Minimum Value. On one hand, a decrease in
the support minimum value tends to increase the accuracy, with or without
hierarchies, also increasing the rule set size. On the other hand, a high support
minimum value tends to discover a more interesting rule set, i.e., a set with more
high level rules.
Variation of the Confidence Minimum Value. The effect of this kind of
variation depends of the DB domain. For the databases analyzed, a higher confidence value could not always result in a higher accuracy.
Alterations of the Beam Width. A higher beam width tends to increase
the accuracy. However, depending on the DB domain, a better accuracy can be
obtained in lower beam width, with or without hierarchies. The hierarchy also
affects the discovered rule set: a higher accuracy can be obtained with a lower
beam width.
References
1. Beneditto, M.E.M.D.: Descoberta de regras de classificação com hierarquias conceituais. Master’s thesis, Institute de Matemática e Estatística, Universidade de
São Paulo, Brasil (2004)
2. Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y.,
Rajan, A., Stefanovic, N., Xia, B., Zaiane, O.R.: DBMiner: A system for mining
knowledge in large relational databases. In Simoudis, E., Han, J.W., Fayyad, U.,
eds.: Proceedings of the Second International Conference on Knowledge Discovery
and Data Mining (KDD-96), AAAI Press (1996) 250–263
3. Taylor, M.G.: Finding High Level Discriminant Rules in Parallel. PhD thesis, Faculty of the Graduate School of the University of Maryland, College Park, USA
(1999)
4. Freitas, A., Lavington, S.: Speeding up knowledge discovery in large relational
databases by means of a new discretization algorithm. In: Proc. 14th British Nat.
Conf. on Databases (BNCOD-14), Edinburgh, Scotland (1996) 124–133
5. Quinlan, J.R.: C4.5: Programs for machine learning. 1 edn. Morgan Kaufmann
(1993)
6. Freitas, A., Lavington, S.: Using SQL primitives and parallel DB servers to speed
up knowledge discovery in large relational databases. In Trappl., R., ed.: Cybernetics and Systems’96: Proc. 13th European Meeting on Cybernetics and Systems
Research, Viena, Austria (1996) 955–960
7. Clark, P., Niblett, T.: The CN2 induction algorithm. Machine Learning 3 (1989)
261–283
TEAM LinG
A Clustering Method for Symbolic Interval-Type
Data Using Adaptive Chebyshev Distances
Francisco de A.T. de Carvalho, Renata M.C.R. de Souza, and Fabio C.D. Silva
Centro de Informatica - CIn / UFPE, Av. Prof. Luiz Freire, s/n
Cidade Universitaria, CEP: 50740-540, Recife-PE, Brasil
{fatc,rmcrs}@cin.ufpe.br
Abstract. This work presents a partitioning method for clustering symbolic interval-type data using a dynamic cluster algorithm with adaptive
Chebyshev distances. This method furnishes a partition and a prototype for each cluster by optimizing an adequacy criterion that measures
the fitting between the clusters and their representatives. To compare
interval-type data, the method uses an adaptive Chebyshev distance that
changes for each cluster according to its intra-class structure at each iteration of the algorithm. Experiments with real and artificial interval-type
data sets demonstrate the usefulness of the proposed method.
1
Introduction
Recently, clustering has become a subject of great interest, mainly due the explosive growth in the use of databases and the huge volume of data stored in
them. Due to this growth, interval data is now widely used in real applications.
Symbolic Data Analysis (SDA) [2] is a new domain in the area of knowledge
discovery and data management. It is related to multivariate analysis, pattern
recognition and artificial intelligence and seeks to provide suitable methods (clustering, factorial techniques, decision tree, etc.) for managing aggregated data
described by multi-valued variables, where data table cells contain sets of categories, intervals, or weight (probability) distributions (for more details on SDA,
see www.jsda.unina2.it).
Concerning partitioning clustering methods, SDA has provided suitable tools
for clustering symbolic interval-type data. Verde et al [10] introduced a dynamic
cluster algorithm for interval-type data considering context dependent proximity
functions. Chavent and Lechevalier [3] proposed a dynamic cluster algorithm for
interval-type data using an adequacy criterion based on the Hausdorff distance.
Souza and De Carvalho [9] presented dynamic cluster algorithms for intervaltype data based on adaptive and non-adaptive City-Block distances.
The main contribution of this paper is to introduce a partitioning clustering
method for interval-type data using the dynamic cluster algorithm with adaptive
Chebyshev distances. The standard dynamic cluster algorithm [5] is a two-step
relocation algorithm involving the construction of clusters and the identification of a representation or prototype of each cluster by locally minimizing an
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 266–275, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
A Clustering Method for Symbolic Interval-Type Data
267
adequacy criterion between the clusters and their representatives. The adaptive
version of this algorithm [4] uses a separate distance to compare each cluster
with its representation. The advantage of these adaptive distances lies in the
fact that the clustering algorithm is able to find clusters of different shapes and
sizes for a given set of objects.
In this paper, we present a dynamic cluster method with adaptive Chebyshev
distances for partitioning a set of symbolic interval-type data. This method is
an extension of the use of adaptive distances of a dynamic cluster algorithm
proposed in [3]. In section 2, a dynamic cluster with an adaptive Chebyshev
distance for interval-type data is presented. In order to validate this new method,
section 3 presents experiments with real and artificial symbolic interval-type
data sets. Section 4 shows an evaluation of the clustering results based on the
computation of an external cluster validity index ([7]) in the framework of the
Monte Carlo experience. In section 5, the concluding remarks are given.
2
Adaptive Dynamic Cluster
Let
be a set of symbolic objects described by p interval
variables. Each object
is represented as a vector of intervals
where
Let P be a partition of E into K clusters
where each cluster
has a prototype
that is also represented as a vector of intervals
According to the standard adaptive dynamic cluster algorithm [4], at each iteration there is a different distance associated with each cluster, i.e., the distance
is not determined once and for all, and is different from one class to another.
Our algorithm searches for a partition
of E in K classes,
the corresponding set of K class prototypes
and a set of
K different distances
associated with the clusters by locally
minimizing an adequacy criterion, which is usually stated as:
where
is an adaptive dissimilarity measure between an object
and the class prototype
of
2.1 Adaptive Distances Between Two Vectors of Intervals
In [4] an adaptive distance is defined according to the structure of a cluster
and is parameterized by a vector of coefficients
with
and
In this paper, we define the adaptive Chebyshev
distance between the two vectors of intervals
and
as:
where
TEAM LinG
268
Francisco de A.T. de Carvalho et al.
is the maximum between the absolute values of the differences among the lower
bounds and the upper bounds of the intervals
and
The concept behind the distance function in equation (3) is to represent an
interval
as a point
where the lower bounds of the intervals
are represented in the x-axis, and the upper bounds in the y-axis, and then
compute the
(Chebyshev) distance between the points
and
Therefore, the distance function in equation (2) is a weighted version of the
(Chebyshev) metric for interval-type data.
2.2
The Optimization Problem
The optimizing problem is stated as follows: find the class prototype
of the
class
and the adaptive Chebyshev distance associated to
that minimizes
an adequacy criterion by measuring the dissimilarity between this class prototype
and the class
according to
Therefore, the optimization problem has two stages:
a) The class
and the distance
the vector of intervals
of the prototype
minimizes
The criterion
are fixed. We look for
of the class
which locally
being additive, the problem becomes finding the interval
that minimizes
Proposition 1. This problem has an analytical solution, which is
and
where
is the median of midpoints of the intervals of the
objects belonging to the cluster
and
is the median of their half-lengths.
The proof of the proposition 1 can be found in [3].
b) The class
and the prototype
the vector of weights
with
that minimizes the criterion
Proposition 2. The coefficients
are fixed. We look for
and
that minimize
are:
The proof of proposition 2 is based on the Lagrange multipliers method and
can be found in [6].
TEAM LinG
A Clustering Method for Symbolic Interval-Type Data
2.3
269
The Adaptive Dynamic Cluster Algorithm
The adaptive dynamic cluster algorithm performs a representation step where
the class prototypes and the adaptive distances are updated. This is followed
by an allocation step in order to assign the individuals to the classes, until the
convergence of the algorithm, when the adequacy criterion reaches a stationary
value.
If a single quantitative value is considered as an interval where the lower
and upper bounds are the same (i.e., when only usual data are present), this
symbolic-oriented algorithm corresponds to the standard numerical one with
adaptive
distances introduced by Diday and Govaert [4].
The algorithm schema is the following:
1. Initialization
To construct the initial partition
Choose a partition
of E randomly or choose K distinct objects
belonging to E
and assign each object to its closest prototype
where
2. Representation step
a) (The partition P and the set of distances are fixed)
For
to K compute the vector of intervals (which represents the
prototype
with
and
where
is the median of midpoints of the intervals of the
objects belonging to the cluster
and
of their half-lengths.
b) (the partition P and the set of prototypes L are fixed)
For
and
compute
3. Allocation step
for
to
define the cluster
if
such that
and
4. Stopping criterion
If test = 0 then STOP, otherwise go to (2).
Remark: In the sub-step 2.b) (computation of
for at least one variable
re-start a new one (go to step 1).
3
if
stop the current iteration and
Experiments
To show the usefulness of these methods, experiments with two artificial intervaltype data sets with different degrees of clustering difficulty (clusters of different
TEAM LinG
270
Francisco de A.T. de Carvalho et al.
shapes and sizes, linearly non-separable clusters, etc) are considered in this section, along with a fish interval-type data set.
3.1
Artificial Symbolic Data Sets
Initially, we considered two standard quantitative data sets in
Each data set
has 450 points scattered among four clusters of unequal sizes and shapes: two
clusters with ellipsis shapes and sizes 150 and two clusters with spherical shapes
of sizes 50 and 100. The data points of each cluster in each data set were drawn
according to a bi-variate normal distribution with non-correlated components.
Data set 1 (Fig. 1), showing well-separated clusters, is generated according
to the following parameters:
a)
b)
c)
d)
Class 1:
Class 2:
Class 3:
Class 4:
Fig. 1. Data set 1 showing well-separated classes
Data set 2 (Fig. 2), showing overlapping clusters, is generated according to
the following parameters:
a)
b)
c)
d)
Class 1:
Class 2:
Class 3:
Class 4:
Each data point
of the data set 1 and 2 is a seed of a vector of intervals
(rectangle):
These parameters
are randomly selected from the same predefined interval. The intervals considered
in this paper are: [1, 8], [1, 16], [1, 24], [1, 32], and [1, 40]. Figure 3 shows artificial
interval-type data set 1 (obtained from data set 1) with well separated clusters
and Figure 4 shows artificial interval-type data set 2 (obtained from data set 2)
with overlapping clusters.
TEAM LinG
A Clustering Method for Symbolic Interval-Type Data
271
Fig. 2. Data set 2 showing overlapping classes
Fig. 3. Interval-type data set 1 showing well-separated classes
Fig. 4. Interval-type data set 2 showing overlapping classes
TEAM LinG
272
3.2
Francisco de A.T. de Carvalho et al.
Eco-toxicology Data Set
A number of studies carried out in French Guyana demonstrated abnormal levels
of mercury contamination in some Amerindian populations. This contamination
has been connected to their high consumption of contaminated freshwater fish
[1]. In order to obtain better knowledge on this phenomenon, a data set was
collected by researchers from the LEESA (Laboratoire d’Ecophysi- ologie et
d’Ecotoxicologie des Systèmes Aquatiques) laboratory. This data set concerns
12 fish species, each specie being described by 13 interval variables and 1 categorical variable. These species are grouped into four a priori clusters of unequal
sizes according to the categorical variable: two clusters (Carnivorous and Detritivorous) of sizes 4 and two clusters of sizes 2 (Omnivorous and Herbivorous).
Table 1 shows part of the fish data set.
4
Evaluation of Clustering Results
In order to compare the adaptive dynamic cluster algorithm proposed in the
present paper with the non-adaptive version of this algorithm, this section
presents the clustering results furnished by these methods according to artificial interval-type data sets 1 and 2 and the fish data set (see section 3).
The non-adaptive dynamic cluster algorithm uses a suitable extension of the
(Chebyshev) metric to compare the vectors of intervals
and
where
is given by equation 3.
The evaluation of the clustering results is based on the corrected Rand (CR)
index [7]. The CR index assesses the degree of agreement (similarity) between
an a priori partition (i.e., the partition defined by the seed points of data sets 1
and 2) and a partition furnished by the clustering algorithm. We used the CR
TEAM LinG
A Clustering Method for Symbolic Interval-Type Data
273
index because it is neither sensitive to the number of classes in the partitions
nor to the distributions of the items in the clusters [8].
For the artificial data sets, the CR index is estimated in the framework of a
Monte Carlo experience with 100 replications for each interval-type data set, as
well as for each predefined interval where the parameters
and
are selected.
For each replication a clustering method is run 50 times and the best result
according to the corresponding adequacy criterion is selected. The average of
the corrected Rand (CR) index among these 100 replications is calculated.
Table 2 shows the values of the average CR index according to adaptive and
non-adaptive methods, as well as artificial interval-type data sets 1 and 2. From
these results it can be seen that the average CR indices for the adaptive method
are greater than those for the non-adaptive method.
The comparison between the proposed clustering methods is achieved by a
paired Student’s t-test at a significance level of 5%. Table 3 shows the suitable
(null and alternative) hypothesis and the observed values of the test statistics
following a Student’s t distribution with 99 degrees of freedom. In this table,
and
are, respectively, the average of the CR index for the non-adaptive
and adaptive method. From these results, we can reject the hypothesis that the
average performance (measured by the CR index) of the adaptive method is
inferior or equal to the non-adaptive method.
TEAM LinG
274
Francisco de A.T. de Carvalho et al.
Concerning the fish interval-type data set, Table 4 shows the clusters (individual labels) given by the a priori partition according to the categorical variable,
as well as the clusters obtained by the non-adaptive and adaptive methods.
The CR indices obtained from the comparison between the a priori partition
and the partitions given by the adaptive and non-adaptive methods (see Table
4) are, respectively, 0.49 and -0.02. Therefore, the performance of the adaptive
method is superior to the non-adaptive method for this data set also.
5
Concluding Remarks
In this paper, a clustering method for interval-type data using a dynamic cluster algorithm with adaptive Chebyshev distances was presented. The algorithm
locally optimizes an adequacy criterion that measures the fitting between the
classes and their representatives (prototypes). To compare classes and prototypes, adaptive distances based on a weighted version of the
(Chebyshev)
metric for interval data are introduced.
With this method, the prototype of each class is represented by a vector of
intervals where the lower bounds of these intervals for a variable are the difference
between the median of midpoints of the intervals of the objects belonging to the
class. The median of their half-lengths and the upper bounds of these intervals
for a variable are the sum of the median of midpoints of the intervals of the
objects belonging to the class plus the median of their half-lengths.
Experiments with real and artificial symbolic interval-type data sets showed
the usefulness of this clustering method. The accuracy of the results furnished
by the adaptive clustering method is assessed by the CR index and compared
with results furnished by the non-adaptive version of this method. Concerning
the artificial symbolic interval-type data sets, the CR index is calculated in the
framework of the Monte Carlo experience with 100 replications. Statistical tests
support the evidence that this index for the adaptive method is superior to the
non-adaptive method. In regards to the fish interval-type data set, it is also
observed that the adaptive method outperforms the non-adaptive method.
Acknowledgments
The authors would like to thank CNPq (Brazilian Agency) for its financial support.
TEAM LinG
A Clustering Method for Symbolic Interval-Type Data
275
References
1. Bobou, A. and Ribeyre, F. Mercury in the food web: accumulation and transfer
mechanisms, in Sigrel A. and Sigrel H. Eds., Metal Ions in Biological Systems. M.
Dekker, New York, (1988) 289–319
2. Bock, H.H. and Diday, E.: Analysis of Symbolic Data: Exploratory Methods for
Extracting Statistical Information from Complex Data. Springer, Berlin Heidelberg
(2000)
3. Chavent, M. and Lechevallier, Y.: Dynamical Clustering Algorithm of Interval
Data: Optimization of an Adequacy Criterion Based on Hausdorff Distance. In:
Sokolowsky and H.H. Bock Eds., K. Jaguja, A. (eds) Classification, Clustering and
Data Analysis (IFCS2002). Springer, Berlin et al, (2002) 53–59
4. Diday, E. and Govaert, G.: Classification Automatique avec Distances Adaptatives.
R.A.I.R.O. Informatique Computer Science, 11 (4) (1977) 329–349
5. Diday, E. and Simon, J.C.: Clustering analysis. In: K.S. Fu (ed) Digital Pattern
Clasification. Springer, Berlin et al, (1976) 47–94
6. Govaert, G.: Classification automatique et distances adaptatives. Thèse de 3ème
cycle, Mathématique appliquée, Université Paris VI (1975)
7. Hubert, L. and Arabie, P.: Comparing Partitions. Journal of Classification, 2 (1985)
193–218
8. Milligan, G. W.:Clustering Validation: results and implications for applied analysis
In: Arabie, P., Hubert, L. J. and De Soete, G. (eds) Clustering and Classification,
Word Scientific, Singapore, (1996) 341–375
9. Souza, R.M.C.R. and De Carvalho, F. A. T.: Clustering of interval data based on
city-block distances. Pattern Recognition Letters, 25 (3) (2004) 353–365
10. Verde, R., De Carvalho, F.A.T. and Lechevallier, Y.: A dynamical clustering algorithm for symbolic data. In: Diday, E., Lechevallier, Y. (eds) Tutorial on Symbolic
Data Analysis (Gfkl2001), (2001) 59–72
TEAM LinG
An Efficient Clustering Method
for High-Dimensional Data Mining
Jae-Woo Chang and Yong-Ki Kim
Dept. of Computer Engineering
Research Center for Advanced LBS Technology
Chonbuk National University, Chonju, Chonbuk 561-756, South Korea
{jwchang,ykkim}@dblab.chonbuk.ac.kr
Abstract. Most clustering methods for data mining applications do not work efficiently when dealing with large, high-dimensional data. This is caused by socalled ‘curse of dimensionality’ and the limitation of available memory. In this
paper, we propose an efficient clustering method for handling of large amounts
of high-dimensional data. Our clustering method provides both an efficient cell
creation and a cell insertion algorithm. To achieve good retrieval performance
on clusters, we also propose a filtering-based index structure using an approximation technique. We compare the performance of our clustering method with
the CLIQUE method. The experimental results show that our clustering method
achieves better performance on cluster construction time and retrieval time.
1 Introduction
Data mining is concerned with extraction of information of interest from large
amounts of data, i.e. rules, regularities, patterns, constraints. Data mining is a data
analysis technique that has been developed from other research areas such as Machine Learning, Statistics, and Artificial Intelligent. However, data mining has three
differences from the conventional analysis techniques. First, while the existing techniques are mostly applied to a static dataset, data mining is applied to a dynamic dataset with continuous insertions and deletions. Next, the existing techniques manage
only errorless data, but data mining can manage data containing some errors. Finally,
unlike the conventional techniques, data mining generally deals with large amounts of
data.
The typical research topics in data mining are classification, clustering, association
rule, and trend analysis, etc. Among them, one of the most important topics is clustering. The conventional clustering methods have a critical drawback that they are not
suitable for handling large data sets containing millions of data units because the data
set is restricted to be resident in a main memory. They do not work well for clustering
high-dimensional data because their retrieval performance is generally degraded as
the number of dimensions increases. In this paper, we propose an efficient clustering
method for dealing with a large amount of high-dimensional data. Our clustering
method provides an efficient cell creation algorithm, which makes cells by splitting
each dimension into a set of partitions using a split index. It also provides a cell inserA.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 276–285, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
An Efficient Clustering Method for High-Dimensional Data Mining
277
tion algorithm to construct clusters of cells with more density than a given threshold
as well as to insert the clusters into an index structure. By using an approximation
technique, we also propose a new filtering-based index structure to achieve good
retrieval performance on clusters.
The rest of this paper is organized as follows. The next section discusses related
work on clustering methods. In Section 3, we propose an efficient clustering method
to makes cells and insert them into our index structure. In Section 4, we analyze the
performances of our clustering method. Finally, we draw our conclusion in Section 5.
2 Related Work
Clustering is the process of grouping data into classes or clusters, in such a way that
objects within a cluster have high similarity to one another, but are very dissimilar to
objects in other clusters [1]. In data mining applications, there have been several
existing clustering methods, such as CLARA(Clustering LARge Applications) [2],
CLARANS(Clustering Large Applications based on RANdomized Search) [3],
BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies) [4],
DBSCAN(Density Based Spatial Clustering of Applications with Noise) [5],
STING(STatistical INformation Grid) [6], and CLIQUE(CLustering In QUEst) [7].
In this section, we discuss a couple of the existing clustering methods appropriate for
high dimensional data. We also examine their potential for clustering of large
amounts of high dimensional data.
The first method is STING(STatistical INformation Grid) [6]. It is a method which
relies on a hierarchical division of the data space into rectangular cells. Each cell is
recursively partitioned into smaller cells. STING can be used to answer efficiently
different kinds of region-oriented queries. The algorithm for answering such queries
first determines all bottom-level cells relevant to the query, and constructs regions of
those cells using statistical information. Then, the algorithm goes down the hierarchy
by one level. However, when the number of bottom-level cells is very large, both the
quality of cell approximations of clusters and the runtime for finding them deteriorate.
The second method is CLIQUE(CLustering In QUEst) [7]. It was proposed for
high-dimensional data as a density-based clustering method. CLIQUE automatically
finds subspaces(grids) with high-density clusters. CLIQUE produces identical results
irrespective of the order in which input records are presented, and it does not presume
any canonical distribution of input data. Input parameters are the size of the grid and
a global density threshold for clusters. CLIQUE scales linearly with the number of
input records, and has good scalability as the number of dimensions in the data.
3 An Efficient Clustering Method
Since the conventional clustering methods assume that a data set is resident in main
memory, they are not efficient in handling large amounts of data. As the dimensionality of data is increased, the number of cells increases exponentially, thus causing the
TEAM LinG
278
Jae-Woo Chang and Yong-Ki Kim
dramatic performance degradation. To remedy that effect, we propose an efficient
clustering method for handling large amounts of high-dimensional data. Our clustering method uses a cell creation algorithm which makes cells by splitting each dimension into a set of partitions using a split index. It also uses a cell insertion algorithm,
which constructs clusters of cells with more density than a given threshold, and stores
the constructed cluster into the index structure. For fast retrieval, we propose a filtering-based index structure by applying an approximation technique to our clustering
method. The figure 1 shows the overall architecture of our clustering method.
Fig. 1. Overall architecture of our clustering method.
3.1 Cell Creation Algorithm
Our cell creation algorithm makes cells by splitting each dimension into a group of
sections using a split index. Density based split index is used for creating split sections and is efficient for splitting multi-group data. Our cell creation algorithm first
finds the optimal split section by repeatedly examining a value between the maximum
and the minimum in each dimension. That is, it finds the optimal value while the
difference between the maximum and the minimum is greater than one and the value
of a split index after splitting is greater than the previous value. The split index value
is calculated by Eq. (1) before splitting and Eq. (2) after splitting.
Using Eq. (1), we can determine the split index value for a data set S in three steps:
i) divide S into C classes, ii) calculate the square value of the relative density of each
class, and iii) subtract from one all the square values of the densities of C classes.
Using Eq. (2), we compute a split index value for S after S is divided into and
If the split index value is larger than the previous value before splitting, we actually
divide S into and
Otherwise, we stop splitting. Secondly, our cell creation algorithm creates cells being made by the optimal split sections for n-dimensional data.
As a result, our cell creation algorithm creates fewer cells than the existing clustering
methods using equivalent intervals. Figure 2 shows our cell creation algorithm. Here,
the subprogram called ‘Partition’ is one that partitions input data sets according to
attributes. The subprogram is omitted because it is very easy to construct it by slightly
modifying the procedure ‘Make_Cell’.
TEAM LinG
An Efficient Clustering Method for High-Dimensional Data Mining
279
In Figure 3, we show an example of our cell creation algorithm. We show the
process of splitting twenty records with two classes in two-dimensional data. The
split index value for S before splitting is calculated as
A bold line represents a split index of twenty records in the X-axis. First, we calculate
all the split index values for ten intervals. Secondly, we choose an interval with the
maximum value among them. Finally, we regard the upper limit of the interval as a
split axis. For example, for an interval between 0.3 and 0.4, the split index value is
calculated as
For
an interval between 0.4 and 0.5, the split index value is calculated as
Fig. 2. Cell creation algorithm.
Fig. 3. Example of cell creation algorithm.
We determine the upper limit of the interval (=0.5) as the split axis, because the
split index value after splitting is greater than the previous value. Thus, the X axis can
TEAM LinG
280
Jae-Woo Chang and Yong-Ki Kim
be divided into two sections; the first one is from 0 and 0.5 and the second one is
from 0.5 to 1.0. If a data set has n dimensions and the number of the initial split sections in each dimension is m, the conventional cell creation algorithms make
cells,
but our cell creation algorithm makes only
cells
3.2 Cell Insertion Algorithm
Using our cell creation algorithm, we obtain the cells created from the input data set.
Figure 4 shows an insertion algorithm used to store the created cells. First, we construct clusters of cells with more density than a given cell threshold and store them
into a cluster information file. In addition, we store all the sections with more density
than a given section threshold, into an approximation information file.
Fig. 4. Cell insertion algorithm.
The insertion algorithm to store data is as follows. First, we calculate the frequency
of a section in all dimensions whose frequency is greater than a given section threshold. Secondly, in an approximation information file, we set to ‘1’ the corresponding
bits to sections whose frequencies are greater than the threshold. We set other bits to
‘0’ for the remainder sections. Thirdly, we calculate the frequency of data in a cell.
Finally, we store cell id and cell frequency into the cluster information file for cells
whose frequency is greater than a given cell threshold. The cell threshold and the
section threshold are shown in Eq. (3).
3.3 Filtering-Based Index Scheme
In order to reduce the number of I/O accesses to a cluster information, it is possible to
construct a new filtering-based index scheme using the approximation information
TEAM LinG
An Efficient Clustering Method for High-Dimensional Data Mining
281
Fig. 5. Two-level filtering-based index scheme.
file. Figure 5 shows a two-level filter-based index scheme containing both the approximation information file and cluster information file.
Let assume that K clusters are created by our cell-based clustering method and the
numbers of split sections in X axis and Y axis are m and n, respectively. The following equation, Eq.(4), shows the retrieval times (C) when the approximation information file is used and without the use of it. We assume that is an average filtering
ratio in the approximation information file. D is the number of dimensions of input
data. P is the number of records per page. R is the average number of records in each
dimension. When the approximation information file is used, the retrieval time decreases as decreases. For high-dimension data, our two-level index scheme using
the approximation information file is an efficient method because the K value increases exponentially in proportion to dimension D.
i) Retrieval time without the use of an approximation information file
ii) Retrieval time with the use of an approximation information file
When a query is entered, we first obtain sections to be examined in all the dimensions. If all the bits corresponding to the sections in the approximation information
file are set ‘1’, we calculate a cell number and obtain its cell frequency by accessing
the cluster information file. Otherwise, we can improve retrieval performance without
accessing the approximation information file. Increase in dimensionality may cause
high probability that a record of the approximation information file has zero in at
least one dimension.
Figure 5 shows a procedure used to answer a user query in our two-level index
structure when a cell threshold and a section threshold are 1, respectively. For a query
Q1, we determine 0.6 in X axis as the third section and 0.8 in Y axis as the fourth
section. In the approximation-information file, the value for the third section in X axis
is ‘1’ and the value for the 4-th section in Y axis is ‘0’. If there are one or more sections with ‘0’ in the approximation-information file, a query is discarded without
TEAM LinG
282
Jae-Woo Chang and Yong-Ki Kim
searching the corresponding cluster information file. So, Q1 is discarded in the first
phase. For a query Q2, the value of 0.55 in X axis and the value of 0.7 in Y axis belong to the third section, respectively. In the approximation information file, the third
bit for X axis and the third bit for Y axis have ‘1’, so we can calculate a cell number
and obtain its cell frequency by accessing the corresponding entry of the cluster information file. As a result, in case of Q2, we obtain the cell number of 11 and its
frequency of 3 in the cluster information file.
4 Performance Analysis
For our performance analysis, we implemented our clustering method on Linux
server with 650 MHz dual processors and 512 MB of main memory. We make use of
one million 16-dimensional data created by Synthetic Data Generation Code for Classification in IBM Quest Data Mining Project [8]. A record in our experiment is composed of both numeric type attributes, like salary, commission, age, hvalue, hyears,
loan, tax, interest, cyear, balance, and categorical type attributes, like level, zipcode,
area, children, ctype, job. The factors of our performance analysis are cluster construction time, precision, and retrieval time. We compare our clustering method
(CBCM) with the CLIQUE method, which is one of the most efficient conventional
clustering method for handling high-dimensional data. For our experiment, we make
use of three data sets, one with random distribution, one with standard normal distribution (variation=1), and one with normal distribution of variation 0.5. We also use 5
and 10 for the interval of numeric attributes. Table 1 shows methods used for performance comparison in our experiment.
Figure 6 shows the cluster construction time when the interval of numeric attributes equals 10. It is shown that the cluster construction time increases linearly in
proportion to the amount of data. This result is applicable to large amounts of data.
The experimental result shows that the CLIQUE requires about 700 seconds for one
million items of data, while our CBCM needs only 100 seconds. Because our method
TEAM LinG
An Efficient Clustering Method for High-Dimensional Data Mining
283
creates smaller number of cells than the CLIQUE, our CBCM method leads to 85%
decrease in cluster construction time. The experimental result with the maximal interval (MI)=5 is similar to that with MI=10.
Fig. 6. Cluster Construction Time.
Figure 7 shows average retrieval time for a given user query after clusters were
constructed. When the interval of numeric attributes equals 10, the CLIQUE needs
about 17-32 seconds, while our CBCM needs about 2 seconds. When the interval
equals 5, the CLIQUE and our CBCM need about 8-13 seconds and 1 second, respectively. It is shown that our CBCM is much better on retrieval performance than the
CLIQUE. This is because our method creates a small number of cells by using our
cell creation algorithm, and achieves good filtering effect by using the approximation
information file. It is also shown that the CLIQUE and our CMCM require long retrieval time when using a data set with random distribution , compared with normal
distribution of variation 0.5. This is because as the variation of a data set decreases,
the number of clusters decreases, leading to better retrieval performance.
Fig. 7. Retrieval Time.
Figure 8 shows the precision of the CLIQUE and that of our CBCM, assuming that
the section threshold is assumed to be 0. The result shows that the CLIQUE achieves
TEAM LinG
284
Jae-Woo Chang and Yong-Ki Kim
about 95% precision when the interval equals 10, and it achieves about 92% precision
when the interval equals 5. Meanwhile, our CBCM achieve over 90% precision when
the interval of numeric attributes equals 10 while it achieves about 80% precision
when the interval equals 5. This is because the precision decreases as the number of
clusters constructed increases.
Because both retrieval time and precision have a trade-off, we estimate a measure
used to combine retrieval time and precision. To do this, we define a system efficiency measure in Eq. (5). Here
is the system efficiency of methods (MD) shown
in Table 1 and
and
are the weight of precision and that of retrieval time, respectively.
and
are the precision and the retrieval time of the methods
(MD).
and
are the maximum precision and the minimum retrieval time,
respectively, for all methods.
Fig. 8. Precision.
Fig. 9. System efficiency.
TEAM LinG
An Efficient Clustering Method for High-Dimensional Data Mining
285
Figure 9 depicts the performance results of methods in terms of their system efficiency when the weight of precision are three times greater than that of retrieval time
It is shown from our performance results that our CBCM outperforms the CLIQUE with respect to the system efficiency, regardless of the data
distribution of the data sets. Especially, the performance of our CBCM with MI=10 is
the best.
5 Conclusion
The conventional clustering methods are not efficient for large, high-dimensional
data. In order to overcome the difficulty, we proposed an efficient clustering method
with two features. The first one allows us to create the small number of cells for
large, high-dimensional data. To do this, we calculate a section of each dimension
through split index and create cells according to the overlapped area of each fixed
section. The second one allows us to apply an approximation technique to our clustering method for fast clustering. For this, we use a two-level index structure which
consists of both an approximation information file and a cluster information file. For
performance analysis, we compare our clustering method with the CLIQUE method.
The performance analysis results show that our clustering method shows slightly
lower precision, but it achieves good performance on retrieval time as well as cluster
construction time. Finally, our clustering method shows a good performance on system efficiency which is a measure to combine both precision and retrieval time.
Acknowledgement
This research was supported by University IT Research Center Project.
References
1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)
2. Ng R.T., Han J.: Efficient and Effective Clustering Methods for Spatial Data Mining. Proc.
of Int. Conf. on Very Large Data Bases (1994) 144-155
3. Kaufman L., Rousseeuw P.J.: Finding Groups in Data: An Introduction to Cluster Analysis.
John Wiley & Sons (1990)
4. Zhang T., Ramakrishnan R., Linvy M.: BIRCH: An Efficient Data Clustering Method for
Very Large Databases. Proc. of ACM Int. Conf. on Management of Data (1996) 103-114
5. Ester M., Kriegel H.-P., Sander J., Xu X.: A Density Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proc. of Int. Conf. on Knowledge Discovery and
Data Mining (1996) 226-231
6. Wang W., Yang J., Muntz R.: STING: A Statistical Information Grid Approach to Spatial
Data Mining. Proc. of Int. Conf. on Very Large Data Bases (1997) 186-195
7. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of
High Dimensional Data Mining Applications. Proc. of ACM Int. Conf. on Management of
Data (1998) 94-105
8. http://www.almaden.ibm.com/cs/quest
TEAM LinG
Learning with Drift Detection
João Gama1,2, Pedro Medas1, Gladys Castillo1,3, and Pedro Rodrigues1
1
LIACC - University of Porto
Rua Campo Alegre 823, 4150 Porto, Portugal
{jgama,pmedas}@liacc.up.pt, [email protected]
2
Fac. Economics, University of Porto
3
University of Aveiro
[email protected]
Abstract. Most of the work in machine learning assume that examples are generated at random according to some stationary probability
distribution. In this work we study the problem of learning when the
distribution that generate the examples changes over time. We present a
method for detection of changes in the probability distribution of examples. The idea behind the drift detection method is to control the online
error-rate of the algorithm. The training examples are presented in sequence. When a new training example is available, it is classified using
the actual model. Statistical theory guarantees that while the distribution is stationary, the error will decrease. When the distribution changes,
the error will increase. The method controls the trace of the online error
of the algorithm. For the actual context we define a warning level, and
a drift level. A new context is declared, if in a sequence of examples, the
error increases reaching the warning level at example
and the drift
level at example
This is an indication of a change in the distribution of the examples. The algorithm learns a new model using only the
examples since
The method was tested with a set of eight artificial
datasets and a real world dataset. We used three learning algorithms: a
perceptron, a neural network and a decision tree. The experimental results show a good performance detecting drift and with learning the new
concept. We also observe that the method is independent of the learning
algorithm.
Keywords: Concept Drift, Incremental Supervised Learning, Machine
Learning
1
Introduction
In many applications, learning algorithms acts in dynamic environments where
the data flows continuously. If the process is not strictly stationary (as most
of real world applications), the target concept could change over time. Nevertheless, most of the work in machine learning assume that training examples
are generated at random according to some stationary probability distribution.
Examples of real problems where change detection is relevant include user modeling, monitoring in biomedicine and industrial processes, fault detection and
diagnosis, safety of complex systems, etc [1].
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 286–295, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Learning with Drift Detection
287
In this work we present a direct method to detect changes in the distribution
of the training examples. The method will be presented in the on-line learning
model, where learning takes place in a sequence of trials. On each trial, the
learner makes some kind of prediction and then receives some kind of feedback.
A important concept through out this work is the concept of context. We define
context as a set of examples where the function generating examples is stationary.
We assume that the data stream is composed by a set of contexts. Changes
between contexts can be gradual - when there is a smoothed transition between
the distributions; or abrupt - when the distribution changes quickly. The aim of
this work is to present a straightforward and direct method to detect the several
moments when there is a change of context. If we can identify contexts, we can
identify which information is outdated and re-learn the model only with relevant
information to the present context.
The paper is organized as follows. The next section presents related work in
detecting concept drifting. In section 3 we present the theoretical basis of the
proposed method. Section 4 we evaluate the method using several algorithms
on artificial and real datasets. Section 5 concludes the paper and present future
work.
2
Tracking Drifting Concepts
There are several methods in machine learning to deal with changing concepts [7,
6,5,12]. In machine learning drifting concepts are often handled by time windows
or weighted examples according to their age or utility. In general, approaches
to cope with concept drift can be classified into two categories: i) approaches
that adapt a learner at regular intervals without considering whether changes
have really occurred; ii) approaches that first detect concept changes, and next,
the learner is adapted to these changes. Examples of the former approaches are
weighted examples and time windows of fixed size. Weighted examples are based
on the simple idea that the importance of an example should decrease with time
(references about this approach can be found in [7,6,9,10,12]). When a time
window is used, at each time step the learner is induced only from the examples that are included in the window. Here, the key difficulty is how to select
the appropriate window size: a small window can assure a fast adaptability in
phases with concept changes but in more stable phases it can affect the learner
performance, while a large window would produce good and stable learning results in stable phases but can not react quickly to concept changes. In the latter
approaches,with the aim of detecting concept changes, some indicators (e.g. performance measures, properties of the data, etc.) are monitored over time (see [7]
for a good classification of these indicators). If during the monitoring process a
concept drift is detected, some actions to adapt the learner to these changes can
be taken. When a time window of adaptive size is used these actions usually lead
to adjusting the window size according to the extent of concept drift [7]. As a
general rule, if a concept drift is detected the window size decreases, otherwise
the window size increases. An example of work relevant to this approach is the
FLORA family of algorithms developed by Widmer and Kubat [12]. For instance,
TEAM LinG
288
João Gama et al.
FLORA2 includes a window adjustment heuristic for a rule-based classifier. To
detect concept changes the accuracy and the coverage of the current learner are
monitored over time and the window size is adapted accordingly.
Other relevant works are the works of Klinkenberg and Lanquillon, both of
them in information filtering. For instance, Klinkenberg [7], to detect concept
drift, propose monitoring the values of three performance indicators: accuracy,
recall and precision over time, and then, comparing it to a confidence interval of
standard sample errors for a moving average value (using the last M batches) of
each particular indicator. Although these heuristics seem to work well in their
particular domain, they have to deal with two main problems: i) to compute
performance measures, user feedback about the true class is required, but in
some real applications only partial user feedback is available; ii) a considerable
number of parameters are needed to be tuned. Afterwards, in [6] Klinkenberg
and Joachims present a theoretically well-founded method to recognize and handle concept changes using support vector machines. The key idea is to select
the window size so that the estimated generalization error on new examples is
minimized. This approach uses unlabeled data to reduce the need for labeled
data, it doesn’t require complicated parameterization and it works effectively
and efficiently in practice.
3
The Drift Detection Method
In most of real-world applications of machine learning data is collected over
time. For large time periods, it is hard to assume that examples are independent
and identically distributed. At least in complex environments its highly provable
that class-distributions changes over time.
In this work we assume that examples arrive one at a time. The framework
could be easy extended to situations where data comes on batches of examples.
We consider the online learning framework. In this framework when an example
becomes available, the decision model must take a decision (e.g. an action). Only
after the decision has been taken the environment react providing feedback to
the decision model (e.g. the class label of the example).
Suppose a sequence of examples, in the form of pairs
For each example, the actual decision model predicts
that can be or True
or False
For a set of examples the error is a random variable from Bernoulli
trials. The Binomial distribution gives the general form of the probability for
the random variable that represents the number of errors in a sample of examples. For each point in the sequence, the error-rate is the probability of observe
False,
with standard deviation given by
In the PAC learning model [11] it is assumed that if the distribution of the
examples is stationary, the error rate of the learning algorithm
will decrease
when the number of examples (i) increases1. A significant increase in the error
of the algorithm, suggest a change in the class distribution, and that the actual
decision model is not appropriate. For a sufficient large number of example, the
1
For an infinite number of examples, the error rate will tend to the Bayes error.
TEAM LinG
Learning with Drift Detection
289
Binomial distribution is closely approximated by a Normal distribution with
the same mean and variance. Considering that the probability distribution is
unchanged when the context is static, then the
confidence interval for
with
examples is approximately
The parameter depends on
the confidence level. The drift detection method manages two registers during
the training of the learning algorithm,
and
Every time a new example
is processed those values are updated when
is lower than
We use a warning level to define the optimal size of the context window. The
context window will contain the old examples that are on the new context and a
minimal number of examples on the old context. Suppose that in the sequence of
examples that traverse a node, there is an example with correspondent and
In the experiments described below the confidence level for warning has been
set to 95%, that is, the warning level is reached if
The
confidence level for drift has been set to 99%, that is, the drift level is reached
if
Suppose a sequence of examples where the error of
the actual model increases reaching the warning level at example
and the
drift level at example
This is an indication of a change in the distribution
of the examples. A new context is declared starting in example
and a new
decision model is induced using only the examples starting in
till
It is
possible to observe an increase of the error reaching the warning level, followed
by a decrease. We assume that such situations corresponds to a false alarm,
without changing the context. Figure 1 details the dynamic window structure.
With this method of learning and forgetting we ensure a way to continuously
keep a model better adapted to the present context.
This method could be applied with any learning algorithm. It could be directly implemented inside online and incremental algorithms, and could be implemented as a wrapper to batch learners. The goal of the proposed method
is to detect sequences of examples with a stationary distribution. We denote
those sequences of examples as context. From the practical point of view, what
the method does is to choose the training set more appropriate to the actual
class-distribution of the examples.
4
Experimental Evaluation
In this section we describe the evaluation of the proposed method. We used three
distinct learning algorithms with the drift detection algorithm: a Perceptron, a
neural network and a decision tree [4]. These learning algorithms use different
representations to generalize examples. The simpler representation is linear, the
Perceptron. The neural networks example representation is a non linear combination of attributes. The decision tree uses DNF to represent generalization of
the examples.
We have used eight artificial datasets, previously used in concept drift detection [8] and a real-world problem [3]. The artificial datasets have several different
characteristics that allow us to assess the performance of the method in various
conditions - abrupt and gradual drift, presence and absence of noise, presence of
irrelevant and symbolic attributes, numerical and mixed data descriptions.
TEAM LinG
290
João Gama et al.
Fig. 1. Dynamically constructed Time Window. The vertical line marks the change of
concept.
4.1
Artificial Datasets
The eight artificial datasets used are briefly described. All the problems have
two classes. Each class is represented by 50% of the examples in each context.
To ensure a stable learning environment within each context, the positive and
negative examples in the training set are interchanged. Each dataset embodies at
least two different versions of a target concept. Each context defines the strategy
to classify the examples. Each dataset is composed of 1000 random generated
examples in each context.
1. SINE1. Abrupt concept drift, noise-free examples. The dataset has
two relevant attributes. Each attributes has values uniformly distributed in
[0,1]. In the first context all points below the curve
are classified
as positive. After the context change the classification is reversed.
2. SINE2. The same two relevant attributes. The classification function is
After the context change the classification is reversed.
3. SINIRREL1. Presence of irrelevant attributes. The same classification function of SINE1 but the examples have two more random attributes
with no influence on the classification function.
4. SINIRREL2. The same classification function of SINE2 but the examples
have two more random attributes with no influence on the classification
function.
5. CIRCLES. Gradual concept drift, noise–free examples. The same
relevant attributes are used with four new classification function. This dataset has four contexts defined by four circles:
center [0.2,0.5] [0.4,0.5] [0.6,0.5] [0.8,0.5]
0.25
0.3
0.2
radius 0.15
6. GAUSS. Abrupt concept drift, noisy examples. Positive examples
with two relevant attributes from the domain R×R are normally distributed
around the center [0,0] with standard deviation 1. The negative examples
TEAM LinG
Learning with Drift Detection
291
are normally distributed around center [2,0] with standard deviation 4. After
each context change, the classification is reversed.
7. STAGGER. Abrupt concept drift, symbolic noise–free examples.
The examples have three symbolic attributes - size (small, medium, large),
color (red, green), shape (circular, non-circular). In the first context only
the examples satisfying the description
are classified positive. In the second context, the concept description is defined by
two attributes,
With the third context, the
examples are classified positive if
8. MIXED. Abrupt concept drift, boolean noise-free examples. Four
relevant attributes, two boolean attributes
and two numeric attributes
from [0,1]. The examples are classified positive if two of three conditions
are
After each context change the
classification is reversed.
4.2
Results on Artificial Domains
The propose of this experiments is to study the effect of the proposed drift
detection method on the generalization capacity of each learning algorithm. We
also show the method independence of the learning algorithm. The results of
different learning algorithms are not comparable.
Figure 2 compare the results of the application of the drift detection method
with the results without detection. These are the results for the three learning
algorithms used and two artificial datasets. The use of artificial datasets allow
us to control the points where the concept drift. The points where the concept
drift are signaled by a vertical line. We can observe the performance curve of the
learning algorithm without drift detection. During the first concept the learning
algorithm error systematically decreases. After the first concept drift the error
strongly increases and never drops to the level of the first concept. When the
concept drift is detected the error rate grows dramatically compared to the gradual growth of the model without drift detection. But the drift detection method
overcomes this and with few examples can achieve a much better performance
level, as can be seen with figure 2, than the method without drift detection.
While the error rate still grows with the non detection algorithm, the drift detection curve falls to a lower error rate. Both with the neural network and the
decision tree it is relevant the application of the detection method over the flat
application of the learning algorithm on the learning efficiency.
Table 1 shows the final values for the error rate by dataset and learning
algorithm. There is a significant difference of results when the drift detection is
used. We can observe that the method is effective with all learning algorithms.
Nevertheless, the differences are more significant with the neural network and
the decision tree.
4.3
The Electricity Market Dataset
The data used in this experiments was first described by M. Harries [3]. The
data was collected from the Australian New South Wales Electricity Market. In
TEAM LinG
292
João Gama et al.
Fig. 2. Abrupt Concept Drift, noise-free examples. Left column: STAGGER dataset,
right column: MIXED dataset.
this market, the prices are not fixed and are affected by demand and supply
of the market. The prices in this market are set every five minutes. Harries [3]
shows the seasonality of the price construction and the sensitivity to short-term
events such as weather fluctuations. Another factor on the price evolution was
the time evolution of the electricity market. During the time period described
in the data the electricity market was expanded with the inclusion of adjacent
areas. This allowed for a more elaborated management of the supply. The excess
production of one region could be sold on the adjacent region. A consequence
of this expansion was a dampener of the extreme prices. The ELEC2 dataset
contains 45312 instances dated from 7 May 1996 to 5 December 1998. Each
example of the dataset refers to a period of 30 minutes, i. e. there are 48 instances
for each time period of one day. Each example on the dataset has 5 fields, the
day of week, the time stamp, the NSW electricity demand, the Vic electricity
TEAM LinG
Learning with Drift Detection
293
demand, the scheduled electricity transfer between states and the class label.
The class label identifies the change of the price related to a moving average of
the last 24 hours. The class level only reflect deviations of the price on a one
day average and removes the impact of longer term price trends. The interest of
this dataset is that it is a real-world dataset. We do not know when drift occurs
or if there is drift.
Experiments with ELEC2 Data. We have considered two problems. The
first problem consists in short term prediction: predict the changes in the prices
relative to the last day. The other problem consists in predicting the changes
in the prices relative to the last week of examples recorded. In both problems
the learning algorithm, the implementation of CART available in R, learns a
model from the training data. We have used the proposed method as a wrapper
over the learning algorithm. After seeing all the training data, the final model
classifies the test data.
As we have pointed out we don’t know if and when drift occurs. In a first set
of experiments we run a decision tree using two different training sets: all the
available data (e.g. except the test data), and the examples relative to the last
year. These choices corresponds to ad-hoc heuristics. Our method makes an intelligent search of the appropriate training sets. These heuristics have been used
to define upper bounds to the generalization ability of the learning algorithm.
A second set of experiments was designed to find a lower bound for the
predictive accuracy. We made an extensive search to look for the segment of
the training dataset with the best prediction performance on the test set. There
should be noted that this is not feasible in practice, because we are looking for the
class in the test set. This result can only be seen as a lower bound. Starting with
all the training data, the learning algorithm generates a model that classifies the
test set. Each new experiment uses a subset of the last dataset which excludes
the data of the oldest week, that is, it removes the first 336 examples of the
previous experiment. In each experiment, a decision tree is generated from the
training set and evaluated on the test set. The smallest test set error is chosen
as a lower bound for comparative purposes. We made 134 experiments with the
1-day test set problem, and 133 with the 1-week test set problem, using in each
a different partition of the train dataset. The figure 3 presents the trace of the
error rate of the drift detection method using the full ELEC2 dataset. The figure
also presents the trace of the decision tree without drift detection. The third set
of experiments was the application of the drift detection method with a decision
tree to the training dataset defined for each of the test datasets, 1-day and 1week test dataset. With the 1-day dataset the trees are built using only the
last 3836 examples on the training dataset. With the 1-week dataset the trees
are built with the 3548 most recent examples. This is the data collected since
1998/09/16. Table 2 shows the error rate obtained with the 1-day and 1-week
prediction for the three set of experiments. We can see that the 1-day prediction
error rate of the Drift Detection Method is equal to the lower bound and the
1-week prediction is very close to the lower bound. This is a excellent indicator
of the drift detection method performance.
TEAM LinG
294
João Gama et al.
Fig. 3. Trace of the on-line error using the Drift Detection Method applied with a
Decision Tree on ELEC2 dataset.
We have also tested the method using the dataset ADULT [2]. This dataset
was created using census data in a specific point of time. The concept should be
stable. Using a decision tree as inducer, the method never detects drift. This is
an important aspect, because it presents evidence that the method is robust to
false alarms.
5
Conclusions
We present a method for detection of concept drift in the distribution of the
examples. The method is simple, with direct application and is computationally efficient. The Drift Detection Method can be applied to problems where the
information is available sequentially over time. The method is independent of
the learning algorithm. It is more efficient when used with learning algorithms
with greater capacity to represent generalizations of the examples. This method
improves the learning capability of the algorithm when modeling non-stationary
problems. We intend to proceed with this research line with other learning algorithms and real world problems. We already started working to include the drift
detection method in an incremental decision tree. Preliminary results are very
promising. The algorithm could be applied with any loss-function given appropriate values for Preliminary results in regression domain using mean-squared
error loss function confirm the results presented here.
TEAM LinG
Learning with Drift Detection
295
Acknowledgments
The authors reveal its gratitude to the financial contribution of project ALES
(POSI/SRI/39770/2001), RETINAE, and FEDER through the plurianual support to LIACC.
References
1. Michele Basseville and Igor Nikiforov. Detection of Abrupt Changes: Theory and
Applications. Prentice-Hall Inc, 1993.
2. C. Blake, E. Keogh, and C.J. Merz. UCI repository of Machine Learning databases,
1999.
3. Michael Harries. Splice-2 comparative evaluation: Electricity pricing. Technical report, The University of South Wales, 1999.
4. Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics.
Journal of Computational and Graphical Statistics, 5(3):299–314, 1996.
5. R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 2004.
6. R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In Pat Langley, editor, Proceedings of ICML-00, 17th International Conference on Machine Learning, pages 487–494, Stanford, US, 2000. Morgan Kaufmann
Publishers.
7. R. Klinkenberg and I. Renz. Adaptive information filtering: Learning in the presence of concept drifts. In Learning for Text Categorization, pages 33–40. AAAI
Press, 1998.
8. M. Kubat and G. Widmer. Adapting to drift in continuous domain. In Proceedings
of the 8th European Conference on Machine Learning, pages 307–310. Springer
Verlag, 1995.
9. C. Lanquillon. Enhancing Text Classification to Improve Information Filtering.
PhD thesis, University of Madgdeburg, Germany, 2001.
10. M. Maloof and R. Michalski. Selecting examples for partial memory learning. Machine Learning, 41:27–52, 2000.
11. Tom Mitchell. Machine Learning. McGraw Hill, 1997.
12. Gerhard Widmer and Miroslav Kubat. Learning in the presence of concept drift
and hidden contexts. Machine Learning, 23:69–101, 1996.
TEAM LinG
Learning with Class Skews and Small Disjuncts
Ronaldo C. Prati, Gustavo E.A.P.A. Batista, and Maria Carolina Monard
Institute of Mathematics and Computer Science at University of São Paulo
P. O. Box 668, ZIP Code 13560-970, São Carlos, SP, Brazil
{prati,gbatista,mcmonard}@icmc.usp.br
Abstract. One of the main objectives of a Machine Learning – ML –
system is to induce a classifier that minimizes classification errors. Two
relevant topics in ML are the understanding of which domain characteristics and inducer limitations might cause an increase in misclassification. In this sense, this work analyzes two important issues that might
influence the performance of ML systems: class imbalance and errorprone small disjuncts. Our main objective is to investigate how these
two important aspects are related to each other. Aiming at overcoming
both problems we analyzed the behavior of two over-sampling methods
we have proposed, namely Smote + Tomek links and Smote + ENN.
Our results suggest that these methods are effective for dealing with
class imbalance and, in some cases, might help in ruling out some undesirable disjuncts. However, in some cases a simpler method, Random
over-sampling, provides compatible results requiring less computational
resources.
1
Introduction
This paper aims to investigate the relationship between two important topics
in recent ML research: learning with class imbalance (class skews) and small
disjuncts. Symbolic ML algorithms usually express the induced concept as a
set of rules. Besides a small overlap within some rules, a set of rules might be
understood as a disjunctive concept definition. The size of a disjunct is defined
as the number of training examples it correctly classifies. Small disjuncts are
those disjuncts that correctly cover only few training cases. In addition, class
imbalance occurs in domains where the number of examples belonging to some
classes heavily outnumber the number of examples in the other classes. Class
imbalance has often been reported in the ML literature as an obstacle for the
induction of good classifiers, due to the poor representation of the minority class.
On the other hand, small disjuncts have often been reported as having higher
misclassification rates than large disjuncts. These problems frequently arise in
applications of learning algorithms in real world data, and several research papers
have been published aiming to overcome such problems. However, these efforts
have produced only marginal improvements and both problems still remain open.
A better understanding of how class imbalance influences small disjuncts (and
of course, the inverse problem) may be required before meaningful results might
be obtained.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 296–306, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Learning with Class Skews and Small Disjuncts
297
Weiss [1] suggests that there is a relation between the problem of small disjuncts and class imbalance, stating that one of the reasons why small disjuncts
have a higher error rate than large disjuncts is due to class imbalance. Furthermore, Japkowicz [2] enhances this hypothesis stating that the problem of
learning with class imbalance is potentiated when it yields small disjuncts. Even
though these papers point out a connection between such problems, the true
relationship between them is not yet well-established. In this work, we aim to
further investigate this relationship.
This work is organized as follows: Section 2 reports some related work and
points out some connections between class imbalance and small disjuncts. Section 3 describes some metrics for measuring the performance of ML algorithms
regarding small disjuncts and class skews. Section 4 discusses the experimental
results of our work and, finally, Section 5 presents our concluding remarks and
outlines future research directions.
2
Related Work
Holt et al. [3] report two main problems when small disjuncts arise in a concept
definition: (a) the difficulty in reliably eliminating the error-prone small disjuncts
without producing an undesirable net effect on larger disjuncts and; (b) the
algorithm maximum generality bias that tends to favor the induction of good
large disjuncts and poor small disjuncts.
Several research papers have been published in the ML literature aiming to
overcome such problems. Those papers often advocate the use of pruning to draw
small disjuncts off the concept definition [3,4] or the use of alternative learning
bias, generally using hybrid approaches, for coping with the problem of small
disjuncts [5]. Similarly, class imbalance has been often reported as an obstacle for
the induction of good classifiers, and several approaches have been reported in
the literature with the purpose of dealing with skewed class distributions. These
papers often use sampling schemas, where examples of the majority class are
removed from the training set [6] or examples of the minority class are added
to the training set [7] in order to obtain a more balanced class distribution.
However, in some domains standard ML algorithms induce good classifiers even
using highly imbalanced training sets. This indicates that class imbalance is not
solely accountable for the decrease in performance of learning algorithms. In [8]
we conjecture that the problem is not only caused by class skews, but is also
related to the degree of data overlapping among the classes.
A straightforward connection between both themes can be traced by observing that minority classes may lead to small disjuncts, since there are fewer
examples in these classes than in the others, and the rules induced from them
tend to cover fewer examples. Moreover, disjuncts induced to cover rare cases are
likely to have higher error rates than disjuncts that cover common cases, as rare
cases are less likely to be found in the test set. Conversely, as the algorithm tries
to generalize from the data, minority classes may yield some small disjuncts to
be ruled out from the set of rules. When the algorithm is generalizing, common
cases can “overwhelm” a rare case, favoring the induction of larger disjuncts.
TEAM LinG
298
Ronaldo C. Prati, Gustavo E.A.P.A. Batista, and Maria Carolina Monard
Nevertheless, it is worth noticing the differences between class imbalance
and small disjuncts. Rare cases exist in the underlying population from which
training examples are drawn, while small disjuncts might also be a consequence
of the learning algorithm bias. In fact, as we stated before, rare cases might have
a dual role regarding small disjuncts, either leading to undesirable small disjuncts
or not allowing the formation of desirable ones, but rather small disjuncts might
be formed even though the number of examples in each class is naturally equally
balanced. In a nutshell, class imbalance is a characteristic of a domain while
small disjuncts are not [9].
As we mentioned before, Weiss [1] and Japkowicz [2] have suggested that
there is a relation between both problems. However, Japkowicz performed her
analysis on artificially generated data sets and Weiss only considers one aspect
of the interaction between small disjuncts and class imbalances.
3
Evaluating Classifiers with Small Disjuncts
and Imbalanced Domains
From hereafter, in order to facilitate our analysis, we constrain our discussion
to binary class problems where, by convention, the minority is called positive
class and the majority is called negative class. The most straightforward way
to evaluate the performance of classifiers is based on the confusion matrix analysis. Table 1 illustrates a confusion matrix for a two-class problem. A number of
widely used metrics for measuring the performance of learning systems can be
extracted from such a matrix, such as error rate and accuracy. However, when
the prior class probabilities are very different, the use of such measures might
produce misleading conclusions since those measures do not take into consideration misclassification costs, are strongly biased to favor the majority class and
are sensitive to class skews.
Thus, it is more interesting to use a performance metric that disassociates the errors (or hits) that occur in each class. Four performance metrics
that directly measure the classification performance on positive and negative
classes independently can be derived from Table 1, namely true positive rate
(the percentage of correctly classified positive examples), false positive rate
(the percentage of incorrectly
classified positive examples), true negative rate
(the
percentage of correctly classified negative examples) and false negative rate
(the percentage of incorrectly classified negative examples).
These four performance metrics have the advantage of being independent of class
TEAM LinG
Learning with Class Skews and Small Disjuncts
299
costs and prior probabilities. The aim of a classifier is to minimize the false positive and negative rates or, similarly, to maximize the true negative and positive
rates. Unfortunately, for most real world applications there is a tradeoff between
and
and similarly between
and
ROC (Receiver Operating Characteristic) analysis enables one to compare
different classifiers regarding their true positive rate and false positive rate. The
basic idea is to plot the classifiers performance in a two-dimensional space, one
dimension for each of these two measurements. Some classifiers, such as the
Naïve Bayes classifier and some Neural Networks, yield a score that represents
the degree to which an example is a member of a class. For decision trees, the
class distributions on each leaf can be used as a score. Such ranking can be used to
produce several classifiers by varying the threshold of an example to be classified
into a class. Each threshold value produces a different point in the ROC space.
These points are linked by tracing straight lines through two consecutive points
to produce a ROC curve. The area under the ROC curve (AUC) represents the
expected performance as a single scalar. In this work, we use a decision tree
inducer and the method proposed in [10] with Laplace correction for measuring
the leaf accuracy to produce ROC curves.
In order to measure the degree to which errors are concentrated towards
smaller disjuncts, Weiss [1] introduced the Error Concentration (EC) curve. The
EC curve is plotted starting with the smallest disjunct from the classifier and
progressively adding larger disjuncts. For each iteration where a larger disjunct is
added, the percentage of test errors versus the percentage of correctly classified
examples is plotted. The line Y = X corresponds to classifiers having errors
equally distributed towards all disjuncts. Error Concentration is defined as the
percentage of the total area above the line Y = X that falls under the EC
curve. EC may take values from between 100%, which indicates that the smallest
disjunct(s) covers all test errors before even a single correctly classified test
example is covered, to -100%, which indicates that the largest disjunct(s) covers
all test errors after all correctly classified test examples have been covered.
In order to illustrate these two metrics Figure 1 shows the ROC (Fig. 1(a))
and the EC (Fig. 1(b)) graphs for the pima data set and pruned trees – see
Table 3. The AUC for the ROC graph is 81.53% and the EC measure from the
EC graph is 42.03%. The graphs might be interpreted as follows: from the ROC
graph, considering for instance a false positive rate of 20%, one might expect a
true positive rate of nearly 65%; and from the EC graph, the smaller disjuncts
that correctly cover 20% of the examples are responsible for more than 55% of
the misclassifications.
4
Experimental Evaluation
The aim of our research is to provide some insights into the relationship between class imbalances and small disjuncts. To this end, we performed a broad
experimental evaluation using ten data sets from UCI [11] having minority class
distribution spanning from 46.37% to 7.94%, i.e., from nearly balanced to skewed
TEAM LinG
300
Ronaldo C. Prati, Gustavo E.A.P.A. Batista, and Maria Carolina Monard
Fig. 1. ROC and EC graphs for the pima data set and pruned trees.
distributions. Table 2 summarizes the data sets employed in this study. It shows,
for each data set, the number of examples (#Examples), number of attributes
(#Attributes), number of quantitative and qualitative attributes and class distribution. For data sets having more than two classes, we chose the class with
fewer examples as the positive class, and collapsed the remainder classes as the
negative class.
In our experiments we used the release 8 of the C4.5 symbolic learning algorithm to induce decision trees [12]. Firstly, we ran C4.5 over the data sets
and calculated the AUC and EC for pruned (default parameters settings) and
unpruned trees induced for each data set using 10-fold stratified cross-validation.
Table 3 summarizes these results, reporting mean value results and their respective standard deviations. It should be observed that for two data sets, Sonar and
Glass, C4.5 was not able to prune the induced trees. Furthermore, for data set
Flag and pruned trees, the default model was induced.
We consider the results obtained for both pruned and unpruned trees because
we aim to analyze whether pruning is effective for coping with small disjuncts
in the presence of class skews. Pruning is often reported in the ML literature as
a rule of thumb for dealing with the small disjuncts problem. The conventional
wisdom beneath pruning is to perform significance and/or error rate tests aiming
TEAM LinG
Learning with Class Skews and Small Disjuncts
301
to reliably eliminate undesirable disjuncts. The main reason for verifying the
effectiveness of pruning is that several research papers indicate that pruning
should be avoided when target misclassification costs or class distributions are
unknown [13,14]. One reason to avoid pruning is that most pruning schemes,
including the one used by C4.5, attempt to minimize the overall error rate.
These pruning schemes can be detrimental to the minority class, since reducing
the error rate on the majority class, which stands for most of the examples,
would result in a greater impact over the overall error rate. Another fact is
that significance tests are mainly based on coverage estimation. As skewed class
distributions are more likely to include rare or exceptional cases, it is desirable
for the induced concepts to cover these cases, even if they can only be covered
by augmenting the number of small disjuncts in a concept.
Table 3 results indicate that the decision of not pruning the decision trees
systematically increases the AUC values. For all data sets in which the algorithm
was able to prune the induced trees, there is an increase in the AUC values. However, the EC values also increase in almost all unpruned trees. As stated before,
this increase in EC values generally means that the errors are more concentrated
towards small disjuncts. Furthermore, pruning removes most branches responsible for covering the minority class, thus not pruning is beneficial for learning
with imbalanced classes. However, the decision of not pruning also leaves these
small disjuncts in the learned concept. As these disjuncts are error-prone, since
pruning would remove them, the overall error tends to concentrate on these disjuncts, increasing the EC values. Thus, concerning the problem of pruning or not
pruning, a trade-off between the increase we are looking for in the AUC values
and the undesirable raise in the EC values seems to exist.
We have also investigated how sampling strategies behave with respect to
small disjuncts and class imbalances. We decided to apply the sampling methods until a balanced distribution was reached. This decision is motivated by the
results presented in [15], in which it is shown that when AUC is used as performance measure, the best class distribution for learning tends to be near the
balanced class distribution. Moreover, Weiss [1] also investigates the relationship
between sampling strategies and small disjuncts using a Random under-sampling
method to artificially balance training sets. Weiss’ results show that the trees
TEAM LinG
302
Ronaldo C. Prati, Gustavo E.A.P.A. Batista, and Maria Carolina Monard
induced using balanced data sets seem to systematically outperform the trees
induced using the original stratified class distribution from the data sets, not
only increasing the AUC values but also decreasing the EC values. In our view,
the decrease in the EC values might be explained by the reduction in the number of induced disjuncts in the concept description, which is a characteristic of
under-sampling methods. We believe this approach might rule out some interesting disjuncts from the concept. Moreover, in previous work [16] we showed that
over-sampling methods seem to perform better than under-sampling methods,
resulting in classifiers with higher AUC values. Table 4 shows the AUC and EC
values for two over-sampling methods proposed in the literature: Random oversampling and Smote [7]. Random over-sampling randomly duplicates examples
from the minority class while Smote introduces artificially generated examples
by interpolating two examples drawn from the minority class that lie together.
Table 4 reports results regarding unpruned trees. Besides our previous comments concerning pruning and class imbalance, whether pruning can lead to a
performance improvement for decision trees grown over artificially balanced data sets still seems to be an open question. Another argument against pruning
is that if pruning is allowed to execute under such conditions, the learning system would prune based on false assumption, i.e., that the test set distribution
matches the training set distribution.
The results in Table 4 show that, in general, the best AUC result obtained by
an unpruned over-sampled data set is similar (less than 1% difference) or higher
than those obtained by pruned and unpruned trees grown over the original data
sets. Moreover, unpruned over-sampled data sets also tend to produce higher
EC values than pruned and unpruned trees grown over the original data sets.
It is also worth noticing that Random over-sampling, which can be considered
the simplest method, produced similar results to Smote (with a difference of less
than 1% in AUC) in six data sets (Sonar, Pima German, New-thyroid, Satimage
and Glass); Random over-sampling beats Smote (with a difference greater than
1%) in two data sets (Bupa and Flag) and Smote beats Random over-sampling
in the other two (Haberman and E-coli). Another interesting point is that both
over-sampling methods produced lower EC values than unpruned trees grown
over the original data for four data sets (Sonar, Bupa, German and New-thyroid),
TEAM LinG
Learning with Class Skews and Small Disjuncts
303
and Smote itself produced lower EC values for another one (Flag). Moreover, in
three data sets (Sonar, Bupa and New-thyroid) Smote produced lower EC values
even if compared with pruned trees grown over the original data.
These results might be explained observing that by using an interpolation
method, Smote might help in the definition of the decision border of each class.
However, as a side effect, by introducing artificially generated examples Smote
might introduce noise in the training set. Although Smote might help in overcoming the class imbalance problem, in some cases it might be detrimental regarding the problem of small disjuncts. This observation, allied to the results we
obtained in a previous study that poses class overlapping as a complicating factor
for dealing with class imbalance [8] motivated us to propose two new methods
to deal with the problem of learning in the presence of class imbalance [16].
These methods ally Smote [7] with two data cleaning methods: Tomek links [17]
and Wilson’s Edited Nearest Neighbor Rule (ENN) [18]. The main motivation
behind these methods is to pick up the best of the two worlds. We not only
balance the training data aiming at increasing the AUC values, but also remove
noisy examples lying in the wrong side of the decision border. The removal of
noisy examples might aid in finding better-defined class clusters, allowing the
creation of simpler models with better generalization capabilities. As a net effect,
these methods might also remove some undesirable small disjuncts, improving
the classifier performance. In this matter, these data cleaning methods might be
understood as an alternative for pruning.
Table 5 shows the results of our proposed methods on the same data sets.
Comparing these two methods it can be observed that Smote + Tomek produced
the higher AUC values for four data sets (Sonar, Pima, German and Haberman)
while Smote+ENN is better in two data sets (Bupa and Glass). For the other
four data sets they produced compatible AUC results (with a difference lower
than 1%). However, it should be observed that for three data sets (New-thyroid,
Satimage and Glass) Smote+Tomek obtained results identical to Smote – Table 4. This occurs when no Tomek links or just a few of them are found in the
data sets.
Table 6 shows a ranking of the AUC and EC results obtained in all experiments for unpruned decision trees, where: O indicates the original data set
TEAM LinG
304
Ronaldo C. Prati, Gustavo E.A.P.A. Batista, and Maria Carolina Monard
(Table 3) R and S stand respectively for Random and Smote over-sampling
(Table 4) while S+E and S+T stand for Smote + ENN and Smote + Tomek
(Table5).
indicates that the method is ranked among the best and
among
the second best for the corresponding data set. Observe that results having a difference lower than 1% are ranked together. Although the proposed conjugated
over-sampling methods obtained just one EC value ranked in the first place
(Smote + ENN on data set German) these methods provided the highest AUC
values in seven data sets. Smote + Tomek produced the highest AUC values
in four data sets (Sonar, Haberman, Ecoli and Flag), and the Smote + ENN
method produced the highest AUC values in another three data sets (Satimage,
New-thyroid and Glass). If we analyze both measures together, in four data sets
where Smote + Tomek produced results among the top ranked AUC values, it
is also in second place with regard to lower EC values (Sonar, Pima, Haberman and New-thyroid). However, it is worth noticing in Table 6 that simpler
methods, such as the Random over-sampling approach (R) or taking only the
unpruned tree (O), have also produced interesting results in some data sets. In
the New-thyroid data set, Random over-sampling produced one of the highest
AUC values and the lowest EC value. In the German data set, the unpruned
tree produced the highest AUC value, and the EC value is almost the same as
in the other methods that produced high AUC values. Nevertheless, the results
we report suggest that the methods we propose in [16] might be useful, specially
if we aim to further analyze the induced disjuncts that compound the concept
description.
5
Conclusion
In this work we discuss results related to some aspects of the interaction between learning with class imbalances and small disjuncts. Our results suggest
that pruning might not be effective for dealing with small disjuncts in the presence of class skews. Moreover, artificially balancing class distributions with oversampling methods seems to increase the number of error-prone small disjuncts.
Our proposed methods, which ally over sampling with data cleaning methods
produced meaningful results in some cases. Conversely, in some cases, Random
TEAM LinG
Learning with Class Skews and Small Disjuncts
305
over-sampling, a very simple over-sampling method, also achieved compatible results. Although our results are not conclusive with respect to a general approach
for dealing with both problems, further investigation into this relationship might
help to produce insights on how ML algorithms behave in the presence of such
conditions. In order to investigate this relationship in more depth, several further
approaches might be taken. A natural extension of this work is to individually
analyze the disjuncts that compound each description assessing their quality
concerning some objective or subjective criterium. Another interesting topic is
to analyze the ROC and EC graphs obtained for each data set and method.
This might provide us with a more in depth understanding of the behavior of
pruning and balancing methods. Last but not least, another interesting point
to investigate is how alternative learning bias behaves in the presence of class
skews.
Acknowledgements
We wish to thank the anonymous reviewers for their helpful comments. This
research was partially supported by the Brazilian Research Councils CAPES
and FAPESP.
References
1. Weiss, G.M.: The Effect of Small Disjuncts and Class Distribution on Decision
Tree Learning. PhD thesis, Rutgers University (2003)
2. Japkowicz, N.: Class Imbalances: Are we Focusing on the Right Issue? In: ICML
Workshop on Learning from Imbalanced Data Sets. (2003)
3. Holte, R.C., Acker, L.E., Porter, B.W.: Concept Learning and the Problem of Small
Disjuncts. In: IJCAI. (1989) 813–818
4. Weiss, G.M.: The problem with Noise and Small Disjuncts. In: ICML. (1988) 574–
578
5. Carvalho, D.R., Freitas, A.A.: A Hybrid Decision Tree/Genetic Algorithm for Coping with the Problem of Small Disjuncts in Data Mining. In: Genetic and Evolutionary Computation Conference. (2000) 1061–1068
6. Kubat, M., Matwin, S.: Addressing the Course of Imbalanced Training Sets: OneSided Selection. In: ICML. (1997) 179–186
7. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic
Minority Over-sampling Technique. JAIR 16 (2002) 321–357
8. Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class Imbalances versus Class
Overlapping: an Analysis of a Learning System Behavior. In: MICAI. (2004) 312–
321 Springer-Verlag, LNAI 2972.
9. Weiss, G.M.: Learning with Rare Cases and Small Disjucts. In: ICML. (1995) 558–
565
10. Ferri, C., Flach, P., Hernández-Orallo, J.: Learning Decision Trees Using the Area
Under the ROC Curve. In: ICML. (2002) 139–146
11. Blake, C., Merz, C.: UCI Repository of Machine Learning Databases (1998)
http://www.ics.uci.edu/~mlearn/MLRepository.html.
12. Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann (1993)
TEAM LinG
306
Ronaldo C. Prati, Gustavo E.A.P.A. Batista, and Maria Carolina Monard
13. Zadrozny, B., Elkan, C.: Learning and Making Decisions When Costs and Probabilities are Both Unknown. In: KDD. (2001) 204–213
14. Bauer, E., Kohavi, R.: An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning 36 (1999) 105–139
15. Weiss, G.M., Provost, F.: Learning When Training Data are Costly: The Effect of
Class Distribution on Tree Induction. JAIR 19 (2003) 315–354
16. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A Study of the Behavior of Several
Methods for Balancing Machine Learning Training Data. SIGKDD Explorations 6
(2004) (to appear).
17. Tomek, I.: Two Modifications of CNN. IEEE Transactions on Systems Man and
Communications SMC-6 (1976) 769–772
18. Wilson, D.L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data.
IEEE Transactions on Systems, Man, and Communications 2 (1972) 408–421
TEAM LinG
Making Collaborative Group Recommendations
Based on Modal Symbolic Data*
Sérgio R. de M. Queiroz and Francisco de A.T. de Carvalho
Centro de Informática (CIn/UFPE) Cx. Postal 7851, CEP 50732-970, Recife, Brazil
{srmq,fatc}@cin.ufpe.br
Abstract. In recent years, recommender systems have achieved great
success. Popular sites give thousands of recommendations every day.
However, despite the fact that many activities are carried out in groups,
like going to the theater with friends, these systems are focused on recommending items for sole users. This brings out the need of systems capable
of performing recommendations for groups of people, a domain that has
received little attention in the literature. In this article we introduce
a novel method of making collaborative recommendations for groups,
based on models built using techniques from symbolic data analysis. After, we empirically evaluate the proposed method to see its behaviour for
groups of different sizes and degrees of homogeneity, and compare the
achieved results with both an aggregation-based methodology previously
proposed and a baseline methodology.
1
Introduction
You arrive at home and turn on your cable TV. There are 150 channels to choose
from. How can you quickly find a program that will likely interest you? When
one has to make a choice without full knowledge of the alternatives, a common
approach is to rely on the recommendations of trusted individuals: a TV guide,
a friend, a consulting agency. In the 1990s, computational recommender systems appeared to automatize the recommendation process. Nowadays, we have
(mostly in the Web) various recommender systems. Popular sites, like Amazon.com, have recommendation areas where users can see which items would be
of their interest.
One of the most successfully technologies used by these systems has been
collaborative filtering (CF) (see e.g. [1]). The CF technique is based on the assumption that the best recommendations for an individual are those given by
people with preferences similar to his/her preferences.
However, until now, these systems have focused only on making recommendations for individuals, despite the fact that many day-to-day activities are performed in groups (e.g. watching TV at home). This highlights the need of developing recommender systems for groups, that are able to capture the preferences
of whole groups and make recommendations for them.
*
The authors would like to thank CNPq and CAPES (Brazilian Agencies) for their
financial support.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 307–316, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
308
Sérgio R. de M. Queiroz and Francisco de A.T. de Carvalho
When recommending for groups, the utmost goal is that the recommendations should be the best possible for the group. Thus, two prime questions are
raised: What is the best suggestion for a group? How to reach this suggestion?
The concept of making recommendations for groups has received little attention in the literature of recommender systems. A few works have developed
recommender systems capable of recommending for groups ([2–4]), but none of
them have delved into the difficulties involving the achievement of good recommendation for groups (i.e., the two fundamental questions previously cited).
Although little about this topic has been studied in the literature of recommender systems, how to achieve good group results from individual preferences
is an important topic in many research areas, with different roots. Beginning in
the XVIII century motivated by the problem of voting, to modern research areas
like operational research, social choice, multicriteria decision making and social
psychology, this topic has been treated by diverse research communities.
Developments in these research fields are important for a better understanding of the problem and the identification of the limitations of proposed solutions;
as well as to the development of recommender systems that achieve similar results to the ones groups of people would achieve during a discussion.
A conclusion that can be drawn from these areas is that there is no “perfect” way to aggregate individual preferences in order to achieve a group result.
Arrow’s impossibility theorem [5] which showed that it is impossible for any procedure (termed a social function in social choice parlance) to achieve at the same
time a set of simple desirable properties is but one of the most known results in
social choice to show that an ideal social function is unattainable. Furthermore,
many empirical studies in social psychology have noted that the adequacy of
a decision scheme (the mechanism used by a group of people to combine the
individual preferences of its members into the group result) to the group decision process is very dependent to the group’s intrinsic characteristics and the
problem’s nature (see e.g. [6]). Multi-criteria decision making strengthens the
view that the achievement of an “ideal configuration” is not the most important
feature when working with decisions (in fact, this ideal may not exist in most
of the times) and highlights the importance of giving the users interactivity and
permit the analysis of different possibilities.
However, the nonexistence of an ideal does not mean that we cannot compare
different possibilities. Based on good properties that a preference aggregation
scheme should have, we can define meaningful metrics to quantify the goodness
of group recommendations. They will not be completely free of value judgments,
but these will reflect desirable properties.
In this article we introduce a novel method of making recommendations for
groups, based on the ideas of collaborative filtering and symbolic data analysis [7]. To be used to recommend for groups, the CF methodology has to be
adapted. We can think of two different ways to modify it with this goal. The
first is to use CF to recommend to the individual members of the group, and
then aggregate the recommendations in order to achieve the recommendation for
the group as a whole (we will call this approaches “aggregation-based methodTEAM LinG
Making Collaborative Group Recommendations
309
ologies”). The second is to modify the CF process so that it directly generates
a recommendation for the group. This involves the modeling of the group as a
single entity, a meta-user (we will call this approaches “model-based methodologies”). Here we take the second approach, using techniques from symbolic data
analysis to model the users. After, we experimentally evaluate the proposed
method to see its behaviour under groups of different sizes and degrees of homogeneity. For each group configuration the behaviour of the proposed method
is compared with both an aggregation-based methodology we have previously
proposed (see [8]) and a baseline methodology. The metric used reflects good
social characteristics for the group recommendations.
2
2.1
Recommending for Groups
The Problem
The problem of recommendations for groups can be posed as follows: how to
suggest (new) items that will be liked by the group as a whole, given that we have
a set of historical individual preferences from the members of this group as well
as preferences from other individuals (who are not in the group).
Thinking collaboratively, we want to know how to use the preferences (evaluations over items) of the individuals in the system to predict how one group
of individuals (a subset of the community) will like the items available. Thence,
we would be able to suggest items that will be valuable for this group.
2.2
Symbolic Model-Based Approach
In this section we develop a model-based recommendation strategy for groups.
During the recommendation process, it uses models for the items – which can be
pre-computed – and does not require the computation of on-line user neighborhoods, not having this scalability problem present in many collaborative filtering
algorithms (for individuals). To create the models and compare them techniques
from symbolic data analysis are used.
The intuition behind our approach is that for each item we can identify the
group of people who like it and the group of people that do not like it. We
assume that the group for which we will make a recommendation will appreciate
an item if the group has similar preferences to the group of people who like the
item and is dissimilar to the group of people who do not like it.
To implement this, first the group of users for whom the recommendations
will be computed is represented by a prototype that contains the histogram
of rates for each item evaluated by the group. The target items (items that
can be recommended) are also represented in a similar way, but now we create
two prototypes for each target item: a positive prototype, that contains the
histogram of rates for (other) items evaluated by individuals who liked the target
item; and a negative prototype that is analogous to the positive one, but the
individuals chosen are those who did not like the target item. Next we compute
TEAM LinG
310
Sérgio R. de M. Queiroz and Francisco de A.T. de Carvalho
the similarity between the group prototype and the two prototypes of each target
item. The final similarity between a target item and a group is given by a simple
linear combination of the similarities between the group prototype and both item
prototypes using the formula:
where
is the final
similarity value,
is the similarity between the group prototype and the
positive item prototype and
analogously for the negative one. Finally,
we order the target items by decreasing order of similarity values. If we want to
recommend items to the users, we can take the first items of this ordering.
Figure 1 depicts the recommendation process. Its two main aspects, the creation
of prototypes and the similarity computation will be described in the following
subsections.
Fig. 1. The recommendation process
Prototype Generation. A fundamental step of this method is the prototype
generation. The group and the target items are represented by the histograms
of rates for items. Different weights can be attributed to each histogram that
make up the prototypes. In other words, each prototype is described by a set of
symbolic variables
Each item corresponds to a categorical modal variable
that may also have an associated weight. The modalities of
are the different
rates that can be given to items. In our case, we have six modalities.
Group Prototype. In the group prototype we have the rate histograms for every
item that has been evaluated by at least one member of the group. The rate histogram is built by computing the frequency of each modality in the ratings of the
group members for the item being considered. The used data has a discrete set of
6 rates: {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}, where 0.0 is the worst and 1.0 is the best rate.
For example, if an item was evaluated by 2 users in a group of 3 individuals
and they gave the ratings 0.4 and 0.6 for the item, the row in the symbolic data
table corresponding to the item would be:
assuming the weight as the fraction of the group that has evaluated the item.
Item Prototypes. To build a prototype for a target item, the first step is to decide
which users will be selected to have their evaluations in the prototype. This users
have the role of characterizing the profile of those who like the target item, for
TEAM LinG
Making Collaborative Group Recommendations
311
the positive profile; and of characterizing the profile of those who do not like the
target item, for the negative profile. Therefrom, for the positive prototype only
the users that evaluated the target item highly are chosen. Users that have given
rates 0.8 or 1.0 were chosen as the “positive representatives” for the group. For
the negative prototype the users that have given 0.0 or 0.2 for the target item
were chosen. One parameter for the building of the models is how many users
will be chosen for each target item. We have chosen 300 users for each prototype,
after experimenting with 30, 50, 100, 200 and 300 users.
Similarity Calculation. To compute the similarity between the prototype of
a group and the prototype of a target item, we only consider the items that are
in both prototypes. As similarity measure we tried Bacelar-Nicolau’s weighted
affinity coefficient (presented in [7]) and two measures based on the Euclidean
distance and the Pearson correlation, respectively. At the end we used the affinity
coefficient, as it achieved slightly better results. The similarity between two
prototypes
and based on the affinity coefficient is given by:
where:
is the number of items present in both prototypes;
is the weight attributed to item
is the number of modalities (six, for the six different rates);
and
are the relative frequencies obtained by rate in the prototypes
and for the item respectively.
3
Experimental Evaluation
We carried on the experiments with the same groups that were used in [8]. To
make this article more self-contained, we describe in the next subsections how
these groups were generated.
3.1 The EachMovie Dataset
To run our experiments, we used the Eachmovie dataset. Eachmovie was a recommender service that run as part of a research project at the Compaq Systems
Research Center. During that period, 72,916 users gave 2,811,983 evaluations to
1,628 different movies. Users’ evaluations were registered using a 6-level numerical scale (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). The dataset can be obtained from Compaq
Computer Corporation1. The Eachmovie dataset has been used in various experiments involving recommender systems.
We restricted our experiments to users that have evaluated at least 150
movies (2,551 users). This was adopted to allow an intersection (of evaluated
movies) of reasonable size between each pair of users, so that more credit can be
given to the comparisons related to the homogeneity degree of a group.
1
Available at the URL: http://www.research.compaq.com/SRC/eachmovie/
TEAM LinG
312
3.2
Sérgio R. de M. Queiroz and Francisco de A.T. de Carvalho
Data Preparation: The Creation of Groups
To conduct the experiments, groups of users with varying sizes and homogeneity
degrees were needed. The EachMovie dataset is only about individuals, therefore
it was needed to build the groups first.
Four group sizes were defined: 3, 6, 12 and 24 individuals. We believe that
this range of sizes includes the majority of scenarios where recommendation for
groups can be used. For the degree of homogeneity factor, 3 levels were used:
high, medium and low homogeneity. The groups don’t need to be a partition of
the set of users, i.e. the same user can be in more than one different group. The
next subsections describe the methodology used to build the groups.
Obtaining a Dissimilarity Matrix. The first step in the group definition was
to build a dissimilarity matrix for the users. That is, a matrix of size
is the number of users) where each
contains the dissimilarity value between
users and To obtain this matrix, the dissimilarity of each user against all the
others was calculated.
The dissimilarities between users will be subsequently used to construct the
groups with the three desired homogeneity degrees. To obtain the dissimilarity
between two users and we calculated the Pearson correlation coefficient
between them (which is in the interval [–1, 1]) and transformed this value into
a dissimilarity using the formula:
The Pearson
correlation coefficient is the most common measure of similarity between users
used in collaborative filtering algorithms (see e.g. [1]). To compute
between
two users we consider only the items that both users have rated and use
the formula:
where
is the rate that user has
given for item and is the average rate (over the items for user (analogously
for user
For our experiments, the movies were randomly separated in three sets: a
profile set with 50% of the movies, a training set with 25% and a test set with
25% of the movies. Only the user’s evaluations which refer to elements of the
first set were used to obtain the dissimilarity matrix. The evaluations that refer
to movies of the other sets were not used at this stage. The rationale behind
this procedure is that the movies from the test set will be the ones used to
evaluate the behavior of the model (Section 3.3). That is, it will be assumed
that the members of the group did not know them previously. The movies from
the training set were used to adjust the model parameters.
Group Formation
High Homogeneity Groups. We wanted to obtain 100 groups with high homogeneity degree for each of the desired sizes. To this end, we first randomly generated 100 groups of 200 users each. Then the hierarchical clustering algorithm
divisive analysis (diana) was run for each of these 100 groups. To extract a high
homogeneity group of size from each tree, we took the “lowest” branch with
at least elements. If the number of elements of this branch was larger than
TEAM LinG
Making Collaborative Group Recommendations
313
we tested all combinations of size and selected the one with lowest total
dissimilarity (sum of all dissimilarities between the users). For groups of size
24, the number of combinations was too big. In this case we used a heuristic
method, selecting the users which have the lowest sum of dissimilarities in the
branch (sum of dissimilarities between the user in consideration and all others
in the branch).
Low Homogeneity Groups. To select a group of size with low homogeneity
from one of the groups with 200 users, we first calculated for each user its sum of
dissimilarities (between this user and all the other 199). The elements selected
were the ones with the largest sum of dissimilarities.
Medium Homogeneity Groups. To select a group of size with medium homogeneity degree, elements were randomly selected from the total population of
users. To avoid surprises due to randomness, after a group was generated, a test
to compare a single mean (the one of the extracted group) to a specified value
(the mean of the population) was done (using
3.3
Experimental Methodology
For each of the 1200 generated groups (4 sizes × 3 homogeneities × 100 repetitions) recommendations for items from the test set were generated. We also
generated recommendations using two other strategies: a baseline model, inspired
by a “null model” used in group experiments in social psychology (e.g. [6]); and
an aggregation-based method using fuzzy majority we have previously proposed
in [8].
Null Model. The null model takes the opinion of one randomly chosen group
member as the group decision (random dictator). Taking this to the domain of
recommender systems, we randomly selected one group member and make recommendations for this individual (using traditional neighbourhood-based collaborative filtering). These recommendations are taken as the group recommendations.
Aggregation-Based Method Using Fuzzy Majority. This method works
in two steps: first individual recommendations are generated for the members
of the group, and after the individual recommendations are aggregated to make
the group recommendation.
For the first step, a traditional neighborhood-based collaborative filtering
algorithm was used (see [8] for the details). For the second one, a classification
method of alternatives using fuzzy majority (introduced in [9]) was adopted.
The rationale for using a method based on fuzzy majority for the aggregation
of recommendations was that given the impossibility of having an ideal method
for the aggregation, one that offered some degree of “human meaning” was a
good choice. The kind of human meaning of the fuzzy majority aggregation is
TEAM LinG
314
Sérgio R. de M. Queiroz and Francisco de A.T. de Carvalho
provided by the use of fuzzy linguistic operators that model the human discourse
(like as many as possible, most and at least half). This could make possible to the
users specify in what general terms they would like the aggregation, for example:
“show me the alternatives that are ‘better’ than most of the others according to
the recommendations for as many as possible persons in the group”.
The fuzzy majority procedure follows two phases to achieve the classification of alternatives: aggregation and exploitation. The aggregation phase defines
an outranking relation which indicates the global preference (in a fuzzy majority sense) between every pair of alternatives, taking into consideration different
points of view. Exploitation compares the alternatives, transforming the global
preference information into a global ranking, thus supplying a selection set of
alternatives. Each phase uses a fuzzy linguistic operator, resulting in a classification of alternatives with an interpretation like the one cited in the previous
paragraph (assuming that the operator as many as possible was used in the
aggregation phase and the operator most in the exploitation phase).
Evaluating the Strategies. To evaluate the behaviour of the strategies for
the various sizes and degrees of homogeneity of the groups, a metric is needed.
As we have a set of rankings as the input and a ranking as the output, a ranking correlation method was considered a good candidate. We used the Kendall’s
rank correlation coefficient
with ties (see [10]). For each generated recommendation, we calculated between the final ranking generated for the group
and the users’ individual rankings (obtained from the users’ rates available in
the test set). Then we calculated the average for the recommendation. The
has a good social characteristic. One ranking with largest is a Kemeny optimal
aggregation (it is not necessarily unique). Kemeny optimal aggregations are the
only ones that fulfill at the same time the principles of neutrality and consistency of the social choice literature and the extended Condorcet criterion [11],
which is: if a majority of the individuals prefer alternative to then should
have a higher ranking than in the aggregation. Kemeny optimal aggregations
are NP-hard to obtain when the number of rankings to aggregate is
[11].
Therefore, it is not possible to implement an optimal strategy in regard of the
making it a good reference for comparison.
The goal of the experiment was to evaluate how is affected by the variation
on the size and homogeneity of the groups, as well as by the strategy used
(symbolic approach versus null model versus fuzzy aggregation-based approach).
To verify the influence of each factor, we did a three-way (as we have 3 factors)
analysis of variance (ANOVA). After the verification of significant relevance, a
comparison of means for the levels of each factor was done. To this end we used
Tukey Honest Significant Differences test at the 95% confidence level.
4
Results and Discussion
Figure 2 shows the observed
for the three approaches.
For low homogeneity groups, the symbolic approach outperformed by a large
difference the other two in groups of 3 and 6 people (in these configurations
TEAM LinG
Making Collaborative Group Recommendations
315
Fig. 2. Observed
by homogeneity degree for the null, symbolic and fuzzy approaches. Fuzzy results refer to the use of the linguistic quantifiers as many as possible
followed by most. Other combinations of quantifiers achieved similar results.
the null model was statistically equivalent to the fuzzy approach). This shows
that for highly heterogeneous groups, trying to aggregate individual preferences
is not a good approach. All results were statistically equivalent for groups of
12 people, and the fuzzy approach had a better result for groups of 24 people,
followed by the null model and the symbolic approach. It is not clear if the symbolic model is inadequate for larger heterogeneous groups, or if this result is due
to the biases present in the data used. Due to the process of group formation,
larger heterogeneous groups (even in the same homogeneity degree) are more
homogeneous than smaller groups, as it is much more difficult to find a large
strongly heterogeneous group than it is to find a smaller one. Experiments using synthetic data where the homogeneity degree was more carefully controlled
would be more useful to do this comparisons.
Under medium and high homogeneity levels, the null model shows that for
more homogeneous groups it may be a good alternative. Under medium homogeneity, it was statistically equivalent to the other two for groups of 3 people
and second-placed after the fuzzy approach for the other group sizes. Under high
homogeneity, the null model was statistically equivalent to the fuzzy approach
for all group sizes (indicating that taking the opinion of just one member of a
highly homogeneous group is good enough) and the symbolic approach lagged
behind (by a small margin) in these cases. This suggests that the symbolic
strategy should be improved to better accommodate these cases, as well that
aggregation-based approaches have a good performance for more homogeneous
groups.
Making comparisons for the factor homogeneity, in all cases the averages of
the levels differed significantly. Besides, we had: average tau under high homogeneity > avg. tau under medium homogeneity > avg. tau under low homogeneity, i.e. the compatibility degree between the group recommendation and
the individual preferences was proportional to the group’s homogeneity degree.
These facts were to be expected if the strategies were coherent.
For the group size, in many cases the differences between its levels were
not significant, indicating that the size of a group is less important than its
homogeneity degree for the performance of recommendations.
TEAM LinG
316
Sérgio R. de M. Queiroz and Francisco de A.T. de Carvalho
References
1. Herlocker, J., Konstan, J., Borchers, A., Riedl, J.: An algorithm framework for
performing collaborative filtering. In: Proc. of the 22nd ACM SIGIR Conference,
Berkley (1999) 230–237
2. Hill, W., Stead, L., Rosenstein, M., Furnas, G.: Recommending and evaluating
choices in a virtual community of use. In: Proc. of the ACM CHI’95 Conference,
Denver (1995) 194–201
3. Lieberman, H., Van Dyke, N., Vivacqua, A.: Let’s browse: A collaborative web
browsing agent. In: Proc. of the IUI-99, L.A. (1999) 65–68
4. O’Connor, M., Cosley, D., Konstan, J., Riedl, J.: Polylens: A recommender system
for groups of users. In: Proc. of the 7th ECSCW conference, Bonn (2001) 199–218
5. Arrow, K.J.: Social Choice and Individual Values. Wiley, New York (1963)
6. Hinsz, V.: Group decision making with responses of a quantitative nature: The theory of social schemes for quantities. Organizational Behavior and Human Decision
Processes 80 (1999) 28–49
7. Bock, H., Diday, E.: Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer, Berlin Heidelberg (2000)
8. Queiroz, S., De Carvalho, F., Ramalho, G., Corruble, V.: Making recommendations
for groups using collaborative filtering and fuzzy majority. In: Proc. of the 16th
Brazilian Symposium on Artificial Intelligence (SBIA), LNAI 2507, Recife (2002)
248–258
9. Chiclana, F., Herrera, F., Herrera-Viedma, E., Poyatos, M.: A classification method
of alternatives for multiple preference ordering criteria based on fuzzy majority.
Journal of Fuzzy Mathematics 4 (1996) 801–813
10. Kendall, M.: Rank Correlation Methods. 4th edn. Charles Griffin & Company,
London (1975)
11. Dwork, C., Kumar, R., Moni, N., Sivakumar, D.: Rank aggregation methods for
the web. In: Proc. of the WWW10 Conference, Hong Kong (2001) 613–622
TEAM LinG
Search-Based Class Discretization
for Hidden Markov Model for Regression
Kate Revoredo and Gerson Zaverucha
Programa de Engenharia de Sistemas e Computação(COPPE)
Universidade Federal do Rio de Janeiro
Caixa Postal 68511, 21945-970, Rio de Janeiro, RJ, Brasil
{kate,gerson}@cos.ufrj.br
Abstract. The regression-by-discretization approach allows the use of
classification algorithm in a regression task. It works as a pre-processing
step in which the numeric target value is discretized into a set of intervals.
We had applied this approach to the Hidden Markov Model for Regression (HMMR) which was successfully compared to the Naive Bayes for
Regression and two traditional forecasting methods, Box-Jenkins and
Winters. In this work, to further improve these results, we apply three
discretization methods to HMMR using ten time series data sets. The
experimental results showed that one of the discretization methods improved the results in most of the data sets, although each method improved the results in at least one data set. Therefore, it would be better
to have a search algorithm to automatically find the optimal number and
width of the intervals.
Keyword: Hidden Markov Models, regression-by-discretization, timeseries forecasting, machine learning
1
Introduction
As discussed in [5], the effective handling of continuous variables is a central
problem in machine learning and pattern recognition. In statistics and pattern
recognition the typical approach is to use a parametric family of distributions,
which makes strong assumptions about the nature of the data; the induced model
can be a good approximation of the data, if these assumptions are warranted.
Machine learning, on the other hand, deal with continuous variables by discretizing them, which can lead to information loss. When the continuous variable is
the target this approach is known as regression-by-discretization [13, 4, 15, 14],
which allows the use of more comprehensible models.
Naive Bayes for Regression (NBR) [4] uses the regression-by-discretization
approach in order to apply Naive Bayes Classifier (NBC) [3] to predict numerical
values. In [4], it was pointed out that NBR “...performed comparably to well
known methods for time series predictions and sometime even slightly better.”. In
[2], it was argued that although in the theory of supervised learning the training
examples are assumed independent and identically distributed (i.i.d), this is
not the case in applications where a temporal dependence among the examples
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 317–325, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
318
Kate Revoredo and Gerson Zaverucha
exists. An example was given in [2] for a classification task comparing NBC
and Hidden Markov Model (HMM). While NBC ignored temporal dependence
HMM took it into account. Consequently, HMM performed better than NBC.
Similar results were found in [10] when Hidden Markov Model for Regression
(HMMR), using the regression-by-discretization approach, was applied to the
task of monthly electric load forecasting of real world data from Brazilian utilities
and successfully compared to NBR and two traditional forecasting methods,
Box-Jenkins[1] and Winters [7].
In this work, to further improve these results we apply to HMMR the three
alternative ways of transforming a set of continuous values into a set of intervals
described in [13] using ten time series data sets.
The paper is organized as follows. In section 2 HMM, HMMR and misclassification cost are reviewed and the methods used for discretizing the numeric
target value are described. In section 3 the experimental results are presented.
Finally, in section 4 our work is concluded.
Background Knowledge
2
Throughout this paper, we use capital letters, such as Y and Z, for random
variables names and lowercase letters such as and to denote specific values
assumed by them. Sets of variables are denoted by boldface capital letters such
as Y and Z, and assignments of values to the variables in these sets are denoted
by boldface lowercase letters such as y and z. The probability of a possible value
of a random variable is denoted by
and the probability distribution of a
random variable is denoted by p(.); this can be generalized for sets of variables.
2.1
Discretization Methods
A discretization method divides a set of numerical values into a set of intervals.
Three discretization methods are described as follows:
Equal width intervals (EW): the set of numerical values is divided into equal
width intervals.
Equal probable intervals (EP): the set of intervals is created with the same
number of elements. It can be said that this method has the focus on class
frequencies and that it makes the assumption that equal class frequencies is
best for a classification problem.
K-means clustering (KM): this method starts with the EW approximation
and then moves the elements of each interval to contiguous intervals if these
changes reduce the sum of the distances of each element of an interval to its
gravity center1. Each interval must have at least one element.
Table 1 shows the intervals found for these three methods considering that
the best number of intervals is 5 when applied to the task of monthly electric
load forecasting using real world data from Brazilian utilities (Serie 1).
1
We used the median of the elements in each interval as the gravity center.
TEAM LinG
Search-Based Class Discretization for Hidden Markov Model for Regression
319
Fig. 1. First-order Dynamic Bayesian Network
2.2
Hidden Markov Model
For a classification task, as discussed in section 1, if the training examples have a
temporal dependence then HMM performs better than NBC. HMM is a particular Dynamic Bayesian Network (DBN) [6,8,9]. A DBN is a Bayesian Network
(BN) that represents a temporal probability model like the one seen in figure 1:
in each slice,
is a set of hidden state variables (discrete or continuous) and
is a set of evidence variables (discrete or continuous). Two important inference
tasks in a DBN are: filtering (computes
where p(.) is a probability
distribution of the random variables and
denotes
and smoothing
(computes
for
Normally, it is assumed that the parameters do not change, that is, the model is time-invariant (stationary):
and
are the same for all t. In the HMM, each
is a single discrete
random variable.
For a classification task if the training examples are fully observable, have
a temporal dependence such that each
is observed in the training data and
hidden (and hence predicted) in the test data and the model structure is known
we can use Maximum Likelihood (ML) Estimation (we do not need to use EM
[6]) for learning.
In order to use this approach in HMM, each example is given by a class
(representing
and a conjunction of attributes
(representing
(see figure 2). Additionally, it is also assumed that the attributes are
conditionally independent given the class. The ML estimation for the HMM must
compute the probabilities, using the formulas showed in 1, by counting the discrete values from the training examples. For each
and
we compute
TEAM LinG
320
Kate Revoredo and Gerson Zaverucha
where N is the total number of training examples,
is the number of training
examples with the class
and
is the number of training examples
with the attribute
and the class
Fig. 2. Hidden Markov Model
Let
be the representation of
choose a class
by
At any time t the HMM can
where (filtering - computing
if
then
if
then
where is a normalization constant.
2.3
Hidden Markov Model for Regression
Hidden Markov Model for Regression (HMMR) [10] uses the regression-bydiscretization approach in order to apply HMM (see figure 2) to predict a numerical value
given a conjunction of attributes
which
can also be numerical.
In this approach, for each target
there is a corresponding discrete value (pseudo-class
representing the interval that
contains the numerical value. In this way the HMM can be applied to the discretized data. The predicted numerical value by HMMR is the sum of the means
of each of the pseudo-classes that were output by HMM, weighted according to
the pseudo-class probabilities assigned by HMM:
where
is the mean of the pseudo-class
Figure 3 sketches the HMMR’s forecasting of a numerical value, where
First, the discretization of a new input is done producing a
conjunction of discrete attributes,
Then this conjunction
uses the prior distribution
of the pseudo-classes to produce a
posterior distribution
of pseudo-classes. Finally, the prediction of
TEAM LinG
Search-Based Class Discretization for Hidden Markov Model for Regression
321
Fig. 3. HMMR’s prediction of a numerical value
the numerical value is calculated by the weighted average of the means of pseudoclasses, where the weights are the probabilities from the posterior distribution.
This posterior distribution will be the prior distribution for the next input.
2.4
Misclassification Costs
Decreasing the classification error does not necessarily decreases the regression
error [13]. In order to ensure that, the absolute difference between the pseudoclass that was output by NBC and the true pseudo-class should be minimized.
Towards this objective, [13] has shown the accuracy benefits of using misclassification costs. Considering m(.) as the median of the values that were discretized
into the interval w, the cost of classifying a pseudo-class v instance as pseudoclass w is defined by
Using this approach, the predicted numerical value by a classifier (C) is:
3
Experimental Results
For each discretization method HMMR is applied to ten time series data sets,
including two well-known benchmarks, the Wölfer sunspot number and the
TEAM LinG
322
Kate Revoredo and Gerson Zaverucha
Mackey-Glass chaotic time series, and two real world data of monthly electric
load forecasting from Brazilian utilities (Serie 1 and Serie 2). These series are
differentiated and then the values are rescaled linearly to between 0.1 and 0.9.
Using measurements of these time series
a forecast model needs
to be constructed in order to predict the value immediately posterior
For HMMR, the target and attribute values are set to
and
respectively. A different version of HMMR, considering misclassification
costs with m(.) as a median of each of the pseudo-classes, is also used (HMMRmc). To select the best model, forward validation [16] is applied considering
as parameters
the number of discretized regions and
the number of
atributes considered.
Forward validation begins with
training examples
is considered as
a sufficient number of training examples) and as the validation set the example
where
In the next step,
is included in the training set and the
validation set is the example
This procedure continues until is equal to
N. The decision measure
is defined as a weighted average of the losses
for
The chosen model will be the one that minimizes
This paper has considered as the loss function:
The weights
are defined as:
where
is the number of parameters used for the model associated with
The error metric used is MAPE (Mean Absolute Percentage Deviation):
where N in this case is the number of examples in the test set.
For the two time series data of monthly electric load forecasting (Serie 1 and
Serie 2) we consider 12 months in the test set, the measured load values of the
previous 10 years for the training set
and
In the Wölfer sunspot time series, the values for the years 1770-1869 are used
as the training set with
and the years 1870-1889 as the test set.
The data set for the Mackey-Glass chaotic time series is a solution of the
Mackey-Glass delay-differential equation
TEAM LinG
Search-Based Class Discretization for Hidden Markov Model for Regression
323
where
initial conditions
for
and sampling rate
This series is obtained by integrating the equation (10)
with the 4th order Runge-Kutta method at a step size of 1, and then downsampling by 6. The training set consists of the first 500 samples with
and as the test set the next 100 samples.
The others time series were mentioned in [17] except the last two which were
used in a competition sponsored by the Santa Fe institute (time series A [18])
and in the K.U. Leuven competition (time series Leuven [19]). For all these time
series the training set consists of the first 600 samples with
and the test
set the next 200 samples.
Table 2 indicates the MAPE for the 3 discretization methods when applying
HMMR and HMMR_mc to the ten time series. The boldface numbers indicate
the discretization method that provides the lowest error and the italic numbers
indicate that the difference between each of them and the correspond lowest
error is statistically significant (paired t-test at 95% confidence level). The table
3 shows the parameters chosen
4
Conclusion and Future Work
To further improve the successful results already obtained with the Hidden
Markov Model for Regression [10] we applied the three discretization methods
described in [13] to it and to a version of HMMR considering misclassification
costs using ten time series data sets. A summary of the wins and losses of the
three methods can be seen in table 4.
The experimental results (see table 2) showed that the KM discretization
method improved the results in most of the data sets considered confirming our
expectation that better results can be found when a better discretization method
is used.
TEAM LinG
324
Kate Revoredo and Gerson Zaverucha
Since each discretization method improved the results in at least on data set,
if time allows, it is better to have a search based system to automatically find
the optimal number and width of the intervals.
As future work, we intend to extend this experiments to the Fuzzy Bayes and
Fuzzy Markov Predictors [11], since they used the EW discretization method.
Furthermore, HMMR will be applied to multi-step forecasting [12].
Acknowledgments
The authors would like to thank João Gama and Luis Torgo for giving us the
Recla code, Marcelo Andrade Teixeira for useful discussions and Ana Luisa de
Cerqueira Leite Duboc for her help in the implementation. We are all partially
financially supported by the Brazilian Research Council CNPq.
References
1. Box G.E.P. , Jenkins G.M. and Reinsel G.C.. Time Series Analysis: Forecasting &
Control. Prentice Hall, 1994.
2. Dietterich T.G.. The Divide-and-Conquer Manifesto. Proceedings of the Eleventh
International Conference on Algorithmic Learning Theory. pp. 13-26, 2000.
3. Domingos P. and Pazzani M.. On the Optimality of the Simple Bayesian Classifier
under Zero-One Loss. Machine Learning Vol.29(2/3), pp.103-130, November 1997.
TEAM LinG
Search-Based Class Discretization for Hidden Markov Model for Regression
325
4. Frank E., Trigg L., Holmes G. and Witten I.H.. Naive Bayes for Regression. Machine Learning. Vol.41, No.l, pp.5-25, 1999.
5. Friedman N., Goldszmidt M. and Lee T.J.. Bayesian network classification with
continuous attributes: Getting the best of both discretization and parametric fitting. In 15th Inter. Conf. on Machine Learning (ICML), pp.179-187, 1998.
6. Ghahramani Z.. Learning Dynamic Bayesian Networks. In C.L.Giles and M.Gori
(eds.). Adaptive Processing of Sequences and Data Structures, Lecture Notes in
Artificial Intelligence. pp.168-197, Berlin, Springer-Verlag, 1998.
7. Montgomery D.C., Johnson L.A. and Gardiner J.S.. Forecasting and Time Series
Analysis. McGraw-Hill Companies, 1990.
8. Roweis S. and Ghahramani Z.. A Unifying Review of Linear Gaussian Models.
Neural Computation Vol.11, No.2, pp.305-345, 1999.
9. Russell S. and Norvig P.. Artificial Intelligence: A Modern Approach, Prentice Hall,
2nd edition, 2002.
10. Teixeira M.A. and Revoredo K. and Zaverucha G.. Hidden Markov Model for
Regression in Electric Load Forecasting. In Proceedings of the ICANN/ICONIP2003, Turkey, v.l, pp.374-377.
11. Teixeira M.A. and Zaverucha G.. Fuzzy Bayes and Fuzzy Markov Predictors. Journal of Intelligent and Fuzzy Systems, Amsterdam, The Netherlands, V.13, n.2-4,pp.
155-165, 2003.
12. Teixeira M.A. and Zaverucha G.. Fuzzy Markov Predictor in Multi-Step Electric
Load Forecasting. In the Proceedings of the IEEE/INSS International Joint Conference on Neural Networks (IJCNN’2003), Portland, Oregon, v.l pp.3065-3070.
13. Torgo L., Gama J.. Regression Using Classification Algorithms. Intelligent Data
Analysis. Vol.1, pp. 275-292, 1997.
14. Weiss S. and Indurkhya N.. Rule-base Regression. In Proceedings of the 13th Internationa Joing Conference on Artificial Intelligence. pp. 1072-1078. 1993.
15. Weiss S. and Indurkhya N.. Rule-base Machine Learning Methods for Functional
Prediction. Journal of Artificial Intelligence Research (JAIR). Vol. 3, pp. 383-403.
1995.
16. Urban Hjorth J.S.. Computer Intensive Statistical Methods. Validation Model Selection and Bootstrap. Chapman &; Hall. 1994.
17. Keogh E. and Kasetty S.. On the Need for Time Series Data Mining Benchmarks:
A Survey and Empirical Demonstration. Data Mining and Knowledge Discovery,7,
349-371,2003.
18. http://www-psych.stanford.edu/%7Eandreas/Time-Series/SantaFe
19. ftp://ftp.esat.kuleuven.ac.be/pub/sista/suykens/workshop/datacomp.dat
TEAM LinG
SKDQL: A Structured Language
to Specify Knowledge Discovery Processes and Queries
Marcelino Pereira dos Santos Silva1 and Jacques Robin2
1
Universidade do Estado do Rio Grande do Norte
BR 110, Km 48, 59610-090, Mossoró, RN, Brasil
[email protected]
2
Universidade Federal de Pernambuco, Centro de Informática, 50670-901, Recife, PE, Brasil
[email protected]
Abstract. Tools and techniques used for automatic and smart analysis of huge
data repositories of industries, governments, corporations and scientific institutes are the subjects dealt by the field of Knowledge Discovery in Databases
(KDD). In MATRIKS context, a framework for KDD, SKDQL (Structured
Knowledge Discovery Query Language) is the proposal of a structured language for KDD specification, following SQL patterns within an open and extensible architecture, supporting heterogeneity, interaction and increment of
KDD process, with resources for accessing, cleaning, transforming, deriving
and mining data, beyond knowledge manipulation.
1 Introduction
The high availability of huge databases, and the eminent necessity of transforming
such data in information and knowledge, have demanded valuable efforts from the
scientific community and software industry. The tools and techniques used for smart
analysis of large repositories are the subjects dealt by Knowledge Discovery in Databases (KDD). However, the KDD process has challenges related to the specification
of queries and processes, once several tools are often used to extract knowledge. It is
generally a problem, because the complexity of the process itself is augmented by the
heterogeneity of tools employed.
An approach to face the problems that arise in such context must provide resources
to specify queries and processes, avoiding common bottlenecks and respecting KDD
requirements. This article presents as contribution SKDQL (Structured Knowledge
Discovery Query Language) [12], which contains specific clauses for KDD tasks. In
order to use the language and validate the concepts of this work, part of the SKDQL
specification was implemented. This prototype of SKDQL was effectively tested on
the log database of the RoboCup domain. Such domain contains the real problems
that arise in KDD tasks, once its logs offer a wide and detailed data repository about
the teams behavior.
The paper is organized as follows: the next section presents the KDD process and
its bottlenecks; the third topic is about a case study of SKQL; the next section describes SKDQL specification; in the fifth part a prototype of the language is presented; the following section brings related work, finishing with conclusions.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 326–335, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
SKDQL: A Structured Language
327
2 Knowledge Discovery in Databases
Knowledge Discovery in Databases is the nontrivial process of identifying valid,
novel, potentially useful and ultimately understandable patterns in data, aiming to
improve the understanding of a problem or a procedure of decision-making. The KDD
process is interactive, iterative, cognitive and exploratory, involving many steps (Figure 1) with many decisions being taken by the analyst, according to the following
description [3]:
1. Definition of the kind of knowledge to be discovered, what demands a good comprehension of the domain and the kind of decision such knowledge can improve.
2. Selection - in order to create a target dataset where discovery will be performed.
3. Preprocessing - including noise removal, manipulation of null/absent data fields,
data formatting.
4. Transformation – data reduction and projection, aiming to find useful features to
represent data and reduce variables or instances considered in the process.
5. Data Mining – selection of methods that will be used to find patterns in data, followed by the effective search for patterns of interest, in a particular representation
or set of representations.
6. Interpretation/Evaluation of the mined patterns, with possible returns to steps 2-6.
7. Implantation of the discovered knowledge, incorporating it to the system performance or reporting it to interested parts.
Fig. 1. KDD steps [3].
2.1 Bottlenecks in KDD Process
In KDD systems, bottlenecks are generally characterized by the absence of:
Support for heterogeneous platforms: wrappers to integrate legacy systems, implemented in platform independent language; the lack of this resource hinders the
reuse of the components.
Efficiency and performance: basic requirements, once KDD deals with huge
amounts of data for pattern extraction.
Modularity and integration: KDD systems must present modularity in its components, in order to facilitate the resources addition, removal or update.
This way, an interesting feature in KDD systems is the interactive and ad hoc support of data mining tasks, providing flexibility and efficiency in knowledge discovery,
through an open and extensible environment. An intuitive and declarative query and
process definition language comes to this direction, which is the SKDQL proposal.
TEAM LinG
328
Marcelino Pereira dos Santos Silva and Jacques Robin
3 Case Study
The Robot World Cup Soccer (RoboCup) [10] is an international initiative to stimulate research in artificial intelligence, robotics and multi agents systems. The environment models a hypothetic robot system, combining features of different systems
and simulating human soccer players. This simulator, acting like a server, provides a
domain and supports users that want to construct their own agents/players.
In order to get relevant knowledge related to the behavior (play, attitude and peculiarity of the players and the teams), logs were extracted from RoboCup games,
through Soccer Monitor, a software that using binary logs presents the matches in its
simulated environment, allowing to visualize the context of the players and its movements. This software also converts binary logs into ASCII code. Processed logs of the
Soccer Server originated two important behavior data tables (Figure 2).
Primitive flat table – constituted by minimal granularity statistics, information
about each player’s action and position at each cycle of the simulator.
Derived flat table – constituted by higher granularity statistics, demonstrating different actions (pass, goal, kickoff, offside, and so on), and relevant data about them
(the moment it started and finished, players involved, relative positions, and so on).
Fig. 2. RoboCup data model.
Among the performed experiments, it was verified that in classification cases with
Id3 and J48, algorithms confirm in a reciprocal way their results, presenting a game
tendency in specific areas of the game field for continuous activity. It was also observed that the filtering of attribute relevance improves the information quality, avoiding mistakes related, for example, to area hierarchy. It was verified that in many cases
it’s not generated an immediately comprehensible pattern, what indicates that data, its
format or mining algorithm must be modified in the KDD task. Further results and
experiments may be found in [13].
This RoboCup case study provided relevant experience in the use of different tools
and paradigms for data mining, and outlined the need of open and integrated KDD
environments with languages for a set of integrated resources and functionalities.
TEAM LinG
SKDQL: A Structured Language
329
4 SKDQL Specification
The MATRIKS project (Multidimensional Analysis and Textual Summarizing for
Insight Knowledge Search) [2, 4, 6] aims the creation of an open and integrated environment for decision support and KDD. This project intends to fill KDD environment
lacks, related to tools integration, knowledge management of the mined model, language for query/process specification, and to the variety of input data, models and
mining algorithms.
In MATRIKS environment, a set of resources will be accessed through a declarative language of KDD queries and processes specification which, in a transparent
way, will provide all the tools in an integrated manner, using the open, multi platform
and distributed power of this KDSE (Knowledge Discovery Support Environment)
proposal. Based on the natural flow of data and results manipulation, SKDQL (Structured Knowledge Discovery Query Language) is the language proposal for KDD with
clauses that access in an integrated and transparent way the resources in MATRIKS.
The knowledge discovery in databases demands tasks for data manipulation and
analysis. Each task includes sequences of steps to perform selection, cleaning, transformation and mining, beyond presentation and storage of results and knowledge.
Considering the manipulation of data and results, SKDQL has four kinds of clauses:
Resources to access, load and store data during the knowledge discovery process.
Clauses to preprocess the selected data, including cleaning, transformation, deduction and enrichment of these data.
Commands to visualize, store and present the knowledge.
Data mining algorithms for classification, association, clustering.
4.1 SKDQL High Level Grammar
The initial symbol of the language is the non-terminal <SKDQLtask>, which defines
recursively a task as a sequence of data treatment steps (SKDQLstep):
where <Conj> is the terminal “and” or “then”, depending on the semantics that must
be represented between the proposed tasks (serial or parallel).
<SKDQLstep> is defined as a step of data preparation (Prepare), followed by an
optional activity of previous knowledge (PriorKnowledge). A data mining step follow
it, with a subsequent result presentation (Present) for interpretation and evaluation:
<Present> allows information visualization (previously stored) through the
clause <Display> . The junction of different files of this type can be performed
with <JoinDisplay>.
TEAM LinG
330
Marcelino Pereira dos Santos Silva and Jacques Robin
4.1.1 Clause for Data Access and Storage Task Specification
The clause <Prepare> has two clauses to be considered, <Pick> and <Preprocess>:
The initial step in a KDD task is the specification of the dataset to be explored during the whole process. This step demands the indication of data source (servers, database, and so on). An example follows:
It can also perform a peculiar data selection (with sampling, for example):
4.1.2 Clause for Data Preprocessing Task Specification
In the clause <Prepare>, right after <Pick>, follows <Preprocess> which
deals with cleaning, transformation, derivation, randomization and data recovery:
For example (“4” is the position of the attribute in the dataset):
4.1.3 Clause for Knowledge Presentation Task Specification
The previous knowledge of a domain may indicate good ways and solutions that effectively improve the KDD process in terms of quality and speed. It can modify completely the chosen approach over a dataset. Therefore, SKDQL has clauses to specify
the previous knowledge, and present knowledge discovered in the task.
<PriorKnowledge> has resources to access a database, to verify a dataset sampling, and to previously define the layout of association rules, according to its syntax:
<AssociationPriorKnowledge> allows the definition of a meta-rule that
will determine the layout of the association rules that will be created.
<PriorKnowledge> uses the structure previously described to connect a database and query it. Moreover, it is also possible to visualize a dataset sampling previously stored using the clause <ViewSampleOf Dataset >, when it is informed the
percentage of the sample to be visualized. For example:
4.1.4 Clause for Data Mining Task Specification
SKDQL has resources to mine many kinds of knowledge (or models) through different methods and algorithms using validation, testing and other options. The relevant
attributes, classification models, association rules and clustering mining tasks are
TEAM LinG
SKDQL: A Structured Language
331
present in this specification [17]. <Mine> is defined according to the following syntax:
In a dataset, there are attributes that has a higher influence in data mining tasks.
For example: to classify a good or bad loan client, certainly his income will be much
more relevant than his birthplace. The income and other attributes with highly influence in the task are called relevant attributes, which can be mined through the
<MineRelevantAttributes> clause.
Classification methods in data mining are used to determine and evaluate models
that classify or foresee an object or event. The syntax of the classification task is
specified in <MineClassification>.
The classification, as well as the relevance attribute, must be performed according
to an attribute called class, where the prototype default is the last attribute of the dataset. An example of this task follows:
Association rules are similar to classification rules, except that the former can foresee any attribute, not only the one that must be previously determined (class), allowing this way that different combinations of attributes occur in the rules through <Mi neAssociations>.
Clusters mining (<MineClusters>) presents, following criterions, the format of
a diagram, which reveals how instances of a dataset are distributed in groups/clusters.
The entire specification of the language is available at [13]. The example below
gives an idea of the usage of SKDQL:
After connecting the RoboCup database through a SQL query, a dataset is selected.
Right after, a sampling task of 50% is performed on this dataset, with a subsequent
application of Naïve Bayes classification algorithm. This task aims to acquire general knowledge about the context of the dataset using a basic classifier:
TEAM LinG
332
Marcelino Pereira dos Santos Silva and Jacques Robin
5 SKDQL Implementation
SKDQL was implemented through a prototype that has the main functionalities of the
language. In this implementation, Java [14] was chosen because MATRIKS already
adopts Java as a pattern development platform. Moreover, the developed interfaces
support heterogeneous platforms, very common in KDD. This way, distribution, extensibility, interoperability and modularity adopted for MATRIKS are supported via
Java. Different software, components and API’s were used for the implementation of
SKDQL functionalities (Figure 3 and Table 1).
Fig. 3. System Architecture.
To access relational databases, JDBC [15] was used, an API that supports connection to tables of datasets from Java programs. In this implementation, the DBMS
(Database Management System) Microsoft SQL Server [8] was used.
For preprocessing and data mining functionalities, WEKA [17] components were
used, a collection of algorithms for data manipulation and data mining (filtering, normalization, classification, association, clustering, and others), which were written in
Java and modularized in components called by SKDQL.
The XSB Prolog [19] is a programming language and a deductive database used in
derivation tasks of SKDQL, once WEKA has many resources for preprocessing and
mining, while nothing for deduction. It is accessed through JB2P [11], an API that
allows Java programs make calls and receive results from XSB.
For code generation of SKDQL, functionalities of access, preprocessing, data mining and data presentation were selected. Via JDBC, the SKDQL code generated by
JavaCC [18] performs the relational data access, when it requires URL, database,
login and password. The preprocessing and data mining tasks are performed via calls
to Prolog and WEKA components, which are also implemented in Java.
TEAM LinG
SKDQL: A Structured Language
333
6 Related Work
6.1 DMQL
In Simon Fraser University (Canada) DMQL (Data Mining Query Language) [5] was
developed with clauses for relevant dataset, kind of knowledge to be mined, prior
knowledge used in the KDD process, interest measures, limits to evaluate patterns,
and visualization representation of the discovered patterns.
However, the language does not have resources for data preprocessing. The specification presents a limited and invariable set of mining algorithms. DMQL isn’t implemented, once its single practical application is found in DBMiner [1], where it is
used as a task description resource.
6.2 OLE DB for DM
OLE DB for Data Mining (OLE DB for DM) [7] is an extension of Microsoft OLE
DB, which supports data mining operations on OLE DB providers. As an OLE DB
extension, it introduces a new virtual object called Data Mining Model (DMM), as
well commands to manipulate this virtual object. DMM is like a relational table, except that it contains special columns that can be used for pattern training and discovery allowing, for example, the creation of a prediction model and the generation of
predictions.
While a relational table stores data, DMM stores the patterns discovered by the
mining algorithm. The manipulation of a DMM is similar to the manipulation of a
table. However, this approach is not adequate, once tables are not flexible enough to
represent data mining models (for example, decision tables or bayesian networks).
6.3 CWM
Common Warehouse Metamodel (CWM) [9] is a recent pattern defined by the Object
Management Group (OMG) for data interchange in different environments: data
warehousing, KDD and business analysis. CWM provides a common language to
describe metadata (based on a generic metamodel, but semantically complete) and
facilities data interchange and specification of KDD classes and processes.
The scope of CWM specification includes metamodels definitions of different domains, what imposes CWM a high complexity, demanding knowledge of its principles. Moreover, there is a lack of tutorials, documents and cases enough for a wide
comprehension of the specification and techniques to use it in practical problems.
6.4 KDDML-MQL
KDDML-MQL [16] is an environment that supports the specification and execution
of complex KDD processes in the form of high-level queries.
KDDML (KDD Markup Language) is XML-based, i.e. both data, meta-data, mining models and queries are represented as XML documents. Query tags specify data
acquisition, preprocessing, mining and post-processing algorithms taken from possibly distinct suites of tools. MQL (Mining Query Language) is an algebraic language,
in the style of SQL. The MQL system compiles an MQL query into KDDML queries.
TEAM LinG
334
Marcelino Pereira dos Santos Silva and Jacques Robin
7 Conclusions
The SKDQL proposal provides a specification language and its prototype for the
application of different tasks of knowledge discovery in databases, taking into account features of the KDD process, overcoming most of the KDD bottlenecks and
limitations of alternative approaches. This work contributes in the following points:
Iteration – application and reapplication of resources and tools in the process.
Interaction – SKDQL allows the analyst perform tasks in an interactive manner,
requesting tasks and chaining operations to results previously reached.
Systematization of KDD tasks – resources are available to users in a style very
similar to SQL, with specific clauses for each task, freeing the “miner” of implementation details of the tools.
Support to heterogeneous resources – SKDQL proposal supports the access to
different data models widely used, increasing the user autonomy regarding to the
manipulation of different databases in the same environment.
Integration – due to KDD requirement in the points above, the language integrates
resources of different tools.
Considering the wide scope, evolution and dynamics of KDD, the extension of the
language is a consequence of the continuity of this work, once the present specification supports resources for a limited and open set of steps in a knowledge discovery
process. Although sequences of tests have been performed, it is necessary to apply
SKDQL to a wider set of tasks to improve resources and validate functionalities.
Acknowledgments
The supports of UERN and CAPES are gratefully acknowledged. We also thank Alexandre Luiz, João Batista and Rodrigo Galvão for their valuable contributions.
References
1. Dbminer Technology Inc. DBMiner Enterprise 2.0. Available at DBMiner Technology site
(2000).URL: http://www.dbminer.com/
2. Favero, E. HYSSOP - Hypertext Summary System for Olap. Doctorate Thesis. UFPE, 2000.
3. Fayyad, U. M.; Piatesky-Shapiro, G.; Smyth, P. From Data Mining to Knowledge Discovery: An Overview. Advances in KDD and Data Mining, AAAI, 1996.
4. Fidalgo, R. N. JODI: A Java API for OLAP Systems and OLE DB for OLAP Interoperability. Master Thesis. UFPE, 2000.
5. Han, J. et al. DMQL: A Data Mining Query Language for Relational Databases. Simon
Fraser University, 1996.
6. Lino, N. C. Q. DOODCI: An API for Multidimensional Databases and Deductive Systems
Integration. Master Thesis. UFPE, 2000.
7. Microsoft Corporation. OLE DB for Data Mining. Available at Microsoft Corporation site
(2000). URL: http://www.microsoft.com/data/oledb/dm.htm
8. Microsoft Corporation. Microsoft SQL Server. Available at Microsoft Corporation site
(2002). URL: http://www.microsoft.com/sql/default.asp
9. Poole, J. et al. Common Warehouse Metamodel–An Introduction to the Standard for Data
Warehouse Integration. OMG Press, 2002.
TEAM LinG
SKDQL: A Structured Language
335
10. The RoboCup Federation. The Robot World Cup Initiative (RoboCup). Available at RoboCup site (2002). URL: http://www.robocup.org
11. Rocha, J.B. Java Bridge to Prolog – JB2P (2001). Available at Rocha site (2001). URL:
http://www.cin.ufpe.br/~jbrj/msc/courses/taias/jb2p
12. Silva, M. P. S.; Robin, J. R. SKDQL – A Declarative Language for Queries and Process
Specification for KDD and its Implementation (2002). Master Thesis. UFPE, 2002.
13. Silva, M. P. S.; Robin, J. R. SKDQL Grammar Specification. Available at Silva site (2002).
URL: http://www.dpi.inpe.br/~mpss/skdql
14. Sun Microsystems, Inc. Java Developer Connection: Documentation and Training. Available at Sun Microsystems site (2001). URL: http://developer.java.sun.com
15. Sun Microsystems Inc. Java Database Connection - JDBC. Available at Sun Microsystems
site (2002). URL: http://java.sun.com/products/jdbc
16. Turini, F. et al. KDD Markup Language (2003). Available at Universita’ di Pisa site
(2003). URL: http://kdd.di.unipi.it/kddml/
17. University of Waikato. Weka 3 – Machine Learning Software in Java. Available at University of Waikato site (2001). URL: http://www.cs.waikato.ac.nz/ml/weka
18. Webgain Inc. Java Compiler Compiler – JavaCC. Available at WebGain Inc. site (2002).
URL: http://www.webgain.com/products/java_cc/
19. The XSB Research Group. XSB Prolog. Available at XSB Research Group site (2001).
URL: http://xsb.sourceforge.net/
TEAM LinG
Symbolic Communication in Artificial Creatures:
An Experiment in Artificial Life
Angelo Loula, Ricardo Gudwin, and João Queiroz
Dept. Computer Engineering and Industrial Automation
School of Electrical and Computer Engineering, State University of Campinas, Brasil
{angelocl,gudwin,queirozj}@dca.fee.unicamp.br
Abstract. This is a project on Artificial Life where we simulate an
ecosystem that allows cooperative interaction between agents, including
intra-specific predator-warning communication in a virtual environment
of predatory events. We propose, based on Peircean semiotics and informed by neuroethological constraints, an experiment to simulate the
emergence of symbolic communication among artificial creatures. Here
we describe the simulation environment and the creatures’ control architectures, and briefly present obtained results.
Keywords: symbol, communication, artificial life, semiotics, C.S.Peirce
1
Introduction
According to the semiotics of C.S.Peirce, there are three fundamental kinds of
signs underlying meaning processes: icons, indexes and symbols (CP 2.2751).
Icons are signs that stand for their objects through similarity or resemblance
(CP 2.276, 2.247, 8.335, 5.73); indexes are signs that have a spatio-temporal
physical correlation with its object (CP 2.248, see 2.304); symbols are signs
connected to O by the mediation of I. For Peirce (CP 2.307), a symbol is “A
Sign which is constituted a sign merely or mainly by the fact that it is used and
understood as such, whether the habit is natural or conventional, and without
regard to the motives which originally governed its selection.”
Based on this framework, Queiroz and Ribeiro [2] performed a neurosemiotic
analysis of vervet monkeys’ intra-specific communication. These primates use
vocal signs for intra-specific social interactions, as well as for general alarm
purposes regarding imminent predation on the group [3]. They vocalize basically
three predator-specific alarm calls which produce specific escape responses: alarm
calls for terrestrial predators (such as leopards) are followed by a escape to the
top of trees, alarm calls for aerial raptors (such as eagles) cause vervets to hide
under bushes, and alarm calls for ground predators (such as snakes) elicit careful
scrutiny of the surrounding terrain. Queiroz and Ribeiro [2] identified the different
signs and the possible neuroanatomical substrates involved. Icons correspond to
neural responses to the physical properties of the visual image of the predator
1
The work of C.S.Peirce[l] is quoted as CP, followed by the volume and paragraph.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 336–345, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Symbolic Communication in Artificial Creatures
337
and the alarm-call, and exist within two independent primary representational
domains (visual and auditory). Indexes occur in the absence of a previously
established relationship between call and predator, when the call simply arouses
the receiver’s attention to any concomitant event of interest, generating a sensory
scan response. If the alarm-call operates in a sign-specific way in the absence
of an external referent, then it is a symbol of a specific predator class. This
symbolic relationship implies the association of at least two representations of a
lower order in a higher-order representation domain.
2
Simulating Artificial Semiotic Creatures
The framework (above) guided our experiments of simulating the emergence of
symbolic alarm calls2. The environment is bi-dimensional having approximately
1000 by 1300 positions. The creatures are autonomous agents, divided into preys
and predators. There are objects such as trees (climbable objects) and bushes
(used to hide), and three types of predators: terrestrial predator, aerial predator
and ground predator. Predators differentiate by their visual limitations: terrestrial predators can’t see preys over trees, aerial predators can’t see preys under
bushes, but ground predators don’t have these limitations. The preys can be
teachers, which vocalizes pre-defined alarms to predators, or learners, which
try to learn these associations. There is also the self-organizer prey, which is a
teacher and a learner at the same time, able to create, vocalize and learn alarms,
simultaneously.
The sensory apparatus of the preys include hearing and vision; predators
have only a visual sensor. The sensors have parameters that define sensory areas
in the environment, used to determine the stimuli the creatures receive. Vision
has a range, a direction and an aperture defining a circular section, and hearing
has just a range defining a circular area. These parameters are fixed, with exception to visual direction, changed by the creature, and visual range increased
during scanning. The received stimuli correspond to a number, which identifies
the creature or object seen associated with the direction and distance from the
stimulus’ receiver.
The creatures have interactive abilities, high-level motor actions: adjust visual sensor, move, attack, climb tree, hide on bush, and vocalize. These last three
actions are specific to preys, while attacks are only done by predators. The creatures can perform actions concomitantly, except for displacement actions (move,
attack, climb and hide) which are mutually exclusive. The move action changes
the creature position in the environment and takes two parameters velocity (in
positions/interaction, limited to a maximum velocity) and a direction (0-360 degrees). The visual sensor adjustment modifies the direction of the visual sensor
(and during scanning, doubles the range), and takes one parameter, the new
direction (0-360 degrees). The attack action has one parameter that indicates
the creature to be attacked, that must be within action range. If successful the
2
The simulator is called Symbolic Creatures Simulation. For more technical details,
see http://www.dca.fee.unicamp.br/projects/artcog/symbcreatures
TEAM LinG
338
Angelo Loula, Ricardo Gudwin, and João Queiroz
attack increases an internal variable, number of attacks suffered, from the attacked creature. The climb action takes as a parameter the tree to be climbed,
that must be within the action range. When up in a tree, an internal variable
called ‘climbed’ is set to true; when the creature moves it is turned to false and
it goes down the tree. Analogously, the hide action has the bush to be used to
hide as a parameter, and it uses an internal variable called ‘hidden’. The vocalize
action has one parameter the alarm to be emitted, a number between 0 and 99,
and it creates a new element in the environment that lasts just one interaction,
and is sensible by creatures having hearing sensors.
To control their actions after receiving the sensory input, the creatures have
a behavior-based architecture [4], dedicated to action selection [5]. Our control
mechanism is composed of various behaviors and drives. Behaviors are independent and parallel modules that are activated at different moments depending on
the sensorial input and the creature’s internal state. At each iteration, behaviors
provide their motivation value (between 0 and 1), and the one with highest value
is activated and provides the creature actions at that instant. Drives define basic
needs, or ‘instincts’, such as ‘fear’ or ‘hunger’, and they are represented by numeric values between 0 and 1, updated based on the sensorial input or time flow.
This mechanism is not learned by the creature, but rather designed, providing
basic responses to general situations.
Predators’ Cognitive Architecture
The predators have a simple control architecture with basic behaviors and drives.
The drives are hunger and tiredness, and the behaviors are wandering, rest and
prey chasing. The drives are implemented as follows:
where
is the creature’s velocity at the current instant (t).
The wandering behavior has a constant motivation value of 0.4, and makes
the creature basically move at random direction and velocity, directing its vision toward movement direction. The resting behavior makes the creature stop
moving and its motivation is given by
TEAM LinG
Symbolic Communication in Artificial Creatures
339
The behavior chasing makes the predator move towards the prey, if its out
of range, or attack it, otherwise. The motivation of this behavior is given by
Preys’ Cognitive Architecture
Preys have two sets of behavior: communication related behaviors and general
behaviors. The communication related behaviors are vocalizing, scanning, associative learning and following, the general ones are wandering, resting and fleeing.
Associated with these behaviors, there are different drives: boredom, tiredness,
solitude, fear and curiosity. The learner and the teacher don’t have the same
architecture, only teachers have the vocalize behavior and only learners have
the associative learning behavior, the scanning behavior and the curiosity drive
(figure 1). On the other hand, the self-organizer prey has all behaviors and drives.
The prey’s drives are specified by the expressions
The tiredness drive is computed by the same expression used by predators.
The vocalize behavior and associative learning behavior can run in parallel
with all other behaviors, so it does not undergo behavior selection. The vocalize
behavior makes the prey emit an alarm when a predator is seen. The teacher
has a fixed alarm set, using alarm number 1 for terrestrial predator, 2 for aerial
predator and 3 for ground predator. The self-organizer uses the alarm with the
highest association value in the associative memory (next section), or chooses
randomly an alarm from 0 to 99 and places it in the associative memory, if none
is known. (The associative learning behavior is described in the next section.)
TEAM LinG
340
Angelo Loula, Ricardo Gudwin, and João Queiroz
Fig. 1. Preys’ cognitive architecture: (a) learners have scanning and associative learning capabilities and (b) teachers have vocalizing capability. The self-organizer prey is
a teacher and a learner at the same time and has all these behaviors.
The scanning behavior makes the prey turn towards the alarm emitter direction and move at this direction, if an alarm is heard, turn to the same vision
direction of the emitter, but still moving towards the emitter, if the emitter is
seen, or keep the same vision and movement direction, if the alarm is not heard
anymore. The motivation is given by
if an alarm is heard or if
This behavior also makes the vision range double, simulating a
wide sensory scanning process.
To keep preys near each other and not spread out in the environment, the following behavior makes the prey keep itself between a maximum and a minimum
distance of another prey, by moving towards or away from it. This was inspired
by experiments in simulation of flocks, schools and herds. The motivation for
following is equal to
if another prey is seen.
The fleeing behavior has its motivation given by
It makes the prey
move away from the predator with maximum velocity, or in some situations,
perform specific actions depending upon the type of prey. If a terrestrial predator
is or was just seen and there’s a tree not near the predator (the difference between
predator direction and tree direction is more than 60 degrees), the prey moves
toward the tree and climbs it. If it is an aerial predator and there’s a bush not
near it, the prey moves toward the bush and hides under it. If the predator is not
seen anymore, and the prey is not up on a tree or under a bush, it keeps moving
in the same direction it was before, slightly changing its direction at random.
The wandering behavior makes the prey move at a random direction and
velocity, slightly changing it at random. The vision direction is alternately turn
left, forward and right. The motivation is given by
if the prey is
not moving and
or zero, otherwise. The rest behavior makes the
prey stop moving, with a motivation as for predators.
Associative Learning
The associative learning allows the prey to generalize spatial-temporal relations
between external stimuli from particular instances. The mechanism is inspired on
the neuroethological and semiotic constraints described previously, implementing
TEAM LinG
Symbolic Communication in Artificial Creatures
341
Fig. 2. (a) Associative learning architecture. (b) Association adjustment rules.
a lower-order sensory domain through work memories and a higher order multimodal domain by a associative memory (figure 2a).
The work memories are temporary repositories of stimuli: when a sensorial
stimulus is received from either sensor (auditory or visual), it is placed on the
respective work memory with maximum strength, at every subsequent iteration
it is lowered and when its strength gets to zero it is removed. The strength of
stimuli in the work memory (WM) varies according to the expression
The items in the work memory are used by the associative memory to produce
and update association between stimuli, following basic Hebbian learning (figure
2b). When an item is received in the visual WM and in the auditory WM, an
association is created or reinforced in the associative memory, and changes in
its associative strength are inhibited. Inhibition avoids multiple adjustments in
the same association caused by persisting items in the work memory. When
an item is dropped from the work memory, its associations not inhibited, i.e.
not already reinforced, are weakened, and the inhibited associations have their
inhibition partially removed. When the two items of an inhibited association
are removed, the association ends its inhibition, being subject again to changes
in its strength. The reinforcement and weakening adjustments for non-inhibited
associations, with strengths limited to the interval [0.0; 1.0], are done as follows:
reinforcement, given a visual stimulus
the work memories
and a hearing stimulus
present in
TEAM LinG
342
Angelo Loula, Ricardo Gudwin, and João Queiroz
weakening, for every association related to the dropped visual stimuli
weakening, for every association related to the dropped hearing stimuli
As shown in figure 1, the associative learning can produce a feedback that indirectly affects drives and other behaviors. When an alarm is heard and it is
associated with a predator, a new internal stimulus is created composed of the
associated predator, the association strength, and the direction and distance of
the alarm, which is used as an approximately location of the predator. This new
stimulus will affect the fear drive and fleeing behavior. The fear drive is changed
to account for this new information, which gradually changes fear value:
This allows the associative learning to produce an escape response, even if
the predator is not seen. This response is gradually learned and it describes a
new action rule associating alarm with predator and subsequent fleeing behavior.
The initial response to an alarm is a scanning behavior, typically indexical. If the
alarm produces an escape response due to its mental association with a predator,
our creature is using a symbol.
3
Creatures in Operation
The virtual environment inhabited by creatures works as a laboratory to study
the conditions for symbol emergence. In order to evaluate our simulation architecture, we performed different experiments to observe the creatures during
associative learning of stimuli. We simulate the communicative interactions between preys in an environment with the different predators and objects, varying
the quantity of creatures present in the environment.
Initially, we used teacher and learner preys and change the number of teachers, predators and learners (figure 3). Results show that learners are always able
to establish the correct associations between alarms and predators (alarm 1 terrestrial predator, alarm 2 - aerial predator, alarm 3 - ground predator). The
number of interactions decreased whereas the amount of competition among
associations increased, as the number of teachers or predators increased. This
is due to an increase in the numbers of vocalizing events from teachers, what
corresponds to more events of reinforcement and less of weakening. Placing two
learners in the environment, we could also notice that the trajectories described
by the association values in each prey are quite different, partially because of
random properties in their behavior.
TEAM LinG
Symbolic Communication in Artificial Creatures
343
Fig. 3. Evolution of association strength values using Teachers and Learners (association value x iteration). Exp. A (1 learner (L), 5 teachers (T) and 3 predators (P)):
associations with (a) alarm 1, (b) alarm 2 and (c) alarm 3. Exp. B (1 L, 5 T, 6 P): (d)
winning associations for alarms. Exp. C (1 L, 10 T, 3 P): (e) winning assoc. for alarms.
Exp. D (2 L, 5 T, 3 P): (d) winning associations in each creature.
Using self-organizers, all preys can vocalize and learn alarms. Therefore, the
number of alarms present in the simulations is not limited to three as before.
Each prey can create a different alarm to the same predator and the one mostly
used tends to dominate the preys’ repertoire at the end (figure 4). Increasing
the number of preys, tends to increase the number of alarms, the number of
interactions and also the amount of competition, since there are more preys
creating alarms and also alarms have to disseminate among more preys.
In a final experiment, we wanted to evaluate the adaptive advantage of using
symbols instead of just indexes (figure 5). We adjusted our simulations by modelling an environment where visual cues are not always available, as predators,
for instance, can hide themselves in the vegetation to approach preys unseen.
This was done by including a probability of predators been actually seen even
if they are in the sensory area. We then placed learner preys that responded to
alarms by just performing scanning (indexical response) and preys that could
respond to alarms using their learned associations (symbolic response). Results
show that the symbolic response to alarm provides adaptive advantage, as the
number of attacks suffered is consistently lower than otherwise.
4
Conclusion
Here we presented a methodology to simulate the emergence of symbols through
communicative interactions among artificial creatures. We propose that symbols
TEAM LinG
344
Angelo Loula, Ricardo Gudwin, and João Queiroz
Fig. 4. Evolution of association strength values for Self Organizers (mean value in
the preys population). Exp. A (4 self-organizers (S) and 3 predators (P)): associations
with (a) terrestrial predator, (b) aerial predator and (c) ground predator. Exp. B (8 S,
3 Ppredators): associations with (d) terrestrial pred., (e) aerial pred. and (f) ground
pred.
Fig. 5. Number of attacks suffered by preys responding indexically or symbolically to
alarms. We simulated an environment where preys can’t easily see predators, introducing a 25% probability of a predator being seen, even if it is within sensorial area.
can result from the operation of simple associative learning mechanisms between
external stimuli. Experiments show that learner preys are able to establish the
correct associations between alarms and predators, after exposed to vocalization
events. Self-organizers are also able to converge to a common repertoire, even
TEAM LinG
Symbolic Communication in Artificial Creatures
345
though there were no pre-defined alarm associations to be learned. Symbols
learning and use also provide adaptive advantage to creatures when compared
to indexical use of alarm calls.
Although there have been other synthetic experiments simulating the development and evolution of sign systems, e.g. [4,6], this work is the first to deal
with multiple distributed agents performing autonomous (self-controlled) communicative interactions. Different from others, we don’t establish a pre-defined
‘script’ of what happens in communicative acts, stating a sequence of fixed task
to be performed by one speaker and one hearer. In our work, creatures can be
speakers and/or hearers, vocalizing and hearing from many others at the same
time, in various situations.
Acknowledgments
A.L. was funded by CAPES; J.Q. is funded by FAPESP.
References
1. Peirce, C.S.: The Collected Papers of Charles Sanders Peirce. Harvard University
Press (1931-1958) vols.I-VI. Hartshorne, C., Weiss, P., eds. vols.VII-VIII. Burks,
A.W., ed.
2. Queiroz, J., Ribeiro, S.: The biological substrate of icons, indexes, and symbols in
animal communication: A neurosemiotic analysis of vervet monkey alarm calls. In
Shapiro, M., ed.: The Peirce Seminar Papers 5. Berghahn Books, New York (2002)
69–78
3. Seyfarth, R., Cheney, D., Marler, P.: Monkey responses to three different alarm
calls: Evidence of predator classification and semantic communication. Science 210
(1980) 801–803
4. Cangelosi, A., Greco, A., Harnad, S.: Symbol grounding and the symbolic theft
hypothesis. In Cangelosi, A., Parisi, D., eds.: Simulating the Evolution of Language.
Springer, London (2002)
5. Franklin, S.: Autonomous agents as embodied ai. Cybernetics and Systems 28(6)
(1997) 499–520
6. Steels, L.: The Talking Heads Experiment: Volume I. Words and Meanings. VUB
Artificial Intelligence Laboratory, Brussels, Belgium (1999) Special pre-edition.
TEAM LinG
What Makes a Successful Society?
Experiments with Population Topologies
in Particle Swarms
Rui Mendes* and José Neves
Departamento de Informática, Universidade do Minho, Portugal
Abstract. Previous studies in Particle Swarm Optimization (PSO)
have emphasized the role of population topologies in particle swarms.
These studies have shown that a relationship between the way individuals in a population are organized and their aptitude to find global optima
exists. A study of what graph statistics are relevant is of paramount importance. This work presents such a study, which will provide guidelines
that can be used by researchers in the field of PSO in particular and in
the Evolutionary Computation arena in general.
Keywords: Particle Swarm Optimization, Swarm Intelligence, Evolutionary Computation
1 Introduction
The field of Particle Swarm Optimization (PSO) is evolving fast. Since its creation in 1995 [1, 2], researchers have proposed important contributions to the
paradigm in the field of parameter selection [3,4]. Lately, the field of population
topologies has also been object of study, as its importance has been demonstrated [5, 6]. The study of topologies has also triggered the development of a very
successful algorithm, Fully Informed Particle Swarm (FIPS), that has demonstrated to perform better than the canonical particle swarm, widely accepted by
researchers as the state-of-the-art algorithm, in a well-known benchmark of hard
functions [7, 8].
Due to the fact that FIPS has demonstrated superior results and its close
relationship to the structure of the population, a study to understand the relationship between the population structure and the algorithm was conducted.
2
Canonical Particle Swarm
The standard algorithm is given in some form resembling the following:
*
The work of Rui Mendes is sponsored by the grant POSI/ROBO/43904/2002.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 346–355, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
What Makes a Successful Society?
347
where denotes point-wise vector multiplication, U[min, max] is a function that
returns a vector whose positions are randomly generated, following the uniform
distribution between min and max,
is called the inertia weight and is less
than 1,
and
represent the speed and position of the particle at time
refers to the best position found by the particle, and
refers to the position
found by the member of its neighborhood that has had the best performance so
far. The Type
constriction coefficient is often used [4]:
The two versions are equivalent, but are simply implemented differently. The
second form is used in the present investigations. Other versions exist, but all
are fairly close to the models given above.
A particle searches through its neighbors in order to identify the one with
the best result so far, and uses information from that source to bias its search in
a promising direction. There is no assumption, however, that the best neighbor
at time actually found a better region than the second or third-best neighbors. Important information about the search space may be neglected through
overemphasis on the single best neighbor.
When constriction is implemented as in the second version above, lightening the right-hand side of the velocity formula, the constriction coefficient is
calculated from the values of the acceleration coefficient limits,
and
importantly, it is the sum of these two coefficients that determines what to use. This
fact implies that the particle’s velocity can be adjusted by any number of terms,
as long as the acceleration coefficients sum to an appropriate value. For instance,
the algorithm given above is often used with
and
The coefficients must sum, for that value of to 4.1.
3
Fully Informed Particle Swarm
The idea behind FIPS is that social influence comes from the group norm, i.e.,
the center of gravity of the individual’s neighborhood. Contrary to canonical
particle swarm, there is no individualism. That is, the particle’s previous best
position takes no part in the velocity update.
In the canonical particle swarm, each particle explores around a region defined by its previous best success and the success of the best particle in its
neighborhood. The difference in FIPS is that the individual should gather information about the whole neighborhood. For that, let us define
as the set of
neighbors of and
as the best position found by individual
TEAM LinG
348
Rui Mendes and José Neves
This formula is a generalization of the canonical version. In fact, if
is
defined to contain only itself and its best neighbor, this formula is equivalent to
the one presented in equation 4. Thus, in FIPS the velocity update is performed
according to a stochastically weighted average of the difference between the
particle’s current position and each of its neighbors’ previous best.
As can be concluded from equation 5, the algorithm uses neither information
about the relative quality of each of the solutions found by its neighbors nor
about the particle’s previous best position. The particle simply oscillates around
the stochastic center of gravity of its neighbors’ previous findings.
4
Population Structures and Graph Statistics
In particle swarms, individuals strive to improve themselves by imitating traits
found in their successful peers. Thus, “social norms” emerge because individuals
are influenced by their neighbors. The definition of the social neighborhood of
an individual, i.e., which individuals influence it, is very important. As practice
demonstrates, the topology that is most widely used – gbest, where all individuals
influence one another – is vulnerable to local optima.
Social influence is dictated by the information found in the neighborhood of
each individual, which is only a subset of the population. The relationship of
influence is defined by a social network – represented as a graph – that we call
population topology or sociometry.
The goal of sociometries is to control how soon the algorithm converges. The
goal is find which aspects of the graph structure are responsible for the information “spread”. It does not make sense to study topologies where there are
isolated subgroups, as they would not communicate among themselves. Therefore, all graphs studied are connected, i.e., there is a path between any two
vertices. Results reported by researchers confirm that PSO performs well with
small populations of 20 individuals.
4.1
Degree and Distribution Sequence
Degree determines the scale of socialization: An individual without neighbors is
an outsider; an individual with few neighbors cannot gather information from
nor influence others in the population; an individual with many neighbors is
both well informed and i possesses a large sphere of influence.
One of the most interesting measures of the spread of information seems to
be the distribution sequence. In fact it can be seen as an extension of the degree.
In short, this sequence, named
gives the number of individuals that can only
be reached through a path of edges.
This is the degree of vertex
It represents the number of individuals
immediately influenced by
This is the number of
neighbor’s neighbors. To influence these individuals, must influence its neighbors for a sufficiently long period of time.
TEAM LinG
What Makes a Successful Society?
349
This is the number of individuals three steps away from To influence
these individuals, has to transitively influence its neighbors and its neighbors’ neighbors.
Besides the degree, this study also investigates the effects of
because it is not defined on most of the graphs used.
4.2
is not used
Average Distance, Radius and Diameter
In a sparsely connected population, information takes a long time to travel. The
spreading of information is an important object of study. Scientists study this
effect in many different fields, from social sciences to epidemiology. A measure
of this is path length. Path length presents a compromise between exploration
and exploitation: If it is too small, it means that information spreads too fast,
which implies a higher probability of premature convergence. If it is large, it
means that information takes a long time to travel through the graph and thus
the population is more resilient and not so eager to exploit earlier on.
However, robustness comes at a price: speed of convergence. It seems important to find an equilibrium. This statistic correlates highly with degree: a high
degree means a low path length and vice-versa. The radius of a graph is the
smallest maximal difference of a vertex to any other. The diameter of a graph is
the largest distance between any two vertices.
4.3
Clustering
Clustering measures the percentage of a vertex’s neighbors that are neighbors
to one another. It measures the degree of “cliquishness” of a graph. Overlapping plays an important part in social networks. We move in several circles of
friends. In these, almost everyone knows each other. In fact we act as bridges
or shortcuts between the various circles we frequent. Clustering influences the
information spread in a graph. However, its influence is more subtle. The degree of homogenization forces the cluster to follow a social norm. If most of the
connections are inside the cluster; all individuals in it will tend to share their
knowledge fairly quickly. Good regions discovered by one of them are quickly
passed on to the other members of the group. Even a partial degree of clustering
helps to disseminate information. It is easier to influence an individual if we
influence most of its neighbors.
5
Parallel Coordinates and Visual Data Analysis
Parallel coordinates provide an effective representation tool to perform hyperdimensional data analysis [9]. Parallel coordinates were proposed by Inselberg
[10] as a new way to represent multi-dimensional information. Since the original
proposal, much subsequent work has been accomplished, e.g., [11]. In traditional
Cartesian coordinates, all axes are mutually perpendicular. In parallel coordinates, all axes are parallel to one another and equally spaced. By drawing the
TEAM LinG
350
Rui Mendes and José Neves
axes parallel to one another, one can represent points, lines and planes in hyperdimensional spaces. Points are represented by connecting the coordinates on each
of the axes by a line.
Parallel coordinates are a very useful tool in visual analysis. It is very easy to
identify clusters visually in high dimensional data by using color transparency.
Color transparency is used to darken less clustered areas and brighten highly
clustered ones. By using brushing techniques, it is possible to examine subsets
of the data and to identify relationships between variables.
In this study, parallel coordinates were used to identify the graph statistics
present in all highly successful population topologies. By using brushing, it is
possible to identify highly successful groups and identify what characteristics are
shared by all topologies belonging to them.
6
Parameter Selection and Test Procedure
The present experiments extracted two kinds of measures of performance on
a standard suite of test functions. The functions were the sphere or parabolic
function in 30 dimensions, Rastrigin’s function in 30 dimensions, Griewank’s
function in 10 and 30 dimensions (the importance of the local minima is much
higher in 10 dimensions, due to the product of co-sinuses, making it much harder
to find the global minimum), Rosenbrock’s function in 30 dimensions, Ackley’s
function in 30 dimensions, and Schaffer’s f6, which is in 2 dimensions. Formulas
can be found in the literature (e.g., in [12]).
The experiments conducted compare several conditions among themselves.
A condition is an algorithm paired with a topology. To have a certain degree of
precision as to the value of a certain measure pertaining to a given condition,
50 runs were performed per condition.
6.1
Mean Performance
One of the measures used is the best function result attained after a fixed number of function evaluations. This measure reports the expected performance an
algorithm will have on a specific function. The mean performance is a measure
of sloppy speed. It does not necessarily indicate whether the algorithm is close
to the global optimum. A relatively high score can be obtained on some of these
multi-modal functions simply by finding the best part of a locally optimal region.
When using many functions, results are usually presented independently on
each of the functions used and there is no methodology to conclude which of
the approaches has a good performance over all the functions. However, this
considerably complicates the task of evaluating which approach is the best. It
is not possible to combine raw results from different functions, as they are all
scaled differently. To provide an easier way of combining the results from different
functions, uniform fitness is used, instead of raw fitness. A uniform fitness can
simply be regarded as a proportion: a uniform fitness of less than 0.1 can be
interpreted as being one of the top 10% solutions. In this study, the number of
iterations elapsed before performance is recorded is of 1,000.
TEAM LinG
What Makes a Successful Society?
6.2
351
Proportion of Successes
While the measure of mean performance gives an indication of the quality of the
solution found, an algorithm can achieve a good result while getting stuck in a
local optimum. The proportion of successes shows the percentage of times that
the algorithm was able to reach the globally optimal region. The proportion of
successes validates the results of the average performance. It may be possible
for good results to be achieved by combining an extremely good result in a
function (e.g. the Sphere, with an average result in a more difficult function).
The algorithm is left to run until 3,000 iterations have elapsed and then its
success is recorded.
6.3
Parameter Selection
As the goal of this study is to verify the impact of the choice of social topologies
in the behavior of the algorithm, the tuning parameters are fixed. They are set
to the values that are widely used by the community and that are deemed to
be the most appropriate ones, as demonstrated in [4]. The value of was set to
4.1, which is one of the most used in the community of particle swarms. This
value is split equally between
and
The value of was set to 0.729. All
the population topologies used in this study comprise 20 individuals.
6.4
Topology Generation
The graphs representing the social topologies were generated according to a
given set of constraints. These were representative of several parameters deemed
important in the graph structure. Preliminary studies of the graph statistics
indicated that by manipulating the average degree and average clustering, along
with the corresponding standard deviations, it was possible to manipulate the
other statistics over the entire range of possible values. These parameters were
used to create a database of graphs with average degrees ranging from 3 to 10
and clustering from 0 to 1. A database of graph statistics of these topologies was
collected, to be used in the analysis. The total number of population topologies
used amounts to 3,289.
7
Analysis of the Results
The results obtained are analyzed visually using Parvis, a tool for parallel coordinates visualization of multidimensional data sets. To allow for an easier interpretation of the figures, the name of each of the axes is explained:
Alg 1 for Canonical Particle Swarm, 2 for FIPS.
Prop Proportion of successes.
Perf Average performance.
Degree Average degree of the population topology.
ClusteringCoefficient Clustering coefficient of the population topology.
TEAM LinG
352
Rui Mendes and José Neves
Fig. 1. Experiments with a proportion of successes higher than 93%. All the experiments belong to the FIPS algorithm.
AverageDistance Average distance between two nodes in the graph.
DistSeq2 The distribution sequence of order 2.
Radius The radius of the graph.
Diameter The diameter of the graph.
First, the experiments responsible for a proportion of successes higher than
93% are isolated (Figure 1). All the results belong to the FIPS algorithm. None of
the canonical experiments was this successful. However, some of the experiments
have low quality average performance. The next step is to isolate the topologies
with both a high proportion of successes and a high quality average performance
(Figure 2). Fortunately, all of these have some characteristics in common:
the average degree is always 4;
the clustering coefficient is low;
the average distance is always similar.
As most of the graph statistics are related to some degree, it seems interesting
to display the graph statistics of all graphs with degree 4 (Figure 3). This shows
that the average distance is similar for graphs with a somewhat low clustering
coefficient. Thus, it makes sense to concentrate the efforts in just the average
degree and clustering coefficient.
Figure 4 shows the experiments of FIPS, using topologies with average degree
4 and clustering lower than 0,5. This figure is similar to Figure 2. As a further
exercise, Figure 5 shows what happens when the clustering is restricted to values
lower than 0,0075. This set identifies very high quality solutions, according to
both measures.
8
Conclusions and Further Work
This study corroborates the results reported in [7,8] that FIPS shows superior
results to the ones of the canonical particle swarm. It showed that the successful
TEAM LinG
What Makes a Successful Society?
353
Fig. 2. Experiments with a high proportion of successes and a high quality average
performance. The following conclusions can be drawn: the average degree is always 4;
the clustering coefficient is low; the average distance is always similar.
Fig. 3. Graph statistics of all topologies with average degree 4.
topologies had an average of four neighbors. This result can be easily rationalized:
The use of more particles triggers the possibility of crosstalk effects encountered
in neural network learning algorithms. In other words, the pulls experienced in
the directions of multiple particles will mostly cancel each other and reduce the
possible benefits of considering their knowledge.
Parallel coordinates proved to be a powerful tool to analyze the results. The
capabilities of the tool used allowed for a very straightforward test of different
hypothesis. The visual analysis of the results was able to find a set of graph
statistics that explains what makes a good social topology.
To validate the conjectures concluded by this work, a large number of graphs
with the characteristics found should be generated and tested to see if all the
graphs in the set have similar characteristics when interpreted as a population
topology. Further tests with other problems should also be performed, especially
with real-life problems, to validate the results found.
TEAM LinG
354
Rui Mendes and José Neves
Fig. 4. Experiments of FIPS with topologies with average degree 4 and clustering lower
than 0, 5.
Fig. 5. Experiments of FIPS with topologies with average degree 4 and clustering lower
than 0, 0075.
References
1. Eberhart, R., Kennedy, J.: A new optimizer using particle swarm theory. In: Proceedings of the Sixth International Symposium on Micro Machine and Human
Science, Nagoya, Japan, IEEE Service Center (1995) 39–43
2. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of
ICNN’95 - International Conference on Neural Networks. Volume 4., Perth, Western Australia (1995) 1942–1948
3. Shi, Y., Eberhart, R.C.: Parameter selection in particle swarm optimization. In
Porto, V.W., Saravanan, N., Waagen, D., Eiben, A.E., eds.: Evolutionary Programming VII, Berlin, Springer (1998) 591–600 Lecture Notes in Computer Science
1447.
4. Clerc, M., Kennedy, J.: The particle swarm: Explosion, stability, and convergence
in a multi-dimensional complex space. IEEE Transactions on Evolutionary Computation 6 (2002) 58–73
TEAM LinG
What Makes a Successful Society?
355
5. Kennedy, J.: Small worlds and mega-minds: Effects of neighborhood topology on
particle swarm performance. In: Proceedings of the 1999 Conference on Evolutionary Computation, IEEE Computer Society (1999) 1931–1938
6. Kennedy, J., Mendes, R.: Topological structure and particle swarm performance.
In Fogel, D.B., Yao, X., Greenwood, G., Iba, H., Marrow, P., Shackleton, M., eds.:
Proceedings of the Fourth Congress on Evolutionary Computation (CEC-2002),
Honolulu, Hawaii, IEEE Computer Society (2002)
7. Mendes, R., Kennedy, J., Neves, J.: Watch thy neighbor or how the swarm can learn
from its environment. In: Proceedings of the Swarm Intelligence Symposium (SIS2003), Indianapolis, IN, Purdue School of Engineering and Technology, IUPUI,
IEEE Computer Society (2003)
8. Mendes, R., Kennedy, J., Neves, J.: The fully informed particle swarm: Simpler,
maybe better. IEEE Transactions of Evolutionary Computation (in press 2004)
9. Wegman, E.: Hyperdimensional data analysis using parallel coordinates. Journal
of the American Statistical Association 85 (1990) 664–675
10. Inselberg, A.: n-dimensional graphics, part I–lines and hyperplanes. Technical Report G320-2711, IBM Los Angeles Scientific Center, IBM Scientific Center, 9045
Lincoln Boulevard, Los Angeles (CA), 900435 (1981)
11. Inselberg, A.: The plane with parallel coordinates. The Visual Computer 1 (1985)
69–91
12. Reynolds, R.G., Chung, C.: Knowledge-based self-adaptation in evolutionary programming using cultural algorithms. In: Proceedings of IEEE International Conference on Evolutionary Computation (ICEC’97). (1997) 71–76
TEAM LinG
Splinter: A Generic Framework
for Evolving Modular Finite State Machines
Ricardo Nastas Acras and Silvia Regina Vergilio
Federal University of Parana (UFPR), CP: 19081
CEP: 81531-970, Curitiba, Brazil
[email protected], [email protected]
Abstract. Evolutionary Programming (EP) has been used to solve a
large variety of problems. This technique uses concepts of Darwin’s theory to evolve finite state machines (FSMs). However, most works develop
tailor-made EP frameworks to solve specific problems. These frameworks
generally require significant modifications in their kernel to be adapted
to other domains. To easy reuse and to allow modularity, modular FSMs
were introduced. They are fundamental to get more generic EP frameworks. In this paper, we introduce the framework Splinter, capable of
evolving modular FSMs. It can be easily configured to solve different
problems. We illustrate this by presenting results from the use of Splinter
for two problems: the artificial ant problem and the sequence of characters. The results validate the Splinter implementation and show that the
modularity benefits do not decrease the performance.
Keywords: evolutionary programming, modularity
1
Introduction
Evolutionary Computation (CE) techniques have been gained attention in last
years mainly due to the fact that they are able to solve a great number of complex
problems [7, 11]. These techniques are based on Darwin’s theory [4]: The individuals that better adapt to the environment that surrounds them have a greater
chance to survive. They pass their genetic characteristics to their descendents
and consequently, after several generations, this process tends to naturally select
individuals, eliminating the ones that do not fit the environment. The concepts
are usually applied by genetic operators, such as: selection, crossover, mutation
and reproduction. CE techniques are: Genetic Algorithms, Genetic Programming, Evolution Strategies and Evolutionary Programming. This last one is the
focus of this paper. In Evolutionary Programming (EP) the individuals, that
represent the solutions for a given problem, are finite state machines (FSMs).
EP is not a new field. It was first proposed by Fogel for evolving artificial
intelligence in the early 1960’s [6]. Since then, it has been used for the evolution
and optimization of a wide variety of architectures and parameters. According
to Chellapilla and Czarnechi [3] such applications include linear and bilinear
models, neural networks, fuzzy systems, lists, etc. However, most works and EP
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 356–365, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Splinter: A Generic Framework for Evolving Modular Finite State Machines
357
frameworks found in the literature deal with problem-specific representations [2,
11]. EP frameworks need significant modifications in their kernel to be adapted to
other domains. In this sense, an evolutionary framework capable of implementing
different problem representations is necessary. This generic framework should be
easily configurable and scalable to problems of practical value.
Chellapilla and Czarnechi [3] points that such framework should support automatic discovery of problem representations. To allow this feature it should
use modular FSMs (MFSMs). The use of MFSMs favors the generation of hierarchical, modular structures that can decompose a difficult task into simpler
subtasks. These subtasks may then be solved with lower computational effort
and they solutions combined to give the general solution. This also allow reuse
to solve similar sub-problems and easy comprehension.
The modularity is an important concept to reach generic solutions. Because
of this, we find in CE some works focusing modularity, such as the evolution
of modules using Genetic Programming [1,9,12,13], and using EP [3]. In this
last work, the authors propose a procedure to evolve MFSMs and present results
showing that the evolution of MFSMs is statistically significant faster. However,
the authors do not implement a generic framework. In [10] a generic EP framework is described. It offers a set of C++ classes to be configured to evolve FSMs
but, it does not allow the evolution of MFSMs.
We introduce Splinter, a generic EP framework, capable of evolving MFSMs.
Splinter implements the procedure described in [3]. Because of this, it supports
modularity and reuse. It can be easily configured to solve different problems and
allows non-expert people to use EP for solving their specific problems, reducing
effort and time.
To illustrate this, we describe two examples of problems solved with Splinter.
The obtained results allow the validation of the Splinter implementation and the
performance evaluation of MFSMs.
The paper is organized as follows. Section 2 presents an overview of MSFMs
and of the evolution process for this kind of machines. Section 3 describes the
framework Splinter. Section 4 shows use examples and results obtained with
Splinter. Section 5 concludes the paper.
2
MFSMs
A finite state machine M is represented as
where:
I, O, and S are finite sets of input symbols, output symbols and states
respectively.
is the state transition function, it can be null.
is the output function.
When the machine is in a current state in S and receives an input from
I, it moves to the next state specified by
and produces an output given
by
S includes a special state
called the initial state.
A FSM can be represented by a state transition diagram, a directed graph
whose vertices correspond to the states of the machine and whose edges correspond to the state transitions; each edge is labeled with the input an output
TEAM LinG
358
Ricardo Nastas Acras and Silvia Regina Vergilio
symbols associated with the transition. For example, consider the diagram of
Figure 1 and the machine in state
(initial state with an extra arrow), upon
input the machine moves to state
and outputs Equivalently, a FSM can
be represented by a state table1 as given in Table 1. Observe that the initial
state is marked with a and null transitions are represented by –.
Fig. 1. An Example of State Transition Diagram
There are in the literature many extensions to FSM, some of them allow representation of guards and actions [15] and of data-flow information [14]. To allow
modularity and reuse a FSM can be extended and have one or more modules.
A modular FSM consists of one main FSM and k sub-modules (which are also
FSMs, that is, are sub-FSMs). In a MFSM transitions between the main FSM
and the sub-FSM are possible. They are represented by hexagons in the state
diagram and by other row in the state table (Control).
For example, Figure 2 represents a MFSM whit one sub-FSM,
and the
Main FSM.
is the initial state, upon input symbol a, the machine moves to
state
and outputs c. Currently in state
and upon the input b, the machine
moves to initial state
in sub-machine
and outputs d. According to the
input received, the sub-machine retains control until the one of the transitions
represented by the hexagon Main is reached. In this case, the control returns to
the Main-FSM in the state
Observe that when control is transferred to a sub-FSM, the processing of the
input symbol always starts in the sub-FSM initial state. However, when control
returns to the main-FSM or to any other sub-FSM, processing continues from
the last state, during a transition to which control was transferred. The control
1
We will consider only deterministic machines. These machines do not have more
than one transition for each input symbol.
TEAM LinG
Splinter: A Generic Framework for Evolving Modular Finite State Machines
359
Fig. 2. An Example of State Transition Diagram for a MFSM
transfer is represented in the state table by the number of the sub-FSM, the
main FSM has number 0.
The evolution process for MFSMs is based on the evolutionary procedure
of Fogel [5] and of Chellapila and Czarnecki [3]. This procedure includes the
following steps.
1. Initialization: a population is randomly created and consists on MFSMs.
Each sub-FSM is initialized at random in a identical manner. First the number of states is initialized and the initial state is selected. The transitions
are created and after this, based on the provided input and output alphabets,
the symbols are assigned to each transition.
2. Application of the mutation operators: when the individuals are FSMs only
mutation operators are applied. An individual P is modified to produce one
offspring
The mutation operations are:
delete states: one or more states are randomly selected for deletion. The
links in the machine are reassigned randomly to other states. If the initial
state was deleted, a new one is selected.
reassign the initial state: a new initial state is chosen at random.
reassign transitions: randomly selected links in states are randomly
reassigned to different states.
reassign output symbols:
output symbols are randomly chosen and
reassigned to different symbols randomly chosen from the alphabet.
change control: control entries in the state table are randomly chosen
and reassigned to different machines.
TEAM LinG
360
Ricardo Nastas Acras and Silvia Regina Vergilio
add states: a new state is created and its transitions are randomly generated. This new state will be really connected to the machine if another
mutation occurs, such as, reassign transitions.
3. Fitness Evaluation: the fitness of each individual is evaluated according to
the objective function for the task.
4. Selection: to determine the individuals to be modified by the mutation operators the tournament selection [5] is used. For a machine M, a number of
opponents is randomly chosen. If the machine’s fitness is no lesser than the
opponent’s fitness, it receives a win. The individuals with the most wins
are selected to be mutated for the next generation.
5. The procedure ends if the halting criterion is satisfied; otherwise, the maximum number of generations is reached.
Chellapila and Czarnecki [3] used the above procedure to the artificial ant
problem. The results indicate that the proposed EP procedure can rapidly evolve
optimal modular machines in comparison with the evolution of non-modular
FSMs. In 48 of the 50 MFSMs, the perfect machines were found. In 44 of the 50
non-modular FSM evolution trials, the perfect machines were found.
3
The Framework Splinter
The framework Splinter supports the evolution of MFSM. It was implemented in
C++. This language allows the use of containers besides of the object-oriented
concepts, such as: polymorphism, overloading and inheritance. They simplify the
framework implementation.
Fig. 3 shows diagrams illustrating the main modules and classes of Splinter.
They are described as follows.
1. Population: responsible for maintaining the population during the evolution
process. It is implemented by the class CPopulation that is associated to
several MFSMs, that are the individuals in the population. Each individual
is represented by the class CMFSM, according to the tables presented in
Section 2. This class is composed by a set of n instances of CModule, where
n is the number of modules of the modular machine. Each class CModule
by its turn is composed by a set of states and transitions, implemented
respectively by the classes CState and CTransition.
2. Fitness: this module is implemented by the class CFitness associated to
CPopulation. This class has a method evaluate that is related to the fitness
function and is dependent on the problem.
3. Evolver: module responsible for the evolution process and the application
of the genetic operators. The evolution procedure and operators used by
Evolver were described in Section 2.
4. Creator: creates the initial population. It is implemented by the class CCreator. There are two special class CUtilsRandom and CUtilsSymbols responsible by the random generation of the individuals, which are randomly created,
according to the initial configuration file.
TEAM LinG
Splinter: A Generic Framework for Evolving Modular Finite State Machines
361
Fig. 3. Splinter Diagrams
The configuration file is organized in related sections, delimited by “[”. Each
section defines several parameters. An example of configuration file is presented
in Fig. 4. This figure is explained below.
population and individuals: this section contains information for the random generation of the individuals. The number of individuals in the initial
population, the maximum and minimum numbers of individuals during the
process, the maximum and minimum numbers of modules for an individual, the maximum and minimum numbers of states and transitions. If the
maximum number of modules is 1, non-modular FSMs are evolved.
evolution: this section contains information necessary to the evolution process. The maximum number of generations and better fitness are possible
termination criteria. The second one depends on the fitness function implemented. The number of opponents used to select an individual to be
mutated, the maximum and minimum numbers of children and of mutations
to generate a child.
mutation: this section contains information necessary for the mutation operators application. The mutation rate defines the probability of a mutation
occurs. In a population of 100 individuals a mutation rate of 0.7 means
that 70 of the parents will be mutated to compose the next generation. A
probability is also given for each mutation operator.
symbols: this section defines the input and output alphabets.
recursion: this section contains only a boolean information to indicate that
recursion is or not allowed.
To configure Splinter ,it is necessary to define the fitness function adequate
to the problem to be solved. The user should overwrite the method evaluate of
CPopulation. Beside of this, he or she needs to write the configuration file. When
TEAM LinG
362
Ricardo Nastas Acras and Silvia Regina Vergilio
necessary, all the evolution procedure can be changed. In such case, the method
evolver of CPopulation needs to be overwritten. But this last modification requires more knowledge about CE evolution strategies.
Fig. 4. Splinter Configuration File
4
Using Splinter
This section presents how Splinter was configured to solve two different problems
and shows some preliminary results.
4.1
The Tracker Task
This problem was introduced in [8] and is also known as the artificial ant problem.
The problem consists of an ant placed on a 32x32 toroidal grid. Food packets
are scattered along a trail on the grid. The trail begins on the second square
in the first row near the left top corner. It is 127 squares long, and contains 20
turns and 89 squares with food packets. The ant can sense the presence of a
food packed in the square directly ahead and can take three decisions: turn left
or right or, move forward one square. The goal of the machine is to guide the
TEAM LinG
Splinter: A Generic Framework for Evolving Modular Finite State Machines
363
ant to collect all 89 food packets. The ant starts out facing East on the second
square in the first row.
The objective function for evaluation a FSM is the total number of food
packets collected within the allotted time. Each of the ant’s actions cost one
time step. A maximum of 600 time steps were allowed.
As mentioned in Section 2, Chellapilla and Czarnecki [3] used this problem
to evaluate their EP procedure. As Splinter implements that procedure, we used
Splinter to solve this problem too. The goal is to validate our implementation.
To configure Splinter, the input alphabet used is {F,N}, representing respectively that are or not food ahead. The output alphabet consists on {L, R, M},
representing the three movements mentioned above.
We started with a configuration file similar to the one presented in Fig. 4.
After an amount of experimentation, we used the following main parameters:
number of opponents is 10; number of children varying between [1..4]; in each
children was applied [1..6] mutations; number of states in [3..6] and number of
modules in [2..5]. Splinter was run 4 times and 50 trials were obtained in each
run, in a total of 200. Only three of them were not successful. This result is
very similar to the obtained by Chellapila and Czarnecki, described in Section 2.
They obtained two not successful modular machines and six non-modular ones.
4.2
Sequence of Characters
This is a very common problem on the programming language area. The machine
has to identify a specific sequence of characters; in our example, the sequence of
vowels: (a, e, i, o, u). The idea of this second experiment is to evaluate the implementation of Splinter in another context. Beside of this, we run Splinter with
several configuration files to investigate the influence of its different parameters.
The fitness function for evaluating a FSM is given by the number of identified
vowels. The best fitness (of 100%) means that all the sequence was identified.
The input alphabet for this problem is {a, e, i, o, u}. The output alphabet is
{x}, because the output is not significant in this case.
The different configurations are modifications of the file presented in Fig. 4.
These modifications are described below.
1.
2.
3.
4.
configuration of Fig. 4, this configuration does not include modules.
changing the number of modules for the interval [2..4] with [2..5] states.
changing the size of population to 1000
changing the maximum number of transitions for 5 (the same number of
input symbols)
5. changing the number of opponents to 7 and the number of children to [5..10]
6. combination of the above modifications.
Splinter was run 10 times for each configuration. Table 3 presents the results
obtained for each run. For example, using Configuration 1 the solution with
best fitness was found in the
generation in the first run. This configuration
presents the worst result, that is to find the solution in the
run, however
TEAM LinG
364
Ricardo Nastas Acras and Silvia Regina Vergilio
it always find the best solution. Configuration 2, that includes modules, does
not find the solution in two runs (marked with a ‘-’ in the table). The zero
indicates that the initial population presented the best fitness. Better solutions
were found by increasing the number of transitions and the number of opponents,
represented in the last rows of the table. These parameters really influence on
the result. The best result is found by introducing all the modifications together.
This configuration includes modules. The modularity does not seem to influence
the evolution process in such case.
5
Conclusions
EP is a CE technique that can be used to solve different problems in several
domains. However, for its large application, many in industrial environments
a generic framework is necessary. This work contributes in this direction by
describing Splinter, a generic EP framework, that is capable of evolving MFSMs.
Splinter supports modularity and all its benefits: decomposition of problems,
reduction of complexity and reuse. Beside of this, the structure of Splinter allows
easy and quick configuration for diverse kinds of problems. The evolution kernel,
responsible for the genetic operations, is totally independent on the domain.
The user needs only to provide the configuration file and to write the method
responsible for the fitness function. More expert users can easily overwritten
other methods and even modify the evolution process, if desired.
To validate the implementation of Splinter, we explore its use in two problems. The tracker task problem, used by other authors and by Chellapila and
Czarnecki to investigate MFSMs and the sequence of characters problem.
To the first problem a very good result was obtained with MFSMs: only three
of the modular machines were not successful. This result are very similar to the
results found in the literature. MFSMs get a better performance.
In the second problem, we compare modular and non-modular machines and
investigate the influence of the configuration parameters of Splinter. We obtained
better solutions by increasing the number of transitions and opponents. The
results show that the use of modularity does not seem to influence the evolution
process and does not imply a lower performance. However, new experiments
should be conducted to better evaluate MFSMs.
TEAM LinG
Splinter: A Generic Framework for Evolving Modular Finite State Machines
365
The preliminary experience with Splinter is very encouraging. Due this easy
configuration, we are now exploring Splinter in the context of software engineering to select and evaluate test data for specifications models. We also intend
to conduct other experiments with Splinter. These new studies should explore
explore other contexts and the performance of MFSMs.
References
1. Angeline, P. J. and Pollack, J. Evolutionary module acquisition. Proceedings of the
Sec. Annual Conference on Evolutionary Programming. pp 154-163, 1993.
2. Báck, T. and Urich, H. and Schwefel, H.P. Evolutionary Computation: Comments
on the History and Current State IEEE Trans. on Software Engineering Vol 17(6),
pp 3-17, June, 1991
3. Chellapila K. and Czarnecki, D. A Preliminary Investigation into Evolving
Modular Finite States Machines. Proceedings of the Congress on Evolutionary
Computation- CEC 99. IEEE Press. Vol 2, pp 1349-1356, 6-9 July 1999.
4. Darwin, C. On the Origin of Species by Means of Natural Selection or the Preservation of Favored Races in the Struggle for Life, Murray, London-UK”, 1859.
5. Fogel, D.B. Evolutionary Computation - Toward a New Philosophy of Machine
Intelligence”, IEEE Press, Piscataway, NJ, 1995.
6. Fogel, L.J. On the Organization of Intellect, Ph.D. Dissertation, UCLA-USA 1964.
7. Proceedings of Genetic and Evolutionary Computation Conference, New York-USA
2002, Chicago-USA 2003.
8. Jefferson, D. and et al. Evolution of a Theme in Artificial Life: The Genesys:
Tracker System, Tech. Report, Univ. California, Los Angeles, CA, 1991.
9. Koza, J.R. Genetic Programming II: Automatic Discovery of Reusable Programs
MIT Press, 1994.
10. Ladd, S.R. libevocosm - C++ Tools for Evolutionary Software
http://www.coyotegulch.com/docs/evocosm, February, 2004.
11. Michalewicz, Z. and Michalewicz, M. Evolutionary Computation Techniques and
Their Applications. IEEE International Conf. on Intelligent Processing Systems,
1997.
12. Rodrigues, E. and Pozo, A.R.T. Grammar-Guided Genetic Programming and Automatically Defined Functions. Brazilian Symposium on Artificial Intelligence,
SBIA-2002, Porto de Galinhas, Recife.
13. Rosca, J.P. and Ballard, D.H. Discovery of Sub-routines in Genetic Programming.
Advances in Genetic Programming. pp 177-201. MIT Press, 1996.
14. Shehady, R.K. and Siewiorek, D.P. A Method to Automate User Interface Testing
Using Variable Finite State Machines. Proc. of International Symposium on FaultTolerant Computing -FTCS’97. 25-27, June, Seattle, Washington, USA.
15. Wang, C-J. and Liu, M.T. Axiomatic Test Sequence Generation for Extended Finite State Machines Proc. International Conference on Distributed Computing
Systems. 9-12, June, 1992 pp:252-259.
TEAM LinG
An Hybrid GA/SVM Approach for Multiclass
Classification with Directed Acyclic Graphs
Ana Carolina Lorena and André C. Ponce de Leon F. de Carvalho
Instituto de Ciências Matemáticas e de Computação (ICMC)
Universidade de São Paulo (USP)
Av. do Trabalhador São-Carlense, 400 - Centro - Cx. Postal 668
São Carlos, São Paulo, Brasil
{aclorena,andre}@icmc.usp.br
Abstract. Support Vector Machines constitute a powerful Machine Learning technique originally proposed for the solution of 2-class problems.
In the multiclass context, many works divide the whole problem in multiple binary subtasks, whose results are then combined. Following this
approach, one efficient strategy employs a Directed Acyclic Graph in the
combination of pairwise predictors in the multiclass solution. However,
its generalization depends on the graph formation, that is, on its sequence of nodes. This paper introduces the use of Genetic Algorithms in
intelligently searching permutations of nodes in a DAG. The technique
proposed is especially useful in problems with relatively high number of
classes, where the investigation of all possible combinations would be
extremely costly or even impossible.
Keywords: Support Vector Machines, Directed Acyclic Graphs, Genetic
Algorithms, multiclass classification
1
Introduction
Multiclass classification using Machine Learning (ML) techniques consists of
inducing a function
from a dataset composed of pairs
where
Some learning methods are originally binary, being able to carry
out classifications where
Among these one can mention Support Vector
Machines (SVMs) [2].
To generalize a SVM to multiclass problems, several strategies may be employed [3,9,10,15]. One common extension consists in generating
classifiers, one for each pair of classes
with
For combining these
predictors, Platt et al. [15] suggested the use of a Decision Directed Acyclic
Graph (DDAG). Each node of the graph corresponds to one binary classifier,
which decides for a class or
Based on this decision, a new node is visited.
In each prediction,
nodes are visited, so that the final classification is given
by the
node. This technique presents in general fast prediction times
and high accuracies. However, its results depends on the sequence of nodes chosen to compose the graph. Kijsirikul and Ussivakul [9] point out that this causes
high variances in classification results, affecting the reliability of the algorithm.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 366–375, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
An Hybrid GA/SVM Approach for Multiclass Classification
367
Based on this observation and also in the fact that the DDAG architecture
requires an unnecessary number of node evaluations for the correct class, these
authors presented a new graph based structure for multiclass prediction with
pairwise SVM classifies, the Adaptive Directed Acyclic Graph (ADAG) [9]. In
the ADAG the graph structure is adaptive, depending on the predictions made
by previous layers of nodes. Although this new approach showed less variance
on results, there were still differences of accuracy between distinct node configurations in the graph.
The present paper introduces then the use of Genetic Algorithms (GAs),
an intelligent search technique found on principles of genetics and evolution
[12], in finding the ordering of nodes in a DAG (DDAG or ADAG) based on
its accuracy in solving the overall multiclass problem. The coding scheme and
genetic operators definition were adapted from evolutionary strategies commonly
used in the traveling salesman problem solution, in which one wishes to find an
order of cities to be visited at lower cost. Initial experimental results indicate
that the GA approach can be efficient in finding good class permutations for
both DDAG and ADAG structures.
This paper is organized as follows: Section 2 briefly describes the Support
Vector Machine technique. Section 3 presents the graph based extensions of
SVMs to multiclass problems. Section 4 introduces the genetic algorithm approach for finding the sequence of nodes in a DAG. Section 5 presents some
experimental results. Section 6 concludes this paper.
2
Support Vector Machines
Support Vector Machines (SVMs) represent a learning technique based on the
Statistical Learning Theory [17]. Given a dataset with samples
where
each
is a data sample and
corresponds to
label, this
technique seeks an hyperplane
able of separating data with a
maximal margin. For performing this task, it solves the following optimization
problem:
where C is a constant that imposes a tradeoff between training error and generalization and the are slack variables. The former variables relax the restrictions
imposed to the optimization problem, allowing some patterns to be within the
margins and also some training errors.
In the case a non-linear separation of the dataset is needed, its data samples
are mapped to a high-dimensional space. In this space, also named feature space,
the dataset can be separated by a linear SVM with a low training error. This
mapping process is performed with the use of Kernel functions, which compute
dot products between any pair of patterns in the feature space in a simple way.
TEAM LinG
368
Ana Carolina Lorena and André C. Ponce de Leon F. de Carvalho
Thus, the only modification necessary to deal with non-linearity with SVMs is to
substitute any dot product among patterns by a Kernel function. In this work,
the Kernel function used was a Gaussian, illustrated in Equation 1.
3
Multiclass SVMs with Graphs
As described in the previous section, SVMs are originally formulated for the
solution of problems with two classes (+1 and -1, respectively). For extending
this learning technique to multiclass solutions, one common approach consists
of combining the predictions obtained in multiple binary subproblems [8].
One standard method to do so, called all-against-all (AAA), consists of building
predictors, each differentiating a pair of classes and
with
For combining these classifiers, Platt et al. [15] suggested the use of
Decision Directed Acyclic Graphs (DDAG).
A Directed Acyclic Graph (DAG) is a graph with oriented edges and no
cycles. The DDAG approach uses the classifiers generated in an AAA manner in
each node of a DAG. Computing the prediction of a pattern using the DDAG is
equivalent to operating a list of classes. Starting from the root node, the sample
is tested against the first and last elements of the list. If the predicted value is
+1, the first class is maintained in the list, while the second class is eliminated.
If the output is -1, the opposite happens. The node equivalent to the first and
last elements of the new list obtained is then consulted. This process continues
until one unique class remains. For classes,
SVMs are evaluated on test.
Figure 1 illustrates an example of DDAG where four classes are present.
It also shows how this DDAG can be implemented with the use of a list, as
described above.
Kijsirikul and Ussivakul [9] observed that the DDAG results have dependency
on its sequence of nodes, adversely affecting its reliability. They also pointed out
that, depending on the position of the correct class on the graph, the number
of node evaluations with it is unnecessarily high, resulting in a large cumulative
error. For instance, if the correct class is evaluated at the root node, it will be
tested against the others
classes before generating a response. If there is a
probability of 1% of misclassification in each node, this will cause a
rate of cumulative error.
Based on these observations, these authors proposed a new graph architecture, the Adaptive DAG (ADAG) [9]. An ADAG is a DDAG with a reversed
structure. The first layer has
nodes, followed by
nodes on the second
layer, and so on, until a layer with one unique node is reached, which outputs
the final class.
In the prediction phase, a pattern is submitted to all binary nodes in the
first layer. These nodes give then outputs of their preferred classes, composing
the next layer. In each round, the number of classes is reduced by half. Like in
DDAG,
nodes are evaluated in each prediction. However, the correct class
TEAM LinG
An Hybrid GA/SVM Approach for Multiclass Classification
369
Fig. 1. (a) Example of DDAG for a problem with four classes; (b) Implementation of
this DDAG with a list [15]
is tested against others
times or less, lower than in DDAG, where this
number is (at most)
times.
Figure 2 illustrates an example of ADAG for eight classes. It also shows
how this structure can be implemented with a list. The list is initialized with
a permutation of all classes. A test pattern is evaluated against the first and
last elements of the list. The node’s preferred class is kept in the left element’s
position. The ADAG then tests against the second class and the class before the
last in the list. This process is repeated until one or no class remains untested
in the list. A new round is then initiated, with the list reduced to
elements.
A total of
rounds are made, when an unique class remains on the list.
Empirically, [9] verified that the ADAG was more advantageous especially for
problems with a relatively large number of classes. However, they also pointed
that, although the ADAG was less dependent on the sequence of nodes in the
graph, its accuracy was also affected by this selection, arising in differences for
distinct combinations of classes.
4
GA-Based Approach for Fiding Node Sequences
Genetic Algorithms (GAs) are search and optimization techniques based on the
mechanisms of genetics and evolution [14]. They aim to solve a particular problem by investigating populations of possible solutions (also named individuals).
Through several generations, population’s individuals suffer constant evolutions
based on their fitness to solve the problem. In each generation, a new population of individuals is produced by genetic operators. The most common genetic
operators are elitism, that maintains copies of the best individuals in the next
generation, cross-over, which combines the structures of pairs of individuals, and
TEAM LinG
370
Ana Carolina Lorena and André C. Ponce de Leon F. de Carvalho
Fig. 2. (a) Example of ADAG for a problem with eight classes; (b) Implementation of
this ADAG with a list [9]
mutation, that changes the features of selected individuals. The principle of using various individuals representing possible solutions, allied to the process of
cross-over and mutation, allows a large search space to be swept in multiple
directions, making GAs a global search technique.
Next, the authors show how GAs were applied in finding node orderings in
a DDAG/ADAG.
Individuals Representation. Since the DDAG and ADAG approaches can be
implemented by operating a list of classes, a vector representation was chosen.
Each individual consists of a list (vector) of integers, representing the classes.
Every class has to be present on the list and no repetitions of classes are allowed.
The task is to find the ordering of these classes that leads to higher accuracies
in the multiclass graph operation.
The adopted representation is similar to the path representation commonly
employed in the solution of the traveling salesman problem (TSP), in which one
wants to find the ordering of cities that have minimum traveling cost [12].
However, it should be noticed that, in the present application, a pair of classes
with
is equivalent to the pair
This leads to a search space of
size
for ADAGs and
for DDAGs (against a size of for an ordering
problem without the previous restriction), which becomes especially critical for
problems with relatively high number of classes.
Fitness Function. The fitness of each individual was given by its mean accuracy in the multiclass solution through cross-validation. The datasets used in the
experiments conduction were then divided following the
cross-validation
methodology [13]. According to this method, the dataset is divided in disjoint
TEAM LinG
An Hybrid GA/SVM Approach for Multiclass Classification
371
subsets of approximately equal size. In each train/validation round,
subsets
are used for training and the remaining is left for validation. This makes a total
of pairs of training and validation sets. The accuracy (error) of a classifier on
the total dataset is then given by the average of the accuracies (errors) observed
in each validation partition.
The standard deviation of accuracies in cross-validation was also considered,
so that among two individuals with the same mean accuracy, the one with lower
standard deviation was considered better. This was accomplished by subtracting
from each individual mean accuracy its standard deviation.
Elistism. The elitism operator was applied, selecting in each next generation a
fraction of the best individuals of the current population.
Cross-over. Given the similarity between the present GA application and the
travelling salesman one, the partially-mapped cross-over (PMX) operator [7]
from the TSP literature was considered. This operator is able of preserving
more the order and position of the parents classes during recombination, and
thus good parent’s graph orderings. For such, in obtaining an offspring it chooses
a subsequence of classes from one parent and maintains the order and position
of as many classes as possible from the other parent [12]. The subsequence is
obtained by choosing at random two cut points.
Since in the ADAG implementation a class in position of the list paires
with the class in position
only a random point was generated. The second
point was given by its pair following the above rule, so that pairs of classes (the
graph nodes) of the parents could be further preserved.
For selection of parents in the cross-over process a tournament matching
mechanism was employed [14]. In selecting a parent through the tournament
procedure, initially two individuals of the population are randomly chosen. A
random number in [0,1] is then generated. If this number is less than a constant,
for example 0.75, the individual with highest fitness is selected. Otherwise, the
one with lowest fitness is chosen.
Mutation. The mutation operator applied was the insertion, also borrowed
from the TSP literature. It consists of selecting a class and inserting it in a
random place in the individual [12]. This operator allows large changes in the
graphs nodes configuration.
For each individual suffering mutation, this operator was applied a fixed
number of times (equal to the individuals size) and the best mutation product
was then chosen, constituting a kind of local search procedure.
5
Experiments
Experiments were conducted with the aim of evaluating the GA based approach
performance in obtaining DDAG and ADAG structures. Three datasets were
TEAM LinG
372
Ana Carolina Lorena and André C. Ponce de Leon F. de Carvalho
employed in these experiments: the UCI dataset for optical recognition of handwritten digits [16], the UCI letter image recognition dataset [16] and a protein
fold recognition dataset [5]. These datasets are described in Table 1. This table
shows, for each dataset, the number of training and test examples, the number
of attributes and the number of classes.
A scaling step was applied to all training datasets, consisting of a normalization of attributes to mean zero and unit variance. The independent test datasets
were also pre-processed according to the normalization factors extracted from
training data.
All experiments with SVMs were conducted with the SVMTorch II tool [1].
The Gaussian Kernel standard deviation parameter was set to 10. Other parameters were kept with default values. Although the best values for the SVM
parameters may differ for each multiclass strategy, they were kept the same to
allow a fair evaluation of the differences between the techniques considered. The
GA and DDAG/ADAG codes were implemented in the Perl language.
For the GA fitness evaluation, the training datasets were divided according
to the
cross validation methodology. For speeding the GA processing, a
number of
folds was employed. This procedure was adopted in a stratified
manner, in which each validation partition must have the same class distribution
as the original dataset. In the letter dataset, as such a huge number of examples
would slow the GA processing, only a fraction of it was used in the GA training.
For such, 25 elements of every class were randomly selected to compose each
validation dataset.
Table 2 shows the GA parameters employed in each dataset. It shows the
individuals size (Ind size), the population size (Pop size), elitism rate (Elitism),
cross-over rate (Cross-over), mutation rate (Mutation) and the maximum number of generations the GA is run (#generations). If no improvement could be
observed in the best fitness for 10 generations, the GA was also stopped. To prevent early stop, this criterion begun to be evaluated only after 20 generations.
After the GA training process (in which the permutation is search), the best
individual obtained in each case was trained on the whole original dataset and
tested on the independent test dataset. As GAs solutions depend on the initial
population provided, a total of 5 runs of the GA were performed and the final
accuracy was then averaged over these runs. In each of these rounds, the same
initial random population was provided for both DDAG and ADAG GA search.
Table 3 presents the results achieved. Best results are detached in boldface.
This table also shows the results of a majority voting (MV) of the pairwise clasTEAM LinG
An Hybrid GA/SVM Approach for Multiclass Classification
373
sifiers outputs. Following this technique, described in [10], each classifier gives
one vote for its preferred class. The final result is given by the class with most
votes. This method is largely employed in the combination of pairwise classifiers. Nevertheless, it has a drawback. If more than one class receives the same
number of votes, the pattern cannot be classified. The graph integration does
not suffer from this problem and has also the advantage of speeding prediction
time. The numbers of unclassified patterns by MV in each dataset are indicated
in parentheses. The best solutions produced in the GA rounds for ADAG and
DDAG are also shown (B-GA-ADAG and B-GA-DDAG, respectively).
Analyzing the results of Table 3, it can be verified that, although the GAADAG showed slightly better mean accuracies, the results of GA-ADAG and
GA-DDAG were similar in all cases. Comparing the performance of the best
GA solutions obtained by GA-ADAG and GA-DDAG in each case with the
McNemar statistical test [4], it is not possible to detect a significant difference
between the results achieved, at 95% of confidence level.
Besides that, the accuracies of the MV approach were inferior to the GA ones
in all datasets. In the optical dataset, the difference of performance between MV
and the B-GA-ADAG solution can be considered statistically significant, at 95%
of confidence. In the letter dataset, the difference of performance among MV
and both B-GA-ADAG and B-GA-DDAG was significant, at 95% of confidence.
In the protein dataset, no statistical significance (at 95% of confidence) was
found among the mean accuracies of these techniques, which showed then similar
results. In all tests conducted, unknown classifications were considered errors in
the computation of the statistics. This represents a deficiency of MV over DDAGs
and ADAGs, which was reflected on the results verified. Anyway, the analysis
presented indicate that the GA-based strategy was able of finding good and
plausible multiclass solutions.
TEAM LinG
374
Ana Carolina Lorena and André C. Ponce de Leon F. de Carvalho
In the optical dataset, the GA-ADAG results were more stable than of the
GA-DDAG, showing a lower standard deviation. This situation was opposite
in the letter and protein datasets. In general, however, the GA found similar
results in the distinct rounds, what was reflected in the low standard deviation
rates obtained in the experiments. This suggests a robustness of the proposed
approach.
It was also observed that the graphs generated by the GAs showed good
performance in the distinction of each class composing the multiclass datasets
investigated.
6
Conclusion
This paper presented a novel approach to determine the graph structure in a
Decision Directed Acyclic Graph (DDAG) and an Adaptive Directed Acyclic
Graph (ADAG) for multiclass classification with pairwise SVM predictors. This
can be considered an important task, since the results of these strategies depend
on the sequence of classes in the nodes of the graph. It becomes specially critical
for relatively large numbers of classes. The proposed approach offers an automatic and structured mean of searching good node permutations in these sets.
Besides that, the proposed approach is general and can also employ other base
learning techniques generating binary classifiers.
Future experiments succeeding this work should consider modifying the GAs
and (also) the SVMs parameters, since this procedure can improve the results
obtained in the experiments conducted. The GA algorithm can also be further
improved with the definition and introduction of new genetic operators. In a
recent work, Martí et al. [11] analyzed the performance of GAs in the solution of
various permutation problems and suggested that the combination of GAs with a
local search procedure can improve the results achieved by this technique. Since a
simple GA algorithm implementation was able of finding good class permutations
in this work, its modification with the introduction of a more sophisticated local
search strategy can improve the results verified.
Others modifications being considered include using leave-one-out bounds of
the SVM literature [18] in the GA’s fitness evaluation. Others works using GAs
in conjunction with SVMs have proved that these bounds can be more effective
in evaluating the SVMs fitness than a cross-validation methodology (ex.: [6]).
The GA approach proposed could also be extended to provide a model selection mechanism for SVMs, by incorporating the parameters of this technique
in the GA search process.
Acknowledgements
The authors would like to thank the financial support provided by the Brazilian
research councils FAPESP and CNPq.
TEAM LinG
An Hybrid GA/SVM Approach for Multiclass Classification
375
References
1. Collobert, R., Bengio, S.: SVMTorch: Support Vector Machines for Large Scale
Regression Problems. Journal of Machine Learning Research, Vol. 1 (2001) 143–
160
2. Cristianini, N., Taylor, J. S.: An Introduction to Support Vector Machines. Cambridge University Press (2000)
3. Dietterich, T. G., Bariki, G.: Solving Multiclass Learning Problems via ErrorCorrecting Output Codes. Journal of Artificial Intelligence Research, Vol. 2 (1995)
263–286
4. Dietterich, T. G.: Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, Vol. 10, N. 7 (1998) 1895–1924
5. Ding, C. H. Q., Dubchak, I.: Multi-class Protein Fold Recognition using Support
Vector Machines and Neural Networks. Bioinformatics, Vol. 4, N. 17 (2001) 349–
358
6. Fröhlich, H., Chapelle, O., Schölkopf, B.: Feature Selection for Support Vector
Machines by Means of Genetic Algorithms. Proceedings of 15th IEEE International
Conference on Tools with AI (2003) 142–148
7. Goldberg, D. E., Lingle, R.: Alleles, Loci, and the TSP. Proceedings of the 1st
International Conference on Genetic Algorithms, Lawrence Erlbaum Associates
(1985) 154–159
8. Hsu, C.-W., Lin, C.-J.: A Comparison of Methods for Multi-class Support Vector
Machines. IEEE Transactions on Neural Networks, Vol. 13 (2002) 415–425
9. Kijsirikul,B., Ussivakul,N.: Multiclass Support Vector Machines using Adaptive
Directed Acyclic Graph. Proceedings of International Joint Conference on Neural
Networks (IJCNN 2002) (2002) 980–985
Pairwise Classification and Support Vector Machines. In Scholkopf,
10.
B., Burges, C. J. C., Smola, A. J. (eds.), Advances in Kernel Methods - Support
Vector Learning, MIT Press (1999) 185–208
11. Martí, R., Laguna, M., Campos, V.: Scatter Search vs. Genetic Algorithms: An
Experimental Evaluation with Permutation Problems. To appear in Rego, C., Alidaee, B. (eds.), Adaptive Memory and Evolution: Tabu Search and Scatter Search
(2004)
12. Michaelewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs.
Springer-Verlag (1996)
13. Mitchell, T.: Machine Learning. McGraw Hill (1997)
14. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press (1998)
15. Platt, J. C., Cristianini, N., Shawe-Taylor, J.: Large Margin DAGs for Multiclass
Classification. In: Solla, S. A., Leen, T. K., Müller, K.-R. (eds.), Advances in Neural
Information Processing Systems, Vol. 12. MIT Press (2000) 547–553
16. University of California Irvine: UCI benchmark repository - a huge collection of
artificial and real-world datasets. http://www.ics.uci.edu/~mlearn
17. Vapnik, V. N.: Statistical Learning Theory. John Wiley and Sons, New York (1998)
18. Vapnik, V. N., Chapelle, O.: Bounds on Error Expectation for Support Vector
Machines. Neural Computation, Vol. 12, N. 9 (2000)
TEAM LinG
Dynamic Allocation of Data-Objects in the Web,
Using Self-tuning Genetic Algorithms*
Joaquín Pérez O.1, Rodolfo A. Pazos R.1, Graciela Mora O.2,
Guadalupe Castilla V.2, José A. Martínez.2, Vanesa Landero N. 2 ,
Héctor Fraire H.2, and Juan J. González B.2
1
Centro Nacional de Investigación y Desarrollo Tecnológico (CENIDET)
AP 5-164, Cuernavaca, Mor. 62490, México
{jperez,pazos}@sd-cenidet.com.mx
2
Instituto Tecnológico de Ciudad Madero, México
[email protected]
Abstract. In this paper, a new mechanism for automatically obtaining some control parameter values for Genetic Algorithms is presented,
which is independent of problem domain and size. This approach differs
from the traditional methods which require knowing the problem domain
first, and then knowing how to select the parameter values for solving
specific problem instances. The proposed method uses a sample of problem instances, whose solution allows to characterize the problem and to
obtain the parameter values. To test the method, a combinatorial optimization model for data-object allocation in the Web (known as DFAR)
was solved using Genetic Algorithms. We show how the proposed mechanism allows to develop a set of mathematical expressions that relates
the problem instance size to the control parameters of the algorithm.
The expressions are then used, in on-line process, to control the parameter values. We show the last experimental results with the self-tuning
mechanism applied to solve a sample of random instances that simulates
a typical Web workload. We consider that the proposed method principles must be extended to the self-tuning of control parameters for other
heuristic algorithms.
1 Introduction
A large number of real problems are NP-hard combinatorial optimization problems. These problems require the use of heuristic methods for solving large size
instances. Genetic Algorithms (GA) constitute an alternative that has been used
for solving this kind of problems [1].
A framework used frequently for the study of evolutionary algorithms includes: the population, the selection operator, the reproduction operators, and
the generation overlap. The GA’s components have control parameters associated. The choice of appropriate parameters setting is one of the most important
factors that affect the algorithms efficiency. Nevertheless, it is a difficult task to
* This research was supported in part by CONACYT and COSNET.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 376–384, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
Dynamic Allocation of Data-Objects in the Web
377
devise an effective control parameter mechanism that obtains an adequate balance between quality and processing time. It requires having a deep knowledge
of the nature of the problem to be solved, which is not usually trivial.
For several years we have been working on the distribution design problem
and the design of solution algorithms. We have carried out a large number of
experiments with different solution algorithms, and a recurrent problem is the
tuning of the algorithm control parameters; hence our interest in incorporating
self-tuning mechanisms for parameter adjustment. In [2], we proposed an online method to set the control parameters of the Threshold Accepting algorithm.
However, in that method we cannot relate algorithm parameters to the problem
size. Now, we want to explore, with genetic algorithms, the off-line automatic
configuration of parameters.
2
Related Work
Diverse works try to establish the relationship between the values of the genetic
algorithm control parameters and the algorithm performance.
The following are some of the most important research works on the application of the theoretical results in practical methodologies.
Back uses an evolutionary on-line strategy to adjust the parameter values
[3]. Mercer and Grefenstette use a genetic meta-algorithm to evolve the control
parameter values of another genetic algorithm [4, 5]. Smith uses an equation
derived from the theoretical model proposed by Goldberg [6]. Harik uses a
technique prospection based [7], for tuning the population size using an on-line
process.
Table 1 summarizes research works on parameter adaptation. It shows the
work reference, applied technique and on-line controlled parameters (population
size P, crossover rate C and mutation rate M).
We propose a new method to obtain relationships between the problem size
and the population size, generation number, and the mutation rate. The process
consists of applying off-line statistical techniques to determine mathematical
expressions for the relationships between the problem size and the parameter
values. With this approach it is possible to tune a genetic algorithm to solve
many instances at a lower cost than using the prospection approach.
TEAM LinG
Joaquín Pérez O. et al.
378
Proposed Method for Self-tuning GA Parameters
3
In this work we propose the use of off-line sampling to get the relationship
between the problem size and the control parameters of a Genetic Algorithm.
The self-tuning mechanism is constructed iteratively by solving a set of problem
instances and gathering statistics of algorithm performance to obtain the relationship sought. With this approach it is possible to tune a genetic algorithm for
solving many problem instances at a low cost. To automate the configuration of
the algorithm control parameters the following procedure was applied:
Iteratively execute next steps:
Step 1. Record instances. Keep a record of all the instances currently solved
with the GA configured manually. For each instance, its size, configuration
used and the corresponding performance are recorded.
Step 2. Select a representative sample. Get a representative sample of recorded instances, each one of different size. The sample is built considering
only the best configuration for each selected instance.
Step 3. Determine correlation functions. Get the relationship between
the problem size and the algorithm parameters.
Step 4. Feedback. The established relationships reflect the behavior of the
recorded instances. When new instances with a different structure occur,
the adjustment mechanism can lose effectiveness.
The proposed method allows advancing toward an optimal parameter configuration with an iterative and systematic approach. An important advantage of
this method is that the experimental costs are reduced gradually. We can start
using an initial solved instance set and continue adding new solved instances.
In the next section we describe an application problem to explain some method
details.
4
Application Problem
To test the method, a combinatorial optimization model for data-objects allocation in the Web (known as DFAR) was solved using Genetic Algorithms. We
show how the proposed method allows to develop a set of mathematical expressions that relates the problem instance size to the control parameters of the
algorithm. In this section we describe the distribution design problem and the
DFAR mathematical model.
4.1
Problem Description
Traditionally it has been considered that the distributed database (DDB) distribution consists of two sequential phases. Contrary to this widespread belief, it
has been shown that it is simpler to solve the problem using our approach which
TEAM LinG
Dynamic Allocation of Data-Objects in the Web
379
combines both phases [8]. In order to describe the model and its properties, the
following definition is introduced:
DB – object: Entity of a database that requires to be allocated, which can
be an attribute, a relation or a file. They are independent units that must be
allocated in the sites of a network.
The DDB distribution design problem consists of allocating DB-objects, such
that the total cost of data transmission for processing all the applications is
minimized. New allocation schemas should be generated that adapt to changes in
a dynamic query processing environment, which prevent the system degradation.
A formal definition of the problem is the following:
Fig. 1. Distribution Design Problem
Assuming there are a set of DB-objects
a computer
communication network that consists of a set of sites
where
a set of queries
are executed, the DB-objects required by each
query, an initial DB-object allocation schema, and the access frequencies of each
query from each site in a time period. The problem consists of obtaining a new
allocation schema that adapts to a new database usage pattern and minimizes
transmission costs. Figure 1 shows the main elements related with this problem.
4.2
Objective Function
The integer (binary) programming model consists of an objective function and
four intrinsic constraints. The decision about storing a DB-object m in site is
represented by a binary variable
Thus,
if
is stored in
and
otherwise.
TEAM LinG
Joaquín Pérez O. et al.
380
The objective function below (1) models costs using four terms: 1) the transmission cost incurred for processing all the queries, 2) the cost for accessing multiple remote DB-objects required for query processing, 3) the cost for DB-object
storage in network sites, and 4) the transmission cost for migrating DB-objects
between nodes.
where
4.3
emission frequency of query from site during a given period of
time;
usage parameter,
if query uses DB-object
else
number of packets for transporting DB-object
for query
communication cost between sites and
cost for accessing several remote DB-objects for processing a query;
indicates if query accesses one or more DB-objects located at site
cost for allocating DB-objects in a site;
indicates if there exist DB-objects at site
indicates if DB-object
was previously located in site
number of packets required for moving DB-object to another site.
Intrinsic Constraints of the Problem
The model solutions are subject to four constraints: each DB-object must be
stored in one site only, each DB-object must be stored in a site that executes at
least one query that uses it, a constraint to determinate for each query where
is the DB-objects required, and a constraint to determinate if the sites contains
DB-objects. The detailed formulation of the constraints can be found in [2, 8].
5
Implementation
In this section we present some application examples of the proposed method,
using the DDB design problem.
5.1
Record Instances
Table 2 shows four entries of the historical record. These correspond to an
instance solved using a manually configured GA. Columns 1 and 2 contain the
instance identifier I and the instance size S in bytes. Columns 3-6 show the
configuration of four GA parameters (population size P, generation number G,
crossover rate C, and mutation rate M). Columns 7 and 8, present the algorithm
TEAM LinG
Dynamic Allocation of Data-Objects in the Web
381
performance (the best solution B found by the GA, and the execution time T in
seconds).
Table 2 shows the best solutions that were obtaining with the specified
configurations.
5.2
Select a Representative Sample
Table 3 presents an example of a sample of instances of different size extracted
from the record, where column headings have the same meaning as those of
Table 2. For each selected instance only its best configuration is included in the
sample.
5.3
Determine Correlation Functions
Population Correlation Functions. To find the relationship between the
problem size and the population size we used two techniques: statistical regression and estimate based on proportions. Three mathematical expressions (2,3,4)
were constructed to determinate the population P size in function of the problem
size The expressions contain derived coefficients of the lineal and logarithmic
statistical estimates and a constant of proportionality.
TEAM LinG
382
Joaquín Pérez O. et al.
The proportional estimate was adjusted to get the best estimation. As a result
of the fine adjustment the following factors were defined:
Figure 2 shows the graphs of the real data and the adjusted proportional
estimate.
Fig. 2. Correlation functions graphs
Correlation Functions for the Generation Number and Mutation Rate.
Similarly the relationships between the size of the problem, and the number of
generations and the mutation rate were determined. Expressions (6,7) specify
the relationship between the instance size and these algorithm parameters. In
these expressions, G is the number of generations, and M is the mutation rate
and
is an adjust factor.
As observed, the parameter tuning mechanism is defined using an offline procedure. The evaluation and subsequent use of this mechanism should be carried
out online. In this example, for the evaluation of the mechanism a comparative
experiment was carried out using a GA configured manually, according to the
recommendations proposed in the literature, and our self-tuning GA. To carry
TEAM LinG
Dynamic Allocation of Data-Objects in the Web
383
out the evaluation, a sample of 14 random instances was solved using both algorithms. The instances were created in order to simulate a typical Web workload.
In that environment 20% of the queries access 80% of the data-objects and 80%
of the queries only access 20% of the data-objects. The improvement of the quality solution percentage is calculated, getting the objective value diminution with
respect to the solution with the GA configured using the literature recommendations. In Figure 3 the graph of the improve solution percentage is showed, for
the 14 random instances ordered by size. The graph shows that the self-tuning
mechanism exhibits a tendency to get better results in the large scale instances
range.
Fig. 3. Improvement of quality solution percentage
5.4
Feedback
Since the tuning mechanism requires a periodic refinement, the performance
of the GA configured automatically can be compared versus other algorithms
when solving new instances. If for some instance another algorithm is superior,
the GA will be configured manually to equal or surpass the performance of that
algorithm. The instance and their different configurations must be recorded in
the historical record and the process is repeated from step 2 through step 4.
Hence the experimental cost it is relatively low, because it takes advantage of
all the experimental results stored in the historical record.
6
Conclusions and Future Work
In this work, we propose a new method to obtain relationships between the
problem size and the population size, generation number, and the mutation
rate. The process consists of applying off-line statistical techniques to determine
TEAM LinG
384
Joaquín Pérez O. et al.
mathematical expressions for these relationships. The mathematical expressions
are used on-line to control the values of the algorithm parameters. With this approach it is possible to tune a genetic algorithm to solve many problem instances
at a lower cost than other approaches.
We present a genetic algorithm configured with mathematical expressions,
designed with the proposed method, which was able to obtain a better solution than the algorithm configured according to the literature. The self-tuning
mechanism exhibits a tendency to get better results in the large scale instances
range. To test the method, a mathematical model for dynamic allocation of
data-objects in the Web (known as DFAR) was solved using both algorithms
with typical Web workloads.
Currently the self-tuning GA is being tested for solving a new model of the
DDB design problem that incorporates data replication, and the preliminary
results are encouraging.
References
1. Fogel, D., Ghozeil, A.: Using Fitness Distributions to Design More Efficient Evolutionary Computations. Proceedings of the 1996 IEEE Conference on Evolutionary
Computation, Nagoya, Japan. IEEE Press, Piscataway N.J. (1996) 11-19
2. Pérez, J., Pazos, R.A., Velez, L. Rodriguez, G.: Automatic Generation of Control
Parameters for the Threshold Accepting Algorithm, Lectures Notes in Computer
Science, Vol. 2313. Springer-Verlag, Berlin Heidelberg New York (2002) 119-127.
3. Back, T., Schwefel, H.P.: Evolution Strategies I: Variants and their computational
implementation. In: Winter, G., Périaux, J, Galán, M., Cuesta, P. (eds.): Genetic
Algorithms in Engineering and Computer Science. Chichester: John Wiley and
Sons. (1995) Chapter 6, 111-126
4. Mercer, R.E., Sampson, J.R.: Adaptive Search Using a Reproductive Meta-plan.
Kybernets 7 (1978) 215-228
5. Grefenstette, J.J.: Optimization of Control Parameters for Genetic Algorithms. In:
Sage, A.P. (ed.): IEEE Transactions on Systems, Man and Cybernetics, Volume
SMC-16(1). New York: IEEE (1986) 122-128
6. Smith, R.E., Smuda, E.: Adaptively Resizing Population: Algorithm Analysis and
First Results. Complex Systems 9 (1995) 47-72
7. Harik, G.R., Lobo, F.G.: A parameter-less Genetic Algorithm. In: Banzhaf, W.,
Daida, J., Eiben, A.E., Garzon, M.H., Honavar, V., Jakiela. M., Smith, R.E. (eds.):
Proceedings of the Genetic and Evolutionary Computation Conference GECCO99. San Francisco, CA: Morgan Kaufmann (1999) 258-267
8. Pérez, J., Pazos, R.A., Romero, D., Santaolaya, R., Rodríguez, G., Sosa, V.: Adaptive and Scalable Allocation of Data-Objects in the Web. Lectures Notes in Computer Science, Vol. 2667. Springer-Verlag, Berlin Heidelberg New York (2003) 134143
TEAM LinG
Detecting Promising Areas
by Evolutionary Clustering Search
Alexandre C.M. Oliveira1,2 and Luiz A.N. Lorena2
1
Universidade Federal do Maranhão - UFMA, Departamento de Informática
S. Luís MA, Brasil
[email protected]
2
Instituto Nacional de Pesquisas Espaciais - INPE
Laboratório Associado de Computação e Matemática Aplicada
S. José dos Campos SP, Brasil
[email protected]
Abstract. A challenge in hybrid evolutionary algorithms is to define
efficient strategies to cover all search space, applying local search only in
actually promising search areas. This paper proposes a way of detecting
promising search areas based on clustering. In this approach, an iterative
clustering works simultaneously to an evolutionary algorithm accounting the activity (selections or updatings) in search areas and identifying
which of them deserves a special interest. The search strategy becomes
more aggressive in such detected areas by applying local search. A first
application to unconstrained numerical optimization is developed, showing the competitiveness of the method.
Keywords: Hybrid evolutionary algorithms; unconstrained numerical
optimization
1
Introduction
In the hybrid evolutionary algorithm scenario, the inspiration in nature have
been pursued to design flexible, coherent and efficient computational models.
The main focus of such models are real-world problems, considering the known
little effectiveness of canonical genetic algorithms (GAs) in dealing with them.
Investments have been made in new methods, which the evolutionary process
is only part of the whole search process. Due to their intrinsic features, GAs
are employed as a generator of promising search areas (search subspaces), which
are more intensively inspected by a heuristic component. This scenario comes to
reinforce the parallelism of evolutionary algorithms.
Promising search areas can be detected by fit or frequency merits. By fit
merits, the fitness of the solutions can be used to say how good their neighborhood are. On other hand, in frequency merits, the evolutionary process naturally
privileges the good search areas by a more intensive sampling in them. Figure 1
shows the 2-dimensional contour map of a test function known as Langerman.
The points are candidate solutions over fitness surface in a typical simulation.
One can note their agglomeration over the promising search areas.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 385–394, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
386
Alexandre C.M. Oliveira and Luiz A.N. Lorena
Fig. 1. Convergence of typical GA into fitter areas
The main difficulty of GAs is a lack of exploitation moves. Some promising
search areas are not found, or even being found, such areas are not consistently
exploited. The natural convergence of GAs also contributes for losing the reference to all promising search areas, implicating in poor performance.
Local search methods have been combined with GAs in different ways to
solve multimodal numerical functions more efficiently. Gradient as well as direct
search methods have been employed as exploitation tool. In the Simplex Genetic
Algorithm Hybrid [1], a probabilistic version of Nelder and Mead Simplex [2] is
applied in the elite of population. In [3], good results are obtained just by applying a conjugate gradient method as mutation operator, with certain probability.
In the Population-Training Algorithm [4], improvement heuristics are employed
in fitness definition, guiding the population to settle down in search areas where
the individuals can not be improved by such heuristics. All those approaches
report an increase in function calls that can be prohibitive in optimization of
complex computational functions.
The main challenge in such hybrid methods is the definition of efficient strategies to cover all search space, applying local search only in actually promising
areas. Elitism plays an important role towards achieving this goal, once the best
individuals represent such promising search area, a priori. But the elite of population can be concentrated in few areas and thus the exploitation moves are not
rationally applied.
More recently, a different strategy was proposed to employ local search more
rationally: the Continuous Hybrid Algorithm (CHA) [5]. The evolutionary process run normally until be detected a promising search area. The promising area
is detected when the highest distance between the best individual and other individuals of the population is smaller than a given radius, i.e., when population
diversity is lost. Thereafter, the search domain is reduced, an initial simplex is
built inside the area around the best found individual, and a local search based
upon Nelder and Mead Simplex is started. With respect to detection of promising
areas, the CHA has a limitation. The exploitation is started once, after diversity
loss, and the evolutionary process can not be continued afterwards, unless a new
population takes place.
TEAM LinG
Detecting Promising Areas by Evolutionary Clustering Search
387
Another approach attempting to find out relevant areas for numerical optimization is called UEGO by its authors. UEGO is a parallel hill climber, not an
evolutionary algorithm. The separated hill climbers work in restricted search areas (or clusters) of the search space. The volume of the clusters decreases as the
search proceeds, which results in a cooling effect similar to simulated annealing
[6]. UEGO do not work so well as CHA for high dimensional functions.
Several evolutionary approaches have evoked the concept of species, when
dealing with optimization of multimodal and multiobjective functions [6],[7].
The basic idea is to divide the population into several species according to their
similarity. Each species is built around a dominating individual, staying in a
delimited area.
This paper proposes an alternative way of detecting promising search areas based on clustering. This approach is called Evolutionary Clustering Search
(ECS). In this scenario, groups of individuals (clusters) with some similarities
(for example, individuals inside a neighborhood) are represented by a dominating individual. The interaction between inner individuals determines some kind
of exploitation moves in the cluster. The clusters work as sliding windows, framing the search areas. Groups of mutually close points hopefully can correspond
to relevant areas of attraction. Such areas are exploited as soon as they are
discovered, not at the end the process. An improvement in convergence speed
is expected, as well as a decrease in computational efforts, by applying local
optimizers rationally.
The remainder of this paper is organized as follows. Section 2 describes the
basic ideas and conceptual components of ECS. An application to unconstrained
numerical optimization is presented in section 3, as well as the experiments
performed to show the effectiveness of the method. The findings and conclusions
are summarized in section 4.
2
Evolutionary Clustering Search
The Evolutionary Clustering Search (ECS) employs clustering for detecting
promising areas of the search space. It is particularly interesting to find out such
areas as soon as possible to change the search strategy over them. An area can
be seen as an abstract search subspace defined by a neighborhood relationship
in genotype space.
The ECS attempts to locate promising search areas by framing them by
clusters. A cluster can be defined as a tuple
where and are the
center and the radius of the area, respectively. There also exists a search strategy
associated to the cluster. The radius of a search area is the distance from its
center to the edge.
Initially, the center is obtained randomly and progressively it tends to slip
along really promising points in the close subspace. The total cluster volume is
defined by the radius and can be calculated, considering the problem nature.
The important is that must define a search subspace suitable to be exploited
by aggressive search strategies.
TEAM LinG
388
Alexandre C.M. Oliveira and Luiz A.N. Lorena
In numerical optimization, it is possible to define in a way that all search
space is covered depending on the maximum number of clusters. In combinatorial
optimization, can be defined as a function of some distance metric, such as the
number of movements needed to change a solution inside a neighborhood. Note
that neighborhood, in this case, must also be related with the search strategy
of the cluster. The search strategy is a kind of local search to be employed into
the clusters and considering the parameters and The appropriated conditions
are related with the search area becoming promising.
2.1
Components
The main ECS components are conceptually described here. Details of implementation are left to be explained later. The ECS consist of four conceptually
independent parts: (a) an evolutionary algorithm (EA); (b) an iterative clustering (IC); (c) an analyzer module (AM); and (d) a local searcher (LS). Figure 2
brings the ECS conceptual design.
Fig. 2. ECS components
The EA works as a full-time solution generator. The population evolves independently of the remaining parts. Individuals are selected, crossed over, and
updated for the next generations. This entire process works like an infinite loop,
where the population is going to be modified along the generations.
The IC aims to gather similar information (solutions represented by individuals) into groups, maintaining a representative solution associated to this
information, named the center of cluster. The term information is used here because the individuals are not directly grouped, but the similar information they
represent. Any candidate solution that is not part of the population is called
information. To avoid extra computational effort, IC is designed as an iterative
process that forms groups by reading the individuals being selected or updated
by EA. A similarity degree, based upon some distance metric, must be defined,
a priori, to allow the clustering process.
The AM provides an analysis of each cluster, in regular intervals of generations, indicating a probable promising cluster. Typically, the density of the
cluster is used in this analysis, that is, the number of selections or updatings
recently happened. The AM is also responsible by eliminating the clusters with
lower densities.
TEAM LinG
Detecting Promising Areas by Evolutionary Clustering Search
389
At last, the LS is an internal searcher module that provides the exploitation
of a supposed promising search area, framed by cluster. This process can happen
after AM having discovered a target cluster or it can be a continuous process,
inherent to the IC, being performed whenever a new point is grouped.
2.2
The Clustering Process
The clustering process described here is based upon Yager’s work, which says
that a system can learn about an external environment with the participation
of previously learned beliefs of the own system [8],[9]. The IC is the ECS’s core,
working as an information classifier, keeping in the system only relevant information, and driving a search intensification in the promising search areas. To avoid
propagation of unnecessary information, the local search is performed without
generating other points, keeping the population diversified. In other words, clusters concentrate all information necessary to exploit framed search areas.
All information generated by EA (individuals) passes by IC that attempts to
group as known information, according to a distance metric. If the information
is considered sufficiently new, it is kept as a center in a new cluster. Otherwise,
redundant information activates a cluster, causing some kind of perturbation in
it. This perturbation means an assimilation process, where the knowledge (center
of the cluster) is updated by the innovative received information.
The assimilation process is applied over the center
considering the new
generated individual It can be done by: (a) a random recombination process
between and (b) deterministic move of in the direction of or (c) samples
taken between and Assimilation types (a) and (b) generate only one internal
point to be evaluated afterwards. Assimilation type (c), instead, can generate
several internal points or even external ones, holding the best evaluated one to
be the new center, for example. It seems to be advantageous, but clearly costly.
These exploratory moves are commonly referred in path relinking theory [10].
Whenever a cluster reaches a certain density, meaning that some information
template becomes predominantly generated by the evolutionary process, such
information cluster must be better investigated to accelerate the convergence
process in it. The cluster activity is measured in regular intervals of generations.
Clusters with lower density are eliminated, as part of a mechanism that will
allow to create other centers of information, keeping framed the most active of
them. The cluster elimination does not affect the population. Only the center of
information is considered irrelevant for the process.
3
ECS for Unconstrained Numerical Optimization
A real-coded version of ECS for unconstrained numerical optimization is presented in this section. Several test functions can be found in literature related
to such problems. Their general presentation is:
TEAM LinG
390
Alexandre C.M. Oliveira and Luiz A.N. Lorena
In test functions, the upper
and lower
bounds are defined a priori and
they are part of the problem, bounding the search space over the challenger
areas in function surface. This work uses some of well-known test functions,
such as Michalewicz, Langerman, Shekel [11], Rosenbrock, Sphere [12], Schwefel,
Griewank, and Rastrigin [13]. Table 1 shows all test functions, their respective
known optimal solution and bounds.
3.1
Implementation
The application details are now described, clarifying the approach. The component EA is a steady-state real-coded GA employing well-known genetic operators
as roulette wheel selection [14], blend crossover (BLX0.25) [15], and non-uniform
mutation [16]. Briefly explaining, in each generation, a fixed number of individuals
are selected, crossed over, mutated and updated in the same original
population, replacing the worst individual (steady-state updating). Parents and
offspring are always competing against each other and the entire population
tends to converge quickly.
The component IC performs an iterative clustering of each selected individual.
A maximum number of clusters,
must be defined a priori. The
cluster
has its own center
but a common radius
in each generation is calculated
for all clusters by:
where
is the current number of clusters (initially,
and
are, respectively, the known upper and lower bounds of the domain of variable
considering that all variables
have the same domain.
Whenever a selected individual
is far away from all centers (a distance
above
then a new cluster must be created. Evidently,
is a bound value
that prevents a unlimited cluster creation, but this is not a problem because the
clusters can slip along the search space.
The cluster assimilation is a foreseen step that can be implemented by different techniques. The selected individual
and the center
which it belongs
to, are participants of the assimilation process by some operation that uses new
information to cause some changing in the cluster location. In this work, the
cluster assimilation is given by:
where is called disorder degree associated with assimilation process. In this
application, the center are kept more conservative to new information
TEAM LinG
Detecting Promising Areas by Evolutionary Clustering Search
391
These choices are due to computational requests. Complex clustering algorithms could make ECS a slow solver for high dimensional problems. Considering
the euclidean distance calculated for each cluster, for a n-dimensional problem,
the IC complexity is about
At the end of each generation, the component AM performs the cooling of all
clusters, i.e., they have their accounting of density,
reset. Eventually some (or
all) clusters can be re-heated by selections or become inactive, being eliminated
thereafter by AM. A cluster is considered inactive when no selection has occurred
in the last generation. This mechanism is used to eliminate clusters that have
lost the importance along the generations, allowing that other search areas can
be framed. The AM is also evoked whenever a cluster is activated. It starts the
component LS, at once, if
The pressure of density,
allows to control the sensibility of the component
AM. The meaning of
is the desirable cluster density beyond the normal
density, obtained if
was equally divided to all clusters. In this application,
satisfactory behavior has been obtained setting
and
The component LS was implemented by a Hooke and Jeeves direct search
(HJD) [17]. The HJD is an early 60’s method that presents some interesting features: excellent convergence characteristics, low memory storage, and requiring
only basic mathematical calculations. The method works by two types of move.
At each iteration there is an exploratory move with one discrete step size per
coordinate direction. Supposing that the line gathering the first and last points
of the exploratory move represents an especially favorable direction, an extrapolation is made along it before the variables are varied again individually. Its
efficiency decisively depends on the choice of the initial step sizes
In this
application,
was set to 5% of initial radius.
The Nelder and Mead Simplex (NMS) has been more widely used as a numerical parameter optimization procedure. For few variables the simplex method
is known to be robust and reliable. But the main drawback is its cost. Moreover, there are parameter vectors to be stored. According to the authors, the
number of function calls increases approximately as
but these numbers
were obtained only for few variables
[2]. On the other hand, the HJD is
less expensive. Hooke and Jeeves found empirically that the number of function
evaluations increase only linearly, i.e.,
[17].
3.2
Computational Experiments
The ECS was coded in ANSI C and it was run on Intel AMD (1.33 GHz) platform.
The population size was varied in {10,30,100}, depending upon the problem size.
The parameter
was set to 20 for all test functions. In the first experiment,
ECS is compared against two other approaches well-known in the literature:
Genocop III [16] and the OptQuest Callable Library (OCL) [18]. Genocop III is
the third version of a genetic algorithm designed to search for optimal solutions
TEAM LinG
392
Alexandre C.M. Oliveira and Luiz A.N. Lorena
in optimization problems with real-coded variables and linear and nonlinear
constraints. The OCL is a commercial software designed for optimizing complex
systems based upon metaheuristic framework known as scatter search [10]. Both
approaches were run using the default values that the systems recommend and
the results showed in this work were taken from [18].
The results in Table 2 were obtained, in 20 trials, allowing ECS to perform
10,000 function evaluations, at the same way that Genocop III and OCL are
tested. The average of the best solutions found (FS) and the average of function
calls (FC) were considered to compare the algorithm performances. The average
of execution time in seconds (ET) is only illustrative, since the used platforms
are not the same. The values in bold indicate which procedure yields the solution
with better objective function value for each problem. Note that ECS has found
better solutions in two test functions, while both OCL and Genocop III have
better results in one function.
In the second experiment, ECS is compared against other approach found in
literature that works with the same idea of detecting promising search areas:
the Continuous Hybrid Algorithm (CHA), briefly described in the introduction.
The CHA results were taken from [5], where the authors worked with several
dimensional test functions. The most challenging of them are used for comparison
in this work. The results in Table 3 were obtained allowing ECS to perform up to
100,000 function evaluations in each one of the 20 trials. There is no information
about the corresponding CHA bound. The average of the gaps between the
solution found and the best known one (GAP) and the average of function calls
(FC) were considered to compare the algorithm performances, besides the success
rate (SR) obtained. In the ECS experiments, the SR reflects the percentage
of trials that have reached at least a gap of 0.001. The SR obtained in CHA
experiments is not a classical one, according the authors, because it considers
the actual landscape of the function at hand [5].
One can observe that ECS seems to be better than CHA in all test functions
showed in Table 3, except for the Zakharov, which ECS has not found the best
known solution. It is known that the 2-dimensional Zakharov’s function is a
monomodal one with the minimum lying at a corner of a wide plain. Nevertheless,
TEAM LinG
Detecting Promising Areas by Evolutionary Clustering Search
393
there was not found any reason for such poor performance. In function Shekel,
although ECS have found better gaps, the success rate is not as good as CHA.
The values in bold indicate in which aspects ECS was worse than CHA.
Other results obtained by ECS are showed in Table 4. The gap of 0.001 was
reached a certain number of times for all these functions. The worst performance
was in Michalewicz and Langerman’s functions (SR about 65%).
4
Conclusion
This paper proposes a new way of detecting promising search areas based upon
clustering. The approach is called Evolutionary Clustering Search (ECS). The
ECS attempts to locate promising search areas by framing them by clusters.
Whenever a cluster reaches a certain density, its center is used as start point of
some aggressive search strategy.
An ECS application to unconstrained numerical optimization is presented employing a steady-state genetic algorithm, an iterative clustering algorithm and
a local search based upon Hooke and Jeeves direct search. Some experiments
are presented, showing the competitiveness of the method. The ECS was compared with other approaches, taken from the literature, including the well-known
Genocop III and the OptQuest Callable Library.
For further work, it is intended to perform more tests on other bench-mark
functions. Moreover, heuristics and distance metrics for discrete search spaces
are being studied aiming to build applications in combinatorial optimization.
References
1. Yen, J., Lee, B.: A Simplex Genetic Algorithm Hybrid, In: IEEE International
Conference on Evolutionary Computation -ICEC97, (1997)175-180.
TEAM LinG
394
Alexandre C.M. Oliveira and Luiz A.N. Lorena
2. Nelder, J.A., Mead, R.: A simplex method for function minimization. Computer
Journal. (1956) 7(23):308-313.
3. Birru, H.K., Chellapilla, K., Rao, S.S.: Local search operators in fast evolutionary
programming. Congress on Evolutionary Computation,(1999)2:1506-1513.
4. Oliveira A.C.M.; Lorena L.A.N. Real-Coded Evolutionary Approaches to Unconstrained Numerical Optimization. Advances in Logic, Artificial Intelligence and
Robotics. Jair Minoro Abe and João I. da Silva Filho (Eds). Plêiade, ISBN:
8585795778. (2002)2.
5. Chelouah, R., Siarry, P.: Genetic and Nelder-Mead algorithms hybridized for a
more accurate global optimization of continuous multiminima functions. Euro.
Journal of Operational Research, (2003)148(2):335-348.
6. Jelasity, M., Ortigosa, P., García, I.: UEGO, an Abstract Clustering Technique for
Multimodal Global Optimization, Journal of Heuristics (2001)7(3):215-233.
7. Li, J.P., Balazs, M.E., Parks, G.T., Clarkson, P.J.: A species conserving genetic algorithm for multimodal function optimization, Evolutionary Computation,
(2002)10(3):207-234.
8. Yager, R.R.: A model of participatory learning. IEEE Trans. on Systems, Man and
Cybernetics, (1990)20(5)1229-1234.
9. Silva, L.R.S. Aprendizagem Participativa em Agrupamento Nebuloso de Dados,
Dissertation, Faculdade de Engenharia Elétrica e de Computação, Unicamp, Campinas SP, Brasil (2003).
10. Glover, F., Laguna, M., Martí, R.: Fundamentals of scatter search and path relinking. Control and Cybernetics, (2000) 39:653-684.
11. Bersini, H., Dorigo, M., Langerman, S., Seront G., Gambardella, L.M.: Results of
the first international contest on evolutionary optimisation - 1st ICEO. In: Proc.
IEEE-EC96. (1996)611-615.
12. De Jong, K.A.: An analysis of the behavior of a class of genetic adaptive systems,
Ph.D. dissertation, University of Michigan Press, Ann Arbor, 1975.
13. Digalakis, J., Margaritis, K.: An experimental study of benchmarking functions for
Genetic Algorithms. IEEE Systems Transactions,(2000)3810-3815.
14. Goldberg, D.E.: Genetic algorithms in search, optimisation and machine learning.
Addison-Wesley, (1989).
15. Eshelman, L.J., Schawer, J.D.: Real-coded genetic algorithms and intervalschemata, In: Foundation of Genetic Algorithms-2, L. Darrell Whitley (Eds.), Morgan Kaufmann Pub. San Mateo (1993) 187-202.
16. Michalewicz, Z.: GeneticAlgorithms + DataStructures = EvolutionPrograms.
Springer-Verlag, New York (1996).
17. Hooke, R., Jeeves, T.A.: “Direct search” solution of numerical and statistical problems. Journal of the ACM, (1961)8(2):212-229.
18. Laguna, M., Martí, R.: The OptQuest Callable Library In Optimization Software
Class Libraries, Stefan Voss and David L. Woodruff (Eds.), Kluwer Academic Pub.,
(2002)193-218.
TEAM LinG
A Fractal Fuzzy Approach
to Clustering Tendency Analysis
Sarajane Marques Peres1,2 and Márcio Luiz de Andrade Netto1
1
Unicamp - State University of Campinas
School of Electrical and Computer Engineering
Department of Computer Engineering and Industrial Automation
Campinas SP 13083-970, Brazil
{smperes,marcio}@dca.fee.unicamp.br
2
Unioeste - State University of Western of Paraná
Department of Computer Science
Campus Cascavel, Cascavel PR 85814-110, Brazil
Abstract. A hybrid system was implemented with the combination of
Fractal Dimension Theory and Fuzzy Approximate Reasoning, in order
to analyze datasets. In this paper, we describe its application in the
initial phase of clustering methodology: the clustering tendency analysis.
The Box-Counting Algorithm is carried out on a dataset, and with its
resultant curve one obtains numeric indications related to the features of
the dataset. Then, a fuzzy inference system acts upon these indications
and produces information which enable the analysis mentioned above.
Keywords: Clustering Tendency Analysis, Fractal Dimension Theory,
Fuzzy Approximate Reasoning
1 Introduction
The treatment of high-dimension and large datasets is a critical issue for data
analysis, and necessary for most of the problems in this area. The computational
complexity of the used methods plays an important role when they are applied to
datasets with a high number of descriptive attributes and a high number of data
points. Efforts to find simpler and more efficient alternatives have grown in recent
years. In this paper, we present a new approach (described with details in [13]),
implemented with the aid of the Fractal Dimension Theory (FDT) and Fuzzy
Approximate Reasoning (FAR), to analyze the clustering tendency (CT) of a
dataset. This task, also known as the Clustering Tendency Problem, helps make
decisions about “applying or not applying” a clustering process to a dataset.
The main objective is to avoid excessive computational time and resources in a
poor and more complex data analysis process, as in [3] e [9].
In fact, this hybrid system classifies, in a heuristic way, the “spatial distribution” of the data points in the dataset1 space by: uniform, normal and clustered.
1
In this context, the dataset is a stochastic fractal.
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 395–404, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
396
Sarajane Marques Peres and Márcio Luiz de Andrade Netto
This discovery is made by the fuzzy analysis of the information obtained from
the measuring process of the dataset’s fractal dimension.
This paper is organized as follows: Section 2 describes the CT analysis and a
classical approach to solve it - the Hopkins approach; the motivation to use FDT
and FAR2 in the conception of our system is discussed in Section 3; in Sections 4
and 5 we describe our hybrid system and compare the complexity of the two
approaches; the tests and results are shown in Section 6. The considerations
about the limitations of our approach and future works are discussed in Section 7
and, finally, the references are listed.
2
Clustering Tendency Analysis
Some methodologies are defined, with different phases and taxonomy, to guide
the clustering process [6]. One of these phases is the CT analysis, which is defined
as a problem of deciding whether the dataset exhibits a “predisposition” to
cluster in natural groups. The information acquired from this phase can avoid
the inappropriate application of clustering algorithms, which could find clusters
where they do not exist. The most common approach to solve it is the Hopkins
approach [6].
The Hopkins approach provides a numeric indicator useful to discover the CT.
It examines if the data points contradict the assertion that they are distributed
uniformly. If the disposition is not similar to a uniform distribution, the CT is
certified. According to [6], the Hopkins index to multi-dimensional datasets is
determined by (1), where:
is the set of
uniformly spread on the
space of the dataset;
is the set of
randomly chosen from
the dataset with
is the set of minimal distances between each point
of and its nearest neighbor from
and
is the set of distances between
each point of and its nearest neighbor.
Generally, in clusters, neighbors are closer than the samples of a uniformly
distributed set of points. Thus, if
is close to 1, clusters are suggested.
Similarly, values near 0 suggest uniform scattering.
3
Motivation: FDT and FAR
Fractal Theory studies “complex” subsets located inside simple metric space (like
Euclidean space) and generated by simple self-transformations of this space.
The Fractal Dimension of a dataset is the real dimension represented by its
data points, i.e. a dataset can represent an object, which has lower dimensions
2
We presumed that the reader is familiar with FDT and FAR. [1] and [7] has specific
information about them.
TEAM LinG
A Fractal Fuzzy Approach to Clustering Tendency Analysis
397
than the dimensions of the space where it is located. One of the ways to obtain
this measure is through the BC algorithm. A classical implementation of this
algorithm is described in [10].
The BC algorithm analyzes the distribution of the data points through successive hypercube grid coverings of the dataset space. Every iterations realize
a finer covering than the previous one, by shrinking the sides of all hypercubes
needed to cover all the space. Thus, it is possible to observe the distribution
behavior of the data points under the successive coverings. For datasets with
uniform distributions, this behavior must be uniform, i.e. the number of occupied hypercubes must increase uniformly. Datasets with clusters cause stronger
or weaker changes in the number of occupied hypercubes. This behavior of distribution is reflected on a log/log curve, which is formed by a sequence of straightlines, with different slopes, limited by the points that represent the relationship
between the shrinking of the hypercubes and their occupation rates.
The difference of the slopes between successive straight-line segments on the
curve represents the changes in the number of occupied hypercubes in successive
coverings. Bigger or smaller slopes represent the dataset structure, or its spatial
styles of distribution. Thus, it is needed to know how big or how small the
variation must be to indicate a specific spatial distribution. The FAR, which
allows working with linguistic variables and fuzzy values modeled by fuzzy sets
(see some examples in [11]), can make it possible to answer this question.
4
The Fractal Fuzzy Approach – FFA
The FFA is composed of two modules: the former carries out the classical BC
algorithm and the latter makes the fuzzy analysis of the resultant curve. The
output of this system enables the decision about the distribution’s style observed
in the analyzed dataset, and the possible conclusions are: uniform, normal and
clustered distribution. The first option means that the CT is not verified and
the others mean that it is verified. The “normal distribution” can characterise
the presence of clusters or not. This case demands an analysis more specific and
the CT must be considered. However, the conclusion “normal distribution” is
weaker than the conclusion clustered distribution, in relation to the CT. The
outputs are followed by a membership degree (a confidence value between [0,1]),
which allows an evaluation of the strength of each answer.
The curve resultant from the BC module is described by the coordinates3
and
The fuzzy module analyzes the information obtained
from this curve (normalized): the difference of slope between each pair of consecutive segments and the slope of the second segment of each pair. These values are
mapped to linguistic variables through the fuzzy sets and the Mamdami fuzzy
inference (with the operators mim for implications and max for aggregations) is
triggered (Figure 1). The parameters of the fuzzy system are listed in Tables 1
and 2. They were obtained through a supervised procedure of adjustment, which
3
is used as a precision measurement and
rithm.
is the current iteration of the algoTEAM LinG
398
Sarajane Marques Peres and Márcio Luiz de Andrade Netto
Fig. 1. Fractal Fuzzy Approach process. The parameters in this figure are only illustrative.
relates the spatial distribution of known datasets to features of their BC curves.
Changes in these parameters can make the system more sensitive or less sensitive
to anomalies in the curve.
The fuzzy rules are based on the relationship between the behavior of occupation of the hypercube and the number of hypercubes on the grid. For example,
the third rule in Table 2 infers the existence of a clustered spatial distribution
(this situation can be observed in Figure 2). The two straight-line segments created by the second, third and fourth iterations of the BC process have a negative
difference of slope. The existence of a change in the features of the curve can be
inferred. It specifies an increase of the relative hypercube occupation rate. The
high value of this difference, indicated by the slope of the second segment, reveals
the existence of very close data points4. This situation explains that the hypercube grid was not able to separate some subsets of data points, on second and
third iteration. So, these subsets form clusters with some degree of granularity.
4
There are justifications for the establishment of all the rules. We mentioned just one
as an example, due to space restrictions.
TEAM LinG
A Fractal Fuzzy Approach to Clustering Tendency Analysis
399
Fig. 2. Didactal example. (a) second BC iteration; (b) third BC iteration; (c) fourth
BC iteration; (d) resultant curve from a clustered distribution; (e) resultant curve from
a normal distribution. All graphs have the axes normalized in [0,1].
The sequence of conclusions represents the behavior of the BC curve and to
analyse it means to discover the style of the spatial distribution of the datasets
and, consequently, to analyse the CT. For example: if most of the conclusions
in the sequence is “Uniformity”; it means that the curve has a behavior as a
“straight-line”, i.e. the data points are uniformly distributed in the space and
the CT does not exist. We defined some rules of thumb, in order to analyse the
sequence. The algorithm below describes these rules5.
In this context, the sequence of straight-line segments, which determine the
result supplied by the system (for example, the most of the sequence of conclusions that is composed by “Uniformity”), is called “meaningful part”.
5
All possibities of arrangement for the sequence of conclusions are covered by these
rules
TEAM LinG
400
5
Sarajane Marques Peres and Márcio Luiz de Andrade Netto
Complexity Analysis
The computational complexity of the Hopkins approach is
where:
nSS is the number of data points on the sample set; and mUD is the number of
samples on the uniform distribution6; and is the dimensionality of the space
where the dataset is located. Thus, the upper limit of the complexity function
is summarized to
The computational complexity of the FFA approach is determined by the BC
algorithm which is the only process carried out with the data points. Fuzzy mappings and fuzzy inferences are carried out with a very short sequence of numbers,
and their running times are not expressive for the whole process. There are several implementations of the BC algorithm with different upper limit functions,
as in: [14] with complexity
[5], a recursive algorithm with complexity
plus the running time of each iteration (O(N)); and [8] with complexity
where N is the number of data points, D is the dimension of
the dataset space and I is the number of points on the resultant curve. The
algorithm presented by [14] constitutes the best solution for high-dimensional
and large datasets.
6
Test and Results
The tests were done using numerical7 datasets with several characteristics referring to the spatial distributions, number of instances and dimensionality of the
space8.
6
7
8
I.e.: nSS is equal to mUD.
No numerical datasets must be changed to numerical datasets.
We shown some datasets used on the tests, as a representative set. The others
datasets and the respective results can be obtained with the authors.
TEAM LinG
A Fractal Fuzzy Approach to Clustering Tendency Analysis
401
In order to analyze the performance of the Hopkins approach, the tests were
done with different configurations for two parameters: the number of the iterations and the size of the sample set. There were variations in the results obtained
due to the random features of this approach. The tests which presented an average result had 50 iterations and a sample set with half of the analyzed dataset.
The “uniform spatial distribution” was detected by the FFA when the resultant curve, or its meaningful part, was like an “inclined straight-line”. This
fact was observed with datasets whose distribution occupied all the space, with
few or many data points, without concentrations. Table 39 lists three examples
where the conclusion “uniform distribution” was obtained. It shows the comparisons with the Hopkins index and with the expected result10. The dataset
“Space Clusters” is a difficult example for our approach (due to the sparcity of
the data points) and the Hopkins index obtained a low indication of CT. The
conclusion obtained by the FFA for the dataset “Normal Distribution 1” (a normal distribution with 3000 data points in a 5-dimensional space - mean 0 and
variance 1) showed a weak TC, which was revealed by a decision limit situation.
Table 4 shows the datasets for which the FFA observed the “normal spatial
distribution”. In these cases, the resultant curve was like a “simple curve” (see
Figure 2(e)). This distribution was found in two situations: when the data points
were strongly concentrated in one region of the distribution, with some dispersion
around it12; or when the data points presented ill-defined concentrations.
The datasets Normal Distribution13 2 and 3 (located in a 1-dimensional
and 3-dimensional space, respectively, with 100 data points), 4 (located in a
1-dimensional space, 3000 data points) were classified by the Hopkins Index as
clustered datasets. Our approach was able to identify them as a normal set (with
CT). The dataset Random Distribution has short clusters scattered in the space.
The dataset Spiral has two spirals [4].
9
10
11
12
13
For all tables like this, the marks (Ok) and (Not) assign our evaluation about the
results obtained as the solution to the CT analisys: (Ok) correct result; (Not) not
correct result.
The expected result is determined by the feature of the distribuition used to create
the datasets or by information found on the reference where the dataset was obtained.
The symbol specifies a conclusion obtained on a decision limit region.
Like a cloud of data points.
All uniform distributions were generated with: mean 0 and variance 1.
TEAM LinG
402
Sarajane Marques Peres and Márcio Luiz de Andrade Netto
The results shown in Table 5 refer to decision limit situations. The datasets
have high dimension, with the exception of the dataset Normal Distribution 4,
that is located in 1-dimensional space (5000 data points). The Hopkins index
was very high for this dataset. The next four datasets are normal distributions
with: 100 points in 4-dimensional space; 100 points in 5-dimensional space; 3000
points in 4-dimensional space; and 3000 points in 5-dimensional space. The others
datasets [2] are: Iris (4-dimensional space), Abalone (8-dimensional space) and
Hungary Heart Diseases (13-dimensional space).
The “clustered spatial distribution” was observed when the resultant curve
presented some style of anomaly (as shown on Figure 2(d)). Table 6 shows the
datasets where this situation was observed. The first four datasets are located
in 2-dimensional space and have groups: well separated, with partial overlap,
stronger overlap forming two groups. The other datasets are located, respectively,
in: 34, 7, 13, 13-dimensional space, and they are available in [2].
In relation to the solutions for the CT, the FFA approach presented 96% of
the correct answers. The same percentual result was obtained by the Hopkins
index. The FFA obtained 76% of the correct results referent to expected answers. The datasets with “normal spatial distribution” were not considered in
evaluating the Hopkins approach, in relation to this resquisite (because answer
“normality” could not be obtained). So, its performance was 100%, under these
restrictive conditions (on 16 datasets, including the random dataset). Others
TEAM LinG
A Fractal Fuzzy Approach to Clustering Tendency Analysis
403
considerations about the Hopkins index could be made, and these results could
be changed, for example: a more restrictive, however common, index threshold
to determine the CT is 0.75. Thus, under this condition, the performance of this
approach was: 43.75% in relation to expected answers and 52% in relation to
CT; other possibility (less common) is determine using three indexes (as a politic
of intervals): [0,0.3] - regularity; (0.3, 0.75) - normality; [0.75, 1] - clustered. For
this condition, the performance of the approach was: 48% in relation to expected
answers (now, considering all datasets) and 100% in relation to CT.
7
Conclusions and Trends
In this paper we demonstrated how to solve the preliminary phase of a clustering
methodology - the CT analysis - using a new approach. We implemented a
hybrid system combining FDT and FAR to enable the analysis of the relationship
between the data points and the dataset space. The efficiency of this system
was evaluated in relation to the algorithmic complexity and the quality of the
analysis (with test on synthetic and real datasets). We compared the efficiency
of our approach with the efficiency of the classical Hopkins approach.
The capacity of the FFA, to detect the CT, is similar to the capacity of
the Hopkins approach with the classical parameters. In this question, the FFA
presented 96% of correct answers and the Hopkins approach reached: 96% with
the common index threshold (0.5); 52% with a more restrictive index threshold
(0.75); and 100% with a relaxed index threshold (the politics of intervals). The
FFA is able to supply discriminatory information of the dataset structure, with
more efficiency than the Hopkins approach. The percent of correct answers, in
relation to expected result - uniform, normal and clustered - was better to our
approach (76% against 48%). Moreover, the Hopkins approach is able to supply
these three styles of information only with the use of the politics of intervals. The
upper limit of the complexity function for the Hopkins approach indicates that it
can be slower than some implementations of the BC algorithm (which determines
the complexity of our approach). For large datasets, the implementations for the
TEAM LinG
404
Sarajane Marques Peres and Márcio Luiz de Andrade Netto
BC algorithms developed by [14] or [8] are good alternatives to implement our
approach, because the upper limit of the complexity function are not dependent
on a quadratic function of the number of used data points.
The studies about FAR are not finished. There are problems in relation to
sparse datasets and we are exploring this problem now. The use of this system to
determine a style of “accurate fractal dimension measure”, and the application
of this measure in others problems of clustering processing also is being explored.
We have reached some interesting preliminary results with the combination of
FAR with Neural Networks [12].
References
1. M. Barnsley. Fractals Everywhere. Academic Press Inc, San Diego, California, USA,
1988.
2. C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.
3. F. Can, Altingovde, and E. I. S., Demir. Efficiency and effectiveness of query
processing in cluster-based retrieval. Information Systems, 2003. to appear.
4. L. N. Castro and F. J. Voz Zuben. Data Mining: A Heuristic Approach, chapter
aiNet: an artificial immune network for data analysis, pages 231–259. Idea Group
Publishing, USA, 2001.
5. B.F. Feeny. Fast multifractal analysis by recursive box covering. International Journal of Bifurcation and Chaos, 10(9):2277–2287, 2000.
6. K. J. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc.,
New Jersey, USA, 1988.
7. G. J. Klir and B. Yuan. Fuzzy Sets and Fuzzy Logic: Theory and Applications.
Prentice-Hall, 1995.
8. A. Kruger. Implementation of a fast box-counting algorithm. Computer Physics
Communications, 98:224–234, 1996.
9. Massey L. Using ART1 Neural Networks to Determine Clustering Tendency, chapter Applications and Science in Soft Computing. Springer-Verlag, 2003.
10. H. O. Peitgen, H. Jurgens, and D. Saupe. Chaos and Fractals: New Frontiers of
Science. Springer-Verlag New York Inc., New York, USA, 1992.
11. S.M. Peres and M.L.A. Netto. Using fractal dimension to fuzzy pre-processing
of n-dimensional datasets. In ICSE 2003 - Sixteenth International Conference on
System Engineering, Conventry, United Kingdom, 2003. (Accepted to).
12. S.M. Peres and M.L.A. Netto. Fractal fuzzy decision making: What is the adequate
dimension for self-organizing maps. In NAFIPS 2004 - North American Fuzzy Information Processing Society, Banff, Canada, 2004. to appear.
13. S.M. Peres and M.L.A. Netto. Um sistema hibrido para analise heuristica de dados
utilizando teoria de fractais e raciocinio aproximado. Technical report, Universidade Estadual de Campinas, Campinas, Sao Paulo, Brasil, 2004.
14. C. Jr. Traina, A. Traina, Wu L., and C. Faloutsos. Fast feature selection using fractal dimension. In XV Brazilian Database Symposium, pages 158–171, João Pessoa,
PA, Brazil, 2002.
TEAM LinG
On Stopping Criteria for Genetic Algorithms
Martín Safe1, Jessica Carballido1,2, Ignacio Ponzoni1,2, and Nélida Brignole1,2
1
Grupo de Investigación y Desarrollo en Computación Científica (GIDeCC)
Departamento de Ciencias e Ingeniería de la Computación
Universidad Nacional del Sur, Av. Alem 1253, 8000, Bahía Blanca, Argentina
[email protected], {jac,[email protected]}, [email protected]
2
Planta Piloto de Ingeniería Química - CONICET
Complejo CRIBABB, Camino La Carrindanga km.7
CC 717, Bahía Blanca, Argentina
Abstract. In this work we present a critical analysis of various aspects
associated with the specification of termination conditions for simple genetic algorithms. The study, which is based on the use of Markov chains,
identifies the main difficulties that arise when one wishes to set meaningful upper bounds for the number of iterations required to guarantee
the convergence of such algorithms with a given confidence level. The
latest trends in the design of stopping rules for evolutionary algorithms
in general are also put forward and some proposals to overcome existing
limitations in this respect are suggested.
Keywords: stopping rule, genetic algorithm, Markov chains, convergence analysis
1
Introduction
During the last few decades genetic algorithms (GAs) have been widely employed
as effective search methods in numerous fields of application. They are typically
used in problems with huge search spaces, where no efficient algorithms with low
polynomial times are available, such as NP-complete problems [1].
Although in practice GAs have clearly proved to be efficacious and robust
tools for the treatment of hard problems, the theoretical fundamentals behind
their success have not been well-established yet [2]. There are very few studies
on key aspects associated with how a GA works, such as parameter control and
convergence analysis [3]. More specifically, the answers to the following questions
concerning GA design remain open and constitute subjects of current interest.
How can we define an adequate termination condition for an evolutionary process? [4–6]. Given a desired confidence level, how can we estimate an upper
bound for the number of iterations required to ensure convergence? [7–9].
In this work we present a critical review of the state-of-the-art in the design
of termination conditions and convergence analysis for canonical GAs. The main
contributions in the field are discussed, as well as some existing limitations. On
the basis of this analysis, future research lines are put forward. The article has
been organized as follows. In section 2 the traditional criteria typically employed
A.L.C. Bazzan and S. Labidi (Eds.): SBIA 2004, LNAI 3171, pp. 405–413, 2004.
© Springer-Verlag Berlin Heidelberg 2004
TEAM LinG
406
Martín Safe et al.
to express GA termination conditions are presented. Then, basic concepts on the
use of Markov chain models for GA convergence analysis are summed up. Section
4 contains a discussion of the results obtained in the estimation of upper bounds
for the number of iterations required for GA convergence. A description of the
present trends as regards termination conditions for evolutionary algorithms in
general is given next. Finally, some conclusive remarks and proposals for further
work are stated in section 6.
2
Termination Conditions for the sGA
A simple Genetic Algorithm (sGA) exhibits the following features: finite population, bit representation, one-point crossover, bit-flip mutation and roulette wheel
selection. The sGA and its elitist variation are the most widely employed kinds of
GA. Consequently, this variety has been studied quite extensively. In particular,
most of the scarce theoretical formalizations of GAs available in the literature
are focused on sGAs. The following three kinds of termination conditions have
been traditionally employed for sGAs [10, p. 67]:
An upper limit on the number of generations is reached,
An upper limit on the number of evaluations of the fitness function is reached,
or
The chance of achieving significant changes in the next generations is excessively low.
The choice of sensible settings for the first two alternatives requires some
knowledge about the problem to allow the estimation of a reasonable maximum
search length. In contrast, the third alternative, whose nature is adaptive, does
not require such knowledge. In this case, there are two variants, namely genotypical and phenotypical termination criteria. The former end when the population
reaches certain convergence levels with respect to the chromosomes in the population. In short, the number of genes that have converged to a certain value of
the allele is checked. The convergence or divergence of a gene to a certain allele
is established by the GA designer through the definition of a preset percentage,
which is a threshold that should be reached. For example, when 90% of the population in a GA has a 1 in a given gene, it is said that that gene has converged
to the allele 1. Then, when a certain percentage of the genes in the population
(e.g. 80%) has converged, the GA ends. Unlike the genotypical approach, phenotypical termination criteria measure the progress achieved by the algorithm
in the last generations, where is a value preset by the GA designer. When
this measurement, which may be expressed in terms of the average fitness value
for the population, yields a value beyond a certain limit
it is said that the
algorithm has converged and the evolution is immediately interrupted.
The main difficulty that arises in the design of adaptive termination policies
concerns the establishment of appropriate values for their associated parameters
(such as in phenotypical rules), while for the criteria that set a fixed amount of
iterations, the fundamental problem is how to determine a reasonable value for
TEAM LinG
On Stopping Criteria for Genetic Algorithms
407
that number, so that sGA convergence is guaranteed with a certain confidence
level. In this case, the values not only depend on the dimension of the search
space, but also on the rest of the parameters involved in the sGA, which include
the crossover and mutation probabilities as well as the population size.
The minimum number of iterations required in a GA can be found by means
of a convergence analysis. This study may be carried out from different approaches, such as the scheme theory [11, Chap. 2] or Markov chains [12–14]. The
usefulness of the schema theorem has been widely criticised [15]. As it gives a
lower bound for the expectation for one generation, it is very difficult to extrapolate its conclusions to multiple generations accurately. In this article we have
concentrated on Markov chains because, as pointed out by Bäck et al. [2], this
approach has already provided remarkable insight into convergence properties
and dynamic behaviour.
3
Markov Chains and Convergence Analysis of the sGA
A Markov chain may be viewed as a stochastic process that traverses a sequence
of states
through time. The passage from state
to state
is
called a transition. A distinguishing feature that characterizes Markov chains is
the fact that, given the present state, future states are independent from past
states, though they may depend on time. For a formal definition see, for example,
[16, pp. 106–107].
Nix and Vose [12] showed how the sGA can be modelled exactly as a finite
Markov chain, i.e. a Markov chain with a finite set of states. In their model, each
state represents a population and each transition corresponds to the application
of the three genetic operators. They found exact formulas for the transition probabilities to go from one population to another in one GA iteration as functions
of the mutation and crossover rates. By forming a matrix with these transition
probabilities and computing its powers, one can predict precisely the behaviour
of the sGA in terms of probability, for fixed genetic rates and fitness function.
This approach was taken up by De Jong et al. [17]. Unfortunately, the number
of rows and columns of the corresponding matrices is equal to the number of all
possible populations, which, according to [12], amounts to
This quantity becomes extremely large as the population size or the strings
length grows. Also notice that these matrices are not sparse because their
entries are all non-zero probabilities. Therefore, this method can only be applied
for small values of and
Nevertheless, Nix and Vose’s formulation can lead to an analysis of the sGA
convergence behaviour. For instance, they confirm the intuitive fact that, unless mutation rate is zero, each population is reachable from any other in one
transition, i.e. the transition probability is non-zero for any pair of populations.
TEAM LinG
408
Martín Safe et al.
According to the theory about finite Markov chains, this simple fact has immediate consequences in the sGA behaviour as the number of iterations grows
indefinitely. More specifically, whatever the initial population the probability
to reach any other population after iterations does not approach 0 as tends
to infinity. It tends to a positive limit probability instead. This limit depends
on
but is independent from Then, although the selection process tends to
favour populations that contain high-fitness individuals by making them more
probable, the constant-rate mutation introduces enough diversity to ensure that
all populations are visited again and again. Thus, the sGA fails to converge to
a population subset, no matter how much time has elapsed.
Moreover, Rudolph [14] showed that the same holds for more general crossover and selection operators, if a constant-rate mutation is kept. Nevertheless,
reducing mutation rates progressively does not seem to be enough. Davis and
Príncipe [13] presented a variation of the sGA that uses the mutation rate as a
control parameter analogous to temperature in simulated annealing. They show
how the mutation rate can be reduced during execution in order to ensure that
the limiting distribution focuses only on populations consisting of replicas of the
same individual, which is however, not necessarily a globally optimal one.
In contrast, the elitist version of the sGA, which always remembers the best
individual found so far, does converge in a probabilistic sense. In this respect,
Rudolph [14] shows that the probability of having found the best individual
sometime during the process approaches 1 when the number of iterations tends
to infinity, and he points out that this property does not mean that genetic
algorithms have special capabilities to find the best individual. In fact, since any
population has nonzero probability of being visited and there is a finite number
of populations, then each of them will eventually be visited with probability
1 as the number of iterations grows indefinitely. Then, this observation lacks
significance in practice because, for example, the direct enumeration of all the
individuals guarantees the discovery of the global optimum in a finite time.
4
Stopping Criteria for the sGA with Elitism
Aytug and Koehler [7, 8] formulated a stopping criterion for the elitist sGA from
the fact that all populations are visited with probability 1. Given a threshold
they aimed at finding an upper bound for the number of iterations t required to
guarantee that the global optimum has been visited with probability at least
in one of these iterations. Using Nix and Vose’s model [12], they showed [7] that
it is enough to have
to ensure that all the populations, and consequently all the individuals, have
been visited with probability greater or equal to
In equation 2,
is
the mutation rate, is the length of the chains that represent the individuals,
and is the population size. Later, Aytug and Koehler [8] determined an upper
TEAM LinG
On Stopping Criteria for Genetic Algorithms
409
bound for the number of iterations required to guarantee, with probability at
least
that all possible individuals have been inspected, instead of imposing
the condition on all the populations. In this way, they managed to improve the
bound in (2) significantly, proving that a number of iterations that satisfies
is enough to achieve this objective. Greenhalgh and Marshall [9] obtained similar results independently on the basis of simpler arguments. In the rest of this
section, we will show that, in spite of being theoretically correct, these criteria
are of little practical interest.
Let us consider a random algorithm (RA) that generates in each iteration a
population of individuals, not necessarily different from each other, chosen at
random and independently. Just like Aytug and Koehler [7,8] did for the elitist
sGA, we shall determine the lowest number of iterations required to guarantee
with probability at least that the RA has generated all the possible individuals in the course of the procedure. Let us consider the populations
generated by the RA and an element from the space of individuals (for example, a global optimum). Our objective is to find the lowest value for so that
Since the populations are generated independently
from each other, then
The expression in brackets is lower than 1, so the whole expression approaches
1 as tends to infinity. Since is an arbitrary individual, (4) shows that the RA
will visit all individuals with probability 1 if it is allowed to iterate indefinitely.
Moreover, by applying logarithms to (4), we get
This is an upper bound for the number of iterations required to ensure with probability at least that the RA has examined all the individuals, and consequently
discovered the global optimum.
Since (3) reaches its minimum for
then the bound for RAs given in
(5) is always at least as good as the bound for GAs presented in (3). Then, the
latter does not provide a stopping criterion in practice because it always suggests
waiting for the execution of at least as many iterations as the amount that an
RA without heuristics of any kind would require. Moreover, when tends to
0, which constitutes the situation of practical interest, the amount of iterations
required by (3) grows to infinity. Figure 1 depicts the behaviour of (3) and its
relation to (5).
TEAM LinG
410
Martín Safe et al.
Fig. 1. This graph illustrates how the lower bound for GA iterations (3) (continuous
line) grows quickly to infinity as mutation rate moves away from 1/2. In this case
and
The lower bound for RA iterations (5) for the same
and is also indicated (dashed line) and coincides with the minimum attained by (3)
Due to the way Aytug and Koehler posed their problem, they were theoretically impeded to go beyond the bound for RAs given in (5). In fact, since they
make no hypotheses on the fitness function, they implicitly include the possibility of dealing with a fitness function that assigns a randomly-chosen value to
each individual. When this is the case, only exploration is required and no exploitation should be carried out. Therefore, the RA exhibits better performance
than the sGA.
5
Present Trends in Stopping Rules
for Evolutionary Algorithms
Whatever the problem, it is nowadays considered inappropriate to employ sets
of fixed values for the parameters in an evolutionary algorithm [3]. Therefore,
it would be unadvisable to choose a termination condition based on a preestablished number of iterations. Some adaptive alternatives have been explored in
the last decade.
Among them we can cite Meyer and Feng [4], who suggested using fuzzy logic
to establish the termination condition, and Howell et al. [5], who designed a new
variant of the evolutionary algorithms called Genetic Learning Automata (GLA).
This algorithm uses a peculiar representation of the chromosomes, where each
gene is a probability. On this basis, a novel genotypical stopping rule is defined.
The execution stops when the alleles reach values close to 0 or 1.
TEAM LinG
On Stopping Criteria for Genetic Algorithms
411
In turn, Carballido et al. [6] present a representative example of a stopping
criterion designed ad hoc for a specific application, namely the traveling salesman
problem (TSP). In that work, a genotypical termination criterion defined for
both ordinal and path representations is proposed.
Finally, it is important to remark the possibility of increasing efficiency by
using parallel genetic algorithms (pGAs) in particular. As stated in Hart et al.
[18], the performance measurements employed in parallel algorithms, such as the
speed-up, are usually defined in terms of the cost required to reach a solution
with a pre-established precision. For this reason, when you wish to calculate
metrics of parallel performance, it is incorrect to stop a pGA either after a
fixed number of iterations or when the average fitness exhibits little variation.
This constitutes a motivation for the definition of stopping rules based on the
attainment of thresholds. The central idea is to stop the execution of the pGA
when a solution that reaches this threshold is found. For instance, Sena et al. [19]
present a parallel distributed GA (pdGA) based on the master-worker paradigm.
The authors illustrate how this algorithm works by applying it to the TSP,
using a lower bound estimated for the minimum-cost tour as threshold for the
termination condition.
Nevertheless, threshold definition requires a good estimation of the optimum
of the problem under study, which is unavailable in many cases. Unfortunately,
the most recent reviews on pGAs ([20,21]) fail to provide effective strategies to
overcome these limitations.
6
Conclusions
Research work in this field shows that the sGA does not necessarily lead to better
and better populations. Although its elitist version converges probabilistically
to the global optimum, this is due to the fact that the sGA tends to explore
the whole space, rather than to the existence of any special capability in its exploitation mechanism. This is not, indeed, in contradiction to the interpretation
of sGAs as evolutionary mechanisms because the introduction of a fixed fitness
function implies an assumption that may not be in exact correspondence with
natural environments, whose character is inherently dynamic. As pointed out by
De Jong [22], Holland’s initial motivation for introducing the concept of GAs
was to devise an implementation for robust adaptive systems, without focusing
on the optimization of functions. Furthermore, De Jong makes a clear distinction
between sGA and GA-based function optimizers. The successful results achieved
through the use of the latter for the solution of hard problems often blurs this
distinction.
Until recently, this trend has led researchers to look for a general measure of
elitist sGA efficiency from a theoretical viewpoint, applicable when finding the
solution of any optimization problem on binary strings of length [8,9]. The
smoothness of the fitness function is extremely important when choosing the
most convenient kind of heuristic strategy to adopt when facing a given problem.
The higher the smoothness, the higher the exploitation level and conversely, as
TEAM LinG
412
Martín Safe et al.
the function is less smooth, more exploration is required. This fact is so clear
that it should not be overlooked. Since no hypotheses on the fitness function
have been made, and also considering that no measure of its smoothness has
been included in its formula, this approach is overestimating exploration to the
detriment of exploitation, this being just the opposite of what one really wishes
in practice when implementing a heuristic search.
In view of the fact that the sGA is not an optimizer of functions, efforts
should be directed to the devise of adequate modifications to tackle each specific
problem in order to design an optimizer that is really efficient for a determinate
family of functions. Besides, it is important to remark that current trends are
towards the employment of adaptive termination conditions, either genotypical
or phenotypical, instead of using a fixed number of iterations, because for most
applications in the real world the mere estimation of the size of the search space
constitutes in itself an extremely complex problem.
Acknowledgments
The authors would like to express their acknowledgment to the “Agencia Nacional de Promoción Científica y Tecnológica” from Argentina, for their economic
support given through Grant N°11-12778. It was awarded to the research project
entitled “Procesamiento paralelo distribuido aplicado a ingeniería de procesos”
(ANPCYT Res N°117/2003) as part of the “Programa de Modernización Tecnológica, Contrato de Préstamo BID 1201/OC-AR”.
References
1. Brassard, G., Bratley, P.: Algorithmics: Theory and Practice. Prentice-Hall, Inc.,
New Jersey (1988)
2. Bäck, T., Hammel, U., Schwefel, H.P.: Evolutionary computation: Comments on
the history and current state. IEEE Transactions on Evolutionary Computation 1
(1997) 3–17
3. Eiben, Á.E., Hinterding, R., Michalewicz, Z.: Parameter control in evolutionary
algorithms. IEEE Transactions on Evolutionary Computation 3 (1999) 124–141
4. Meyer, L., Feng, X.: A fuzzy stop criterion for genetic algorithms using performance estimation. In: Proceedings of the Third IEEE Conference on Fuzzy Systems. (1994) 1990–1995
5. Howell, M., Gordon, T., Brandao, F.: Genetic learning automata for function optimization. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 32 (2002) 804–815
6. Carballido, J.A., Ponzoni, I., Brignole, N.B.: Evolutionary techniques for the travelling salesman problem. In Rosales, M.B., Cortínez, V.H., Bambill, D.V., eds.:
Mecánica Computacional. Volume XXII. Asociación Argentina de Mecánica Computacional (2003) 1286–1294
7. Aytug, H., Koehler, G.J.: Stopping criterion for finite length genetic algorithms.
INFORMS Journal on Computing 8 (1996) 183–191
8. Aytug, H., Koehler, G.J.: New stopping criterion for genetic algorithms. European
Journal of Operational Research 126 (2000) 662–674
TEAM LinG
On Stopping Criteria for Genetic Algorithms
413
9. Greenhalgh, D., Marshall, S.: Convergence criteria for genetic algorithms. SIAM
Journal on Computing 20 (2000) 269–282
10. Michalewicz, Z.: Genetic Al