LNCS 4172 - Research and Advanced Technology for Digital Libraries

LNCS 4172 - Research and Advanced Technology for Digital Libraries
Lecture Notes in Computer Science
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
University of Dortmund, Germany
Madhu Sudan
Massachusetts Institute of Technology, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Moshe Y. Vardi
Rice University, Houston, TX, USA
Gerhard Weikum
Max-Planck Institute of Computer Science, Saarbruecken, Germany
Julio Gonzalo Costantino Thanos
M. Felisa Verdejo Rafael C. Carrasco (Eds.)
Research and
Advanced Technology
for Digital Libraries
10th European Conference, ECDL 2006
Alicante, Spain, September 17-22, 2006
Volume Editors
Julio Gonzalo
M. Felisa Verdejo
Universidad Nacional de Educación a Distancia (UNED)
Departamento de Lenguajes y Sistemas Informáticos
c/Juan del Rosal, 16, 28040 Madrid, Spain
Costantino Thanos
Consiglio Nazionale delle Richerche
Istituto di Scienza e Tecnologie dell’Informazione
Via Moruzzi, 1, 56124, Pisa, Italy
E-mail: [email protected]
Rafael C. Carrasco
Universidad de Alicante
Departamento de Lenguajes y Sistemas Informáticos
03071 Alicante, Spain
E-mail: [email protected]
Library of Congress Control Number: Applied for
CR Subject Classification (1998): H.3.7, H.2, H.3, H.4.3, H.5, J.7, J.1, I.7
LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web
and HCI
3-540-44636-2 Springer Berlin Heidelberg New York
978-3-540-44636-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
© Springer-Verlag Berlin Heidelberg 2006
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
SPIN: 11863878
We are proud to present the proceedings of the 10th European Conference on
Digital Libraries (ECDL 2006) which, following Pisa (1997), Heraklion (1998),
Paris (1999), Lisbon (2000), Darmstadt (2001), Rome (2002), Trondheim (2003),
Bath (2004) and Vienna (2005), took place on September 17-22, 2006 at the
University of Alicante, Spain. Over the last ten years, ECDL has created a strong
interdisciplinary community of researchers and practitioners in the field of digital
libraries, and has formed a substantial body of scholarly publications contained
in the conference proceedings. As a commemoration of its 10th anniversary, and
by special arrangement with Springer, these proceedings include (as an attached
CD-ROM) an electronic copy of all ECDL proceedings since its inception in
1997: a small but rich digital library on digital libraries.
ECDL 2006 featured separate calls for paper and poster submissions, resulting
in 130 full papers and 29 posters being submitted to the conference. All papers
were subject to an in-depth peer-review process; three reviews per submission
were produced by a Program Committee of 92 members and 42 additional reviewers from 30 countries. Finally, 36 full paper submissions were accepted at
the Program Committee meeting, resulting in an acceptance rate of 28%. Also,
15 poster submissions plus 18 full paper submissions were accepted as poster or
demo presentations, which are also included in this volume as four-page extended
The conference program started on Sunday 17 with a rich tutorials program,
which included a tutorial on thesauri and ontologies in digital libraries by Dagobert Soergel, and introduction to digital libraries by Ed Fox, a tutorial on bringing digital libraries to distributed infrastructures by Thomas Risse and Claudia
Niederée, a description of the Fedora repository and service framework by Sandy
Payette and Carl Lagoze, a tutorial on creating full-featured institutional repositories combining DSpace ETD-db and DigiTool, and a tutorial on the use of
XML and TEI for content production and metadata.
The main conference featured three keynote speakers: Michael A. Keller, Ida
M. Green University Librarian at Stanford, Director of Academic Information
Resources, Publisher of HighWire Press, and Publisher of the Stanford University
Press; Horst Forster, director of Interfaces, knowledge and content technologies at
the Directorate-General for Information Society of the European Commission,
and Ricardo Baeza-Yates, director of Yahoo! Research Barcelona and Yahoo!
Research Latin America at Santiago de Chile.
The rest of the main conference program consisted of 12 technical sessions,
a panel and a poster session preceded by a spotlight session which served as a
quick guide to the poster session for the conference participants.
Following the main conference, ECDL hosted eight workshops, including
the long-standing workshop of the Cross-Language Evaluation Forum, a major
event of its own that ran an intensive three-day program devoted to discuss the
outcome of its annual competitive evaluation campaign in the field of Multilingual Information Access. The other workshops were: NKOS 2006 (5th European Networked Knowledge Organization Systems workshop), DORSDL 2006
(Digital Object Repository Systems in Digital Libraries), DLSci 2006 (Digital
Library goes e-science: perspectives and challenges), IWAW 2006 (6th International Workshop on Web Archiving and Digital Preservation), LODL 2006
(Learning Object repositories as Digital Libraries: current challenges), M-CAST
2006 and CSFIC 2006 (Embedded e-Learning – critical success factors for institutional change). All information about ECDL 2006 is available from the conference
homepage at http://www.ecdl2006.org.
We would like to take the opportunity to thank all those institutions and individuals who made this conference possible, starting with the conference participants and presenters, who provided a dense one-week program of high technical
quality. We are also indebted to the Program Committee members, who made
an outstanding reviewing job under tight time constraints; and to all Chairs and
members of the Organization Committee, including the organizing teams at the
University of Alicante, Biblioteca Virtual Miguel de Cervantes and UNED. We
would specifically like to thank Miguel Ángel Varó for his assistance with the
conference management system, and Valentı́n Sama for his help when compiling
these proceedings.
Finally, we would also like to thank the conference sponsors and cooperating
agencies: the DELOS network of Excellence on Digital Libraries, Grupo Santander, Ministerio de Educación y Ciencia, Patronato Municipal de Turismo de
Alicante, Red de investigación en Bibliotecas Digitales, Fundación Biblioteca
Miguel de Cervantes, Departamento de Lenguajes y Sistemas Informáticos de la
Universidad de Alicante, and UNED.
Julio Gonzalo
Costantino Thanos
Felisa Verdejo
Rafael Carrasco
Organization Committee
General Chair
Felisa Verdejo
UNED, Spain
Program Chairs
Julio Gonzalo
Costantino Thanos
UNED, Spain
CNR, Italy
Organization Chair
Rafael C. Carrasco
Universidad de Alicante, Spain
Workshops Chairs
Donatella Castelli
José Luis Vicedo
CNR, Italy
Universidad de Alicante, Spain
Poster and Demo Chair
Günter Mühlberger
University of Innsbruck, Austria
Panel Chairs
Andreas Rauber
Liz Lyon
Vienna University of Technology, Austria
Tutorial Chairs
Marcos Andre Gonçalves
Ingeborg Solvberg
Federal University of Minas Gerais, Brazil
Norwegian University of Science and
Technology, Norway
Publicity and Exhibits Chairs
Maristella Agosti
Tamara Sumner
Shigeo Sugimoto
University of Padua, Italy
University of Colorado at Boulder, USA
University of Tsukuba, Japan
Doctoral Consortium Chairs
Jose Borbinha
Lillian Cassel
IST, Lisbon Technical University, Portugal
Villanova University, USA
Local Organization Chair
Rafael C. Carrasco
University of Alicante
Program Committee
Alan Smeaton
Allan Hanbury
Andras Micsik
Andy Powell
Anita S. Coleman
Ann Blandford
Anselmo Peñas
Antonio Polo
Birte Christensen-Dalsgaard
Boris Dobrov
Carl Lagoze
Carlo Meghini
Carol Peters
Ching-Chih Chen
Christine L. Borgman
Clifford Lynch
Dagobert Soergel
Dieter Fellner
Dimitris Plexousakis
Djoerd Hiemstra
Donna Harman
Douglas W. Oard
Eduard A. Fox
Edleno Silva de Moura
Ee-Peng Lim
Elaine Toms
Erik Duval
Fernando López-Ostenero
Floriana Esposito
Franciska de Jong
Frans Van Assche
Gary Marchionini
George Buchanan
Dublin City University, Ireland
Vienna University of Technology, Austria
SZTAKI, Hungary
Eduserv Foundation, UK
University of Arizona, USA
University College London, UK
UNED, Spain
University of Extremadura, Spain
State and University Library, Denmark
Moscow State University, Russia
Cornell University, USA
Simmons College, USA
University of California, USA
Coalition for Networked Information, USA
University of Maryland, USA
Graz University of Technology, Austria
FORTH, Greece
Twente University, Netherlands
University of Maryland, USA
Virginia Tech, USA
Universidade do Amazonas, Brazil
Nanyang Technological University, Singapore
Dalhousie University, Canada
Katholieke Universiteit Leuven, Belgium
UNED, Spain
University of Bari, Italy
University of Twente, Netherlands
European Schoolnet, Belgium
University of North Carolina Chapel Hill, USA
University of Wales, Swansea, UK
Gerhard Budin
Gregory Crane
George R. Thoma
Hanne Albrechtsen
Harald Krottmaier
Heiko Schuldt
Herbert Van de Sompel
Howard Wactlar
Hussein Suleman
Ian Witten
Ingeborg Solvberg
Jacques Ducloy
Jan Engelen
Jane Hunter
Jela Steinerova
Jesús Tramullas
José Hilario Canós Cerdá
Jussi Karlgren
Key-Sun Choi
Laurent Romary
Lee-Feng Chien
Leonid Kalinichenko
Liddy Nevile
Lloyd Rutledge
Lynda Hardman
Marc Nanard
Margaret Hedstrom
Margherita Antona
Mária Bieliková
Maria Sliwinska
Mario J. Silva
Martin Kersten
Michael Mabe
Mike Freeston
Mounia Lalmas
Nicholas Belkin
Nicolas Spyratos
Norbert Fuhr
Nozha Boujemaa
Pablo de la Fuente
Paul Clough
Rachel Bruce
Ray R. Larson
University of Vienna, Austria
Tufts University, USA
U.S. National Library of Medicine, USA
Institute of Knowledge Sharing, Denmark
Graz University of Technology, Austria
University of Basel, Switzerland
Los Alamos National Laboratory, USA
Carnegie Mellon University, USA
University of Cape Town, South Africa
University of Waikato, New Zealand
Norwegian University of Technology and Science,
Katholieke Universiteit Leuven, Belgium
University of Queensland, Australia
Comenius University in Bratislava, Slovakia
University of Zaragoza, Spain
Universidad Politécnica de Valencia, Spain
SICS, Sweden
Korea Advanced Institute of Science and
Technology, Korea
Laboratoire Loria CNRS, France
Academia Sinica, Taiwan
Russian Academy of Sciences, Russia
La Trobe University, Australia
CWI, Netherlands
CWI, Netherlands
University of Montpellier, France
University of Michigan, USA
FORTH, Greece
Slovak University of Technology in Bratislava
ICIMSS, Poland
Universidade de Lisboa, Portugal
CWI, Netherlands
Elsevier, UK
University of California, Santa Barbara, USA
Queen Mary University of London, UK
Rutgers University, USA
Université de Paris-Sud, France
University of Duisburg-Essen, Germany
INRIA, France
University of Valladolid, Spain
University of Sheffield, UK
University of California, Berkeley, USA
Reagan Moore
Ricardo Baeza-Yates
Richard Furuta
Sally Jo cunningham
Sarantos Kapidakis
Schubert Foo
Stavros Christodoulakis
Stefan Gradmann
Susanne Dobratz
Thomas Baker
Thomas Risse
Timos Sellis
Tiziana Catarci
Traugott Koch
Yahoo! Research, Spain and Chile
Texas A&M University, USA
University of Waikato, New Zealand
Ionian University, Greece
Nanyang Technological University, Singapore
Technical University of Crete, Greece
University Hamburg Computing Center, Germany
Humboldt University, Germany
State and University Library, Germany
Fraunhofer IPSI, Germany
National Technical University of Athens, Greece
University of Rome 1, Italy
Additional Reviewers
Enrique Amigó
Luis J. Arévalo Rosado
Tobias Blanke
You-Jin Chang
Nicola Fanizzi
Daniel Gomes
Jesús Herrera
Sascha Kriewel
Manuel Llavador
Jorge Martı́nez Gil
Lehti Patrick
Antonella Poggi
Dimitris Sacharidis
Maria Sliwinska
Alia Amin
Javier Artiles
André Carvalho
Theodore Dalamagas
Stefano Ferilli
Sheila Gomes
Michiel Hildebrand
Monica Landoni
Natalia Loukachevitch
Roche Mathieu
Vı́ctor Peinado
Konstantin Pussep
Monica Scannapieco
Zoltan Szlavik
Local Organization Committee
Laura Devesa
Antonio Carrasco
Rafael González
Ester Serna
Ángel Clar
(Office Contact)
(Communication and Press)
(Website Development)
(Graphic Design)
Bruno Araújo
David Bainbridge
Michelangelo Ceci
Reza Eslami
Gudrun Fischer
Mark Hall
Stephen Kimani
Francesca A. Lisi
Ming Luo
Diego Milano
Thomaz Philippe
Philippe Rigaux
Yannis Stavrakas
Theodora Tsikrika
Table of Contents
Architectures I
OpenDLibG: Extending OpenDLib by Exploiting a gLite Grid
Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leonardo Candela, Donatella Castelli, Pasquale Pagano,
Manuele Simi
A Peer-to-Peer Architecture for Information Retrieval Across Digital
Library Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ivana Podnar, Toan Luu, Martin Rajman, Fabius Klemm,
Karl Aberer
Scalable Semantic Overlay Generation for P2P-Based Digital
Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Christos Doulkeridis, Kjetil Nørvåg, Michalis Vazirgiannis
Reevaluating Access and Preservation Through Secondary Repositories:
Needs, Promises, and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dean Rehberger, Michael Fegan, Mark Kornbluh
Repository Replication Using NNTP and SMTP . . . . . . . . . . . . . . . . . . . . . .
Joan A. Smith, Martin Klein, Michael L. Nelson
Genre Classification in Automated Ingest and Appraisal Metadata . . . . . .
Yunhyong Kim, Seamus Ross
The Use of Summaries in XML Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Zoltán Szlávik, Anastasios Tombros, Mounia Lalmas
An Enhanced Search Interface for Information Discovery from Digital
Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Georgia Koutrika, Alkis Simitsis
The TIP/Greenstone Bridge: A Service for Mobile Location-Based
Access to Digital Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Annika Hinze, Xin Gao, David Bainbridge
Table of Contents
Architectures II
Towards Next Generation CiteSeer: A Flexible Architecture for Digital
Library Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Isaac G. Councill, C. Lee Giles,
Ernesto Di Iorio, Marco Gori, Marco Maggini,
Augusto Pucci
Digital Object Prototypes: An Effective Realization of Digital Object
Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Kostas Saidis, George Pyrounakis, Mara Nikolaidou, Alex Delis
Design, Implementation, and Evaluation of a Wizard Tool for Setting
Up Component-Based Digital Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Rodrygo L.T. Santos, Pablo A. Roberto, Marcos André Gonçalves,
Alberto H.F. Laender
Design of a Digital Library for Early 20th Century Medico-legal
Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
George R. Thoma, Song Mao, Dharitri Misra, John Rees
Expanding a Humanities Digital Library: Musical References
in Cervantes’ Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Manas Singh, Richard Furuta, Eduardo Urbina, Neal Audenaert,
Jie Deng, Carlos Monroy
Building Digital Libraries for Scientific Data: An Exploratory Study
of Data Practices in Habitat Ecology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Christine Borgman, Jillian C. Wallis, Noel Enyedy
Designing Digital Library Resources for Users in Sparse, Unbounded
Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Richard Butterworth
Design and Selection Criteria for a National Web Archive . . . . . . . . . . . . . . 196
Daniel Gomes, Sérgio Freitas, Mário J. Silva
What Is a Successful Digital Library? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
Rao Shen, Naga Srinivas Vemuri, Weiguo Fan,
Edward A. Fox
Table of Contents
Evaluation of Metadata Standards in the Context of Digital
Audio-Visual Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Robbie De Sutter, Stijn Notebaert, Rik Van de Walle
On the Problem of Identifying the Quality of Geographic Metadata . . . . . 232
Rafael Tolosana-Calasanz, José A. Álvarez-Robles,
Javier Lacasta, Javier Nogueras-Iso, Pedro R. Muro-Medrano,
F. Javier Zarazaga-Soria
Quality Control of Metadata: A Case with UNIMARC . . . . . . . . . . . . . . . . . 244
Hugo Manguinhas, José Borbinha
Large-Scale Impact of Digital Library Services: Findings from a Major
Evaluation of SCRAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Gobinda Chowdhury, David McMenemy, Alan Poulter
A Logging Scheme for Comparative Digital Library Evaluation . . . . . . . . . 267
Claus-Peter Klas, Hanne Albrechtsen, Norbert Fuhr, Preben Hansen,
Sarantos Kapidakis, Laszlo Kovacs, Sascha Kriewel, Andras Micsik,
Christos Papatheodorou, Giannis Tsakonas, Elin Jacob
Evaluation of Relevance and Knowledge Augmentation in Discussion
Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Ingo Frommholz, Norbert Fuhr
User Studies
Designing a User Interface for Interactive Retrieval of Structured
Documents — Lessons Learned from the INEX Interactive Track . . . . . . . . 291
Saadia Malik, Claus-Peter Klas, Norbert Fuhr, Birger Larsen,
Anastasios Tombros
“I Keep Collecting”: College Students Build and Utilize Collections
in Spite of Breakdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Eunyee Koh, Andruid Kerne
An Exploratory Factor Analytic Approach to Understand Design
Features for Academic Learning Environments . . . . . . . . . . . . . . . . . . . . . . . . 315
Shu-Shing Lee, Yin-Leng Theng, Dion Hoe-Lian Goh,
Schubert Shou-Boon Foo
Table of Contents
Representing Contextualized Information in the NSDL . . . . . . . . . . . . . . . . . 329
Carl Lagoze, Dean Krafft, Tim Cornwell, Dean Eckstrom,
Susan Jesuroga, Chris Wilper
Towards a Digital Library for Language Learning . . . . . . . . . . . . . . . . . . . . . 341
Shaoqun Wu, Ian H. Witten
Beyond Digital Incunabula: Modeling the Next Generation of Digital
Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
Gregory Crane, David Bamman, Lisa Cerrato, Alison Jones,
David Mimno, Adrian Packel, David Sculley, Gabriel Weaver
Audiovisual Content
Managing and Querying Video by Semantics in Digital Library . . . . . . . . . 367
Yu Wang, Chunxiao Xing, Lizhu Zhou
Using MILOS to Build a Multimedia Digital Library Application: The
PhotoBook Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Giuseppe Amato, Paolo Bolettieri, Franca Debole, Fabrizio Falchi,
Fausto Rabitti, Pasquale Savino
An Exploration of Space-Time Constraints on Contextual Information
in Image-Based Testing Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Unmil Karadkar, Marlo Nordt, Richard Furuta, Cody Lee,
Christopher Quick
Language Technologies
Incorporating Cross-Document Relationships Between Sentences
for Single Document Summarizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Xiaojun Wan, Jianwu Yang, Jianguo Xiao
Effective Content Tracking for Digital Rights Management in Digital
Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Jen-Hao Hsiao, Cheng-Hung Li, Chih-Yi Chiu, Jenq-Haur Wang,
Chu-Song Chen, Lee-Feng Chien
Semantic Web Techniques for Multiple Views on Heterogeneous
Collections: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
Marjolein van Gendt, Antoine Isaac, Lourens van der Meij,
Stefan Schlobach
Table of Contents
A Content-Based Image Retrieval Service for Archaeology
Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Naga Srinivas Vemuri, Ricardo da S. Torres, Rao Shen,
Marcos André Gonçalves, Weiguo Fan, Edward A. Fox
A Hierarchical Query Clustering Algorithm for Collaborative
Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Lin Fu, Dion Hoe-Lian Goh, Schubert Shou-Boon Foo
A Semantics-Based Graph for the Bib-1 Access Points of the Z39.50
Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Michalis Sfakakis, Sarantos Kapidakis
A Sociotechnical Framework for Evaluating a Large-Scale Distributed
Educational Digital Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Michael Khoo
A Tool for Converting from MARC to FRBR . . . . . . . . . . . . . . . . . . . . . . . . . 453
Trond Aalberg, Frank Berg Haugen, Ole Husby
Adding User-Editing to a Catalogue of Cartoon Drawings . . . . . . . . . . . . . . 457
John Bovey
ALVIS - Superpeer Semantic Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . 461
Gert Schmeltz Pedersen, Anders Ardö, Marc Cromme, Mike Taylor,
Wray Buntine
Beyond Error Tolerance: Finding Thematic Similarities in Music
Digital Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Tamar Berman, J. Stephen Downie, Bart Berman
Comparing and Combining Two Approaches to Automated Subject
Classification of Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Koraljka Golub, Anders Ardö, Dunja Mladenić, Marko Grobelnik
Concept Space Interchange Protocol: A Protocol for Concept Map
Based Resource Discovery in Educational Digital Libraries . . . . . . . . . . . . . 471
Faisal Ahmad, Qianyi Gu, Tamara Sumner
Design of a Cross-Media Indexing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Murat Yakıcı, Fabio Crestani
Table of Contents
Desired Features of a News Aggregator Service: An End-User
Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
Sudatta Chowdhury, Monica Landoni
DIAS: The Digital Image Archiving System of NDAP Taiwan . . . . . . . . . . 485
Hsin-Yu Chen, Hsiang-An Wang, Ku-Lun Huang
Distributed Digital Libraries Platform in the PIONIER Network . . . . . . . . 488
Cezary Mazurek, Tomasz Parkola, Marcin Werla
EtanaCMV: A Visual Browsing Interface for ETANA-DL Based
on Coordinated Multiple Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Johnny L. Sam-Rajkumar, Rao Shen, Naga Srinivas Vemuri,
Weiguo Fan, Edward A. Fox
Intelligent Bibliography Creation and Markup for Authors: A Step
Towards Interoperable Digital Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Bettina Berendt, Kai Dingel, Christoph Hanser
Introducing Pergamos: A Fedora-Based DL System Utilizing Digital
Object Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
George Pyrounakis, Kostas Saidis, Mara Nikolaidou,
Vassilios Karakoidas
Knowledge Generation from Digital Libraries and Persistent Archives . . . . 504
Paul Watry, Ray R. Larson, Robert Sanderson
Managing the Quality of Person Names in DBLP . . . . . . . . . . . . . . . . . . . . . 508
Patrick Reuther, Bernd Walter, Michael Ley, Alexander Weber,
Stefan Klink
MedSearch: A Retrieval System for Medical Information Based
on Semantic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
Angelos Hliaoutakis, Giannis Varelas, Euripides G.M. Petrakis,
Evangelos Milios
Metadata Spaces: The Concept and a Case with REPOX . . . . . . . . . . . . . . 516
Nuno Freire, José Borbinha
Multi-Layered Browsing and Visualisation for Digital Libraries . . . . . . . . . 520
Alexander Weber, Patrick Reuther, Bernd Walter, Michael Ley,
Stefan Klink
OAI-PMH Architecture for the NASA Langley Research Center
Atmospheric Science Data Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
Churngwei Chu, Walter E. Baskin, Juliet Z. Pao, Michael L. Nelson
Table of Contents
Personalized Digital E-library Service Using Users’ Profile
Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Wonik Park, Wonil Kim, Sanggil Kang, Hyunjin Lee,
Young-Kuk Kim
Representing Aggregate Works in the Digital Library . . . . . . . . . . . . . . . . . . 532
George Buchanan, Jeremy Gow, Ann Blandford, Jon Rimmer,
Claire Warwick
Scientific Evaluation of a DLMS: A Service for Evaluating Information
Access Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
Giorgio Maria Di Nunzio, Nicola Ferro
SIERRA – A Superimposed Application for Enhanced Image
Description and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
Uma Murthy, Ricardo da S. Torres, Edward A. Fox
The Nautical Archaeology Digital Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
Carlos Monroy, Nicholas Parks, Richard Furuta, Filipe Castro
The SINAMED and ISIS Projects: Applying Text Mining Techniques
to Improve Access to a Medical Digital Library . . . . . . . . . . . . . . . . . . . . . . . 548
Manuel de Buenaga, Manuel Maña, Diego Gachet, Jacinto Mata
The Universal Object Format – An Archiving and Exchange Format
for Digital Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
Tobias Steinke
Tsunami Digital Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
Sayaka Imai, Yoshinari Kanamori, Nobuo Shuto
Weblogs for Higher Education: Implications for Educational Digital
Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
Yin-Leng Theng, Elaine Lew Yee Wan
XWebMapper: A Web-Based Tool for Transforming XML Documents . . . . 563
Manel Llavador, José H. Canós
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
OpenDLibG: Extending OpenDLib by Exploiting a
gLite Grid Infrastructure
Leonardo Candela, Donatella Castelli, Pasquale Pagano, and Manuele Simi
Istituto di Scienza e Tecnologie dell’Informazione “Alessandro Faedo” – CNR
Via G. Moruzzi, 1 - 56124 PISA - Italy
{candela, castelli, pagano, simi}@isti.cnr.it
Abstract. This paper illustrates how an existing digital library system, OpenDLib, has been extended in order to make it able to exploit the storage and processing capability offered by a gLite Grid infrastructure. Thanks to this extension
OpenDLib is now able to handle a much wider class of documents than in its
original version and, consequently, it can serve a larger class of application domains. In particular, OpenDLib can manage documents that require huge storage
capabilities, like particular types of images, videos, and 3D objects, and also create them on-demand as the result of a computational intensive elaboration on a
dynamic set of data, although performed with a cheap investment in terms of
computing resource.
1 Introduction
In our experience in working with digital libraries (DLs) we have often had to face the
problem of resources scalability. Recent technology progresses make it now possible to
support DLs where multimedia and multi-type content can be described, searched and
retrieved with advanced services that make use of complex automatic tools for feature
extraction, classification, summarization, etc. Despite the feasibility of such DLs, the
actual use of them is still limited because of the high cost of the computer resources
they require. Thus, for example, institutions that need to automatically classify images
or 3D objects are forced to afford the cost of large processing capabilities even if this
elaboration is only sporadically done. In order to overcome this problem, a couple of
years ago we decided to start investigating the use of Grid technologies for supporting an effective handling of these objects. By using the features of the Grid several
institutions can share a number of storage and computing resources and use them ondemand, on occasion of their temporary need. This organization allows minimizing the
total number of resources required and maximizing their utilization.
Our attempt of exploiting Grid technologies is not isolated. Others are moving in
the same direction even if with different objectives. Widely used content repository
systems, like DSpace [18] and Fedora [13] as well as DLs, like the Library of Congress,
are presently using the SDSC Storage Resource Broker [17] (SRB) as a platform for
ensuring preservation and, more in generally, the long term availability of the access to
digital information [15, 16].
Chershire3 [14], is an Information Retrieval system that operates both in single processor and in Grid distributed computing environments. A new release of this system
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 1–13, 2006.
c Springer-Verlag Berlin Heidelberg 2006
L. Candela et al.
capable of processing and indexing also documents stored in the SRB via their inclusion
in workflows has been recently designed.
DILIGENT [6] aims at generalizing the notion of sharing proposed by the Grid technologies by creating an infrastructure that connects not only the computing and storage
resources but, more generally, all the resources that compose a DL, i.e. archives of
information, thesauri, ontologies, tools, etc. By exploiting the functionality provided
by DILIGENT, digital libraries will be created on-demand by exploiting the resources
connected through this particular infrastructure.
This paper describes how we have extended an existing DL system, OpenDLib [4], in
order to make it able to exploit the sharing of storage and processing capabilities offered
by a gLite Grid infrastructure[12] for effectively handling new document types. The
system resulting from this extension, named OpenDLibG, can manage documents that
require huge storage capabilities, like particular types of images, videos, and 3D objects,
and also create them on-demand as the result of a computational intensive elaboration
on a dynamic set of data. The novelty of this system with respect to its predecessor
is that, by exploiting the on-demand usage of resources provided by the Grid, it can
provide reliable, scalable and high throughput functionality on complex information
objects without necessarily large investments on computing resources.
The paper presents the technical solution adopted by highlighting not only the potentialities related to the use of a Grid infrastructure in the particular DL application framework, but also the aspects that have to be carefully considered when designing services
that exploit it. The additional features offered by this new version of the OpenDLib
system are illustrated by presenting a real application case that has been implemented
to concretely evaluate the proposed solution.
The rest of the paper is organized as follows: Section 2 briefly introduces OpenDLib
and gLite; Section 3 provides details on the technical solution that has been implemented; Section 4 describes the new functionality OpenDLibG is able to offer and illustrates this functionality by presenting an implemented usage scenario; and finally,
Section 5 contains final remarks and plans for further extensions.
2 The Framework
In this section we present a very brief overview of the OpenDLib and gLite technologies
by focussing on those aspects that are relevant for describing OpenDLibG.
2.1 OpenDLib
OpenDLib [4] is a digital library system developed at ISTI-CNR. Its is based on a
Service-Oriented Architecture that enables the construction of networked DLs hosted
by multiple servers belonging to different institutions. Services implementing the DL
functionality communicate through a HTTP-based protocol named OLP [5]. These services can be distributed or replicated on more than one server and can be logically
organised as in Figure 1.
The Collective Layer contains services performing the co-ordination functions (e.g.
mutual reconfiguration, distribution and replication handling, work-load distribution) on
OpenDLibG: Extending OpenDLib by Exploiting a gLite Grid Infrastructure
the services federation. In particular, the Manager Service maintains a continually updated status of the DL service instances and disseminates it on request to all the other
services that use this information to dispatch message requests to the appropriate service
instances. Thanks to this functionality, each service instance does not need to know where
the other instances are located and how to discover the appropriate instances
to call.
The DL Components includes services implementing DL functions. The basic
OpenDLib release offers services to support the description, indexing, browsing, retrieval, access, preservation, storage, and virtual organization of documents. In particular, the storage and the dissemination of documents is handled by the Repository
The Workflows provides functionality implemented through workflows,
i.e. structured plans of service calls. In
particular, this area includes the Library
Manager which manages and controls
the submission, withdrawal and replacement of documents.
The Presentation contains services implementing the user front-ends
Fig. 1. The OpenDLib Layered Architecture
to the other services. It contains a highly
customisable User Interface and an OAI-PMH Publisher.
The OpenDLib Kernel supports all the above services by providing mechanisms to
ensure the desired quality of service.
These services can be configured by specifying a number of parameters, like metadata and document formats, user profile format, query language, etc. The set of services
illustrated above can be extended by including other services that implement additional
application-specific functionality.
The OpenDLib services interact by sharing a common information objects model,
DoMDL [2]. This model can represent a wide variety of information object types with
different formats, media, languages and structures. Moreover, it can represent new types
of documents that have no physical counterpart, such as composite documents consisting of the slides, video and audio recordings of a lecture, a seminar or a course. It can
also maintain multiple editions, versions, and manifestations of the same document,
each described by one or more metadata records in different formats. Every manifestation of the digital object can be either locally stored, or retrieved from a remote server
and displayed whether at run time or in its remote location. A manifestation can also be
a reference to another object manifestation; through this mechanism, data duplication
can be avoided.
2.2 gLite
gLite [12] is a Grid middleware recently released by EGEE [7], the largest Grid infrastructure project currently being funded in Europe.
The role of gLite is to hide the heterogeneous nature of both the computing elements
(CEs), i.e. services representing a computing resource, and storage elements (SEs), i.e.
L. Candela et al.
services representing a storage resource, by providing an environment that facilitates
and controls their sharing.
The services constituting the gLite
software are logically organized as in
Access Services
Figure 2.
The Job Management Services is in
Security Services
Information and
charge of managing jobs and DAGs1 .
Monitoring Services
The core components of this subsystem
are the Computing Element, the WorkData Management Services
Job Management Services
load Manager (WMS), and the Logging
and Bookkeeping services. The former
represents a computing resource and
provides an interface for job submission
Fig. 2. The gLite Services
and control. It is worth noting that the
back end of the CE is composed by a
set of computational resources managed by a Local Resource Management System
(LRMS), e.g. Torque, Condor2. The Workload Manager is the subsystem whose main
role is to accept requests of job submission and forward them to the appropriate CEs.
The Logging and Bookkeeping service is in charge of tracking jobs in terms of events e.g. submitted, running, done - gathered from various WMSs and CEs.
The Data Management Services is in charge of managing data and file access. gLite
assumes that the granularity of data is on file level and that the access is controlled by
Access Control Lists. The main services are the gLite I/O, the Storage Element, and
the Data Catalog. The former provides a POSIX-like file I/O API, while the Storage
Element represents the back end storage resource and can be implemented with various Storage Resource Managers, e.g. dCache3 , DPM4 . The Data Catalogue allows to
perceive the storage capacity of the infrastructure as a single file system.
The Security Services is in charge of dealing with authentication, authorization and
auditing issues. Actually, the Virtual Organization Membership Service (VOMS) is the
main service dealing with these issues. Other aspects are regulated via well known
standards and technologies, e.g. X.509 Proxy Certificates [19], Grid Map Files.
The Information and Monitoring Services discovers and monitors the resources
forming the infrastructure. The main service is the Relational Grid Monitoring Architecture (R-GMA), a registry supporting the adjunction and the removal of data about
the resources constituting the infrastructure.
The Access Services enables end-users to have access to and use the resources of the
infrastructure. Its main component is the User Interface (UI), a suite of clients and APIs
enabling users to perform the common user tasks of a gLite based infrastructure, e.g.
store and retrieving files, run jobs and monitor on their status.
In gLite terminology jobs are an application that can run on a CE, and DAGs are directed
acyclic graphs of dependent jobs.
dCache is accessible at http://www.dcache.org
DPM information can be found at http://wiki.gridpp.ac.uk/wiki/Disk
Pool Manager
OpenDLibG: Extending OpenDLib by Exploiting a gLite Grid Infrastructure
3 OpenDLibG: gLite Integration and Exploitation
The OpenDLib document model is flexible enough to represent a large variety of complex information objects that, if largely employed, could change the way in which research is done. By exploting the functionality built on this model multimedia objects
can be composed with table, graphs or images generated by elaborating a large amount
of raw data, videos can be mixed with text and geographic information, and so on. Even
if the support to this type of complex information objects is theoretically possible with
OpenDLib, in practice it turns out to be unrealistic due to the large amount of computing and storage resources that have to be employed to provide performance acceptable
by users. Our decision to extend OpenDLib by making it able to exploit the storage
and processing capabilities provided by a gLite-compliant infrastructure was mainly
motivated by the aim of overcoming this heavy limitation. In the rest of this section
we describe the components that we have added, how they have been integrated in the
architecture, and the difficulties that we faced in performing this integration.
3.1 The Integrated Architecture
In order to equip OpenDLib with the capabilities required to exploit a gLite-compliant
infrastructure we designed the following new services:
– gLite SE broker: interfaces OpenDLib services with the pool of SEs made available
via the gLite software and optimises their usage.
– gLite WMS wrapper: provides OpenDLib services with an interface to the pool of
gLite CEs and implements the logic needed to optimize their usage.
– gLite Identity Provider: maps the OpenDLib user and service identities onto gLite
user identities that are recognized and authorized to use gLite resources.
– OpenDLib Repository++: implements an enhanced version of the OpenDLib
Repository service. It is equipped with the logic required to manage and optimize
the usage of both OpenDLib repositories and gLite SEs as well as to manage novel
mechanisms for the dynamic generation of document manifestations.
The architecture of the resulting system is depicted in Figure 3.
Thanks to the extensibility of the OpenDLib application framework the integration
of these services has been obtained by only modifying the configuration of some of
the existing services without any change in their code. In particular, the OpenDLib
Manager Service has been appropriately configured to provide information about the
new services and to disseminate new routing rules. These rules enable the OpenDLib
UI to interact with instances of the OpenDLib Repository++ service in a completely
transparent way for both the submission of, and the access to, documents while the
Repository service is only accessed through its new enhanced version.
We explicitly chose to build the enhanced version of the Repository service as an
abstraction of the basic version. It does not replace the original service because not all
digital libraries require a Grid-based infrastructure. However, this new service maintains
all the main characteristics of the basic version and, in particular, it can be replicated
and/or distributed in the infrastructure designed to provide the DL application. Finally,
the Repository++ can manage a multitude of basic Repository services while a same
basic Repository service can accept requests coming from different Repository++.
In the rest of this section we present each of the new services in detail.
L. Candela et al.
raw data
Grid on
Digital Library
IO server
SE broker
IO server
IO server
gLite Identity
WMS wrapper
Fig. 3. OpenDLib integrate with gLite: the Architecture
The gLite SE broker. It provides the other OpenDLib services with the capability of
using gLite based storage resources. In particular, this service interfaces the gLite I/O
server to perform the storage and withdrawal of files and the access to them.
In designing this service one of our main concerns was to overcome two problems
we have discovered experiencing with the current gLite release: (i) the inconsistency
between catalog and storage resource management systems and (ii) failure in the access
or remove operations without notification.
Although the gLite SE broker could not improve the reliability of the requested operations we designed it to: (i) monitor its requests, (ii) verify the status of the resources
after the processing of the operations, (iii) repeat file registration in the catalog and/or
storage until it is considered correct or unrecoverable, (iv) return a valid message reporting the exit status of the operation. The feasibility of this approach was validated by the
small resulting delay experimentally measured as well as by real users judgements.
In order to appropriately exploit the great number of resources provided by the infrastructure, the gLite SE broker service was designed to interface more than one I/O
server for distributing storage and access operations. In particular, this service can be
configured to support three types of storage strategies for distributing files among the
I/O servers, namely: (i) round-robin, (ii) file-type-based, which places the files of a certain type on a predefined set of I/O servers, and (iii) priority-based, which is useful to
enhance one of the previous strategies with an identified prioritized list of I/O servers
ordering the requests to them. It is worth noting that the service can also dynamically
rearrange the prioritized list by taking into account performance characteristics, e.g. the
time and the number of failures in executing I/O actions.
Inspired by the RAID technology5 we designed the gLite SE broker to support the
RAID 1 modality that mirrors each file by creating a copy of it on two or more servers.
This feature is activated by default but it can be explicitly turned off at configuration time.
The RAID 0 modality, a.k.a. striped, that splits files on two or more servers and the
possibility to select the appropriate modality for each file at the submission time is
under investigation.
A Redundant Array of Independent Disks, a.k.a. Redundant Array of Inexpensive Disks.
OpenDLibG: Extending OpenDLib by Exploiting a gLite Grid Infrastructure
The gLite WMS wrapper. It provides the other OpenDLib services with the computing power supplied by gLite CEs. In particular, this service offers an interface for
managing jobs and DAGs with an abstraction level higher than that provided by gLite.
The gLite WMS broker has been designed to: (i) deal with more than one WMS, (ii)
monitor the quality of service provided by these WMSs by analyzing the number of managed jobs and the average time of their execution, and, finally, (iii) monitor the status of
each submitted job querying the Logging and Bookkeeping service. As a consequence of
the implemented functionality, the gLite WMS service represents a single point of access
to the computing capabilities provided by the WMS services and to the monitoring capabilities provided by the LB services. This approach decouples the gLite infrastructure
from the OpenDLib federation of services while hiding their characteristics. Moreover,
by exploiting the features provided by the OpenDLib application framework, the gLite
WMS broker service can be replicated in a number of different service instances, each
managing the same set of gLite services, or can be distributed over a number of different
service instances, each managing a different pool of gLite services.
In implementing this component we provided both a round-robin and a priority based
scheduling strategies to manage the distribution of jobs to WMSs. In particular, the
second approach represents an enhancing of the first one because it identifies a priority
list of WMSs ordering the requests to them. It is still under investigation the possibility
of automatically manipulating this priority in order to take into account performance
metrics such as the time and the number of failures.
Finally, we equipped the service with a basic fault tolerance capability in performing
job submission tasks that repeats the execution in case of failure.
gLite Identity Provider. The mechanisms that support authentication and authorization in OpenDLib and gLite are different. The two systems have been designed with the
aim to satisfy different goals in a completely different usage scenarios: OpenDLib operates in a distributed framework where the participating institutions collaborate and share
the same rules and goals under the supervision of a distributed management, while gLite
has to work in an environment where policies and access rules are managed differently
by the participating institutions. OpenDLib builds its own authentication mechanism
on user name and password, while gLite builds it on X.509 Certificates. Moreover, the
authorization mechanisms for assigning policies to users are proprietary in OpenDLib
while they are based on the Virtual Organization mechanism and Grid Map Files in a
gLite based infrastructure. In order to reconcile these authentication and authorization
frameworks a service able to map OpenDLib identities on gLite identities was introduced. The main characteristics of this service are:
– it generates the Proxy Certificates [19] that are needed to interact with gLite resources. In order to support this functionality it has to be equipped with the appropriate pool of personal certificates that, obviously, must be stored on a secure device.
– it can be configured to establish the mapping rules for turning OpenDLib identities
into gLite identities.
The OpenDLib Repository++. This service was designed to act as a virtual repository,
capable of the same operations as those required to store and access documents in a
L. Candela et al.
traditional OpenDLib DL. In this way the other services of the infrastructure do not
need either to be aware of this service’s enhanced capabilities nor to be redesigned
and re-implemented. Despite the public interface of this service completely resembles
the Repository interface, its logic is completely different because it does not store any
content locally, instead, it relies on the storage facilities provided by both the OpenDLib
Repository and the gLite infrastructure via the gLite SE broker.
In designing this component we decided to make the strategy for distributing content
on the two kinds of storage systems configurable.
The configuration aspects exploit the DoMDL management functionality allowing
any supported manifestation type to be associated with a predefined workflow customising storage, access, and retrieve capabilities. It is thus possible to design and implement
the most appropriate processes for each new type of raw data managed by the DL and
easily associate it with the manifestation type. In the current version, one workflow to
store, access, and retrieve files through the described gLite wrappers has been implemented. For instance, it is possible to configure the Repository++ service in order to
maintain all metadata manifestations on a specific OpenDLib Repository instance, a
certain manifestation type on another OpenDLib Repository, while raw data and satellite products that are accessed less frequently and require a huge amount of storage
can be stored on a SE provided by the gLite based infrastructure. The characteristics of
the content to be stored should drive the designer in making the configuration. Usually,
manifestations that require to be frequently accessed, or that need to be maintained under the physical control of a specific library institution, should be stored on standard
OpenDLib Repository services. On the contrary, content returned by processes, that either is not directly usable by the end-user, or can be freely distributed on third-party
storage devices should be stored on gLite SEs.
Cryptography capabilities are under investigation to mitigate the problems mostly
related to the copyright management for storing content on third-party devices.
Another important feature added to the enhanced repository is the capability of associating a job or a DAG of jobs with a manifestation. This feature makes it possible to
manage new types of document manifestations, i.e., manifestations dynamically generated by running a process at access time. The realisation of such extension has been
quite simple in OpenDLib thanks to DoMDL. In fact, DoMDL is able to associate a URI
of a specific task with a manifestation. In this case, this task uses the gLite WMS wrapper for executing a process customized with the information identifying the job/DAG
to be run together with the appropriate parameters.
An example of the exploitation of this functionality is given in the following section.
4 OpenDLibG in Action: An Environmental DL
Stimulated by the long discussions we had with members of the European Space Agency
(ESA)6 , we decided to experiment the construction of an OpenDLibG DL for supporting the work of the agencies that collaboratively work at the definition of environmental
conventions. By exploiting their rich information sources, ranging from raw data sets to
These discussions and the corresponding requirements where raised mainly in the framework
of the activities related to the DILIGENT project.
OpenDLibG: Extending OpenDLib by Exploiting a gLite Grid Infrastructure
maps and graphs archives, these agencies periodically prepare reports on the status of
the environment. Currently, this task is performed by first selecting the relevant information from each of the multiple and heterogeneous sources available, then launching
complex processing on large amount of data to obtain graphs, tables and other summary
information and, finally, producing the required report by assembling all the different
parts together. This process repeated periodically requires a lot of work due to the complexity of interfacing the different sources and tools. Despite the effort spent, the resulting reports do not always met the requirements of the their users since they present
to its readers a picture of the environmental status at the time of the production of the
report and not at time in which the information reported is accessed and used. To overcome this problem and, more generally, to simplify the generation of the environmental
reports we created an OpenDLibG DL prototype.
From the architectural point of view, the OpenDLibG components of this DL are
hosted on three servers. The first server is publicly accessible and hosts the User Interface service that allows end-users to easily interact with a human-friendly interface.
The second and third servers are protected behind a firewall and host the basic and the
extended OpenDLib services respectively.
As far as the Grid infrastructure is concerned, the OpenDLibG environmental DL
exploits the DILIGENT gLite infrastructure. This infrastructure consists of five sites
located in Pisa (Italy), Rome (Italy) Athens (Greece), Hall in Tyrol (Austria) and Darmstadt (Germany). Each site provides storage and computational capabilities for a total
of 41 Processors, 38,72 GB RAM, and 3,28 TB disk space. For the scope of this DL,
we decided to exploit only two storage elements based on dCache and other two storage
elements based on DPM.
Fig. 4. A GOMOS Document
In this experimental environmental DL the Repository service has been configured
to manage DoMDL instances that are able to maintain both information objects
selected from third-party information sources - whose content is imported/linked in/to
the DL - and information objects whose manifestations are generated on-demand
L. Candela et al.
using a registered workflow that invokes the gLite WMS Wrapper for executing specific elaborations.
This DL provides the data, the documents, the dynamically generated reports, and
any other content and services deemed as relevant with respect to the day-by-day activity of people who have to take decisions on environmental strategies. In particular,
the prototype can: (i) manage environmental reports, workshops proceedings and other
types of documents relevant to the Earth Observation community collected from different and heterogenous information sources; (ii) deal with added value satellite products
like chlorophyll concentration maps and mosaics; (iii) dynamically produce reports related to certain world regions and time periods; (iv) use the gLite job management
facilities to produce Nitrate and Ozone profiles from satellite products, thus making
experiments on the management of such data; (v) support search and use of a number
of relevant external data, services, and resources concerned with Earth Science, like
glossaries, gazetteers, and other digital libraries of interest.
In particular, the above information objects have been obtained by harvesting: (i)
documents gathered or linked from external information sources like MFSTEP monthly
reports7 and the European Environment Agency8 reports, briefings, indicators and news,
(ii) high resolution satellite images both directly acquired and dynamically generated,
and (iii) level two ENVISAT-GOMOS products9 containing raw data on ozone, temperature, moisture, N O2 , N O3 , OClO, O3 measures collected by the GOMOS sensor.
Figure 4 shows an example of the novel type of documents that can be managed
by this DL. This document is composed by (i) a metadata view containing descriptive
information like the start and stop sensing dates and the coordinates of the geographical
area the document refers to and (ii) three products defined via appropriate workflows
whose manifestations are generated by running workflows on the Grid infrastructure.
These workflows exploit the BEAT2GRID application, provided by the ESA organisation and adapted by us to run on a gLite based infrastructure, and the appropriate operations for gathering from the Grid the raw data to be elaborated, storing on the Grid the
obtained products, and linking them as document manifestations. Such workflows generate geolocation information extracted from the raw data; the N O2 /N O3 image profile
information showing the density with respect to the altitude; the N O2 /N O3 profile information comprising date, time, longitude and latitude of tangent point, longitude and
latitude of satellite; the ozone density with respect to the altitude and the ozone density
covariance. Each of such products represent a manifestation of a product view. According to the document definition, each product manifestation can be retrieved from the
Grid or dynamically generated at access time. To give access to these complex objects
a specialised user interface has been designed. It is capable capable to start the product
generation process, progressively show the status of the workflow execution and, once
products are generated, give access to them.
ENVISAT (ENVIronment SATellite) is an ESA Earth Observation satellite whose purpose is
to collect earth observations: it is fitted with 10 sensors ASAR, MERIS, AATSR, RA-2, MWR,
DORIS, GOMOS, MIPAS, SCIAMACHY, LRR and other units. Detailed information about
the ENVISAT satellite can be found at http://envisat.esa.int/
OpenDLibG: Extending OpenDLib by Exploiting a gLite Grid Infrastructure
It is worth noting that the BEAT2GRID application is executed in a couple of minutes in a quite normal bi-processor entry-level server. However, standard DL based applications can not provide such functionality to end-users since hundreds of concurrent
requests prove the limited scalability of a static infrastructure. OpenDLibG powered by
the described gLite based infrastructure, instead, proves to manage tenths of concurrent
requests with the same throughput as the single execution on a dedicated server, and to
correctly manage a higher number of requests by using queue mechanisms. The same
observation holds with respect to the storage capacity. Raw data and intermediate processing results require an huge amount of storage space. Thanks to the Grid technology
this space can be obtained on demand by relying on third party institutions while in the
case of a standard DL it is needed to equip the DL with such amount of resources even
if they are needed only for a limited time period.
This experimentation can be considered the first step in the exploitation of Grid enabled DLs. Moreover it represents a great opportunity for both users and digital library
developers to share views and language, to express needs via practical examples, to
understand capabilities for future exploitation, to access practical progress, to evaluate
opportunities and alternative solutions, to support technical decisions and, last but not
least, to develop critical interfaces.
5 Conclusions and Lesson Learned
This paper has described OpenDLibG, a system obtained by extending the OpenDLib
digital library system with services exploiting a gLite Grid infrastructure. As a result
of this extension OpenDLibG can provide both a more advanced functionality on novel
information objects and a better quality of service without requiring a very expensive
We strongly believe that the new information objects described in this paper will
play an increasingly important role in the future as they can contribute to revolutionize
the way in which many communities perform their activities. In this paper we have
shown only an example of the exploitation of such documents, but many others have
been suggested us by the many user communities we are in contact with.
The integration of OpenDLib with a Grid infrastructure not only makes it possible
to handle the new type of objects but it also supports any functionality whose implementation requires intensive batch computations. For example, periodic complex feature extraction on large document collections or generation and storage of multiple and
alternative manifestations for preservation purposes can similarly be supported while
maintaining a good quality of service. Our next future plan is to extend the system with
novel and distributed algorithms for providing DL functionality relying on the huge
amount of computing and storage power provided by the Grid.
While carrying out this experience we have learnt that there are a number of aspects
that have to be carefully considered in designing a DL system that exploits a Grid infrastructure. In this framework resources are provided by third-parties and there is a
lack of any central control on their availability. These resources can disappear or become unavailable without informing any central authority that, therefore, has no means
to prevent it. This problem is made worst by the lack of advanced reservation, i.e. the
L. Candela et al.
possibility for a resource user to agree with the resource provider on the availability
of a resource for a well established time period and on a given quality of service. This
feature is a long term goal in the Grid research area and it is expected that it will be
provided in future releases of Grid middleware. This lack has strong implications on
the reliability of the Grid resources usage. For example, a document stored on a single SE can be lost if the SE is removed from the Grid by its provider. Appropriate
measures have to be taken to reduce the risk induced by this lack. For example, a DL
service must be designed in such a way that if the CE running one of its processes disappears, it must be able to recover this malfunction. Other aspects to be carefully taken
into account are related to performance. Some of them apply to any Grid infrastructure,
while others are more specific and relate to the gLite software and its current release.
Perhaps the most important among these aspects is concerned with the communication
overhead that arises when using resources spread over the Net. In this context, where
resources are SEs and CEs, the decision to ask third-party for the storage or the processing capabilities must be carefully evaluated, i.e. the enhancement obtained must be
compared with the overhead needed and the right trade-off among these aspects must
be discovered.
Acknowledgments. This work is partially funded by the European Commission in the
context of the DILIGENT project, under the 2nd call of FP6 IST priority.
We thank all the DILIGENT partners that have contributed with their suggestions
and expertise to this work. Particular thanks go to ESA, which provided us the requirements for live documents, the real data and the specific applications for our experimentation; CERN which supported us in our understanding of the gLite technology; and
FhG-IPSI which studied the problem of dynamically defining the structure and visualization of the on-demand documents and which collaborated with CNR for the setting
up of the experimentation.
1. W. Y. Arms. Digital Libraries. The MIT Press, September 2001.
2. L. Candela, D. Castelli, P. Pagano, and M. Simi. From Heterogeneous Information Spaces
to Virtual Documents. In Proceedings of the 8th International Conference on Asian Digital
Libraries, ICADL 2005, Bangkok, Thailand, December 2005, pages 11–22. Springer, 2005.
3. L. Candela, D. Castelli, P. Pagano, and M. Simi. Moving Digital Library Service Systems
to the Grid. In Peer-to-Peer, Grid, and Service-Orientation in Digital Library Architectures,
number 3664 in Lecture Notes in Computer Science, pages 236 – 259. Springer Verlag, 2005.
4. D. Castelli and P. Pagano. A System for Building Expandable Digital Libraries. In
ACM/IEEE Joint Conference on Digital Libraries (JCDL 2003), pages 335–345. SpringerVerlag, 2003.
5. D. Castelli and P. Pagano. The OpenDLib Protocol. Technical report, Istituto di Scienza e
Tecnologie dell’Informazione “A. Faedo”, CNR, 2004.
6. DILIGENT. A DIgital Library Infrastructure on Grid ENabled Technology. http://
7. EGEE. Enabling Grids for E-science in Europe. http://public.eu-egee.org.
8. I. Foster. What is the Grid? A Three Point Checklist. GRIDtoday, 1(6), 2002.
OpenDLibG: Extending OpenDLib by Exploiting a gLite Grid Infrastructure
9. I. Foster and C. Kesselman, editors. The Grid: Blueprint for a Future Computing Infrastructure. Morgan-Kaufmann, 2004.
10. I. Foster, C. Kesselman, J. Nick, and S. Tuecke. The Physiology of the Grid: An Open Grid
Services Architecture for Distributed Systems Integration. Open Grid Service Infrastructure
WG, Global Grid Forum, June 2002.
11. I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the Grid: Enabling scalable virtual organization. The International Journal of High Performance Computing Applications,
15(3):200–222, 2001.
12. gLite. Ligthweight Middleware for Grid Computing. http://glite.web.cern.ch/.
13. C. Lagoze, S. Payette, E. Shin, and C. Wilper. Fedora: An Architecture for Complex Objects
and their Relationships. Journal of Digital Libraries, Special Issue on Complex Objects,
14. R. R. Larson and R. Sanderson. Grid-based digital libraries: Cheshire3 and distributed retrieval. In JCDL ’05: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital
Libraries, pages 112–113, New York, NY, USA, 2005. ACM Press.
15. R. W. Moore and R. Marciano. Building preservation environments. In M. Marlino, T. Sumner, and F. M. S. III, editors, JCDL, page 424. ACM, 2005.
16. A. Rajasekar, R. Moore, F. Berman, and B. Schottlaender. From Digital Preservation Lifecycle Management for Multi-media Collections. In 8th International Conference on Asian Digital Libraries, ICADL 2005, Bangkok, Thailand, December 2005, pages 380–384. Springer,
17. A. Rajasekar, M. Wan, R. Moore, W. Schroeder, G. Kremenek, A. Jagatheesan, C. Cowart,
B. Zhu, S.-Y. Chen, and R. Olschanowsky. Storage Resource Broker - Managing Distributed
Data in a Grid. Computer Society of India Journal, Special Issue on SAN, 33(4):42–54,
October 2003.
18. R. Tansley, M. Bass, and M. Smith. DSpace as an Open Archival Information System:
Current Status and Future Directions. In Proceedings of the 7th European Conference, ECDL
2003, Trondheim, Norway, August 2003, pages 446–460. Springer-Verlag, 2003.
19. S. Tuecke, V. Welch, D. Engert, L. Pearlman, and M. Thompson. Internet X.509 Public Key
Infrastructure (PKI) Proxy Certificate Profile. RFC3820, IETF, The Internet Engineering
Task Force, June 2004.
A Peer-to-Peer Architecture for Information Retrieval
Across Digital Library Collections
Ivana Podnar, Toan Luu, Martin Rajman, Fabius Klemm, and Karl Aberer
School of Computer and Communication Sciences
Ecole Polytechnique Fédérale de Lausanne (EPFL)
Lausanne, Switzerland
{ivana.podnar, vinhtoan.luu, martin.rajman,
fabius.klemm, karl.aberer}@epfl.ch
Abstract. Peer-to-peer networks have been identified as promising architectural
concept for developing search scenarios across digital library collections. Digital libraries typically offer sophisticated search over their local content, however, search methods involving a network of such stand-alone components are
currently quite limited. We present an architecture for highly-efficient search
over digital library collections based on structured P2P networks. As the standard single-term indexing strategy faces significant scalability limitations in distributed environments, we propose a novel indexing strategy–key-based indexing.
The keys are term sets that appear in a restricted number of collection documents. Thus, they are discriminative with respect to the global document collection, and ensure scalable search costs. Moreover, key-based indexing computes
posting list joins during indexing time, which significantly improves query performance. As search efficient solutions usually imply costly indexing procedures,
we present experimental results that show acceptable indexing costs while the
retrieval performance is comparable to the standard centralized solutions with
TF-IDF ranking.
1 Introduction
Research in the area of information retrieval has largely been motivated by the growth
of digital content provided by digital libraries (DLs). Today DLs offer sophisticated
retrieval features, however, search methods are typically bound to a single stand-alone
library. Recently, peer-to-peer (P2P) networks have been identified as promising architectural concepts for integrating search facilities across DL collections [1, 2]. P2P
overlays are self-organizing systems for decentralized data management in distributed
environments. They can be seen as a common media for ‘advertising’ DL contents e.g.
to specialists in a particular area, or to the broader public. We argue that a wide range
of topic and genre specific P2P search engines can facilitate larger visibility of existing DLs while providing guaranties for objective search and ranking performance. Note
The work presented in this paper was carried out in the framework of the EPFL Center for
Global Computing and supported by the Swiss National Funding Agency OFES as part of the
European FP 6 STREP project ALVIS (002068).
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 14–25, 2006.
c Springer-Verlag Berlin Heidelberg 2006
A P2P Architecture for Information Retrieval Across DL Collections
that P2P networks cannot be centrally controlled: Peers are located in various domains
requiring minimal in place infrastructure and maintenance.
Full-text P2P search is currently an active research area as existing P2P solutions still
do not meet the requirements of relevance-based retrieval. It is a challenging problem
since search engines traditionally rely on central coordination, while P2P is inherently
decentralized. For example, global document collection statistics are not readily available in P2P environments, and naı̈ve broadcast solutions for acquiring such statistics
induce huge network traffic. In fact, scalability issues and potentially high bandwidth
consumption are one of the major obstacles for large-scale full-text P2P search [3].
In this paper we present an integrated architecture for information retrieval over textual DL collections. We assume DLs are cooperative and provide an index of a representative sample of their collections, or supply documents they want to make searchable
through a P2P engine. In this way DLs can choose the content that becomes globally
available, which naturally resolves the problems related to restricted crawler access.
The architecture accommodates distributed indexing, search, retrieval, and ranking over
structured P2P networks by means of a common global inverted index, and serves as a
blueprint for our prototype system A LVIS PEERS, a full-text search engine designed to
offer highly-efficient search with retrieval quality comparable to centralized solutions.
It is the result of our research efforts within the project A LVIS1 that aims at building
an open-source semantic search engine with P2P and topic specific technology at its
core [4].
We propose a novel indexing scheme and design a distributed algorithm for maintaining the global index in structured P2P networks. Our engine indexes keys—terms
and term sets appearing in a restricted number of global collection documents—while
keeping indexing at document granularity. Indexed keys are rare and discriminative
with respect to a global document collection. They represent selective queries readily retrievable from the global P2P index, while search costs are significantly reduced
due to limited posting list size. As our engine provides highly-efficient search over a
global P2P network, the indexing procedure is costly. However, since DL collections
are rather static, it is appropriate to invest resources into the indexing procedure and
benefit largely from the search performance. We will show experimentally that, as we
carefully choose keys, the key indexing costs remain acceptable. The number of indexed keys per peer is nearly constant for large document collections, as well as the
average posting list size when we keep the number of documents per peer constant and
increase the global collection by adding new peers. The bandwidth consumption during
retrieval is substantially smaller compared to single-term indexing, while the observed
retrieval quality (top-k precision) is comparable to the standard centralized solutions
with TF-IDF ranking. In contrast to the majority of published experimental results that
rely on simulations, our experiments have been performed using a fully fledged prototype system built on top of the P-Grid P2P platform2 .
The paper is structured as follows. Section 2 reviews the characteristics of P2P networks in the context of full-text search, while Section 3 presents our novel key-based
indexing strategy. Section 4 specifies the integrated architecture for P2P full-text search
I. Podnar et al.
and defines a distributed algorithm for building the key index. Experimental results investigating indexing costs and retrieval performance are presented in Sect. 5. Section 6
briefly covers related work, and we conclude the paper in Section 7.
2 Unstructured vs. Structured P2P
There are two main categories of P2P systems, unstructured and structured. In unstructured systems peers broadcast search requests in the network, which works well if used
to search for popular highly-replicated content. However, broadcast performs poorly if
used to search for rare items as many messages are sent through the network. More
advanced approaches restrict the amount of query messages by using random walks [5]
or special routing indexes, which maintain content models of neighboring peers in order to determine routing paths for a query [6]. The second class is structured P2P, also
called structured overlay networks or distributed hash tables (DHT) [7, 8, 9]. In structured P2P, each peer is responsible for a subset of identifiers id in a common identifier
space. Multiple peers may be responsible for the same identifier space to achieve higher
reliability. All peers use an overlay routing protocol to forward messages for which they
are not responsible. To allow efficient routing, most DHTs maintain routing tables of
size O(log(N )) where N is the number of peers in the network. Starting at any peer
in the network, a message with any destination id can be routed in O(log(N )) overlay
hops to the peer responsible for id. Structured P2P overlay networks therefore exhibit
much lower bandwidth consumption for search compared to unstructured networks.
However, they are limited to exact-match key search. Please refer to [10] for a comprehensive analysis of generic P2P properties.
There are two architectural concepts for designing P2P search engines in the area of
information retrieval: a) local indexes in unstructured/hierarchical P2P networks, and
b) global index in structured P2P networks. The first strategy [6] divides documents
over the peer network, and each peer maintains the index of its local document collection. Such indexes are in principle independent, and a query is broadcasted to all the
peers in unstructured networks generating an enormous number of messages. To limit
the query traffic, the query can be answered at two levels, the peer and document level:
The first step locates a group of peers with potentially relevant document collections,
while in the second step the query is submitted to the peers, which then return answers
by querying their local indexes. The answers are subsequently merged to produce a
single ranked hit list. The second strategy [11] distributes the global document index
over a structured P2P network. Each peer is responsible for a part of the global vocabulary and their associated posting lists. A posting list consists of references to the
documents that contain the associated index term. Queries are processed by retrieving
posting lists of the query terms from the P2P network. Our approach is based on the
second strategy.
3 Our Approach: Indexing Rare Keys
The key idea of our indexing strategy is to limit the posting list size of the global P2P index to a constant predefined value and extend the index vocabulary to improve retrieval
A P2P Architecture for Information Retrieval Across DL Collections
effectiveness. Fig. 1 compares our rare-key indexing strategy to the standard single-term
indexing approach. It is visible that we trade in an increased index vocabulary for the
limited posting list size. As posting lists are extremely large for a single-term index,
the process of joining them at query time generates unacceptable network traffic, which
makes this approach practically unfeasible. On the other contrary, rare-key indexing
offers highly-efficient query performance as we limit the posting list size according to
network characteristics and intersect posting lists at indexing time.
small voc.
long posting lists
naïve approach
term 1
term 2
posting list 1
posting list 2
term M-1
term M
posting list M-1
posting list M
indexing with rare keys
key 11
key 12
posting list 11
posting list 12
key 1i
posting list 1i
large voc.
key N1
key N2
posting list N1
posting list N2
key Nj
posting list Nj
short posting lists
Fig. 1. The basic idea of indexing with rare keys
Let D be a global document collection, and T its single-term vocabulary. A key
k ∈ K consists of a set of terms {t1 , t2 , . . . , ts }, ti ∈ T , appearing within the same
document d ∈ D. The number of terms comprising a key is bounded, i.e. 1 ≤ s ≤
smax . The quality of a key k for a given document d with respect to indexing adequacy
is determined by its discriminative power. To be discriminative, a key k must be as
specific as possible with respect to d and the corresponding document collection D [12].
We categorize a key on the basis of its global document frequency (DF), and define a
threshold DFmax to divide the set of keys K into two disjoint classes, a set of rare
and frequent keys. If a key k appears in more than DFmax documents, i.e. DF (k) >
DFmax , the key is frequent, and has low discriminative power. Otherwise, k is rare and
specific with respect to the document collection.
Although the size of the key vocabulary is bounded for a bounded collection size of
limited size documents, there are many term combinations that form potential rare keys
and special filtering methods are needed to reduce the key vocabulary to a practically
manageable size. We currently use the proximity and redundancy filter to produce
highly-discriminative keys (HDKs) indexed by our search engine. Proximity filter uses
textual context to reduce the size of the rare key vocabulary and retains keys built of
terms appearing in the same textual context—a document window of predefined size
w—because words appearing close in documents are good candidates to appear together in a query. The analysis presented in [13] reports the importance of text passages
I. Podnar et al.
that are more responsive to particular user needs than the full document. Redundancy
filter removes supersets of rare keys from the vocabulary as such keys are redundant
and only increase the vocabulary size without improving retrieval performance. Therefore, all properly contained term subsets in rare keys are frequent, and we call such
keys intrinsically rare (i-rare) keys. Proximity filtering strongly depends on the window size and document characteristics. Although it seems intuitive that it would remove
most keys, our experiments show the great importance of the redundancy filter which
removes many keys after proximity filtering (e.g. 83% of 2-term and 99% of 3-term
keys). By applying both the proximity and redundancy filter to rare keys, we obtain
a significantly smaller set of HDKs compared to the theoretical value, as reported in
Section 5.
As our engine indexes keys, it is essential to map queries to keys for an effective
retrieval performance. We will now discuss the problem of finding, given a query Q =
{t1 , t2 , . . . , tq }, ti ∈ T , the corresponding relevant keys in the HDK index. A perfect
situation occurs when {t1 , t2 , . . . , tq } is an HDK, in other words, a user has posed
a good discriminative query for the indexed document collection: The posting list is
readily available and is simply retrieved from the global index. However, this may not
happen with all user queries. Therefore, we use terms and term sets from Q to form
potential HDKs. We extract all the subsets of smax , (smax − 1) , . . . , 1 terms from the
query Q to retrieve the posting lists associated with the corresponding keys, and provide
a union of retrieved posting lists as an answer to Q. In fact, we first check smax -term
combinations, and if all of them retrieve posting list, we stop the procedure because
there will be no (smax − 1)-term HDKs. For example, for a query Q = {t1 , t2 , t3 }
and smax = 2, possible 2-term keys are {t1 , t2 }, {t1 , t3 }, and {t2 , t3 }. If we retrieve
postings for {t1 , t2 } and {t1 , t3 }, there is no need to check whether {t1 }, {t2 }, or {t3 }
are indexed because i-rare keys cannot be subsets of other i-rare keys. If we retrieve a
posting only for {t1 , t2 }, we still need to check {t3 }, as it may be an HDK. A similar
query mapping principle has recently been proposed for structuring user queries into
smaller maximal term sets [14].
However, users may still pose queries containing only frequent keys, or some query
terms may not be covered by HDKs. A valid option is to notify a user that his/her query
in non-discriminative with respect to the document collection, and provide support for
refining the query. We have also devised two other possible strategies to improve the
retrieval performance in such cases: The first strategy uses distributional semantics [15]
to find semantically similar terms to query terms, while the second strategy indexes kbest documents for frequent keys, as the size of the frequent key vocabulary is less than
1% of the HDK size. We leave further analysis of the two strategies for future work.
4 Architecture
We assume an environment comprising a set of M independent DLs hosting local document collections and willing to make a part of their collections searchable through a
global distributed index. Each DL is a standalone component that can index and search
its local document collection, and therefore provide (a part of) its local single-term
index as a contribution to the global index. A structured P2P network with N peers is
A P2P Architecture for Information Retrieval Across DL Collections
available to share a global index, and offer efficient search over the global collection
composed of documents contributed by M DLs.
Fig. 2. An overview of the P2P architecture for digital libraries
The high-level architecture of our P2P search engine is presented in Fig. 2. DLs
interact with peers to submit an index and to send a query to the engine. A peer can
be regarded as an entry point to a distributed index, and a P2P network as a scalable
and efficient media for sharing information about DL content. The architecture is layered to enable clean separation of different concepts related to P2P networks, document and content modeling, and the applied retrieval model [16]. As the global index is
key-based, the system is decomposed into the following four layers: 1) transport layer
(TCP/UDP) providing the means for host communication; 2) P2P layer building a distributed hash table and storing global index entries; 3) HDK layer for building a key
vocabulary and corresponding posting lists, and mapping queries to keys; and 4) Ranking layer that implements distributed document ranking.
Each peer incorporates a local and global system view. The HDK layer focuses on
the local view and builds the key index from a received single-term index for a DL’s
local collection. The received single-term index must contain a positional index needed
for key computation, and may provide DL’s relevance scores for (term, document) pairs.
The P2P layer provides a global system view by maintaining the global key index with
information about rare and frequent keys. Global index entries have the following structure {k, DF (k), P eerList(k), P osting(k)}, where DF (k) is the key’s global document frequency, P eerList(k) is the list of peers that have reported local document
frequencies df (k), and P osting(k) is the k’s global posting list. The P osting(k) is
null in case k is frequent.
4.1 Distributed Indexing
The indexing process is triggered when a DL inserts a single-term index or document
collection into the P2P search engine. Since the indexing process is computationally
intensive, peers share computational load and build the HDK vocabulary in parallel.
Each peer creates HDKs from the received index, inserts local document frequencies
I. Podnar et al.
for HDKs it considers locally i-rare or frequent, and subsequently inserts posting lists
for globally i-rare keys into the P2P overlay. The P2P layer stores posting lists for
globally i-rare keys, maintains the global key vocabulary with global DFs, and notifies
the HDK layer when i-rare keys become frequent due to addition of new documents.
Algorithm 1 defines the process of computing HDKs locally by peer Pi at its HDK
layer. It is performed in levels by computing single-term, 2-term, . . . , smax -term keys.
The peer stores a set of potentially i-rare keys in Kir , and globally frequent keys in
Kf req . Note that a locally frequent key is also globally frequent, but each locally rare
key may become globally frequent. The P2P overlay is aware when a key becomes
frequent, and notifies interested peers from the P eerList(k).
The algorithm starts by inserting local document frequencies for the single-term vocabulary Ti and classifying terms as frequent or rare. Note that a peer is notified when
its locally rare keys become globally frequent, which depends on the HDK computation
process performed by other peers. Next, Pi re-checks single-term DFs, and inserts posting lists for the rare ones into the P2P overlay. The approach is tolerant to erroneous
insertions of posting lists for frequent keys: The P2P overlay disregards the received
posting list, updates the global document frequency of a key, and notifies a peer that the
key is frequent.
For determining multi-term i-rare keys, the algorithm uses term locations from the
received single-term index. A potential term combination needs to appear within a predefined window, next the redundancy property is checked, and if a key passes both
filters, it is an HDK candidate. It’s global frequency is updated in the P2P overlay, but
the HDK layer at this point updates its posting list only locally. The global posting list
will be updated subsequently in case the key was not reported globally frequent by the
P2P layer.
4.2 Distributed Retrieval
The query and retrieval scenario involves all four architectural layers. A query is submitted through a peer’s remote interface to the HDK layer which maps query terms to
HDKs as discussed in Section 3. The HDK layer retrieves posting lists associated with
relevant HDKs from the global P2P index. The received posting lists are merged, and
submitted to the ranking layer. The ranking layer ranks documents, and must be designed to provide relevance scores with the minimal network usage. There are a number of ranking techniques the proposed architecture can accommodate, but here we only
sketch an approach using content-based ranking since distributed ranking is outside the
scope of this paper.
As the P2P index maintains global DFs for all frequent and rare terms, DFs for
the vocabulary T are readily available in the index and may be retrieved to be used
for ranking. Term frequencies are local document-related values that are also used for
computing content-based relevance scores. As DLs provide either a single-term index
or original documents when initiating the indexing procedure, the indexing peer can use
them to extract/compute document-related term statistics. Consequently, we can rank an
answer set using a relevance ranking scheme that relies on global document frequencies
and term frequencies, without knowing the total global document size, as this parameter
is typically used to normalize the scores.
A P2P Architecture for Information Retrieval Across DL Collections
Algorithm 1. Computing HDKs at peer Pi
1: for s = 1 to smax do
Kfsreq (s) ← ∅
if s = 1 then
/* process single-term keys */
for all tk ∈ Ti do
if df (tk ) ≤ DFmax then
← Kir
(s) ∪ tk
Kfsreq ← Kfsreq ∪ tk
end if
end for
/* generate new keys from frequent keys*/
for all key = (tk1 , . . . , tks−1 ) ∈ Kfs−1
req do
/* process each document in the key posting list to create a set of potential term
combinations */
for all dj ∈ localPostingList(key) do
for all tks ∈ windowOf(key) do
newKey = concat(key, tks )
if checkRedundancy(newKey) then
← Kir
∪ newKey
updateLocalPostingList(newKey, dj )
end if
end for
end for
end for
end if
/* update global key frequency and insert posting list for i-rare*/
∪ Kfsreq ) do
for all key ∈ (Kir
if DF (key) > DFmax then
/* key is globally frequent */
← Kir
Kfsreq (s) ← Kfsreq ∪ key
end if
end for
40: end for
5 Experimental Evaluation
Experimental setup. The experiments were carried out using a subset of news articles
from the Reuters corpus3 . The documents in our test collection contain between 70 and
I. Podnar et al.
3000 words, while the average number of terms in a document is 170, and the average
number of unique terms is 102. To simulate the evolution of a P2P system, i.e. peers
joining the network and increasing the document collection, we started the experiment
with 2 peers, and added additional 2 peers at each new experimental run. Each peer
contributes with 5000 documents to the global collection, and computes HDKs for its
local documents. Therefore, the initial global document collection for 2 peers is 10,000
documents, and it is augmented by the new 10,000 documents at each experimental run.
The maximum number of peers is 16 hosting in total the global collection of 80,000
documents. The experiments were performed on our campus intranet. Each peer runs
on a Linux RedHat PC with 1GB of main memory connected by a 100 Mbit Ethernet.
The prototype system is implemented in Java.
Performance analysis. Experiments investigate the number of keys generated by our
HDK algorithm, and the resulting average posting list size maintained by the P2P network. All documents were pre-processed: First we removed 250 common English stop
words and applied the Porter stemmer, and then we removed 100 extremely frequent
terms (e.g. the term ‘reuters’ appears in all the news). The DFmax is set to 250 and
500, smax is 3, and w = 20 for the proximity filter.
key (DF=250)
key (DF=500)
Fig. 3. Average HDK vocabulary per peer
key (DF=250)
key (DF=500)
Fig. 4. Average posting list size
Figure 3 shows the total number of HDKs stored per peer for DFmax = 250 and
DFmax = 500. As expected, an increased value of DFmax results in decreased key
vocabulary. Both experimentally obtained result sequences exhibit a logarithmic growth
and are expected to converge to a constant value because the number of generated term
combinations is limited by the proximity window and the total key vocabulary size
grows linearly with the global collection size for large collections. The number of keys
is quite large compared to the single-term vocabulary, but we expect to benefit from the
query performance.
Figure 4 shows the average posting list size for the HDK and single-term indexing. As the average posting list size for HDK indexing remains constant, the expected
bandwidth consumption is significantly smaller than for the single-term index exhibiting a linear increase.
For the retrieval performance evaluations, we have created a total of 200 queries by
randomly choosing 2 to 3 terms from the news titles. Because of the lack of relevance
judgments for our query set, we compared the retrieval performance to a centralized
A P2P Architecture for Information Retrieval Across DL Collections
baseline4 by indexing the collection using both single-term and HDK indexing with
deferent DFmax values (200, 250, 500). Then for each query we compared the top
20 documents retrieved by our prototype and by the baseline, both hit lists have been
ranked using TF-IDF. We are interested in the high-end ranking as typical users are
often interested only in the top 20 results. Two metrics are used to compare the result
sets: the first one is the overlap between our system and the centralized baseline, and
the second one is the average number of posting lists transmitted during retrieval.
Table 1. Retrieval quality of HDK indexing compared to the centralized TF-IDF system
single-term (TF-IDF)
HDK, DFmax = 500
HDK, DFmax = 250
HDK, DFmax = 200
Overlap ratio on top20 Transmitted postings
100 %
232.925 (7.63%)
96.91 (3.17%)
75.37 (2.47%)
Table 1 presents our findings related to retrieval performance for the collection of
30,000 documents over 6 peers. The results show an extreme reduction of the average
number of transmitted postings per query of the HDK compared to a naı̈ve P2P approach with single-term indexing which compensates for the increased indexing costs.
The results show acceptable retrieval performance of the HDK approach. As expected,
the retrieval performance is better for larger DFmax as we are getting closer to the
single-term indexing, but the average number of transmitted postings also increases,
although it is still significantly smaller compared to the single-term case.
6 Related Work
Full-text P2P search is investigated in two overlapping domains: DLs and the Web.
There is an ongoing debate on the feasibility of P2P Web search for scalability reasons.
In [3] it is shown that the naı̈ve use of unstructured or structured overlay networks is
practically infeasible for the Web, since the generated traffic required for indexing and
search exceeds the available Internet capacity. Thus different schemes have been devised to make P2P Web search feasible. Several approaches target at a term-to-peer indexing strategy, where the unit of indexing are peers rather than individual documents:
PlanetP [17] gossips compressed information about peers’ collections in an unstructured P2P network, while MINERVA [18] maintains a global index with peer collection
statistics in a structured P2P overlay to facilitate the peer selection process.
As DLs represent only a small fraction of the entire Web space, the feasibility of
full-text P2P search across DL collections is not in question. Hierarchical solutions
have been investigated for federated search where a backbone P2P network maintains
a directory service to route queries to peers with relevant content [6, 1]. A recently proposed solution uses collection-wide statistics to update routing indexes dynamically at
query time, and reports low traffic overheads for the Zipf-distribution queries after the
Terrier search engine, http://ir.dcs.gla.ac.uk/terrier/
I. Podnar et al.
initial ‘learning phase’ [19]. These solutions are orthogonal to our approach since they
are designed for unstructured P2P networks with the low-cost indexing schemes, while
the processing and major network traffic is generated during the query phase. Our technique is costly in terms of indexing, however, it offers highly-efficient and responsive
querying performance. It is comparable to solutions for distributed top-k retrieval that
aim at minimizing query costs by transmitting a limited number of postings [19, 20].
However, the major difference is our novel indexing strategy. The HDK approach is
not the only indexing strategy that uses term sets as indexing features. The set-based
model [21] indexes term sets occurring in queries, and exploits term correlations to reduce the number of indexed term sets. The authors report significant gains in terms of
retrieval precision and average query processing time, while the increased index processing time is acceptable. In contrast to our indexing scheme, the set-based model has
been used to index frequent term sets and is designed for a centralized setting.
7 Conclusion
We have presented a P2P architecture for information retrieval across digital library
collections. It relies on a novel indexing strategy that indexes rare terms and term sets
to limit the bandwidth consumption during querying and enable scalable and highlyefficient search performance. As a proof of concept, we have implemented a prototype
system following the presented architectural design, and performed experiments to investigate query performance and indexing costs. Our experiments have demonstrated
significant benefits of the HDK approach in terms of reduced networking costs and the
feasibility of the proposed indexing strategy for P2P environments. Our future work will
further investigate techniques for reducing the cost of the proposed indexing strategy,
e.g., by using query statistics, or query-driven indexing. We will perform experiments
with larger and various document collections, and increased size of the peer network
to confirm existing positive results related to both the networking costs and retrieval
1. Lu, J., Callan, J.: Federated search of text-based digital libraries in hierarchical peer-to-peer
networks. In: Advances in Information Retrieval, 27th European Conference on IR Research
(ECIR). (2005) 52–66
2. Balke, W.T., Nejdl, W., Siberski, W., Thaden, U.: DL Meets P2P - Distributed Document
Retrieval Based on Classification and Content. In: 9th European Conference on Research
and Advanced Technology for Digital Libraries (ECDL). (2005) 379–390
3. Li, J., Loo, B., Hellerstein, J., Kaashoek, F., Karger, D., Morris, R.: The feasibility of peerto-peer web indexing and search. In: Peer-to-Peer Systems II: 2nd International Workshop
on Peer-to-Peer Systems (IPTPS). (2003) 207–215
4. Buntine, W., Aberer, K., Podnar, I., Rajman, M.: Opportunities from open source search.
In: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence.
(2005) 2–8
5. Lv, Q., Cao, P., Cohen, E., Li, K., Shenker, S.: Search and replication in unstructured peerto-peer networks. In: 16th International Conference on Supercomputing. (2002) 84–95
A P2P Architecture for Information Retrieval Across DL Collections
6. Lu, J., Callan, J.: Content-based retrieval in hybrid peer-to-peer networks. In: Proceedings
of the 12th International Conference on Information and Knowledge Management (CIKM).
(2003) 199–206
7. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable contentaddressable network. In: SIGCOMM ’01. (2001) 161–172
8. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peerto-peer lookup service for internet applications. In: SIGCOMM ’01. (2001) 149–160
9. Aberer, K.: P-Grid: A self-organizing access structure for P2P information systems. In:
CooplS ’01: Proceedings of the 9th International Conference on Cooperative Information
Systems. (2001) 179–194
10. Aberer, K., Alima, L.O., Ghodsi, A., Girdzijauskas, S., Haridi, S., Hauswirth, M.: The
Essence of P2P: A Reference Architecture for Overlay Networks. In: Fifth IEEE International Conference on Peer-to-Peer Computing. (2005) 11–20
11. Reynolds, P., Vahdat, A.: Efficient Peer-to-Peer Keyword Searching. Middleware03 (2003)
12. Salton, G., Yang, C.: On the specification of term values in automatic indexing. Journal of
Documentation 4 (1973) 351–372
13. Salton, G., Allan, J., Buckley, C.: Approaches to Passage Retrieval in Full Text Information
Systems. In: SIGIR’93. (1993) 49–58
14. Pôssas, B., Ziviani, N., Ribeiro-Neto, B., Wagner Meira, J.: Maximal termsets as a query
structuring mechanism. In: CIKM ’05. (2005) 287–288
15. Rajman, M., Bonnet, A.: Corpora-Base Linguistics: New Tools for Natural Language Processing. 1st Annual Conference of Association for Global Strategic Information (1992)
16. Aberer, K., Klemm, F., Rajman, M., Wu, J.: An Architecture for Peer-to-Peer Information
Retrieval. In: SIGIR’04, Workshop on Peer-to-Peer Information Retrieval. (2004)
17. Cuenca-Acuna, F.M., Peery, C., Martin, R.P., Nguyen, T.D.: PlanetP: Using Gossiping to
Build Content Addressable Peer-to-Peer Information Sharing Communities. In: 12th IEEE
International Symposium on High Performance Distributed Computing (HPDC-12), IEEE
Press (2003) 236–246
18. Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: Improving collection selection with overlap awareness in P2P search engines. In: SIGIR ’05: Proceedings of the 28th
annual international ACM SIGIR conference on Research and development in information
retrieval. (2005) 67–74
19. Balke, W., Nejdl, W., Siberski, W., Thaden, U.: Progressive distributed top-k retrieval in peerto-peer networks. In: Proceedings of the 21st International Conference on Data Engineering
(ICDE 2005). (2005) 174–185
20. Michel, S., Triantafillou, P., Weikum, G.: KLEE: a framework for distributed top-k query
algorithms. In: VLDB ’05. (2005) 637–648
21. Pôssas, B., Ziviani, N., Wagner Meira, J., Ribeiro-Neto, B.: Set-based vector model: An
efficient approach for correlation-based ranking. ACM Trans. Inf. Syst. 23 (2005) 397–429
Scalable Semantic Overlay Generation for P2P-Based
Digital Libraries
Christos Doulkeridis1 , Kjetil Nørvåg2, and Michalis Vazirgiannis1
Dept. of Informatics, AUEB, Athens, Greece
{cdoulk, mvazirg}@aueb.gr
Dept. of Computer Science, NTNU, Trondheim, Norway
[email protected]
Abstract. The advent of digital libraries along with the tremendous growth of
digital content call for distributed and scalable approaches for managing vast data
collections. Peer-to-peer (P2P) networks emerge as a promising solution to delve
with these challenges. However, the lack of global content/topology knowledge
in an unstructured P2P system demands unsupervised methods for content organization and necessitates efficient and high quality search mechanisms. Towards
this end, Semantic Overlay Networks (SONs) have been proposed in the literature, and in this paper, an unsupervised method for decentralized and distributed
generation of SONs, called DESENT, is proposed. We prove the feasibility of our
approach through analytical cost models and we show through simulations that,
when compared to flooding, our approach improves recall by more than 3-10
times, depending on the network topology.
1 Introduction
The advent of digital libraries along with the tremendous growth of digital content call
for distributed and scalable approaches for managing vast data collections. Future digital libraries will enable citizens to access knowledge any time/where, in a friendly,
multi-modal, efficient and effective way. Reaching this vision requires development of
new approaches that will significantly reform the current form of digital libraries. Key
issues in this process are [9]: the system architecture and the information access means.
With respect to system architecture, peer-to-peer (P2P) is identified as a topic of primary interest, as P2P architectures allow for loosely-coupled integration of information
services and sharing of information/knowledge [1,6,11].
In this paper, we present a scalable approach to P2P document sharing and retrieval.
Because scalability and support for semantics can be difficult in structured P2P systems
based on DHTs, we instead base our approach on unstructured P2P networks. Such
systems, in their basic form, suffer very high search costs, in terms of both consumed
bandwidth and latency, so in order to be useful for real applications, more sophisticated search mechanisms are required. We solve this problem by employing semantic
overlay networks (SONs) [5], where peers containing related information are connected
together in separate overlay networks. If SONs have been created, queries can be forwarded to only those peers containing documents that satisfy the constraints of the
query context, for example based on topic, user profiles or features extracted from previous queries.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 26–38, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Scalable Semantic Overlay Generation for P2P-Based Digital Libraries
One of the problems of SONs is the actual construction of these overlays, because in
a P2P context there is a lack of knowledge of both global content and network topology.
In a P2P architecture, each peer is initially aware only of its neighbors and their content.
Thus, finding other peers with similar contents, in order to form a SON, becomes a tedious
problem. This contrasts to a centralized approach, where all content is accessible to a
central authority, and clustering becomes a trivial problem, in the sense that only the
clustering algorithm (and its input parameter values) determines the quality of the results.
The contribution of this paper is a distributed and decentralized method for hierarchical SON construction (DESENT) that provides an efficient mechanism for search in
unstructured P2P networks. Our strategy for creating SONs is based on clustering peers
based on their content similarity. This is achieved by a recursive process that starts on
the individual peers. Through applying a clustering algorithm on the documents stored
at the peer, one or more feature vectors are created for each peer, essentially one for
each topic a peer covers. Then representative peers, each responsible for a number of
peers in a zone are selected. These peers, henceforth called initiators, will collect the
feature vectors from the members of the zone and use these as basis for the next level of
clustering. This process is applied recursively, until we have a number of feature vectors
covering all available documents.
The organization of the rest of this paper is as follows. In Section 2, we give an
overview of related work. In Section 3, we present our method for creating SONs that
can be used in the search process (Section 4). In Section 5, we use analytical cost
models to study the cost and the time required for overlay creation, while, in Section 6,
we present the simulation results. Finally, in Section 7, we conclude the paper.
2 Related Work
Several techniques have been proposed that can improve search in unstructured P2P
systems [2,8], including techniques for improved routing that give a direction towards
the requested document, like routing indices [4], and connectivity-based clustering that
creates topological clusters that can be used as starting points for flooding [12]. An
approach to improve some of the problems of Gnutella-like systems [2], is to use a
super-peer architecture [15], which can be also used to realize a hierarchical summary
index, as described in [13].
The concept of semantic overlay networks (SONs) [5] is about directing searches
only to a specific subset of peers with content relevant to the query. The advantage of
this approach is that it reduces the flooding cost in the case of unstructured systems.
Crespo and Garcia-Molina [5] essentially base their approach on partly pre-classified
documents that only consist of information about the song contained in a particular file.
Also they do not provide any other algorithm for searching, other than flooding. In order
to be useful in a large system, unsupervised and decentralized creation of SONs is necessary, as well as efficient routing to the appropriate SON(s). The DESENT approach
described in our paper solves these issues.
Although several papers describe how to use SON-like structures for P2P content
search [3,10], little work exists on the issue of how to actually create SONs in an
unsupervised, decentralized and distributed way in unstructured networks. Distributed
C. Doulkeridis, K. Nørvåg, and M. Vazirgiannis
clustering in itself is considered a challenge demanding for efficient and effective solutions. In [14], a P2P architecture where nodes are logically organized into a fixed
number of clusters is presented. The main focus of the paper is fairness with respect to
the load of individual nodes. In contrast to our approach, the allocation of documents to
clusters is done by classification, it is not unsupervised, and clusters are not hierarchical. We believe that current research in P2P digital libraries [1,6,11] can benefit from
the merits of our approach.
3 Overlay Network Creation
In this section, we describe SON generation, assuming peers storing digital content
and being connected in an unstructured P2P network. Each peer represents a digital
library node and in this paper we focus on peers that store documents, though other
data representations can also be supported. The approach is based on creating local
zones of peers, forming semantic clusters based on data stored on these peers, and then
merging zones and clusters recursively until global zones and clusters are obtained.
3.1 Decentralized and Distributed Cluster Creation
The peer clustering process is divided into 5 phases: 1) local clustering, 2) zone initiator
selection, 3) zone creation, 4) intra-zone clustering, and 5) inter-zone clustering.
Phase 1: Local Clustering. In the process of determining sites that contain related
documents, feature vectors are used instead of the actual documents because of the
large amounts of data involved. A feature vector Fi is a vector of tuples, each tuple
containing a feature (word) fi and a weight wi . The feature vectors are created using
a feature extraction process (more on the feature extraction process in section 6). By
performing clustering of the document collection at each site, a set of document clusters
is created, each cluster represented by a feature vector.
Phase 2: Initiator Selection. In order to be able to create zones, a subset of the peers
have to be designated the role of zone initiators that can perform the zone creation
process and subsequently initiate and control the clustering process within the zone.
The process of choosing initiators is completely distributed and ideally would be
performed at all peers concurrently, in order to have approximately SZ peers in each
zone1 . However, this concurrency is not necessary, since the use of zone partitioning at
the next phase eliminates the danger of excessive zone sizes.
Assuming the IP of a peer Pi is IPPi and the time is T (rounded to nearest ta 2 ), a peer
will discover that it is an initiator if (IPPi + T ) M OD SZ = 0. The aim of the function
is to select initiators that are uniformly spread out in the network and an appropriate
In order to avoid some initiators being overloaded, the aim is to have as uniform zone sizes as
possible. Note that although uniform zone size and having initiator in the center of the zone
are desired for load-balancing reasons, this is not crucial for the correctness or quality of the
overlay construction.
Assuming that each peer has a clock that is accurate within a certain amount of time ta , note
that DESENT itself can be used to improve the accuracy.
Scalable Semantic Overlay Generation for P2P-Based Digital Libraries
Fig. 1. Step-wise zone creation given the three initiators A, B, and C
number of initiators relative to the total number of peers in the network. By including
time in the function we ensure that we obtain different initiators each time the clustering
algorithm is run. This tackles the problem of being stuck with faulty initiators, as well
as reduces the problem of permanent cheaters.
If no initiator is selected by the above strategy, this will be discovered from the
fact that the subsequent zone creation phase is not started within a given time (i.e., no
message received from an initiator). In this case, a universal decrease of the moduloparameter is performed, by dividing by an appropriate prime number, as many times
as necessary, in order to increase the chance of selecting (at least) one peer at the next
Phase 3: Zone Creation. After a peer Pi has discovered that it is an initiator, it uses
a probe-based technique to create its zone. An example of zone creation is illustrated
in Fig. 1. This zone creation algorithm has a low cost wrt. to number of messages (see
Section 5), and in the case of excessive zone sizes, the initiator can decide to partition
its zone, thus sharing its load with other peers. When this algorithm terminates, 1) each
initiator has assembled a set of peers Zi and their capabilities, in terms of resources they
possess, 2) each peer knows the initiator responsible for its zone and 3) each initiator
knows the identities of its neighboring initiators. An interesting characteristic of this
algorithm is that it ensures that all peers in the network will be contacted, as long as
they are connected to the network. This is essential, otherwise there may exist peers
whose content will never be retrieved. We refer to the extended version of this paper for
more details on initiator selection and zone creation [7].
Phase 4: Intra-zone Clustering. After the zones and their initiators have been determined, global clustering starts by collecting feature vectors from the peers (one feature
vector for each cluster on a peer) and creating clusters based on these feature vectors:
1. The initiator of each zone i sends probe messages FVecProbe to all peers in Zi .
2. When a peer Pi receives a FVecProbe it sends its set of feature vectors {F } to the
initiator of the zone.
3. The initiator performs clustering on the received feature vectors. The result is a set
of clusters represented by a new set of feature vectors {Fi }, where an Fi consists
of the top-k features of cluster Ci . Note that a peer can belong to more than one
cluster. In order to limit the computations that have to be performed in later stages
at other peers, when clusters from more than one peer have to be considered, the
clustering should result in at most NC0 such basic clusters (NC0 is controlled by
the clustering algorithm). The result of this process is illustrated in the left part of
Fig. 2.
C. Doulkeridis, K. Nørvåg, and M. Vazirgiannis
Level 4 initiators:
Level 3 initiators:
Level 2 initiators:
Level 1 initiators:
A0 ... An ... B 0 ... B n ...
Level 2 zone
Level 3 zone
Fig. 2. Left: Possible result of intra-zone clustering of zone A, resulting in the four clusters
C0 , C1 , C2 , and C3 . Right: Hierarchy of zones and initiators
4. The initiator selects a representative peer Ri for each cluster, based on resource
information that is provided during Phase 3, like peer bandwidth, connectivity, etc.
One of the purposes of a representative peer is to represent a cluster at search time.
5. The result kept at the initiator is a set of cluster descriptions (CDs), one for each
cluster Ci . A CD consists of the cluster identifier Ci , a feature vector Fi , the set
of peers {P } belonging to the cluster, and the representative R of the cluster, i.e.,
CDi = (Ci , Fi , {P }, R). For example, the CD of cluster C2 in Fig. 2 (assuming
A7 is the cluster representative) would be CD2 = (C2 , F2 , {A5 , A7 , A8 , A9 }, A7 ).
6. Each of the representative peers are informed by the initiator about the assignment
and receive a copy of the CDs (of all clusters in the zone). The representatives then
inform peers on their cluster membership by sending them messages of the type
(Ci , Fi , R).
Phase 5: Inter-zone Clustering. At this point, each initiator has identified the clusters
in its zone. These clusters can be employed to reduce the cost and increase the quality
of answers to queries involving the peers in one zone. However, in many cases peers in
other zones will be able to provide more relevant responses to queries. Thus, we need
to create an overlay that can help in routing queries to clusters in remote zones. In order
to achieve this, we recursively apply merging of zones to larger and larger super-zones,
and at the same time merge clusters that are sufficiently similar into super-clusters: first
a set of neighboring zones are combined to a super-zone, then neighboring super-zones
are combined to a larger super-zone, etc. The result is illustrated in the right part of
Fig. 2 as a hierarchy of zones and initiators. Note that level-i initiators are a subset of
the level-(i − 1) initiators.
This creation of the inter-zone cluster overlay is performed as follows:
1. From the previous level of zone creation, each initiator maintains knowledge about
its neighboring zones (and their initiators). Thus, the zones essentially form a zoneto-zone network resembling the P2P network that was the starting point.
2. A level-i zone should consist of a number of neighboring level-(i − 1) zones, on
average |SZ| in each (where SZ denotes a set of zones, and |SZ| the number
of the level-(i − 1) initiators should be
of zones in the set). This implies that |SZ|
level-i initiators. This is achieved by using the same technique for initiator selection
Scalable Semantic Overlay Generation for P2P-Based Digital Libraries
as described in Phase 2, except that in this case only peers already chosen to be
initiators at level-(i − 1) in the previous phase are eligible for this role.
3. The level-i initiators create super-zones using the algorithm of Phase 3. In the same
way, these level-i initiators will become aware of their neighboring super-zones.
4. In a similar way to how feature vectors were collected during the basic clustering,
the approximately NC |SZ| CDs created at the previous level are collected by the
level-i initiator (where NC denotes the number of clusters per initiator at the previous level). Clustering is performed again and a set of super-clusters is generated.
Each of the newly formed super-clusters is represented by top-k features produced
by merging the top-k feature vectors of the individual clusters. The result of cluster
merging is a set of super-clusters. A peer inside the super-cluster (not necessarily
one of the representatives of the cluster) is chosen as representative for the supercluster. The result is a new set of CDs, CDi = (Ci , Fi , {P }, R), where the set of
peers {P} contains the representatives of the clusters forming the base of the new
5. The CDs are communicated to the appropriate representatives. The representatives
of the merged clusters (the peers in {P } in the new CDs) are informed about the
merging by the super-cluster representative, so that all cluster representatives know
about both their representatives below as well as the representative above in the
hierarchy. Note that although the same information could be obtained by traversing
the initiator/super-initiator hierarchy, the use of cluster representatives distributes
the load more evenly and facilitates efficient searching.
This algorithm terminates when only one initiator is left, i.e., when an initiator has no
neighbors. Unlike the initiators at the previous levels that performed clustering operations, the only purpose of the final initiator is to decide the level of the final hierarchy.
The aim is to have at the top level a number of initiators that is large enough to provide
load-balancing and resilience to failures, but at the same time low enough to keep the
cost of exchanging clustering information between them during the overlay creation to
a manageable level. Note that there can be one or more levels below the top-level initiator that have too few peers. The top-level peer probes level-wise down the tree in order
to find the number of peers at each level until it reaches level j with appropriate number
minF of peers. The level-j initiators are then informed about the decision and they are
given the identifiers of the other initiators at that level, in order to send their CDs to
them. Finally, all level-j initiators have knowledge about the clusters in zones covered
by the other level-j initiators.
3.2 Final Organization
To summarize, the result of the zone- and cluster-creation process are two hierarchies:
Hierarchy of peers: Starting with individual peers at the bottom level, forming zones
around the initiating peer which acts as a zone controller. Neighboring zones recursively
form super-zones (see right part of Fig. 2), finally ending up in a level where the top
of the hierarchies have replicated the cluster information of the other initiators at that
level. This is a forest of trees. The peers maintain the following information about the
rest of the overlay network: 1) Each peer knows its initiator. 2) A level-1 initiator knows
C. Doulkeridis, K. Nørvåg, and M. Vazirgiannis
the peers in its zone as well as the level-2 initiator of the super-zone it is covered by.
3) A level-i initiator (for i > 1) knows the identifiers of the level-(i − 1) initiators of the
zones that constitute the super-zone as well as the level-(i+1) initiator of the super-zone
it is covered by. 4) Each initiator knows all cluster representatives in its zone.
Hierarchy of clusters: Each peer is member of one or more clusters at the bottom
level. Each cluster has one of its peers as representative. One or more clusters constitute
a super-cluster, which again recursively form new super-clusters. At the top level a
number of global clusters exist. The peers store the following information about the
cluster hierarchy: 1) Each peer knows the cluster(s) it is part of, and the representative
peers of these clusters. 2) A representative also knows the identifiers of the peers in its
cluster, as well as the identifier of the representative of the super cluster it belongs to.
3) A representative for a super-cluster knows the identifier of the representative at the
level above as well as the representatives of the level below.
3.3 Peer Join
A peer PJ that joins the network first establishes connection to one or more peers as
part of the basic P2P bootstrapping protocol. These neighbors provide PJ with their
zone initiators. Through one of these zone initiators, PJ is able to reach one of the
top-level nodes in the zone hierarchy and through a search downwards find the most
appropriate lowest-level cluster, which PJ will then subsequently join. Note that no
reclustering will be performed, so after a while a cluster description might not be accurate, but that cannot be enforced in any way in a large-scale, dynamic peer-to-peer
system, given the lack of total knowledge. However, the global clustering process is
performed at regular intervals and will then create a new clustering that reflects also
the contents of new nodes (as well as new documents that have changed the individual peer’s feature vectors). This strategy considerably reduces the maintenance cost, in
terms of communication bandwidth compared with incremental reclustering, and also
avoids the significant cost of continuous reclustering.
4 Searching
In this section we provide an overview of query processing in DESENT. A query Q
in the network originates from one of the peers P , and it is continually expanded until
satisfactory results, in terms of number and quality, have been generated. All results that
are found as the query is forwarded are returned to P . Query processing can terminate
at any of the steps below if the result is satisfactory:
1. The query is evaluated locally on the originating peer P .
2. A peer is a member of one or more clusters Ci . The Ci which has the highest similarity sim(Q, Ci ) with the query is chosen, and the query is sent to and evaluated
by the other peers in this cluster.
3. Q is sent to one of the top-level initiators (remember that each of the top-level
initiators knows about all the top-level clusters). At this point we employ two alternatives for searching:
Scalable Semantic Overlay Generation for P2P-Based Digital Libraries
Table 1. Parameters and default values used in the cost models
Minimum bandwidth available
1 KB/s
# of peers/zones at level i
Avg. # of neighbors at level 0
Avg. # of neighbors at level i
# of initiator levels
Min. # of trees in top-level forest
# of clusters per peer
# of clusters per level-i initiator
# of trees in top-level forest
logS NP Z
SZ /4
> SZ /4
Total # of peers in the network
Max zone radius
Size of a CD
Size of feature vector
Size of packet overhead
Avg. zone size
Time between synch. points
(SZ )i
≈ 1.5SF
200 bytes
60 bytes
60 seconds
(a) The most appropriate top-level cluster is determined based on a similarity measure, and Q is forwarded to the representative of that cluster. Next, Q is routed
down the cluster hierarchy until the query is actually executed at the peers in
a lowest-level cluster. The path is chosen based on highest sim(Q, Ci ) of the
actual sub-clusters of a level-i cluster. If the number of results is insufficient,
then backtracking is performed in order to extend the query to more clusters.
(b) All top-level clusters that have some similarity sim(Q, Ci ) > 0 to the query
Q are found and the query is forwarded to all cluster representatives. The
query is routed down at all paths of the cluster hierarchy until level-0. Practically, all subtrees that belong to a matching top-level cluster are searched
The first approach reduces query latency, since the most relevant subset of peers will
be identified with a small cost of messages. However, the number of returned documents will probably be restricted, since the search will focus on a local area only. This
approach is more suitable for top-k queries. The second approach can access peers residing at remote areas (i.e. remote zones), with acceptable recall, however this results
in a larger number messages. It is more suitable for cases when we are interested in
the completeness of the search (retrieval of as many relevant documents as possible). In
the following, we provide simulation results only for the second scenario, since we are
mainly interested in testing the recall of our approach.
5 Feasibility Analysis
We have studied the feasibility of applying DESENT in a real-world P2P system
through analytical cost models. Due to lack of space, we present here only the main
results of the analytical study, whereas the actual cost models are described in detail in
the extended version of this paper [7]. The parameters and default values used in the
cost models are summarized in Table 1. These are typical values (practically size and
performance) or values based on observations and conclusions from simulations.
A very important concern is the burden the DESENT creation imposes on participating nodes. We assume that the communication cost is the potential bottleneck and hence
the most relevant metric, and we consider the cost of creating DESENT acceptable if
the cost it imposes is relatively small compared to the ordinary document-delivery load
on a web server.
C. Doulkeridis, K. Nørvåg, and M. Vazirgiannis
B=100KB/s, SZ=50
B=100KB/s, SZ=100
Fig. 3. Left: maximum cost of participation in overlay network creation for different values of
network size NP and zone size SZ . Right: Time TC to create DESENT as a function of ta for
different zone sizes and bandwidths.
In studying the feasibility of DESENT, it is important that the average communication cost for each peer is acceptable, but most important is the maximum cost that can
be incurred for a peer, i.e., the cost for the initiators on the top level of the hierarchy.
In order to study the maximum cost CM for a particular peer to participate in the creation of the overlay network, both received and sent data should be counted because
both pose a burden on the peer. Fig. 3 (left) illustrates CM for different values of NP
and zone size SZ . We see that a large zone size results in higher cost, but with very
high variance. The situations in which this happens, is when the number of top-level
peers is just below the minF threshold so that the level below will be used as top level
instead. With a large zone size this level will contain a large number of peers, and the
final exchange of clusters information between the roots of this forest will be expensive.
However, in practice this could be solved by merging of zones at this level. Regarding
the maximum cost, if we consider a zone size of SZ = 100, the maximum cost is just
above 100 MB. Compared with the load of a typical web server, which is some GB of
delivered documents per day, 3 this is acceptable even in the case of daily reclustering.
However, considering the fact that the role of the upper-level initiators changes every
time the overlay network is created, it could even be feasible to perform this clustering
more often. In addition to the cost described above, there will also be a certain cost
for maintaining replicas and peer dynamics in the network. However, this cost will be
relatively small compared to the upper-level exchange of CDs.
In order to ensure freshness of the search results, it is important that the duration
of the DESENT creation itself is not too long. The results, illustrated in Fig. 3 (right),
show the time needed to create DESENT for different values of maximum assumed
clock deviation, zone size SZ , and minimum available bandwidth for DESENT participation B. For typical parameter values and ta = 30s, the time needed to construct the
DESENT overlay network is between 3000 and 4000 seconds, i.e., approximately one
hour. This means that the DESENT creation could run several times a day, if desired. An
important point is that even if the construction takes a certain time, the average load the
construction imposes on peers will be relatively low. Most of the time is used to ensure
Using a web server in our department as example, it delivers in the order of 4 GB per day, and
a large fraction of this data is requested by search engines crawling the web.
Scalable Semantic Overlay Generation for P2P-Based Digital Libraries
that events are synchronized, without having to use communication for this purpose.
Regarding values of parameters, it should be stressed that the actual number of peers
has only minimal impact on the construction time, because the height of the tree is the
important factor, and this increases only logarithmically with the number of peers.
6 DESENT Simulation Results
We have developed a simulation environment in Java, which covers all intermediate
phases of the overlay network generation as well as the searching part. We ran all our
experiments on Pentium IV computers with 3GHz processors and 1-2GB of RAM.
At initialization of the P2P network, a topology of NP interconnected peers is created. We used the GT-ITM topology generator4 to create random graphs of peers (we
also used power-law topologies with the same results, due to the fact that the underlying topology only affects the zone creation phase), and our own SQUARE topology,
which is similar to GT-ITM, only the connectivity degree is constant and neighboring
peers share 3-5 common neighbors, i.e., the network is more dense than GT-ITM. A
collection of ND documents is distributed to peers, so that each peer retains ND /NP
distinct documents. Every peer runs a clustering algorithm on its local documents resulting in a set of initial clusters. In our experiments we chose the Reuters-21578 text
categorization test collection,5 and we used 8000 pre-classified documents that belong
to 60 distinct categories, as well as a different setup of 20000 documents. We tried
different experimental setups with 2000, 8000 and 20000 peers. We then performed
feature extraction (tokenization, stemming, stop-word removal and finally keeping the
top-k features based on their TF/IDF6 value and kept a feature vector of top-k features
for each document as a compact document description). Thus, each document is represented by a top-k feature vector. Initiators retrieve the feature vectors of all peers within
their zone, in order to execute intra-zone clustering. We used hierarchical agglomerative clustering (HAC) to create clusters of documents. Clustering is based on computing
document similarities and merging feature vectors, by taking the union of the clusters’
features and keeping the top-k features with higher TF/IDF values. We used the cosine
similarity with parameter the similarity threshold Ts for merging. Clusters are created
by grouping together sufficiently similar documents and each cluster is also represented
by a top-k feature vector. Obviously, other clustering algorithms, as well as other similarity measures can be used.
6.1 Zone Creation
We studied the average zone size after the zone creation phase at level 1. The network
topology consists of NP = 20000 peers, each having 10 neighbors on average and
Notice that the inverse document frequency (IDF) is not available, since no peer has global
knowledge of the document corpus, so we use the TF/IDF values produced on each peer locally, taking only the local documents into account.
C. Doulkeridis, K. Nørvåg, and M. Vazirgiannis
DESENT clustering quality relative to
centralized clustering
Fig. 4. Simulation results: Cluster quality, compared to centralized clustering, for different network sizes and values of k (left), and average recall compared to normalized flooding using the
same number of messages (right)
SZ = 100. We run the experiment with and without the zone partitioning mechanism.
The simulations brought out the value of zone partitioning, since this mechanism keeps
all zones smaller than SZ , while most are of sizes 50 − 100. However, when there is no
zone partitioning, about 30% of the total zones have sizes greater than SZ , and some
are twice larger than SZ , thus imposing a cumbersome load on several initiators.
6.2 Clustering Results Quality
Measuring the quality of the DESENT clustering results is essential for the value of the
approach. As clustering quality in our context, we define the similarity of the results of
our clustering algorithm (Ci ), with respect to an optimal clustering (Kj ). We used in
our experiments the F-measure as a cluster quality measure. F-measure ranges between
0 and 1, with higher values corresponding to better clustering.
We compare the clustering quality of our approach to the centralized clustering results. The average values of DESENT F-measure relative to centralized clustering are
illustrated in the left part of Fig 4, and show that DESENT achieves high clustering
quality. Also note that the results exhibit a relatively stable behavior as the network size
increases. This indicates that DESENT scales well with the number of participating
peers. This conveys that the proposed system achieves high quality in forming SONs
despite of the lack of global knowledge and the high distribution of the content.
6.3 Quality and Cost of Searching
In order to study the quality of searching in DESENT, we consider as baseline the
search that retrieves all documents that contain all keywords in a query. We measure the
searching quality using recall, representing the percentage of the relevant documents
found. Note that, for the assumed baseline, precision will always be 100% in our approach, since the returned documents will always be relevant, due to the exact matching
of all keywords. We generated a synthetic query workload consisting of queries with
term count average 2.0 and standard deviation 1.0. We selected query terms from the
documents randomly (ignoring terms with frequency less than 1%). The querying peer
was selected randomly.
Scalable Semantic Overlay Generation for P2P-Based Digital Libraries
In the right part of Fig. 4, we show the average recall of our approach compared to
normalized flooding using the same number of messages for different values of k, for
the GT-ITM topology and the SQUARE topology for 8000 peers. Normalized flooding [8] is a variation of naive flooding that is widely used in practice, in which each
peer forwards a query to d neighbors, instead of all neighbors, where d is usually the
minimum connectivity degree of any peer in the network. The chart shows that with the
same number of messages, our approach improves recall by more than 3-5 times for GTITM, and more than 10 for SQUARE, compared to normalized flooding. Furthermore,
the absolute recall values increase with k, since more queries can match the enriched
(with more features) cluster descriptions. Also notice that our approach presents the
same recall independent of the underlying network topology.
7 Conclusions and Further Work
In this paper, we have presented algorithms for distributed and decentralized construction of hierarchical SONs, for supporting searches in a P2P-based digital library context.
Future work includes performance and quality measurement of the search algorithm using large document collections, studying the use of other clustering algorithms as well
as the use of caching techniques and ranking to increase efficiency.
Acknowledgments. The authors would like to thank George Tsatsaronis and Semi
Koen for their help in preparing the feature extraction and clustering modules.
1. W.-T. Balke, W. Nejdl, W. Siberski, and U. Thaden. DL meets P2P - Distributed Document
Retrieval based on Classification and Content. In Proceedings of ECDL’2005, 2005.
2. Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, and S. Shenker. Making Gnutella-like
P2P Systems Scalable. In Proceedings of SIGCOMM’03, 2003.
3. E. Cohen, H. Kaplan, and A. Fiat. Associative Search in Peer-to-Peer Networks: Harnessing
Latent Semantics. In Proceedings of INFOCOM’03, 2003.
4. A. Crespo and H. Garcia-Molina. Routing Indices for Peer-to-Peer Systems. In Proceedings
of ICDCS’2002, 2002.
5. A. Crespo and H. Garcia-Molina. Semantic Overlay Networks for P2P Systems. Technical
report, Stanford University, 2002.
6. H. Ding and I. Sølvberg. Choosing Appropriate Peer-to-Peer Infrastructure for your Digital
Libraries. In Proceedings of ICADL’2005, 2005.
7. C. Doulkeridis, K. Nørvåg, and M. Vazirgiannis. DESENT: Decentralized and Distributed
Semantic Overlay Generation in P2P Networks. Technical report, AUEB, 2005 http://
www.db-net.aueb.gr/index.php/publications/technical reports/.
8. C. Gkantsidis, M. Mihail, and A. Saberi. Hybrid Search Schemes for Unstructured Peer-toPeer Networks. In Proceedings of INFOCOM’05, 2005.
9. Y. Ioannidis, H.-J. Schek, and G. Weikum, editors. Proceedings of the 8th International
Workshop of the DELOS Network of Excellence on Digital Libraries on Future Digital Library Management Systems (System Architecture & Information Access), 2005.
10. X. Liu, J. Wang, and S. T. Vuong. A Category Overlay Infrastructure for Peer-to-Peer Content
Search. In Proceedings of IPDPS’05, 2005.
C. Doulkeridis, K. Nørvåg, and M. Vazirgiannis
11. H. Nottelmann and N. Fuhr. Comparing Different Architectures for Query Routing in Peerto-Peer Networks. In Proceedings of ECIR’2006, 2006.
12. L. Ramaswamy, B. Gedik, and L. Liu. Connectivity based Node Clustering in Decentralized
Peer-to-Peer Networks. In Proceedings of P2P’03, 2003.
13. H. T. Shen, Y. Shu, and B. Yu. Efficient Semantic-based Content Search in P2P Network.
IEEE Transactions on Knowledge and Data Engineering, 16(7):813–826, 2004.
14. P. Triantafillou, C. Xiruhaki, M. Koubarakis, and N. Ntarmos. Towards High Performance
Peer-to-Peer Content and Resource Sharing Systems. In Proceedings of CIDR’03, 2003.
15. B. Yang and H. Garcia-Molina. Designing a Super-Peer Network. In Proceedings of
ICDE’03, 2003.
Reevaluating Access and Preservation Through
Secondary Repositories: Needs, Promises, and Challenges
Dean Rehberger, Michael Fegan, and Mark Kornbluh
310 Auditorium MATRIX Michigan State University East Lansing, MI 48824-1120 USA
[email protected], [email protected], [email protected]
Abstract. Digital access and preservation questions for cultural heritage institutions have focused primarily on primary repositories — that is, around collections of discrete digital objects and associated metadata. Much of the promise
of the information age, however, lies in the ability to reuse, repurpose, combine
and build complex digital objects[1-3]. Repositories need both to preserve and
make accessible primary digital objects, and facilitate their use in a myriad of
ways. Following the lead of other annotation projects, we argue for the development of secondary repositories where users can compose structured collections of complex digital objects. These complex digital objects point back to the
primary digital objects from which they are produced (usually with URIs) and
augment these pointers with user-generated annotations and metadata. This paper examines how this layered approach to user generated metadata can enable
research communities to move forward into more complex questions surrounding digital archiving and preservation, addressing not only the fundamental
challenges of preserving individual digital objects long term, but also the access
and usability challenges faced by key stakeholders in primary digital repository
collections—scholars, educators, and students. Specifically, this project will
examine the role that secondary repositories can play in the preservation and
access of digital historical and cultural heritage materials with particular emphasis on streaming media.
1 Introduction
To date, digital access and preservation questions for cultural heritage institutions
have focused chiefly on primary repositories — that is, around collections of discrete
digital objects and associated metadata. From the Library of Congress’ American
Memory to the digital image collection of the New York Public Library to the University of Heidelberg Digital Archive for Chinese Studies to the digital collections at
the National Library of Australia, millions of objects are being made available to the
general public that were once only the province of the highly trained researcher. Students have unprecedented access to illuminated manuscripts, primary and secondary
documents, art, sheet music, photographs, architectural drawings, ethnographic case
studies, historical voices, video, and a host of other rich and varied resources. The
rapid growth of primary materials available online is well documented, as are the
challenges posed by the “deep web.”
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 39 – 50, 2006.
© Springer-Verlag Berlin Heidelberg 2006
D. Rehberger, M. Fegan, and M. Kornbluh
Access is at issue as well as preservation. While access to the “deep web” resources is difficult for most internet users, access to items in even the most well established repositories is largely limited to search, browse, and view. Much of the
promise of the information age, however, lies in the ability to reuse, repurpose,
combine and build complex digital objects[1-3]. Repositories need both to preserve
and make accessible primary digital objects, and facilitate their use in a myriad of
ways. Following the EU-NSF DL all projects meeting in March 2002 in Rome,
Dagobert Soergel outlined a framework for the development of digital libraries by
proposing that DLs need to move beyond the “paper-based metaphors” that privilege the finding and viewing of documents to support new ways of doing
intellectual work [4]. The framework offers, among others, seven key points for
this transformation of digital libraries: one, DLs need to support collaboration and
communities of users with tools; two, the tools must be able to process and present
the materials in ways that “serve the user’s ultimate purpose”; three, users need to
build their own individual or community “information spaces through the process of
selection, annotation, contribution, and collaboration”; four, the tools need to be
easy to use and should automate as many processes as possible; five, users need to
be able to retrieve complex objects and interrelated structures; six, developers need
to do careful analysis of user tasks and needs; and finally, seven, key to this framework is also the need to support user training and education to enhance further exploration and use of digital libraries.
While this framework appears ambitious (and expensive), we propose the development of secondary repositories where users can compose structured collections of
complex digital objects with easy to use tools. These complex digital objects point
back to the primary digital objects from which they are produced (usually with
URIs) and users can augment these pointers with user-generated annotations and
metadata. In so doing, users can organize the objects they find from a variety of
DLs, personalizing and contextualizing the objects. They can gather a variety of
media and format types, providing a meaningful presentation for themselves and
their communities of users as well as a portal back to the digital libraries to encourage further investigation and discovery. The key element to the tool set is to provide affordances that encourage users to improve their ability to access digital
libraries and develop ontologies that make sense to their community[s] of users.
Since information in a secondary repository is generated and layered outside of the
controlling system of the primary repository, such contextualized metadata currently would not be proposed as a replacement for current practices and initiatives
but as an enhancement that seeks to support the current paradigm shift in research
from object to use, presentation to interaction.
This paper examines how this layered approach to user generated metadata can
enable research communities to move forward into more complex questions surrounding digital archiving and preservation, addressing not only the fundamental
challenges of preserving individual digital objects long term, but also the access and
usability challenges faced by key stakeholders in primary digital repository collections—scholars, educators, and students. Specifically, this project will examine the
role that secondary repositories can play in the preservation and access of digital
historical and cultural heritage materials with particular emphasis on streaming
Reevaluating Access and Preservation Through Secondary Repositories
2 Paradigm Shift
“Many of the digital resources we are creating today will be repurposed and re-used for reasons we cannot imagine today. . . .
Digital technologies are shaping creation, management, preservation, and access in ways which are so profound that traditional
methods no longer are effective. These changes will require a paradigm shift in research if it is to provide the innovations—whether
theoretical, methodological or technical—necessary to underpin
long term access to digital resources.”[1]
Many researchers and scholars within the digital library community recognize that
new and innovative research directions are required to stimulate research on the longterm management and preservation of digital media.[2] The reasons for the call for a
paradigm shift in the Digital Library community’s research agenda are simple and direct. While access to online resources has steadily improved in the last decade, online
archives and digital libraries still remain difficult to use, particularly for students and
novice users [5]. In some cases, large amounts of resources have been put into massive digitization initiatives that have opened rich archives of historical and cultural
materials to a wide range of users. Yet the traditional cataloging and dissemination
practices of libraries and archives make it difficult for these users to locate and use effectively these sources, especially within scholarly and educational contexts [6].
Many digital libraries around the country, large and small, have made admirable efforts toward creating user portals and galleries to enhance the usability of their holdings, but these results are often expensive and labor intensive, often speaking only
directly to a small segment of users or giving limited options for user interactivity.
Most popular is the user-generated collection (e.g., Main Memory Network, users
create their image galleries [7]). While an important step forward, these initiatives
often develop tools that can be used only within a single archive that developed the
To address these problems and to initiate the paradigm shift, researchers have
questioned the gulf that separates issues of access from issues of preservation. Preservation and access are no longer entirely thought of in terms of stand alone files or
individual digital objects, but in terms of active use—how users find, use, reuse, repurpose, combine and build complex digital objects out of the objects they collect.
This assumption relies on a more complex meaning for the term “access.” Many
scholars in the field have called for a definition of access that goes beyond search interfaces to the ability of users to retrieve information “in some form in which it can be
read, viewed, or otherwise employed constructively”[6, 8, 9]. Access thus implies
four related conditions that go beyond the ability to link to a network: 1) equity—the
ability of “every citizen” and not simply technical specialists to use the resources;
2) usability—the ability of users to easily locate, retrieve, use, and navigate resources;
3) context—the conveyance of meaning from stored information to users, so that it
makes sense to them; and 4) interactivity—the capacity for users to be both consumers and producers of information.
Researchers have noted that the keys to enhancing access for specific user groups,
contexts, and disciplines are to build repositories with resources and tools that allow
D. Rehberger, M. Fegan, and M. Kornbluh
users to enhance and augment materials[10], share their work with a community of
users[11], and easily manipulate the media with simple and intuitive tools. Users will
also need portal spaces that escape the genre of links indexes and become flexible
work environments that allow users to become interactive producers[12].
2.1 The Challenges of Metadata
Over the past decade, the digital library community has tried to reduce the labor and
expense of creating, cataloging, storing, and disseminating digital objects through the
research and development of specific practices to facilitate each of these stages. In
the face of ever-accelerating rates of complex data-creation and primary repository
development, the central challenge to the digital library community is the long term
sustainability and cost-effectiveness of primary digital repositories. The greatest cost
factor in the field of digital preservation is human labor, “with current methods relying on significant human intervention for selection, organization, description and access” [1]. Leaders in the field of digital preservation are asking how metadata,
semantics, and knowledge management technologies can enable the future reuse of
primary repository collections; while at the same time minimize the labor intensiveness of the process [2]. Although current processes have become easier, better documented, and more automated, creating and working with digital objects is still a very
specialized endeavor that requires specialized hardware, software, and expertise. This
expertise is for the most part outside of the realm and resources for many cultural institutions and small digital libraries.
In line with digital library best practices, digitized sources are typically cataloged to
describe their bibliographic information, along with technical, administrative, and
rights metadata. While these practices are essential for preserving the digital object
and making it available to users, unfortunately they do so in a language and guise often difficult to understand within the context of use [3, 13]. As Hope Olson points
out, traditional cataloguing practices based on LCSH and DDC, while essential to giving access to items, often disproportionately affects access for marginalized groups
and topics falling outside of mainstream culture [14]. Similarly, even though the
author’s name, the title of the work, and keywords are essential for describing and locating a digital object, this kind of information is not always the most utilized information for ascertaining the relevance of a digital object. For instance, K-12 teachers
often do not have specific authors or titles in mind when searching for materials for
their classes. Teachers more frequently search in terms of grade level, the state and
national standards that form the basis of their teaching, or broad overarching topics
derived from the required content and benchmark standards (e.g., core democratic
values or textbook topics) that tend to display too many search returns to make the information of value.
This problem for educators has been one of the primary reasons for the development of Learning Object Metadata (LOM) [15]. Through improved metadata attached to learning objects, the hope is that educators can more easily find, assemble,
and use units of educational content. Using object-oriented programming as a metaphor, the emphasis is on avoiding needless replication of labor by assembling learning
objects found on the internet to build course material. This approach has provided
excellent resources, particularly for the sciences, math and engineering. Yet Paul
Reevaluating Access and Preservation Through Secondary Repositories
Shabajee has chronicled well the problems associated with learning object metadata
[10]. While it can do an excellent job of facilitating access to learning objects, especially for well-developed models and simulations, for raw assets (images, video segments, audio clips) assigning learning object metadata can exclude as much as give
access. For examples, a set of images of a New Hampshire village may be designed
for a college-level course on ethnography, but could be used on any level for a number of subjects from art to history to social studies to architecture (an infinite variety
of uses). Moreover, learning object repositories usually are either a collection of objects with no relation to other digital libraries (from which facets of the object may
have been taken) or as a collection of link reviews. While instructors can assemble
good materials for their classes, the materials are often in the form of sets of links that
do not articulate or contextualize access to related digital libraries nor do they allow
for much personalization or change.
Researchers have long grappled with the problems of costs, knowledge, and resources needed to do full cataloguing of digital objects. As is well known, the Dublin
Core initiative directly addresses the problem by specifying a minimal set of metadata
to enhance searching and indexing of digital objects. The Dublin Core has worked so
well that studies are now demonstrating that authors can apply metadata to their
creations as well as professional [16]. Similarly, taking advantage of the XML namespace, the Resource Description Framework provides a modular approach to metadata, allowing for the accommodation of numerous and varied metadata packages
from a variety of user groups. While viable instantiations of RDF have been limited
to specialized areas and commerce, it does provide a wrapper that would work well to
exchange metadata between secondary repositories. Dublin Core (which could be
harvested or submitted from participating digital repositories), provides for the initial
metadata needed to create secondary repositories, their access and development,
which is then enhanced by user-generated metadata.
2.2 The Challenges of Annotating Streaming Media
Even though access by specialist scholars and educators to digital objects has grown
at an exponential rate, tangible factors have prevented them from fully taking advantage of these resources in the classroom, where they could provide the conceptual and
contextual knowledge of primary objects for their students. When educators do find
the materials they need, using objects from various primary repositories to put together presentations and resources for their students and research can be challenging.
Beyond merely creating lists of links to primary and secondary resources, assembling
galleries of images, segmenting and annotating long audio and video files require far
more technical expertise and time than can realistically be expected in the educational
context. Additionally, even though scholars have a long history of researching archives and are comfortable sifting through records, locating items, and making annotations, comparisons, summaries, and quotations, these processes do not yet translate
into online tools. Contemporary bibliographic tools have expanded to allow these users to catalogue and keep notes about media, but they do not allow users to mark specific passages and moments in multimedia, segment it, and return to specific places at
a later time. Multimedia and digital repository collections thus remain underutilized in
D. Rehberger, M. Fegan, and M. Kornbluh
education and research because the tools to manipulate the various formats often
“frustrate would be users” and take too much cognitive effort and time to learn[17].
While cursory studies have indicated these access issues, still very little is known
about archival use or how these users express their information needs [18, 19]. For
digital libraries to begin to fulfill their potential, much research is needed to understand better the processes by which primary repositories are accessed and how information needs are expressed. For example, research needs to address the ways in
which teachers integrate content into their pedagogy so that bridges can be built from
digital repositories to the educational process, bridges that greatly facilitate the ability
of teachers and students to access specific information within the pedagogical process. Recent research strongly suggests that students need conceptual knowledge of
information spaces that allow them to create mental models to do strategic and successful searches. As with any primary source, the materials in digital libraries do not
literally “speak” for themselves and impart wisdom; they require interpretation and
analysis [20]. Allowing communities of users to enhance metadata and actively use,
reuse, repurpose, combine and build complex digital objects can help users to contextualize the information they find, draw from deeper resources within the digital library, and find more meaningful relationships between digital objects and their needs.
Thinking in terms of a distributed model (similar to the open source software community) that allows users both easier access to materials and a greater range of search
criteria and also provides opportunity for active engagement in the generation of
metadata and complex digital objects, promises to help us rethink our most basic assumptions about user access and long-term preservation.
Researchers have long recognized the importance of user generated annotations
and developing ontologies for differing user communities. Relevance feedback
from users and interactive query expansion have been used to augment successfully metadata for document and image retrieval. The annotation and Semantic
Web communities have made great strides in developing semi-automated annotation tools to enhance searching for a variety of media. Although many of the developed tools (SHOE Knowledge Annotator, MnM annotation tool, and WebKB)
focus on HTML pages, the CREAting Metadata for the Semantic Web (CREAM)
annotation framework promises to support manual and semi-automated annotation
of both the shallow and deep web through the development of OntoAnnotate [21].
Other annotation projects tend to focus on particular fields, G-Portal (geography)
and ATLAS (linguistics) and support a number of user groups within the field.
Several of these annotation projects have worked remarkably well within distinct,
highly trained user groups, but are more problematic when used by untrained, general users or in fields with less highly defined ontologies.
The secondary repository that we have built draws on the lessons learned annotation
community. It is responsible for handling secondary metadata, extended materials and
resources, interactive tools and application services. This information is cataloged,
stored, and maintained in a repository outside of the primary repository that holds the
digital object. The comments and observations generated by users in this context are
usually highly specialized because such metadata is created from discipline-specific,
scholarly perspectives (as an historian, social scientist, teacher, student, enthusiast, etc.)
and for a specific purpose (research, publishing, teaching, etc.). Affordances are built in
to help users identify themselves and their fields of interest. Even though the
Reevaluating Access and Preservation Through Secondary Repositories
information generated by a secondary repository directly relates to digital objects in
primary repositories, secondary repositories remain distinctly separate from the traditional repository. The information gathered in secondary repositories would rarely be
used in the primary cataloging and maintenance of the object, and primary repositories
would continue to be responsible for preservation, management, and long-term access
but could be freed from creating time-consuming and expensive materials, resources,
services, and extended metadata for particular user groups.
MATRIX: Center for Humane Arts, Letters and Social Sciences OnLine, at Michigan State University, for instance, has created a secondary repository using a serverside application called MediaMatrix [22]. This application is an online tool that
allows users to easily find, segment, annotate and organize text, image, and streaming
media found in traditional online repositories. MediaMatrix works within a web
browser, using the browser’s bookmark feature, a familiar tool for most users. When
users find a digital object at a digital library or repository, they simply click the MediaMatrix bookmark and it searches through the page, finds the appropriate digital
media, and loads it into an editor. Once this object is loaded, portions of the media
can be isolated for closer and more detailed work—portions of an audio or video clip
may be edited into annotated time-segments, images may be cropped then enlarged to
highlight specific details. MediaMatrix provides tools so that these media can be
placed in juxtaposition, for instance, two related images, a segment of audio alongside
related images and audio, and so forth. Most importantly, textual annotations can be
easily added to the media, and all this information is then submitted and stored on a
personal portal page.
This portal page can be created by a scholar-educator who wishes to provide specific and contextualized resources for classroom use, and/or by a student creating a
multimedia-rich essay for a class assignment. While these users have the immediate
sense that they are working directly with primary objects, it is important to emphasize
that primary repository objects are not actually being downloaded and manipulated.
MediaMatrix does not store the digital object, rather, it stores a pointer to the digital
object (URI) along with time or dimension offsets the user specified for the particular
object and the user’s annotation for that particular object. This use of URI pointing as
opposed to downloading is especially significant because it removes the possibility
that items may be edited and critiqued in contexts divorced from their original repositories, which hold the primary and crucial metadata for such objects. As long as
primary repositories maintain persistent URIs for their holdings the pointer to the
original digital object will always remain within the secondary repository, which acts
as a portal to both the primary collection and contextualizing and interpretive information generated by individuals on items in those collections. This information is
stored in a relational database along with valuable information about the individual,
who supplies a profile regarding their scholarly/educational background, and provides
information of the specific purposes for this work and the user-group (a class, for example) accessing the materials. The secondary repository can thus be searched and
utilized in any number of ways.
D. Rehberger, M. Fegan, and M. Kornbluh
3 Secondary Repositories and the Sustainability of Primary
At its most basic level, a secondary repository provides four levels of information concerning the use of digital objects housed in the primary repository: what is being used;
what portions of those files are most utilized; who is using the digital objects; and, for
what purpose are they using it. This information may be utilized in a number of different ways to support preservation and migration practices and the long-term sustainability of digital archives. Secondary repositories can instantly generate a list of the digital
objects being used from any primary repository. This information could be used in determining digitization and preservation strategies as materials that are being utilized
most by users might be pushed up the migration schedule and materials similar to those
being most utilized might be digitized ahead of those materials that are least used. Because secondary repositories like MediaMatrix also allow users to segment digital
objects by storing the time parameters of the sections they use, secondary repositories
reveal what parts of digital objects users are most frequently accessing. This is not only
helpful in determining segmentation strategies for all files and whether to further create
specific semantic/intellectually meaningful segments for specific files, it removes the
need for segmentation by the primary repository altogether. Repositories can store the
time offsets (for audio and video files) or dimension markers (for images) to dynamically create segments of whole digital objects by feeding the offsets to the appropriate
media player when the digital object is streamed or downloaded.
Of key importance to digital libraries is the issue of getting digital access and preservation on the agenda of key stakeholders such as universities and education
systems. This agenda must be presented in terms that they will understand, and the
ability to provide information about whom from these various communities is accessing particular digital objects from their holdings and for what purpose they are using
them will be invaluable. The information contained in secondary repositories can assist stewards of primary repositories in building galleries and portals of digital objects
that pertain to the needs of specific populations of users. This enables a more targeted
approach to funding and project development. Whereas most primary repositories
have educational sections, limitations in resources and labor often means that they can
typically only offer a limited number of lesson plans that have relatively few digital
objects (in relation to whole collections) from the primary holdings associated with
them. Secondary repositories may give curators of primary repositories a better
glimpse into how a specific user-base is using their holdings. Digital libraries can
package materials especially suited for a specific demographic as well as instantly offer “additional suggestions” via a qualitative recommender system (for example, “Social Science, Grade 10-12 Teachers who accessed this image also viewed these
resources”). Secondary repositories can even offer suggestions and links to similar
digital objects housed at other primary repositories, therefore offering a truly federated resource. Secondary repositories can not only directly impact the sustainability
of long term preservation projects, but also provides fruitful areas for further research
and development on how recommender systems can be used effectively in these contexts, and how users interact with digital objects and personalize and repurpose information within specific contexts for specific purposes.
Reevaluating Access and Preservation Through Secondary Repositories
In creating new models for making digital preservation affordable and attractive to
individuals, government agencies, universities, cultural institutions, and society at
large, secondary repositories can perform vital roles. By enhancing and increasing
meaningful access to primary repository holdings and by providing tools for quantifying and assessing that access within specific groups and educational context, secondary repositories can raise public awareness of digital preservation needs and also
attract key stakeholders such as universities, libraries, and government agencies to invest in the continuance of digital preservation and access.
3.1 Secondary Repositories and Metadata
Secondary repositories may also provide a wealth of extensive metadata that pertain
to the digital objects to which they point. While many would discount the usefulness
of this metadata since it is primarily user-generated and does not follow cataloging
standards, like dirty transcripts that contain various kinds and levels of errors, the annotations and notes generated by users could be used as additional criteria for keyword searching. This metadata would not replace traditional descriptions, keywords,
and subject headings developed by catalogers, but rather it would be used in tandem
with this metadata. As noted above, the real utility of this metadata is that it is generated from a very discipline/user specific vantage point and speaks to the language and
conventions of that group. Traditional finding tools (keyword searches, thematic
browsing, galleries, etc.) are problematic to many segments of users, stemming not
only from the user’s inability to formulate effective searches or lack of knowledge,
but also the metadata that is searched and used to create these utilities. Because the
metadata generated from secondary repositories is created by the same kind of user
who will eventually search for specific digital objects, it often speaks directly to the
methods and language they will use with search and browse utilities. User-specific
metadata sets can be created using user profiles so that scholars have the ability to
search the traditional catalogs, but also search through the annotations created by others within their field. Teachers will be able to search through the metadata created by
others teaching the same grade level and subject matter. While traditional metadata
approaches need to remain driven by best practices and community standards, secondary repositories provide a way to augment this metadata with a very personalized
method of finding information.
This personalized and organic approach to metadata will help archivists of primary
repositories identify what types of information future generations will need to use archival records, and help us to begin to answer the question “what information will
people need to be able to continuously use records across time?” Secondary repositories can thus raise interesting questions as to the very function of metadata and what it
means to preserve an object. The object itself represents “the tip of a very large iceberg; the tip is visible above the water only because there is a large mass of complex
social relationships ‘underneath’ it—that generate, use and give meaning to, the digital documents.” The object itself is more effectively thought of as a principle of organization for a complex nexus of interactions, events, and conversations that give
meaning to a particular object. But, as Wendy Duff asks, how would archivists begin
to represent the context of these records? What types of metadata are needed to
document these relationships? There are many levels of metadata that need to be
addressed to catalog properly the creation, nature, and life of a digital object [19].
D. Rehberger, M. Fegan, and M. Kornbluh
Descriptive, copyright, source, digital provenance, and technical metadata work to ensure digital repositories can properly manage, find, migrate, and disseminate digital
objects. In a sense they track the life of a digital object (where it came from, its different manifestations, changes in copyright, etc.) and ensure its access in the future
(descriptive metadata function will work as a traditional finding aids, and technical
metadata will provide information on how to render the object). While these hooks
into digital objects can never be replaced by user-generated metadata, other disciplines would argue that more is needed to preserve truly the life and meaning of an
object over time. Indeed, social theorists would argue that a digital object’s meaning
is socially constructed through its use. Thus one way to begin to understand an object
is to understand how people interpreted and used the object at a particular point in
time. Similar to the marginalia written in books, interpretations of works of art or historical artifacts, translations of the now-lost dead sea scrolls, or the scribbled notes
and diagrams in Watson and Cricke’s workbooks, secondary repositories provide a
unique way of documenting and preserving the meaning—and the construction of the
meaning—of an object by revealing how specific users made meaning out the object
at specific times and for specific uses. If the goal of preservation is to retain the truest
sense of an object over time, this information would help define a richer sense of an
object’s meaning at any given time.
The preservation of metadata that works to preserve the meaning of a digital object
over time is being broached in an indirect way through the development of Fedora
(Flexible Extensible Digital Object and Repository Architecture - http://www.
fedora.info/) and the use/development of METS (Metadata Encoding and Transmission
Standard - http://www.loc.gov/standards/mets/). METS is a metadata standard that was
specifically created to encode all levels of metadata needed to preserve, manage and disseminate a digital object. Fedora, which is an open-source digital object repository
management system, uses METS as its primary metadata scheme. In its early conception, Fedora was struggling with using the METS scheme because it did not have a specific way of documenting the behaviors (definitions and mechanisms) FEDORA uses
for each digital object. Behaviors are directions for doing something with a digital object and the parameters in order to perform that action. A sample behavior might be to
get an image and display it at a specific size in a web browser. This information does
not specifically describe the digital object; instead it provides instructions for computer
applications on how to process the digital object in a particular way. The original inception of METS did not have an obvious place to store this information within the METS
scheme. The creators of FEDORA successfully lobbied to have a section where multiple behaviors could be tied to a single digital object.
While this information is functional in the use and dissemination of digital objects,
it also presents an interesting history of how specific digital objects were processed
and presented to users over time. It documents the evolution of technology and how
technology was used to present digital objects to users in a meaningful way. Secondary repositories would work much the same way by preserving which digital objects
were selected and how users processed digital object in their own work. While repositories produce metadata that documents the nature and life of a digital object so
that it can be managed and found, the difficult question remains: what other kinds of
metadata are required so that multiple audiences can successfully use digital objects
each in their own discipline-specific practices?
Reevaluating Access and Preservation Through Secondary Repositories
4 Conclusions
From this survey of work and our initial studies, we have found several serious research and community challenges still need to be broached.
Persistent URIs: URIs are an important aspect of secondary repositories and of tools
for building secondary repositories like MediaMatrix. Digital archives and libraries
have increasingly hindered access and re-access to digital objects by limiting access
or granting temporary URIs for a digital object. While the importance of stable, persistent URIs has been well documented in the library community, they are especially
important to secondary repositories. To respect the access restrictions built around
digital objects by primary repositories, secondary repositories need to store a unique,
persistent URI that allows the user to re-access the digital object they have annotated.
Standardizing Secondary Repository Metadata: While metadata standards have
been thoroughly researched and developed for primary repositories, a standardized
metadata scheme for secondary repositories has yet to be developed. Metadata, semantics, and knowledge management technologies need to enable future reuse of collections in digital archives. In particular, research and standardization are required for
the metadata needed to help users make sense of objects, to help the secondary repository administrators manage the entries of users, and to preserve that information over
time. The standardization of secondary repository metadata is especially important so
metadata can be easily exchanged between secondary repository tools and between
the secondary and primary repository. For secondary repositories to be truly useful
for the user, they need to be able to use a number of different tools to work with and
produce information about digital objects. To work with multiple tools developed by
any number of institutions, a common approach to documenting and exchanging
metadata needs to be adopted so that users can easily take their entries from one tool
and import them into another. As noted above, this is one area in which substantial
work has been done by the annotation community of researchers; porting and reevaluating of this work needs to be done in relation to cultural heritage materials and the
humanities. It is also essential to produce a mutually beneficial relationship between
secondary and primary repositories. Primary repositories need to access information
easily that specifically pertains to the digital objects from their repository and utilize it
within their own infrastructure. It will also be beneficial if secondary repositories can
access and integrate small bits of metadata from the primary repository. This would
provide the user with official metadata to accompany their annotations (helping to
automate as much as the process as possible) as well as provide a means of detecting
changes in URI’s and updates to the digital object itself.
Preservation of complex digital objects: Primary repositories primarily change
through the addition of digital objects to their holdings. The metadata for those digital objects is relatively static except for documenting the migration of those objects to
different storage mediums or file formats. Secondary repositories on the other hand
are organic, ever-changing entities. Users adjust the segments they have created, revise
annotations and other information they have recorded for the object, and delete whole
entries at will; they can restrict and allow very levels of access. Because these entries
work to help preserve the meaning of a digital object over the time, questions arise as to
D. Rehberger, M. Fegan, and M. Kornbluh
how we will preserve the changing metadata created by users or whether we will preserve these changes at all. Like dirty transcripts, do we accept the flawed nature of the
metadata created by users or are the changes made by users important bits of information in studying the use and evolution of the digital objects they describe?
Support for the project comes in large part due to the JISC/NSF Digital Libraries Initiative II: Digital Libraries in the Classroom Program—National Science Foundation,
award no. IIS-0229808.
M. Hedstrom, in Wave of the Future: NSF Post Digital Libraries Futures Workshop,
Chatham, MA, 2003.
N. S. F. a. t. L. o. Congress, in Workshop on Research Challenges in Digital Archiving
and Long-Term Preservation (M. Hedstrom, ed.), 2003, p. 1.
C. Lynch, in NSF Post Digital Libraries Futures Workshop, 2003.
D. Soergel, in D-Lib Magazine, Vol. 8, 2002.
W. Y. Arms, Digital libraries, MIT Press, Cambridge, Mass., 2000.
C. L. Borgman, From Gutenberg to the global information infrastructure: access to information in the networked world, MIT Press, Cambridge, Mass., 2000.
M. H. Society, Vol. 2004, 2001.
C. Lynch, Wilson Library Bulletin 69 (1995) 38.
B. Kahin, J. Keller, and Harvard Information Infrastructure Project., Public access to the
Internet, MIT Press, Cambridge, Mass., 1995.
P. Shabajee, in D-Lib Magazine, Vol. 8, 2000.
R. Waller, in Ariadne, UKOLN, 2004.
P. Miller, in Ariadne, 2001.
C. Lynch, in Personalization and Recommender Systems in the Larger Context: New Directions and Research Questions (Keynote Speech). Dublin City University, Ireland, 2001.
H. A. Olson, The power to name: locating the limits of subject representation in libraries, Kluwer Academic Publishers, Dordrecht, The Netherlands; Boston, 2002.
T. I. L. T. S. C. (LTSC), Vol. 2003, IEEE, 2002.
J. Greenberg, M. C. Pattuelli, B. Parsia, and W. D. Robertson, Journal of Digital Information 20 (2001).
J. R. Cooperstock, in HCI International, Conference on Human-Computer Interaction,
McGill, New Orleans, LA, 2001, p. 688.
W. Duff and C. A. Johnson, American Archivist 64 (2001) 43.
W. Duff, Archival Science (2001) 285.
G. C. Bowker and S. L. Star, Sorting things out: classification and its consequences,
MIT Press, Cambridge, Mass., 1999.
S. Handschuh and S. Staab, in WWW2002, Honolulu, Hawaii, 2002.
M. L. Kornbluh, D. Rehberger, and M. Fegan, in 8th European Conference, ECDL2004,
Springer, Bath, UK, 2004, p. 329.
Repository Replication Using NNTP and SMTP
Joan A. Smith, Martin Klein, and Michael L. Nelson
Old Dominion University, Department of Computer Science
Norfolk, VA 23529 USA
{jsmit, mklein, mln}@cs.odu.edu
Abstract. We present the results of a feasibility study using shared,
existing, network-accessible infrastructure for repository replication. We
utilize the SMTP and NNTP protocols to replicate both the
metadata and the content of a digital library, using OAI-PMH to
facilitate management of the archival process. We investigate how dissemination of repository contents can be piggybacked on top of existing email and Usenet traffic. Long-term persistence of the replicated
repository may be achieved thanks to current policies and procedures
which ensure that email messages and news posts are retrievable for
evidentiary and other legal purposes for many years after the creation
date. While the preservation issues of migration and emulation are not
addressed with this approach, it does provide a simple method of refreshing content with unknown partners for smaller digital repositories
that do not have the administrative resources for more sophisticated
We propose and evaluate two repository replication models that rely on shared,
existing infrastructure. Our goal is not to “hijack” other sites’ storage, but to take
advantage of protocols which have persisted through many generations and which
are likely to be supported well into the future. The premise is that if archiving
can be accomplished within a widely-used, already deployed infrastructure whose
operational burden is shared among many partners, the resulting system will
have only an incremental cost and be tolerant of dynamic participation. With
this in mind, we examine the feasibility of repository replication using Usenet
news (NNTP, [1]) and email (SMTP, [2]).
There are reasons to believe that both email and Usenet could function as persistent, if diffuse, archives. NNTP provides well-understood methods for content
distribution and duplicate deletion (deduping) while supporting a distributed
and dynamic membership. The long-term persistence of news messages is evident in “Google Groups,” a Usenet archive with posts dating from May 1981
to the present [3]. Even though blogs have supplanted Usenet in recent years,
many communities still actively use moderated news groups for discussion and
awareness. Although email is not usually publicly archivable, it is ubiquitous
and frequent. Our departmental SMTP email server averaged over 16,000 daily
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 51–62, 2006.
c Springer-Verlag Berlin Heidelberg 2006
J.A. Smith, M. Klein, and M.L. Nelson
outbound emails to more than 4000 unique recipient servers during a 30-day
test period. Unlike Usenet, email is point-to-point communication but, given
enough time, attaching repository contents to outbound emails may prove to be
an effective way to disseminate contents to previously unknown locations. The
open source products for news (“INN”) and email (“sendmail” and “postfix”)
are widely installed, so including a preservation function would not impose a
significant additional administrative burden.
These approaches do not address the more complex aspects of preservation
such as format migration and emulation, but they do provide alternative methods
for refreshing the repository contents to potentially unknown recipients. There
may be quicker and more direct methods of synchronization for some repositories,
but the proposed methods have the advantage of working with firewall-inhibited
organizations and repositories without public, machine-readable interfaces. For
example, many organizations have web servers which are accessible only through
a VPN, yet email and news messages can freely travel between these servers and
other sites without compromising the VPN. Piggybacking on mature software
implementations of these other, widely deployed Internet protocols may prove
to be an easy and potentially more sustainable approach to preservation.
Related Work
Digital preservation solutions often require sophisticated system administrator
participation, dedicated archiving personnel, significant funding outlays, or some
combination of these. Some approaches, for example Intermemory [4], Freenet [5],
and Free Haven [6], require personal sacrifice for public good in the form of donated storage space. However, there is little incentive for users to incur such
near-term costs for the long-term benefit of a larger, anonymous group. In contrast, LOCKSS [7] provides a collection of cooperative, deliberately slow-moving
caches operated by participating libraries and publishers to provide an electronic
“inter-library loan” for any participant that loses files. Because it is designed to
service the publisher-library relationship, it assumes a level of at least initial
out-of-band coordination between the parties involved. Its main technical disadvantage is that the protocol is not resilient to changing storage infrastructures.
The rsync program [8] has been used to coordinate the contents of digital library
mirrors such as the arXiv eprint server but it is based on file system semantics
and cannot easily be abstracted to other storage systems. Peer-to-peer services
have been studied as a basis for the creation of an archiving cooperative among
digital repositories [9]. The concept is promising but their simulations indicated
scalability is problematic for this model. The Usenet implementation [10] of the
Eternity Service [11] is the closest to the methods we propose. However, the
Eternity Service focuses on non-censorable anonymous publishing, not preservation per se.
Repository Replication Using NNTP and SMTP
The Prototype Environment
We began by creating and instrumenting a prototype system using popular,
open source products: Fedora Core (Red Hat Linux) operating system; an NNTP
news server (INN version 2.3.5); two SMTP email servers, postfix version 2.1.5
and sendmail version 8.13.1; and an Apache web server (version 2.0.49) with
the mod oai module installed [12]. mod oai is an Apache module that provides
Open Archives Protocol for Metadata Harvesting (OAI-PMH) [13] access to
a web server. Unlike most OAI-PMH implementations, mod oai does not just
provide metadata about resources, it can encode the entire web resource itself in
MPEG-21 Digital Item Declaration Language [14] and export it through OAIPMH. We used Perl to write our own repository replication tools, which were
operated from separate client machines.
As part of our experiment, we created a small repository of web resources
consisting of 72 files in HTML, PDF and image (GIF, JPEG, and PNG) formats.
The files were organized into a few subdirectories with file sizes ranging from
less than a kilobyte to 1.5 megabytes. For the NNTP part of the experiment,
we configured the INN news server with common default parameters: messages
could be text or binary; maximum message life was 14 days; and direct news
posting was allowed. For email, we did not impose restrictions on the size of
outgoing attachments and messages. For each archiving method, we harvested
the entire repository over 100 times.
Both the NNTP and SMTP methods used a simple, iterative process: (1)read a
repository record; (2)format it for the appropriate archive target (mail or news);
(3)encode record content using base64; (4)add human-readable X-headers (for
improved readability and recovery); (5)transmit message (email or news post)
to the appropriate server; (6)repeat steps 1 through 5 until the entire repository
has been archived. Below, we discuss details of the differences in each of these
steps as applied specifically to archiving via news or email.
We took advantage of OAI-PMH and the flexibility of email and news to embed the URL of each record as an X-Header within each message. X-Headers
are searchable and human-readable, so their contents give a clue to the reader
about the purpose and origin of the message. Since we encoded the resource
itself in base 64, this small detail can be helpful in a forensic context. If the
URL still exists, then the X-Headers could be used to re-discover the original resource. Table 1 shows the actual X-Headers added to each archival
The News Prototype
For our experiment, we created a moderated newsgroup which means that postings must be authorized by the newsgroup owner. This is one way newsgroups
keep spam from proliferating on the news servers. We also restricted posts to
selected IP addresses and users, further reducing the “spam window.” For the experiment, we named our newsgroup “repository.odu.test1,” but groups can have
any naming scheme that makes sense to the members. For example, a DNS-based
J.A. Smith, M. Klein, and M.L. Nelson
Table 1. Example of Human-Readable X-Headers Added to Archival Messages
X-Harvest Time: 2006-2-15T18:34:51Z
X-baseURL: http://beatitude.cs.odu.edu:8080/modoai/
X-OAI-PMH verb: GetRecord
X-OAI-PMH metadataPrefix: oai didl
X-OAI-PMH Identifier: http://beatitude.cs.odu.edu:8080/1000/pg1000-1.pdf
X-sourceURL: http://beatitude.cs.odu.edu:8080/modoai/?verb=GetRecord
&metadataPrefix=oai didl
X-HTTP-Header: HTTP/1.1 200 OK
scheme that used “repository.edu.cornell.cs” or “repository.uk.ac.soton.psy”
would be a reasonable naming convention.
Using the simple 6-step method outlined above, we created a news message
with X-Headers for each record in the repository, We also collected statistics
on (a)original record size vs. posted news message size; (b)time to harvest, convert and post a message; and (c)the impact of line length limits in news posts.
Our experiment showed high reliability for archiving using NNTP. 100% of the
records arrived intact on the target news server, “beatitude.” In addition, 100%
of the records were almost instantaneously mirrored on a subscribing news server
(“beaufort”). A network outage during one of the experiments temporarily prevented communication between the two news servers, but the records were replicated as soon as connectivity was restored.
The Email Prototype
The two sides of SMTP-method archiving, outbound and inbound, are shown
in Figure 1. Archiving records by piggybacking on existing email traffic requires
sufficient volume to support the effort and to determine which hosts are the
best recipients. Analysis of outbound email traffic from our department during a
30-day period showed 505,987 outgoing messages to 4,081 uniquehosts. A power
(a) Outbound Mail
(b) Inbound Mail
Fig. 1. Archiving Using SMTP
Repository Replication Using NNTP and SMTP
law relationship is also evident (see Figure 2) between the domain’s rank and
email volume sent to that domain:
Vκ = c ∗ (κ−1.6 )
Using the Euler Zeta function (discussed in detail in [15]), we derived the value
of the constant, c = 7378, in Equation 1.
Fig. 2. Email distribution follows a power law
Prototype Results
Having created tools for harvesting the records from our sample digital library,
and having used them to archive the repository, we were able to measure the
results. How fast is each prototype and what penalties are incurred? In our
email experiment, we measured approximately a 1 second delay in processing
attachments of sizes up to 5MB. With NNTP, we tested postings in a variety of
sizes and found processing time ranged from 0.5 seconds (12 KB) to 26.4 seconds (4.9MB). Besides the trivial linear relationship between repository size and
replication time, we found that even very detailed X-Headers do not add a significant burden to the process. Not only are they small (a few bytes) relative to
record size, but they are quickly generated (less than 0.001 seconds per record)
and incorporated into the archival message. Both NNTP and SMTP protocols
are robust, with most products (like INN or sendmail) automatically handling
occasional network outages or temporary unavailability of the destination host.
News and email messages are readily recovered using any of a number of “readers” (e.g., Pine for email or Thunderbird for news). Our experimental results
formed the basis of a series of simulations using email and Usenet to replicate a
digital library.
Simulating the Archiving Process
When transitioning from live, instrumented systems to simulations, there are
a number of variables that must be taken into consideration in order to arrive
J.A. Smith, M. Klein, and M.L. Nelson
at realistic figures (Table 2). Repositories vary greatly in size, rate of updates
and additions, and number of records. Regardless of the archiving method, a
repository will have specific policies (“Sender Policies”) covering the number of
copies archived; how often each copy is refreshed; whether intermediate updates
are archived between full backups; and other institutional-specific requirements
such as geographic location of archives and “sleep time” (delay) between the
end of one completed archive task and the start of another. The receiving agent
will have its own “Receiver Policies” such as limits on individual message size,
length of time messages live on the server, and whether messages are processed
by batch or individually at the time of arrival.
Table 2. Simulation Variables
Repository Ra
Number of records in repository
Mean size of records
Number of records added per day
Number of records updated per day
Number of records posted per day
News post time-to-live
“Sleep” time between baseline harvests
Records postable per day via news
Time to complete baseline using news
Rank of receiving domain
Constant derived from Euler Zeta function
Records postable per day via email
Time to complete baseline using email
A key difference between news-based and email-based archiving is the activevs-passive nature of the two approaches. This difference is reflected in the policies
and how they impact the archiving process under each method. A “baseline,”
refers to making a complete snapshot of a repository. A “cyclic baseline” is the
process of repeating the snapshot over and over again (S = 0), which may result
in the receiver storing more than one copy of the repository. Of course, most
repositories are not static. Repeating baselines will capture new additions (Ra )
and updates (Ru ) with each new baseline. The process could also “sleep” between
baselines (S > 0), sending only changed content. In short, the changing nature
of the repository can be accounted for when defining its replication policies.
Archiving Using NNTP
Figure 3 illustrates the impact of policies on the news method of repository
replication. A baseline, whether it is cyclic or one-time-only, should finish before
the end of the news server message life (Nttl ), or a complete snapshot will not be
achieved. The time to complete a baseline using news is obviously constrained
by the size of the repository and the speed of the network. NNTP is an older
Repository Replication Using NNTP and SMTP
protocol, with limits on line length and content. Converting binary content to
base64 overcomes such restrictions but at the cost of increased file size (onethird) and replication time.
Fig. 3. NNTP Timeline for Sender & Receiver Policies
Archiving Using SMTP
One major difference in using email as the archiving target instead of news
is that it is passive, not active: the email process relies on existing traffic between the archiving site and one or more target destination sites. The prototype
is able to attach files automatically with just a small processing delay penalty.
Processing options include selecting only every E th email, a factor we call “granularity” [15]; randomly selecting records to process instead of a specific ordering; and/or maintaining replication lists for each destination site. Completing
a baseline using email is subject to the same constraints as news - repository
size, number of records, etc. - but is particularly sensitive to changes in email
volume. For example, holidays are often used for administrative tasks since they
are typically “slow” periods, but there is little email generated during holidays
so repository replication would be slowed rather than accelerated. However, the
large number of unique destination hosts means that email is well adapted to
repository discovery through advertising.
In addition to an instrumented prototype, we simulated a repository profile
similar to some of the largest publicly harvestable OAI-PMH repositories. The
simulation assumed a 100 gigabyte repository with 100,000 items (R = 100000,
Rs = 1M B); a low-end bandwidth of 1.5 megabits per second; an average
daily update rate of 0.4% (Ru = 400); an average daily new-content rate of 0.1%
(Ra = 100); and a news-server posting life (Nttl ) of 30 days. For simulating email
replication, our estimates were based on the results of our email experiments:
Granularity G = 1, 16866 emails per day, and the power-law factor applied to
the ranks of receiving hosts. We ran the NNTP and SMTP simulations for the
equivalent of 2000 days (5.5 years).
J.A. Smith, M. Klein, and M.L. Nelson
Policy Impact on NNTP-Based Archiving
50 100
single baseline with updates
cyclic baseline with updates
repeating baseline
repository size
Percent of Repository Archived
News-based archiving is constrained primarily by the receiving news server and
network capacity. If the lifetime of a posting (Nttl ) is shorter than the archiving
time of the repository (Tnews ), then a repository cannot be successfully archived
to that server. Figure 4 illustrates different repository archiving policies, where S
ranges from 0 (cyclic baseline) to infinity (single baseline). The “Cyclic Baseline
with Updates” in Figure 4 graphs a sender policy covering a 6-week period: The
entire repository is archived twice, followed by updates only, then the cycle is
repeated. This results in the news server having between one and 2 full copies
Time (Days)
Fig. 4. Effect of Sender Policies on News-Method Archiving
of the repository, at least for the first few years. The third approach, where the
policy is to make a single baseline copy and follow up with only updates and
additions, results in a rapidly declining archive content over time, with only
small updates existing on the server. It is obvious that as a repository grows and
other factors such as news posting time remain constant, the archive eventually
contains less than 100% of the library’s content, even with a policy of continuous
updates. Nonetheless, a significant portion of the repository remains archived
for many years if some level of negotiated baseline archiving is established. As
derived in [15], the probability of a given repository record r being currently
replicated on a specific news server N on day D is:
P (r) =
(ρnews × D) − ρnews × (D − NT T L )
R + (D × Ra )
Policy Impact on SMTP-Based Archiving
SMTP-based replication is obviously constrained by the frequency of outbound
emails. Consider the following two sender policies: The first policy maintains
just one queue where items of the repository are being attached to every E th
email regardless of the receiver domain. In the second policy, we have more than
Repository Replication Using NNTP and SMTP
one queue where we keep a pointer for every receiver domain and attach items
to every E th email going out to these particular domains. The second policy
will allow the receiving domain to converge on 100% coverage much faster, since
accidental duplicates will not be sent (which does happen with the first policy).
However, this efficiency comes at the expense of the sending repository tracking
separate queues for each receiving domain.
Because email volume follows a power law distribution, receiver domains
ranked 2 and 3 achieve 100% repository coverage fairly soon but Rank 20 takes
significantly longer (2000 days with a pointer), reaching only 60% if no pointer is
maintained. Figure 5(a) shows the time it takes for a domain to receive all files of
a repository without the pointer to the receiver and figure 5(b) shows the same
setup but with receiver pointer. In both graphs, the 1st ranked receiver domains
are left out because they represent internal email traffic. Figure 5 shows how important record history is to achieving repository coverage using email. If a record
history is not maintained, then the domain may receive duplicate records before
a full baseline has been completed, since there is a decreasing statistical likelihood of a new record being selected from the remaining records as the process
progresses. Thus, the number of records replicated per day via email ρemail is
a function of the receiver’s rank (κ), the granularity (G), and probability based
on use of a history pointer (h). That is, ρemail = c(κ−1.6 ) ∗ G ∗ h. If a pointer
is maintained then h = 1; and if every outbound email to the domain is used,
then G = 1 as well. The probability that a given record, r has been replicated
via email is therefore:
(ρemail × D)
P (r) =
R + (D × Ra )
How would these approaches work with other repository scenarios? If the archive
were substantially smaller (10,000 records with a total size of 15 GB), the time
to upload a complete baseline would also be proportionately smaller since replication time is linear with respect to the repository’s size for both the news and
email methods of archiving. The news approach actively iterates through the
repository, creating its own news posts, and is therefore constrained primarily
by bandwidth to the news server. Email, on the other hand, passively waits
for existing email traffic and then “hitches a ride” to the destination host. The
SMTP approach is dependent on the site’s daily email traffic to the host, and a
reduction in the number of records has a bigger impact if the repository uses the
email solution because fewer emails will be needed to replicate the repository.
A repository consisting of a single record (e.g., an OAI-PMH “Identify” response) could be effectively used to advertise the existence of the repository
regardless of the archiving approach or policies. After the repository was discovered, it could be harvested via normal means. A simple “Identify” record (in
OAI-PMH terms) is very small (a few kilobytes) and would successfully publish the repository’s existence in almost zero time regardless of the archiving
approach that was used.
J.A. Smith, M. Klein, and M.L. Nelson
(a) Without Record History
(b) With Record History
Fig. 5. Time To Receive 100% Repository Coverage by Domain Rank
Future Work and Conclusions
Through prototypes and simulation, we have studied the feasibility of replicating
repository contents using the installed NNTP and SMTP infrastructure. Our
initial results are promising and suggest areas for future study. In particular,
we must explore the trade-off between implementation simplicity and increased
repository coverage. For SMTP approach, this could involve the receiving email
domains informing the sender (via email) that they are receiving and processing
attachments. This would allow the sender to adjust its policies to favor those
sites. For NNTP, we would like to test varying the sending policies over time as
well as dynamically altering the time between baseline harvests and transmission
of update and additions. Furthermore, we plan to revisit the structure of the
Repository Replication Using NNTP and SMTP
objects that are transmitted, including taking advantage of the evolving research
in preparing complex digital objects for preservation [16][17].
It is unlikely that a single, superior method for digital preservation will
emerge. Several concurrent, low-cost approaches are more likely to increase the
chances of preserving content into the future. We believe the piggyback methods
we have explored here can be either a simple approach to preservation, or a
compliment to existing methods such as LOCKSS, especially for content unencumbered by restrictive intellectual property rights. Even if NNTP and SMTP
are not used for resource transport, they can be effectively used for repository
awareness. We have not explored what the receiving sites do with the content
once it has been received. In most cases, it is presumably unpacked from its
NNTP or SMTP representation and ingested into a local repository. On the
other hand, sites with apparently infinite storage capacity such as Google Groups
could function as long-term archives for the encoded repository contents.
This work was supported by NSF Grant ISS 0455997. B. Danette Allen contributed to the numerical analysis.
1. Brian Kantor and Phil Lapsley. Network news transfer protocol, Internet RFC-977,
February 1986.
2. Jonathan B. Postel. Simple mail transfer protocol, Internet RFC-821, August 1982.
3. 20 year archive on google groups.
archive announce 20.html.
4. Andrew V. Goldberg and Peter N. Yianilos. Towards an archival intermemory. In
Proceedings of IEEE Advances in Digital Libraries, ADL 98, pages 147–156, April
5. Ian Clark, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong. Freenet: a
distributed anonymous information storage and retrieval system. In International
Workshop on Design Issues in Anonymity and Unobservability LNCS 2009.
6. Roger Dingledine, Michael J. Freedman, and David Molnar. The free haven
project: Distributed anonymous storage service. Lecture Notes in Computer Science, 2009:67 –95, 2001.
7. Petros Maniatis, Mema Roussopoulos, T.J.Giuli, David S. H. Rosenthal, and Mary
Baker. The LOCKSS peer-to-peer digital preservation system. ACM Transactions
on computer systems, 23:2 – 50, February 2005.
8. Andrew Tridgell and Paul Mackerras.
The rsync algorithm.
Technical report, The Australian National University, 1996.
9. Brian F. Cooper and Hector Garcia-Molina. Peer-to-peer data trading to preserve
information. ACM Transactions on Information Systems, 20(2):133 – 170, 2002.
10. Adam Back. The eternity service. Phrack Magazine, 7(51), 1997.
11. Ross J. Anderson. The eternity service. In 1st International Conference on the
Theory and Applications of Cryptology (Pragocrypt ’96), pages 242–252, 1996.
J.A. Smith, M. Klein, and M.L. Nelson
12. Michael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, and Terry L. Harrison. mod oai: An apache module for metadata harvesting. Technical report, Old
Dominion University, 2005. arXiv cs.DL/0503069.
13. Carl Lagoze, Herbert Van de Sompel, Michael L. Nelson, and Simeon
The Open Archives Initiative Protocol for Metadata Harvesting.
14. Jeroen Bekaert, Patrick Hochstenbach, and Herbert Van de Sompel. Using
MPEG-21 DIDL to represent complex digital objects in the Los Alamos National Laboratory digital library.
D-Lib Magazine, 9(11), November 2003.
15. Joan A. Smith, Martin Klein, and Michael L. Nelson. Repository replication using NNTP and SMTP. Technical report, Old Dominion University, 2006. arXiv
16. Jeroen Bekaert, Xiaoming Liu, and Herbert Van de Sompel. Representing digital
assets for long-term preservation using MPEG-21 DID. In Ensuring Long-term
Preservation and Adding Value to Scientific and Technical data (PV 2005), 2005.
arXiv cs.DL/0509084.
17. Herbert Van de Sompel, Michael L. Nelson, Carl Lagoze, and Simeon Warner.
Resource harvesting within the OAI-PMH framework. D-Lib Magazine, 10(12),
December 2004. doi:10.1045/december2004-vandesompel.
Genre Classification in Automated Ingest and
Appraisal Metadata
Yunhyong Kim and Seamus Ross
Digital Curation Centre (DCC)
Humanities Adavanced Technology Information Institute (HATII)
University of Glasgow
Glasgow, UK
Abstract. Metadata creation is a crucial aspect of the ingest of digital materials into digital libraries. Metadata needed to document and
manage digital materials are extensive and manual creation of them expensive. The Digital Curation Centre (DCC) has undertaken research
to automate this process for some classes of digital material. We have
segmented the problem and this paper discusses results in genre classification as a first step toward automating metadata extraction from
documents. Here we propose a classification method built on looking at
the documents from five directions; as an object exhibiting a specific visual format, as a linear layout of strings with characteristic grammar,
as an object with stylo-metric signatures, as an object with intended
meaning and purpose, and as an object linked to previously classified
objects and other external sources. The results of some experiments in
relation to the first two directions are described here; they are meant to
be indicative of the promise underlying this multi-facetted approach.
Background and Objective
Construction of persistent, cost-contained, manageable and accessible digital collections depends on the automation of appraisal, selection, and ingest of digital
material. Descriptive, administrative, and technical metadata play a key role in
the management of digital collections ([37],[21]). As DELOS/NSF ([13],[14],[21])
and PREMIS working groups ([34]) noted metadata are expensive to create and
maintain. Digital objects are not always accompanied by adequate metadata and
the number of digital objects being created and the variety of such objects is increasing at an exponential rate. In response, the manual collection of metadata
can not keep pace with the number of digital objects that need to be documented. It seems reasonable to conclude that automatic extraction of metadata
would be an invaluable step in the automation of appraisal, selection, and ingest
of digital material. ERPANET’s ([17]) Packaged Object Ingest Project ([18])
identified only a limited number of automatic extraction tools mostly geared
to extract technical metadata (e.g.[29],[31]), illustrating the intensive manual
labour required in the ingest of digital material into a repository. Subsequently
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 63–74, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Y. Kim and S. Ross
substantial work on descriptive metadata extraction has emerged: e.g. extraction from structured documents have been attempted by MetadataExtractor
from University of Waterloo ([27]), Dublin Core Metadata Editor ([11]) and Automatic Metadata Generation (AMG) at the Catholic University of Leuven([2]),
and the extraction of bibliographic information from medical articles, based on
the detection of contiguous blocks and fuzzy pattern matching, is available from
Medical Article Record System (MARS) ([42]) developed at the US National Library of Medicine (NLM)([30]). There have also been previous work on metadata
extraction from scientific articles in postscript using a knowledge base of stylistic
cues ([19],[20]) and, from the language processing community, there have been
results in automatic categorisation of emails ([6],[24]), text categorisation ([39])
and document content summarisation ([43]). Other communities have used image analysis for information extraction from the Internet ([3]), document white
space analysis ([9]), graphics recognition in PDF files ([41]), and algorithms for
page segmentation ([40]). Despite the wealth of research being conducted, no
general tool has yet been developed which can be employed to extract metadata from digital objects of varied types and genres, nor are there dependable
extraction tools for the extraction of deeper semantic metadata such as content
summary. The research in this paper is motivated by an effort to address this
problem by integrating the methods available in the area to create a prototype
tool for automatically extracting metadata across many domains at different
semantic levels. This would involve:
– constructing a well-structured experimental corpus of one file type (for use
in this and future related research);
– summarising and integrating existing research related to automatic metadata
– determining the limit and scope of metadata that can be extracted and building a prototype descriptive and semantic metadata extraction tool applicable
across many domains;
– extending the tool to cover other file types and metadata ; and,
– integrating it with other tools to enable automatic ingest, selection and/or
The initial prototype is intended to extract Genre, Author, Title, Date, Identifier, Pagination, Size, Language, Keywords, Composition (e.g. existence and
proportion of images, text and links) and Content Summary. In the present
paper, we discuss genre classification of digital documents represented in PDF
([32]) as a step towards acquiring the appropriate metadata. The term genre
does not always carry a clear meaning. We follow the definition of Kessler ([25])
who refers to genre as “any widely recognised class of texts defined by some
common communicative purpose or other functional traits, provided the function is connected to some formal cues or commonalities and that the class is
extensible”. For instance, a scientific research article is a theoretical argument
or communication of results relating to a scientific subject usually published in
a journal and often starting with a title, followed by author, abstract, and body
Genre Classification in Automated Ingest and Appraisal Metadata
of text,finally ending with a bibliography. One important aspect of genre classification is that it is distinct from subject classification which can coincide over
many genres (e.g. a mathematical paper on number theory versus a news article
on the proof of Fermat’s Last Theorem). The motivation for starting with genre
classification is as follows:
– Identifying the genre first will limit the scope of document forms from which
to extract other metadata:
• The search space for further metadata will be reduced; within a single genre, metadata such as author, keywords, identification numbers or
references can be expected to appear in a specific style and region.
• A lot of independent work exists for extraction of metadata within a
specific genre which can be combined with a general genre classifier for
metadata extraction over many domains (e.g. the papers listed at the
beginning of this section).
• Resources available for extracting further metadata is different for each
genre; for instance, research articles unlike newspaper articles come with
a list of reference articles closely related to the original article leading to
better subject classification.
– Scoping new genres not apparent in the context of conventional libraries is
– Different institutional collecting policies might focus on digital materials in
different genres. Genre classification will support automating the identification, selection, and acquisition of materials in keeping with local collecting
We have opted to consider 60 genres (Table 1). This list is not meant to represent
a complete spectrum of possible genres; it is meant to be a starting point from
which to determine what is possible.
We have focused our attention on different genres represented in PDF files.
By limiting the research to one file type we hoped to put a boundary on the
problem space. The choice of PDF as the format stems from the fact that
– PDF is a widely used format. Specifically, PDF is a common format for
digital objects ingested into digital libraries including eprint services.
– It is a portable format, distributed over many different platforms.
– There are many tools available for conversion to and from other formats.
– It is a versatile format which includes objects of different type (e.g. images,
text, links) and different genres (e.g. data structure, fiction, poetry, research
In the experiment which follows we worked with a developmental data set collected via the Internet using a random PDF-grabber which
1. selects a random word from a Spell Checker Oriented Word List (from sourceforge.net),
2. searches the Internet using Google for PDF files containing the chosen word,
Y. Kim and S. Ross
Table 1. Scope of genres
Academic book, Fiction(book), Poetry(book),Other book
Scientific research article, Other research article, Magazine article,
News report
Periodicals Periodicals, Newsletter
Email, Letter
Thesis, Business/Operational report, Technical report, Misc report
Calendar, Menu, Other table
Grant/Project proposal, Legal appeal/proposal/order
Description Job/Course/Project description, Product/Application description
Minutes, Proceedings
Instruction/Guideline, Regulations
Abstract,Advertisement, Announcement, Appeal/Propaganda, Biography, Chart/Graph,Contract, Drama, Essay, Exam/Worksheet, Fact
sheet,Fiction piece, Forms, Forum discussion, Image, Interview, LecOther
ture notes/presentation, Speech transcript, Manual, Memo, Sheet music, Notice, Posters, Programme, Questionnaire, Q & A, Resume/CV,
Review, Slides, Poetry piece, Other genre not listed
3. selects a random PDF file from the returned list and places it in a designated
We collected over 4000 documents in this manner. Labelling of this document
corpus is still in progress (for genre classification) and is mostly being carried
out by one of the authors. Currently 570 are labelled with one of the 60 genres.
A significant amount of disagreement is expected in labelling genre even between
human labellers; we intend to cross check the labelled data in two ways:
– We will employ others to label the data to determine the level of disagreement
between different human labellers; this will enable us to analyse at what level
of accuracy the automated system should be expected perform, while also
providing us with a gauge to measure the difficulty of labelling individual
– We will gather PDF files which have already been classified into genres as a
fresh test data for the classifier; this will also serve as a means of indexing
the performance on well-designed classification standards.
Along with the theoretical work of Biber ([7]) on genre structures, there have
been a number of studies in automatic genre classification: e.g. Karlgren and
Cutting ([23], distinguishing Press, Misc, Non-fiction and Fiction), Kessler et
al. ([25], distinguishing Reportage, Fiction, Scitech, Non-fiction, Editorial and
Legal; they also attempt to detect the level of readership - which is referred
to as Brow - divided into four levels, and make a decision on whether or not
Genre Classification in Automated Ingest and Appraisal Metadata
the text is a narrative), Santini ([38], distinguishing Conversation, Interview,
Public Debate, Planned Speech, Academic prose, Advert, Biography, Instruction,
Popular Lore and Reportage), and, Bagdannov and Worring ([4], fine-grained
genre classification using first order random graphs modeled on trade journals
and brochures found in the Océ Competitive Business Archive) not to mention
a recent MSc. dissertation written by Boese ([8], distinguishing ten genres of
web documents). There are also related studies in detecting document logical
structures ([1]) and clustering documents ([5]). Previous methods can be divided
into groups which look at one or more of the following:
Document image analysis
Syntactic feature analysis
Stylistic feature analysis
Semantic structure analysis
Domain knowledge analysis
We would eventually like to build a tool which looks at all of these for the
60 genres mentioned (see Table 1). The experiments in this paper however are
limited to looking at the first two aspects of seven genres. Only looking at seven
genres out of 60 is a significant cut back, but the fact that none of the studies
known to us have combined the first two aspects for genre classification and
that very few studies looked at the task in the context of PDF files makes the
experiments valuable as a report on the first steps to a general process. This
paper is not meant to be a conclusive report, but the preliminary findings of an
ongoing project and is meant to show the promise of combining very different
classifying methods in identifying the genre of a digital document. It is also
meant to emphasise the importance of looking at information extraction across
genres; genre-specific information extraction methods usually depend heavily
on the structures held in common by the documents in the chosen domain; by
looking at differences between genres we can determine the variety of structures
one might have to resolve in the construction of a general tool.
The experiments described in this paper require the implementation of two
Image classifier: this classifier depends on features extracted from the PDF
document when handled as an image.
– It uses the module pdftoppm from XPDF to extract the first page of the
document as an image then employs Python’s Image Library (PIL) ([35],
[33]) to extract pixel values. This is then sectioned off into ten regions for
an examination of the number of non-white pixels. Each region is rated
as level 0, 1, 2, 3 (larger number indicating a higher density of non-white
space). The result is statistically modelled using the Maximum Entropy
principle. The tool used for the modelling is MaxEnt for C++ developed
by Zhang Le ([26]).
Y. Kim and S. Ross
Language model classifier: this classifier depends on an N-gram model on
the level of words, Part-of-Speech tags and Partial Parsing tags.
– N-gram models look at the possibility of word w(N) coming after a string
of words W(1), W(2), ..., w(N-1). A popular model is the case when N=3.
This model is usually constructed on the word level. In this research we
would eventually like to make use of the model on the level of Partof-Speech (POS) tags (for instance, tags which denote whether a word
is a verb, noun or preposition) or Partial Parsing (PP) tags (e.g. noun
phrases, verb phrases or prepositional phrases). Initially we only work
with the word-level model. This has been modelled by the BOW toolkit
developed by Andrew McCallum ([28]). We used the default Naiive Bayes
model without a stoplist.
Although the tools for extracting the image and text of the documents used in
these classifiers are specific to PDF files, a comparable representation can be
extracted in other formats by substituting these tools with corresponding tools
for those formats. In the worst-case scenario the process can be approximated
by first converting the format to PDF, then using the the same tools; the wide
distribution of PDF ensures the existence of a conversion tool for most common
Using the image of a text document in the classification of the document has
several advantages:
– it will be possible to extract some basic information about documents without accessing content or violating password protection or copyright;
– more likely to be able to forgo the necessity of substituting language modeling
tools when moving between languages, i.e. it maximises the possibility of
achieving a language independent tool;
– the classification will not be solely dependent on fussy text processors and
language tools (e.g. encoding requirements, problems relating to special characters or line-breaks);
– it can be applied to paper documents digitally imaged (i.e. scanned) for inclusion in digital repositories without heavily relying on accuracy in character
Experiment Design
The experiments in this paper are the first steps towards testing the following
Hypothesis A: Given a collection of digital documents consisting of several different genres, the set of genres can be partitioned into groups such
that the visual characteristics concur and linguistic characteristics differ between documents within a single group, while visual aspects differ
between the documents of two distinct groups.
Genre Classification in Automated Ingest and Appraisal Metadata
An assumption in the two experiments described here is that PDF documents
are one of four categories: Business Report, Minutes, Product/Application Description, Scientific Research Article. This, of course, is a false assumption and
limiting the scope in this way changes the meaning of the resulting statistics
considerably. However, the contention of this paper is that high level performance on a limited data set combined with a suitable means of accurately
narrowing down the candidates to be labelled would achieve the end
Steps for the first experiment
1. take all the PDF documents belonging to the above four genres (70 documents in the current labelled data),
2. randomly select a third of the documents in each genre as training data (27
documents) and the remaining documents as test data (43 documents),
3. train both the image classifier and language model classifier (on the level of
words) on the selected training data,
4. examine result.
Steps for the second experiment
1. using the same training and test data as that for the first experiment,
2. allocate the genres to two groups, each group containing two genres: Group
I contains business reports and minutes while Group II contains scientific
research articles and product descriptions,
3. train the image classifier to differentiate between the two groups and use
this to label the test data as documents of Group I or Group II,
4. train two language model classifiers: Classifier I which distinguishes business
reports from minutes and Classifier II which labels documents as scientific
research articles or product descriptions,
5. take test documents which have been labelled Group I and label them with
Classifier I; take test documents which have been labelled Group II and label
them with Classifier II,
6. examine result.
The genres to be placed in Group I and Group II were selected by choosing the
partition which showed the highest training accuracy for the image classifier.
In the evaluation of the results to follow we will use three indices which are
considered standard in a classification tasks: accuracy, precision and recall. Let N
be the total number of documents in the test data, Nc the number of documents
in the test data which are in class C, T the total number of correctly labelled
documents in the data independent of the class, Tc the number of true positives
Y. Kim and S. Ross
for class C (documents correctly labelled as class C), and Fc the number of false
positives for class C (documents labelled incorrectly as class C). Accuracy is
defined to be A = N
while precision and recall for each class C is defined to be
Pc = (Tc +Fc ) and Rc = N
The precision and recall for the first and second experiments are given in
Table 2 and Table 3.
Table 2. Result for first small experiment
Overall accuracy (Language model only): 77%
Business Report
Sci. Res. Article
Product Desc.
Table 3. Result for second small experiment
Overall accuracy(Image and Language model: 87.5 %
Business Report
Sci. Res. Article
Product Desc.
Although the performance of the language model classifier given in Table 2
is already surprisingly high, this, to a great extent, depends on the four categories chosen. In fact, when the classifier was expanded to include 40 genres,
the classifier performed only at an accuracy of approximately 10%. When a
different set was employed which included Periodicals, Thesis, Minutes and Instruction/Guideline, the language model performs at an accuracy of 60.34%. It
is clear from the two examples that such a high performance can not be expected
for any collection of genres.
The image classifier on Group I(Periodicals) and Group II(Thesis, Minutes,
Instruction/Guideline) performs at an accuracy of 91.37%. The combination of
the two classifiers have not been tested but even in the worst-case scenario,
where we assume that the set of mislabelled documents for the two classifiers
have no intersection, the combined classifier would still show an increase in
overall accuracy of approximately 10%.
The experiments show an increase in the overall accuracy when the language
classifier is combined with the image classifier. To gauge the significance of the
increase, a statistically valid significance test would be required. The experiments
here however are intended not to be conclusive but indicative of the promise
underlying the combined system.
Genre Classification in Automated Ingest and Appraisal Metadata
Conclusion and Further Research
Intended Extensions
The experiments show that, although there is a lot of confusion visually and
linguistically over all 60 genres, subgroups of the genres exhibit statistically
well-behaved characteristics. This encourages the search for groups which are
similar or different visually or linguistically to further test Hypothesis A. To
extend the scenario in the experiment to all the genres the following steps are
1. randomly select a third of the documents in each genre as training data and
the remaining documents as test data,
2. train the image and language model classifier on the resulting and test over
all genres,
3. try to re-group genres so that each group contain genres resulting in a high
level of cross labelling in the previous experiment,
4. re-train and test.
Employment of Further Classifiers
Further improvement can be envisioned by integrating more classifiers into the
decision process. For instance consider the following classifiers.
Extended image classifier: In the experiments described in this paper the
image classifier looked at only the first page of the document. A variation
or extension of this classifier to look at different pages of the document or
several pages of the document will be necessary for a complete image analysis. This would however involve several decisions: given that documents
have different lengths, the optimal number of pages to be used needs to be
determined, and we need to examine the best way to combine the information from different pages (e.g. will several pages be considered to be one
image; if not, how will the classification of synchronised pages be statistically
combined to give a global classification).
Language model classifier on the level of POS and phrases: This is a
N-gram language model built on the part-of-speech tags of the undelying
text of the document and also on partial chunks resulting from detection of
Stylo-metric classifier: This classifier takes its cue from positioning of text
and image blocks, font styles, font size, length of the document, average
sentence lengths and word lengths. This classifier is expected be useful for
both genre classification (by distinguishing linguistically similar Thesis and
Scientific Research Article by say the length of the document) and other
bibliographic data extraction (by detecting which strings are the Title and
Author by font style, size and position).
Semantic classifier: This classifier will combine extraction of keywords, subjective or objective noun phrases (e.g. using [36]). This classifier is expected
to play an important role in the summarisation stage if not already in the
genre classification stage.
Y. Kim and S. Ross
Classifier based on external information: When the source information of
the document is available, such features as name of the journal, subject
or address of the webpage and anchor texts can be gathered for statistical
analysis or rule-based classification.
Labelling More Data
To make any reasonable conclusions with this study, further data needs to be
labelled for fresh experiments and also to make up for the lack of training data.
Although 60 genres are in play, only 40 genres had more than 3 items in the set
and only 27 genres had greater than or equal to 15 items available.
Putting It into Context
Assuming we are able build a reasonable extractor for genre, we will move on to
implementing the extraction of author, title, date, identifier, keywords, language,
summarisations and other compositional properties within each specific genre.
After this has been accomplished, we should augment the tool to handle subject
classification and to cover other file types.
Once the basic prototype for automatic semantic metadata extraction is tamed
into a reasonable shape, we will pass the protype to other colleagues in the
Digital Curation Centre ([10]) to be integrated with other tools (e.g. technical
metadata extraction tools) and standardised frameworks (e.g. ingest or preservation model) for the development of a larger scale ingest, selection and appraisal
application. Eventually, we should be able at least to semi-automate essential
processes in this area.
This research is being conducted as part of The Digital Curation Centre’s (DCC)
[10] research programme. The DCC is supported by a grant from the United
Kingdom’s Joint Information Systems Committee (JISC) [22] and the e-Science
Core Programme of the Engineering and Physical Sciences Research Council
(EPSRC) [16]. The EPSRC grant (GR/T07374/01) provides the support for the
research programme. Additional support for this research comes from the DELOS: Network of Excellence on Digital Libraries (G038-507618) funded under
the European Commission’s IST 6th Framework Programme [12]. The authors
would like to thank their DCC colleague Adam Rusbridge whose work on ERPANET’s Packaged Object Ingest Project [18] provided a starting point for
the current project on automated metadata extraction. We are grateful to the
anonymous ECDL reviewers of this paper who provided us with very helpful
comments, which enabled us to improve the paper.
Note on website citations: All citations of websites were validated on 29 May
Genre Classification in Automated Ingest and Appraisal Metadata
1. Aiello, M., Monz, C., Todoran, L., Worring, M.: Document Understanding for
a Broad Class of Documents. International Journal on Document Analysis and
Recognition 5(1) (2002) 1–16.
2. Automatic Metadata Generation: http://www.cs.kuleuven.ac.be/˜hmdb/amg /documentation.php
3. Arens,A., Blaesius, K. H.: Domain oriented information extraction from the Internet. Proceedings of SPIE Document Recognition and Retrieval 2003 Vol 5010
(2003) 286.
4. Bagdanov, A. D., Worring, M.: Fine-Grained Document Genre Classification Using
First Order Random Graphs. Proceedings of International Conference on Document Analysis and Recognition 2001 (2001) 79.
5. Barbu, E., Heroux, P., Adam, S., Trupin, E.: Clustering Document Images Using
a Bag of Symbols Representation. International Conference on Document Analysis
and Recognition, (2005) 1216–1220.
6. Bekkerman, R., McCallum, A., Huang, G.: Automatic Categorization of Email
into Folders. Benchmark Experiments on Enron and SRI Corpora’, CIIR Technical
Report, IR-418 (2004).
7. Biber, D.: Dimensions of Register Variation:a Cross-Linguistic Comparison. Cambridge University Press (1995).
8. Boese, E. S.: Stereotyping the web: genre classification of web documents. Master’s
thesis, Colorado State University (2005).
9. Breuel, T. M.: An Algorithm for Finding Maximal Whitespace Rectangles at Arbitrary Orientations for Document Layout Analysis. 7th International Conference
for Document Analysis and Recognition (ICDAR), 66–70 (2003).
10. Digital Curation Centre: http://www.dcc.ac.uk
11. DC-dot, Dublin Core metadata editor: http://www.ukoln.ac.uk/metadata/dcdot/
12. DELOS Network of Excellence on Digital Libraries: http://www.delos.info/
13. NSF International Projects: http://www.dli2.nsf.gov/ intl.html
14. DELOS/NSF Working Groups: Reference Models for Digital Libraries: Actors and Roles (2003) http://www.dli2.nsf.gov /internationalprojects/ working group reports/ actors final report.html
15. Dublin Core Initiative: http://dublincore.org/tools/#automaticextraction
16. Engineering and Physical Sciences Research Council: http://www.epsrc.ac.uk/
17. Electronic Resources Preservation Access Network (ERPANET): http://
18. ERPANET: Packaged Object Ingest Project. http://www.erpanet.org/events/
2003/rome/presentations/ ross rusbridge pres.pdf
19. Giuffrida, G., Shek, E. Yang, J.: Knowledge-based Metadata Extraction from
PostScript File. Proc. 5th ACM Intl. conf. Digital Libraries (2000) 77–84.
20. Han, H., Giles, L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E. A.: Automatic Document Metadata Extraction using Support Vector Machines. Proc. 3rd ACM/IEEECS conf. Digital libraries (2000) 37–48.
21. Hedstrom, M., Ross, S., Ashley, K., Christensen-Dalsgaard, B., Duff, W.,
Gladney, H., Huc, C., Kenney, A. R., Moore, R., Neuhold, E.: Invest to Save:
Report and Recommendations of the NSF-DELOS Working Group on Digital
Archiving and Preservation. Report of the European Union DELOS and US
National Science Foundation Workgroup on Digital Preservation and Archiving
Y. Kim and S. Ross
22. Joint Information Systems Committee: http://www.jisc.ac.uk/
23. Karlgren, J. and Cutting, D.: Recognizing Text Genres with Simple Metric using
Discriminant Analysis. Proc. 15th conf. Comp. Ling. Vol 2 (1994) 1071–1075.
24. Ke, S. W., Bowerman, C. Oakes, M. PERC: A Personal Email Classifier. Proceedings of 28th European Conference on Information Retrieval (ECIR 2006) 460–463.
25. Kessler, B., Nunberg, G., Schuetze, H.: Automatic Detection of Text Genre. Proc.
35th Ann. Meeting ACL (1997) 32–38.
26. Zhang Le: Maximum Entropy Toolkit for Python and C++. LGPL license,
http://homepages.inf.ed.ac.uk/s0450736/maxent toolkit.html
27. MetadataExtractor: http://pami-xeon.uwaterloo.ca/TextMiner/ MetadataExtractor.aspx
28. McCallum, A.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval,
Classification and Clustering. (1998) http://www.cs.cmu.edu/ mccallum/bow/
29. National Archives UK: DROID (Digital Object Identification). http: //www. nationalarchives. gov.uk/ aboutapps/pronom/droid.htm
30. Natinal Library of Medicine US: http://www.nlm.nih.gov/
31. National Library of New Zealand: Metadata Extraction Tool. http://www. natlib.
32. Adobe Acrobat PDF specification: http://partners.adobe.com/ public/developer/
pdf/index reference.html
33. Python Imaging Library: http://www.pythonware.com/products/pil/
34. PREMIS (PREservation Metadata: Implementation Strategy) Working Group:
35. Python: http://www.python.org
36. Riloff, E., Wiebe, J., and Wilson, T.: Learning Subjective Nouns using Extraction
Pattern Bootstrapping. Proc. 7th CoNLL, (2003) 25–32.
37. Ross S and Hedstrom M.: Preservation Research and Sustainable Digital Libraries.
International Journal of Digital Libraries (Springer) (2005) DOI: 10.1007/s00799004-0099-3.
38. Santini, M.: A Shallow Approach To Syntactic Feature Extraction For Genre Classification. Proceedings of the 7th Annual Colloquium for the UK Special Interest
Group for Computational Linguistics (CLUK 04) (2004).
39. Sebastiani F.: ’Machine Learning in Automated Text Categorization’, ACM Computing Surveys, Vol. 34 (2002) 1-47
40. Faisal Shafait, Daniel Keysers, Thomas M. Breuel, “Performance Comparison of Six
Algorithms for Page Segmentation”, 7th IAPR Workshop on Document Analysis
Systems (DAS) (2006).368–379.
41. M. Shao, M. and Futrelle, R.: Graphics Recognition in PDF document. Sixth IAPR
International Workshop on Graphics Recognition (GREC2005), 218–227.
42. Thoma,G.: Automating the production of bibliographic records. R&D report of the
Communications Engineering Branch, Lister Hill National Center for Biomedical
Communications, National Library of Medicine, 2001.
43. Witte, R., Krestel, R. and Bergler, S.: ERSS 2005:Coreference-based Summarization Reloaded. DUC 2005 Document Understanding Workshop, Canada
The Use of Summaries in XML Retrieval
Zoltán Szlávik, Anastasios Tombros, and Mounia Lalmas
Department of Computer Science,
Queen Mary University of London
Abstract. The availability of the logical structure of documents in
content-oriented XML retrieval can be beneficial for users of XML retrieval systems. However, research into structured document retrieval has
so far not systematically examined how structure can be used to facilitate
the search process of users. We investigate how users of an XML retrieval
system can be supported in their search process, if at all, through summarisation. To answer this question, an interactive information retrieval
system was developed and a study using human searchers was conducted.
The results show that searchers actively utilise the provided summaries,
and that summary usage varied at different levels of the XML document
structure. The results have implications for the design of interactive XML
retrieval systems.
As the eXtensible Markup Language (XML) is becoming increasingly used in
digital libraries (DL), retrieval engines that allow search within collections of
XML documents are being developed. In addition to textual information, XML
documents provide a markup that allows the representation of the logical structure of XML documents in content-oriented retrieval. The logical units, called
elements, are encoded in a tree-like structure by XML tags. The logical structure
allows DL systems to return document portions that may be more relevant to
the user than the whole document, e.g. if a searcher wants to read about how
Romeo and Juliet met, we do not return the whole play but the actual scene
about the meeting. This content-oriented retrieval has received large interest
over the last few years, mainly through the INEX initiative [6].
As the number of XML elements is typically large (much larger than that
of documents), we believe it is essential to provide users of XML information
retrieval systems with overviews of the contents of the retrieved elements. One
approach is to use summarisation, which has been shown to be useful in interactive information retrieval (IIR) [9,7,15].
In this paper, we investigate the use of summarisation in XML retrieval in
an interactive environment. In interactive XML retrieval, a summary can be
associated with each document element returned by the XML retrieval system.
Because of the nature of XML documents, users can, in addition to accessing
any retrieved element, browse within the document containing that element. One
method to allow browsing XML documents is to display the logical structure
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 75–86, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Z. Szlávik, A. Tombros, and M. Lalmas
of the document containing the retrieved elements [13]. This has the benefit
of providing (sometimes necessary) context to users when reading an element.
Therefore, summaries can also be associated with the other elements of the
document, in addition to the returned elements themselves.
The aim of our work is to investigate how users of an XML retrieval system
can be supported in their search process, if at all, through summarisation. To
answer this question, an interactive information retrieval system was developed
and a study using human searchers was conducted.
The paper is organised as follows. In Section 2 we present the background of
our work, then we describe the experimental system and methodology that was
used in Section 3. The analysis of our data is described in Section 4, which is
followed by the conclusions and future work.
In recent years, interactive aspects of the IR process have been extensively investigated. Major advances have been made by co-ordinated efforts in the interactive track at TREC. These efforts have been in the context of unstructured
documents (e.g. news articles) or in the context of the loosely-defined structure
encountered in web pages. XML documents, on the other hand, define a different
context, by offering the possibility of navigating within the structure of a single
document, or following links to another document part.
The interactive aspect of XML IR has recently been investigated through the
interactive track at INEX (iTrack) [13,10,8]. A major result from iTrack 2004
was that searchers did not interact enough with the elements of retrieved XML
documents [14]. Searchers seemed to appreciate the logical structure of XML
documents as a means of providing context for identifying interesting XML elements within a document, but they did not browse much within XML documents.
Tombros et al. suggest that this behaviour may have been due to limitations of
the interactive XML IR system used. Among these limitations was that XML element (or document) summarisation capabilities were few, and therefore searchers
did not have enough relevance clues to decide which elements to visit [14]. In this
paper, we focus on the presentation of the document structure as a hierarchical
table of contents, and on the use of summarisation to facilitate the users’ search
Text summarisation has attracted attention primarily after the information
explosion on the Internet; however, significant work was done as early as the
1950’s and 1960’s. Edmundson proposed extraction methods considering various
sentence features, e.g. location, title words [5]. In recent summarisation systems,
users’ query terms are also considered in generating summaries [15]. Few researchers recently have investigated the summarisation of information available
in XML format (e.g. [1,2]). In our work, we considered a simple summarisation
algorithm that takes advantage of the sentence location and the query (referred
to as query-biased), as our main aim is to study how users “interact” with
The Use of Summaries in XML Retrieval
The use of summaries in interactive IR has been shown to be useful for various information seeking tasks in a number of environments such as the web
(e.g. [16,4]). However, in the context of interactive XML retrieval, summarisation has not yet been investigated extensively. Our main focus in this paper
is to study how searchers behave in an environment that provides them with
structural documents, and how they use summaries of document elements that
are presented to them. To do so, we created and tested, through user-based
studies, an interactive XML retrieval system with XML element summarisation
capabilities. We describe the system and the setup of our study in the next
Experimental Setup
In this section, we describe the system and method that was used in our study.
We include only the necessary details for the presentation of the analyis and
results reported in this paper. A more detailed description can be found in
User Interface. The user interface is a web based system which passes the query
to a retrieval module, processes and displays the retrieved list of elements and
shows each of these elements. The system allows users to enter a search query
and start the retrieval process by clicking on the search button. The display of
the list of retrieved elements is similar to standard search interfaces (Figure 1).
Fig. 1. The list of the result elements
Z. Szlávik, A. Tombros, and M. Lalmas
Once searchers follow the link to a particular result element, the element is
displayed in a new window (Figure 2). The frame on the right shows the content
of the target element. The structure is displayed on the left as an automatically
generated table of contents (ToC) where each structural item is a hyperlink that
will show the corresponding XML element in the right window when clicked. For
this user study, four levels of structural items were displayed. Level one always
refers to the whole article; level two contains the body, front and backmatters;
level three usually contains the abstract, sections and appendices; and level four
usually means subsections or paragraphs, depending of the inner structure of
articles. The number of levels could be changed by searchers. For each item in
the ToC, summaries were generated and displayed as ‘tool tips’, i.e. when users
moved the mouse pointer over an item in the ToC, the summary of the target
element was shown. Query terms in the summaries, as well as in the displayed
documents, were highlighted.
Fig. 2. On the left, the structure of the XML document with a summary; on the right,
the content of a section element displayed
Summarisation. Summaries were displayed in the result list view for each result
element and for the displayed elements in the ToC in element view. Since our
aim at this stage of the research was not to develop sophisticated summarisation
methods, but to investigate summary usage in XML retrieval, we implemented
and used a simple query-biased algorithm. Four sentences with the highest scores
The Use of Summaries in XML Retrieval
were presented as extracts of the source XML elements, in order of appearance
in the source element (for further details, see [12]).
Document Collection. The document collection we used was the IEEE collection
(version 1.4) which contains 12,107 articles, marked up in XML, of the IEEE
Computer Society’s publications from 12 magazines and 6 transactions, covering
the period of 1995-2002. On average, an article contains 1532 XML nodes and
the average depth of a node is 6.9. These properties provided us with a suitably
large collection of articles of varying depth of logical structure.
XML Retrieval Engine. The retrieval was based on the HySpirit retrieval framework [11]. To be able to examine the relation between the structural display and
the use of summaries, only paragraphs were returned as retrieval results. This
strategy ensured that elements deeply nested in a document logical structure
were returned, so as to “force” searchers to browse through the structural display on the left panel of Figure 2 (instead of simply scrolling down the right
Searchers. Twelve searchers (9 males and 3 females) were recruited for this study.
All of them had computer science background as the collection used contained
articles from the field of computer science.
Experimental and Control Systems. Two versions of the developed system were
used in this study. The control system (Sc ) had all the functionalities we described in previous sections, whereas the experimental system (Se ) differed in the
display mode of summaries: System Se displayed summaries only at high levels in
the hierarchical structure, i.e. the upper three levels had associated summaries,
the fourth level did not. The rationale behind this is that we wanted to see
whether searchers’ behaviour is affected by the different display. To avoid bias
towards the use of the hierarchical structure and summarisation, we employed a
blind study, i.e. searchers were not told what the purpose of the study was.
Tasks. Four search tasks were used in the experiments. The tasks described
simulated work task situations [3]. We used modified versions of the INEX 2005
ad-hoc track topics which ensured that the tasks were realistic, and that relevant
documents could be found in the document collection. Two types of search tasks
were chosen. Background type tasks instructed searchers to look for information
about a certain topic (e.g. concerns about the CIA and FBI’s monitoring of the
public) while List type tasks asked searchers to create a list of products that are
connected to the topic of their tasks (e.g. a list of speech recognition software).
From each group of tasks, searchers could freely choose the one that was more
interesting to them. Searchers had a maximum of 20 minutes for each task. This
period is defined as a search session. Search sessions of the same searcher (i.e.
one searcher had two search sessions) are defined and used in this paper as a
user session.
Search Design. To rule out the fatigue and learning effects that could affect the
results, we adopted Latin square design. Participants were randomly assigned into
groups of four. Within groups, the system order and the task order were permuted,
Z. Szlávik, A. Tombros, and M. Lalmas
i.e. each searcher performed two tasks on different systems which involved two different task types. We made an effort to keep situational variables constant, e.g. the
same computer settings were used for each subject, the same (and only) experimenter was present, and the place of the experiments was the same.
Data Collected. Two types of events were logged. One type was used to save
the users’ actions based on their mouse clicks (e.g. when users clicked on the
‘search’ button, or opened an element for reading). The other type corresponds
to the summary-viewing actions of users, i.e. we logged whenever a summary
was displayed (users moved the mouse pointer over an item in the ToC).
During the analysis of summary log files, calculated summary-viewing times
that were shorter than half a second or longer than twenty seconds were discarded,
because the former probably corresponds to a quick mouse move (without users
having read the summary), and the latter may have recorded user actions when
the keyboard only was used (e.g. opening another window by pressing CTRL+N).
In this section, the analysis of the recorded log files is described. To investigate
whether summarisation can be effectively used in interactive XML retrieval, we
formed four groups of research questions. The first group (Section 4.1) is about
summary reading times. The second group (Section 4.2) is about the number of
summaries searchers read in their search sessions. Section 4.3 investigates the
relation between summary reading times and number of summaries read (the
third group). The fourth group (Section 4.4) looks into the relation between
the multi-level XML retrieval and traditional retrieval.
Summary Times
In this section, we examine how long searchers read an average summary;
whether there are differences in reading times for summaries that are associated with elements at various structural levels and element types; and whether
the average summary reading time changes when summaries are not shown at
all structural levels in the ToC.
Taking into account both systems Se and Sc , an average summary was displayed for 4.24 seconds with a standard deviation of 3.9. The longest viewed
summary was displayed for 19.57s, while the shortest accounted summary was
viewed for 0.51s.
Figure 3a shows the distribution of summary display times by structural levels
for each system. Display times of Sc tend to be shorter when users read summaries of deeper, i.e. shorter elements, although the length of summaries were
the same (i.e. four sentences). For Se , times are more balanced. This indicates
that if there are summaries for more levels and the lowest level is very short
(sometimes these paragraphs are as short as the summary itself), people trust
summaries of larger, i.e. higher, elements more. If the difference in size between
the deepest and highest elements is not so big, times are more balanced.
The Use of Summaries in XML Retrieval
level 2
level 3
level 4
Structural levels
level 1
Summary Display Time (s)
Summary Display Time (s)
Element Types
Fig. 3. Summary times by structural levels (a) and XML element types (b)
Figure 3b shows the display time distribution by XML element types (tags).
We can see, for example, that the bdy (body) element has high summary viewing
times; this is the element that contains all the main text of the article. We can
also see that paragraphs (para) and subsections (ss1 and ss2) have low summary
reading times for Sc and, obviously, none for Se (as they are not displayed
at these levels). These three element types appear on the lowest, i.e. fourth,
structural level.
We compared the two systems (Se and Sc ) to find out whether significant
differences in summary reading times can be found. The comparison of the overall
summary-viewing times showed significant difference between Se and Sc , i.e. the
average summary viewing time for system Se (4.58s) is significantly higher than
that of system Sc (3.98s). To examine where this difference comes from, we
compared the two systems by tag types (e.g. whether summary reading times
for sections are different for the two systems). However, we did not find significant
differences at comparable tag types1 . We also compared the two systems with
respect to structural levels (e.g. whether average summary reading time at level
one is significantly different for the two systems). We did not find significant
difference for level one (article), two (body, front and back matters) and three
(abstract, sections, appendix) elements.
To sum up, our results showed that users of system Se read summaries 0.5s
longer than that of system Sc . However, we could not find significant difference
at levels or element types between the two systems. An interpretation of this
result is that since Se searchers had less available summaries to examine, they
were less confused and overloaded by the information available and could take
their time reading a particular summary.
Number of Summaries Read
This section looks into the number of summaries that were read by searchers. We
first examine the average number of summaries seen by users in a search session,
Tag types for which summaries were not displayed for any of the systems were not
compared as one of the sample groups would contain zero samples.
Z. Szlávik, A. Tombros, and M. Lalmas
and then we look into the distributions of the number of read summaries at
different structural levels and element types. Differences between the two systems
with respect to the number of read summaries are also discussed in this section.
Considering both systems together, an average user read 14.42 summaries in
a search session (20 minutes long), with a standard deviation of 10.77. This
shows a considerable difference in user behaviour regarding summary reading.
The least active summary reader read only one summary in a search session,
while the most active saw 52 summaries for at least half a second.
Figure 4a shows that the deeper we are in the structure of the ToC, the
more summaries are read, on average, in a search session. This is consistent with
the nature of XML, and all tree-like structures: the deeper we are in a tree,
the more elements are available on that level. However, our data shows that the
difference between the two systems is not only based on this structural property,
because when only three levels of summaries were displayed, reading of third
level summaries (usually summaries of sections) showed higher activity than
when four levels of summaries were displayed, i.e. the third level seems to be
more interesting than the first and second.
Level 1
Level 2
Level 3
Structural levels
Level 4
Element Types
Number of read summaries
per search session
per search session
Number of read summaries
Fig. 4. Number of read summaries per search sessions, by structural levels (a) and
XML element types (b)
The next step is to find out whether this interest is only at these deeper levels, or connected to some element types. Contents of the same element types
are supposed to have the same amount and kind of information, e.g. paragraphs
are a few sentences long; front matters usually contain author, title information and the abstract of the paper. Our log analysis shows that summaries of
sections, subsections and paragraphs are those most read (Figure 4b), although
users take less time to read them (see previous section). Other tag types are
less promising to users according to their summary usage. We can also see in
Figure 4b that when paragraph and subsection summaries are not available
(Se ), section summary reading increases dramatically. We interpret these results
as indication that for the IEEE collection, sections, that appear mostly at level
three, are the most promising elements to look at when answering an average
The Use of Summaries in XML Retrieval
The comparison of the overall number of viewed summaries showed that an
average user of system Se read 12.5 summaries per search session, and of system
Sc 16.33 summaries per session. In other words, our test persons read more
summaries where more summaries were available. However, this difference is not
statistically significant.
We compared the two systems using the same categories (i.e. tag types and
levels) as previously for summary reading times. T-tests did not show significant
differences at comparable levels and element types between Se and Sc in number
of read summaries.
Reading Times vs. Number of Read Summaries
In this section, we examine the relationships between the data and findings of
the previous two sections. One question we are looking into is whether searchers
with higher summary reading times read less summaries in a search session.
Users of system Se read less summaries than those who used system Sc . This is
in accordance that they had less summaries available. However, users of system
Se also read summaries for longer. This shows that if there are less available
summaries, users can focus more on one particular summary, and vice versa, if
there are many summaries to view, reading can become superficial.
Considering both systems and tag types, we found negative correlation between the summary reading time and the number of read summaries. In other
words, it is true for users of both systems that the more summaries they read on
a particular level the shorter the corresponding reading times are. However, this
is only an indication as the correlation coefficient (-0.5) does not indicate significance. Also, since the number of summaries read increases when going deeper
in the structure, we view this as an indication that, for searchers, summaries of
higher level elements are more indicative to the contents of the corresponding
elements than those of lower, and also shorter, elements.
Usage of the ToC and Article Level
XML retrieval has the advantage of breaking down a document into smaller
elements according to the document’s logical structure. We investigated whether
searchers take advantage of this structure: do they click on items in the ToC, do
they use the article (unstructured) level of a document, and how frequently, do
they alternate between full article and smaller XML element views?
Regarding the usage of the XML structure in terms of the ToC, 58.16% of the
displayed elements were results of at least “second” clicks, i.e. more than half of
the elements were displayed by clicking on an element in the ToC. This shows
that searchers actively used the ToC provided (unlike those in [14]), and that
they used the logical structure of the documents by browsing within the ToC.
The log files show that only 25% of the searchers clicked on article elements
to access the whole document, and none of these searchers clicked on an article
type link more than three times in a search session. The distribution of viewing
whole articles did not depend on the system, i.e. we observed three article clicks
Z. Szlávik, A. Tombros, and M. Lalmas
for each system. This result follows naturally, since the display of the article level
was not different in Se and Sc .
Article level clicks show that articles were only 3.56% of all the displayed
elements. This may be misleading as the retrieval system did not return article
elements in the result list. We therefore compared article usage to elements that
were displayed when users were already in the document view, i.e. we excluded
elements that were shown right after a searcher clicked on a link in the result
list. The updated number shows that article elements were displayed in 6.12%
of these clicks. This suggests that searchers of an XML retrieval system do use
the structure available in terms of the ToC, and although it was the first time
they had used an XML retrieval system, they did not simply prefer to see the
whole document as they were accustomed to.
Our results from the previous sections suggest that searchers still want to have
access to, and use, the full-article level. For example, searchers read summaries
of articles and read them for longer but, they did not necessarily want to use
the full-articles directly, i.e. looking at the full-article summary may be enough
to decide whether reading any part of the article is worthwhile.
Conclusions and Future Work
In this paper, we presented a study of an interactive XML retrieval system
that uses summarisation to help users in navigation within XML documents.
The system takes advantage of the logical structure of an XML document by
presenting the structure in the form of a table of contents (ToC). In this ToC,
summaries of corresponding XML elements were generated and displayed.
Searchers in our study did indeed use the provided structure actively and did
not only use the whole article in order to identify relevant content. In addition,
searchers made good use of the XML element summaries, by spending a significant amount of time reading these summaries. This implies that our system, by
the use of summarisation, facilitated browsing in the ToC level more than that
at INEX 2004 interactive track [13].
Regarding the use of element summaries, in our study searchers tended to read
more summaries that were associated with elements at lower levels in the structure (e.g. summaries of paragraphs), and at the same time summaries of lower
elements were read for a shorter period of time. Our results also suggest that if
more summaries are made available, searchers tend to read more summaries in
a search session, but for shorter time.
In our experiment, the ToC display and summary presentation were highly
connected (i.e. no summary can be displayed without a corresponding item in
the ToC). Based on the close relation between them, for such an XML retrieval
system it is important to find the appropriate ToC, and summary, presentation.
If the ToC is too deep, searchers may lose focus as the reading of many summaries
and short reading times at low levels indicated. Nevertheless, if the ToC is not
detailed enough, users may lose possibly good links to relevant elements. Our
results suggest, that for the used collection, a one or two-level ToC (containing
The Use of Summaries in XML Retrieval
reference to the whole article, body, front and back matter) would be probably
too shallow, while displaying the full fourth level (normally to paragraph-level)
is sometimes too deep.
We view our results as having implications for the design of interactive XML
IR systems that support searchers by providing element summaries and structural information. One implication of the results is that the display of the ToC
in XML IR systems needs to be carefully chosen (see previous paragraph). Our
results also showed that summarisation can be effectively used in XML retrieval.
A further implication of the results is that XML retrieval systems should consider the logical structure of the document for summary generation, as searchers
do not use summaries in the same way at all levels of the structure.
Based on the outcomes of this study, our future work includes the development of an improved interactive XML retrieval system that includes adaptive
generation of summaries at the ToC level, where the structural position and estimated relevance of the element to be summarised will also be considered (some
initial work is done in [2]). We also plan to take into account structural IR search
task types (e.g. fetch and browse), that are currently being investigated at INEX
[6]. The aim of the fetch and browse retrieval strategy is to first identify relevant
documents (the fetching phase), and then to identify the most relevant elements
within the fetched articles (the browsing phase). We believe that summarisation can be particularly helpful in the browsing phase, where finding relevant
elements within a document is required.
This work was partly funded by the DELOS Network of Excellence in Digital
Libraries, to which we are very grateful.
1. A. Alam, A. Kumar, M. Nakamura, A. F. Rahman, Y. Tarnikova, and C. Wilcox.
Structured and unstructured document summarization: Design of a commercial
summarizer using lexical chains. In ICDAR, pages 1147–1152. IEEE Computer
Society, 2003.
2. M. Amini, A. Tombros, N. Usunier, M. Lalmas, and P. Gallinari. Learning to
summarise XML documents by combining content and structure features (poster).
In CIKM’05, Bremen, Germany, October 2005.
3. P. Borlund. The IIR evaluation model: a framework for evaluation of interactive
information retrieval systems. Information Research, 8(3), 2003.
4. S. Dumais, E. Cutrell, and H. Chen. Optimizing search by showing results in
context. In CHI ’01: Proceedings of the SIGCHI conference on Human factors in
computing systems, pages 277–284, New York, NY, USA, 2001. ACM Press.
5. H. P. Edmundson. New methods in automatic extracting. J. ACM, 16(2):264–285,
6. N. Fuhr, M. Lalmas, S. Malik, and Z. Szlávik, editors. Advances in XML Information Retrieval: Third International Workshop of the Initiative for the Evaluation
of XML Retrieval, INEX 2004, volume 3493. Springer-Verlag GmbH, may 2005.
Z. Szlávik, A. Tombros, and M. Lalmas
7. J. Goldstein, M. Kantrowitz, V. Mittal, and J. Carbonell. Summarizing text documents: sentence selection and evaluation metrics. In SIGIR’99, pages 121–128.
ACM Press, 1999.
8. H. Kim and H. Son. Interactive searching behavior with structured xml documents.
In Fuhr et al. [6], pages 424–436.
9. J. Kupiec, J. Pedersen, and F. Chen. A trainable document summarizer. In SIGIR’95, pages 68–73. ACM Press, 1995.
10. N. Pharo and R. Nordlie. Context matters: An analysis of assessments of XML
documents. In Fabio Crestani and Ian Ruthven, editors, CoLIS, volume 3507 of
Lecture Notes in Computer Science, pages 238–248. Springer, 2005.
11. T. Rölleke, R. Lübeck, and G. Kazai. The hyspirit retrieval platform. In SIGIR’01,
page 454, New York, NY, USA, 2001. ACM Press.
12. Z. Szlávik, A. Tombros, and M. Lalmas. Investigating the use of summarisation for
interactive XML retrieval. In Proceedings of the 21st ACM Symposium on Applied
Computing, Information Access and Retrieval Track (SAC-IARS’06), to appear,
pages 1068–1072, 2006.
13. A. Tombros, B. Larsen, and S. Malik. The interactive track at INEX 2004. In Fuhr
et al. [6].
14. A. Tombros, S. Malik, and B. Larsen. Report on the INEX 2004 interactive track.
ACM SIGIR Forum, 39(1):43–49, June 2005.
15. A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In SIGIR’98, pages 2–10. ACM Press, 1998.
16. R. W. White, J. M. Jose, and I. Ruthven. A task-oriented study on the influencing
effects of query-biased summarisation in web searching. Inf. Process. Manage.,
39(5):707–733, 2003.
An Enhanced Search Interface for Information
Discovery from Digital Libraries
Georgia Koutrika1,* and Alkis Simitsis2,**
University of Athens,
Department of Computer Science,
Athens, Greece
[email protected]
National Technical University of Athens,
Department of Electrical and Computer Engineering,
Athens, Greece
[email protected]
Abstract. Libraries, museums, and other organizations make their electronic
contents available to a growing number of users on the Web. A large fraction of
the information published is stored in structured or semi-structured form.
However, most users have no specific knowledge of schemas or structured
query languages for accessing information stored in (relational or XML)
databases. Under these circumstances, the need for facilitating access to
information stored in databases becomes increasingly more important. Précis
queries are free-form queries that instead of simply locating and connecting
values in tables, they also consider information around these values that may be
related to them. Therefore, the answer to a précis query might also contain
information found in other parts of the database. In this paper, we describe a
précis query answering prototype system that generates personalized presentation of short factual information précis in response to keyword queries.
1 Introduction
Emergence of the World Wide Web has given the opportunity to libraries, museums,
and other organizations to make their electronic contents available to a growing
number of users on the Web. A large fraction of that information is stored in
structured or semi-structured form. However, most users have no specific knowledge
of schemas or (semi-)structured query languages for accessing information stored in
(relational or XML) databases. Under these circumstances, the need for facilitating
access to information stored in databases becomes increasingly more important.
Towards that direction, existing efforts have mainly focused on facilitating
querying over structured data proposing either handling natural language queries [2,
14, 17] or free-form queries [1, 18]. However, end users want to achieve their goals
This work is partially supported by the Information Society Technologies (IST) Program of
the European Commission as part of the DELOS Network of Excellence on Digital Libraries
(Contract G038-507618).
This work is co-funded by the European Social Fund (75%) and National Resources (25%) Operational Program for Educational and Vocational Training II (EPEAEK II) and
particularly the Program PYTHAGORAS.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 87 – 98, 2006.
© Springer-Verlag Berlin Heidelberg 2006
G. Koutrika and A. Simitsis
with a minimum of cognitive load and a maximum of enjoyment [12]. In addition,
they often have very vague information needs or know a few buzzwords. Therefore,
the usefulness of keyword-based queries, especially compared to a natural language
approach in the presence of complex queries, has been acknowledged [26].
Consider a digital collection of art works made available to people on the Web. A
user browses the contents of this collection with the purpose of learning about
“Michelangelo”. If this need is expressed as a free-form query, then existing keyword
searching approaches focus on finding and possibly interconnecting entities that
contain the query terms, thus they would return an answer as brief as “Michelangelo:
painter, sculptor”. This answer conveys little information to the user and more
importantly does not help or encourage him in searching or learning more about
“Michelangelo”. On the other hand, a more complete answer containing, for instance,
biographical data and information about this painter’s work would be more
meaningful and useful instead. This could be in the form of the following précis:
“Michelangelo (March 6, 1475 - February 18, 1564) was born in Caprese,
Tuscany, Italy. As a painter, Michelangelo's work includes Holy Family of the
Tribune (1506), The Last Judgment (1541), The Martyrdom of St. Peter (1550).
As a sculptor Michelangelo's work includes Pieta (1500), David (1504).”
A précis is often what one expects in order to satisfy an information need
expressed as a question or as a starting point towards that direction. Based on the
above, support of free-form queries over databases and generation of answers in the
form of a précis comprises an advanced searching paradigm helping users to gain
insight into the contents of a database. A précis may be incomplete in many ways; for
example, the abovementioned précis of “Michelangelo” includes a non-exhaustive list
of his works. Nevertheless, it provides sufficient information to help someone learn
about Michelangelo and identify new keywords for further searching. For example,
the user may decide to explicitly issue a new query about “David” or implicitly by
following underlined topics (hyperlinks) to pages containing relevant information.
In the spirit of the above, recently, précis queries have been proposed [11]. These
are free-form queries that instead of simply locating and connecting values in tables,
they also consider information around these values that may be related to them.
Therefore, the answer to a précis query might also contain information found in other
parts of the database, e.g., frescos created by Michelangelo. This information needs to
be “assembled” -in perhaps unforeseen ways- by joining tuples from multiple
relations. Consequently, the answer to a précis query is a whole new database, a
logical database subset, derived from the original database compared to flattened out
results returned by other approaches. This subset is useful in many cases and provides
to the user much greater insight into the original data.
The work that we describe in this paper focuses on design and implementation
issues of a précis query answering prototype with the following characteristics:
− Support of a keyword-based search interface for accessing the contents of the
underlying collection.
− Generation of a logical subset of the database that answers the query, which
contains not only items directly related to the query selections but also items
implicitly related to them in various ways.
An Enhanced Search Interface for Information Discovery from Digital Libraries
− Personalization of the logical subset generated and hence the précis returned
according to the needs and preferences of the user as a member of a group of users.
− Translation of the structured output of a précis query into a synthesis of results.
The output is an English presentation of short factual information précis.
Outline. Section 2 discusses related work. Section 3 describes the general framework
of précis queries. Section 4 presents the design and implementation of our prototype
system, and Section 5 concludes our results with a prospect to the future.
2 Related Work
The need for free-form queries has been early recognized in the context of databases
[18]. With the advent of the World Wide Web, the idea has been revisited. Several
research efforts have emerged for keyword searching over relational [1, 3, 8, 13] and
XML data [5, 6, 9]. Oracle 9i Text [19], Microsoft SQL Server [16] and IBM DB2
Text Information Extender [10] create full text indexes on text attributes of relations
and then perform keyword queries.
Existing keyword searching approaches focus on finding and possibly
interconnecting tuples in relations that contain the query terms. For example, the
answer for “Michelangelo” would be in the form of relation-attribute pair, such as
(Artist, Name). In many practical scenarios, this answer conveys little information
about “Michelangelo”. A more complete answer containing, for instance, information
about this artist's works would be more useful. In the spirit of the above, recently,
précis queries have been proposed [11]. The answer to a précis query is a whole new
database, a logical database subset, derived from the original database. Logical
database subsets are useful in many cases. However, naïve users would rather prefer a
friendly representation of the information contained in a logical subset, without
necessarily understanding its relational character. In earlier work [11], the importance
of such representation constructed based on information conveyed by the database
graph, has been suggested. A method for generating an English presentation of the
information contained in a logical subset as a synthesis of simple SPO sentences has
been proposed [21]. The process resembles those involved in handling natural
language queries over relational databases in that they both involve some amount of
additional predefinitions for the meanings represented by relations, attributes and
primary-to-foreign key joins. However, natural language query processing is more
complex, since it has to handle ambiguities in natural language syntax and semantics
whereas this approach uses well defined templates to rephrase relations and tuples.
The problem of facilitating the naïve user has been thoroughly discussed in the
field of natural language processing (NLP). For the last couple of decades, several
works are presented concerning NL Querying [26, 15], NL and Schema Design [23,
14, 4], NL and DB interfaces [17, 2], and Question Answering [25, 22]. Related
literature on NL and databases, has focused on totally different issues such as the
interpretation of users’ phrasal questions to a database language, e.g., SQL, or to the
automatic database design, e.g., with the usage of ontologies [24]. There exist some
recent efforts that use phrasal patterns or question templates to facilitate the
answering procedure [17, 22]. Moreover, these works produce pre-specified answers,
G. Koutrika and A. Simitsis
where only the values in the patterns change. This is in contrast to précis queries,
which construct logical subsets on demand and use templates and constructs of
sentences defined on the constructs of the database graph, thus generating dynamic
answers. This characteristic of précis queries also enables template multi-utilization.
In this paper, we built upon the ideas of [11, 20, 21] and we describe the design
and implementation of a system that supports précis queries for different user groups.
3 The Précis Query Framework
The purpose of this section is to provide background information on précis queries.
Preliminaries. We consider the database schema graph G(V, E) as a directed graph
corresponding to a database schema D. There are two types of nodes in V: (a) relation
nodes, R, one for each relation in the schema; and (b) attribute nodes, A, one for each
attribute of each relation in the schema. Likewise, edges in E are the following: (a)
projection edges, Ȇ, each one connects an attribute node with its container relation
node, representing the possible projection of the attribute in the system’s answer; and
(b) join edges, J, from a relation node to another relation node, representing a
potential join between these relations. These could be joins that arise naturally due to
foreign key constraints, but could also be other joins that are meaningful to a domain
expert. Joins are directed for reasons explained later. Therefore, a database graph is a
directed graph G(V, E), where: V = R∪A, and E = Ȇ∪J.
A weight, w, is assigned to each edge of the graph G. This is a real number in the
range [0, 1], and represents the significance of the bond between the corresponding
nodes. Weight equal to 1 expresses strong relationship; in other words, if one node of
the edge appears in an answer, then the edge should be taken into account making the
other node appear as well. If a weight equals to 0, occurrence of one node of the edge
in an answer does not imply occurrence of the other node. Based on the above, two
relation nodes could be connected through two different join edges, in the two
possible directions, between the same pair of attributes, but carrying different
weights. For simplicity, we assume that there is at most one directed edge from one
node to the same destination node.
A directed path between two relation nodes, comprising adjacent join edges,
represents the “implicit” join between these relations. Similarly, a directed path
between a relation node and an attribute node, comprising a set of adjacent join edges
and a projection edge represents the “implicit” projection of the attribute on this
relation. The weight of a path is a function of the weights of constituent edges, which
should satisfy the condition that the estimated weight should decrease as the length of
the path increases, based on human intuition and cognitive evidence. In our system,
we have considered the product of weights over a path.
Logical Database Subsets. Consider a database D properly annotated with a set of
weights and a précis query Q, which is a set of tokens, i.e. Q={k1,k2,…,km}. We
define as initial relation any database relation that contains at least one tuple in which
one or more query tokens have been found. A tuple containing at least one query
token is called initial tuple.
An Enhanced Search Interface for Information Discovery from Digital Libraries
CONTENT (wid,eid,notes)
OWNER (wid,mid,acquisition)
Fig. 1. An example database graph
A logical database subset D’ of D satisfies the following:
− The set of relation names in D’ is a subset of that in the original database D.
− For each relation Ri’ in the result D’, its set of attributes in D’ is a subset of its set
of attributes in D.
− For each relation Ri’ in the result D’, the set of its tuples is a subset of the set of
tuples in the original relation Ri in D (when projected on the set of attributes that
are present in the result).
The result of applying query Q on a database D given a set of constraints C is a
logical database subset D’ of D, such that D’ contains initial tuples for Q and any other
tuple in D that can be transitively reached by joins on D starting from some initial
tuple, subject to the constraints C [11]. Possible constraints could be the maximum
number of attributes in Dƍ, the minimum weight of paths in Dƍ, the maximum number
of tuples in Dƍ and so forth. Using different constraints and weights on the edges of
the database graph allows generating different answers for the same query.
Précis Patterns. Weights and constraints may be provided in different ways. They
may be set by the user at query time using an appropriate user interface. This option is
attractive in many cases since it enables interactive exploration of the contents of a
database. This bears a resemblance to query refinement in keyword searches. In case
of précis queries, the user may explore different regions of the database starting, for
example, from those containing objects closely related to the topic of a query and
progressively expanding to parts of the database containing objects more loosely
related to it. Although this approach is quite elegant, the user should spend some time
on a procedure that may not always seem relevant to his need for a certain answer.
Thus, weights and criteria may be pre-specified by a designer, or stored as part of a
profile corresponding to a user or a group of users. In particular, in our framework,
we have adopted the use of patterns of logical subsets corresponding to different
queries or groups of users, which are stored in the system [20]. For instance, different
patterns would be used to capture preferences of art reviewers and art fans.
G. Koutrika and A. Simitsis
Fig. 2. Example précis patterns
Formally, given the database schema graph G of a database D, a précis pattern is a
directed rooted tree P(V,E) on top of G annotated with a set of weights. Given a query
Q over database D, a précis pattern P(V,E) is applicable to Q, if its root relation
coincides with an initial relation for Q. The result of applying query Q on a database D
given an applicable pattern P is a logical database subset D’ of D, such that:
− The set of relation names in D’ is a subset of that in P.
− For each relation Ri’ in the result D’, its set of attributes in D’ is a subset of its set
of attributes in P.
− For each relation Ri’ in the result D’, the set of its tuples is a subset of the set of
tuples in the original relation Ri in D (when projected on the set of attributes that
are present in the result).
In order to produce the logical database subset D’, a précis pattern P is enriched
with tuples extracted from the database based on constraints, such as the maximum
number of attributes in Dƍ, the maximum number of tuples in Dƍ and so on.
Example. Consider the database graph presented in Fig. 1. Observe the two directed
edges between WORK and OWNER. Works and owners are related but one may consider
that owners are more dependent on works than the other way around. In other words,
an answer regarding an owner should always contain information about related works,
while an answer regarding a work may not necessarily contain information about its
owner. For this reason, the weight of the edge from OWNER to WORK is set to 1, while
the weight of the edge from WORK to OWNER is 0.7. Précis patterns corresponding to
different queries and/or groups of users may be stored in the system. In Fig. 2,
patterns P1 and P2 correspond to different queries, regarding artists and exhibitions,
respectively (the initial relation in each pattern is shown in grey).
4 System Architecture
In this section, we describe the architecture of a prototype précis query answering
system, depicted in Fig. 3.
An Enhanced Search Interface for Information Discovery from Digital Libraries
Fig. 3. System Architecture
Each time a user poses a question, the system finds the initial relations that match
this query, i.e. database relations containing at least one tuple in which one or more
query tokens have been found (Keyword Locator). Then, it determines the database
part that contains information related to the query; for this purpose, it searches in a
repository of précis patterns to extract an appropriate one (Précis Manager). If an
appropriate pattern is not found, then a new one is created and registered in the
repository. Next, this précis pattern is enriched with tuples extracted from the
database according to the query keywords, in order to produce the logical database
subset (Logical Subset Generator). Finally, an answer in the form of a précis is
returned to the user (Translator). The creation and maintenance of the inverted index,
patterns and templates is controlled through a Designer component. In what follows,
we discuss in detail the design and implementation of these components.
Designer Interface. This module provides the necessary functionality that allows a
designer to create and maintain the knowledge required for the system to operate, i.e.:
− inverted index: with a click of a button, the designer may create or drop the
inverted index for a relational database.
− templates: through a graphical representation of a database schema graph, the
designer may define templates to be used by the Translator.
− user groups: the designer may create pre-specified groups of users. Then, when a
new user registers in the system, he may choose the group he belongs to.
− patterns: through a graphical representation of a database schema graph, the
designer may define précis patterns targeting different groups of users and different
types of queries for a specific domain. These are stored in a repository.
Manual creation of patterns and user groups assumes good domain and application
knowledge and understanding. For instance, the pattern corresponding to a query
about art works would probably contain the title and creation date of art works along
with the names of the artists that created them and museums that own them; whereas
the pattern corresponding to a query about artists would most likely contain detailed
information about artists such as name, date and location of birth, and date of death
along with titles of works an artist has created. Furthermore, different users or groups
of users, e.g., art reviewers vs. art fans, would be interested in different logical subsets
for the same query. We envision that the system could learn and adapt précis patterns
G. Koutrika and A. Simitsis
for different users or groups of users by using logs of past queries or by means of
social tagging by large numbers of users. Then, the only work a designer would have
to do would be the creation of templates.
Keyword Locator. When a user submits a précis query Q={k1,k2,…,km}, the system
finds the initial relations that match this query, i.e. database relations containing at
least one tuple in which one or more query tokens have been found. For this purpose,
an inverted index has been built, which associates each keyword that appears in the
database with a list of occurrences of the keyword. Modern RDBMS’ provide
facilities for constructing full text indices on single attributes of relations (e.g.,
Oracle9i Text). In our approach, we chose to create our own inverted index, basically
due to the following reasons: (a) a keyword may be found in more than one tuple and
attribute of a single relation and in more than one relation; and (b) we consider
keywords of other data types as well, such as date and number.
At its current version, the system considers that query keywords are connected with
the logical operator or. Keywords enclosed in quotation marks, e.g., “Leonardo da
Vinci”, are considered as one keyword that must be found in the same tuple. This means
that the user can issue queries such as “Michelangelo” or “Leonardo da Vinci”, but not
queries such as “Michelangelo” and “Leonardo da Vinci”, which would essentially ask
about the connection between these two entities/people. We are currently working on
supporting more complex queries involving operators and and not.
Based on the above, given a user query, Keyword Locator consults the inverted
index, and returns for each term ki in Q, a list of all initial relations, i.e. ki→ {Rj},
∀ki in Q. (If no tuples contain the query tokens, then an empty answer is returned.)
Précis Manager. Précis Manager determines the schema of the logical database
subset, i.e. the database part that contains information related to the query. This
should involve initial relations and relations around them containing relevant
information. The schema of the subset that should be extracted from a database given
a précis query may vary depending on the type of the query issued and the user
issuing the query. Patterns of logical subsets corresponding to different queries or
groups of users are stored in the system. For instance, different patterns would be
used to capture preferences of art reviewers and fans.
Each time an individual poses a question, Précis Manager searches into the
repository of précis patterns to extract those that are appropriate for the situation. If
users are categorized into groups, then this module examines only patterns assigned to
the group the active user belongs to. Based on the initial relations identified for query
Q, one or more applicable patterns may be identified. Recall that a précis pattern
P(V,E) is applicable to Q, if its root relation coincides with an initial relation for Q.
For instance, given a query on “David”, a pattern may correspond to artists
(“Michelangelo”) and another to owners (“Accademia di Belle Arti, Florence, Italy”).
If none is returned for a certain initial relation, then the request is propagated to a
Schema Generator. This module is responsible for finding which part of the database
schema may contain information related to Q. The output of this step is the schema Dƍ
of a logical database subset comprised of: (a) relations that contain the tokens of Q;
(b) relations transitively joining to the former, and (c) a subset of their attributes that
should be present in the result, according to the preferences registered for the user that
poses the query. (For more details, we refer the interested reader to [20].) After its
An Enhanced Search Interface for Information Discovery from Digital Libraries
creation, the schema of the logical database subset is stored in the graph database as a
pattern associated with the group that the user submitting the query belongs to.
Logical Subset Generator. A précis pattern selected from the previous step is
enriched with tuples extracted from the database according to the query keywords, in
order to produce the logical database subset. For this purpose, the Logical Subset
Generator starts from the initial relations where tokens in Q appear. Then, more tuples
from other relations are retrieved by (foreign-key) join queries starting from the initial
relations and transitively expanding on the database schema graph following edges of
the pattern. Joins on a précis pattern are executed in order of decreasing weight. In
other words, a précis pattern comprises a kind of a “plan” for collecting tuples
matching the query and others related to them. At the end of this phase, the logical
database subset has been produced.
Translator. The Translator is responsible for rendering a logical database subset to a
more user-friendly synthesis of results. This is performed by a semi-automatic method
that uses templates over the database schema. In the context of this work, the
presentation of a query answer is defined as a proper structured management of
individual results, according to certain rules and templates predefined by a designer.
The result is a user-friendly response through the composition of simple clauses.
In this framework, in order to describe the semantics of a relation R along with its
attributes in natural language, we consider that relation R has a conceptual meaning
captured by its name, and a physical meaning represented by the value of at least one
of its attributes that characterizes tuples of this relation. We name this attribute the
heading attribute and we depict it as a hachured rounded rectangle. For example, in
Fig. 1, the relation ARTIST conceptually represents “artists” in real world; indeed, its
name, ARTIST, captures its conceptual meaning. Moreover, the main characteristic of
an “artist” is its name, thus, the relation ARTIST should have the NAME as its heading
attribute. By definition, the edge that connects a heading attribute with the respective
relation has a weight 1 and it is always present in the result of a précis query. A
domain expert makes the selection of heading attributes.
The synthesis of query results follows the database schema and the correlation of
relations through primary and foreign keys. Additionally, it is enriched by
alphanumeric expressions called template labels mapped to the database graph edges.
A template label, label(u,z) is assigned to each edge e(u,z)∈ E of the
database schema graph G(V,E). This label is used for the interpretation of the
relationship between the values of nodes u and z in natural language.
Each projection edge e ∈ Ȇ that connects an attribute node with its container
relation node, has a label that signifies the relationship between this attribute and the
heading attribute of the respective relation; e.g., the BIRTH_DATE of an ARTIST
(.NAME). If a projection edge is between a relation node and its heading attribute, then
the respective label reflects the relationship of this attribute with the conceptual
meaning of the relation; e.g., the NAME of an ARTIST. Each join edge e ∈ J between
two relations has a label that signifies the relationship between the heading attributes
of the relations involved; e.g., the WORK (.TITLE) of an ARTIST (.NAME). The label
G. Koutrika and A. Simitsis
of a join edge that involves a relation without a heading attribute signifies the
relationship between the previous and subsequent relations.
We define as the label l of a node n the name of the node and we denote it as
l(n). For example, the label of the attribute node NAME is “name”. The name of a
node is determined by the designer/administrator of the database. The template label
label(u,z) of an edge e(u,z) formally comprises the following parts: (a) lid, a
unique identifier for the label in the database graph; (b) l(u), the name of the starting
node; (c) l(z), the name of the ending node; (d) expr1, expr2, expr3 alphanumeric
expressions. A simple template label has the form:
label(u,z) = expr1 + l(u) + expr2 + l(z) + expr3
where the operator “+” acts as a concatenation operator.
In order to use template labels or to register new ones, we use a simple language
for templates that supports variables, loops, functions, and macros.
The translation is realized separately for every occurrence of a token. At the end,
the précis query lists all the clauses produced. For each occurrence of a token, the
analysis of the query result graph starts from the relation that contains the input token.
The labels of the projection edges that participate in the query result graph are
evaluated first. The label of the heading attribute comprises the first part of the
sentence. After having constructed the clause for the relation that contains the input
token, we compose additional clauses that combine information from more than one
relation by using foreign key relationships. Each of these clauses has as subject the
heading attribute of the relation that has the primary key. The procedure ends when
the traversal of the databases graph is complete. For further details, we refer the
interested reader to [21].
User Interface. The user interface of our prototype comprises a simple form where
the user can enter one or more keywords describing the topic of interest. Currently,
the system considers that query keywords are connected with the logical operator or.
This means that the user can ask about “Michelangelo” or “Leonardo da Vinci”, but
cannot submit a query about “Michelangelo” and Leonardo da Vinci”, which
essentially would ask about the connection between these two entities/people.
Before using the system, a user identifies oneself as belonging to one of the
existing groups, i.e. art reviewers or fans. Fig. 4 displays an example of a user query
and the answer returned by the system. Underlined topics are hyperlinks. Clicking
such a hyperlink, the user implicitly submits a new query regarding the underlined
topic. For example, clicking on “David” will generate a new précis regarding this
sculpture. Hyperlinks are defined on heading attributes of relations.
Although extensive testing of the system with a large number of users has not
taken place yet, a small number of people have used the system to search for preselected topics as well as topics of their interest and reported their experience. This
has indicated the following:
− The précis query answering paradigm allows users with little or no knowledge of
the application domain schema, to quickly and easily gain an understanding of the
information space.
− Naïve users find précis answers to be user-friendly and feel encouraged to use the
system more.
An Enhanced Search Interface for Information Discovery from Digital Libraries
Fig. 4. Example précis query
− By providing précis of information as answers and hyperlinks inside these answers,
the system encourages users to get involved in a continuous search-and-learn
5 Conclusions and Future Work
We have described the design, prototyping and evaluation of a précis query answering
system with the following characteristics: (a) support of a keyword-based search
interface for accessing the contents of the underlying collection, (b) generation of a
logical subset of the database that answers the query, which contains not only items
directly related to the query selections but also items implicitly related to them in
various ways, (c) personalization of the logical subset generated and hence the précis
returned according to the needs and preferences of the user as a member of a group of
users, and (d) translation of the structured output of a précis query into a synthesis of
results. The output is an English presentation of short factual information précis. As
far as future work is concerned, we are interested in implementing a module for
learning précis patterns based on logs of queries that domain users have issued in the
past. In a similar line of research, we would like to allow users to provide feedback
regarding the answers they receive. Then, user feedback will be used to modify précis
patterns. Another challenge will be the extension of the translator to cover answers to
more complex queries. Finally, we are working towards the further optimization of
various modules of the system.
1. S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword-based search
over relational databases. In ICDE, pp. 5-16, 2002.
2. I. Androutsopoulos, G.D. Ritchie, and P. Thanisch. Natural Language Interfaces to
Databases - An Introduction. NL Eng., 1(1), pp. 29-81, 1995.
3. G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching
and browsing in databases using BANKS. In ICDE, pp. 431-440, 2002.
4. A. Dusterhoft, and B. Thalheim. Linguistic based search facilities in snowflake-like
database schemes. DKE, 48, pp. 177-198, 2004.
G. Koutrika and A. Simitsis
5. D. Florescu, D. Kossmann, and I. Manolescu. Integrating keyword search into XML query
processing. Computer Networks, 33(1-6), 2000.
6. L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRank: Ranked keyword search
over XML documents. In SIGMOD, pp. 16-27, 2003.
7. L. R. Harris. User-Oriented Data Base Query with the ROBOT Natural Language Query
System. VLDB 1977: 303-312.
8. V. Hristidis, L. Gravano, and Y. Papakonstantinou. Effcient IR-style keyword search over
relational databases. In VLDB, pp. 850-861, 2003.
9. V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML
graphs. In ICDE, pp. 367-378, 2003.
10. IBM. DB2 Text Information Extender. url: www.ibm.com/software/data/db2/extender/
11. G. Koutrika, A. Simitsis, and Y. Ioannidis. Précis: The essence of a query answer. In
ICDE, 2006.
12. G. Marchionini. Interfaces for End-User Information Seeking. J. of the American Society
for Inf. Sci., 43(2), 156-163, 1992.
13. U. Masermann, and G. Vossen. Design and implementation of a novel approach to
keyword searching in relational databases. In ADBIS-DASFAA, pp. 171-184, 2000.
14. E. Metais, J. Meunier, and G. Levreau. Database Schema Design: A Perspective from
Natural Language Techniques to Validation and View Integration. In ER, pp. 190-205,
15. E. Metais. Enhancing information systems management with natural language processing
techniques. DKE, 41, pp. 247-272, 2002.
16. Microsoft. SQL Server 2000. url: http://msdn.microsoft.com/library/.
17. M. Minock. A Phrasal Approach to Natural Language Interfaces over Databases. In NLDB,
pp. 181-191, 2005.
18. A. Motro. Constructing queries from tokens. In SIGMOD, pp. 120-131, 1986.
19. Oracle. Oracle 9i Text. url: www.oracle.com/technology/products/text/.
20. A. Simitsis, and G. Koutrika. Pattern-Based Query Answering. In PaRMa, 2006.
21. A. Simitsis, and G. Koutrika. Comprehensible Answers to Précis Queries. In CAiSE, pp.
142-156, 2006.
22. E. Sneiders. Automated Question Answering Using Question Templates That Cover the
Conceptual Model of the Database. In NLDB, pp. 235-239, 2002.
23. V.C. Storey, R.C. Goldstein, H. Ullrich. Naive Semantics to Support Automated Database
Design. IEEE TKDE, 14(1), pp. 1-12, 2002.
24. V.C. Storey. Understanding and Representing Relationship Semantics in Database Design.
In NLDB, pp. 79-90, 2001.
25. A. Toral, E. Noguera, F. Llopis, and R. Munoz. Improving Question Answering Using
Named Entity Recognition. In NLDB, pp. 181-191, 2005.
26. Q. Wang, C. Nass, and J. Hu. Natural Language Query vs. Keyword Search: Effects of
Task Complexity on Search Performance, Participant Perceptions, and Preferences. In
INTERACT, pp. 106-116, 2005.
The TIP/Greenstone Bridge:
A Service for Mobile Location-Based Access to
Digital Libraries
Annika Hinze, Xin Gao, and David Bainbridge
University of Waikato, New Zealand
{a.hinze, xg10, d.bainbridge}@cs.waikato.ac.nz
Abstract. This paper introduces the first combination of a mobile tourist
guide with a digital library. Location-based search allows for access to a rich
set of materials with cross references between different digital library collections and the tourist information system. The paper introduces the system’s
design and implementation; it also gives details about the user interface and
interactions, and derives a general set of requirements through a discussion
of related work.
Keywords: Digital Libraries, Tourist information, mobile system,
Digital Libraries provide valuable information for many aspects of people’s lives
that are often connected to certain locations. Examples are maps, newspaper
articles, detailed information about sights and places all over the world. Typically, whenever people are at the location in question, computers are not close
by to access the abundant information. In contrast, mobile tourist information
systems give access to well-formatted data regarding certain sights, or information about travel routes. The abundance of information in history books or art
catalogues has so far been (to all intents and purposes) excluded from these kind
of systems.
This paper introduces the first known combination of a mobile tourist guide
with a digital library. Location-based search allows access to a set of rich materials with cross references between different digital library collections and the
tourist information system. The intention of the hybrid system is that a user,
traveling with an internet connected mobile devise such as a pocketPC, is actively
presented information based on their location (automatically detected through
GPS) and a user profile that records their interests (such as architectural follies);
when a passing detail seems particularly pertinent or piques their interest (hopefully the norm rather than the exception, otherwise the information the tourism
system is providing is not useful) the user seamlessly taps into the “deeper” resources managed by the digital library that can better satisfy their quest for both
more details and related information. Usage scenarios for location-based access
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 99–110, 2006.
c Springer-Verlag Berlin Heidelberg 2006
A. Hinze, X. Gao, and D. Bainbridge
Fig. 1. TIP: location-based personalised information delivery
to digital library collections include looking up the pictures of van Gogh while
being at their locations of origin in France, comparing old postcard photographs
with current buildings, reading Goethe while traveling Italy.
Due to their open source nature and the authors’ familiarity with the software TIP [6] and Greenstone [13] are used as the foundations of the hybrid
system. The paper commences with a brief review of these two project, emphasising aspects pertinent to the work at hand. Next we discuss the challenges of
a location-based bridge between the two systems (Section 2). We subsequently
show the TIP/Greenstone service in use (Section 3). In Sections 4 and 5, we
give details of the service’s design and architecture, respectively. Related work
is discussed in Section 6. Our paper concludes with a summary and an outlook
to future research.
This section describes the foundations of the TIP project and the Greenstone
project that are necessary for this paper.
TIP Core
The Tourist Information Provider (TIP) System delivers location-based information to mobile users. The information delivered is based on a user’s context,
such as their current location, their interest in particular semantic groups of
The TIP/Greenstone Bridge: A Service for Mobile Location-Based Access
sights and topics, and their travel history. Semantic groups and topics are captured in a user’s profile. Examples for groups of sights are public art, buildings,
or beaches, topics may be history or architecture. The travel history of a user
includes the locations/sights that the user visited and the user’s feedback about
these sights.
Figure 1 shows the TIP standard interface in a mobile emulator. The user is
at the University of Waikato. Their profile is groups={buildings; parks}; topics
= {architecture; history}. The university is displayed as a building close to the
user’s current position. In addition to the core functionality, TIP supports several
services such as recommendations and travel navigation on maps (for details
see [6]). The TIP system combines an event-based infrastructure and locationbased service for dynamic information delivery. The heart of the system is a
filter engine cooperating with a location engine. The filter engine selects the
appropriate information from the different source databases based on the user
and sight context. Changes in the user’s location are transmitted to the TIP
server, where they are treated as events that have to be filtered. For the filtering,
the sight context and the user context are taken into account. The location
engine provides geo-spatial functions, such as geo-coding, reverse geo-coding, and
proximity search. For places that are currently of interest, the system delivers
sight-related information.
Greenstone is a versatile open source digital library toolkit [13]. Countless digital
libraries have been formed with it since its release on SourceForge in 2000: from
historic newspapers to books on humanitarian aid; from eclectic multimedia content on pop-artists to curated first editions of works by Chopin; from scientific
intuitional repositories to personal collections of photos and other document
formats. All manner of topics are covered—the black abolitionist movement,
bridge construction, flora and fauna, the history of the Indian working class,
medical artwork, and shipping statistics are just a random selection. All manner
of document formats are covered, including: HTML, PDF, Word, PowerPoint,
and Excel; MARC, Refer, Dublin Core, LOM (Learning Object Metadata) and
BibTeX metadata formats; as well as a variety of image, audio, and video formats. It also supports numerous standards including OAI-PMH (Open Archives
Initiative Protocol for Metadata Harvesting), Z39.50 and METS to assist interoperability. See www.greenstone.org for more details.
For the pattern of use needed by this project Greenstone is an ideal vehicle,
providing a rapid way to “tera-form” a haphazard mass of digital resources into
an organized, manageable collection accessible through browsing structures as
well as direct access through fielded and full-text searching. The end result is
shaped by a configuration process. Of particular relevance here to the hybrid
system is the algorithmic role metadata plays in shaping this process through
its storage, access and ultimately presentation. Later we will see this aspect of
Greenstone exploited—seeded through a gazetteer—for identifying and marking
up place names.
A. Hinze, X. Gao, and D. Bainbridge
Fig. 2. Overview of example interaction with TIP/Greenstone
In addition to the production-level version of the software used for the cited
examples above (known as Greenstone 2), for exploratory purposes a digital
library framework that utilises web services is available for research-based work.
Called Greenstone 3 and backwards compatible, this forms the digital library
code base used for the hybrid system. In particular it allows for the fine-grained
interaction necessary for the required functionality, as evidenced through the
worked example in the next section.
Usage Scenarios and Interface
Here we demonstrate usage in a typical interactions scenario with the TIP/Greenstone Service: the user travels to a location using the TIP systems and then accesses
more detailed information about this location from the digital library. An overview
of the interactions for the example Gardens in Hamilton is given in Figure 2.
The TIP/Greenstone Bridge: A Service for Mobile Location-Based Access
For this usage scenario, we follow a user Kate who visits the Waikato area in
New Zealand. She uses TIP with the TIP/Greenstone Service when she arrives
in the Waikato, near the university. Her initial view is that shown in Figure 1.
(For clarity, we will show further TIP screens using screenshots taken from a
browser window in Figure 3.) Kate then decides to look up the digital library
collections that refer to her current location. When she switches to the page
of TIP/Greenstone Service, the system will display regions and places that are
near-by that she might want to search for in the collection repository provided
by Greenstone. This is necessary as her location is the University of Waikato,
close to the river Waikato, in the region Waikato, in New Zealand, on the north
island, etc. All these locations could be used to search the library and the user
can guide the selection. This step is referred to in the overview as Step 1 (see
Figure 2). Based on Kate’s selection, the system triggers a location-based search
in DL collections. The user is presented with a list of all collections that refer to
the selected region. (We will give the details about the system internal design
for this step later).
After selecting the region ‘Hamilton’ in Step 2, Kate has the choice between
the Hamilton postcard collection, the Waikato newspaper collection and the
Plant&Garden collection; she selects the Plant & Garden collection (Step 3).
Amongst others, this collection contains references to the local Hamilton Gardens. Within the collection, Kate selects a document about the Chinese Garden
in Hamilton (Step 5). Figure 3(a) shows the Greenstone interface with the Plant
& Garden collection. Kate chose to indicate all place names in the document:
a special feature of the TIP/Greenstone service. All words that are recognised
as place names are highlighted. Kate can direct the focus for these highlights to
particular countries (see Figure 3(b)).
Kate can now decide to lookup the highlighted places in TIP or in Greenstone.
This link to TIP or to other programs is reached via a pull-down menu shown in
Figure 3(c). The menu displays links only to existing data pages/documents. Different background colors indicate different target programs. One of the options
is to display information about the places from the geo-gazetteer (a geographical database of place names worldwide). The gazetteer provides information
about location, population, province, country. It displays this information in the
location-context of the document, i.e., only locations within the selected country are displayed. Figure 3(d) shows the information about ‘Hamilton’ when
selecting ‘New Zealand’—showing only 2 of the 26 Hamiltons worldwide in the
geo-gazetteer, the second referring to the wider conurbation that strays outside
the designated city boundary.
The TIP/Greenstone service connects TIP’s service communication layer (similar to other services) to Greenstone as a third party application of Greenstone.
Figure 4 shows the position of the service between TIP and Greenstone.
The communication with TIP is currently handled via TCP/IP and HTTP.
Greenstone provides an interface of communication via SOAP for communication
A. Hinze, X. Gao, and D. Bainbridge
(a) highlighted text (context world) (b) context selection (context New
(c) back link pop-up (context world) (d) gazetteer (context New Zealand)
Fig. 3. Overview of example interaction with TIP/Greenstone
to take place between its agents as well as communicate with third party programs;
consequently the TIP/Greenstone service uses this protocol to connect to the message router of Greenstone, thereby giving it access to the full topology of a particular instantiation of a collection server. To initiate a fielded search based on
location, for example, the user’s profile information and current location are
translated into XML format to call the Text-Query Service of the Greenstone
collections. Search results are also in XML format, and so translated into HTML
and for presentation to the user.
For the interaction with users, Greenstone uses HTML pages by default to
present the response of activating the underlying services. A user’s interaction with a web page will initiate a data transfer to the Library Servlet. The
The TIP/Greenstone Bridge: A Service for Mobile Location-Based Access
time & location
time & location
program logic
event−based communication
GPS location
SOAP interface
message−based communication (router)
Library Servlet
Greenstone Receptionist
Fig. 4. Architecture of TIP and Greenstone with the TIP/Greenstone Bridge
Library Servlet translates the information into XML format and forwards it
to the Receptionist. The Receptionist is an agent responsible for the knowing
which channels of communication to use to satisfy a request and, upon the return of XML-borne messages, how to merge and present the information to the
user. The TIP/Greenstone Service performs similar interactions between users
and Greenstone, except it does not work through the receptionist. Instead its
contact point, as mentioned above, is the message router, through which it can
mesh with the array of services on offer. Effectively the TIP/Greenstone Service
takes on the role of receptionist (in Greenstone terminology)—factoring in information retrieved in a user’s TIP profile and current location—deciding which
channels of communication to use and how to merge and present the resulting
Detailed Design
Handling location-based documents in a mobile setting falls into two phases:
preparation of documents and retrieval. The steps are described in detail in the
subsequent paragraphs. Figure 5 gives an overview of the phases.
Preprocessing: Location identification
The pre-processing phase locates place names in the documents of a collection
and enhances the documents. To recognize those place names that contain more
than one word in the gazetteer and TIP system, a place name window has been
designed. Figure 6 shows an example of how a place name window works.
The documents are first analysed for their textual context—all HTML markup
is ignored. All remaining words are analysed using the place name window and
A. Hinze, X. Gao, and D. Bainbridge
GS Index
GS Build
GS collection
(a) Preparation and indexing of documents
GS Index
User Scope
GS Search
GS Browse
GS collection
Display of
User location
and context
(b) Location-based retrieval and display
Fig. 5. Two phases of location-based handling of DL documents
The University of Waikato is in Hamilton, which is a butiful city.
The University of Waikato is in Hamilton, which is a butiful city.
The University of Waikato is in Hamilton, which is a butiful city.
Fig. 6. A sliding place name window to identify composite and nested place names
the place name validator of the gazetteer and TIP. The current array of words
in the current place name window defines the set of words to be analysed. The
initial size of the place name window can be set by the user; it is currently set
to five words. By changing the size of the sliding window, place names that are
nested within longer place names (‘Waikato’ within ‘University of Waikato’) are
also recognised. The validator returns the name of the country in the gazetteer
and/or the site in TIP.
The preprocessor marks up the documents with location-based information (in
the form of Java Script) that provides the multiway hyperlink seen in Figure 3(c).
The first parameter includes all the place names found in the longest place name.
For instance, if the place name is New York, then the parameter will be York and
New York. The second parameter refers to the countries. The third parameter
is a reference for the place name in the TIP system. The next one is the ID for
the gazetteer.
DL Collections for Location-based access
Collections that are used by the TIP/Greenstone Service, are stored in the digital
library using the standard storage facility of Greenstone. Currently, the collections in the DL need to be built in a particular way to interoperate with TIP.
The TIP/Greenstone Bridge: A Service for Mobile Location-Based Access
More specifically, existing collections have to be pre-processed before the can enter the standard build process for collections. The preprocessor assigns location
information to each occurrence of a place name in the collection’s documents. Location information contains details on related countries or coordinates. This data
is encoded within the collection’s HTML documents using Java Script. Currently,
only collections with HTML documents are supported in the TIP/Greenstone
bridge. After the collections have been built by Greenstone they will be kept as
normal Greenstone collections. An improvement would be to integrate the preprocessing step into the main building code, but this was not done due to time
constrains. By delaying the application of place name metadata markup to the
point where the internal document model (text and structure) has been built,
then the technique would equally apply to the wide range of textual formats
handled by Greenstone cited in Section 2.
Location-based search
Special features have been implemented for location-based search and location
highlighting. Location information is obtained from a geo-gazetteer stored in
the TIP database. The information for the gazetteer has been imported from
www.world-gazetteer.com. The TIP/Greenstone Service has access to the postgreSQL database used in TIP in addition to the information in the gazetteer, to
store information about locations. The TIP database is used in this service to
load the information about the place rather than using the information in the
gazetteer. The coordinates of places are queries in the TIP database to calculate
the nearest place in the gazetteer. TIP uses the PostGIS spatial extension which
supports spatial querying.
All accessed documents and collections from the digital library are filtered
according to the location-restriction. Additional restrictions on countries help
identify appropriate place names as a large number of English words can also
be identified as place names (e.g., ‘by’, which is a village in France). In addition
and on request, stop words are excluded from the location filter. Finally, the
filter introduces hyperlinks into the document that allow for references to TIP,
the gazetteer, and Greenstone. Original links are preserved.
Location-based presentation
The Presentation-Filter component uses the location markup (Java Script in
the documents) to highlight places that are within the current scope of the user.
To determine the current user scope, the user is offered a drop-down box of all
countries that appear in the text. The user can then select the current scope.
Place names are highlighted and hyperlinks are added that link to related pages
in TIP, Greenstone, and the Gazetteer. This list of hyperlinks with different
target services are accessible on left-click in a pop-up menu using dynamic HTML
to effectively give web pages multiway hyperlinks. References to other services
can be easily added. Original hyperlinks in the document need to be treated with
care: they are still visually indicated on the page as hyperlinks. In addition, the
list of hyperlinks contains the phrase ‘original link’ which then refers to the
original target of the document’s hyperlink. If a nested place name (a name
within another name) is not within the current location scope, the longer name
A. Hinze, X. Gao, and D. Bainbridge
is highlighted but the shorter (nested) one is removed from the hyperlink list.
For further implementation details and a first-cut performance study see [3].
Discussion of Related Work
To the best of our knowledge, no combination of a digital library with a mobile
tourist system has been developed previously. Mobile tourist information systems
focus mainly on providing (short) structured information and recommendations.
Examples systems are AccessSights [7], CATIS [10], CRUMPET [11], CyberGuide [1], Guide [2], Gulliver’s Genie [9]. This lack of ability for location-based
access of rich sources was our motivation for combining the tourist information
system with a digital library resource. It is also the case that most research in
the area of electronic guides has focused on indoor users.
In considering the digital library aspect to the work and how it relates to
other digital library projects, we identify the following key requirements for the
hybrid system:
bi-directional interoperability 2. geographical “aware” digital library
generic collection building
4. extensible presentation manipulation
markup restructuring
6. fine-grained interaction
fielded searching
Bi-directional interoperability allows the two sub-systems to be able to work in
unison, and some shared notion of geographical information is needed in which
to have a meaningful exchange. To populate the digital library resource it is
necessary to have a digital library system with a workflow that includes flexible indexing and browsing controls (generic collection building), to allow the
tera-forming of initially formless source documents. To provide the representative functionality shown in the worked example (Section 3), control at the
presentation-level is required that is context based and includes fine-grained interaction with sub-systems to garner the necessary information and the filtering
and restructuring of markup.
The related issue of spatial searches for place names and locations has been addressed in digital libraries (for example, in the Alexandria digital library project
[4, 12]). Access to geo-spatial data is typically given in Geo-Information Systems
(GIS), which use spatial database for storage. A spatial digital library with a GIS
viewing tool has also been proposed [5]. In these systems, the focus lies on the
geo-spatial features and their presentation on maps or in structured form. Rich
documents with ingrained location information are not the focus. Nor do these
project have the ability to be installed by third party developers and populated
with their own content.
Contemporary digital library architectures such as Fedora [8] that include a
protocol as part of their design satisfy the requirement for fine-grained interaction, and given that the de facto for digital libraries is presentation in a web
browser there is a variety of ways manipulation of presentation through restructuring markup etc. can be achieved, for instance the dynamic HTML approach
The TIP/Greenstone Bridge: A Service for Mobile Location-Based Access
deployed in our solution. Fedora, however, only digests one canonical format of
document, so does not meet the generic building requirement without an extra
preprocessing step. Like the preprocessing step we added to Greenstone to augment it with place name metadata, this would be straightforward to do; indeed,
the same metadata place name enhancement would be required also.
In conclusion, this paper has described a bridge between a mobile tourist information system and a digital library. This bridge service allows for location-based
access of documents in the digital library. We explored the usage of the proposed
hybrid system through a worked example. We gave an overview of the architecture and details of the implementation design.
Our system is an example of the trend through which digital libraries are
integrated into co-operating information systems. Moreover, this work represents a focused case-study of how digital library systems are being applied to
our increasingly mobile IT environments, and our experiences with the project
encourage us to pursue further research towards interoperability between TIP
and Greenstone. Although the work is centered on these two systems, through
an analysis of general requirements we have outlined the key attributes necessary for developing co-operative hybrid information systems that combine mobile
location information with digital libraries.
1. G. Abowd, C. Atkeson, J. Hong, S. Long, R. Kooper, and M. Pinkerton. Cyberguide: A mobile context-aware tour guide. ACM Wireless Networks, 3:421–433,
2. K. Cheverst, K. Mitchell, and N. Davies. The role of adaptive hypermedia in a
context-aware tourist guide. Communications of the ACM, 45(5):47–51, 2002.
3. X. Gao, A. Hinze, and D. Bainbridge. Design and implementation of Greenstone
service in a mobile tourist information system. Technical Report X/2006, University of Waikato, March 2006.
4. M. Goodchild. The Alexandria digital library project. D-Lib Magazine, 10, 2004.
5. P. Hartnett and M. Bertolotto. Gisviewer: A web-based geo-spatial digital library.
In Proceedings of the 5th International Workshop on Database and Expert Systems
Applications (DEXA 2004), 30 August - 3 September 2004, Zaragoza, Spain, pages
856–860, 2004.
6. A. Hinze and G. Buchanan. The challenge of creating cooperating mobile services:
Experiences and lessons learned. In V. Estivill-Castro and G. Dobbie, editors,
Twenty-Ninth Australasian Computer Science Conference (ACSC 2006), volume 48
of CRPIT, pages 207–215, Hobart, Australia, 2006. ACS.
7. P. Klante, J. Krsche, and S. Boll. AccesSights – a multimodal location-aware mobile
tourist information system. In Proceedings of the 9th Int. Conf. on Computers
Helping People with Special Needs (ICCHP’2004), Paris, France, July 2004.
A. Hinze, X. Gao, and D. Bainbridge
8. C. Lagoze, S. Payette, E. Shin, and C. Wilper. Fedora: An architecture for complex
objects and their relationships. Journal of Digital Libraries, 2005. Special Issue
on Complex Objects.
9. G. O’Hare and M. O’Grady. Gulliver’s genie: A multi-agent system for ubiquitous
and intelligent content delivery. Computer Communications, 26(11):1177–1187,
10. A. Pashtan, R. Blattler, and A. Heusser. Catis: A context-aware tourist information
system. In Proceedings of the 4th International Workshop of Mobile Computing,
Rostock, Germany, 2003.
11. S. Poslad, H. Laamanen, R. Malaka, A. Nick, P. Buckle, and A. Zipf. CRUMPET:
Creation of user-friendly mobile services personalised for tourism. In Proc. 3G2001
Mobile Communication Technologies, London, U.K., Mar. 2001.
12. T. R. Smith, G. Janee, J. Frew, and A. Coleman. The Alexandria digital earth
prototype. In ACM/IEEE Joint Conference on Digital Libraries, pages 118–119,
13. I. H. Witten and D. Bainbridge. How to Build a Digital Library. Elsevier Science
Inc., 2002.
Towards Next Generation CiteSeer:
A Flexible Architecture for Digital Library
I.G. Councill1 , C.L. Giles1
E. Di Iorio , M. Gori2 , M. Maggini2 , and A. Pucci2
School of Information Sciences and Technology, The Pennsylvania State University,
332 IST Building University Park, PA 16802
{icouncil, giles}@ist.psu.edu
Dipartimento di Ingegneria dell’Informazione, University of Siena,
Via Roma, 56. Siena, Italy
{diiorio, marco, maggini, augusto}@dii.unisi.it
Abstract. CiteSeer began as the first search engine for scientific literature to incorporate Autonomous Citation Indexing, and has since grown
to be a well-used, open archive for computer and information science publications, currently indexing over 730,000 academic documents. However,
CiteSeer currently faces significant challenges that must be overcome in
order to improve the quality of the service and guarantee that CiteSeer will continue to be a valuable, up-to-date resource well into the
foreseeable future. This paper describes a new architectural framework
for CiteSeer system deployment, named CiteSeer Plus. The new framework supports distributed indexing and storage for load balancing and
fault-tolerance as well as modular service deployment to increase system
flexibility and reduce maintenance costs. In order to facilitate novel approaches to information extraction, a blackboard framework is built into
the architecture.
The World Wide Web has become a staple resource for locating and publishing
scientific information. Several specialized search engines have been developed
to increase access to scientific literature including publisher portals such as the
ACM Portal1 and IEEE Xplore2 as well as other academic and commercial sites
including the Google Scholar3 . A key feature common to advanced scientific
search applications is citation indexing [3]. Many popular commercial search
services rely on manual information extraction in order to build citation indexes;
however, the labor involved is costly. Autonomous citation indexing (ACI) [4]
has emerged as an alternative to manual data extraction and has proven to
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 111–122, 2006.
c Springer-Verlag Berlin Heidelberg 2006
I.G. Councill et al.
be successful despite some loss of data accuracy. Additionally, the ACI model
has traditionally been coupled with autonomous or semi-autonomous content
acquisition. In this approach, focused crawlers are developed to harvest the web
for specific types of documents, in this case academic research documents, in
order to organize distributed web content within a single repository. Automatic
content acquisition is particularly useful for organizing literature that would
otherwise be difficult to locate via general search engines [8].
CiteSeer [4] emerged as one of the first focused search engines to freely provide
academic papers, technical reports, and pre-prints, and is also the first example
of a working ACI system. CiteSeer consists of three basic components: a focused
crawler or harvester, the document archive and specialized index, and the query
interface. The focused spider or harvester crawls the web for relevant documents
in PDF and PostScript formats. After filtering crawled documents for academic
documents, these are then indexed using autonomous citation indexing, which
automatically links references in research articles to facilitate navigation and
evaluation. Automatic extraction of the context of citations allows researchers
to determine the contributions of a given research article quickly and easily; and
several advanced methods are employed to locate related research based on citations, text, and usage information. Additional document metadata is extracted
from each document including titles, author lists, abstracts and reference lists,
as well as the more recent addition of author information such as affiliations and
contact information [6] as well as acknowledgement information [5]. CiteSeer is
a full text search engine with an interface that permits search by document or
by numbers of citations or fielded searching, not currently possible on generalpurpose web search engines.
CiteSeer has proven its usefulness to the computer and information science
communities. The CiteSeer installation at Penn State University4 currently receives over one million requests and serves over 25 GB of information daily. The
CiteSeer service is currently being made more available to the world community
through the advent of several mirrors. At the time of this writing there are CiteSeer mirrors hosted at MIT, Switzerland, Canada, England, Italy, and Singapore
in various stages of completion. However, CiteSeer currently faces significant
challenges of interoperability and scalability that must be overcome in order to
improve the quality of the services provided and to guarantee that CiteSeer will
continue to be a valuable, up-to-date resource well into the foreseeable future.
The current architecture of the CiteSeer application is monolithic, making
system maintenance and extension costly. Internal system components are not
based on any established standards, such that all interoperability features incorporated have necessarily been crafted as wrappers to exposed functionality.
The resulting lack of integration reduces the potential of CiteSeer to serve the research community. Additionally, as the CiteSeer collection grows (to over 730,000
documents as of the time of this writing), query latencies are rising and document updates are becoming increasingly cumbersome as the system pushes the
boundaries of its current architecture.
Towards Next Generation CiteSeer: A Flexible Architecture
Recently, other ACI-enabled search engines for scientific literature have been
developed, including Google Scholar. Although Google Scholar indexes at least
an order of magnitude more documents than CiteSeer, CiteSeer remains competitive as an open archive and offers more features. A separate effort that has
shown much promise is OverCite, a re-implemenation of CiteSeer within a peerto-peer architecture based on distributed hash tables [15].
In this paper we present our own re-invention of CiteSeer, currently named
CiteSeer Plus. This work builds on a previous architectural proposal for digital libraries [13]. CiteSeer Plus is based upon a new architecture designed to
be flexible, modular, and scalable. As CiteSeer is currently operated within an
academic environment with a focus on research as well as production, we have
developed a framework that allows scalable, distributed search and storage while
easing deployment of novel and improved algorithms for information extraction
as well as entirely new service features.
The resulting architecture is oriented toward a collection of deployed services
instead of a traditional web search engine approach. Each service component
can be treated as a stand-alone application or as part of a larger service context.
Users and programs can interact directly with individual services or with the
entire system through web-based service front-ends such as a traditional search
engine interface, consistent with ideas emerging from Web 2.0 [11].
Project Goals
Flexibility. CiteSeer’s current monolithic architecture limits the extensibility of
the system. Information extraction routines are difficult to upgrade or change
since they are tightly coupled with other system components. Not only does this
cause maintenance difficulty, but it also limits the potential scope of the CiteSeer
system. Adopting a highly modular service-oriented architecture will make the
system more easily extendable with new services and more adaptable to different content domains. This is a core requirement for a next-generation CiteSeer
service. Although an API has been developed for the existing CiteSeer [13], the
API does not expose internal system functionality that is needed for a powerful
extension environment. To alleviate this problem, each service module should
carry its own API. This will allow service extensions to combine components in
a flexible manner without incurring the overhead of refactoring existing code,
and will allow the system to be more easily extensible to novel content domains.
Performance. A next-generation CiteSeer system must show improvements
over the current system in terms of both query processing and update performance. Due to the current indexing and database framework, CiteSeer shows
significant performance degradation when handling more than five simultaneous
queries. Traffic spikes often account for more than 30 simultaneous queries and
as many as 130 simultaneous connections have been observed. The resulting performance drop often limits the query response times to well below acceptable
standards, in many cases turning users away outright. The new system should be
able to handle at least 30 simultaneous queries without significant performance
degradation. In addition, CiteSeer currently indexes no more than 3-4 papers per
I.G. Councill et al.
minute, resulting in poor speed for acquiring new content. The update processes
are large batch operations that typically take three days for every two weeks
of content acquisition. To improve the freshness of information in the repository, it is desirable for a next-generation CiteSeer architecture to handle content
updates quickly in an iterative process, so new content can be made available
immediately after acquisition.
Distributed Operation. Although CiteSeer is currently implemented as a collection of processes that interoperate over network sockets, the architecture does
not currently support redundant service deployment. This situation is mitigated
through the use of Linux Virtual Server for service load balancing and fail-over;
however, this increases maintenance demands and does not support distributed
operation in a WAN environment. There is no support for propagating updates
to mirrors without large file copies containing much redundant information. The
new system should be natively capable of distributed operation with no single
point of failure and should be easily extendable to support incremental updates
over a WAN deployment.
System Features and Architecture
This section details the features supported by the CiteSeer Plus framework as
well as its architecture. CiteSeer Plus is designed to be a flexible platform for
digital library development and deployment, supporting standard digital library
features as well as plugins for advanced automation. In keeping with the goals
presented in Section 2, the feature set is expandable based on application or
domain requirements and the user interface to the application is arbitrary, to be
built on top of a rich system API. An experimental prototype of a CiteSeer Plus
deployment is publicly available5 .
The CiteSeer Plus system architecture is highly modular. In the following
sections every module is presented and module interactions are discussed. The
system architecture is organized in four logical levels as shown in Figure 1.
The Source Level contains document files and associated data. The Core Level
contains the central part of the system in which document and query processing
occurs. The Interface Level offers interface functions to allow the communication
between the Core Level and services that can be developed using CiteSeer Plus
(in the Service Level). This level is implemented as a collection of Web Services.
Finally, the Service Level contains every service that is running on top of the
CiteSeer Plus system.
Figure 2 maps the levels to the actual system architecture. At the Core Level
are the sets of master and slave indexing nodes. These sets contain redundant
indexing nodes tailored for specific tasks within the CiteSeer Plus system, and
are the fundamental processing nodes. A single node is made of different subcomponents. Figure 3 shows the details of a master indexing node. We can
describe these nodes by following a typical paper lifecycle through an indexing
Towards Next Generation CiteSeer: A Flexible Architecture
Fig. 1. Logical levels
Fig. 2. System architecture overview
The system is agnostic regarding the method of content acquisition. New
content may be harvested by a crawler, received from an external library, or
submitted by users, so long as documents are posted to the system via a supported acquisition interface. Once a paper has been received it is stored in the
PDF cache to guarantee persistence of a document in the original format, then
submitted to a document processing workflow for integration into the system
data. The paper encounters a PDF parser whose duty is to extract text from
the original file and produce a new XML-based representation. This new document contains text and some layout information such as text alignment, size,
style, etc. Next the raw XML file enters the metadata extraction subsystem.
This subsystem is composed of several modules, including a BlackBoard Engine
that is used to run a pool of experts (shown as EXP 1, EXP 2, . . . , EXP N in
Figure 3) that cooperate to extract information from the document. This process
is presented in more detail in Section 5. This process outputs an XML document
that contains all tagged metadata.
Finally the paper is ready to be indexed: the labeled XML is stored in the
XML cache (to make it available for later retrieval) and passed to the indexer.
I.G. Councill et al.
Fig. 3. Indexing node detailed structure
At this point the Query Engine will be able to provide responses to user or
system queries involving the indexed document. Metadata elements are stored
in separate index fields, enabling complex queries to be built according to various
document elements. Every indexing node is able to communicate with the other
system components by exposing a set of methods as a web service. The entire
indexing process takes place in on-line mode, such that a paper entering the
system will enter one or more indexing nodes for immediate consumption by the
In addition to normal indexing nodes (called master nodes) there are also slave
nodes. Slave nodes are a lighter version of master nodes; their inner structure
is just the same as seen in Figure 3, with the exception that slave nodes do
not maintain any kind of cache (no PDF cache nor XML cache). Furthermore,
their indexes contain only metadata slices (such as title, author, abstract and
citation slices), but they do not contain generic text slices, which support fulltext queries. Both master and slave nodes can be deployed redundantly for load
balancing. During initial indexing, a paper can be routed to any number of slave
nodes but must be routed to at least one master node, in order to allow the
system to provide full-text indexing and caching. Slave nodes are provided in
order to support frequent operations such as citation browsing, graph building,
and paper header extraction (a header contains just title, author, abstract and
references) since those operations do not require access to a full-text index. In
this way, performance can be improved by adding new slave nodes that do not
incur large additional storage requirements. Slave nodes can also be used to
support system processing for graph analysis and the generation of statistics
without affecting performance for user queries; however, only a single master
node is needed to run a CiteSeer Plus system.
It is also possible to split the indexes among different machines (in this case
the controller will send a query to all of them and then organize the different
responses received). At the same time, indexes can be redundant; that is, the
same indexes can be managed by different mirror nodes running on different
computers in order to improve system performance through load balancing. In
Figure 4 we show a typical system configuration.
In this deployment we have divided the index into two parts (A and B), so
every time a document is accepted by the system, the controller decides which
subindex will receive the document, such that indexes are balanced. Nodes in
Towards Next Generation CiteSeer: A Flexible Architecture
Fig. 4. Example of system deployment
the same node set have the same indexes to support index redundancy. In this
example “MN A” (master node set of subindex A) contains three computers
running three separated and identical master node instances, and “SN A” provides support to “MN A” nodes. In this case “SN A” contains only one slave
node, but, in general, it can be a set of slave nodes. The same configuration is
kept for the “B” (in this case we have “MN B” and “SN B”). In this scenario, if
a user submits a full-text query the controller will route the query to a master
node chosen from the “MN A” set and one from “MN B”, so the system, in this
sample configuration, is able to provide service for up to three concurrent users
just the same as one by sharing the workload among redundant master node
mirrors inside “MN A” and “MN B”. The same situation happens when a query
does not involve a full-text search, but is just referred to metadata indexes. The
only difference in this case is the fact that slave nodes (“SN A” and “SN B”)
will respond to the query instead of master nodes.
At the Interface Level we find the Middleware, which is the active part of the
external SOAP API. This component converts API methods into procedure calls
on the services provided by the components in the Core Level. The Middleware
contains methods to perform user authentication control in order to determine
whether a system user is authorized to perform the requested operations. The
Middleware also manages the controller threads and performs query and paper
routing in order to maintain consistency in the distributed and redundant sets
of Master and Slave Nodes. Every operation regarding resource distribution and
redundancy is performed in this module.
Each system component exposes public methods through the SOAP API,
allowing the development of discrete services using the CiteSeer Plus framework.
The Service Level uses the API to define prescribed usage scenarios for the
system and interfaces for user control. This level contains HTML forms and
representations for user and administrative interaction. Some exemplar services
that have been built include tools to add or remove documents and correct
I.G. Councill et al.
document metadata, deployment configuration tools, and search interfaces for
users (a web application) or programs (via SOAP).
Citation Graph Management
A document citation graph is a directed graph where the nodes correspond to the
documents and edges correspond to citation relationships among the documents.
A document citation graph is useful for deriving bibliometric analyses such as
computing document authorities and author importance as well as to perform
social network analysis. In order to construct a document citation graph all
citations contained in each document must be identified and parsed, and then
the citations must be matched to corresponding document records. CiteSeer Plus
uses an approach that differs in many ways from the legacy CiteSeer.
CiteSeer’s method could be defined as a ”hard approach”. Each citation is
parsed using heuristics to extract fields such as title, authors, year of publication,
page numbers and the citation identifier (used to mark the citation in the body
text). The fields of each citation are compared with one another based on a
string distance threshold in order to cluster citations into groups representing
a single document. Finally, the metadata from each citation group is compared
to existing document records in order to match the citations to documents.
Citations to a given paper may have widely varying formats; hence, developing
rules for citation field identification can be very time consuming and error prone.
CiteSeer’s approach relies heavily on off-line computations in order to build the
document citation graph. If no document is found to match a citation group, all
citations in the group are unsolved, and cannot be solved until the next graph
update, even if a matching document enters the system beforehand.
The CiteSeer Plus approach could be defined as a soft approach. Our method is
less computationally costly and can be performed online, in an approach similar
to the SFX system [16]. The process of building the citation graph in CiteSeer
Plus is query-based; that is, the citations are solved using queries performed in
the query module. The Indexer allows metadata to be stored in different subindexes (slices) and so a query can be performed on a specific slice of the main
index. Subfields parsed from citations are used to perform complex document
queries on appropriate index slices and the top document is found to match a
citation if it’s similarity to the query surpasses a given threshold. In the other
direction, to find citations matching a new document, CiteSeer Plus makes a
query using all the words of the document title and authors. This query is
performed on the citation slice; thus the query results are all documents that
have a citation containing some words of the query.
Master nodes do not cache the document citation graph since they have to
provide query results that are as fresh as possible. However, slave nodes can
use a query result caching mechanism in order to improve performance at the
cost of reduced information freshness. Repository statistics are built using slave
nodes, but user queries operate on the master node. When a user tries to follow
a citation, this produces a corresponding query on the master node and the user
Towards Next Generation CiteSeer: A Flexible Architecture
will obtain one or more documents that are likely to match the citation. This
framework relieves workload on dynamic components that handle user queries
while allowing detailed statistics and graph management activities to be handled
online within separate components.
Metadata Extraction System
Metadata extraction is the most difficult task performed by an automated digital library system for research papers. In the literature, there are two main approaches to information extraction: knowledge engineering and machine learning.
In the knowledge engineering approach, the extraction rules used by the system
are constructed manually by using knowledge about the application domain.
The skill of the knowledge engineer plays a large role in the level of system performance, but the best performing systems are often handcrafted. However, the
development process can be very laborious and sometimes the required expertise
may not be available. Additionally, handcrafted rules are typically brittle and do
not perform well when faced with variation in the data or new content domains.
CiteSeer uses this approach, employing information about the computer science
document styles (or templates) to extract metadata.
In the machine learning approach, less human expertise regarding template
styles is required when customizing the system for a new domain. Instead, someone with sufficient knowledge of the domain and the task manually labels a set
of training documents and the labeled data is used to train a machine learning
algorithm. This approach is more flexible than the knowledge engineering approach, but requires that a sufficient volume of training data is available. In the
last decade, many techniques have been developed for metadata extraction from
research papers. There are two major sets of machine learning techniques in the
metadata extraction literature. Generative models such as Hidden Markov Models (HMM) (e.g. [14], [9]) learn a predictive model over labeled input sequences.
Standard HMM models have difficulty modeling multiple non-independent features of the observation sequence, but more recently Conditional Random Fields
(CRF) have been developed to relax independence assumptions [7]. The second
set of techniques is based on discriminative classifiers such as Support Vector Machines (SVM) (e.g. [6]). SVM classifiers can handle large sets of non-independent
features. For the sequence labeling problem, [6] work in a two stage process: first
classifying each text line independently in order to assign it a label, then adjusting these labels based on an additional classifier that examines larger windows
of labels. The best performance in metadata extraction from research papers
has been reached by McCallum and Peng in [12] using CRFs. The CiteSeer Plus
metadata extraction system has been built to maximize flexibility such that it
is simple to add new extraction rules or extraction models into the document
processing workflow. In our metadata extraction system, different kinds of models can be used which have been trained for different or the same extraction
tasks using various techniques, including but not limited to HMM, CRF, regular
expression, and SVM classifiers. The CiteSeer Plus metadata extraction system
I.G. Councill et al.
is based on a blackboard architecture ([10], [1], [2]) such that extraction modules can be designed as standalone processes or within groups of modules with
dependencies. A blackboard system consists of three main components:
Knowledge Sources (in our framework these are named Experts): independent
modules that specialize in some part of the problem solving. These experts can
be widely different in their inference techniques and in their knowledge representation.
BlackBoard : a global database containing the input data, partial solutions and
many informational items produced by experts to support the problem solving.
Control component : a workflow controller that makes runtime decisions about
the course of problem solving. In our framework, the control component consists
of a set of special experts called scheduling experts that are able to schedule the
knowledge sources registered in the framework. The scheduling expert is chosen by the controller components based on the problem solving strategy that is
employed and the kinds of metadata that the system needs to progress. Using
different scheduling experts, it is possible to change the problem solving strategy
dynamically in order to experiment with various learning strategies.
Although an individual expert can be independent from all the other experts
registered in the framework, each expert can declare its information dependences,
that is, all the information that it needs to work. The control component activates the expert when all these dependences are satisfied. As such, experts can be
activated when all the information required by the expert has been extracted and
stored on the BlackBoard module. The experts declare their skills (the information they can extract) to the Control component, such that during the problem
solving (metadata extraction), at the right moment the control component can
activate the experts, and the controller can reason about which intermediary experts must be employed in order to reach a later result. The BlackBoard groups
similar information and registers expert accuracies based on the prior expertise 6
declared by each expert. In this way, if more than one expert produces the same
(or similar) kinds of information, the accuracy value of that information will be
computed as the joint confidence among the experts.
An example configuration may group experts into three classes or functional
levels, although the framework does not restrict the processing workflow. The
first level is the Entity Recognition level. In this level are all the experts able
to give words a specific semantic augmentation, including part-of-speech tagging and recognition of named entities such as first or last name, city, country,
abbreviation, organization, etc. Experts at this level will be activated first for
processing workflows. The second level is the Row Labeling level. At this level
are all the experts able to classify a paper line with one or more defined labels
such as author, title, affiliation, citation, section title and so on. The experts
at this level classify the paper lines using a document representation supplied
by the Document module, a framework object able to elaborate the document
The prior expertise is a measure of expert ability (F score) on a standard dataset.
Towards Next Generation CiteSeer: A Flexible Architecture
structure by supplying a representation based on many different features regarding line contents, layout and font styles. Row labeling can be an iterative process,
reclassifying lines based on tagged context in subsequent passes. The last level is
the Metadata Construction level. Using all the extracted information from the
previous levels, the experts at this level can build the final metadata record for
a document.
This paper has presented a new version of the CiteSeer system, showing significant design improvements over its predecessor. The new system reproduces
every core feature of the previous version within a modular architecture that
is easily expandable, configurable, and extensible to new content domains. Increased flexibility is obtained through a design based on customizable plug-in
components (for the metadata extraction phase) and the extensive use of web
service technology to provide an interface into every system component. CiteSeer
Plus can also be a useful tool for researchers or other developers interested in
information retrieval and information extraction, as CiteSeer Plus can be used
as a powerful yet easy to use framework to test new ideas and technologies by
developing third party applications that bind with specific components of the
CiteSeer Plus framework.
We thank Nicola Baldini and Michele Bini (FocuSeek.com) for fruitful discussions, suggestions and support during the system design and development process. We also thank Fausto Giunchiglia and Maurizio Marchese (University of
Trento, Italy) for fruitful discussions that have aided the evolution of the system.
1. B. L. Buteau. A generic framework for distributed, cooperating blackboard systems. Proceedings of the 1990 ACM annual conference on Cooperation, p.358-365,
February 20-22, 1990.
2. H. Chen , V. Dhar. A knowledge-based approach to the design of document-based
retrieval systems. ACM SIGOIS Bulletin, v.11 n.2-3, p.281-290, Apr. 1990.
3. E. Garfield. Science Citation Index - A new dimension in indexing. Science, 144,
pp. 649-654, 1964.
4. C.L. Giles, K. Bollacker and S. Lawrence. CiteSeer: An Automatic Citation Indexing System, Digital Libraries 98: Third ACM Conf. on Digital Libraries, ACM
Press. New York, 1998, pp. 89-98.
5. C.L. Giles and I.G. Councill. Who gets acknowledged: measuring scientific contributions through automatic acknowledgement indexing. PNAS, 101, Number 51,
pp. 17599-17604, 2004.
I.G. Councill et al.
6. H. Han, C. Lee Giles, E. Manavoglu, H. Zha, Z. Zhang, E. A. Fox. Automatic
Document Metadata Extraction using Support Vector Machines. Proceedings of
the 2003 Joint Conference on Digital Libraries (JCDL03), 2003.
7. J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic
models for segmenting and labeling sequence data. In International Conference on
Machine Learning, 2001.
8. S. Lawrence, C. Lee Giles. Searching the World Wide Web. Science, 280, Number
5360, pp. 98-100, 1998.
9. T. R. Leek. Information extraction using hidden Markov models. Masters thesis,
UC San Diego, 1997.
10. H. Penny Nii. Blackboard systems: The blackboard model of problem solving and
the evolution of blackboard architectures. The AI Magazine, VII(2):38–53, Summer
11. T. O’Reilly. What Is Web 2.0 Design Patterns and Business Models for the Next
Generation of Software. http://www.oreillynet.com/pub/a/oreilly/tim/news
12. F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology
Conference and North American Chapter of the Association for Computational
Linguistics(HLT-NAACL), pages 329336 (2004).
13. Y. Petinot, C. Lee Giles, V. Bhatnagar, P. B. Teregowda, H. Han, I. Councill.
A Service-Oriented Architecture for Digital Libraries. ICSOC04, November 15-19,
14. K. Seymore, A. McCallum and R. Rosenfeld. Learning hidden Markov model structure for information extraction. In Papers from the AAAI-99 Workshop on Machine
Learning for Information Extration, pages 3742, July 1999.
15. J. Stribling, I.G. Councill, M.F. Kaashoek, R. Morris, and S. Shenker. Overcite: A
cooperative digital research library. In Proceedings of The International Workshop
on Peer-To-Peer Systems (IPTPS 05), Ithaca, NY, 2005 .
16. H. Van de Sompel, P. Hochstenbach. Reference linking in a hybrid library environment. Part 1: Frameworks for linking. D-Lib Magazine, v.5 n.4, 1999.
Digital Object Prototypes: An Effective
Realization of Digital Object Types
Kostas Saidis1 , George Pyrounakis2, Mara Nikolaidou2 , and Alex Delis1
Dept. of Informatics and Telecommunications
Libraries Computer Center
University of Athens
University Campus, Athens, 157 84, Greece
{saiko, forky, mara, ad}@di.uoa.gr
Abstract. Digital Object Prototypes (DOPs) provide the DL designer
with the ability to model diverse types of digital objects in a uniform
manner while offering digital object type conformance; objects conform
to the designer’s type definitions automatically. In this paper, we outline
how DOPs effectively capture and express digital object typing information and finally assist in the development of unified web-based DL
services such as adaptive cataloguing, batch digital object ingestion and
automatic digital content conversions. In contrast, conventional DL services require custom implementations for each different type of material.
Several formats and standards, including METS [10], MPEG-21 [15], FOXML [7]
and RDF [11] are in general able to encode heterogeneous content. What they
all have in common is their ability to store and retrieve arbitrary specializations
of a digital object’s constituent components, namely, files, metadata, behaviors
and relationships [9]. The derived digital object typing information – that is,
which components constitute each different type of object and how each object
behaves – is not realized in a manner suitable for effective use by higher level
DL application logic including DL modules and services [13].
Our main objective is to enhance how we express and use the types of digital
objects independently of their low-level encoding format used for storage. Digital
object prototypes (DOPs) [13] provide a mechanism that uniformly resolves
digital object typing issues in an automated manner. The latter releases DL users
such as cataloguers, developers and designers from dealing with the underlying
complexity of typing manually. A DOP is a digital object type definition that
provides a detailed specification of its constituent parts and behaviors. Digital
objects are conceived as instances of their respective prototypes. DOPs enable
the generation of user-defined types of digital objects, allowing the DL designer
to model the specialities of each type of object in a fine-grained manner, while
offering an implementation that guarantees that all objects conform to their type
automatically. Using DOPs, the addition of a new digital object type requires
no custom development and services can be developed to operate directly on all
types of material without additional coding for handling “special” cases.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 123–134, 2006.
c Springer-Verlag Berlin Heidelberg 2006
K. Saidis et al.
DOPs assist in dealing with important “every day” DL development issues
in a unified way: how to speed up and simplify cataloguing, how to automate
content ingestion, how to develop effective web interfaces for presenting and
manipulating heterogeneous types of digital objects. In this paper, we focus on
the benefits offered by the deployment of DOPs in the development of high
level services in Pergamos, the University of Athens DL. In particular, we point
out how web based services such as browsing, cataloguing, batch ingestion and
automatic digital content conversion cope with any type of DOP defined object,
while having all services reside in a single, uniform implementation.
The remainder of the paper is organized as follows. Section 2 provides a detailed description of the current implementation of DOPs and pinpoints how
DOPs assist on the development of uniform yet effective DL services. In Section 3 we present several DOP examples originating from Pergamos collections.
Finally, Section 4 concludes the paper discussing related and future work.
Digital Object Prototypes in Pergamos
We have implemented DOPs in Java. As depicted in Figure 1a, DOPs operate
atop the repository / storage layer of the DL (in Pergamos we use FEDORA [14]).
Fig. 1. (a) The 3-tier Pergamos architecture incorporating the “type enforcement” layer
of DO Dictionary [13] and (b) A digital object instance as composed by its respective
prototype and the underlying stored digital object
The DO Dictionary layer of Figure 1a exposes the DOPs API to high level DL
services or the application logic. The underlying repository’s “mechanics” remain
hidden, since all service functionality is directed through DOPs. We define DOPs
in terms of XML documents, that are loaded by the DO Dictionary at bootstrap
time. These XML documents provide the type specification that is translated to a
Java representation wrapped by the DOPs API. At runtime, the DO Dictionary
Digital Object Prototypes: An Effective Realization of Digital Object Types
loads stored digital objects from the repository and generates Java artifacts
named digital object instances that conform to their respective DOP definition.
High level services operate on digital object instances; any modification occurring
in an instance’s data is serialized back to the repository when the object is saved.
In order to illustrate how DOPs effectively realize digital object types, in this
section we use examples drawn from the Senate Archive’s Session Proceedings
collection in Pergamos DL. We model Session Proceedings using Session and
Page DOPs; each Senate Session is modelled as a complex object containing
Pages. Figure 1b depicts the runtime representation of a Session digital object
instance, while Figure 2 illustrates the definition of the Session DOP, encoded
in XML. The Session instance reflects the specifications found in the Session
DOP. The instance’s behaviors are defined in the DOP the instance conforms
to, while its metadata, files and relations are loaded from its underlying stored
digital object.
Fig. 2. The Session prototype defined in XML terms
DOP definitions are encoded in XML as depicted by the Session DOP of
Figure 2 and are made up of four parts according to [9]: (a) metadata element
set definitions expressed in the MDSets XML section, (b) digital content specifications expressed in the files section, (c) relationships, defined in the relations
section and (d) behaviors, defined in the behaviors XML section. In the following we provide a detailed description of each of these four definition parts, while,
in parallel, we discuss how these type definitions are interpreted at runtime. It is
worth pointing out that, although most of the examples we use herein originate
K. Saidis et al.
from object input scenarios, the automatic type conformance offered by DOPs
covers all aspects of digital object manipulation. The DOPs framework is not a
static digital object model. On the contrary, it can be conceived as a framework
that allows users to define their own digital object models.
Behaviors in DOPs
The behaviors of a digital object constitute the set of operations it supports.
All the instances of the same DOP share the same behaviors; for example, all
Session Proceedings behave in the same manner. This is reflected by the fact
that with DOPs, behaviors are defined only in the object’s respective prototype
and are automatically bound to the digital object instance at runtime by the
DO Dictionary.
DOPs implement digital object types by drawing on the notions of the OO
paradigm. In order to support OO encapsulation, our approach distinguishes
private from public behaviors. Private behaviors refer to operations that are
executed by the digital object instance in a private fashion, hidden from third
parties. For example, validations of metadata element values are private behaviors that are executed by instances according to their DOP specification, without
user intervention. Private behaviors are triggered on specific events of the digital
object instance’s lifecycle; for instance, when a DL service updates the metadata
of an object. Private behaviors are implicitly defined in the DOP, as described in
the examples presented later in this section. On the other hand, public behaviors
constitute the interface through which third parties can interact with the digital
object instance at hand. Public behaviors are explicitly defined in a DOP and
are described in Section 2.5.
Metadata Elements in DOPs
DOPs support the use of multiple metadata element sets for describing different
digital object characteristics [9,10]. There are three ways to specify a metadata
element set in a DOP: (a) as a standard element set, such as the Dublin Core
(DC) [3], (b) as a user-defined extension of a standard element set (e.g. qualified
DC) or (c) as a totally custom element set. In detail, a DOP specifies:
- the individual metadata sets contained in the objects of this type, supplied
with an identifier and a multi-lingual label and description.
- the specific elements that constitute each metadata set. Each element is
designated by an identifier, desired labels and descriptions, and additional behavioral characteristics expressed in terms of private behaviors.
- the possible mappings among elements of the various metadata sets.
As the MDSets section of Figure 2 illustrates, Session objects are characterized using a qualified DC metadata set, called dc. Due to the archival nature
of the material, we also use a second, custom element set called ead, that follows
the principles of Encoded Archival Description (EAD) [6], yet without encoding
the EAD Finding Aid in its entirety.
Digital Object Prototypes: An Effective Realization of Digital Object Types
In what follows, we describe the metadata handling capabilities of DOPs and
provide appropriate examples drawn from the MDSets specifications found in the
Session prototype of Figure 2.
Automatic loading & serialization of Metadata sets: Loading and serialization of metadata sets are private behaviors, both executed by the DOP
behind the scenes. For example, if a DL service requests the dc metadata set
values of a Session digital object instance, the DOP specified loader is used to
load the corresponding element values from the underlying stored digital object.
Respectively, whenever a DL service stores the digital object instance to the
repository, the DOP supplied serializer is used to serialize each metadata set
to the appropriate underlying format. Loaders and serializers are defined in the
datastream XML section of the MDSet definition. Each DOP is allowed to define
its custom loading / serialization plugins, given that they constitute valid implementations of the respective Loader and Serializer Java interfaces supplied by
the DO Dictionary. The Session DOP, for example, uses the StandardLoader
plugin to load the metadata of Session Proceedings objects.
Behavioral characteristics of Metadata elements: The DOPs metadata specification inherently offers additional behavioral characteristics for each
metadata element. These characteristics are exploited by DL services on a case
to case basis for each element. DOPs define behavioral characteristics in terms
of XML attributes of the respective field definitions appearing in the MDSet
specification. In DOPs, we support the following behavioral characteristics:
- isMandatory: the instance will throw an exception if the metadata element
is to be saved with a null value.
- isHidden: advices the UI to hide the element from end-users.
- isRepeatable: the metadata element is allowed to have multiple values. The
UI service adjusts accordingly, by supplying the cataloguer with the ability to
insert multiple values or by displaying the values to the end-user in a list.
- validation: digital object instances apply the given validation whenever
they are called to set values to the element. The validation occurs just before
the user-supplied values are serialized and sent to the repository. DOPs support
user-defined, pluggable validations, given that they implement the Validation
interface provided by the DO Dictionary. For example, the definition of the
dc:date element in Figure 2 specifies the use of a validation that checks whether
respected values conform to the date format selected by the Senate Archive’s
cataloguing staff.
Mappings among Metadata Elements: The Session DOP of Figure 2
maps ead:unitid to dc:identifier physical. A mapping between elements
represents another example of a private behavior. Whenever the value of the
ead:unitid element is modified, the digital object propagates its new value to
the dc:identifier physical. In Session objects, the mappings are created
from selected ead elements to members of the dc metadata set. This is performed in order to allow us to offer cross-collection search to our users, given
that FEDORA only supports DC metadata searches. With the use of DOP-based
K. Saidis et al.
mappings we supply Pergamos with such search capabilities, without having to
limit our material description requirements to DC metadata only or force our
cataloguing staff to provide redundant information for both ead and dc metadata
Digital Content in DOPs
With regard to digital content, a prototype:
- specifies the various files and their respective formats,
- provides the necessary information required for converting a primary file
format to derivatives in order to automate and speed up the ingestion process,
- enables batch ingestion of content and automatic creation of the appropriate
digital objects.
Listing 1.1 depicts the files configuration of the Senate Archive’s Page DOP.
The latter specifies that Page objects should contain three file formats, namely
a high quality TIFF image (hq), a JPEG image of lower quality for web display
(web) and a small JPEG thumbnail image for browsing (thumb). In what follows
we describe batch ingestion and content conversion capabilities of DOPs.
<files >
< f i l e id =" hq " t y p e =" primary " d a t a s t r e a m=" HQ " >
< l a b e l lang =" en " > High Quality Image </ l a b e l >
<mime - t y p e id =" image / tiff " >
< c o n v e r s i o n target =" web " task =" convRes " h i n t =" scale :0.6 , quality :0.7"
mimeType =" image / jpeg " c o n v e r t e r =" gr . uoa . dl . core . conv . I m a g e C o n v e r t e r"/ >
< c o n v e r s i o n target =" thumb " task =" convRes "
h i n t =" width :120 , height :120 , quality :0.6"
mimeType =" image / jpeg " c o n v e r t e r =" gr . uoa . dl . core . conv . I m a g e C o n v e r t e r"/ >
</ f i l e >
< f i l e id =" web " t y p e =" d e r i v a t i v e" d a t a s t r e a m=" WEB " >
< l a b e l lang =" en " > Web Image </ l a b e l >
<mime - t y p e id =" image / jpeg "/ >
</ f i l e >
< f i l e id =" thumb " t y p e =" d e r i v a t i v e" d a t a s t r e a m=" THUMB " >
< l a b e l lang =" en " > T h u m b n a i l Image </ l a b e l >
<mime - t y p e id =" image / jpeg "/ >
</ f i l e >
</ f i l e s >
Listing 1.1. The files section of the Page prototype
Automatic Digital Content Conversions: Each file format is characterized either as primary or derivative. In the case of files of Senate Archive’s
Page objects, as defined in the files section of Listing 1.1, the hq file is primary,
referring to the original digitized material. The web and thumb files are treated
as derivatives of the primary file, since the prototype’s conversion behavior
can generate them automatically from the hq file. Conversion details reside in
the conversion section of each file specification. After the ingestion of the
primary file, the digital object instance executes the conversions residing in
its prototype automatically.
Digital Object Prototypes: An Effective Realization of Digital Object Types
We support three conversion tasks, namely (a) convert, used to convert a file
from one format to another, (b) resize, used to resize a file while maintaining its
format and (c) convRes, used to perform both (a) and (b). Each task is carried
out by the Java module supplied in the converter attribute, offering flexibility
to users to provide their own custom converters. The converter is supplied with
a hint, specifying either the required width and height of the resulting image
in pixels, the scale factor as a number within (0, 1) or the derivative’s quality
as a fraction of the original. In the case of Page objects (Listing 1.1), the hq file
is converted to a web JPEG image using compression quality of 0.7 and resized
using a scale factor of 0.6. Additionally, the hq file is also converted to a thumb
JPEG image using compression quality 0.6 and dimensions equal to 120 x 120
pixels. The Page instance stores both derivatives in the FEDORA datastreams
specified in the datastream attribute of their respective file XML element.
Batch Digital Object Ingestion: We also use DOPs to automate digital
object ingestion. The files section of the Session prototype (Figure 2), depicts
that Session objects are complex entities that contain no actual digital content
but act as containers of Page objects. However, the Session prototype defines a
zip file that is characterized as container. Containers correspond to the third
supported file format. If the user uploads a file with the application/zip mime
type in a Session instance, the latter initiates a batchIngest procedure. The
Session DOP’s batchIngest specification expects each file contained in the zip
archive to abide to the hq file definitions of the Page prototype. In other words,
if the user supplies a Session instance with a zip file containing TIFF images,
as the Session zip file definition requires, the instance will automatically create
the corresponding Page digital objects. Specifically, the Session batchIngest
procedure extracts the zip file in a temporary location and iterates over the files
it contains using the file name’s sort order. If the file at hand abides to the Page’s
primary file format:
a. Creates a new Page digital object instance.
b. Adds the Page instance to the current Session instance (as required from
structural relationships described in Section 2.4).
c. Adds the file to the Page instance at hand. This will trigger the automatic
file conversion process of the Page prototype, as outlined earlier.
Should we consider a Session comprised of 120 Page objects, then the ingestion automation task, supplied by DOPs, releases the user from creating 120
digital objects and making 240 file format conversions manually.
Relationships in DOPs
DOPs specify the different relationships that their instances may be allowed to
participate in. Currently, DOPs support the following relationships:
- Internal Relationships: Digital objects reference other DL pertinent objects.
- Structural Relationships: These model the “parent / child” relationships
generated between digital objects that act as containers and their respective
K. Saidis et al.
- External Relationships: Digital object reference external entities, providing
their respective URLs.
A Session object is allowed to contain Page objects; this specification appears
in the relations section of the Session DOP (Figure 2). The existence of a
structure specification in the Session prototype yields the following private
behavior in the participating entities:
- Every Session object instance maintains a list of all the digital object
identifiers the instance contains.
- Every Page instance uses the dc:relation isPartOf element to hold the
identifier of its parent Session.
Finally, the references part of the relation section informs DL services
whether custom relationships are supported by this type of object. In the
Session DOP of Figure 2, the references value guides UI services to allow
the cataloguer to relate Session instances only with DL internal objects and
not with external entities.
Public Behaviors in DOPs
We define public behaviors in DOPs using the notion of behavioral scheme. A
behavioral scheme is a selection of the entities that are part of a digital object. Behavioral schemes are used to generate projections of the content of the
digital object. Figure 2 illustrates the behaviors section of the Session prototype, which defines three behavioral schemes, namely browseView, zipView,
and detailView. The browseView scheme supplies the user with a view of the
digital object instance containing only three elements of the qualified DC metadata set, namely dc:identifier, dc:title and dc:date. Respectively, zipView
generates a projection containing the dc:title metadata element and the zip
file, while detailView provides a full-detail view of the object’s metadata elements. This way, the DL designer is able to generate desired “subsets” of the
encapsulated data of the digital object instance at hand for different purposes.
Execution of public behavior is performed by the invocation of a high level
operation on a digital object instance, supplying the desired behavioral scheme.
High level operations correspond to the actions supported by the DL modules.
For example, the cataloguing module supports the editObject, saveObject and
deleteObject actions, the browsing module supports the browseObject action,
while object display module supports the viewObject action. At this stage, all
Pergamos DL modules support only HTML actions:
- viewObject("uoadl:1209", shortView): Dynamically generates HTML
that displays the elements participating in the shortView of the “uoadl:1209”
object in read-only mode. The DO Dictionary will first instantiate the object
via its respective Session DOP (Fig. 1b). The new instance “knows” how to
provide its shortView elements to the object display module.
- editObject("uoadl:1209", zipView): Dynamically generates an HTML
form that allows the user to modify the instance’s elements that participate
Digital Object Prototypes: An Effective Realization of Digital Object Types
in zipView. This view is used by the digitization staff in order to upload the
original material and trigger the batch ingestion process, as described earlier in
this section.
- editObject("uoadl:1209", detailView): Generates an HTML form that
displays all the metadata elements of the given instance in an editable fashion.
This is used by the cataloguing staff in order to edit digital object’s metadata.
The cataloguing module uses the behavioral characteristics described in Section
2.2 (e.g. isMandatory, isRepeatable) to generate the appropriate, type-specific
representation of the digital object.
- saveObject("uoadl:1209", zipView): Saves “uoadl:1209” instance back
to the repository. Only the zipView scheme elements are modified. Cataloguing module knows how to direct the submission of the web form generated by
its aforementioned editObject action to saveObject. Respectively, cataloguing
deleteObject action is bound to a suitable UI metaphor (e.g. a “delete” button
of the web form). The scheme supplied to deleteObject is used to generate a
“deletion confirmation view” of the digital object.
The execution of public behaviors is governed by the particular scheme at
hand, while the DOP specifications enable DL application logic to adjust to the
requirements of each element participating in the scheme.
Organization of Collections in Pergamos Using DOPs
Currently, Pergamos contains more than 50,000 digital objects originating from
the Senate Archive, the Theatrical Collection, the Papyri Collection and the
Folklore Collection. Table 1 provides a summary of the DOPs we generated for
modeling the disparate digital object types of each collection, pinpointing the
flexibility of our approach. It should be noted that DOPs are defined with a
collection-pertinent scope [13] and are supplied with fully qualified identifiers,
such as folklore.page and senate.page, avoiding name collisions. These identifiers apply to the object’s parts, too; folklore.page.dc metadata set is different from the senate.page.dc set, both containing suitable qualifications of
the DC element set for different types of objects.
a. Folklore Collection Folklore Collection consists of about 4,000 handwritten notebooks created by students of the School of Philosophy. We modeled the
Folklore Collection using the Notebook, Chapter and Page DOPs. Notebooks are
modeled as complex objects that reflect their hierarchical nature; the Notebook
DOP allows notebooks to contain Chapter objects, which in turn are allowed
to contain other Chapter objects or Page objects. Notebooks are supplied with
metadata that describe the entire physical object, while Chapter metadata characterize the individual sections of the text. Finally, Page objects are not supplied
with metadata but contain three files, resembling the definition of the Senate
Archive’s Pages provided in Listing 1.1.
b. Papyri Collection This collection is comprised of about 300 papyri of
the Hellenic Papyrological Society. We modeled papyri using the Papyrus DOP,
consisting of a suitable DC qualification and four file formats. The orig file format
K. Saidis et al.
Table 1. A summary of the DOPs we generated for four Pergamos collections
a. Folklore Collection
Notebook dc
Chapter dc
b. Papyri Collection
Papyrus dc
hq, web, thumb, hq to web, hq
to thumb conversions
contains Chapter or Page
contains Chapter or Page
orig, hq, web, thumb, hq to none
web, hq to thumb conversions
c. Theatrical Collection
custom → dc zip triggers batch import
contains Photo
niso → dc
hq, web, thumb, hq to web, hq none
to thumb conversions
d. Senate Archive’s Session Proceedings
Session ead → dc
zip triggers batch import
contains Page
hq, web, thumb, hq to web, hq none
to thumb conversions
corresponds to the original papyrus digitized image, while hq refers to a processed
version, generated for advancing the original image’s readability. The orig image
is defined as primary, without conversions. The hq image, which is also defined
as primary, is the one supplied with the suitable conversion specifications that
generate the remaining two derivative formats, namely web and thumb.
c. Theatrical Collection Theatrical Collection consists of albums containing photographs taken from performances of the National Theater. Each Photo
digital object contains three different forms of the photograph and is accompanied by the metadata required for describing the picture, either descriptive
(dc) or technical (niso). As in the case of Senate Session Proceedings, mapping
are used to to map niso elements to dc. Albums do not themselves contain any
digital content, since they act as containers of Photo digital objects. However,
Albums are accompanied by the required theatrical play metadata, encoded in
terms of a custom metadata set, that is also mapped to dc.
d. Senate Archive The Senate Archive’s Session Proceedings has been discussed in Section 2.
Discussion and Related Work
To our knowledge, DOPs provide the first concrete realization of digital object
types and their enforcement. Our approach draws on the notions of the OO
paradigm, due to its well established foundations and its well known concepts.
Approaches on the formalization of OO semantics [2,12] show that the notion
Digital Object Prototypes: An Effective Realization of Digital Object Types
of objects in OO languages and the notion of digital objects in a DL system
present significant similarities, yet in a different level of abstraction. [1] defines
OO systems in terms of the following requirements:
- encapsulation: support data abstractions with an interface of named operations and hidden state,
- type conformance: objects should be associated to a type,
- inheritance: types may inherit attributes from super types.
At this stage, DOPs fulfill the encapsulation and type conformance requirements. The inclusion of inheritance is expected to provide explicit polymorphic
capabilities to DOPs, since polymorphism is currently implicitly supported; the
high level actions residing in the DL modules, as presented in Section 2.5, are
polymorphic and can operate on a variety of types. Inheritance is also expected
to allow designers to reuse digital object typing definitions. The concept of definition reuse through inheritance has been discussed in [8], although targeted on
information retrieval enhancements.
Although DOPs are currently implemented atop the FEDORA repository,
we believe that the presented concepts are of broader interest. The core type
enforcement implementation of DOPs regarding digital object instances and
their respective behavior is FEDORA independent and only stored digital object operations are tied to FEDORA specific functionality (e.g. getDatastream,
saveDatastream services). Taken into consideration that DOPs, conceptually,
relate to the OO paradigm and the digital object modeling approach of Kahn
and Wilensky [9], we argue that there are strong indications that DOPs can be
implemented in the context of other DL systems as well.
DOPs are complementary to FEDORA, or any other underlying repository.
FEDORA can effectively handle low-level issues regarding digital object storage,
indexing and retrieval. DOPs provide an architecture for the effective manipulation of digital objects in the higher level context of DL application logic. DOPs
behaviors are divided into private and public, in order to support encapsulation,
while their definition is performed in the object’s respective prototype. FEDORA
implements behaviors in terms of disseminators, which associate functionality
with datastreams. FEDORA disseminators must be attached to each individual
digital object upon ingestion time. With DOPs, all objects of the same type
behave in the same manner; their respective behaviors are dynamically binded
to the instances at runtime, while the behaviors are defined once and in one
place, increasing management and maintenance capabilities. aDORe [4] deploys
a behavior mechanism that, although it is similar to FEDORA, it attaches behaviors to stored digital objects in a more dynamic fashion, upon dissemination
time, using disseminator-related rules stored in a knowledge base. Finally, DOPs
behaviors operate on digital objects in a more fine-grained manner, since they
can explicitly identify and operate upon the contents of FEDORA datastreams.
[5] enables the introspection of digital object structure and behavior. A DOP
can be conceived as a meta-level entity that provides structural and behavioral
metadata for a specific subset of base-level digital objects. Put in other terms,
K. Saidis et al.
a DOP acts as an introspection guide for its respective digital object instances.
DOP supplied type conformance and type-driven introspection of digital object
structure and behavior allows third parties to adjust to each object’s “idiosyncrasy” in a uniform manner.
1. L. Cardelli and P. Wegner. On understanding types, data abstraction, and polymorphism. ACM Computing Surveys, 17(4):471–522, 1985.
2. W. Cook and J. Palsberg. A denotational semantics of inheritance and its correctness. In Proceedings of the ACM Conference on Object-Oriented Programming:
Systems, Languages and Application (OOPSLA), pages 433–444, New Orleans,
Louisiana, USA, 1989.
3. DCMI Metadata Terms. Dublin Core Metadata Initiative, January 2005.
4. H. Van de Sompel, J. Bekaert, X. Liu, L. Balakireva, and T. Schwander. adore:
A modular, standards-based digital object repository. The Computer Journal,
48(5):514–535, 2005.
5. N. Dushay. Localizing experience of digital content via structural metadata. In
Proceedings of the Joint Conference on Digital Libraries, pages 244–252, Portland,
Oregon, USA, 2002.
6. Encoded Archival Description (EAD). Library of Congress, 2006.
7. Introduction to Fedora Object XML. Fedora Project.
8. N. Fuhr. Object-oriented and database concepts for the design of networked information retrieval systems. In Proceedings of the 5th international conference
on Information and knowledge management, pages 164–172, Rockville, Maryland,
USA, 1996.
9. R. Kahn and R. Wilensky. A Framework for Distributed Digital Object Services.
Corporation of National Research Initiative - Reston, VA, 1995.
10. METS: An Overview & Tutorial. Library of Congress, Washington, D.C., 2006.
11. Resource Description Framework (RDF). World Wide Web Consortium.
12. U.S Reddy. Objects as closures: Abstract semantics of object-oriented languages.
In Proceedings of the ACM Conference on Lisp and Functional Programming, pages
289–297, Snowbird, Utah, USA, 1988.
13. K. Saidis, G. Pyrounakis, and M. Nikolaidou. On the effective manipulation of
digital objects: A prototype-based instantiation approach. In Proceedings of the
9th European Conference on Digital Libraries, pages 26–37, Vienna, Austria, 2005.
14. T. Staples, R. Wayland, and S. Payette. The fedora project: An open-source digital
object repository management system. D-Lib Magazine, 9(4), April 2003.
15. T. Staples, R. Wayland, and S. Payette. Using mpeg-21 dip and niso openurl for
the dynamic dissemination of complex digital objects in the los alamos national
laboratory digital library. D-Lib Magazine, 10(2), February 2004.
Design, Implementation, and Evaluation of a
Wizard Tool for Setting Up Component-Based
Digital Libraries
Rodrygo L.T. Santos, Pablo A. Roberto,
Marcos André Gonçalves, and Alberto H.F. Laender
Department of Computer Science, Federal University of Minas Gerais
31270-901 Belo Horizonte MG, Brazil
{rodrygo, pabloa, mgoncalv, laender}@dcc.ufmg.br
Abstract. Although component-based architectures favor the building
and extension of digital libraries, the configuration of such systems is
not a trivial task. Our approach to simplify the tasks of constructing
and customizing component-based digital libraries is based on an assistant tool: a setup wizard that segments those tasks into well-defined
steps and drives the user along these steps. For generality purposes, the
architecture of the wizard is based on the 5S framework and different
wizard versions can be specialized according to the pool of components
being configured. This paper describes the design and implementation of
this wizard, as well as usability experiments designed to evaluate it.
The complexity of a digital library, with respect to its content and the range of
services it may provide, varies considerably. As an example of a simple system,
we could cite BDBComp (Biblioteca Digital Brasileira de Computação) [7], which
provides, basically, searching, browsing, and submission facilities. More complex
systems, such as CITIDEL (Computing and Information Technology Interactive
Digital Educational Library) [3], may also include additional services such as
advanced searching and browsing through unified collections, binding, discussion
lists, etc.
Many of the existing digital libraries are based on monolithic architectures
and their development projects are characterized by intensive cycles of design,
implementation and tests [13]. Several have been built from scratch, aiming to
meet the requirements of a particular community or organization [4].
The utilization of modular architectures, based on software components, beyond being a widely accepted software engineering practice, favors the interoperability of such systems at the levels of information exchange and service
collaboration [13].
However, although component-based architectures favor the building and extension of digital libraries, the configuration of such systems is not a trivial task.
In this case, the complexity falls on the configuration at the level of each component and on the resolution of functional dependencies between components.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 135–146, 2006.
c Springer-Verlag Berlin Heidelberg 2006
R.L.T. Santos et al.
In existing systems, in general, such configurations are performed manually or
via command-line scripts. Both alternatives, however, seem inappropriate in a
broader context of digital libraries utilization. Instead, higher level techniques
to support the creation of complete digital libraries in a simple manner should
be investigated [14].
The approach taken in this paper for simplifying the tasks of constructing
and customizing digital libraries consists in segmenting such tasks into steps
and in driving the user along these steps. This approach is achieved through
the development of a digital library setup wizard running on top of a pool of
software components.
Wizards are applications specially suited for assisting users on the execution
of both complex and infrequent tasks, presenting such tasks as a series of welldefined steps. Though efficient as assistant tools, such applications are not useful
for didactical purposes; on the contrary, they should be designed to hide most
of the complexity involved in the task to be accomplished. Besides, they should
provide a supplementary rather than substitutive way to accomplish the task,
so that they do not restrict its execution by specialist users [8].
This paper is organized as follows. In Section 2, the architecture of the wizard is described in details. Following, Section 3 shows some usage examples. In
Section 4, we discuss the usability experimental evaluation of the prototype developed. Section 5 discusses related work. Finally, Section 6 presents conclusions
and perspectives for future work.
Architecture Overview
In this section, we describe the architecture of the wizard, which basically follows the MVC (Model-View-Controller) framework [2] with the addition of a
persistence layer.
The model layer was primarily designed [12] based on configuration requirements gathered from the ODL (Open Digital Libraries) framework [14]. Later, it
was extended in order to support the configuration of different component pools.
Such extension was conceived inspired on the definition of a digital library from
the 5S (Streams, Structures, Spaces, Scenarios, Societies) framework [6]. Accordingly to 5S, a typical digital library is informally defined as a set of mathematical
components (e.g., collections, services), each component being precisely defined
as functional compositions or set-based combinations of formal constructs from
the framework. Our configuration model was devised regarding the components
that make up a 5S-like digital library as configurable instances of software components provided by a component pool. By “configurable instances”, we mean
software components whose behaviors are defined as sets of user-configurable
The class diagram [11] in Fig. 1 shows a simplified view of the devised model.
As shown in the diagram, a digital library is implemented as a set of configurable
instances of provider components, among those supplied by the pool being used.
A provider may be typed either a repository or a service, according to its role
Design, Implementation, and Evaluation of a Wizard Tool
within the library. For orthogonality purposes, the digital library itself is also
implemented as a configurable instance of a component. Additionally, components may be declared mandatory, as well as dependent on other components.
The configuration of each component is implemented as a set of parameters,
semantically organized into parameter groups. For validation purposes, each parameter is associated to an existing Java type; they may also have a default
value, in conformance with their defined type. Parameters may be also declared
mandatory (not null) and/or repeatable (with cardinality greater than one).
Fig. 1. Class diagram for the model layer
View and controller layers are integrated – a common simplification of the
original MVC framework. They are responsible for handling user interactions,
performing the corresponding modifications to the configuration model and displaying the updated model back to the user. Once user interactions are directly
transmitted to the model, users modify a clone rather than the original configuration of each component. This allows users to cancel all the modifications
performed to a given component at any time.
The configuration interface is organized into steps, in a wizard-like fashion.
Each step comprises a major aspect of a digital library: the library itself, its
collections and/or metadata catalogs, and the services it may provide. In each
of these steps, the parameters associated to each of the components they list are
presented in dynamically created, tab-organized forms (Figs. 2, 3, and 4). Each
tab corresponds to a parameter group. Form elements are designed according
to the type of the parameter they represent: repeatable parameters are shown
as lists, parameters representing file descriptors present a file chooser dialog,
parameters with values restricted to an enumerable domain are displayed as a
combo box, strings and integers are simply shown as text fields. The semantics
of every parameter is displayed as a tooltip near the parameter label. Typechecking is performed against every value entered by the user; in case of an
erroneous value, a corresponding exception is raised and the user is notified
about the error.
The persistence layer is responsible for loading and saving the components
configuration. Besides that, it is up to this layer the tasks of setting environment
R.L.T. Santos et al.
variables and preparing databases that support the execution of some components. Its working scheme is based on two XML documents: a pool descriptor
and a configuration log. The pool descriptor document details every component
in the pool, including all configuration parameters associated to them. The description of each configuration parameter contains path entries of the form document :xpath expression that uniquely locate the parameter in each of its source
documents. Since some path entries are dependent on auto-detected or userentered information, both only known at runtime (e.g., the base directory of the
wizard and the current digital library identifier), the pool descriptor document
also comprises a list of definitions to be used in path entries declaration. For
example, in the listing below, the path entry for the “libraryName” parameter
is declared relatively to the definitions “wizardHome” (auto-detected) and “libraryId” (user-entered). The other document, a configuration log, acts as a cache
for the persistence layer. It comprises information about the currently configured
digital libraries running in the server.
<component id="library" type="model.pool.library.DigitalLibrary">
<label>General Configuration</label>
<parameter id="libraryName" type="java.lang.String" mandatory="yes">
<default>My New Library</default>
<label>Library Name:</label>
<description>A human readable name for the library.</description>
Both XML documents are handled via DOM. Loading and saving of components are performed through XPath expressions. Based on the specification
of each component (from the pool descriptor document), configured instances
of them are loaded into the digital library model; besides, a template of each
component is added to the model so that new instances of components can be
added later. Loading is performed in a lazy fashion, i.e., objects are created only
when needed. On the other hand, saving is only performed at the end of the
whole configuration task, as well as some additional tasks, such as environment
variables and database setup, performed via system calls.
Specializing the wizard to assist the configuration of different component pools
can be done just by providing a description document for each pool to be configured, as well as eventual accessory scripts for performing system calls. In fact,
during the development project, we produced wizard versions for two component
pools, namely, the ODL and WS-ODL frameworks.
Usage Examples
In this section, we show some usage examples of configuration tasks performed
with the aid of the wizard developed.
Design, Implementation, and Evaluation of a Wizard Tool
The initial step welcomes the user and states the purpose of the wizard. The
following step (Fig. 2) handles the digital library’s configuration. At this step,
previously configured digital libraries are listed by the wizard and the user can
choose to modify or even remove any of them. Besides, he/she can choose to
create a new digital library. Both library creation and modification are handled
by a component editor dialog. For instance, selecting “BDBComp” from the
list and clicking on “Details” opens this library’s configuration editor dialog.
This dialog comprises the digital library’ general configuration (e.g., the library’s
home directory, name, and description), as well as its hosting information (e.g.,
the server name and port number for the library’s application and presentation
layers). Selecting a digital library from the list enables the “Next” button on the
navigation bar.
Fig. 2. Configuring digital libraries
Clicking on “Next” drives the user to the following step (Fig. 3), which handles
the configuration of the digital library’s repositories. Similarly to the previous
step, this one shows a list of existing repositories under the currently selected
digital library so that the user can choose to modify or remove any of them. As in
the previous step, he/she can also add a new repository to the library. Clicking on
“Details” after selecting “BDBComp Repository” shows its configuration editor
dialog. Repositories’ configuration parameters include administrative data (e.g.,
repository administrators’ e-mails and password), hosting information and access
permissions (e.g., the repository’s server name and a list of hosts allowed to
access the repository), database connection and storage paths (e.g., the JDBC
driver used to connect to the repository’s database and the PID namespace
associated to records stored in the repository), etc. Since the whole configuration
is performed on the currently selected digital library and is only saved at the
end of the configuration task, clicking on “Back” warns the user that selecting
a new library to be configured implies discarding the current configuration. If
R.L.T. Santos et al.
there is at least one repository under the currently selected digital library, the
“Next” button is enabled and the user can go forth.
Fig. 3. Configuring repositories
The following step (Fig. 4) handles the configuration of the digital library’s
services. A list of all the services provided by the pool of components being used
is displayed – those already configured under the current library are marked.
Selecting any of the services displays its description on the right panel. Trying to unmark a service which is an instance of a mandatory component raises
an exception, as well as trying to mark a service component which depends on
other components or to unmark a service component that other components depend on. Selecting a service component which has additional parameters to be
configured enables the “Details” button. For instance, selecting “Browsing” and
clicking on “Details” launches this service’s configuration editor. Its configuration includes navigational parameters, such as a list of dimensions for browsing
and the number of records to be displayed per page, and presentational parameters, such as the XSL stylesheets to be used when displaying browsing results. As
another example, the “Searching” service’s configuration includes parsing and
indexation parameters, such as lists of delimiters, stopwords and fields to be
indexed, among others.
After configuring the services that will be offered by the digital library, the
user is driven to the penultimate step. This step summarizes all the configuration
performed so far, showing a list of repositories and services comprised by the
library being configured. If anything is wrong, the user can go back and correct
the proper parameters. Otherwise, clicking on “Configure” saves the current
digital library’s configuration and drives the user to the last step.
The last step (Fig. 5) notifies the user about the result of the whole configuration task. If no problem has occurred while saving the configurations performed,
links to the digital library’s services are made available to the user.
Design, Implementation, and Evaluation of a Wizard Tool
Fig. 4. Configuring services
Fig. 5. Configuration completion
System Evaluation
In order to evaluate the usability of our tool, we have conducted a series of
experiments involving four users from Computer Science (CS) and four from
Library and Information Science (LIS). The experiments included performing
two configuration tasks and filling in an evaluation questionnaire. Both tasks
highly explore all interface elements of the wizard, such as lists and file choosers.
The first and simpler task, aimed at helping users to get familiar with the tool,
consisted of modifying a few parameters of a pre-configured digital library. The
second and more complex one consisted of configuring a whole library from
scratch. Since the wizard prototype we tested was running on top of the WS-
R.L.T. Santos et al.
ODL framework [10], we designed this second task to be comparable to the one
performed at a command-line installation test conducted with that framework.
Though data insertion is considered out of the scope of our tool but is performed
in the command-line installation experiments of WS-ODL, the comparison was
still possible since they measured the installation time at distinct checkpoints,
allowing us to discard data insertion time while comparing the overall times.
Table 1 shows the completion time and correctness from the two experiments
conducted with the wizard prototype (namely, tasks #1 and #2), as well as
those for the users who also performed the command-line driven configuration
experiment (task #2c ). For comparison purposes, the performance of an expert
user – the developers of the wizard and the WS-ODL framework – is also shown
at the end of the table. Time is displayed in the form hh:mm:ss and correctness stands for the number of correctly executed items in the configuration task
divided by the total number of items in that task.
Table 1. Completion time and correctness per task
CS #1
CS #2
CS #3
CS #4
CS Mean
CS Std. Dev.
LIS #1
LIS #2
LIS #3
LIS #4
LIS Mean
LIS Std. Dev.
Global Mean
Global Std. Dev.
Completion Time
Task #1 Task #2 Task #2c Task #1 Task #2 Task #2c
00:05:16 00:10:48
00:07:27 00:17:36
00:07:26 00:08:09 01:36:00
00:07:54 00:09:10 01:12:00
00:07:01 00:11:26 01:24:00
00:01:11 00:04:15 00:16:58
00:15:59 00:20:38
00:08:01 00:17:22 01:36:00
00:08:59 00:16:11
00:11:21 00:20:03 01:35:00
00:11:05 00:18:33 01:35:30
00:03:33 00:02:08 00:00:42
00:09:03 00:15:00 01:29:45
00:03:17 00:04:55 00:11:51
00:01:53 00:04:33 00:37:00
Comparing the wizard-guided and the command-line driven approaches for
task #2 shows that configuring WS-ODL components with the aid of the wizard
is much faster (about 500%, on average) than manually (hypothesis accepted
by statistical analysis: t test with α = 0.05). Configuration correctness is also
substantially increased (about 34%, on average) with the aid of the wizard (hypothesis accepted by statistical analysis: t test with α = 0.05). This is mainly
due to the type-checking and component dependency checker systems of the
wizard. Fastness and correctness attest the effectiveness of the wizard against
the command-line driven approach. Effectiveness was also subjectively rated by
users who participated in both tasks and measured based on a 5-point bipolar
Design, Implementation, and Evaluation of a Wizard Tool
scale, ranging from 1 (worst rating) to 5 (best rating). On average, the effectiveness of the wizard-guided approach, in terms of easing the configuration task,
was rated 4.5.
The learnability of the tool was also derived from Table 1. For such, we devised two measures: configuration efficiency and expertise. Efficiency stands for
the total number of items in the task divided by the overall task completion
time. Expertise measures how close the user’s completion time is to the expert’s
completion time. Table 2 shows the values for these two learnability measures.
Efficiency is measured in terms of task items performed per minute.
Table 2. Efficiency and expertise per task
CS #1
CS #2
CS #3
CS #4
CS Mean
CS Std. Dev.
LIS #1
LIS #2
LIS #3
LIS #4
LIS Mean
LIS Std. Dev.
Global Mean
Global Std. Dev.
Task #1
Task #2 Task #2c Task #1 Task #2 Task #2c
From Table 2, we can see that, in most cases (CS #2 and LIS #2 are the
only exceptions), configuration efficiency is increased (about 33%, on average)
from task #1 to task #2. Here we regard all task items as equally difficult, what
is quite reasonable once all of them consist of setting configuration parameters.
Also, the few items that differ in difficulty (e.g., choosing a file in a dialog or
adding an item to a list) are homogeneously distributed across the two tasks.
Expertise – another learnability indicator – is also increased (about 49%, on
average) from task #1 to task #2, what could show that the wizard is easy to
learn. However, the hypotheses of efficiency and expertise growth from task #1
to task #2 were rejected by statistical analysis (t test with α = 0.05), what
suggests that perhaps task #1 was not enough for users to become familiar with
the tool.
From the questionnaire filled in by the users who performed the wizard-guided
configuration tasks, we devised other two metrics: didactical applicability and
satisfaction, both measured based on 5-point bipolar scales, ranging from 1
(worst rating) to 5 (best rating). On average, in terms of understanding of the
R.L.T. Santos et al.
concepts being configured (i.e., concepts pertaining to the domain of the component pool on top of which the wizard is running), the didactical applicability
of the wizard was subjectively rated 3.75. This was an unexpected yet not unwelcome high value, since the design of wizards is not intended for didactical
purposes. Satisfaction was measured in terms of comfort and ease of use. On
average, users subjectively rated them 4.25 and 4, respectively.
Related Work
There are several works found in the literature that deal with component-based
frameworks for building digital libraries. As far as we know, however, there are
few works related specifically to the task of configuring such systems. In this
section, we present four works that fall into the latter category.
5SGraph [15], a tool based on the 5S framework, provides a visual interface
for conceptual modeling of digital libraries from a predefined metamodel. In the
modeling task, the user interacts with the tool by incrementally constructing a
tree where each node, picked from the metamodel, represents a construct of the
digital library being modeled. Differently from the other works presented here,
this one has a didactical goal: to teach the 5S theory.
BLOX [5] is a tool that hides most of the complexity involved in the task of
configuring distributed component-based digital libraries. However, as occurs in
5SGraph, users interact with this tool in a flexible manner: its interface comprises
a set of windows, each one representing the configuration of an ODL component.
The Greenstone suite [1] incorporates a wizard that allows non-specialist users
to create and organize digital collections from local or remote documents. Driving
the user step by step, this tool gets information such as the name and the purpose
of the collection, administrator’s e-mail, existing collections to serve as a model,
base directories or URL’s, etc. This tool, on the other hand, does not deal with
the configuration of service provider components.
Finally, the OAIB application (Open Archives in a Box) [9], based on the
COCOA framework (Components for Constructing Open Archives), provides
a wizard for configuring metadata catalogs stored in RDBMS’s. Its interface
consists of a series of tabs where each tab presents different configuration options.
Similarly to the wizard provided by the Greenstone suite, this one does not deal
with the configuration of service providers.
Table 3 summarizes the characteristics of all these tools, comparing them to
the ones present in our wizard.
Table 3. Wizard vs. related tools
interaction guided
didactical no
5S constructs
Design, Implementation, and Evaluation of a Wizard Tool
Conclusions and Future Work
This paper has presented a wizard tool for setting up component-based digital
libraries. The tool is aimed at assisting users in the nontrivial task of configuring
software components in order to build a fully functional digital library. The architecture of the wizard comprises a generic model layer for the purpose of supporting the configuration of different component pools upon minimal specialization.
The paper has also presented a usability experimental evaluation of a prototype running on top of the WS-ODL framework. Despite the relatively small
number of users, the results (statistically meaningful) show that our approach is
quite effective in easing the task of configuring that framework by hiding most
of the complexity involved in the configuration task.
As future work, we plan to extend the wizard tool in order to support the customization of user interfaces and workflows. Though its comfort and ease of use
have been well-rated, we plan to further enhance some interface aspects of the
wizard based on users’ suggestions and observations we made during the experiment sessions, in order to improve the overall learnability of the tool. Also, we
intend to perform additional experiments in order to compare the guided and
flexible interaction approaches, as provided by the wizard and the BLOX tool
(for instance), respectively. In the near future, we plan to incorporate the wizard
to the WS-ODL framework. Additionally, prototype versions for other component pools could be produced in order to test and expand the generality of the
model layer.
This work was partially supported by CNPq funded projects I3DL and 5S/VQ.
Authors would like to thank Allan J. C. Silva for his valuable help on statistical
analysis of our experimental evaluation.
1. Buchanan, G., Bainbridge, D., Don, K. J., Witten, I. H.: A new framework for
building digital library collections. In: Proceedings of the 5th ACM-IEEE Joint
Conference on Digital Libraries (2005) 25–31
2. Burbeck, S.: Applications Programming in Smalltalk-80: How to use Model-ViewController (MVC), tech. report. Softsmarts Inc. (1987)
3. CITIDEL. http://www.citidel.org, March (2006)
4. Digital Libraries in a Box. http://dlbox.nudl.org, March (2006)
5. Eyambe, L., Suleman, H.: A Digital Library Component Assembly Environment.
In: Proceedings of the 2004 Annual Research Conference of the SAICSIT on IT
Research in Developing Countries (2004) 15–22
6. Gonçalves, M. A., Fox, E. A., Watson, L. T., Kipp, N.: Streams, Structures, Spaces,
Scenarios, Societies (5S): A Formal Model for Digital Libraries. ACM Transactions
on Information Systems 22 (2004) 270–312
R.L.T. Santos et al.
7. Laender, A. H. F., Gonçalves, M. A., Roberto, P. A.: BDBComp: Building a Digital
Library for the Brazilian Computer Science Community. In: Proceedings of the 4th
ACM-IEEE Joint Conference on Digital Libraries (2004) 23–24
8. MSDN. http://msdn.microsoft.com/library/en-us/dnwue/html/ch13h.asp, March
9. Open Archives in a Box. http://dlt.ncsa.uiuc.edu/oaib, March (2006)
10. Roberto, P. A.: Um Arcabouço Baseado em Componentes, Serviços Web e Arquivos Abertos para Construção de Bibliotecas Digitais. Master’s thesis, Federal
University of Minas Gerais (2006)
11. Rumbaugh, J., Jacobson, I., Booch, G.: The Unified Modeling Language Reference
Manual. Addison-Wesley Professional (2004)
12. Santos, R. L. T.: Um Assistente para Configuração de Bibliotecas Digitais Componentizadas. In: I Workshop in Digital Libraries, Proceedings of the XX Brazilian
Symposium on Databases (2005) 11–20
13. Suleman, H., Fox, E. A.: A Framework for Building Open Digital Libraries. D-Lib
Magazine 7 (2001)
14. Suleman, H., Feng, K., Mhlongo, S., Omar, M.: Flexing Digital Library Systems. In:
Proceedings of the 8th International Conference on Asian Digital Libraries (2005)
15. Zhu, Q., Gonçalves, M. A., Shen, R., Cassell, L., Fox, E. A.: Visual Semantic
Modeling of Digital Libraries. In: Proceedings of the 7th European Conference on
Digital Libraries (2003) 325–337
Design of a Digital Library for Early 20th Century
Medico-legal Documents
George R. Thoma, Song Mao, Dharitri Misra, and John Rees
U.S. National Library of Medicine, Bethesda, Maryland, 20894, USA
{gthoma, smao, dmisra, jrees}@mail.nih.gov
Abstract. The research value of important government documents to historians
of medicine and law is enhanced by a digital library of such a collection being
designed at the U.S. National Library of Medicine. This paper presents work
toward the design of a system for preservation and access of this material, focusing mainly on the automated extraction of descriptive metadata needed for
future access. Since manual entry of these metadata for thousands of documents
is unaffordable, automation is required. Successful metadata extraction relies on
accurate classification of key textlines in the document. Methods are described
for the optimal scanning alternatives leading to high OCR conversion performance, and a combination of a Support Vector Machine (SVM) and Hidden
Markov Model (HMM) for the classification of textlines and metadata extraction. Experimental results from our initial research toward an optimal textline
classifier and metadata extractor are given.
1 Introduction
As the United States moved from an agrarian economy to an industrial one during the
late 19th and early 20th centuries, the need for food and drug regulation became increasingly important to American public health. Prior to this transformation, most food and
medication came primarily from natural sources or trusted people, but as the nation’s
population became more urbanized, food and drug production became more of a manufacturing process. The mostly unregulated practice of adding chemicals and compounds
and physical processes to increase the shelf life of foods, as well as outright medical
quackery, became issues of political and social concern leading to legislation.
A landmark legislation, the 1906 Federal Food and Drug Act [1], established
mechanisms for the federal government to seize, adjudicate, and punish manufacturers of adulterated or misbranded food, drugs and cosmetics. These federal activities
were carried out by the various sub-offices we now know as the U.S. Food and Drug
Administration (FDA). The legal proceedings associated with each case resulting
from these activities were documented as Notices of Judgment (NJs), published synopses created on a monthly basis.
The U.S. National Library of Medicine (NLM) has acquired a collection of FDA
documents (70,000+ pages) containing more than 65,000 NJs dating between 1906
and 1964. (In this paper, we refer to this collection as FDA documents.) To preserve
these NJs and make them accessible, our goal is to create a digital archive of both
page images and metadata. By providing access to NJs through metadata, this digital
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 147 – 157, 2006.
© Springer-Verlag Berlin Heidelberg 2006
G.R. Thoma et al.
library will offer insight into U.S. legal and governmental history, but also into the
evolution of clinical trial science and the social impact of medicine on health. The
history of some of our best-known consumer items of today, such as Coca Cola, can
be traced in the NJs. The intellectual value of this data for historians of medicine is
expected to be high, and a Web service should increase its use exponentially.
Apart from providing access, digitization of this collection is needed for strictly
preservation purposes since many of the existing volumes of NJs are one of a kind
and the earliest ones are printed on paper that is extremely brittle and prone to crumbling. Constant physical handling of the print would probably shorten its lifespan
The creation of a digital library for this material requires a system for ingesting the
scanned FDA documents, extracting the metadata, storage of documents (in TIFF and
PDF forms) and metadata, and a Web server allowing access. This paper gives an
overall system description (Section 2), and focuses on techniques for automated
metadata extraction, experiments and results (Section 3).
2 System Description
A critical step in preserving the FDA documents for future access is the recording of
the metadata elements pertaining to each NJ, and making the metadata accessible to
users. The manual input of metadata for 65,000 NJs would be prohibitively expensive
and error-prone. On the other hand, since these NJs are self-documenting, with important metadata elements (such as case number, description, defendant, judgment date),
encoded in the pages following certain structured layout patterns, it is possible
to consider automated extraction of these elements for a cost-effective and reliable
solution. In our work, this automated metadata extraction is performed by using a
prototype preservation framework called System for the Preservation of Electronic
Resources (SPER) [2], which incorporates in-house tools to extract metadata from
text-based documents through layout analysis.
SPER is an evolving Java-based system to research digital preservation functions
and capabilities, including automated metadata extraction, retrieval of available metadata from Web-accessed databases, document archiving, and ensuring long term use
through bulk file format migration. The system infrastructure is implemented through
DSpace [3] (augmented as necessary to suit our purpose), along with a MySQL 5.0
database system.
The part of SPER that extracts metadata, called SPER-AME, is used for the preservation of the FDA documents. The overall workflow of the FDA documents
through the system, as well as a description of the SPER-AME architecture with focus
on components used for metadata extraction from the documents, are given below.
2.1 Preservation Workflow
Figure 1 depicts the high level workflow and processing steps involved in the preservation of the FDA documents. There are three basic steps, as described below.
Design of a Digital Library for Early 20th Century Medico-legal Documents
External Scanning Facility
FDA Paper
FDA NJ Preservation Facility (FPF)
FPF Local
High Volume
Content Management
Web Access for
SPER (Admin)
Scanned TIFF Images, derivatives
and ancillary data
Scanned TIFF images
Automatically extracted/edited metadata
Metadata retrieval information
Fig. 1. Preservation Workflow for FDA Notices of Judgment
As the first step, the FDA paper documents (either the originals, or, more frequently, their reproduction copies) are sent to a designated external scanning facility. The TIFF images of the scanned documents are sent back to an in-house
facility (represented here as the FDA NJ Preservation Facility or FPF), and considered to be the master images for preservation. Besides these TIFF images, derivative documents such as PDF files, created for dissemination, are also received
and stored at the FPF.
In the next step, NJs are identified and metadata is automatically extracted from
these TIFF documents using SPER-AME. In this client-server system, the backend server process runs on a stand-alone Windows-2000 based server machine,
while the frond-end client process, with a graphical user interface (GUI), runs on
a console used by an archivist or operator.
Using the SPER-AME GUI, the operator sends the master TIFF files in manageable batches to the server for automated extraction of metadata. The server receives the TIFF documents, identifies and extracts the embedded metadata for
each NJ using the automated metadata extractor, stores both the image files and
the extracted metadata (as XML files) in its storage system, and adds related information to the database. The operator may then view the extracted metadata for
G.R. Thoma et al.
each NJ, perform editing if necessary, validate/qualify them for preservation, and
download validated metadata to FPF local storage.
For efficiency, the SPER-AME server may perform metadata extraction from
one batch while supporting interactive metadata review and editing by the operator from an already processed batch.
In Step 3 the master TIFF images, the derivatives and the metadata are ingested
to the FPF Content Management system for preservation and Web access. If necessary, the XML-formatted metadata from SPER will be reformatted to be compliant with the chosen Content Management system. This step will be discussed
in a future report.
2.2 SPER-AME Architecture
As mentioned earlier, SPER is a configurable system, which (among other preservation functions) can accommodate metadata extraction for different types of documents
and collections by using pluggable tailored interfaces encapsulating the internal characteristics of those documents. Here we describe a light-weight version of SPER
(called SPER-AME), for the extraction of metadata from the FDA documents.
The SPER-AME system architecture is shown in Figure 2. Its operator interface
runs as a separate GUI process, and communicates with the SPER-AME Server using
Java Remote Method Invocation (RMI) protocols [4]. The File Copy Server is an
RMI server, which runs on the operator’s machine to transfer specified TIFF files
from FPF local storage to the server upon request. These image files are stored on a
multi-terabyte NetAPP RAID system and used for metadata extraction by the server.
The three major components that participate in the metadata extraction process are the
Metadata Manager, the Metadata Extractor, and the OCRConsole module. They are
briefly described below. (Other essential components such as the Batch Manager and
the Property Manager are not shown here for simplicity.)
Metadata Manager – This module receives all metadata-related requests from the
GUI, through higher level RMI modules, and invokes lower level modules to perform
the desired function such as extracting metadata from the documents, storing original/edited metadata in the database as XML files, and fetching these files to be sent to
the operator upon request.
Metadata Extractor – This is the heart of the SPER-AME system, which identifies
a specific NJ in a document batch and extracts the corresponding metadata elements
by analyzing its layout from the associated OCR file. Further details on this module
are provided in Section 3.
The metadata extractor for the FDA documents is chosen by the Metadata Manager
(from a set of several extractors that have been developed for different document types)
through an associated Metadata Agent module, shown in Figure 2. The Metadata Agent
returns the metadata results from the Metadata Extractor in a standardized XML format.
OCRConsole– This is an optical character recognition module, external to SPER, invoked by the Metadata Extractor to take a TIFF image, generate a set of feature values
for each character, such as its ASCII code, bounding box coordinates, font size, font
attributes, etc., in the TIFF image, and store it in a machine-readable OCR output file.
This OCR data is then used for layout analysis, metadata field classification, and metadata extraction.
Design of a Digital Library for Early 20th Century Medico-legal Documents
The module Metadata Validator, shown in Figure 2, performs front-end checks
such as missing mandatory metadata elements for an NJ item, invalid NJ identifiers,
etc. so as to alert the FPF operator to review the item and make manual corrections as
File Copy
FPF Local Storage
TIFF Image
OCR data
TIFF Image
MySQL Database
(External module)
TIFF image data
Extracted metadata
OCR data
retrieval info
Fig. 2. SPER-AME System Components and Data Flow
3 Automated Metadata Extraction
Automated metadata extraction, an essential step in the economical preservation of
these historic medico-legal documents, consists of the stages shown in Figure 3.
Since the originals are brittle and have small font size, they are first photocopied at a
magnified scale and appropriate toner level. Another reason for photocopying is the
reluctance of sending one-of-a-kind rare documents to an outside facility. The photocopied version is then digitized as a TIFF image, which is recognized by the OCRConsole module whose design relies on libraries in a FineReader 6.0 OCR engine.
Textlines are first segmented using the OCR output and then fourteen features are
G.R. Thoma et al.
extracted from each textline. Layout is classified using layout type specific keywords.
Each textline is classified as a case header, case body, page header (including page
number, act name, and N. J. type or case range), and case category (e.g. cosmetics,
food, drug, etc.) using a pre-trained layout type specific model file. Finally, metadata
is extracted from the classified textline using metadata specific tags. Figure 4 shows
an example of textline classes and its class syntax model that will be described in
Section 3.2.
Fig. 3. Automated metadata extraction system. Ovals represent processes and rectangles represent objects or data.
In the following subsections, we first describe required metadata and layout classification, and then describe the 14 features extracted from each textline. Given next are
the methods for classifying textlines, and metadata extraction from these classified
textlines. Finally, we report experimental results.
3.1 Metadata and Layout Classification
Metadata important for future access to the FDA documents occur in the text. There
are also metadata that are either constant such as format of the image (e.g., TIFF) or
related to system operation (e.g., metadata creation time stamp). Table 1 provides a
list of the metadata items of interest contained in these documents. Note that IS and
Sample numbers are related to “Interstate Shipment” of food, drug and cosmetic
Design of a Digital Library for Early 20th Century Medico-legal Documents
Page number
Act name
NJ type
Case header
Case body
Case category
Case header
Case body
Fig. 4. Textline classes in a sample TIFF image and its class syntax model
products and are used to identify a specific type of case. FDC and F&D numbers are
used to categorize cases into Food, Drug and Cosmetic publications.
Table 1. Metadata items in historical medico-legal documents
Metadata item
Case issue date
Case/NJ number
Case keyword
F.D.C, Sample, IS and F&D numbers
Defendant Name(s)
Adjudicating court jurisdiction
Seizure location
Seizure date
Page header text
Case header text
Case header text
Page header text or
Case header text
Case body text
Case body text
Case body text
Case body text
G.R. Thoma et al.
These historical documents possess different layout types. Figure 5 shows three
typical ones. We recognize the layout types by layout specific keywords from OCR
results. For example, keywords such as “QUANTITY” and “LIBEL FILED” in layout
type 1 are used for its detection. Once the layout type of a set of TIFF images is detected, a classification model is learned for this particular layout type, and used for
textline classification in subsequent TIFF images possessing the same layout.
Fig. 5. Three typical layout types. Note that capitalized keywords such as “QUANTITY” and
“NATURE OF CHARGE” are used to tag case body text in layout type 1, while case body text
in layout types 2 and 3 appears as free text without such tags.
3.2 Features, Textline Classification and Metadata Extraction
We extract a set of 14 features from each textline using OCR results. They are 1: ratio of
black pixels; 2-5: mean of character width, height, aspect ratio, and area; 6-9: variance of
character width, height, aspect ratio, and area; 10: total number of letters and numerals/total number of characters; 11: total number of letters/total number of letters and
numerals; 12: total number of capital letters/total number of letters; 13-14: indentation
where 00 denotes center line, 10 denotes left indented line, 11 denotes full line, and 01
denotes right indented line, thus 13th feature value could indicate if the line touches the
left margin, and 14th feature value could indicate if the line touches the right margin.
We classify textlines by a method that combines static classifiers with stochastic
language models representing temporal class syntax. Support Vector Machines
(SVMs) [5] are used as static feature classifiers. They achieve better classification
performance by producing nonlinear class boundaries in the original feature space by
constructing linear space in a larger and transformed version of the original feature
space. However, they cannot model location evolution or class syntax as shown in
Figure 4 in a sequence of class labels. On the other hand, stochastic language models
such as Hidden Markov Models (HMMs) [6] are appropriate to model such class
syntax. When features from different textline classes overlap in feature space, SVM
classifiers could produce misclassification errors, while HMMs can correct such errors by enforcing the class syntax constraints. We therefore combine SVMs and
HMMs in our algorithm [7] for optimal classification performance.
Design of a Digital Library for Early 20th Century Medico-legal Documents
To represent class syntax in a one-dimensional sequence of labeled training textlines using HMM, we order textlines from left to right and top to bottom. Each distinct
state in the HMM represents a textline class. State transitions represent possible class
label ordering in the sequence as shown in Figure 4. Initial state probabilities and state
transition probabilities are estimated directly from the class labels in the sequence. In
the training phase, both the SVM and HMM are learned from the training dataset. In
the test phase, they are combined in our algorithm [7] to classify textlines in the test
dataset. Once a textline is classified, metadata items are extracted from it using metadata specific tags. Table 2 lists tag names used for different metadata items.
Table 2. Specific tags for metadata extraction
Metadata item
Case issue date
No tags needed (full text in identified field)
Case/NJ number
Case keyword
F.D.C, Sample, IS, and F&D
Defendant Name(s)
First word (in case header text)
Adulteration or misbranding (in case header text)
Last open and closing parenthesis (in case header text)
Adjudicating court jurisdiction
Seizure location
Seizure date
Against, owned by, possession of, shipped by, manufactured by, transported by, consigned by
Filed in, convicted in, term of, session of, indictment
in, pending in
From … to …
Shipped on or about, shipped during, shipped within
the period
3.3 Experiments
To investigate optimal OCR and textline classification performance, we first photocopy the original document pages at different scales and toner levels, scan the photocopies into TIFF images, and then run our algorithm on these TIFF images. We select
a scale of 130% for photocopying the 38 original pages of layout type 3 since this is
the maximum possible scale that magnifies the text for the best OCR results while at
the same time avoiding border cut-off. The classification algorithm is trained on a
different training dataset of the same layout type at 130% scale and toner level 0. The
reason for this choice is evident from Table 3 that shows the OCR performance (in
terms of NJ number recognition error rate) and textline classification error rate at
different toner levels. We consider an NJ number to be incorrectly recognized if any
of its digits (up to five) is in error, or extra text is also included inadvertently. Test
results are from an older version of the OCR engine. Upgrading this to the latest version is expected to significantly improve the character recognition accuracy.
Note that when toner level increases, there tends to be more noisy textlines and
more misclassified textlines. When toner level decreases, text becomes too light and
there are more OCR errors, and therefore fewer NJ numbers recognized correctly.
OCR performance is optimal at toner level 0. Since misclassified textlines at toner
level 0 is not very different from other toner levels, we select toner level 0 as the
optimal value for our experiment. We can also see that the classification performance
of our algorithm is relatively insensitive to the changes in toner level.
G.R. Thoma et al.
Table 3. Textline classification and OCR performance at different toner levels
Toner level
(Toner level
from top to
Textline classification error
(Number of incorrectly
classified textlines/total number of textlines)
OCR performance (in terms of NJ
number recognition error rate)
(Number of incorrectly recognized
NJ numbers/total number of NJ numbers)
We then train our classification algorithm on a training dataset of two of the layout
types shown in Figure 5, and then test the algorithm on different test datasets of these
layout types. We do not report experimental results for layout type 2 since it has very
limited number of pages in our test sample. Table 4 shows the experimental results.
We see that textline classification errors from static classifiers (SVMs) are reduced
by introducing class syntax models (HMMs) from 2.22% to 1.22% for layout type 1
and from 1.98% to 0.33% for layout type 3, a substantial improvement justifying our
hybrid approach to the design of our classifier. Since most textlines are correctly
classified, appropriate metadata items can be extracted from them using specific tags.
Table 4. Experimental results for two layout types
Training result
Test result
Total pages: 30
Total textlines: 1,423
SVM errors: 5
SVM error rate: 5/1,423 = 0.35%
Corrected by HMM: 3
Final errors: 2
Final error rate: 2/1,423 = 0.14%
Total pages: 30
Total textlines: 1,849
SVM errors: 3
SVM error rate: 3/1,849 = 0.16%
Corrected by HMM: 1
Final errors: 2
Final error rate: 2/1,849 = 0.11%
Total pages: 189
Total textlines: 9,524
SVM errors: 211
SVM error rate: 211/9,524 = 2.22%
Corrected by HMM: 95
Final errors: 116
Final error rate: 116/9,524 = 1.22%
Total pages: 195
Total textlines: 11,646
SVM errors: 231
SVM error rate: 231 / 11,646 = 1.98%
Corrected by HMM: 193
Final errors: 38
Final error rate: 38/11,646 = 0.33%
4 Conclusion
In this paper, research toward a system for automated metadata extraction from historic medico-legal documents has been described. Specifically, a method that combines the power of static classifiers and class syntax models for optimal classification
Design of a Digital Library for Early 20th Century Medico-legal Documents
performance is introduced. In this method, each textline in these documents is classified into a category of interest. We tested our method on several hundred pages and
show in our experimental results that the use of a class syntax model significantly
reduces classification errors made by static classifiers. Future work includes automated selection of metadata specific tags for metadata extraction from free text, feature subset selection, and image enhancement during digitization.
This research was supported by the Intramural Research Program of the U.S. National
Library of Medicine, National Institutes of Health.
1. Public Law 59-384, repealed in 1938 by 21 U.S.C. Sec 329 (a). And U.S Food and Drug
Administration, “Federal Food and Drugs Act of 1906 (The "Wiley Act"),”
http://www.fda.gov/opacom/laws/wileyact.htm (3 Feb. 2006).
2. Mao S, Misra D, Seamans J, Thoma, G. R.: Design Strategies for a Prototype
Electronic Preservation System for Biomedical Documents, Proc. IS&T Archiving
Conference, Washington DC, pages 48–53, (2005).
3. DSpace at MIT, http://www.dspace.org.
4. Java Remote Method Invocation, http://java.sun.com/products/jdk/rmi/.
5. Cortes C., Vapnik V.: Support-vector Network. Machine Learning. Vol. 20, pages 273-297,
6. Rabiner, L. R., Juang, B. H.: Fundamentals of Speech Recognition. Englewood Cliffs, NJ:
Prentice-Hall. (1993).
7. Mao, S., Mansukhani, P., Thoma, G. R.: Feature Subset Selection and Classification using
Class Syntax Models for Document Logical Entity Recognition. Proc. IEEE International
Conference on Image Processing. Atlanta, GA, (2006). Submitted.
Expanding a Humanities Digital Library: Musical
References in Cervantes’ Works
Manas Singh1, Richard Furuta1, Eduardo Urbina2, Neal Audenaert1,
Jie Deng1, and Carlos Monroy1
Department of Computer Science, Texas A&M University
Center for the Study of Digital Libraries*
Texas A&M University
College Station, TX, 77843-3112
Department of Hispanic Studies, Texas A&M University
Center for the Study of Digital Libraries*
Texas A&M University
College Station, TX, 77843-4238
[email protected]
Abstract. Digital libraries focused on developing humanities resources for both
scholarly and popular audiences face the challenge of bringing together digital
resources built by scholars from different disciplines and subsequently
integrating and presenting them. This challenge becomes more acute as libraries
grow, both in terms of size and organizational complexity, making the
traditional humanities practice of intensive, manual annotation and markup
infeasible. In this paper we describe an approach we have taken in adding a
music collection to the Cervantes Project. We use metadata and the
organization of the various documents in the collection to facilitate automatic
integration of new documents—establishing connection from existing resources
to new documents as well as from the new documents to existing material.
1 Introduction
As a digital library grows in terms of both size and organizational complexity, the
challenge of understanding and navigating the library’s collections increases
dramatically. This is particularly acute in scenarios (e.g., scholarly research) in which
readers need and expect to be able to survey all resources related to a topic of interest.
While large collections with a rich variety of media and document sources make
valuable information available to readers, it is imperative to pair these collections
with tools and information organization strategies that enable and encourage readers
to develop sophisticated reading strategies in order to fully realize their potential [11].
Traditional editorial approaches have focused on detailed hand editing—carefully
reading and annotating every line on every page with the goal of producing a
completed, authoritative edition. Often, such approaches are infeasible in a digital
library environment. The sheer magnitude of many digital collections (e.g., the
Gutenberg Project [13], the Christian Classics Ethereal Library [20], the Making of
America [7][17]) make detailed hand editing unaffordably labor intensive, while the
very nature of the project often conflicts with the traditional goal of producing a final,
Authors’ academic affiliations.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 158 – 169, 2006.
© Springer-Verlag Berlin Heidelberg 2006
Expanding a Humanities Digital Library: Musical References in Cervantes’ Works
fixed edition. Previously, we have described the multifaceted nature of humanities
collections focused on a single author and argued that these projects will require
automatic integration of many types of documents, drawn from many sources, compiled
by many independent scholars, in support of many audiences [1]. Such collections are
continuously evolving. As each new artifact is added to the collection, it needs to be
linked to existing resources and the existing resources need to be updated to refer to the
new artifact, where appropriate. Constructing these collections will require new tools
and perspective on the practice of scholarly editing [10]. One such tool class is that
supporting automatic discovery and interlinking of related resources.
The Cervantes Project [25] has been focused during the last ten years on
developing on-line resources on the life and works of Miguel de Cervantes Saavedra
(1547 – 1616), the author of Don Quixote [5], and thus has proven to be a rich
environment for exploring these challenges. Given its canonical status within the
corpus of Hispanic literature and its iconic position in Hispanic culture, the Quixote
has received a tremendous amount of attention from a variety of humanities
disciplines, each bringing its own unique questions and approaches. Within the broad
scope of this project, individual researchers have made a variety of contributions,
each centered on narrowly scoped research questions. Currently, work in the project
can be grouped into six sub-projects: bibliographic information, textual studies,
historical research, music, ex-libris, and textual iconography. Together, these
contributions span the scope of Cervantes’ life and works and their impact on society.
In this paper, we describe the approach that we have taken in connection with the
presence and influence of music in Cervantes’ works. The data for this project was
collected by Dr. Juan José Pastor as part of his dissertation work investigating
Cervantes’ interaction with the musical culture of his time and the subsequent musical
interpretations of his works [18]. Pastor’s collection is organized in five main
categories (instruments, songs, dances, composers, and bibliographical records) and
contains excerpts from Cervantes’ writings, historical and biographical information,
technical descriptions, images, audio files, and playable scores from songs. Although
Pastor has completed his dissertation, the collection is still growing, as new scores,
images, and documents are located. For example, a recent addition, published in
conjunction with the 400th anniversary of the publication of the Quixote, is a
professionally-produced recording of 22 of the songs referred to by Cervantes [12].
The music sub-project reflects many aspects of the complexity of the Cervantes
Project as a whole, and thus provides an excellent testbed for developing tools and
strategies for integrating an evolving collection of diverse artifacts for multiple
audiences. A key challenge has been determining how to build an interface that
effectively integrates the various components, in a manner that supports the reader’s
understanding of the implicit and explicit relationships between items in the
collection. In particular, since the collection is growing with Pastor’s ongoing
research, it was necessary that the interface be designed so that new resources could
be easily added and the connections between new and old resources generated
automatically. To address this challenge we have developed an automatic linking
system that establishes relationships between resources based on the structural
organization of the collection and various metadata fields associated with individual
documents. An editor’s interface allows users an easy way to add new resources to the
collection and to specify the minimal set of metadata required to support link
generation. Further, a reader’s interface is provided that identifies references within
texts to other items in the collection and dynamically generates navigational links.
M. Singh et al.
2 Background
Developing a system to integrate resources within the collection required attention to
three basic questions: What types of reader (and writer/editor) interactions are to be
supported? What types of information and connections are to be identified? How will that
information be identified and presented to readers? A brief survey of related projects will
help to set the context for the design decisions we have made in these areas.
The Perseus Project [26] has developed a number of sophisticated strategies for
automatically generating links in the context of cultural heritage collections [8][9].
Our work has been heavily influenced by their use of dense navigational linking both
to support readers exploring subjects with which they are unfamiliar and to encourage
readers more closely acquainted with a subject to more fully explore and develop their
own interpretive perspectives. Early work focused on developing language based
tools to assist readers of their extensive Greek and Latin collections. These tools
linked words to grammatical analysis, dictionaries and other linguistic support tools,
helping a wider audience understand and appreciate them. More recently, they have
focused on applying some of the techniques and technologies developed for their
Classical collection to a variety of other, more recent data sets including American
Civil War and London collections. This work has focused on identifying names,
places, and dates to provide automatically generated links to supplementary
information and to develop geospatial representations of the collection’s content.
They have had good results from a layered approach using a combination of a priori
knowledge of semi-structured documents (e.g., of the British Directory of National
Biography and London Past and Present), pattern recognition, name entity retrieval,
and gazetteers to identify and disambiguate references to people, places, and events.
A key technology for supporting this type of integration between resources within
a collection is the use of name authority services. The SCALE Project (Services for a
Customizable Authority Linking Environment) is developing automatic linking
services that bind key words and phrases to supplementary information and
infrastructure to support automatic linking for collections within the National Science
Digital Library [19]. This collaborative effort between Tufts University and Johns
Hopkins University builds on the tools and techniques developed in the Perseus
Project in order to better utilize the authority controlled name lists, thesauri,
glossaries, encyclopedias, subject hierarchies and object catalogs traditionally
employed in library sciences in a digital environment.
As an alternative to authority lists, the Digital Library Service Integration (DLSI)
project uses lexical analysis and document structure to identify anchors for key terms
within a document [6]. Once the anchors are identified, links are automatically
generated to available services based on the type of anchor and the specified rules.
For example, if a term is a proper noun it can be linked to glossaries and thesauri to
provide related information.
Also of relevance is the long history in the hypertext research community of link
finding and of link structures that are more than simple source to destination
connections. Early work in link finding includes Bernstein’s Link Apprentice [4] and
Salton’s demonstration of applications [22] of his Smart system’s vector-space model
[21]. Link models related to our work include those that are multi-tailed, for example
MHMT [15] and that represented in the Dexter model [14].
Expanding a Humanities Digital Library: Musical References in Cervantes’ Works
Fig. 1. Related links and a sample image for sonaja instrument
3 Interface and Usage Scenario
Within the context of the Cervantes music collection, we have chosen to focus on
identifying interrelationships between the structured items in our collection in order to
provide automatic support for the editorial process rather than relying on authority lists
or linguistic features to connect elements of the collection to externally supplied
information sources (such support for this could be added later, if warranted). We have
divided the resources in our collection into categories of structured information (e.g.,
instruments, songs, composers). Each category contains a set of items (e.g., a particular
song or composer). Each item is in turn represented by a structured set of documents.
How the documents for any given item are structured is determined by the category it is
a member of. For example, arpa (a harp) is an item within the instruments category.
This instrument (like all other instruments) may have one or more of each of the
following types of documents associated with it: introductory articles, images, audio
recordings, historical descriptions, bibliographic references, links to online resources,
and excerpts from the texts of Cervantes that refer to an arpa.
Each item is identified by its name and by a list of aliases. Our system identifies
the references to these terms in all of the resources located elsewhere in the collection,
either as direct references or within the metadata fields of non-textual documents. At
present, the matching algorithm is a simple match between the longest-length term
string found at the target. Once identified, the references are linked to the item.
M. Singh et al.
The presentation of information to the reader uses a custom representation of links.
This is because of the complexity of the object linked to—a complexity that reflects
the multiple user communities that we expect will make use of the collection.
Moreover, the collection provides multiple roots that reflect different reader
In developing the Cervantes music collection we have focused our design on
meeting the needs of two primary communities of readers. One group is composed of
Cervantes scholars and music historians interested in research about Cervantes’ works
and music. The second group is composed of non-specialists interested in gaining
access to information they are unfamiliar with. For both the specialist and the nonspecialist, the collection provides two major focal points, or roots, for access. For
example, a reader might approach the music collection from the texts of Cervantes
(which themselves compose a distinct collection), asking how a particular passage
reflects Cervantes’ understanding of contemporary musical trends or in order to better
understand what, for example, an albogue looks and sounds like.1 Another reader
might begin by considering a particular composition that alludes to Cervantes and ask
how this particular piece reflects (or is distinct from) other popular interpretations of
the Quixote. Similarly, a non-expert might find his understanding of a particular opera
enhanced by learning more about an obscure reference to one of Cervantes’ works. In
this way the linkages generated between these two distinct but related collections
allow readers access to a rich and diverse body of resources from multiple
perspectives to achieve a variety of goals. We refer to collections that exhibit this type
of structure as being multi-rooted. Natural roots for the music collection include
compositions (e.g., songs and dances), composers, instruments, and the writings of
Cervantes. In the remainder of this section we present several brief reader interaction
scenarios to help illustrate the design of the system from a reader’s perspective. In the
following section we present an overview of the technical design and implementation
of the link generation system and the interface.
In the first scenario, a native, modern Spanish speaker is reading a less well-known
text of Cervantes, Viaje del Parnaso (1614), and encounters a reference to an
instrument she is unfamiliar with, the sonaja. Curious, she clicks on the link and a
drop-down menu appears displaying links to the various types of documents present
in the collection. She elects to view the ‘sample image,’ resulting in the display
shown in Figure 1. The image sparks her curiosity and she decides to see what it
sounds like by clicking on the ‘sample audio’ link. What is this, who would use it, and
why? To find out more, she clicks to read the introductory text and finds a list of
definitions where she learns that it is a small rustic instrument that was used in the
villages by beating it against the palm of the hands. Interestingly, the Egyptians used
it in the celebrations and sacrifices to the goddess. Having learned what she wanted to
know, she returns to reading Viaje del Parnaso.
“What are albogues?” asked Sancho, “for I never in my life heard tell of them or saw them.”
“Albogues,” said Don Quixote, "are brass plates like candlesticks that struck against one
another on the hollow side make a noise which, if not very pleasing or harmonious, is not
disagreeable and accords very well with the rude notes of the bagpipe and tabor. [Chapter 65,
Part 2, Don Quixote].
Expanding a Humanities Digital Library: Musical References in Cervantes’ Works
Fig. 2. Learning more about the arpa
In the second scenario, a music historian with relatively little familiarity with Don
Quixote or the other works of Cervantes is interested in exploring how string
instruments were used and their societal reception. On a hunch, he decides to see how
societal views of the harp and other instruments might be reflected in the works of
Cervantes. Browsing the collection, he navigates to the section for the harp and
peruses the texts of Cervantes that refer to the harp (Figure 2). After surveying that
information, he explores some of the other instruments in order to get a broader
perspective on how Cervantes tends to discuss and incorporate musical instruments in
his writings. He finds a couple of passages that help to illustrate the ideas he has been
developing, and makes a note of them to refer to later.
In the final scenario, an editor is working with the collection, adding the historical
documents to the song, “Mira Nero de Tarpeya.” As shown in Figure 3, he browses to
the list of composers and notices that, while there is a link to Mateo Flecha, there is
no information provided for Francisco Fernández Palero. He quickly navigates to the
“composers” category, adds Palero as a new composer (Figure 4), and writes a short
description of him and his relevance to classical music. The system recognizes the
new composer and updates its generated links accordingly. Currently, since only
minimal information is present, these links refer only to the newly written
introductory text. A few weeks later, the editor returns to the collection after finding
images, lists of songs written, and historical descriptions. He adds these through
forms similar to the one he used to add Fernández Palero. Links to these new
resources are now added to the drop down menu associated with references to
Fernández Palero. In this way, the editor is able to focus on his area of expertise in
M. Singh et al.
finding and gathering new information that will enhance the scholarly content of the
collection, removing the burden of manually creating links from all the existing
documents to the newly added composer.
Fig. 3. Browsing a Song in the Editor’s Interface
Fig. 4. Adding the composer Francisco Fernandez Palero
4 Organization of the Digital Library
Information in the collection is organized as hierarchical groups. At the highest level,
materials are grouped into eight categories:
Instruments: information pertaining to the different musical instruments that
have been referred to by Cervantes in his works.
Songs: information regarding the different songs that have influenced
Expanding a Humanities Digital Library: Musical References in Cervantes’ Works
Dances: resources related to the dances that have been referred to in
Cervantes’ texts.
Composers: the composers who have influenced Cervantes and his work.
Bibliography: bibliographical entries related to instruments, songs, and
dances that have been referred to in Cervantes’ texts.
Musical Reception: bibliographical entries about musical compositions that
have been influenced by Cervantes or refer to his works.
Cervantes Text: full texts of Cervantes’ works.
Virtual Tour: links to virtual paths, constructed and hosted using Walden’s
Paths [23]. This allows the information to be grouped and presented in
different manners, catering to the interests of diverse scholars, thus opening
up the digital library to unique interpretive perspectives.
Most categories are further subdivided into items. An item defines a unique logical
entity, as appropriate for the type of category. For example, the category “Instruments”
contains items such as arpa and guitarra. Similarly, each composer would be
represented as an item in the respective category as would each dance and each song. The
item is identified by its name, perhaps including aliases (e.g., variant forms of its name).
Artifacts associated with each item are further categorized into different topics like
image, audio, and text. The topics under an item depend on the category to which the
item belongs to. For example, an item under category “Instruments” will have topics
like introduction, audio, image, text, and bibliography but an item under the category
“Composer” will have topics like life, image, work, and bibliography.
An artifact (e.g., an individual picture; a single essay) is the “atomic” unit in the
collection. Thus artifacts are grouped under topics, which in turn are grouped into
items, which in turn are grouped into categories. A unique item identifier identifies
each item in the digital library. Additionally, each artifact placed under an item is
assigned a sub-item identifier that is unique among all the artifacts under that item.
Thus all the artifacts, including texts, audio files, images, musical scores, etc., are
uniquely identified by the combination of item identifier and sub-item identifier.
5 Interlinking
The process of creating interlinks and presenting the related links can be broadly
classified into four major steps. The first is maintaining the list of item names for
which information exists in the digital library. The second is a batch job, which
identifies the reference of these terms in all the texts present in the digital library. The
third step is a run time process, which, while displaying a text, embeds the terms that
need to be linked with a hyperlink placeholder (i.e., hyperlink without any specific
target). This step uses the data from the batch job to identify the terms that should be
presented with the hyperlink for any text. The final step generates the actual related
links for a term and is invoked only when the user clicks on a hyperlink placeholder.
A description of these steps follows.
Maintaining the keyword list: In order for the system to provide related links, it
should be able to identify the terms for which information exists in the digital library.
This is achieved by maintaining a keyword list. To identify the variation in names a
synonym list is also maintained. The system depends on the user to provide a list of
M. Singh et al.
synonyms for the item being added. This may include alternate names for the item or
just variations in the spelling of the item name.
When a new item is added to the digital library its name or title is added to the
keyword list and its aliases to the synonym list. In the following sections the
keyword and synonym lists will be referred to collectively as keywords.
Document keyword mapping batch job: The document keyword mapping is
created by indexing all the texts using Lucene and finding the references of each term
in the keyword list among all the texts. This is done offline using a batch process.
This also populates a document keyword map that maps each document to all the
keywords it refers.
Runtime display of texts with hyperlink placeholders: While displaying a text
the system uses the document keyword map to identify the keywords from the
keyword list that are present in the text. Once the list of keywords present in the text
is known, their occurrences in the text are identified and are embedded with hyperlink
placeholders. In essence, instances of keyword in the source are replaced by,
<a href="javascript:nop” class=”cerhyperlink”> keyword </a>
which invokes the appropriate display function when selected.
Display of composite links: The related links display is generated when the user
clicks on a keyword’s hyperlink placeholder. The click event is intercepted by a client
side JavaScript function that parses the hyperlink statement and retrieves the actual
keyword, sending the keyword to the server
When the request is received at the server, the keyword is retrieved from the
request parameters and the metadata repository is used to find all the artifacts related
to the keyword. Using these related artifacts, the distinct list of topics to which they
belong is identified and links to these topics are generated. For example, if the item
has some related image resources then a link to view the images is added to the
related links list. Furthermore, the format of these artifacts is also noted. If they are of
formats like image or audio then a link to a sample image or audio also is added to the
related links list. This sample audio or image is displayed in a new page right on top
of the text. This allows the user to view a sample image or listen to a sample audio
clip without leaving the text.
The response from the server is received by the client as an XML document object,
which is parsed using JavaScript to obtain the related links. The related links are
displayed in a tooltip just below the keyword clicked by the user. Cascading style
sheets are used to control the look and feel of the tooltip.
6 Discussion and Future Work
This work is one of three major directions we are pursuing to better understand how
complex, highly interdisciplinary humanities collections can be designed to enable
tight integration resulting in a single, multi-rooted collection. Our focus in the
Cervantes music collection has been on leveraging structural information captured as
a natural part of the collection building process. Using this information, we are able
both to identify link anchors (references to items in the collections) and resources to
connect them to. The resulting navigational hypermedia archive enhances the reader’s
ability to access and interact with the collection as a whole. We would like to expand
this approach by more formally investigating the types of structures that can be
Expanding a Humanities Digital Library: Musical References in Cervantes’ Works
included in system such as this and the types of automatic linking strategies that each
of these structures might support. For example, how might hierarchically structured
categories be incorporated? How might that affect the types of interlinkages that can
be established? Such an investigation will help us better understand what additional
structures are needed to scale this approach to incorporate the full breadth of
resources included within the Cervantes Project and how it could be generalized to
meet the needs of other humanities projects.
In addition to the structural approach to integrating resources presented here, we
have also reported on work that uses a formal narrative and thematic taxonomy to
provide an integrative framework [2], and also the use of a framework for identifying
key features within documents [3]. More work is needed to bring these three
directions together to form a unified approach and to understand how each contributes
to the larger goal of single, multi-rooted collection.
As we are developing these ideas, we are becoming more aware of the need for a
shift in the way we understand the editorial process. Traditional editorial work is
focused on the development of a single, centered, completed work that is relatively
fixed over time—a published edition. Despite the growing calls to shift from the book
as the primary technology for developing scholarly editions to electronic media [16],
the resulting editions bear much similarity to their ancestors. In particular, they retain
the notion of a “completed” work that is developed to meet narrowly defined research
objectives. They are typically created by a single editor or by the highly coordinated
efforts of a group of authors working under the guidance of an editorial board. Such
editions do not allow for the more complex types of informational needs that are
required to support a more broadly defined humanities research agenda, such as that
of the Cervantes Project [1]. This type of work is open ended and difficult to restrict
to a closed set of scholarly perspectives—new research directions continually pop up,
often initiated by people outside of the core project members. Its contributors are not
the carefully orchestrated cadre of authors one might find in a scholarly encyclopedia
(e.g., the Stanford Encyclopedia of Philosophy [24]), but rather are individual
researchers pursuing their own unique research ideas (and making their own unique
contributions). These researchers will often be uncooperative, if for no other reason
than their divergent interests (the ethnomusicologist is not likely to be overly
concerned about the work of the textual critic), yet their research may contribute
significantly to the broader goals of such a digital library. This is not to suggest that
traditional approaches are bad or should be abandoned, but rather to propose that we
need to creatively explore how to best employ digital technologies to empower
humanities research.
[1] N. Audenart, R. Furuta, E. Urbina, J. Deng, C. Monroy, R. Saenz, and D. Careaga,
“Integrating Diverse Research in a Digital Library Focused on a Single Author,” Proc.
9th European Conf. on Research and Advanced Technology for Digital Libraries, Vienna,
Austria, 2005.
[2] N. Audenart, R. Furuta, E. Urbina, J. Deng, C. Monroy, R. Saenz, and D. Careaga,
“Integrating Collections at the Cervantes Project,” Proceedings of the 5th ACM/IEEE-CS
joint conference on Digital Libraries, Denver, USA, 2005.
M. Singh et al.
[3] N. Audenaert, R. Furuta, and E. Urbina, “A General Framework for Feature
Identification,” Digital Humanities 2006, to appear.
[4] M. Bernstein, “An Apprentice that Discovers Hypertext Links,” Proceedings of the
European Conference on Hypertext, November 1990, pp. 212-223.
[5] Miguel de Cervantes Saavedra, (1998) Don Quijote de la Mancha, Francisco Rico,
Director. Barcelona: Biblioteca Clásica, 2 vols; Don Quixote (English translation by Edith
Grossman) New York: HarperCollins, 2003.
[6] Xin Chen, Dong-ho Kim Kim, N. Nnadi, H. Shah, P. Shrivastava, M. Bieber, Il Im, and
Yi-Fang Wu, “Digital Library Service Integration,” Proceedings of the 3rd ACM/IEEECS joint conference on Digital Libraries, pp. 384-384, Houston, Texas, 2003.
[7] Cornell University (2005) Making of America: http://moa.cit.cornell.edu/moa/ [accessed
8 Sept 2005]
[8] G. Crane, (1998) “The Perseus Project and beyond”. D-Lib Magazine.
[9] G. Crane (2000) “Designing Documents to Enhance the Performance of Digital Libraries:
Time, Space, People and a Digital Library of London,” D-Lib Magazine 6 (7/8).
[10] G. Crane, J.A. Rydberg-Cox, (2000) “New Technology and New Roles: the Need for
“corpus editors,” Proc. 5th ACM conference on digital libraries, San Antonio, TX, pp
[11] G. Crane, E. David, A. Smith, and C. E. Wulfman, “Building a Hypertextual Digital
Library in the Humanities: A Case Study on London,” Proc. 1st ACM/IEEE-CS joint
conference on Digital libraries, pp. 426-434, Roanoke, Virginia, United States, 2001.
[12] Ensemble Durendal. Por ásperos caminos. Nueva música cervantina. Ediciones de la
Universidad de Castilla-La Mancha, Cuenca, 2005. Text by J. J. Pastor and musical
direction by S. Barcellona.
[13] Project Gutenberg Literary Archive Foundation (2005), Project Gutenberg
http://www.gutenberg.org/. [accessed 9 Sept 2005]
[14] F. Halasz and M. Schwartz. “The Dexter Hypertext Reference Model,” Communications
of the ACM, 37(2), February 1994. pp. 30-39.
[15] B. Ladd, M. Capps, D. Stotts, and R. Furuta. “Multi-head/Multi-tail Mosaic: Adding
Parallel Automata Semantics to the Web,” Proc. 4th WWW Conf., pp. 422-440, 1995
[16] J. McGann, The Rationale of Hypertext : http://www.iath.virginia.edu/public/jjm2f/
[17] University of Michigan (2005) Making of America. http://www.hti.umich.edu/m/moagrp/
[accessed 8 Sept 2005]
[18] J. J. Pastor, “Música y literatura: la senda retórica. Hacia una nueva consideración de la
música en Cervantes,” Doctoral Dissertation, Universidad de Castilla-La Mancha, 2005.
[19] M. S. Patton and D. M. Mimno, “Services for a Customizable Authority Linking
Environment,” Proc. 4th ACM/IEEE-CS joint conf. Digital Libraries, pp. 420-420,
Tuscon, AZ, 2004.
[20] H. Plantinga, coord. (2005) Christian Classics Ethereal Library, Calvin College, Grand
Rapids, MI. http://www.ccel.org/ [accessed 8 September 2005]
[21] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of
Information by Computer, Addison-Wesley, Reading, MA, 1989.
[22] G. Salton, J. Allan, C. Buckley and A. Singhal, “Automatic Analysis, Theme Generation,
and Summarization of Machine-Readable Texts,” Science, Vol. 264, No. 5164, Jun. 1994,
pp. 1421-1426.
Expanding a Humanities Digital Library: Musical References in Cervantes’ Works
[23] F. Shipman, R. Furuta, D. Brenner, C. Chung, and H. Hsieh, “Guided Paths through
Web-based Collections: Design, Experiences, and Adaptations,” Journal of the American
Society of Information Sciences (JASIS), 51(3), March 2000, pp. 260-272.
[24] Stanford Encyclopedia of Philosophy, http://plato.stanford.edu/, [accessed March 7,
[25] Cervantes Project, E. Urbina, director. Center for the Study of Digital Libraries, Texas
A&M University, http://csdl.tamu.edu/cervantes, [accessed Nov 29 2005].
[26] The Perseus Digital Library, G. Crane, ed. Tufts University. http://www.
perseus.tufts.edu/, [access March 7, 2006].
Building Digital Libraries for Scientific Data: An
Exploratory Study of Data Practices in Habitat Ecology
Christine Borgman1, Jillian C. Wallis2, and Noel Enyedy3
Department of Information Studies
Graduate School of Education & Information Studies, UCLA
[email protected]
Center for Embedded Networked Sensing, UCLA
[email protected]
Department of Education
Graduate School of Education & Information Studies, UCLA
[email protected]
Abstract. As data become scientific capital, digital libraries of data become
more valuable. To build good tools and services, it is necessary to understand
scientists’ data practices. We report on an exploratory study of habitat
ecologists and other participants in the Center for Embedded Networked
Sensing. These scientists are more willing to share data already published than
data that they plan to publish, and are more willing to share data from
instruments than hand-collected data. Policy issues include responsibility to
provide clean and reliable data, concerns for liability and misappropriation of
data, ways to handle sensitive data about human subjects arising from technical
studies, control of data, and rights of authorship. We address the implications of
these findings for tools and architecture in support of digital data libraries.
1 Introduction
The emerging cyberinfrastructure is intended to facilitate distributed, informationintensive, data-intensive, collaborative research [1]. Digital libraries are essential to
the cyberinfrastructure effort. As scholarship in all fields becomes more dataintensive and collaborative, the ability to share data becomes ever more essential [2,
3]. Data increasingly are seen as research products in themselves, and as valuable
forms of scientific capital [4]. “Big science” fields such as physics, chemistry, and
seismology already are experiencing the “data deluge” [5, 6]. Data repositories and
associated standards exist for many of these fields, including astronomy, geosciences,
seismology, and bioinformatics [7-10]. “Little science” fields such as habitat ecology
are facing an impending data deluge as they deploy instrumented methods such as
sensor networks. Progress toward repositories and information standards for these
fields is much less mature, and the need is becoming urgent.
2 Research Domain
We have a unique opportunity to study scientific data practices and to construct
digital library architecture to support the use and reuse of research data. The Center
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 170 – 183, 2006.
© Springer-Verlag Berlin Heidelberg 2006
Building Digital Libraries for Scientific Data
for Embedded Networked Sensing (CENS), a National Science Foundation Science
and Technology Center based at UCLA, conducts collaborative research among
scientists, technologists, and educators. CENS’ goals are to develop, and to
implement in diverse contexts, innovative wireless sensor networks. CENS’ scientists
are investigating fundamental properties of these systems, designing and deploying
new technologies, and exploring novel scientific and educational applications.
CENS’ research crosses four primary scientific areas: habitat ecology, marine
microbiology, seismology, and environmental contaminant transport, plus
applications in urban settings and in the arts. The research reported here addresses the
use of embedded networked sensor technology in biocomplexity and habitat
monitoring, supplemented by findings about data sharing from other CENS’ areas. In
these scientific areas, the goals are to develop robust technologies that will operate in
uncontrolled natural settings and in agricultural settings. The science is based on in
situ monitoring, with the goal of revealing patterns and phenomena that were not
previously observable. While the initial framework for CENS was based on
autonomous networks, early results revealed the difficulty of specifying field
requirements in advance well enough to operate systems remotely. Thus we have
moved toward more “human in the loop” models where investigators can adjust
monitoring conditions in real time.
3 Background
3.1 Data Digital Libraries and the Data Deluge
Science is a technical practice and a social practice [11]. It is the interaction between
technological and social aspects of scientific research that underlies the design
challenge. Modern science is distinguished by the extent to which its products rely on
the generation, dissemination, and analysis of data. These practices are themselves
distinguished by their massive scale and global dispersion. New technologies for data
collection are leading to data production at rates that exceed scientists’ abilities to
analyze, interpret, and draw conclusions. No respite from this data deluge is foreseen;
rather, the rate at which data are generated is expected to increase with the
advancement of instrumentation [5]. Consequently, scientists urgently require
assistance to identify and select data for current use and to preserve and curate data
over the long term. Data resources are dispersed globally, due to more international
collaboration and distributed access to computing resources for analyzing data.
Cyberinfrastructure is expected to provide capabilities to (i) generate scientific data in
consistent formats that can be managed with appropriate tools; (ii) identify and
extract—from vast, globally distributed repositories—those data that are relevant to
their particular projects; (iii) analyze those data using globally distributed
computational resources; (iv) generate and disseminate visualizations of the results of
such analyses; and (v) preserve and curate data for future reuse. An effective
cyberinfrastructure is one that provides distributed communities-- scientific and
nonscientific--with persistent access to distributed data and software routinely,
transparently, securely, and permanently.
C. Borgman, J.C. Wallis, and N. Enyedy
3.2 Data Management Practices
The willingness to use the data of others may be a predictor of willingness to share
one’s own data. Scholars in fields that replicate experiments or that draw heavily on
observational data (e.g., meteorological, astronomical records) appear more likely to
contribute data for mutual benefit within their fields. Conversely, scholars in many
fields work only with data they have produced. The graph or table that results from
analyzing the data may be the essential product of a study. Many scholars assume that
the underlying data are not of value beyond that study or that research group. Heads
of small labs often have difficulty reconstructing datasets or analyses done by prior
lab members, as each person used his or her own methods of data capture and
analysis. Local description methods are common in fields such as environmental
studies where data types and variables may vary widely by study [12, 13].
The degree of instrumentation of data collection also appears to be a factor in data
sharing. Sharing expensive equipment is among the main drivers for collaboration,
especially in fields such as physics and chemistry. In these cases, collaboration,
instrumentation, and data sharing are likely to be correlated. The relationship between
instrumentation and data sharing may be more general, however. A small but detailed
study conducted at one research university found that scholars whose data collection
and analysis were most automated were the most likely to share raw data and
analyses; these also tended to be the larger research groups. When data production
was automated but other preparation was labor-intensive, scholars were less likely to
share data. Those whose data collection and analysis were the least automated and
most labor-intensive were most likely to guard their data. These behaviors held across
disciplines; they were not specific to science [14].
3.3 Habitat Ecology Data and Practices
The study of biodiversity and ecosystems is a complex and interdisciplinary domain
[15]. The mechanisms used to collect and store biological data are almost as varied as
the natural world those data document. Over the last thirty years, data management
systems for ecological research have evolved out of large projects such as the
International Biological Program (IBP; established in 1964 by the International
Council of Scientific Unions), the Man and the Biosphere program (MAB;
established in 1971 by the United Nations), and the U.S. Long-Term Ecological
Research program (LTER; established in 1980 by the National Science Foundation)
[16]. These systems need to support multiple data types (numerical measurements,
text, images, sound and video), and to interact with other systems that manage
geographical, meteorological, geological, chemical, and physical data. Currently one
of the biggest challenges to the development of effective data management systems in
habitat ecology is the “datadiversity” that accompanies biodiversity [17].
The Knowledge Network for Biocomplexity (KNB) [http://knb.ecoinformatics.org/],
an NSF-funded project whose first products became available in 2001, is a significant
development for data management in habitat ecology. KNB tools include a data
management system for ecologists, based on the Morpho client software and Metacat
Building Digital Libraries for Scientific Data
server software, and a standard format for the documentation of ecological data—the
Ecological Metadata Language (EML). SensorML is an equally important development
for sensor data [18].
4 Research Problem
While practices associated with scholarly publication vary widely between fields, the
resulting journal articles, papers, reports, and books can be described consistently with
bibliographic metadata. Data are far more problematic. Disciplines vary not only in their
choice of research methods and instruments, but the data gathered may vary in form and
structure by individual scholar and by individual experiment or study. Multidisciplinary
collaboration, which is among the great promises of cyberinfrastructure, will depend
heavily on the ability to share data within and between fields. However, very little is yet
known about practices associated with the collection, management, and sharing of data.
Despite these limitations, immediate needs exist to construct systems to capture and
manage scientific data for local and shared use. These systems need to be based on an
understanding of the practices and requirements of scientists if they are to be useful and
to be used.
Habitat ecology is a “small science,” characterized by small research teams and
local projects. Aggregating research results from multiple projects and multiple sites
has the potential to advance the environmental sciences significantly. The choice of
research problems and methods in environmental research were greatly influenced by
the introduction of remote sensing (satellite) technology in the 1980s and 1990s [19].
Thus one of our research concerns is how habitat ecology may evolve with the use of
embedded networked sensing. These scientists are deploying dense sensor networks
in field locations to pursue research on topics such as plant growth, bird behavior, and
micrometeorological variations.
Our research questions address the initial stages of the data life cycle in which
data are captured, and subsequent stages in which the data are cleaned, analyzed,
published, curated, and made accessible. The questions can be categorized as
• Data characteristics: What data are being generated? To whom are these data? To
whom are these data useful?
• Data sharing: When will scientists share data? With whom will they share data?
What are the criteria for sharing? Who can authorize sharing?
• Data policy: What are fair policies for providing access to these data? What
controls, embargoes, usage constraints, or other limitations are needed to assure
fairness of access and use? What data publication models are appropriate?
• Data architecture: What data tools are needed at the time of research design?
What tools are needed for data collection and acquisition? What tools are needed
for data analysis? What tools are needed for publishing data? What data models do
the scientists who generate the data need? What data models do others need to use
the data?
C. Borgman, J.C. Wallis, and N. Enyedy
5 Research Method
5.1 Data Sources
Our goal is to understand data practices and functional requirements for CENS
ecology and environmental engineering researchers with respect to architecture and
policy, and to identify where architecture meets policy. The results reported here are
drawn from multiple sources over a three-year period (2002-2005). In the first year
(2002-2003), we sat in on team meetings across CENS scientific activities and we
inventoried data standards for each area [20]. In year 2 (2003-4), we interviewed
individual scientists and teams and continued to inventory metadata standards. We
used the results of the first two years to design an ethnographic study of habitat
biologists, conducted in year 3 (2004-05). In the current year (2005-6), we are
interviewing individual members of habitat ecology research teams, including
scientists, their technology research partners in computer science and engineering,
and graduate students, postdoctoral fellows, and research staff.
5.2 Process
The ethnographic work from the first three years of the study (interviewing teams and
individuals, participating in working groups, etc.) is documented in notes, internal
memoranda, and a white paper [20]. We did not audiotape or videotape these
meetings to avoid interfering with the local activities. Knowledge from this part of the
research was used to identify data standards relevant to the research areas. We shared
our results with individuals and teams to get feedback on the relevance of these
standards. We also constructed prototypes of data analysis and management tools as
components of the educational aspects of our research [21]. Thus we are conducting
iterative research, design, and development for data management tools in CENS.
5.3 Participants
Our population at CENS is comprised of about 70 faculty and other researchers, a
varying number of post-doctoral researchers, and many student researchers. About 50
scientists, computer scientists, engineering faculty, and their graduate students, postdoctoral fellows, and research staff are working in the area of habitat ecology. The
data reported here are drawn primarily from in-depth interviews of two participants,
each two to three hours over two to three sessions. The direct quotes are from these
interviews. Results from one-hour interviews with two other scientists and from a
large group meeting (about 20 people) to discuss data sharing policy also are reported
here. These results are informed by interviews, team meetings, and other background
research conducted in earlier stages of our data management studies.
5.4 Analysis
We used the results of interviews and documentary analyses in the first two years of
our research to design the ethnographic study. This study used the methods of
grounded theory [22]. The interview questions are based on Activity Theory [23-25],
which analyzes communities and their evolution as “activity systems.” Activity
Building Digital Libraries for Scientific Data
systems are defined by the shared purposes that organize a community and by the
ways in which joint activity to achieve these purposes is mediated by shared tools,
rules for behavior, and divisions of labor. When analyzing how activity systems
change and develop, the focus is on contradictions that occur within the system or as a
result of the system interacting with other activity systems. These contradictions are
analyzed as the engine for organizational change.
Based on this theoretical framework, we developed interview questions about
participants’ motives, understanding of their community’s motives, tools used in daily
work, ways they divided labor, power relations within their community, and rules and
norms for the community. Interviews were then fully transcribed. In the initial phase
of analysis we looked at the first interviews with participants for emergent themes.
The analysis progressed iteratively. Subsequent interviews were analyzed with an eye
towards testing and further refining the themes identified in the initial coding. With
each refinement, the remaining corpus was searched for confirming or contradictory
evidence. At this stage, however, the work is still preliminary. As such, no formal
coding schemes were developed that were systematically applied to the entire corpus.
Rather, what we present below are the emergent themes and representative
illustrations in the participants’ own words.
6 Results
In the first two years of study, we learned that CENS’ habitat biologists perceived a
lack of established standards for managing sensor data, specifically those that support
the sharing of data among colleagues. They are eager to work with us because they
need tools to capture, manage, and share sensor data more effectively, efficiently, and
easily. They are committed to participation of developing domain data repositories
and standards such as the Knowledge Network for Biocomplexity and Ecological
Metadata Language, but are not yet implementing them. A good starting point for
exploring the metadata requirements of this community is to assess scientists’
experiences with implementing these tools and standards, and to evaluate those tools’
utility. The results are organized by the research questions outlined above: Data
characteristics, data sharing, data policy, and data architecture.
6.1 Data Characteristics
Our interview questions explored what data are being generated, to whom are these
data, and to whom are they useful. CENS is a collaboration between technologists and
scientists, thus the technologies are being evaluated and field tested concurrently with
the scientific data collection. The scientists interviewed reported unreliability of the
sensors. In the early stages, one researcher found that he was losing about 25% of
every transmission from every sensor, for example. While the sensors were sending
data every five minutes, they only produced usable data every 10 or 15 minutes. Thus
they could not always trust what they were getting from the instruments:
I’m highly suspicious [of automated data collection]. I mean it works, but
then sometimes it doesn’t, and then sometimes weird things happen. You get
a glitch, and then you start getting the same value twice or something. …
when you do averages, it’s all funny.
C. Borgman, J.C. Wallis, and N. Enyedy
These researchers have a story in mind when they design a field experiment. They
also know about how much data they will need to support that story in a published
paper in their discipline. One of our scientists explicitly told us that
… to tell that story I’m going to need an average of five figures and a table.
He sketches out mock figures on paper as part of his research design. We were
particularly intrigued by his notion of “least publishable unit” – in this case, five
figures and a table. However, he also told us that he tends to collect more data so that
he can elaborate on his story:
I collect another data set just to round it out rather than [picking] an
interesting sub-phenomenon of another sub-phenomenon. That’s boring.
Thus his research design is based on the amount of data he needs to tell a story of this
scope and form. The publication is the product of the study, rather than the data per se.
6.2 Data Sharing
We collected some useful commentary on issues of what data scientists will share,
when, with whom, and with what criteria. Within this small sample, scientists
generally are more willing to share data that already have been published and less
willing to share data that they plan to publish. The latter type of data represent claims
for their research.
Sharing data- if it’s already published? It’s your data, no problem. I can
give that to [you]. If it’s something I’m working on to get published or
somebody else is working on to get published, or if they want to publish the
paper together, it gets a little bit funnier.
For this scientist, willingness to share also is influenced by the effort required to
collect the data. His hand-collected data are more precious than his instrumentgenerated data:
… if you walk out into a swamp .. out in this wacky eel grass, and marsh
along with your hip-waders and [are] attacked by alligators ..and then you
do it again and again and again... I don’t [want to] share that right away. I
[want to] analyze it because I feel like it’s mine.
The above dataset was seen as “hard won.” When they do share experimental data
with collaborators, they feel an ethical and scientific responsibility to clean the dataset
sufficiently that it has scientific value. Raw data is meaningless to others. Unless the
data are useful and relevant, they would be “just taking up space and nobody’s going
to be able to use [them].”
If they feel they are forced to share data they will, knowing that it may not be of
much use to others. One scientist told us if someone wants his data, he or she can
have it in the raw form. His Excel spreadsheets are cryptic and exist in multiple
versions, representing each transformation.
Conversely, we found less evidence of proprietary ownership over reference data
that provides context, but is not specifically relevant to their research questions. One
scientist gave the example of measuring the density of shade cloth for a field
experiment. Much work went into determining the amount of shade a particular type
Building Digital Libraries for Scientific Data
of cloth provided in a 24-hour period. He was happy to save other people the effort of
reconstructing that number. Similarly, our subjects usually are willing to share
software and other tools.
Several of the researchers interviewed did not think their data would be of much
use to other researchers. Conversely, some said the data they are collecting already is
being shared between themselves and statisticians, engineers, and computer scientists,
all with different purposes for the data. One of our subjects recognized the possible
uses of his data on water contaminants in a river confluence for such diverse fields as
fluid mechanics, public health, ichthyology, and agriculture.
6.3 Data Policy
We encountered a particularly enlightening scenario in which questions arose of whether
the data from an instrument belonged to the designers of the instrument or the designers
of the experiment. The team member (engineering faculty) who designed and installed
the instrument had plans for using the data from the instrument but did not implement
those plans. After several years of data production, one of the scientists found the data
useful for his own research, and asked the head of the research site for permission to use
the data in a publication. Given that no other claims were being made on the data, and
that these data were being posted openly on the website of the research site, permission
was granted. After some investment in cleaning and analysis, the data looked promising
for publication, so the scientist and site director invited the instrument designer to
participate in the publication. However, the designer objected strongly on the grounds
that they were his data because he had deployed the instrument. The situation was
complicated further by the existence of a pending grant proposal involving this
instrument by the same designer. We learned later that the situation was resolved only
when the pending grant was not awarded, relieving some of the proprietary tension. The
scientist we interviewed commented that this was the first real intellectual property issue
over data that he had encountered.
In the above case, the technology people are faculty partners in the research. Yet
they view the status of the data and the control over it rather differently. Authorship
credit on publications is a common issue in research. In cases where instrumentation
is essential to the data collection, questions sometimes arise as to whether technical
support people should be authors. In another interview, the scientist who provided the
above example commented that
... tech-support people might get an acknowledgement ... but they’re not coauthors on a scientific paper.
The two situations described here are distinctly different. In the former, the
technologist was a researcher who had deployed the instrument, and all agreed that he
was entitled to some form of authorship credit. The issue appeared to be about who
had priority over the data in determining what should be published, when, and by
whom. In the latter situation, a scientist is referring to people who assist with
equipment but are not themselves researchers. However, situations may arise where
the distinction between technical research and technical assistance is not clear.
In the group meeting (about 20 CENS faculty, students, and research staff) to
discuss the ethics and policy of data sharing, several interesting issues arose. One
C. Borgman, J.C. Wallis, and N. Enyedy
frequent question was about the condition of data to be shared. The group generally
agreed that those generating the data had responsibility to assure that the data were
reliable and were verified before sharing or posting. Most participants were aware of
NSF rules for sharing the data from funded research. At the same time, they were
concerned about premature release prior to publication, and whether any sort of
liability disclaimers or rights disclaimers (e.g., Creative Commons licenses for
attribution and non-commercial reuse) should be applied.
Several of the meeting participants were involved in a small project involving cameras
triggered by sensors. The purpose of the project was to test the sensors, but capturing
identifiable data on individuals might be unavoidable. They were very concerned about
privacy and security issues with these data. Because the study was not about human
subjects, they had not sought human subjects review. Several people analogized the
situation to webcams on university campuses. Images of people are streaming to public
websites without the knowledge or permission of those involved. They also discussed
technical solutions such as anonymizing faces captured by the cameras.
6.4 Data Architecture
A longer term goal of our research is to design tools to support data acquisition,
management, and archiving. A number of questions addressed the use of data and
tools at each stage in the life cycle.
6.4.1 Research Design and Hypothesis Testing
Field research in habitat ecology begins with identifying a research site in which the
phenomena of interest can be studied. Before scientists begin setting up sensors or
deciding which extant sensors might produce data of interest, they would like a map
of the research site that includes the location of sensors and the types of data that each
sensor could produce. For example, this scientist would like a map of the area and a
table of the data from each set of sensors:
I want a table that I can skim really easily and say Okay of these ten stations
how many of them have temperature data available? I don’t want to look
around on a map and have to click on each link
Scientists spend much time on activities such as sensor placement and development
prior to fieldwork. They test equipment and sample the quality of observations from
the sensors before they start doing any real science. Thus tools for this exploratory
phase are desirable.
6.4.2 Data Collection and Acquisition
Due largely to the relative immaturity of the sensor technology, several of our science
subjects were suspicious of automated data collection. They expressed reluctance to
take data streams straight from the sensors without good tools to assess the cleanliness
of the data. They want simple and transparent methods to find potential gaps in the
data; also desirable are tools to identify when values are duplicated or missing, when
sensors are failing, and other anomalous situations.
Some also expressed the need to annotate the data in the field, which would
provide essential context that cannot be anticipated in data models. For example, data
Building Digital Libraries for Scientific Data
collection tools can have default menus for the available data elements generated by
any particular sensor. What they cannot do in advance is predict how scientists will
modify the instruments or the field conditions. One scientist mentioned moving his
equipment to a different location to compare temperatures. Documenting which data
were collected at which location and when is essential to later interpretation. Another
good example is distinguishing between common instruments with known
characteristics and one that was hand-made. Another scientist might use the same offthe-shelf instrument for another experiment, enabling easy comparisons. If the
instrument were unique, the results may be much more difficult to interpret. The
scientist might calibrate the two instruments and find they could be used
interchangeably, but the future user of the data needs to know what instruments were
used and how. The example here is
air temperature [from] a little therma-couple that I stuck one foot off the
ground with a little aluminum shield that I made. Someone else might use [a
common purchased instrument] interchangeably, but they should know that
one of them was my home-built little therma-couple thing.
6.4.3 Data Analysis
Cleaning. The two scientists in the ethnographic study would extract data from the
sensor network database to perform any correlations. They prefer to use graphing
programs with which they are most familiar. One scientist was disinterested in the
offer of graphing tools or visualizations, because he was reluctant to trust other
people’s graphs. He wants his own graphs so that he can make his own correlations.
He also wants the data in a form that he can import into his preferred analysis
programs. Data are cleaned with respect to specific research questions. If data are
extraneous to a current paper, they may not be cleaned or analyzed. Thus the datasets
resulting from a field project are not necessarily complete.
Version Control. Even when scientists are willing to share archived data, those data
may be poorly labeled, rendering them difficult to use. Multiple versions may exist of
the same data set, resulting from different cleaning or analysis processes. Most of
these processes are not recorded, making the data even less accessible to the potential
consumer. Multiple versions of datasets complicate retrieval for the scientists who
created them and for future researchers who may want to use them. One of our
researchers acknowledged that each person on the team created his or her own Excel
spreadsheets, and the only access to the data of their teammates was to ask for the
spreadsheets and explanations of their contents. This scientist worried that when any
of her team members left the project, their data essentially would be lost.
Tools. In these interviews, and in prior interviews and meetings over the last several
years, we often found that scientists prefer viewing data in tables, especially
Microsoft Excel spreadsheets. One scientist in the ethnographic study offered a
detailed explanation. Columns and tables enable him to identify holes in data and to
determine how to clean them. Graphs and plots show different types of data
inconsistencies, such as identifying dead sensors or graphics in the wrong time scale.
Graphs are also personal, because scientists reduce data to test their own hypotheses.
C. Borgman, J.C. Wallis, and N. Enyedy
These scientists do not appear to trust transformations made by others; they are
more likely to apply their own tools and methods to the original data. The scientist
noted above would trust fellow biologists to clean the data:
I want some radiation measurements ... you can do energy budget
calculations that … have to be cleaned up by a biologist in order to get
incorporated into the data set.. any old biologist can come by and start
doing correlations, because they know what the data is...
7 Discussion and Conclusions
While the number of interviews reported in this study is small, the results are based
on four years of work with these scientists and technology researchers. Collaborative
work can be much slower than solo work due to the effort involved in learning a
common language and in finding common ground in research interests [26]. The
investments pay off in new insights and new approaches. CENS collaborations
between scientists and technologists did not lead to new forms of science as quickly
as expected. In its fourth year of existence, the Center is now deploying networked
sensor technologies for multiple scientific studies. The promised payoffs are
beginning to accelerate. Many of the problems with data cleaning and sensor
reliability are due to the immature stage of the instruments and networks. As these
scientists become more experienced with these technologies, they are likely to
experiment with more instruments and configurations, so the data cleaning and
calibration issues will not go away. Experience also is likely to make them more
discriminating consumers of the technology, making yet greater demands on the
technology researchers.
Scholarly publications have been the product of science for several centuries.
Viewing data as a product per se is a relatively new idea. The scientists that we
interviewed for this study continue to focus on the paper as their primary product,
designing their experiments accordingly. Whether the data will become a direct
product of their research to be reused and shared is one of our continuing research
Our tentative findings about data sharing are consonant with other research: In this
small sample, our scientists are more willing to share data already published than data
that they plan to publish [27]. One scientist was explicit about guarding hand-collected
data more closely than data from sensor networks [14]. Our subjects expressed
responsibility to assure that any shared data are documented sufficiently to be
interpreted correctly [27]. If required to share data they will, knowing that raw data are
of little value to others without sufficient cleaning and documentation. However, given
a choice, most prefer to exploit their research data fully before releasing them to others.
A number of interesting policy issues arose that we are now studying in much more
depth. The Center has made a commitment to share its data with the community and
is seeking ways to do so. Members want to provide clean and reliable data, but are
understandably concerned about liability and misappropriation of data. Among the
questions to address are how to handle sensitive data about human subjects that arises
from technical studies, who controls data from a project, and who has first rights to
authorship. These are common issues in collaboration, especially at the boundaries
Building Digital Libraries for Scientific Data
between fields. The boundary between life sciences, such as habitat biology, and
technology may be even more complex. Not only do differences in data practices
between these domains arise, but the distinction between technical research and
technical assistance may not always be clear.
These findings have some promising implications for architecture and tools for
data management in habitat ecology and perhaps to other field research disciplines.
One is the need for tools to explore the research site and the availability of extant data
sources. These are especially valuable in the early design stages of an experiment.
Visualization tools in the field may be less helpful than expected, as these scientists
want to get the data into their own, familiar tools. Quick prototyping of data sources
in the field is an essential requirement, and one already recognized in CENS’ “human
in the loop” architecture. These scientists wish to add new instruments and new
details about data and instruments in the field on an ad hoc basis. They invent new
tools as needed, using aluminum foil, cloth, tape, and other available equipment. They
want data analysis in the field so that they can adjust experiments in real time.
Most of the above requirements suggest hand-crafted tools and structures for this
research community. The longer term goal, however, is to build generalizable,
scalable tools that facilitate sharing and curation of scientific data. We will continue
to work with habitat biologists and other CENS scientists to find a balance between
local and global requirements for tools and architecture. While the study reported here
relies on a small dataset and is exploratory in nature, it identifies a number of
important questions for the design of cyberinfrastructure for science. Research to
pursue these questions in more depth is currently under way.
CENS is funded by National Science Foundation Cooperative Agreement #CCR0120778, Deborah L. Estrin, UCLA, Principal Investigator. CENSEI, under which
much of this research was conducted, is funded by National Science Foundation grant
#ESI-0352572, William A. Sandoval, Principal Investigator and Christine L.
Borgman, co-Principal Investigator. Jonathan Furner and Stasa Milojevic of the
Department of Information Studies provided valuable material to the fieldwork on
which this study is based. Robin Conley and Jeffrey Good, UCLA graduate students,
assisted in interviewing and data transcription.
1. Atkins, D.E., et al., Revolutionizing Science and Engineering Through
Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon panel on
Cyberinfrastructure. 2003, National Science Foundation: Washington, D.C.
2. Unsworth, J., et al. Draft Report of the American Council of Learned Societies'
Commission on Cyberinfrastructure for Humanities and Social Sciences. Last visited 5
November 2005 http://www.acls.org/cyberinfrastructure/acls-ci-public.pdf.
3. Berman, F. and H. Brady. Final Report: NSF SBE-CISE Workshop on Cyberinfrastructure
and the Social Sciences. Last visited 18 May 2005 http://vis.sdsc.edu/sbe/reports/SBECISE-FINAL.pdf.
C. Borgman, J.C. Wallis, and N. Enyedy
4. Schroder, P., Digital Research Data as Floating Capital of the Global Science System, in
Promise and Practice in Data Sharing, P. Wouters and P. Schroder, Editors. 2003, NIWIKNAW: Amsterdam. p. 7-12.
5. Hey, T. and A. Trefethen, The Data Deluge: An e-Science Perspective, in Grid Computing
– Making the Global Infrastructure a Reality. 2003, Wiley.
6. Hey, T. and A. Trefethen, Cyberinfrastructure and e-Science. Science, 2005. 308: p. 818821.
7. International Virtual Observatory Alliance. Last visited 2 March 2005
8. Incorporated Research Institutions for Seismology. Last visited 25 November 2004
9. Biomedical Informatics Research Network. Last visited 19 March 2005
10. GEON. Last visited 19 March 2005 http://www.geongrid.org/.
11. Star, S.L., The politics of formal representations: Wizards, gurus and organizational
complexity, in Ecologies of Knowledge: Work and Politics in Science and Technology,
S.L. Star, Editor. 1995, State University of New York Press: Albany, NY.
12. Estrin, D., W.K. Michener, and G. Bonito, Environmental cyberinfrastructure needs for
distributed sensor networks: A report from a National Science Foundation sponsored
workshop. 2003, Scripps Institute of Oceanography.
13. Zimmerman, A., New Knowledge from Old Data: The Role of Standards in the Sharing
and Reuse of Ecological Data. Science, Technology, & Human Values, under review.
14. Pritchard, S.M., L. Carver, and S. Anand, Collaboration for knowledge management and
campus informatics. 2004, University of California, Santa Barbara: Santa Barbara, CA.
Retrieved from http://www.library.ucsb.edu/informatics/informatics/documents/UCSB_
Campus_Informatics_Project_Report.pdf on 14 November 2005.
15. Schnase, J.L., et al. Building the next generation biological information infrastructure. in
Proceedings of the National Academy of Sciences Forum on Nature and Human Society:
The Quest for a Sustainable World. 1997. Washington, DC: National Academy Press.
16. Michener, W.K. and J.W. Brunt, eds. Ecological Data: Design, Management and
Processing. 2000, Blackwell Science: Oxford.
17. Bowker, G.C., Biodiversity datadiversity. Social Studies of Science, 2000. 30(5): p. 643683.
18. Brown, C., Lineage metadata standard for land parcels in colonial states. GIS/LIS '95
Annual Conference and Exposition. American Soc. Photogrammetry & Remote Sensing &
American Congress on Surveying & Mapping. Bethesda, MD, USA., 1995. Part 1, Vol 1:
p. 121-130.
19. Kwa, C., Local ecologies and global science: Discourses and strategies of the
International Geospher-Biosphere Programme. Social Studies of Science, 2005. 35(6): p.
20. Shankar, K., Scientific data archiving: the state of the art in information, data, and
metadata management. 2003.
21. Sandoval, W.A. and B.J. Reiser, Explanation-driven inquiry: Integrating conceptual and
epistemic supports for science inquiry. Science Education, 2003. 87: p. 1-29.
22. Glaser, B.G. and A.L. Strauss, The discovery of grounded theory: Strategies for qualitative
research. 1967, Chicago: Aldine Publishing Co.
23. Engeström, Y., Activity theory and individual and social transformation, in Perspectives
on activity theory. 1999, New York: Cambridge University Press: p. 19-38.
Building Digital Libraries for Scientific Data
24. Engeström, Y., Learning by Expanding: An activity-theoretical approach to
developmental research. 1987, Helsinki: Orienta-Konsultit.
25. Cole, M. and Y. Engeström, eds. A Cultural-historical Approach to Distributed Cognition.
1993, New York: Cambridge University Press.
26. Cummings, J.N. and S. Kiesler, Collaborative research across disciplinary and
organizational boundaries. Social Studies of Science, 2005. 35(5): p. 703-722.
27. Arzberger, P., et al., An International Framework to Promote Access to Data. Science,
2004. 303(5665): p. 1777-1778.
Designing Digital Library Resources for Users in
Sparse, Unbounded Social Networks
Richard Butterworth
Interaction Design Centre, School of Computing Science,
Middlesex University, London, UK NW4 4BT.
[email protected]
Abstract. Most digital library projects reported in the literature build
resources for dense, bounded user groups, such as students or research
groups in tertiary education. Having such highly interrelated and well
defined user groups allows for digital library developers to use existing design methods to gather and implement requirements from those
groups. This paper, however, looks at situations where digital library
resources are aimed at much more sparse, ill defined networks of users.
We report on a project which explicitly set out to ‘broaden access’ to
tertiary education library resources to users not in higher education. In
particular we discuss the problem of gathering á priori user requirements
when by definition, we did not know who the users would be, we look
at how disintermediation plays an even stronger negative role for sparse
groups, and how we designed a system to replicate an intermediation
If one were to consider a ‘typical’ digital library (DL) project reported in the
literature one is likely to think of a digital library resource based on university
library holdings and aimed at students or academic researchers (eg. [1, 2, 3]). The
user groups in this case:
are well defined — it is possible to tell who is and who is not a potential user
of the DL system,
have well defined needs and tasks — it is possible to tell what they want
to use the system for,
are co-located or easily accessible — it is not expensive to question them
to gather their requirements,
are homogeneous — their requirements are broadly similar; if you have a user
population of one hundred undergraduate students doing the same course,
interviewing, say, ten of them is likely to give an adequate picture of the
group as a whole, and
are highly interrelated — the individuals in the groups tend to be closely
related in their work, research or studies.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 184–195, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Designing DL Resources for Users in Sparse, Unbounded Social Networks
(Note that although we assert that it is possible to discover the boundaries and
needs of such groups, we do not suggest it is particularly easy.)
In network analysis [4] such collections of users are called ‘dense, bounded
groups’. This work, however, contends that many users of both traditional and
digital libraries are not sufficiently accurately modeled by dense bounded groups,
particularly when looking at user networks outside tertiary education. If dense,
bounded groups are at one end of a continuum, then at the other end are ‘sparse,
unbounded networks’. It is these networks of users we look at in this work, and
report how we set to out design better digital library resources for them.
Dense, Bounded Groups and Sparse, Unbounded Networks
Network analysis (eg. [5, 6]) is a branch of social science that looks at the structure of social networks, in particular analysing the relationships between the
actors in networks. The domain of social networks looked at by network analysis is very broad: from analyses of markets to the structure of riots, but it is
the work of Wellman [4] that applied network analysis to the field of IT, by
characterising the different social networks that can be mediated online, and it
is Wellman’s description of the difference between dense, bounded groups and
sparse, unbounded networks that we base our work on.
Dense, bounded groups1 are characterised by networks that have well defined
boundaries and a high degree of interrelationship between the actors in the
network. Typically the starting point for an analysis of a dense bounded group
is the definition of the boundaries of the group: this implies who is or is not
inside the group, and analysis can proceed on the group members.
In contrast sparse, unbounded networks are characterised by relationships
that cross formal boundaries. For example, a formal boundary may be organisational: there is a clear line around who does and does not work for a particular
organisation. Networks that cross these boundaries may be friendship networks,
or networks of common interests. Because by definition we cannot start an analysis by defining the boundaries of an unbounded group, analyses of sparse, unbounded groups start by looking at the relationships of one or two individuals
and then traces their relationships outwards. If boundaries are discovered, then
they emerge as a consequence of the analysis, not as in bounded groups where
they are the starting point for the analyses.
Wellman describes one of the characteristics that differentiate sparse networks
and bounded groups is that the relationships in sparse networks ‘tend to ramify
out in many directions like an expanding spider’s web’ whereas the relationships
in dense groups ‘curl back on themselves into a densely knit tangle’ [4, page
180]. Note that the difference between the two types of networks is defined
both internally by the characteristics of the relationships in the network, and
externally by the way that they are analysed.
Note that the term ‘group’ has a specialised meaning in network analysis: a dense,
bounded network is referred to as a ’group’. In this paper we shall adhere to this
specialised terminology.
R. Butterworth
Is the Difference Important?
This paper gives examples of digital library users that are much better characterised as sparse, unbounded networks. We argue that standard development
methodologies are not ideal for designing systems with unbounded user networks,
and that intermediation is critical for sparse networks. But before moving on we
need to address the question: how much does it matter? Even though standard development methodologies and disintermediated DL models are based on
the assumption that user groups are dense and bounded, can we still use those
methodologies and models to develop good DL resources?
Evidently the answer is yes: good DL resources have been developed using
standard methodologies. However we would argue that the risk of failure is higher
because of this mismatch between the assumed and actual characteristics of the
user populations. Furthermore as we argued above much reported DL work has
been developing resources for tertiary education users, where user groups are
generally dense and bounded. However, once we move outside the tertiary education domain the evidence for successful DL projects becomes weaker (see [7]).
There are many difference reasons for this lack of success, but we suggest that
user networks outside tertiary education being much sparser and unbounded, is
a contributing factor.
The Accessing Our Archival and Manuscript Heritage
The Accessing our Archival and Manuscript Heritage (AAMH) project was a
fourteen month project undertaken at Senate House Library, University of London which aimed to develop online resources to encourage and assist life-long
learners to use the materials held in University of London libraries and archives2 .
The project was particularly aimed at opening up access to the libraries’ special
collections and archives. It was felt that these collections held much material
that would be of benefit to users outside tertiary education.
The explicit aim of the project was the broaden access to library resources.
Precisely how this broadening of access was to be facilitated was not explicitly
addressed in the early project proposal. It was up to the project staff to decide
(for example) whether directly surrogating library resources by digitising materials or by the more indirect route of surrogating library services would best
fulfil the remit of the project.
Taken to its furthest implications ‘broadening access’ means that we could not
know beforehand who the users of the proposed system would be. We would have
to build it and see who came; we could not perform á priori user requirements
gathering. But given that non-existant, incomplete, changeable or otherwise ill
defined requirements are often quoted [8] as the main culprit in project failure
The University of London is a federated university, consisting of several colleges, and
academic institutions. Many of the constituent colleges and institutions have their
own libraries and archives.
Designing DL Resources for Users in Sparse, Unbounded Social Networks
this seemed to be a recipe for potential disaster. A search of requirements and
software engineering texts for methodologies that helped us gather user requirements when we did not know who the users were was, unsurprisingly, fruitless.
However some requirements gathering methodologies showed more applicability
to our situation than others, and in the next section we outline the methodology
we used and review how it worked in practice.
In order to get started we had to devise some assumptions about who our
potential user base would be. After liaison with several of the university archivists
it became clear that the resources in their archives was most of use to outside
of the usual students and academics were ‘amateur’ family and local history
researchers, who we refer to collectively as ‘personal history researchers’. Even
though deciding to initially limit ourselves to personal history researchers set
some sort of bounds on our user population, this user population was still fairly
unbounded and sparse.
Compare the characteristics of these users to that of the ‘typically’ reported
user population set out at the beginning of this paper. They:
are only very loosely bounded — an interest in personal history hardly constitutes a limiting boundary: who is not interested in their family history?
do not have well defined needs and tasks — there are a multitude of ways
of tracing your family tree, particularly once researchers have moved beyond
the basic census and birth, marriage and death registers.
are not easily accessible and co-located —personal history researchers are
quite happy to work on their own: how does one find and contact these
researchers to analyse their needs?
are extremely heterogeneous — in talking to several members of local and
family history groups we encountered researchers with an enormous range
of skills, from researchers who had no training in research skills to a retired
history professor.
are only weakly interrelated — many researchers we contacted were members of local or family history groups, but these met occasionally (typically
monthly) and in most cases this was the only contact they had with other
similar researchers.
All this adds up to a sparse, unbounded user population.
Note that the AAMH project was action research. The main outcome of the
project was a working, useful DL resource: we did not explicitly set out to develop
for sparse, unbounded networks of users. That the users we were looking at
shared characteristics with models described in social science literature emerged
as a consequence of the design work we were doing. The work described below is
therefore a largely post hoc rationalisation, looking at the work we did through
the lens of network analysis.
Iterative Requirements Gathering and Implementation
The software and requirements engineering literature (eg [9, 10]) was surveyed,
but we could not find a methodology that suited our needs. It is clear that
R. Butterworth
most software development and requirements gathering processes are based on
the assumption that the projected user groups for the system to be developed
are bounded groups. Most requirements engineering methods, advise that the
developers should first define who the users are, and then carefully and explicitly
gather and analyse their requirements. Clearly the process of ‘defining who the
users are’ is about setting boundaries on the group, and gathering requirements
takes place within that defined group: this is an exact parallel to the approach
to analysing bounded groups set out in the introduction.
Furthermore DL systems are highly interactive user driven systems, and therefore to ensure usability and usefulness there are strong arguments [11, 12] that an
iterative design process is needed. Simply put, an iterative design process gathers requirements from a user group, rapidly prototypes an implementation that
hopefully meets those requirements, then evaluates the implementation with the
users. Evaluation will suggest changes to the prototype or to the requirements,
and the process iterates taking these changes into account. The idea is that
from an approximately acceptable starting prototype an increasingly acceptable
implementation is developed.
The spiral model [13] is about iteratively developing prototypes whereas the
star model [11] shows that the requirements should also be included in the process. The star model also argues that developers do not work in a linear way
from requirements to implementation at all: they may start with a prototype,
and then work ‘backwards’ so that the requirements for the prototype emerge.
The key point is that whenever any artefact (requirements statement, prototype,
etc) is proposed it should be evaluated before moving on to develop further design artefacts.
However the unbounded nature of our users posed problems for these iterative
process. Recall that the boundaries of an unbounded network emerge (if at all)
as a consequence of analysing the network, in other words, we would have to do
a lot of analysis in order to delimit who the users actually are, before we could
embark on the sort of iterative design process described above. In a time limited
project like AAMH this was not practical: once we had an analysis of our user
population that was good enough to use in the design, we were likely to have
run out of time to develop anything. Therefore a different approach was needed
such that the analysis of the unbounded user network took place at the same
time as the DL resources were being developed.
‘Early Phase’ Requirements Engineering
However, ‘early phase’ requirements engineering [14] offered promise. The key
principle in standard requirements engineering is that requirements state what
a system should do, as opposed to how the system should do it. This should
promote a clearer understanding of the system among the designers, who are
liable to lose track of what a system should be doing among the messy details
of how it does it. ‘Early phase’ requirements engineering goes one step further,
not only describing what a system should do, but why it should do it. These ‘why’
Designing DL Resources for Users in Sparse, Unbounded Social Networks
statements should promote a clearer understanding not only of the projected
system, but the context (organisational, environmental, user, etc) in which it sits.
Early phase requirements engineering looked valuable because it is the why
statements — the understanding of the users — which would hopefully emerge
as the project progressed. We therefore proposed to use informal early phase
requirements engineering in an iterative manner to develop our DL resources.
Our Proposed Design Process
Using early phase terminology there are three groups of design artefacts: why
statements which describe context and assumptions, what statements which describe requirements, and how statements which describe implementations. As
described above the spiral model is about iteratively refining how statements,
and the star model iteratively refines what and how statements. The innovation
of our design process is that it includes why, what and how statements in the
process. In effect our design process was an augmentation of the star model,
where contextual assumptions are also treated as design artefacts.
A likely consequence of such a process would be that we would not get a
‘neat’ incremental improving of the prototype: changes in the why statements
were likely to result in very dramatic changes to the prototype. Such largely
changes are probably unavoidable, but the important point is that the project
expects them, and leaves enough slack in the project schedule to deal with them.
In our case the design process would start with a set of educated guesses about
who the potential users might be and a broad description of their characteristics
(the why statements), what their needs would be (the what statements) and a
rapid prototype of a DL system that met those needs (the how statements). We
would then evaluate these three sets of statements with potential users, change
them according to the evaluation, and then iterate.
The Design Process in Action
Space precludes a detailed description of how this design process worked in action
on the AAMH project (see [15, section 5] for a more detailed account), but we
include a sketch here to demonstrate the value that this design process added
to the project.
First iteration. Our initial ‘why’ statement proposed that our potential users
would be people interested in using library archive materials in their research.
We further proposed a model of the four processes they would engage in to use
archive material. We suggested that they would:
propose research questions,
identify archival collections that would help answer those questions,
search for materials in those collections, and
interpret the materials they found.
We also proposed this as a roughly cyclical model: we were aware of researchers
who look in collections, and then form research questions based on what they
R. Butterworth
know they can find, and so on. It is not necessarily a linear process from question
formation, through archive identification and searching to interpretation.
Furthermore we proposed that question formation and identifying archives
were the two key processes for users not in higher education. Undergraduate students are given research questions (usually in the form of essay titles or project
proposals) and are pointed by their tutors in the direction of the useful library
collections. Similarly postgraduate students and academics have (or are developing) skills in identifying sensible, tractable research questions, and know how to
the use the library staff and their colleagues to identify likely looking archives.
In other words academics and students are a dense group: there are strong and
supportive relationships between students, tutors, colleagues and library staff
which help them construct research questions and identify useful research materials. Sparsely related non-HE users (we assumed) would have neither the skills
or the supportive network. However we assumed that the users would have good
skills in searching and interpretation, or at least would have access to tutorials
in these skills that our project would not need to replicate.
From this ‘why’ statement stemmed a set of ‘what’ statements: that the website should offer a collection of online tutorials on question formation, and a
discussion group-like facility to allow interaction between users and library staff
to help users identify collections.
A prototype of this system was mocked up and made available to users. We
then set out to evaluate the prototype and the assumptions underlying it. This
was done by inviting local and family history research groups into the library
and visiting meetings of such groups. Individual researchers were also invited into
the library to discuss their work with the project team, and public libraries with
strong local history sections were contacted and they supplied us with contacts
with researchers who used their facilities. We also tried indirect routes to get at
possible users: primarily by interviewing archivists about what their collections
were used for by non-HE users. Our contact with potential users began to ‘span
out’ from the first users we contacted in exactly the way Wellman predicts the
analysis a sparse group would.
Results of evaluating the first iteration. We found that three of the four
main assumptions were correct: users did need support identifying archives, and
were already competent searching and interpreting archival materials. However
we found that, contrary to our expectations, they did have well developed,
tractable research questions, or if they did not, then they would not be interested in the collections held in university libraries. In retrospect this makes
sense: researchers with badly thought out research questions are likely to be beginners, and would only be interested in the records held in public libraries or
in census data. Once the possibilities of the census data and so on have been exhausted, then the researcher may find value in the collections held in university
libraries, but by this time they will have become experienced researchers and
will have defined and refined their research questions. Note how through this
analysis a boundary for our user population has emerged, again, as predicted by
Designing DL Resources for Users in Sparse, Unbounded Social Networks
Our contact with the archivists also provided a key insight: that what was published about an archival collection described objectively what was in it, whereas
the archivists told us subjectively what research one could do with the collection.
This shows the intermediation role that the archivists play (and is discussed in
more detail in the next section) but also suggested to us that a better way of
supporting users in finding archives useful for their research questions would
be encode the archivists’ knowledge of what a collection could be used for as a
searchable, online database.
Second iteration. Based on the evaluation we then had a much clearer idea
of who the users were, what they needed, and how we could fulfill those needs.
We now developed a prototype that did not have tutorials on question formation, and had a database of ‘use centred descriptions’ of University of London
archival collections that suggested to personal history researchers what they
could do with those collections. This prototype and its underlying assumptions
were evaluated, and this time the feedback was much more positive: we now felt
we were firmly on track to deliver a useful DL resource.
Third and
use centred
and various
launched at
subsequent iterations. The way that we were to structure these
descriptions was determined by further evaluation and iteration,
user interface issues were dealt with, until a finished artefact was
the end of the project.
We have shown a design process which is intended not only to design a working
artefact, but also to iteratively develop the designers’ understanding of the user
population. It is not a radical departure from existing methods, but simply makes
it explicit that when defining for unbounded groups, the very basic underlying
assumptions need to be evaluated and refined as much as the working artefact
When looking at the process in action we see that there was a sizeable change
in our ideas about the characteristics of the users after the first iteration, and
correspondingly the first prototype was completely dropped before entering the
second iteration. The key point is that this big change occurred relatively late
in the project, but the project managed to cope with it and still deliver a working product on time, largely because we were expecting a large change once we
had explored enough of the users in our sparse network. This meant that the
early decisions and prototypes were held very lightly, and therefore could be
abandoned without major cost.
Obviously this description of what actually happened has been retrospectively
neatened up. In particular the process of exploring the user networks was not a
linear one: to visit local history groups we had to wait until they had meetings,
or for personal history researchers to visit the library we had to fit our timetables
around their’s. This meant that the analysis came in fits and bursts and did not
fit neatly into our design iterations.
R. Butterworth
Another observation that emerged was how important liaison with archivists
and other front line library staff was in designing the system. This is because
such staff have long been in the job of analysing the needs of their users, and
the results of that analysis was very useful to us in designing our DL resources.
Even though it may be difficult for system developers working on short term
projects to gather requirements directly from users in unbounded networks, it
is possible to get a good indirect picture of their needs through librarians and
archivists who liaise with them in the long term.
Disintermediation in Sparse Networks
Butterworth and Davis Perkins [16] presented an analysis of ‘small, specialist
libraries’ with a focus on how the requirements for developing their digital incarnations differ from those of commercial and academic libraries. In particular
they showed that the intermediation roles of the librarian are even more important and extensive for small, specialist libraries. They showed that librarians
not only play the intermediation roles between information sources and readers
described elsewhere [17, chapter 7], but also play a more social intermediation
role between the readers themselves.
In a sparse network the effect of a social intermediator is dramatic: it turns a
weakly connected or disconnected network into a much more highly connected
network. A sparse network of users may contain several completely separate sub
networks, or even completely isolated actors: a social intermediator connects all
the sub networks and actors together. In theory, if the intermediator is in contact
with all the actors in a network, the effect is to render all the actors at most two
relationships away from each other.
This effect is much more profound for a sparse network than for a dense one:
for a librarian (or anyone) to play a social intermediation role in a dense network would not dramatically increase the interconnectedness of a dense network,
because it is highly interconnected already. There are strong arguments in the
literature [18, 19] against disintermediation, and in the case of digital library systems for sparse user networks we contend that disintermediation is particularly
In the Accessing our Archival and Manuscript Heritage project it became
apparent as we explored our potential users that the main way we could benefit
them was helping them to link potential archive with their research questions.
This relationship between research question and an archival collection that can
be used to address that question is often not clear. Very often an archive can
be used in very different ways to the purposes it was collected for. For example
London University’s School of Oriental and African Studies holds an extensive
collection of correspondence sent by 19th Century African missionaries, which
has been used by a researcher to create a climate map of Africa in the 19th
century. This was possible because the missionaries often wrote home and gave
detailed descriptions of the local geography and climate.
Designing DL Resources for Users in Sparse, Unbounded Social Networks
ISAD(G) description
Use centred description
Detailed usage description
The London School of Hygiene and Tropical
Medicine holds an archive of the medical examinations of people who emigrated to the British
colonies and protectorates between 1898 and
1919. As well as giving a detailed account of
the subject’s health, each record gives a small
amount of family history parents, children and
siblings) as well as some details about their current job, the job that they were intending to take
up in the colonies and its location.
If you have a relative who apparently ’disappeared’ at the end of the 19th Century, e.g.
they’re in the 1891 census, but not in the 1901
census, they may have emigrated, and this collection may give you a clue as to where and when
they went[. . . ]
How to tell if the collection is useful
If you know that a family member emigrated
between 1898 and 1919 then this collection is
clearly useful. If you don’t know for sure, but
suspect that you may have an ancestor who emigrated, you may email a query to LSHTM’s
archivist, giving as much detail as possible[. . . ]
Patrick Manson was born in 1844 and studied
medicine at Aberdeen University, passing M.B.
and C.M. in 1865. In 1866 he became medical
officer of Formosa for the Chinese imperial
maritime customs, moving to Amoy in 1871.
Here, while working on elephantoid diseases,
he discovered in the tissues of blood-sucking
mosquitoes the developmental phase of filaria
worms. From 1883 to 1889 he was based in Hong
Kong, where he set up a school of medicine
that developed into the university and medical
school of Hong Kong[. . . ]
Scope and content/abstract: Papers of Sir
Patrick Manson, 1865-1964, including Manson’s
diaries, 1865-1879, containing notes on the discovery of mosquitoes as carriers of malaria and
patient case notes; bound manuscript notes of
his discovery of filaria, 1877; original drawings of
eggs of bilharzias and embryos of guinea worms,
1893; drawings by Manson of filarial embryos,
1891; correspondence with Charles Wilberforce
Daniels[. . . ]
Fig. 1. A partial ISAD(G) [20] collection level description and partial use centred
description of the Sir Patrick Manson archive held at the London School of Hygiene
and Tropical Medicine (Both descriptions are edited for size)
A potential problem for a researcher is that the published archival descriptions objectively describe what is in an archive, who created it and when, but do
not describe what can be done with the collection. To identify what a collection
can be used for takes either lateral thinking, a lucky guess, or intermediation
by the archivists who know what uses their collections have been put to in the
past and can pass this knowledge on to other researchers. This social intermediation role of passing knowledge from researcher to researcher is crucial in sparse
networks; if one researcher works out that archive X can be used for purpose
Y , this knowledge is not likely to propogate around a sparse network without
The disparity between what a published description of an archival collection
says, (ie. what is in the collection), and what an archivist will tell you about their
collections, (ie. what you can do with a collection) became one of the central
points of the project. We set about interviewing archivists about what personal
history researchers can use their collections for, and published these as a set of
‘use centred descriptions’ on the site developed by the project. (See the ‘Helpers’
site: http://helpers.shl.lon.ac.uk/.)
Again space precludes a full review of use centred descriptions (see [15, Section
6]) but as an example figure 1 shows part of a use centred description and the
standard archival description of the same collection. The collection described is
the Sir Patrick Manson archive held at the London School of Hygiene and Tropical Medicine. Sir Patrick was a founder of the School and was instrumental in
R. Butterworth
showing that malaria was transmitted by mosquitoes. His working papers form
the basis of this archival collection. The standard archival description details Sir
Patrick’s life and lists the different materials held in his archive. If you are a
family history researcher there is no indication in the description that the collection would be of any use to you. However the use centred description shows that
in the collection is a list of medical examinations of twelve thousand people who
emigrated to the British colonies between 1898 and 1919. As well as containing
medical data these records also detailed where the subjects were going in the
colonies to work and their immediate family. This information could be vital to
family historians.
The common approach to repairing disintermediation gaps in digital libraries
is to allow users direct contact with library staff, via email, discussion groups
and chat rooms. (For example, see the People’s network Enquire service3 .) We
attempted the same end by encoding the archivists’ knowledge about which
archives are useful to which users, and making it available online as a searchable
database. Clearly no-one would claim that this is a replacement for the job
that archivists and other front end library staff do, but it is a way of allowing
knowledge about the uses that an archive can be put to to travel through a
sparse network of researchers.
This paper has argued that there are classes of potential digital library users
outside of tertiary education who are best characterised as sparse, unbounded
networks. We argue that most requirements engineering techniques make the assumption that a system is designed for a bounded group of users, and therefore
do not serve DL development well. Furthermore we have contended that intermediation is particularly important in a sparse network, and we have discussed how
‘use centred descriptions’ of archival collections act as an intermediation tool.
We believe that sparse, unbounded networks are much more common within
traditional and digital library users than is suggested by the concentration on
dense, bounded networks reported in the DL literature. We would propose further work, particularly looking at users of public libraries, to further draw out
the characteristics of these user networks.
1. Shen, R., Gonçalves, M.A., Fan, W., Fox, E.: Requirements gathering and modelling of domain-sprecific digital libraries with the 5S framework: an archaeological
case study with ETANA. [21] 1–12
2. Monroy, C., Furuta, R., Mallen, E.: Visualising and exploring picasso’s world. In
Marshall, C., Henry, G., Delcambre, L., eds.: Proceedings of 2003 Joint Conference
on Digital Libraries, IEEE (2003)
Designing DL Resources for Users in Sparse, Unbounded Social Networks
3. Wilson, R., Landoni, M., Gibb, F.: Guidelines for designing electronic books. In
Agosti, M., Thanos, C., eds.: Proceedings of 6th European Conference on Digital
Libraries, ECDL 2002. Volume LNCS 2458., Springer Verlag (2002)
4. Wellman, B.: An electronic group is virtually a social network. In Kiesler, S., ed.:
Culture of the internet. Lawrence Erlbaum (1997) 179–205
5. Wasserman, S., Faust, K.: Social network analysis: methods and applications.
Cambridge University Press (1994)
6. Wellman, B., Berkowitz, S.D., eds.: Social structures: a network approach. JAI
Press Inc. (1988)
7. Davis Perkins, V., Butterworth, R., Curzon, P., Fields, B.: A study into the effect
of digitisation projects on the management and stability of historic photograph
collections. [21] 278–289
8. Keil, M., Cule, P., Lyytinen, K., Schmidt, R.: A framework for identifying software
project risks. Commun. ACM 41(11) (1998) 76–83
9. McGraw, K., Harbison, K.: User-centred requirements: the scenario based engineering process. Lawrence Erlbaum Associates (1997)
10. Sommerville, I., Sawyer, P.: Requirements engineering: a good practice guide.
Wiley (1997)
11. Hix, D., Hartson, H.R.: Developing user interfaces: Ensuring usability through
product and process. John Wiley and Sons (1993)
12. Dix, A., Finlay, J., Abowd, G., Beale, R.: Human computer interaction. Third edn.
Pearson (2004)
13. Boehm, B.: A spiral model of software development and enhancement. IEEE
Computer 21(5) (1988) 61–72
14. Yu, E.: Towards modelling and reasoning support for early-phase requirements
engineering. In: Proceedings of RE-97: 3rd International Symposium on Requirements Engineering. (1997) 226–235
15. Butterworth, R.: The Accessing our Archival and Manuscript Heritage project and
the development of the ‘Helpers’ website. Technical Report IDC-TR-2006-001, Interaction Design Centre, School of Computing Science, Middlesex University (2006)
Available from http://www.cs.mdx.ac.uk/research/idc/tech reports.html.
16. Butterworth, R., Davis Perkins, V.: Assessing the roles that a small specialist
library plays to guide the development of a hybrid digital library. In Crestani,
F., Ruthven, I., eds.: Information Context: Nature, Impact, and Role: 5th International Conference on Conceptions of Library and Information Sciences, CoLIS
2005. Volume LNCS 3507., Springer Verlag (2005) 200–211
17. Nardi, B.A., O’Day, V.: Information Ecologies. MIT Press (2000)
18. Borgman, C.: From Gutenberg to the global information infrastructure: access to
information in the networked world. MIT Press (2000)
19. Vishik, C., Whinston, A.: Knowledge sharing, quality, and intermediation. In:
WACC ’99: Proceedings of the international joint conference on Work activities
coordination and collaboration, ACM Press (1999) 157–166
20. International Council on Archives: ISAD(G): General international standard
archival description. (1999) Second edition.
21. Rauber, A., Christodoulakis, S., Min Tjoa, A., eds.: Research and Advanced Technology for Digital Libraries: 9th European Conference, ECDL 2005. In Rauber,
A., Christodoulakis, S., Min Tjoa, A., eds.: Research and Advanced Technology
for Digital Libraries: 9th European Conference, ECDL 2005. (2005)
Design and Selection Criteria for a
National Web Archive
Daniel Gomes, Sérgio Freitas, and Mário J. Silva
University of Lisbon, Faculty of Sciences
1749-016 Lisboa, Portugal
[email protected], [email protected], [email protected]
Abstract. Web archives and Digital Libraries are conceptually similar,
as they both store and provide access to digital contents. The process
of loading documents into a Digital Library usually requires a strong intervention from human experts. However, large collections of documents
gathered from the web must be loaded without human intervention. This
paper analyzes strategies to select contents for a national web archive and
proposes a system architecture to support it.1
Publishing tools, such as Blogger, enabled people with limited technical skills
to become web publishers. Never before in the history of mankind so much information was published. However, it was never so ephemeral. Web documents
such as news, blogs or discussion forums are valuable descriptions of our times,
but most of them will not last longer than one year [21] If we do not archive the
current web contents, the future generations could witness an information gap
in our days. The archival of web data is of interest beyond historical purposes.
Web archives are valuable resources for research in Sociology or Natural Language Processing. Web archives could also provide evidence in judicial matters
when ephemeral offensive contents are no longer available online. The archival of
conventional publications has been directly managed by human experts, but this
approach can not be directly adopted to the web, given its size and dynamics.
We believe that web archiving must be performed with minimum human intervention. However, this is a technologically complex task. The Internet Archive
collects and stores contents from the world-wide web. However, it is difficult for
a single organization to archive the web exhaustively while satisfying all needs,
because the web is permanently changing and many contents disappear before
they can be archived. As a result, several countries are creating their own national archives to ensure the preservation of contents of historical relevance to
their cultures [6].
Web archivists define boundaries of national webs as selection criteria. However, these criteria influence the coverage of their archives. In this paper, we
This study was partially supported by FCT under grant SFRH/BD/11062/2002
(scholarship) and FCCN.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 196–207, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Design and Selection Criteria for a National Web Archive
Fig. 1. Distribution of documents per domain from the Portuguese web
analyze strategies for selecting contents for a national web archive and present
a system’s architecture to support a web archiving system. This architecture
was validated through a prototype named Tomba. We loaded Tomba with 57
million documents (1.5 TB) gathered from the Portuguese web during 4 years
to update the indexes of a search engine and made this information publicly
available through a web interface (available at tomba.tumba.pt). The main contributions of this paper are: i) the evaluation of selection strategies to populate
a web archive; ii) a system’s architecture to support a web archive.
In the following Section we discuss strategies to populate a web archive. In
Section 3, we present the architecture of the Tomba prototype. Section 4 presents
related work and in Section 5 we conclude our study and propose future work.
Web archivists define strategies to populate web archives according to the scope
of their actions and the resources available. An archive can be populated with
contents delivered from publishers or harvested from the web. The delivery of
contents published on the web works on a voluntary basis in The Netherlands
but it is a legislative requirement in Sweden [20]. However, the voluntary delivery
of contents is not motivating for most publishers, because it requires additional
costs without providing any immediate income. On the other hand, it is difficult
to legally impose the delivery of contents published on sites hosted on foreign
web servers, outside a country’s jurisdiction. The absence of standard methods
and file formats to support the delivery of contents is also a major drawback,
because it inhibits the inclusion of delivery mechanisms in popular publishing
tools. Alternatively, a web archive can be populated with contents periodically
harvested from the country’s web. However, defining the boundaries of a national
web is not straightforward and the selection policies are controversial.
We used the Portuguese web as a case study of a national web and assumed
that it was composed by the documents hosted on a site under the .PT domain or written in the Portuguese language hosted in other domains, linked
from .PT [10]. We used a crawl of 10 million documents harvested from the Portuguese web in July, 2005 as baseline to compare the coverage of various selection
D. Gomes, S. Freitas, and M.J. Silva
Country Code Top Level Domains
There are two main classes of top-level domains (TLD): generic (gTLDs) and
country code (ccTLDs). The gTLDs were meant to be used by particular classes
of organizations (e.g. COM for commercial organizations) and are administrated
by several institutions world wide. The ccTLDs are delegated to designated
managers, who operate them according to local policies adapted to best meet
the economic, cultural, linguistic, and legal circumstances of the country. Hence,
sites with a domain name under a ccTLD are strong candidates for inclusion in a
web archive. However, this approach excludes the documents related to a country
hosted outside the ccTLD. Figure 1 presents the distribution of documents from
the Portuguese web per domain and shows that 49% of its documents are hosted
outside the ccTLD .PT.
Exclude Blogs
Blogs have been introduced as frequent, chronological publications of personal
thoughts on the web. Although the presence of blogs is increasing, most of them
are rarely seen and quickly abandoned. According to a survey, "the typical blog
is written by a teenage girl who uses it twice a month to update her friends and
classmates on happenings on her life" [5], which hardly matches the common
requirements of a document with historical relevance. On the other hand, blogs
are also used to easily publish and debate any subject, gaining popularity against
traditional web sites. Blogs that describe the life of citizens from different ages,
classes and cultures will be an extremely valuable resource for a description of
our times [8].
We considered that a site is a blog if it contained the string "blog" on the
site name and observed that 15.5% of the documents in the baseline would have
been excluded from a national web archive if blogs were not archived. 67% of
the blog documents were hosted under the .com domain and 33% were hosted
on blogs under the .PT domain. One reason we found for this observation is that
most popular blogging sites are hosted under the .COM domain, which tends
to increase the number of documents from a national web hosted outside the
country code TLD (Blogspot that holds 63% of the Portuguese blogs).
Physical Location of Web Servers
The RIPE Network Management Database provides the country where an IP address was firstly allocated or assigned. One could assume that the country’s web
is composed by the documents hosted on servers physically located on the country. We observed that only 39.4% of the IP addresses of the baseline Portuguese
web were assigned to Portugal.
Select Media Types
A web archive may select the types of the contents it will store depending on the
resources available and the scope of the archive. For instance, one may populate
Design and Selection Criteria for a National Web Archive
Table 1. Prevalence of media types on the Portuguese web
avg size %docs.
1,340 0.04%
a web archive exclusively with audio contents. Preservation strategies must be
implemented according to the format of the documents. For instance, preserving
documents in proprietary formats may require having to preserve also the tools
to interpret them. The costs and complexity of the preservation of documents
increases with the variety of media types archived and it may become unbearable. Hence, web archivists focus their efforts on the preservation of documents
with a selected set of media types. Table 1 presents the coverage of selection
strategies according to the selected media types. We can observe that a web
archive populated only with HTML pages, JPEG and GIF images covers 95.2%
of a national web.
Ignore Robots Exclusion Mechanisms
The Robots Exclusion Protocol (REP) enables authors to define which parts of
a site should not be automatically harvested by a crawler through a file named
"robots.txt" [16] and the meta-tag ROBOTS indicates if a page can be indexed
and the links followed [26].Search engines present direct links to the pages containing relevant information to answer a given query. Some publishers only allow
the crawl of the site’s home page to force readers to navigate through several
pages containing advertisements until they find the desired page, instead of finding it directly from search engine results. One may argue that archive crawlers
should ignore these exclusion mechanisms to achieve the maximum coverage of
the web. However, the exclusion mechanisms are also used to prevent the crawling of sites under construction and infinite contents such as online calendars [24].
Moreover, some authors create spider traps, that are sets of URLs that cause
the infinite crawl of a site [15], to punish the crawlers that do not respect the
exclusion mechanisms. So, ignoring the exclusion mechanisms may degrade the
performance of an archive crawler.
We observed that 19.8% of the Portuguese web sites contained the "robots.txt"
file but the REP forbade the crawl of just 0.3% of the URLs. 10.5% of the pages
contained the ROBOTS meta-tag but only 4.3% of them forbade the indexing of
D. Gomes, S. Freitas, and M.J. Silva
Fig. 2. Architecture of the Tomba web archive
the page and 5% disallowed the following of links. The obtained results suggest
that ignoring exclusion mechanisms does not significantly increase the coverage
of a national web crawl. However, this behavior may degrade the crawler’s performance because exclusion mechanisms are also used to prevent crawlers against
hazardous situations.
The Tomba Web Archive
The Tomba web archive is a prototype system developed at the University of
Lisbon to research web archiving issues. A web archive system must present
an architecture able to follow the pace of the evolution of the web, supporting distinct selection criteria and gathering methods. Meta-data must be kept
to ensure the correct interpretation and preservation of the archived data. A
collection of documents built through incremental crawls of the web contains
duplicates, given the documents that remain unchanged and the different URLs
that reference the same document. It is desirable to minimize duplication among
the archived data to save storage space without jeopardizing performance. The
storage space must be extensible to support the collection growth and support
various storage policies according to the formats of the archived documents and
the level of redundancy required. The archived data should be accessible to
humans and machines, supporting complementary access methods to fulfill the
requirements of distinct usage contexts. There must be adequate tools to manage and preserve the archived documents, supporting their easy migration to
different technological platforms.
Figure 2 represents the architecture of Tomba. There are 4 main components. The Gatherer is responsible for collecting web documents and integrating
them in the archive. The Repository stores the contents and their correspondent
meta-data. The Preserver provides tools to manage and preserve the archived
data. The Searcher allows human users to easily access the archived data. The
Archivist is a human expert that manages preservation tasks and defines selection criteria to automatically populate the archive.
Design and Selection Criteria for a National Web Archive
+source: String
+type: String
+key: String
+value: String
+creationDate: Date
+contentKey: String
+contentKey: String
Fig. 3. Data model
A content is the result of a successful download from the web (e.g. an HTML
file), while meta-data is information that describes it (e.g. size). The Repository
is composed by the Catalog [3] that provides high performance structured access
to meta-data and the Volumes [9] that provide an extensible storage space to
keep the contents, eliminating duplicates among them.
Figure 3 describes the data model of the Catalog. We assume that the archive
is loaded in bulk with snapshots of the web. The Source class identifies the origin
of the document, for example an URL on the web. Each Version represents a
snapshot of the information gathered from a Source. The Versions correspondent
to the same snapshot of the web are aggregated in Layers. A Layer represents the
time interval from its creation until the creation of the next one. This way, time is
represented in a discrete fashion within the archive, facilitating the identification
of web documents that need to be presented together, such as a page and the
embedded images. The Property class holds property lists containing meta-data
related to a Version. The use of property lists instead of a static meta-data
model, enables the incremental annotation of contents with meta-data items
when required in the future. The Content and Facet classes reference documents
stored in the Volumes. The former references the documents in their original
format and the latter alternative representations. For instance, a Content is an
HTML page that has a Facet that provides the text contained in it. In the archive,
Facets provide storage for current representations of contents retrieved earlier
in obsolete formats. The Repository supports merging the Content, Facets and
meta-data of a Version into a single Facet in a semi-structured format (XML),
so that each document archived in a Volume can be independently accessed
from the Catalog. There are web documents that contain useful information to
preserve other ones. For instance, a web page containing the specification of the
HTML format could be used in the future to interpret documents written in this
format. The Reference class enables the storage of associations of Versions that
are related to each other.
D. Gomes, S. Freitas, and M.J. Silva
The Gatherer, composed by the Loader and the Crawler, integrates web data
in the Repository. The Loader was designed to support the delivery of web
contents by publishers and receive previously compiled collections of documents.
The Crawler iteratively harvests information from the web, downloading pages
and following the linked URLs. Ideally, a page and the embedded or referenced
documents would be crawled sequentially to avoid that some of them become
unavailable meanwhile. Sequentially crawling all the documents referenced by
a page degrades the crawler’s performance, because harvesting the documents
hosted outside a site requires additional DNS lookups and establishment of new
TCP connections. According to Habib and Abrams, these two factors account for
55% of the time spent downloading web pages [12]. Crawling the documents of
one site at a time in a breadth first mode and postponing the crawl of external
documents until the corresponding sites are visited, is a compromise solution
that ensures that the majority (71%) of the embedded documents internal to
each site are crawled in a short notice, without requiring additional bandwidth
usage [18].
Replication is crucial to prevent data loss and ensure the preservation of the
archived documents. The replication of data among mirrored storage nodes must
consider the resources available, such as disk throughput and network bandwidth. A new document loaded into the archive can be immediately stored
across several mirrors, but this is less efficient than replicating documents in
bulk. Considering that an archive is populated with documents crawled from
the web within a limited time interval, the overhead of replicating each document individually could be prohibitive. The Replicator copies the information
kept in a Volume to a mirror in batch after each crawl is finished. The Dumper
exports the archived data to a file using 3 alternative formats: i) WARC, proposed by the Internet Archive to facilitate the exportation of data to other web
archives [17]; ii) an XML based format to enable flexible automatic processing;
iii) a textual format with minimum formatting created to minimize the space
used by the dump file. The dissemination of the archived documents as public
collections is an indirect way to replicate them outside the archive, increasing
their chance of persisting into the future. These collections are interesting for
scientific evaluations [14] or to be integrated in other web archives. The main
obstacles to the distribution of web collections are their large size, the lack of
standards to format them in order to be easily integrated in external systems
and copyright legislation that requires authorization from the authors of the
documents to distribute them. Obtaining these authorizations is problematic
for web collections having millions of documents written by different authors.
The archived documents in obsolete formats must be converted to up-to-date
formats to maintain their contents accessible. The Converter iterates through
the documents kept in the Repository and generates Facets containing alternative representations in different formats. The Manager allows a human user
Design and Selection Criteria for a National Web Archive
Fig. 4. Tomba web interface
to access and alter the archived information. The meta-data contained in the
Content-Type HTTP header field identifies the media type of a web document
but sometimes it does not correspond to the real media type of the document.
On our baseline crawl, 1.8% of the documents identified as plain text were in
fact JPEG image files. The format of a document is commonly related to the
file name extension of the URL that references it. This information can be used
to automatically correct erroneous media type meta-data. However, the usage of
file name extensions is not mandatory within URLs and the same file name extension may be used to identify more than 1 format. For example, the extension
.rtf identifies documents in the application/rtf and text/richtext media types. In
these cases, a human expert can try to identify the media type of the document
and correct the corresponding meta-data using the Manager.
The Searcher provides 3 methods for accessing the archived data: Term Search,
URL History or Navigation. The Term Search method finds documents containing
a given term. The documents are previously indexed to speed up the searches. The
URL History method finds the versions of a document referenced by an URL. The
Navigation method enables browsing the archive using a web proxy.
Figure 4 presents the public web interface of Tomba that supports the URL
History access method. Navigation within the archive begins with the submis-
D. Gomes, S. Freitas, and M.J. Silva
sion of an URL in the input form of the Tomba home page. In general, multiple
different URLs reference the same resource on the web and it may seem indifferent to users to submit any of them. If only exact matches on the submitted
URL were accepted, some documents might not be found in the archive. Hence,
Tomba expands each submitted URL to a set of URLs that are likely to reference
the same resource, and then searches for them. For instance, if a user inputs the
URL www.tumba.pt, Tomba will look for documents harvested from the URLs:
www.tumba.pt/, tumba.pt, www.tumba.pt/index.html, www.tumba.pt/index.htm, www.tumba.pt/index.php, www.tumba.pt/index.asp. On the visualization
interface, the archive dates of the available versions of a document are displayed
on the left frame. The most recent version of the document is initially presented
on the right frame and users can switch to other versions by clicking on the
associated dates. The versions presented on the left frame enable a quick tracking of the evolution of a document. The documents harvested from the web are
archived in their original format. However, they are transformed before being
presented to the user to enable mimicking their original layout and allow a user
to follow links to other documents within the archive when activating a link on
a displayed page. The documents are parsed and the URLs to embedded images and links to other documents are replaced to reference archived documents.
When a user clicks on a link, Tomba picks the version of the URL in the same
layer of the referrer document and displays it on the right frame along with
the correspondent versions on the left frame. A user may retrieve an archived
document without modifications by checking the box original content below the
submission form (Figure 4). This is an interesting feature for authors that want
to recoverer old versions of a document. The Page Flashback mechanism enables
direct access to the archived versions of a document from the web being displayed on the browser. The user just needs to click on a toolbar icon and the
versions of the page archived in Tomba will be immediately presented.
The URL History access method has 3 main limitations. First, users may not
know which URL they should submit to find the desired information. Second,
the short life of URLs limits their history to a small number of versions. The
Tomba prototype was loaded with 10 incremental crawls of the Portuguese web
but on average each URL referenced just 1.7 versions of a document. Third,
the replacement of URLs may not be possible in pages containing format errors
or complex scripts to generate links. If these URLs reference documents that
are still online, the archived information may be presented along with current
documents. The Term Search and Navigation complement the URL History but
they have other limitations. The Term Search finds documents independently
from URLs but some documents may not be found because the correspondent
text could not be correctly extracted and indexed [7] The Navigation method
enables browsing the archive without requiring the replacement of URLs because all the HTTP requests issued by the user’s browser must pass through
the proxy that returns contents only for archived documents. However, it might
be hard to find the desired information by following links among millions of
Design and Selection Criteria for a National Web Archive
Related Work
According to the National Library of Australia there are 16 countries with wellestablished national Web archiving programs [20]. The Internet Archive was the
pioneer web archive. It has been executing broad crawls of the web and released
an open-source crawler named Heritrix [11]. The National Library of Australia
founded its web archive initiative in 1996 [22].It developed the PANDAS (PANDORA Digital Archiving System) software to periodically archive Australian
online publications, selected by librarians for their historical value. The British
Library leads a consortium that is investigating the issues of web archival [4]. The
project aims to collect and archive 6,000 selected sites from the United Kingdom
during 2 years using the PANDAS software. The sites have been stored, catalogued and checked for completeness. The MINERVA (Mapping the INternet
Electronic Resources Virtual Archive) Web Archiving Project was created by
the Library of the Congress of the USA and archives specific publications available on the web that are related to important events, such as an election [25].
In December 2004 the Danish parliament passed a new legal deposit law that
calls for the harvesting of the Danish part of the Internet for the purpose of
preserving cultural heritage and two libraries became responsible for the development of the Netarkivet web archive [19].The legal deposit of web contents in
France will be divided among the Institut National de l’Audiovisuel (INA) and
the National Library of France (BnF). Thomas Drugeon presented a detailed
description of the system developed to crawl and archive specific sites related to
media and audiovisual [7]. The BnF will be responsible for the archive of online
writings and newspapers and preliminary work in cooperation with a national
research institute (INRIA) has already begun [1].
The National Library of Norway had a three-year project called Paradigma
(2001-2004) to find the technology, methods and organization for the collection
and preservation of electronic documents, and to give the National Library’s
users access to these documents [2]The defunct NEDLIB project (1998-2000)
included national libraries from several countries (including Portugal) and had
the purpose of developing harvesting software specifically for the collection of
web resources for an European deposit library [13].The Austrian National Library together with the Department of Software Technology at the Technical
University of Vienna, initiated the AOLA project (Austrian On-Line Archive)
[23].The goal of this project is to build an archive by harvesting periodically the
Austrian web. The national libraries of Finland, Iceland, Denmark, Norway and
Sweden participate in the Nordic Web Archive (NWA) project [?] The purpose
of this project is to develop an open-source software tool set that enables the
archive and access to web collections.
Conclusions and Future work
We proposed and evaluated selection criteria to automatically populate a national web archive. We observed that no criteria alone provides the solution for
D. Gomes, S. Freitas, and M.J. Silva
selecting the contents to archive and combinations must be used. Some criteria
are not selective but their use may prevent difficulties found while populating
the archive. In particular, we conclude that populating a national web archive
only with documents hosted in sites under the country’s Top Level Domain or
physically located on the country excludes a large amount of documents. The
costs and complexity of the preservation of documents increases with the variety
of media types archived. We observed that archiving documents of just three
media types (HTML, GIF and JPEG) reduced the coverage of a national web
only 5%. We conclude that this is an interesting selection criterion to simplify
web archival, in exchange for a small reduction on the coverage of the web.
We described the architecture of an information system designed to fulfil
the requirements of web archiving and validate it through the development of
a prototype named Tomba. We loaded Tomba with 57 million documents (1.5
TB) harvested from the Portuguese web during the past 4 years and explored
three different access methods. None of them is complete by itself, so they must
be used in conjunction to provide access to the archived data.
As future work, we intend to enhance accessibility to the archived information
by studying an user interface suitable to access a web archive.
Design and Selection Criteria for a National Web Archive
What Is a Successful Digital Library?
Rao Shen, Naga Srinivas Vemuri, Weiguo Fan, and Edward A. Fox
Digital Library Research Laboratory, Virginia Tech, USA
{rshen, nvemuri, wfan, fox}@vt.edu
Abstract. We synthesize diverse research in the area of digital library (DL)
quality models, information systems (IS) success and adoption models, and information-seeking behavior models, to present a more integrated view of the
concept of DL success. Such a multi-theoretical perspective, considering user
community participation throughout the DL development cycle, supports understanding of the social aspects of DLs and the changing needs of users interacting with DLs. It also helps in determining when and how quality issues can be
measured and how potential problems with quality can be prevented.
1 Introduction
Hundreds of millions of dollars have been invested since the early 1990s in research
and development related to digital libraries (DLs). Further R&D is needed worldwide
[17] if the tremendous potential of DLs is to be achieved. Hence, determining the key
characteristics of DL success is of the utmost importance.
What qualifies as a successful DL, and what does not? As this question begins to
be analyzed, more questions arise. Who is the intended user of a DL? What is the
user’s goal for using the DL? What are individual organizations trying to get from
their DLs?
For several years, researchers from various disciplines have studied different
perspectives of DL success and have generated many interesting yet often isolated
findings. Some findings have provided different although sometime overlapping perspectives on how to evaluate DLs. One of them is the DL quality model developed by
Gonçalves [11]. For each key concept of a minimal DL, [11] lists a number of dimensions of quality and a set of numerical measurements for those quality dimensions.
Though many would consider a DL to be a type of information system (IS), it often is
forgotten that there is a long tradition in IS research of evaluating the success of a generic IS. A variety of measures have been used. Two primary research streams, the user
satisfaction literature and the technology acceptance literature (i.e., the technology acceptance model, or TAM) have been investigated. User satisfaction is based on users’
attitudes toward a system. We define satisfaction as a user’s affective state presenting an
emotional reaction to an entire DL and the consequence of the user’s experiences during
various information-seeking stages. Therefore, we seek to understand the changing
needs of users interacting with the DL, and the users’ information-seeking behavior
during these stages [1]. Fortunately, too, information-seeking behavior has been studied
for decades, and many models have been generated.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 208 – 219, 2006.
© Springer-Verlag Berlin Heidelberg 2006
What Is a Successful Digital Library?
A system succeeds when its intended users use it as frequently as needed. User
satisfaction prompts user acceptance of the system and leads to higher system
usage, because attitude leads to action. Thus, DL user satisfaction can lead to DL
The rest of this paper is organized as follows. Section 2 presents the background
for our proposed model, which is described in Section 3. Section 4 presents a case
study of our model in a domain specific DL. Section 5 concludes the paper.
2 Prior Work
Library and information science researchers, such as those attending the workshop on
“Evaluation of Digital Libraries,” have investigated the evaluation of DLs [2, 18].
Saracevic [21] was one of the first to consider the problem. According to his analysis,
there are no clear agreements regarding the elements of criteria, measures, and methodologies for DL evaluation. The challenge is made more complex by the various
classes of users [4]. In an attempt to fill some gaps in this area, Fuhr et al. [10] proposed a description scheme for DLs based on four dimensions. However, a focus on
usability of DLs has lagged, especially regarding the non-user-oriented technical
topics in the DL literature. There are a few reported studies: inspection of NCSTRL
was described in [13]; evaluation of the ACM, IEEE-CS, NCSTRL, and NDLTD
digital libraries was reported in [15]; evaluations of ADL and ADEPT were documented in [14] and [6], respectively.
Theories regarding DLs, IS success and adoption, and information-seeking behavior have evolved in parallel. They provide foundations that can be integrated to help
answer the question: what is a successful DL? The prior research suggests the need
for a more comprehensive view of DL success. There also have been calls for research to empirically validate and extend IS success and adoptions models into varying contexts [25]. Motivated by these calls for research and the increasing number of
DL users with varying skills and from different backgrounds and cultures, we seek to
answer the question: what is the appropriate model of DL success from the perspective of end users (DL patrons)?
DLs are complex information systems; therefore, research on generic IS may be
applied to DLs. The most prominent IS success models existing in the literature
today are by Venkatesh [25], DeLone [7], and Seddon [22]. They are discussed in
subsections 2 and 3 below. But first we should consider how system usage relates
to success.
1. System Usage as a Success Measure
System usage has been considered to be an important indicator of IS success in a
number of empirical studies, for many systems. However, simply measuring the
amount of time a system is used does not fully capture the relationship between usage
and the realization of expected results. The nature, extent, quality, and appropriateness of the system use also should be considered. The nature of system use should be
addressed by determining whether the full functionality of a system is being used for
the intended purpose. Accordingly, we believe that log analysis could be beneficial to
the measurement of DL usage.
R. Shen et al.
2. Technology Acceptance Model (TAM): Predict Intention to Use
TAM provides predictions of intention to use by linking behaviors to attitudes that
are consistent with system usage, in time, target, and context. Venkatesh’s model [25]
predicted behavioral intention to use a system and is a unified model of the eight most
popular behavioral IT acceptance theories in the literature. It consists of four core
determinants of intention and usage, as shown in Fig. 1. They are: performance expectancy, effort expectancy, social influence, and facilitating conditions.
Despite its predictive ability, TAM provides only limited guidance about how to influence usage through system design and implementation. Venkatesh et al. stressed the
need to extend the TAM literature by explicitly considering system and information
characteristics and the way in which they might indirectly influence system usage.
performance expectancy
effort expectancy
intention to use system
social influence
facilitating conditions
Fig. 1. Venkatesh’s model [25]
3. Satisfaction: Attitude toward the System
In contrast to TAM, system and information characteristics have been core elements in the literature on user satisfaction. The DeLone study [7] is one of the first
attempts at a comprehensive review of the literature on IS success. It organized a
broad base of diverse research (180 articles) and presented a more integrated view of
IS success. DeLone’s model consists of six interdependent constructs for IS success:
system quality (SQ), information quality (IQ), use, user satisfaction, individual impact, and organization impact (see Fig. 2). It identified IQ and SQ as antecedents of
user satisfaction and use.
System Quality
Individual Impact
Information Quality
Organization Impact
User Satisfaction
Fig. 2. DeLone’s IS success model [7]
Seddon suggested that DeLone et al. tried to do too much with their model; as a result, the model is confusing and lacks specificity [22]. Seddon’s major contribution is
a re-specified model of IS success. Seddon defined success as a measure of the degree
to which the person evaluating the system believes that the stakeholder is better off.
The model shows that both perceived usefulness and user satisfaction depend on IQ,
SQ, and benefits (see Fig. 3). Both DeLone and Seddon made an explicit distinction
between information aspects and system features as determinants of user satisfaction.
What Is a Successful Digital Library?
System Quality
Information Quality
Perceived Usefulness
Net Benefits to:
User Satisfaction
Fig. 3. Seddon’s IS success model [22]
4. Information-seeking Behavior: Identify Temporal Users’ Information Needs
Satisfaction is a consequence of the user’s experience during various informationseeking stages. The changing needs of users interacting with the DL should be identified. Therefore, understanding of users’ information-seeking behavior is required.
The information-seeking behavior of academic scholars has been studied for decades, and many models have been generated. Among them are Ellis’s model [8] and
Kuhlthau’s model [16]. These two models are based on empirical research and have
been tested in subsequent studies. Ellis’s model includes six generic features coded
from E1 through E6 as shown in Fig. 4. As of 2002, there were more than 150 papers
that cite Ellis’s information-seeking behavior model of social scientists [20]. Most of
the information-seeking behavior features in Ellis’s model are now being supported
by capabilities available in Web browsers. Kuhlthau’s model complements that of
Ellis by attaching to stages of the information-seeking process the associated feelings,
thoughts and actions, and the appropriate information tasks. The stages of Kuhlthau’s
model are coded from K1 through K6 as shown in Fig. 4. Kuhlthau’s model is more
general than that of Ellis in drawing attention to the feelings associated with the various stages and activities. It also has been applied to support learning from DLs [19].
3 DL Success Model
We further connect Gonçalves’ DL quality model and the information life cycle
model [5] with Ellis’ and Kuhlthau’s information-seeking behavior models as shown
in Fig. 4. The outer arrows in Fig. 4 indicate the life cycle stage (active, semi-active,
and inactive) for a given type of information. The innermost portion of the cycle has
four major phases of information use or process: information creation, distribution,
seeking, and utilization. Each major phase is connected to a number of activities.
Gonçalves stated that his work took a very system-oriented view of the quality
problem and partially neglected its usage dimension. Our goal is to define the success
of DL from an end user perspective; hence we focus on the ‘seeking’ and ‘utilization’
stages. Behaviors occurring at the ‘seeking’ phase and ‘utilization’ phase are elaborated in Fig. 4 by Ellis’ and Kuhlthau’s models. Each dimension of quality is associated with a corresponding set of activities. Quality dimensions associated with the
seeking and utilization phases are related to constructs of the DL success model.
Our proposed DL success model consists of four interrelated and interdependent
constructs based on the previously discussed theoretical methods. The general proposition of our model is that DL satisfaction and the intention to (re)use a DL are
dependent on four constructs: information quality, system quality, performance expectancy, and social influence (see Fig. 5). Arrows in Fig. 5 indicate that a construct
R. Shen et al.
is affected by each construct that points to it. IQ and SQ can be found in the IS success literature, while performance expectancy and social influence can be found in the
IT adoption literature. Since our model incorporates TAM, it is a predictive model,
i.e., it can be used to predict intention to (re)use. We think determinants of success are
goal and user specific. Hence, a measurement instrument of “overall success” based
on arbitrary selection of items from the four constructs is likely to be problematic.
Individual measures from the four constructs should therefore be combined systematically to create a comprehensive measurement instrument.
ti v
a r c ng
g, rki
rin wo
sto net
se K2:
se ainin ng
h si ing
E2 : bro ntia ng
E3 iffer itori g
: d on tin
E4 5: m xtrac
E :e
in iti :
co K5
lle :
ti o
timelines pre
bil ve
de a
sc ut
rib ho
in rin
or g, m
ga o
ni di
zi fy
ng in
, i g,
, com ormance
form 4:
ul at
i on
K3: n
DL Success Constructs
Fig. 4. Connection of DL quality model with information life cycle and information seeking
behavior models
1. Information Quality (IQ)
Information in DLs can be classified from two different perspectives, the DL developers’ view and the DL patrons’ (end users’) view. Five main concepts related to
DL information within the 5S framework are: repository, collection, metadata catalog,
digital object, and metadata specification (see Fig. 6). A DL repository involves a set
of collections, each of which is a set of digital objects. Samples of digital objects can
be electronic theses (or dissertations) and records of artifacts (such as bones, seeds,
and figurines) excavated from an archaeological site. Each digital object is assigned
associated metadata specification(s), which compose the metadata catalog.
While the dimensions of quality for each of the five concepts are defined in [11]
and listed in the left part of Fig. 7, they do not fully differentiate end users from DL
developers. We group the five concepts into three categories and develop six items
(factors) to measure the quality for each of the three categories for end users, as
What Is a Successful Digital Library?
shown in the right part of Fig. 7. The dashed arrows illustrate that parts of the quality
dimensions discussed in [11] are associated with the six items measuring DL IQ.
relevance adequacy timeliness
reliability understandability scope
information quality
system quality
Intention to
social influence (SI)
ease of use accessibility
joy of use
Fig. 5. DL success model (integrating Fig. 1- Fig. 3)
associated with
digital object
consist of
associated with
metadata specification
consist of
metadata catalog
consist of
Fig. 6. Concepts related to DL information
a) Digital object and metadata specification:
Accuracy and completeness are defined in [11] as quality dimensions for metadata
specifications, however, they are absent in the quality dimensions list for a digital object.
This suggests two other quality measures for digital object and metadata specification:
adequacy and reliability. Adequacy indicates the degree of sufficiency and completeness.
Reliability indicates the degree of accuracy, credibility, and consistency.
Relevance is concerned with such issues as relevancy, pertinence, and the applicability of the information. Pertinence and relevance for digital objects are measured
with Boolean values (0 or 1) in [11]. They are a subjective judgment by users in a
particular context. We use relevance to measure the quality of both digital object and
metadata specification. Significance of a digital object defined in [11] reflects relevance to user needs or particular user requirements. Therefore, significance can be
partially mapped to relevance. Similarity metrics defined in [11] reflect the relatedness among digital objects. If one of the digital objects is a user’s information need,
then similarity is associated with the relevance item (factor).
Timeliness is concerned with the currency of the information. Understandability
encompasses variables such as being clear in meaning and easy to understand.
Preservability as an important digital object quality property needs to be identified
by DL developers; however, it may not be visible to DL patrons. The accessibility of a
R. Shen et al.
digital object is managed by DL services, so it is used to measure DL services instead of
information. Therefore, preservability and accessibility are not included in the six items
for DL IQ that are shown in Fig 7.
digital object accessibility
specification completeness
impact factor
Information Quality (IQ)
digital object
metadata specification
Fig. 7. DL information quality (IQ) measurement
b) Metadata catalog and collection
Adequacy is used to measure the degree of sufficiency and completeness of DL
metadata catalogs and collections.
c) Repository
Scope evaluates the extent and range of the repository. These address the breadth
of information and the number of different subjects. According to [11], a repository is
complete if it contains all collections it should have. Therefore, completeness defined
in [11] is associated with scope.
2. System Quality (SQ)
Dimensions of quality for DL services are classified as internal (e.g., top three entries) or external (e.g., bottom three entries) in [11], as shown in the dashed box in
Fig. 8. We focus on the external view, concerned with the use and perceived value of
these services from the end users’ point of view. They relate to DL system quality
(SQ) and performance expectancy (discussed in Section 3.3) as indicated by the three
dashed arrows in Fig. 8. We develop four items to measure DL SQ.
Prior research subscales for accessibility include system responsiveness and loading time. The accessibility of a DL refers to not only its speed of access and availability but also to its information (e.g., digital objects and metadata accessibility).
Efficiency defined in [11] is measured in terms of speed; it is associated with service
accessibility. A DL needs to be reliable, which means that it is operationally stable.
Ease of use is concerned with how simple it is for users to (learn to) use DLs. Joy
of use is about the degree of user pleasure. These two items are affected by the user
interface through navigation and screen design as indicated by the two solid arrows
What Is a Successful Digital Library?
System Quality
ease of use
joy of use
user interface
screen design
Fig. 8. DL service quality (SQ) measurement
shown in Fig. 8. Navigation is concerned with evaluating the links to needed information that are provided on the various pages of a DL website. Screen design is the way
information is presented on the screen. It affects both ease of use and joy of use. Having an organized and well-designed screen aids users in locating relevant information
more easily, while an attractive user interface helps increase joy of use. Although we
have a common idea that aesthetic objects should be symmetric, balanced, or well
proportioned, there is no general instruction set prescribing how to create aesthetic
interfaces [12]
3. Performance Expectancy (PE)
Performance expectancy (see Fig. 5) is defined as the degree to which users believe
that a specific DL will help them gain advantage in accomplishing their desired goal.
In [25], it consists of five constructs: perceived usefulness, extrinsic motivation, jobfit, relative advantage, and outcome expectations.
4. Social Influence (SI)
Social influence (see Fig. 5) is concerned with a user’s perception that other important people favor a particular DL. Many studies have been done in the marketing
domain on the role of social influence. Accordingly, it seems appropriate to consider
social influence on DL usage. As reported in [24], DL visibility is considered as an
important factor that may lead to greater user acceptance of DLs. Potential users may
not be aware of the benefits of using the DL, or even its existence. Increasing DL
visibility can help users perceive the DL as more useful, although it will not increase
the functionality of a DL.
5 Case Study
As part of the requirements analysis for an archaeological DL, ETANA-DL [23], email
interviews with 5 prestigious archaeologists, and face to face workplace interviews with
11 archaeologists (including 3 of the 5 interviewed by email) were conducted. Subsequent formative evaluation studies were carried out to improve system design. In this
section, we associate the four constructs of the model discussed in the previous section
with the activities occurring in the seeking and utilization phases (see the innermost portion of the cycle in Fig. 4) by analyzing the results of the interviews and the formative
usability studies. These results are shown in Table 1 and may help distinguish issues that
are generic across domains, from those that are domain specific.
R. Shen et al.
Table 1. DL success constructs associated with seeking and utilization phases
DL success
social influence
seeking phrase
utilization phrase
collection presentation
information quality
adequacy, scope
system quality
ease of use
joy of use
accessibility accessibility accessibility
1. Seeking phase
• E1/K1
“starting” activity in Ellis’ model (‘initiation’ stage in Kuhlthau’s model) is usually
at the beginning of information seeking. It may help one ‘recognize’ a need for information. Users’ information needs may be initiated by a specific active task or condition, or by requirements identified passively.
Social influence, such as regarding DL visibility, is associated with this stage.
Within the archaeological domain, awareness of DLs is poor. Methods to increase DL
visibility include:
1) Publicize the existence of a DL: One archaeologist said that “… the turning
point for the DL will be when someone has demonstrated in a print publication how
ETANA-DL helped in their research …”. Some recommended more international collaboration, e.g., some suggested that ETANA-DL may consider collaboration with
JADIS (Jordanian Archaeological Data Information System) to increase its visibility.
Since JADIS is one of the main Jordanian cultural resource management systems, connecting ETANA-DL with JADIS could allow basic survey and overall information on
Jordanian archaeology to be combined with ETANA-DL’s more in-depth coverage.
2) Provide a DL alert service (e.g., press alerts): Archaeologists may want alerts
when new artifacts from others arise on their subjects of interests.
• (E2-E6)/(K2-K3)
These five feature activities in Ellis’s model (‘chaining’, ‘browsing’, ‘differentiating’, ‘monitoring’, and ‘extracting’) occur in the ‘selection’ and ‘exploration’ stages in
Kuhlthau’s model. In the ‘selection’ stage, a general area for investigation is identified
(located). The appropriate task at this point is to fix the general topic of exploration.
Exploration has many cognitive requirements similar to browsing and search tasks.
IQ, SQ, and PE are associated with these stages. Regarding IQ, adequacy (degree of
sufficiency and completeness) of DL collections and metadata catalogs and scope of DL
repository should be considered. Some archaeologists pointed out: “Ideally, the system
would include as many types of data as possible, from text summaries to photos, maps,
and other visuals.”
Regarding SQ and PE, interface plays a major role in influencing the usefulness,
easy of use, and joy of use. The quality of the DL interface makes a significant
contribution to a usable DL, and interface problems often are cited by non-users as a
What Is a Successful Digital Library?
major reason for not using electronic information retrieval systems [9]. As a virtual
intermediary between users and a DL, the interface is the door through which users
access a DL. The interface characteristics (screen design and navigation) that affect
DL usability include those commonly found in most web GUIs, as well as the ones
specific to archaeological DLs.
1) Screen design: The way that information is arranged on the screen can influence
the users’ interaction with DLs beyond the effect of the information content. Some
archaeologists suggested that “… the interface needs to be more visually stimulating
… should allow to browse visual stacks of the digital library…”. Another issue to be
considered for screen design is the wording for labeling. In the archaeological domain, an example could be the terminology for periodization schemas. There are
different periodization schemas based on political, historical, or cultural events. The
archaeologists found it difficult to use a single “standard” periodization schema.
2) Navigation: The navigation should enable archaeologists to explore a DL without having to keep an auxiliary memory aid like a yellow pad at hand.
2. Utilization phase
Information management and utilization was not identified as a category in Ellis’s
study of social scientists. On the other hand, the last three stages in Kuhlthau’s model
involve organizing information into a coherent structure.
• K4
The formulation stage is identified as conceptually the most important step in the
process [16]. Users focus on a more specific area within the topic and make sense of
(or interpret) information in the light of their own needs. A guiding idea or theme
emerges which is used to construct a story or narrative, or to test a hypothesis. This
formulation also will guide the users in selecting appropriate information.
Research has considered the process of interpreting documents (e.g., reading and
annotating them) rather than simply locating them [3]. Within the archaeological
domain, archaeologists formulate a personal perspective or sense of meaning from the
encountered information. However, they usually conduct interpretation offline. Access to primary data and data analysis services provided by DLs enable archaeologists
to make interpretations online, if they change work habits. Alternatively, exporting of
results to files or into special formats like for spreadsheets may be helpful to support
subsequent offline management, processing, visualization, and reporting.
Some sample factors affecting formulation are as follows.
1) Information accuracy: Formulation is associated with verifying the accuracy
of the information found. Archaeologists need reputable (trusted) information or information analysis to support interpretation.
2) Information accessibility: It defines how much effort (time) is required to
find (locate) the information needed. In the archaeological domain, primary data usually is available to researchers outside a project (site) only after substantial delay.
Some archaeologists said that “… ETANA-DL would be a very efficient way to disseminate and share our research, and in turn, we could utilize the work of others as
much as possible.”
• K5
In the collection stage, information is gathered to support the chosen focus. Information accessibility is very important as discussed above.
R. Shen et al.
• K6
During this final stage, presentation, ideas, focus, and collected resources are organized for publishing and sharing. Some archaeologist suggested making arrangement with the publishers of obscure journals to include their publications in ETANADL. They found that it is useful for ETANA-DL to provide a discussion forum to
share their interpretation of annotated items.
6 Conclusions
The goals and objectives of a DL differ depending on the DL type, resulting in varying ideas of satisfaction as well as success. Therefore, to determine success across
DLs from the perspective of users is goal and context specific. The work presented in
this paper lays the foundation for defining success of DLs from the view of DL end
users. Our work assumes a multi-theoretical perspective and synthesizes many related
research areas in terms of theory and empirical work. Our case study illustrates and
further explicates the approach, which we have shown to be helpful with regard to a
DL to support Near Eastern archaeology. We will empirically validate the proposed
model further when we apply it in various domain specific DLs in the future.
Acknowledgements. This work was funded in part by the National Science Foundation (ITR-0325579). We thank Marshall Breeding, Linda Cantara, Douglas Clark,
Joanne Eustis, James W. Flanagan, and all the interviewees for their support. We also
thank all our colleagues in the Digital Library Research Laboratory at Virginia Tech.
1. Adams, A. and Blandford, A. Digital libraries' support for the user's 'information journey'.
In Proc. JCDL 2005, June 7-11, 2005, Denver, 160-169.
2. Agosti, M., Nunzio, G.M.D. and Ferro, N. Evaluation of a digital library system. In Proc.
of the DELOS WP7 Workshop on the Evaluation of Digital Libraries, Padova, Italy, October 4-5, 2004, 73–76.
3. Bishop, A.P. Making Digital Libraries Go: Comparing Use Across Genres. ACM DL
1999: 94-103.
4. Blandford, A. and Buchanan, G. Usability of digital libraries: a source of creative tensions
with technical developments. In IEEE-CS Technical Committee on Digital Libraries' online newsletter, Vol. 1, No. 1
5. Borgman, C.L. Social aspects of digital libraries. In DL’96: Proceedings of the 1st ACM
International Conference on Digital Libraries, D-Lib Working Session,
http://is.gseis.ucla.edu/research/dl/UCLA_DL_Report.html, 1996.
6. Champeny, L., Borgman, C.L., Leazer, G.H., Gilliland-Swetland, A.J., Millwood, K.A.,
D'Avolio, L., Finley, J.R., Smart, L.J., Mautone, P.D., Mayer, R.E. and Johnson, R.A. Developing a digital learning environment: an evaluation of design and implementation processes. In Proc. JCDL 2004: 37-46.
7. DeLone, W.H. and McLean, E.R. Information systems success: The quest for the dependent variable. Information Systems Research, 3 (1). 60-95, 1992.
What Is a Successful Digital Library?
8. Ellis, D. and Haugan, M. Modeling the information seeking patterns of engineers and research scientists in an industrial environment. Journal of Documentation, 53(4): 384-403,
9. Fox, E.A., Hix, D., Nowell, L.T., Brueni, D.J., Wake, W.C., Heath, L.S. and Rao, D. Users, User Interfaces, and Objects: Envision, a Digital Library. JASIS 44(8): 480-491
10. Fuhr, N., Hansen, P., Mabe, M., Micsik, A. and Sølvberg, I. Digital libraries: A generic
classification and evaluation scheme. Springer Lecture Notes in Computer Science,
2163:187–199, 2001.
11. Gonçalves, M.A. Stream, Structure, Space, Scenarios, and Societies (5S): A Formal Digital Library Framework and Its Applications, Ph.D. Dissertation, Virginia Tech,
http://scholar.lib.vt.edu/theses/available/etd-12052004-135923, 2004.
12. Grün, C., Gerken, J., Jetter, H.-C., König, W. and Reiterer, H. MedioVis - A User-Centred
Library Metadata Browser. In Proc. ECDL 2005: 174-185.
13. Hartson, H.R., Shivakumar, P. and Pérez-Quiñones, M.A. Usability inspection of digital
libraries: a case study. Int. J. on Digital Libraries 4(2): 108-123 (2004).
14. Hill, L.L., Carver, L., Larsgaard, M., Dolin, R., Smith, T.R., Frew, J. and Rae, M.-A. Alexandria digital library: user evaluation studies and system design. JASIS 51(3): 246-259
15. Kengeri, R., Seals, C.D., Harley, H.D., Reddy, H.P. and Fox, E.A. Usability Study of Digital Libraries: ACM, IEEE-CS, NCSTRL, NDLTD. Int. J. on Digital Libraries 2(2-3): 157169 (1999).
16. Kuhlthau, C.C. Learning in digital libraries: an information search process approach. Library Trends 45(4): 708-724, 1997.
17. Larsen, R.L. and Wactlar, H.D. Knowledge Lost in Information: Report of the NSF Workshop on Research Directions for Digital Libraries, June 15-17, 2003, Chatham, MA, National Science Foundation Award No. IIS-0331314. http://www.sis.pitt.edu/~dlwkshop/.
18. Marchionini, G. Evaluating Digital Libraries: A Longitudinal and Multifaceted View preprint from Library Trends, 49(2): 304-333, 2000.
19. Marshall, B., Zhang, Y., Chen, H., Lally, A.M., Shen, R., Fox, E.A. and Cassel, L.N. Convergence of Knowledge Management and E-Learning: The GetSmart Experience. In Proceedings JCDL2003, Houston, 135-146, 2003.
20. Meho, L.I. and Tibbo, H.R. Modeling the information-seeking behavior of social scientists: Ellis's study revisited. JASIST 54(6): 570-587, 2003.
21. Saracevic, T. Digital library evaluation: Toward evolution of concepts. Library Trends,
49(2): 350–369, 2000.
22. Seddon, P.B. A respecification and extension of the DeLone and McLean model of IS success. Information Systems Research, 8 (3): 240-253, 1997.
23. Shen, R., Gonçalves, M.A., Fan, W. and Fox, E.A. Requirements Gathering and Modeling
of Domain-Specific Digital Libraries with the 5S Framework: An Archaeological Case
Study with ETANA. In Proceedings ECDL2005, Vienna, Sept. 18-23.
24. Thong, J.Y.L., Hong, W. and Tam, K.Y. What leads to acceptance of digital libraries?
Commun. ACM 47(11): 78-83 (2004).
25. Venkatesh, V., Morris, M., Davis, G. and Davis, F. User acceptance of information technology: Toward a unified view. MIS Quarterly, 27 (3): 425-478, 2003.
Evaluation of Metadata Standards in the
Context of Digital Audio-Visual Libraries
Robbie De Sutter, Stijn Notebaert, and Rik Van de Walle
Ghent University - IBBT, ELIS, Multimedia Lab,
Sint-Pietersnieuwstraat 41, B-9000 Ghent, Belgium
[email protected]
Abstract. Digital file-based libraries for the audio-visual material of
television broadcasters and production houses are becoming desirable.
These libraries not only address the problem of loss of content due to
tape deterioration, but also improve disclosure of the content. However,
switching to a digital file-based library involves many new concerns and
problems for content providers. This paper will discuss one of them,
namely the metadata. Metadata is additional information that is required
in order to be able to search, retrieve, and play out the stored content.
Different standards for metadata are currently available, each having its
own field of application and characteristics. In this paper, we introduce
an objective framework that one can use in order to select the appropriate
metadata standard for its particular type of application. This framework
is applied to four well-known metadata standards, namely Dublin Core,
MPEG-7, P/Meta, and SMEF.
Digital file-based libraries with audio-visual material originating from television
broadcasters and television production houses are still uncommon. Most of them
store and archive their audio-visual material on tape. Unfortunately, these tapes
deteriorate over time, resulting in content loss. Broadcasters recognize this problem and plan to switch to a tapeless archive. This switch is feasible as the price
to store audio-visual material as digital files on hard disks is acceptable. Furthermore, a tapeless file-based archive improves the disclosure of the material.
The switch to a tapeless archive encounters similar problems as those the
librarians encountered during the digitalization of their libraries, such as the
need for particular metadata — i.e., additional information about the material.
In [1] it is observed that “the metadata necessary for successful management and
use of digital objects is both more extensive than and different from the metadata
used for managing collections of physical material.” This statement holds true for
audio-visual material. For example, the appropriate technical metadata must be
available to be able to play out the material. Furthermore, it is also the metadata
that will ensure that the material in the archive can be searched and retrieved.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 220–231, 2006.
c Springer-Verlag Berlin Heidelberg 2006
Evaluation of Metadata Standards
The Society of Motion Pictures and Television Engineers (SMPTE)1 uses the
following definition:
content = essence + metadata
Here, the essence is the audio-visual material. The definition states that without
metadata, there is no content: content cannot be found or used without metadata; hence the essence is unusable. A second definition extends the previous
asset = content + right to use it
This definition states that content is only valuable (i.e., an asset) if the content
owner has the right to utilize it.
It is clear that, in order to have a fully functional and usable digital audiovisual library, choosing the best suited metadata standard is the key to success.
Indeed, more and more new metadata standards with as main purpose to annotate and manage audio-visual material are available. However, as these standards
are intended for different fields of application, selecting the “best” standard depends on the intended use.
In this paper, we define different criteria that can be used to evaluate and
compare metadata standards. As such, these criteria allow one to make a wellconsidered choice. Furthermore, we apply the criteria to four well-known metadata standards, namely Dublin Core [2], the Multimedia Content Description
Interface (also known as MPEG-7) [3, 4], P/Meta [5], and the Standard Media
Exchange Framework (SMEF) [6].
The remainder of the paper is organized as follows. In section 2, we give an
overview of the related work. Next, in section 3 we define the different evaluation
and comparison criteria for metadata standards. Subsequently in section 4, these
criteria are applied to the four metadata standards. Finally, section 5 concludes
this paper.
Related Work
The expertise built up by the librarians when creating digital libraries is of
great value for any other digitization effort, also for the digitalization of the
audio-visual archives of television broadcasters and archives alike. The purpose
of a digital library – as seen by the librarians – is described in [7] as “electronic
libraries in which large numbers of geographically distributed users can access the
contents of large and diverse repositories of electronic objects – networked text,
images, maps, sounds, videos, catalogues of merchandize, scientific, business and
government data sets – they also include hypertext, hypermedia and multimedia
This statement emphasizes that the library community mainly focuses on the
disclosure and the exchange of digital objects. This resulted in the creation of
the Metadata Encoding & Transmission Standard (METS) by the Library of
More information on the Society of Motion Pictures and Television Engineers can
be found at http://www.smpte.org
R. De Sutter, S. Notebaert, and R. Van de Walle
Congress [8]. METS provides a format for encoding the metadata used for the
management and the exchange of digital objects stored within the library, and
this by extending the techniques developed by the making of America II (MOA
II) project [9]. However these standards do not normatively fix the structural,
administrative, and technical metadata itself. Furthermore they only refer to
available techniques in the pre-digital libraries community for descriptive metadata, such as Machine-Readable Cataloging (MARC) records [10] or the Encoded
Archival Description (EAD) [11]. These pre-digital documentation techniques are
inadequate to fully document digital works. Better suitable standards for the digital libraries are more and more being investigated and used, such as the Dublin
Core Metadata Element Set, the General International Standard Archival Description [12] and the Multimedia Content Description Framework (also known
as MPEG-7) [13]. For an overview of these standards, we refer the reader to [14].
Most efforts and investigations in the digital library communities are about
ways to exchange digital objects between repositories and to ensure interoperability thereof. The Open Archive Initiative Protocol for Metadata Harvesting
(OAI-PMH) is a major effort to address technical interoperability among distributed archives [15, 16, 17]. This initiative lays down the fact that the Simple
Dublin Core Metadata Element Set as defined by the Dublin Core Metadata
Initiative is the baseline to assure metadata interoperability.
Unfortunately, the used techniques for creating and maintaining digital libraries of books and images – based on METS and the results of the MOA2
project – “lacks adequate provisions for encoding of descriptive metadata, only
supports technical metadata for a narrow range of text- and image-based resources, provides no support for audio, video, and other time dependent media,
and provides only very minimal internal and external linking facilities” [1]. This
implies that solely using these technologies to create audio-visual digital libraries
is insufficient. These concerns are addressed by new metadata standards with as
main purpose to annotate and manage audio-visual material, such as P/Meta
and SMEF. The remainder of the paper focuses on a framework to compare the
metadata standards intended for annotation of audio-visual libraries.
Selection Criteria
This section describes criteria that one can use to select the metadata standard
that is best suited for the application in mind. These criteria are composed in
such a way that all aspects ranging from content organization to the different
types of metadata are taken into account, but independent to any restriction
imposed by a particular media asset management system.
Criterion 1: Internal vs. Exchange Metadata Model
For this first criterion, it is important to identify the involved parties that
exchange audio-visual material during their typical life cycle. The European
Broadcasting Union (EBU) identifies in [18] the consumers and three trading
entities, being the content creator, the content distributor, and the archive. EBU
Evaluation of Metadata Standards
has investigated the different relationships between these four players and has
presented the entities and the relationships in the EBU P/Meta Business-toBusiness Dataflow Model (see Fig. 1). This model is independent of any metadata model and is applicable to most broadcasters.
Fig. 1. EBU P/Meta Business-to-Business Dataflow Model [18]
On the one hand, particular metadata models are specifically developed for
managing the metadata in the interior of a system. These metadata models are
further referred to as internal metadata models. Usually, these metadata models
are represented as Entity Relationship Diagrams (ERDs) which describe the architecture of the database that stores the metadata of the audio-visual material.
On the other hand, other metadata models are used to describe the way the
information is to be transmitted from source to destination. Here, the metadata
models are called exchange metadata models. These models are used to exchange
information about the audio-visual material and are specifically intended for the
transmission of metadata between different systems. Here, exchange must be
seen as broad as possible, namely between any combination of content creator,
content distributor, archive, and consumers.
Criterion 2: Flat vs. Hierarchical Metadata Model
The structural organization of the description of the essence is a second criterion.
In general, the broadcaster decides how detailed the metadata needs to be. Two
extreme visions can be identified. On the one hand the essence is considered as
an elementary and indivisible unit, resulting in a coarse description, and on the
other hand the essence is divided in small sub-pieces each annotated separately,
resulting in a fine-detailed description.
If the essence is considered as an elementary and indivisible unit, the broadcaster can associate this elementary unit with, for example, a program. The
metadata describes the essence (here this is that program) as a whole and does
not describe the individual parts therein. This model is mostly referred to as a
flat metadata model.
Sub-parts of the essence can be annotated with much more detail. The additional metadata belongs to the individual parts and permits the users of the
archive to perform more detailed searches on the content. For example, a program can be split up in several editorial objects, corresponding to, for example,
R. De Sutter, S. Notebaert, and R. Van de Walle
the individual scenes. Every editorial object can be annotated with additional
descriptive metadata, so it is possible to search on the editorial object itself. In
turn, editorial objects can be broken down in different media objects. These media objects could be, for example, the audio components, the video components,
the subtitles, and so on. This is also possible the other way around: a group of
programs which belong together can be collected in a program group. This program group is annotated with information identical to all programs within, for
example, the name of the program. Hence, it is not necessary to repeat the same
information for every program, but the program inherits information from its
program group. The underlying idea is that information has to be added to the
objects at the right location. This concept is referred to as a hierarchical metadata model. A four-layered architecture as discussed above is visualized in Fig. 2.
Fig. 2. Content organization
The broadcaster will not always want to use a hierarchical metadata model
although this has huge benefits for faster and more efficient search and retrieve
operations. Indeed, the most important reason for a broadcaster to restrict the
metadata (and thus the decomposition of the audio-visual asset), is to limit the
increase of cost which is proportionally to the amount of metadata that needs to
be collected. It is clear that, as the metadata about an audio-visual object grows,
the marginal profit of the additional metadata decreases, but the cost to generate
this additional metadata increases disproportional. At a certain moment, it will
be impossible to add additional metadata without making unjustified costs. In
other words, the broadcaster will have to make a trade-off between cost and
comprehensiveness of the metadata.
Criterion 3: Supported Types of Metadata
Metadata describes the essence. The requirements of the users determine the
needed types of metadata. There are two rules that must be observed as explained
in the introduction of this paper: 1) essence is unusable withoutmetadata, and
2) the content is valueless without rights information. Hereafter different types of
metadata for the preservation of audio-visual material are discussed.
Evaluation of Metadata Standards
Identification metadata. The identification metadata is primarily about the
information to singularly identify the essence. This can be done by human interpretable fields, like a title or an index, or by machine understandable identifiers, like a Unique Material Identifier (UMID) or a Uniform Resource Identifier
(URI). Besides the identification metadata related to the essence, other identifying information is necessary to locate related documents that are potentially
stored in another system.
Description & classification metadata. The descriptive metadata must describe what the essence expresses. This could be done by providing a list of
keywords which try to place the essence in a particular context. In some cases,
the keywords are selected from an organized dictionary of terms, i.e., the thesaurus. Other than the concept of the thesaurus for purely descriptive purposes,
other classification schemes can be used. Indeed, the content can be categorized
in different predefined classes in accordance with the genre, the audience, and
so on. A very well-known classification system is the Escort 2.4 system [19] from
the EBU that groups the essence in conceptual, administrative, productional,
scheduling, transmission, viewing and financial ways.
Another type of descriptive metadata comprises the description of the essence
as a short text. This type of descriptive metadata is well-known and therefore
it will be extensively used in practically every archiving system. Unfortunately,
these fields are error-prone (e.g., spelling mistakes) and should be used carefully.
Technical metadata. The technical metadata describes the technological characteristics of the related essence. The minimal required technical metadata must
specify the audio and video codecs that can be used for the decoding of the
audio-visual material. With this minimal information, the user has the possibility to play out the essence. Hence, the technical metadata enables the essence to
become usable which is one of the key requirements in order to create content.
Security & rights metadata. The security metadata handles all aspects from
secure transmission (i.e., the encryption method) to access rights. The latter
augments the content into an asset. The access rights metadata can be split
up in information about the rights holder and information about contracts. The
rights holder is the organization who owns the rights of the audio-visual material.
Also the contracts related to the publication of the content and the contracts of
the people who are involved with the creation of the essence, are considered as
rights metadata.
Publication metadata. The last type of metadata describes the publication(s)
of the essence. Every publication establishes a date of publication, the time and
the duration of the publication, the channel of publication, and so on. That
way the broadcasters have an idea on the frequency and the popularity of the
essence. Furthermore, this information is important to clear broadcasting rights
and handle payments.
R. De Sutter, S. Notebaert, and R. Van de Walle
Criterion 4: Syntax and Semantics
Some standards define only syntax, others only semantics, and some define both.
The syntax defines how the representation of the metadata must be done. One of
the most important questions about the syntax is the choice between a textual
and a binary representation. The textual representation has the advantage that
the metadata is human readable, but at the same time it is very verbose. The
binary representation is dense, but it has the disadvantage that it can only be
handled by machines.
In case of plain text notation, the Extensible Markup Language (XML) is
mostly used. If so, the metadata standard provides, besides the standard itself,
an XML Schema that punctiliously determines the syntax of the metadata. Using
the XML Schema makes it possible to check the correctness (i.e., validity) of the
metadata. This characteristic enables interoperability.
The semantics of the metadata standard determine the meaning of the metadata elements. Without any semantic description, one is free to assume the
denotation of the different metadata elements, presumably resulting in different
interpretations thereof between users. Only if the description of the metadata elements is closed (i.e., every metadata element is semantically described), all users
must agree on the sense of the metadata elements improving the interoperability.
Evaluate Metadata Standards
In this section, we apply the evaluation criteria of Section 3 to four well-known
metadata standards, namely Dublin Core, MPEG-7, P/Meta, and SMEF. Table
1 gives an overview of the evaluation criteria for the four metadata standards.
Dublin Core
The Dublin Core Metadata Initiative (DCMI)2 is an open consortium engaged in
the development of interoperable and online metadata standards that support
a broad range of purposes and business models. The DCMI defined in 1999
the Simple Dublin Core Metadata Element Set consisting of 15 elements. In a
second phase, the model was extended with three additional elements and a
series of refinements, resulting in the Qualified Dublin Core Metadata Element
Set (DCMES) specification [2].
The goal of the DCMES specification is to exchange resource descriptions
aiming at cross-domain applications (criterion 1). While both the Simple and
the Qualified specifications are very straightforward, they suffer from two very
important shortcomings. On the one hand, there are no provisions for describing
hierarchically structured audio-visual content – however, this can be circumvented by making implicit references to other parts – hence DCMES is a flat
metadata model (criterion 2). On the other hand, the number of available metadata elements is too limited for thoroughly annotating audio-visual resources in
More information on DCMI can be found at http://dublincore.org
Evaluation of Metadata Standards
digital libraries. In particular, due to the generic character of DCMES, the metadata for describing the technical, rights, security, and publication information is
very confined (criterion 3).
With regards to criterion 4, the semantics are concisely described, still the
user has considerable freedom for own interpretation. The DCMI provides different ways for syntactically describing the metadata: there are guidelines for
incorporating Dublin Core in XML3 and guidelines to use Dublin Core in combination with the Resource Description Framework4.
MPEG-7: Multimedia Content Description Interface
The International Organization for Standardization and the International Electrotechnical Commission (ISO/IEC) have created the International Standard
15938, formally named Multimedia Content Description Interface, but better
known as the MPEG-7 standard, which provides a rich set of tools for thoroughly describing multimedia content [3, 4].
The MPEG-7 standard is developed for, among other things, the exchange
of metadata describing audio-visual content. It has been designed to support a
broad range of applications, without targeting a specific application. As such, it
is an exchange model (criterion 1).
The MPEG-7 standard normatively defines the syntax, by using an XML
Schema, and the semantics, via normative text, of all metadata elements (criterion 4). The elements are structured as descriptors and description schemes: a
descriptor is defined for the representation of a particular feature of the audiovisual content, a description scheme is an ordered structure of both descriptors
and other description schemes. This system is used to create a hierarchical model
(criterion 2). For example, an audio-visual material can be described by its temporal decomposition and by its media source decomposition. The latter is divided
into descriptions about the audio segment and the video segment, which is on
its turn decomposed into shots, key frames, and objects.
The supported types of metadata (criterion 3) are mostly focused on the description, technical, and, to a lesser degree, identification metadata. Almost no
attention was paid to publication and rights & security metadata elements, however ISO/IEC address these concerns in different parts of the MPEG-21 standard.
Part three and part four of the MPEG-7 standard handle about the technical metadata for video respectively audio content. In part five of the MPEG-7
standard, sometimes referred to as Multimedia Description Schemes (MDS), the
descriptors and the description schemes for the description and classification of
audio-visual material are defined. More information about MDS is given in [20],
and an overview of the different functional areas is visualized in Fig. 3.
The guidelines for the notation of Dublin Core in XML format can be found at
More information on using Dublin Core in combination with Resource Description Framework can be found at http://dublincore.org/documents/dcmes-xml and
R. De Sutter, S. Notebaert, and R. Van de Walle
& Access
Creation &
Basic Elements
Links &
Fig. 3. Overview of Multimedia Description Schemes [20]
The P/Meta standard is developed by the EBU as a metadata vocabulary for
program exchange in the professional broadcast industry [5]. Hence, it is not
intended as an internal representation but as an exchange format for programrelated information in a business-to-business use case (criterion 1).
The P/Meta standard [21] presents a five-layered hierarchical model (criterion 2): the brand, the program group, the program, the program item, and
the media object. A brand collects all the program groups with a recognizable
collective identity, e.g. information about the broadcasting TV-station. Every
program group is composed of individual programs, which consist of individual
program items. Finally every program item may be split up in media objects.
This hierarchy is comparable with the one illustrated in Fig 2.
To obtain this hierarchical structure, the standard defines a number of sets
and attributes. A P/Meta set groups P/Meta attributes and other P/Meta sets
in such a way that all relevant metadata is collected for describing the considered
object. A program group with the corresponding programs is described. Every
program group and every program is annotated with identification (numbers and
titles), classification (according to the Escort 2.4 system [19]), and description
metadata. Besides these three elementary types, the description of the individual
programs is complemented with four additional types, namely transmission or
publication metadata, metadata concerning editorial objects and media objects,
technical metadata (audio and video specification, compression schemes, and so
on), and rights metadata (contract clauses, rights list, and copyright holders).
These are also the supported types of metadata (criterion 3).
Evaluation of Metadata Standards
P/Meta defines all sets and attributes, resulting in a metadata standard where
every term is determined unambiguously. The syntax is defined by an XML
Schema (criterion 4).
Standard Media Exchange Framework
SMEF has been developed by the Media Data Group of BBC Technology, now
Siemens SBS, on behalf of the British Broadcasting Corporation (BBC). Through
a close collaboration with a wide variety of BBC projects, a profound understanding of the broadcaster’s audio-visual media information requirements has been derived. Although the model is developed for use within the BBC, the definitions are
organization independent and should be usable for any other broadcasters.
SMEF provides a rich set of data definitions for the range of information
involved in the production, development, use, and management of media assets
[6]. Its purpose is to ensure that different systems store this information in an
equal way. Therefore, the SMEF standard defines an ERD which provides a
framework for storing the metadata in the system. Hence, this is an internal
metadata model (criterion 1).
The SMEF metadata model records all information that becomes available
during the whole production cycle, from a program concept over media and editorial objects to the actual publication. An editorial object (i.e., the program) can
be split up in different media objects, making this a hierarchical metadata model
(criterion 2). Each media object can be annotated extensively with descriptive
and technical metadata. The entities Editorial Object and Media Object can be
linked with two other entities, namely the Usage Restriction entity (describing
the restrictions on the use) and the Copyright Reporting entity (describing copyright details on the source material used). Hence, SMEF pays much attention to
the rights metadata (criterion 3).
With regards to criterion 4, the SMEF standard defines the semantics of all
entities, attributes, and relationships. The definition of syntactical rules covers
the way the metadata is represented in the internal system.
Table 1. Overview of the Evaluation of the Metadata Standards
Dublin Core
criterion 3
and technical
and technical
rights metadata)
XML & RDF (a)
criterion 4
open semantics closed semantics closed semantics open semantics
(a) DCMES can be mapped to XML and RDF.
criterion 1
criterion 2
R. De Sutter, S. Notebaert, and R. Van de Walle
In this paper, we discussed the need for digital file-based libraries for the audiovisual materials of television broadcasters and production houses. It is indispensable that these audio-visual materials are described by additional information,
i.e. metadata, such that the materials in the libraries can be disclosed. It is the
metadata that augments the materials from essences over content to assets.
Different metadata models exist to do this, whereby each model is suitable for
a specific type of application. Within the paper, we introduced different selection
criteria that one can use to compare and to select an appropriate metadata model
for his intended application. The four selection criteria are 1) internal versus
exchange metadata model, 2) flat versus hierarchical metadata model, 3) the
supported types of metadata, and 4) the syntax and semantics of the model. To
conclude this paper, we briefly introduced four well-known metadata standards,
namely Dublin Core, MPEG-7, P/Meta, and SMEF, and applied the evaluation
The research activities that have been described in this paper were funded
by Ghent University, the Interdisciplinary Institute for Broadband Technology
(IBBT), the Institute for the Promotion of Innovation by Science and Technology
in Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders),
the Belgian Federal Science Policy Office (BFSPO), and the European Union.
1. McDonough, J., Proffitt, M., Smith, M.: Structural, technical, and administrative
metadata standards. A discussion document. Technical report, Digital Library Federation (2000) Available at http://www.diglib.org/standards/stamdframe.htm.
2. Dublin Core Metadata Initiative: Dublin core metadata element set, version
1.1: Reference description. Technical report (2004) Available at http://www.
3. Martı́nez, J.M., Koenen, R., Pereira, F.: MPEG-7: The Generic Multimedia Content Description Standard, Part 1. IEEE MultiMedia 9 (2002) 78–87
4. Martı́nez, J.M.: MPEG-7: Overview of MPEG-7 Description Tools, Part 2. IEEE
MultiMedia 9 (2002) 83–93
5. Hopper, R.: Metadata exchange standards. Technical Report Technical Report
No. 284, European Broadcasting Union (2000) Available at http://www.ebu.ch/
en/technical/trev/trev 284-hopper.pdf.
6. BBC Technology: SMEF data model version 1.10. (Technical report) Available at
7. Sreenivasulu, V.: The Role of a Digital Librarian in the Management of Digital
Information Systems (dis). Aslib Proceeding 18 (2000) 12–20
8. Digital Library Federation: METS: Metadata encoding and transmission standard.
Technical report (2005) Available at http://www.loc.gov/standards/mets.
Evaluation of Metadata Standards
9. Digital Library Federation: The making of America II. Technical report (2005)
Available at http://sunsite.berkeley.edu/MOA2.
10. Library of Congress: Understanding Marc Authority Records. Cataloging Distribution Service (2003)
11. The Society of American Archivists: Encoded Archival Description: Tag Library.
Society of American Archivists (2002)
12. International Council on Archives: ISAD(G): General International Standard
Archival Description, Second edition. (1999)
13. Manjunath, B.S., Salembier, P., Sikora, T.: Introduction to MPEG-7: Multimedia
Content Description Language. John Wiley & Sons (2002)
14. Bekaert, J., Van De Ville, D., Strauven, I., De Kooning, E., Van de Walle, R.:
Metadata-based Access to Multimedia Architectural and Historical Archive Collections: a Review. Aslib Proceeding 54 (2002) 362–371
15. Yu, S., Chen, H., Chang, H.: Building an Open Archive Union Catalog for Digital
Archives. The Electronic Library 23 (2005) 410–418
16. Van de Sompel, H., Lagoze, C.: The Santa Fe Convention of the Open Archives
Initiative. D-Lib Magazine 6 (2000)
17. Lagoze, C., Van de Sompel, H.: The making of the Open Archives Initiative Protocol for Metadata Harvesting. Library Hi Tech 21 (2003) 118–128
18. Hopper, R.: Metadata exchange scheme, v1.0. Technical Report Technical Report
No. 290, European Broadcasting Union (2002) Available at http://www.ebu.ch/
trev 290-hopper.pdf.
19. European Broadcasting Union: Escort: EBU System of Classification of RTV
Programmes. Technical report (1995) Available at http://www.ebu.ch/en/
20. Salembier, P., Smith, J.R.: MPEG-7 Multimedia Description Schemes. IEEE
Transactions on Circuits, Systems and Video Technology 11 (2001) 748–759
21. European Broadcasting Union: P/Meta Metadata Exchange Scheme v1.1. Technical Report Tech. 3295 (2005) Available at http://www.ebu.ch/en/technical/
metadata/specifications/notes on tech3295.php.
On the Problem of Identifying the Quality of
Geographic Metadata
Rafael Tolosana-Calasanz, José A. Álvarez-Robles, Javier Lacasta,
Javier Nogueras-Iso, Pedro R. Muro-Medrano, and F. Javier Zarazaga-Soria
Computer Science and Systems Engineering Department,
University of Zaragoza
Marı́a de Luna, 1 50018 Zaragoza Spain
{rafaelt, jantonio, jlacasta, jnog, prmuro, javy}@unizar.es
Abstract. Geographic metadata quality is one of the most important
aspects on the performance of Geographic Digital Libraries. After reviewing previous attempts outside the geographic domain, this paper
presents early results from a series of experiments for the development
of a quantitative method for quality assessment. The methodology is developed through two phases. Firstly, a list of geographic quality criteria
is compiled from several experts of the area. Secondly, a statistical analysis (by developing a Principal Component Analysis) of a selection of
geographic metadata record sets is performed in order to discover the
features which correlate with good geographic metadata.
Geographic Digital Libraries typically use geospatial metadata in order to provide surrogate representations of geographic resources and they represent the
most powerful technique currently available for describing and locating geographic objects. As research and development make progress in the geographic
area and metadata repositories grow in size (there are currently geospatial repository projects operating, whilst others are either to receive geographic metadata
or plan to receive them in the near future), new requirements arise and system
performance must improve necessarily. In this sense, the issues surrounding the
creation of good quality metadata for Geographic Digital Libraries have surprisingly received little attention. Besides, regarding computer systems, there is a
popular acronym, GIGO (Garbage In, Garbage Out), which means that if the
input data is wrong, the output data will be unavoidably inaccurate or wrong.
This work has been partially supported by the Spanish Ministry of Education and
Science through the project TIC2003-09365-C02-01 from the National Plan for Scientific Research, Development and Technology Innovation. J. Lacasta’s work has
been partially supported by a grant (ref. B139/2003) from the Aragón’s Government. Special thanks should be given to A. Sánchez, M. A. Manso, C. Fernández, P.
Pachól, S. Fontano, A. Amaro and S. Muñoz.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 232–243, 2006.
c Springer-Verlag Berlin Heidelberg 2006
On the Problem of Identifying the Quality of Geographic Metadata
In other words, low quality information leads to bad system performance. Consequently, Geographic Digital Libraries need good quality metadata records in
order to produce good results. The influence of poor quality metadata on the
performance of Digital Libraries has been already studied from the perspective of other domains of knowledge: Barton [1] warned that “. . . these problems
manifest themselves in various ways, including poor recall, poor precision, inconsistency of search results, ambiguities and so on. . . ”. Regarding the geographic
domain, not only must the attention be focused on those problems, but also on
the new ones that may appear with the geospatial information specific aspects:
geographic coordinates, place names and so on.
Nevertheless, in order to tackle the problem, the requirements surrounding
good quality metadata and, speaking more generally, the idea of quality have
to be analysed previously. Quality is a matter of human judgement, thus, many
complex human factors have a great influence on it. Additionally, it should be
taken into consideration that these factors might vary widely among individuals
or, what complicates things more, some individuals may modify their judgements
throughout the time. However, the “notion” of quality is so simple, immediate
and direct that it might be recognised less often by logical argument than by
direct perception and observation. Mainly because of these reasons, much of the
scientific research agrees that the definition of metadata quality is not out of difficulties. Nonetheless, according to [2] a metadata record of good quality is defined
as “a record that is useful in a number of different contexts, both with respect
to the search strategies and terms that can be used to locate it”. Another definition [3], even more simple, might be “fitness for purpose”. Following with this
rationale, it seems that geographic metadata may be fit to their purpose, if they
describe geographic data well and those descriptions are useful for their users.
The objective of this paper is to propose a quantitative method for quality
assessment of metadata in geographic digital libraries. The method is developed
through two phases, involving human experts in geographic information systems.
Firstly, a list of geographic quality criteria, structural and semantic, is compiled
from the experts. Then, derived from this criteria list, a group of metrics is
proposed. Secondly, a statistical analysis of a selection of geographic metadata
record sets is performed in order to discover the features which correlate significatively with good geographic metadata.
The remainder of this paper is organised as follows. Next section discusses
other work related to this paper. In section 3, some geographic quality criteria are
obtained from an opinion poll conducted to some experts and some geographic
metadata metrics are proposed. In section 4, the statistical analysis is described
and tested. Finally, the conclusions are given.
Related Work
Initial efforts in metadata development have been primarily invested in structure
rather than in content, that is, in the design and in the implementation of geographic standards. Consequently, appropriate standards such as CSDGM [4] and
R. Tolosana-Calasanz et al.
ISO19115 [5] were developed and currently represent an excellent base for metadata creation and system interoperability. However, not only does metadata quality depend on these standards, but also on the creation process. Thus, generally
speaking, two main approaches can be found in the research of metadata quality.
On the one hand, some studies are more concerned with the content of the
metadata fields and the process involved in the creation of the metadata. In [1] it
is stated that once a metadata standard has been implemented within a system,
the specified fields must be filled out with real data about real resources and
this process brings its own problems. The following assumptions underlying the
metadata creation process in the learning objects and the e-Prints communities
are also challenged there:
– in the context of the culture of the Internet, mediation by controlling authorities is detrimental and undesirable, that rigorous metadata creation is too
time-consuming and costly, a barrier in an area where the supposed benefits
include savings in time, effort and cost.
– only authors and/or users of resources have the necessary knowledge or expertise to create metadata that will be meaningful to their colleges
– given a standard metadata structure, metadata content can be generated or
resolved by machine.
Guy [3] suggests a number of quality assurance procedures that people setting
up an e-Print archive can use to improve the quality of their metadata. The process is developed in the conviction that the metadata creation process is crucial
to the establishment of a successful archive. Another interesting document is the
report elaborated by the Academic ADL Co-Lab [2], which sets up the first step
towards community creation and building in the learning repositories community. The paper is a guide to the various issues challenging learning repository
projects: issues of quality, both content and metadata (creating quality content
and metadata, guidelines to ensure access to quality educational content, quality
and consistency of metadata, tools and workflow).
On the other hand, there exists another block of strategies whose research
is mainly concerned with identifying and computing metrics for quality indicators. Then, resources are classified into different quality bands in accordance
with those indicators. The study carried out by Armento [6] predicts quality
rated Web documents (around popular entertainment topics) by using some preexisting relevance ranking algorithms. Armento states that the results, though
promising, should be tested more extensively and with more quantity of data
in other knowledge domains. Other experiments carried out by Custard and
Summer [7] identify and compute metrics for sixteen quality indicators (indicators that were obtained from an extensive and previous literature review
and meta-analysis) and employ machine-learning techniques in order to classify educational resources into different quality bands based on these indicators.
Additionally, previous experiments were developed to determine whether these
indicators could be actually used for the classification. Hughes [8] describes the
motivation, design and implementation of an architecture to aid metadata quality assessment in the Open Archives Language (OAL) Community. It is worth
On the Problem of Identifying the Quality of Geographic Metadata
highlighting that these quality indicators used in order to support quality judgements are based on the adherence to best practice guidelines for the use of
the Dublin Core [9] elements and codes. Finally, another interesting work [10]
computes some metrics for quality indicators and studies the relation between
metadata quality and the quality of services.
Identifying Geographic Metadata Quality Criteria
At an early stage of our work, we considered studying the criteria by which
the quality of geographic metadata records can be analysed. We carried out an
initial experiment which consisted in asking several experts about the features,
the elements or even the requirements for geographic metadata records that can
determine their quality. As an outcome of this study, a compilation of geographic
metadata quality criteria was obtained (see Fig. 1).
Two main tendencies can be observed in the compiled list. One tendency is
more concerned with the structure of the metadata records and tries to determine
to what extent the metadata records accomplish the standard. For instance,
in the ISO 19115 standard, there exist certain recommendations regarding the
format of certain data types such as dates, integers and so on. Additionally, in
the same standard, there is a subset of elements known as the “ISO19115 Core
metadata for geographic datasets” (called ISO Core onwards) that suggests to
have each of them filled in. The most important elements such as the title, the
abstract and the spatial reference system, among others, are included there.
In the same sense, several experts were expecting to find specific information
elements which were useful for their daily work and which were outside that
core. Other considerations pointed out that the greater the number of filled
elements, the higher the quality for the metadata record.
The other kind of tendency is related to semantic issues on the metadata elements. It is worth mentioning the considerations that the experts made on important free text elements such as the title and the abstract. Some experts stated
that every title should answer, at least, the questions where, when, what and
whom about the data; and that the abstract should describe, in a slightly broader
way, the information which appears on the title, though they also thought that
other issues can also be summarised there. Controlled elements such as the subject were found important as well, since they contribute to sort out subsets of
topic-related records. The use of standardised thesauri, as the tool for filling
in the subject element, was suggested as better than controlled lists. The rest
of the semantic criteria focus their attention on more general aspects such as
the coherence between the element and the information which it contains, the
avoidance of duplicated information, the avoidance of contradictory information,
the importance of precise information, the importance of homogeneity in the information among the metadata record set, and in a similar sense, the need for
entity naming uniformity throughout the metadata record set and, finally, natural language semantic issues such as ambiguity which the experts recommend
to minimise.
R. Tolosana-Calasanz et al.
Fig. 1. Compilation of criteria for the assessment of geographic metadata quality
Nonetheless, some other interesting criteria taxonomies can be proposed:
– according to the information type contained in the elements, the criteria
may be sorted out into spatial (if they are related to spatial element types),
textual (if related to textual element types) or temporal (if they deal with
temporal element types).
– assuming that geographic metadata records do not usually appear in an
isolated way, but form geographic thematic catalogues whose topics are diverse, from environmental aspects, to geographic images and cartography
maps, there may exist quality criteria related to individual quality aspects,
global quality aspects and both of them. In fact, it seems obvious that the
quality of the individual records affects the perception on the repository.
For instance, let us consider a metadata record set in which a high percentage of the records does not present an important, desired characteristic (i.e.
presenting an accurate title field, presenting a correct topic-keyword classification and so on). Although some records fulfil the requirements, the overall
impression on the set is likely to be of bad quality, circumstance that is confirmed because wrong records appear more frequently. Consequently, quality
criteria which measure individual quality aspects, global quality aspects and
both of them have to be taken into account.
Additionally, when studying the initial classification of the criteria (structural
criteria and semantic criteria) more carefully, it can be stated that the semantic
criteria merely determine the constituents of metadata without any regard to the
quantity of each ingredient: they consider qualitative aspects of the metadata.
On the contrary, the structural criteria give evidence of aspects which involve
the measurement of quantity or amount which can be computed automatically.
In each engineering discipline, counting and measuring play an important
role, because when it is feasible to measure the things that are being studied
On the Problem of Identifying the Quality of Geographic Metadata
Table 1. Proposal of geographic metadata metrics
Metric ID Metric name
Metric description
Data purpose filled in
Percentage of the ISO Core filled in
Number of words in the alternate title
numberOfFilledElementsNumber of filled in elements
dataAccessConstraints Data access constraints filled in
Distribution format filled in
Spatial reference system filled in
Number of words in the abstract
dataUpdateFrequency Data and update frequency of the data filled in
Number of words in the title
Information about the data responsible filled in
Information about the data quality report filled in
Information about the lineage of the data filled in
Information about the metadata creator filled in
and to express them in numbers, something is known about them. In addition,
an important element in proving theories is provided by experiments, without
measuring, experiments would be useless as an aid to natural scientists and engineers. After these considerations on the significance of measurement, it should be
noted that there are important difficulties when measuring geographic metadata
quality and, what is more, the engineering good practice of observing, counting
and measuring regarding geographic metadata quality has so far been neglected.
Undoubtedly, those quantitative criteria compiled (the structural criteria from
Fig. 1) represent a good starting point in order to obtain metrics for geographic
metadata quality. In Table 1, a list of 14 metrics for assessing quantitative aspects of geographic metadata quality is proposed. Some of the proposed metrics
merely determine whether certain elements appear on the records (i.e. purpose,
dataAccessConstraints or quality), others count the number of words per element
(i.e. title, alternateTitle or abstract) and others try to determine the percentage
of elements in the ISO Core that are filled in.
Analysing Geographic Metadata Quality Criteria
With the aim of understanding the notion of geographic metadata quality, we
decided to carry out another experiment which intended to discover the quantitative features which correlate significatively with good geographic metadata.
Basically, the experiment consists of the following steps:
– select a sample of geographic metadata record sets
– ask the experts to assess the quality of the record sets with a numerical
R. Tolosana-Calasanz et al.
– compute the proposed metrics for the selected record sets
– analyse the correlation between the metrics and the assessments coming from
the experts.
Table 2. The average value of the assessment per metadata record set
Because of the aforementioned reasons, the experiment was focused on the
quality of the set rather than on individuals. Thus, 30 geographic metadata
record sets of diverse cardinality were selected in order to carry out this experiment. They were compiled from different institutions: the Spanish National Geographical Institute, the French National Geographical Institute, several Spanish
regional governments, some European institutions (such as the Joint Research
Center) and the US Geological Survey. Their topics were Spanish, French and
European cartography, Spanish and French hydrology, European LANDSAT images and orthoimages and geologic maps from the USA. The metadata record
sets were all conforming to ISO 19115 with the exception of those from the US
which were in CSDGM and were translated into the ISO 19115 standard by using
the crosswalk described in [11]. Several experts from relevant public European
organisations were asked to collaborate. Besides, the career backgrounds of the
experts were rather heterogeneous: geographic, librarian and technologic.
The precise instructions given for the assessment were to assign a number from
1 (the lowest quality) to 10 (the highest quality) for each of the thirty metadata
record sets and to write down an optional description for each of the assessments
and a mandatory overall list of the assessment criteria. A form was given away in
order to facilitate the noting down of those three elements. Two human-readable
formats for the records were provided, one in HTML and another one in XML.
A browser was recommended to visualise the records in the first case and the
metadata edition tool CatMDEdit [12] in the second one. It is important to note,
however, that neither evaluation criteria nor assessment recommendations were
indicated to them. However, as geographic metadata represent the description of
a particular geographic dataset and the dataset was not provided, the assessment
was somehow constrained.
Once the results were compiled, the first necessary step for this statistical
analysis was to obtain a unique assessment value per metadata record set. The
assessments of the experts, however, differed slightly. The variation depended on
the nature of the criteria chosen, since some of the experts were more concerned
with structural aspects and others with semantic ones. An arithmetic average
On the Problem of Identifying the Quality of Geographic Metadata
Table 3. The numeric values of the metrics computed
on the assessments was calculated in order to have a unique number per record
set (see Table 2, note that again the values range from 1, the lowest quality, to
10, the highest quality).
The 14 metrics were computed for each of the 30 sets. The process consisted
in computing the metrics for each of the records and then computing the average
of those values to obtain the metric for the record set (see Table 3).
One way of studying the correlation of the metrics and the metadata record
sets quality might be by determining the main source of variation in the metrics. This study was carried out by developing a Principal Component Analysis
(PCA) [13]. The PCA is a mathematical procedure that transforms a number
of variables into a smaller number of uncorrelated variables known as principal
components (PCs). The first principal component (PC1) accounts for as much
of the variability in the information as possible, and each succeeding component
accounts for as much of the remaining variability as possible. The aim of this
procedure is to reduce the dimensionality of data and to identify new meaningful
variables. The relationship between the metadata quality values, coming from
the assessments of the experts, and the principal component scores, obtained
from the metrics, were studied through correlation analysis.
Only the first component extracted from the PCA, which explained 32.2% of the
observed variance (eigenvalue = 4.5), was significantly correlated with the metadata quality values (assessments). This correlation was strong and negative (R =
R. Tolosana-Calasanz et al.
Fig. 2. The relationship between the quality values and the PC1
Table 4. The PCA Factor loading
-0.158 -0.603*
-0.409 0.368
numberOfFilledElements-0.794* -0.221
dataAccessConstraints -0.395 0.169
-0.840* 0.015
-0.710* 0.439*
-0.150 -0.781*
0.441 -0.470*
0.424 -0.383
-0.632* 0.269
-0.402 -0.123
-0.616* -0.305
-0.853* -0.193
-0.85) as Fig. 2 shows. The factor loading of the PCA reflects (Table 4) that this
component (PC1) was significantly correlated with the metadata metrics: coreFilledPercentage, numberOfFilledElements, distributionFormat, referenceSystem,
responsablesData, lineage and cataloguersData. The numerical values represent
the correlation degree between the metrics and the PCs and the symbol * repre-
On the Problem of Identifying the Quality of Geographic Metadata
Fig. 3. Distribution of the different metadata record sets in relation to the PC1 and PC2
sents that there exists significant correlation (p <0.001). Thus, it can be concluded
that these metadata metrics could be used as indicators of geographic metadata
quality. If the value of the metrics increases, the quality of the record set increases
as well. Nevertheless, the rest of the metrics were not significantly correlated and,
consequently it cannot be statistically determined whether they have influence on
the quality.
The first two components obtained through the PCA (PC1 and PC2) were
used to represent the record sets in two dimensions (see Fig. 3). Metadata record
sets were sorted into three groups according to the degree of their quality value
degree (high quality, >7; medium quality, 5-7 and low quality, <5 ). The highest
quality group appears associated to low values of PC1 and the lowest quality
group with high values of this component.
According to Fig. 3, it is important to note that:
– high quality metadata record sets appear quite near among them and far
way from poor quality metadata record sets
– high quality metadata record sets and some medium quality metadata record
sets appear near what may suggest that the significantly correlated metrics
do not determine quality completely and some other indicators such as those
with semantic dimension take also an important role.
It can be stated that within this metadata set sample, the quality of the sets
can be predicted by computing the correlated metrics. Thus, high values of the
metrics involves medium-high quality and low values of them, low quality.
R. Tolosana-Calasanz et al.
This work has presented early results from a series of experiments on identifying the quality of geographic metadata. The paper has proposed a quantitative
method for quality assessment. The method is developed in two phases. Firstly, a
list of geographic quality criteria was compiled from an opinion poll conducted to
several experts of the area. The criteria were primarily classified into structural
and semantic, though some other taxonomies were also described. The structural criteria give evidence of certain aspects which involve the measurement of
quantity or amount which can be computed automatically. Derived from those
criteria, a list of 14 geographic metadata metrics was proposed. Secondly, a statistical analysis was carried out on a selection of 30 geographic metadata record
sets. The experiment, by developing a Principal Component analysis, studied
the relationship between the 14 metrics, which were computed for each record
set, and the assessments made by some experts. As a result, it was observed
that some metrics could be used as indicators of geographic metadata quality
and, within the selected 30 record sets, the geographic metadata quality could
be predicted by computing those metrics: high values of the metrics involve
medium-high quality and low values of them, low quality.
As further work and in order to validate these results and to generalise them,
the experiments should be carried out with an extended metadata corpus. Additionally, it would be interesting to investigate whether metadata quality metrics
can be applied to the development of more efficient information retrieval ranking algorithms. It is expected that quality metrics can play an important role in
computing the relevance of the resource described.
1. Barton, J., Currier, S., Hey, J.: Building Quality Assurance into Metadata Creation: an Analysis based on the Learning Objects and e-Prints Communities of
Practice. In: Proceedings of the 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice-Metadata Research and Applications. (2003)
ISBN 0-9745303-0-1.
2. Holden, C.: From Local Challenges to a Global Community: Learning Repositories
and the Global Learning Repositories Summit. The Academic ADL Co-Lab (2003)
Version 1.0.
3. Guy, M., Powell, A., Day, M.: Improving the Quality of Metadata in Eprint
Archives. Ariadne Magazine (38) (2004) http://www.ariadne.ac.uk/.
4. Federal Geographic Data Committee (FGDC): Content Standard for Digital
Geospatial Metadata, version 2.0. Document FGDC-STD-001-1998. Technical report (1998)
5. International Organization for Standardization (ISO): Geographic information Metadata. ISO 19115:2003 (2003)
6. Armento, B., Terveen, L., Hill, W.: Predicting expert quality ratings of Web documents. In: Proceedings of the 23rd annual international ACM SIGIR conference on
Research and development in information retrieval. Does ”authority” mean quality?, Athens, Greece (2000) 296 – 303 ISBN 1-58113-226-.
On the Problem of Identifying the Quality of Geographic Metadata
7. Custard, M., Summer, T.: Using Machine Learning to Support Quality Judgements. D-Lib Magazine 11(10) (2005) ISSN 1082-9873.
8. Hughes, B.: Metadata Quality Evaluation: Experience from the Open Language
Archives Community. In: Proceedings of the 7th International Conference on Asian
Digital Libraries (ICADL 2004). Number 3334, Lecture Notes on Computer Science. Springer-Verlag (2004) 320–329 ISBN 3-540-24030-6.
9. International Organization for Standardization (ISO): Information and documentation - The Dublin Core metadata element set. ISO 15836:2003 (2003)
10. Zhang, B., Gonçalves, M., Fox, E.: An OAI-Based Filtering Service for CITIDEL
from NDLTD. In: Proceedings of the 6th International Conference on Asian Digital Libraries (IACDL 2003). Number 2911, Lecture Notes on Computer Science.
Springer Verlag (2003) ISBN 3-540-20608-6, pp 590-601.
11. Nogueras-Iso, J., Zarazaga-Soria, F.J., Lacasta, J., Béjar, R., Muro-Medrano, P.R.:
Metadata Standard Interoperability: Application in the Geographic Information
Domain. Computers, Environment and Urban Systems 28(6) (2004) 611–634
12. Nogueras-Iso, J., Zarazaga-Soria, F.J., Muro-Medrano, P.R.: Geographic Information Metadata for Spatial Data Infrastructures - Resources, Interoperability and
Information Retrieval. Springer Verlag (2005) ISBN 3-540-24464-6.
13. Jolliffe, I.T.: Principal Component Analysis. 2nd edn. Springer Series in Statistics.
Springer Verlag (2002)
Quality Control of Metadata: A Case with UNIMARC
Hugo Manguinhas and José Borbinha
INESC-ID – Instituto de Engenharia de Sistemas e Computadores,
Apartado 13069, 1000-029 Lisboa, Portugal
[email protected], [email protected]
Abstract. UNIMARC is a family of bibliographic metadata schemas with formats
for descriptive information, classification, authorities and holdings. This paper describes the automation of quality control processes required in order to monitor
and enforce quality of UNIMARC records. The results are accomplished by
format schemas expressed in XML. This paper also describes the tools that take
advantage of this technology to support the quality control processes, as also its
actual applications in services at the National Library of Portugal.
1 Introduction
Descriptive metadata still plays a fundamental role in resource description services.
Therefore, the quality of that metadata is also a key issue for the effectiveness of
those services.
This paper addresses the problem of quality control of UNIMARC bibliographic
records. This work was developed at the National Library of Portugal (BN), but the
problem is addressed in a generic perspective, which means that the solutions here
presented reused by any organisation dealing with the creation and processing of
UNIMARC is a family of bibliographic metadata schemas with formats for descriptive information, classification, authorities and holdings. The UNIMARC adoption in Portugal followed the adoption in 1985 of UNIMARC as the international
format for record exchange between national bibliographic agencies. Since then all
BN cataloguing processes use UNIMARC as base format.
Until 2004, the validation and correction of records contained in the Portuguese union
catalogue were performed by skilled professionals using cataloguing applications.
Quality control processes are time consuming tasks that required a lot of attention,
experience and expertise. BN requires quality control processes for two major divisions, record acquisition from cooperating libraries (libraries that signed a cooperation
protocol with BN to maintain a single catalogue) into the Portuguese union catalogue
(PORBASE) and another for union catalog maintenance.
Existing validation tools were embedded on legacy systems and couldn’t be extended to accommodate format evolution. Most of them contained proprietary software systems that required continuous updates to current versions. Even software
systems that enabled format update were unable to perform full UNIMARC compliance validation. On the other hand, we required validation tools, apart from our cataloguing systems, to satisfy a number of emerging quality control processes.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 244 – 255, 2006.
© Springer-Verlag Berlin Heidelberg 2006
Quality Control of Metadata: A Case with UNIMARC
To solve this problem, the National Library of Portugal (BN) started an activity to
formalise a schema for the UNIMARC family of formats [1], ultimately supporting all
the four formats (Bibliographic, Holdings, Authority and Classification) and their
versions. In the following of that, a set of applications, named MANGAS (Manipulation and Management of Descriptive Metadata), were developed for the purpose of
supporting automatic processes of quality control, using that schema.
This paper follows by introducing reader to the UNIMARC context and the problem of the quality control of UNIMARC records; it continues explaining all developed tools to aid theses processes and their applications over existing services on BN;
finally, we describe the results already achieved and future work to be done.
The primary purpose of UNIMARC is to facilitate the international exchange of bibliographic data in machine-readable form between national bibliographic agencies.
UNIMARC belongs to a family of other MARC formats like MARC21 [2].
UNIMARC is intended to be a carrier format for exchange purposes. It does not stipulate the form, content, or record structure of the data within individual information
systems. UNIMARC does provide recommendations only on the form and content of
data for when it is to be exchanged.
Like other MARC formats, UNIMARC is a specific implementation of ISO 2709,
an international standard that specifies the structure of records containing bibliographic data. It specifies that every bibliographic record prepared for exchange, conforming to the standard, must consist of a record label (a 24 character data element),
followed by an undefined number of fields. A field is identified by a tag, a numeric
three character code, and can be classified as a control or data field. Control fields
contain a well defined set of character data, on the other hand, data fields may optionally contain one to two indicators (one alphanumeric character data, adding information about field content, relationships between the corresponding field and others, or
about necessary data manipulation procedures) and subdivided into subfields. A subfield is identified by a one alphanumeric character symbol and can contain character
data. Figure 1 shows the UNIMARC format class diagram.
The scope of UNIMARC is to specify the content designators (tags, indicators and
subfield codes) to be assigned to bibliographic records in international machinereadable form and to specify the logical and physical format of the records. It covers
monographs, serials, cartographic materials, music, sound recordings, graphics, projected and video materials, rare books and electronic resources.
The UNIMARC format was first published in 1977 under the title “UNIMARC Universal MARC Format”. It was recommended by the IFLA Working Group on
Content Designators set up by the IFLA Section on Cataloguing and the IFLA Section
on Information Technology. It contained specifications for book and printed serial
material and provisional fields for various non-book materials such as music, motion
pictures, etc. Following editions added data fields required for cartographic materials,
sound recordings, visual projections, video recordings, motion pictures, graphics,
printed music, and microforms, and updated also several fields relating to serials and
H. Manguinhas and J. Borbinha
Like most other formats, UNIMARC has evolved to accommodate new cataloguing
practices related to existing or new bibliographic materials. These format changes involve defining additional fields, indicators, subfields and coded values where needed.
These efforts, promoted by the Universal Control and International MARC Program,
established an important normative support for the UNIMARC. In the 1985 IFLA Conference, UNIMARC was definitively adopted as the international format for record
exchange between national bibliographic agencies and recommended as a model for
further national MARC based formats in countries lacking an official format.
cd UNIM ARC Format
text: string
tag: int
text: string
ind1: char
ind2: char
code: char
text: string
Fig. 1. UNIMARC Format Class Diagram
3 The UNIMARC Schema
UNIMARC is a format with a very complex structure. Besides the common syntactic rules for elements, attributes and values, it also defines semantic relations between them. These relations may even define the interpretation made for a given
element or attribute. UNIMARC also requires grouping information in subsets of
rules (aggregation of rules) that are required to represent blocks of fields. This requirement isn’t essential for record validation but is important to define element
semantic coupling.
Currently developed schema languages like XMLSchema [3], RELAX-NG [4] allow the definition of syntactic rules based on elements, attributes and values but lack
semantic rules for defining relations between them. Schematron [5] is the only
schema language that allows defining these semantic relations.
On the other hand, existing schema languages tend to evolve in time or be forgotten, and we need a stable format. We also required a schema language that was close
to the concepts involved even if it required, for validation purposes, the conversion to
another language (if possible).
We decided not to use any of the existing schema languages, but to develop our
own schema language for the purpose of describing the UNIMARC formats, gathering all syntactic and semantic rules, each uniquely identified by an URN [6]. Figure 2
shows the corresponding class diagram for the UNIMARC format.
Quality Control of Metadata: A Case with UNIMARC
cd UNIM ARC format schema
urn: URN
tag: int
mandatory: boolean
repeatable: boolean
code: char
mandatory: boolean
repeatable: boolean
0..1 -
tag: string
start: int
end: int
length: int
pattern: pattern
format: format
Fig. 2. Class Diagram for the Schema Language of the UNIMARC Format
Each UNIMARC format schema (Bibliographic, Authority, Holdings and Classification), and the respective versions, has its own format schema file with the corresponding format rules. Any format schema file can inherit the structural information
of another UNIMARC format schema, making it possible to represent the format
evolution by simply adding and replacing rules (overloading) in the newer versions.
4 Quality Control of UNIMARC Records
Quality control consists on a workflow of processes required for monitoring and
enforcing quality over incoming and existing records. Figure 3 shows a possible quality
control workflow containing validation, filtering, reporting and correction processes.
4.1 Validation
Validation is the first step and the basis for our quality control procedures. Like helping other quality control processes, other advantages emerge, like allowing end users
to be instantly notified of cataloguing mistakes, their performance be measured (cataloguing procedures), common errors be detected and prevented, etc.
H. Manguinhas and J. Borbinha
ad Quality Control Processes
Filter Script
Fig. 3. Activity Diagram showing a possible quality control workflow
Schema validation is the process of checking vocabularies (well-formedness) and
enforcing rules (constraints) embodied in schemas over metadata documents. The
output of schema validation is a collection of identified errors (if any) in the source
document. For our purpose we used schema validation to check UNIMARC records
according to the UNIMARC format.
Among all the functional requirements common for schema validation we required
a special feature, the generation of structured output errors that could enable more
advanced quality control processes. These processes would use this error information
to add more value. Among these are the generation of record error reports, record
filtering based on record quality, or even record error correction.
At the moment, we are using, for performance reasons, our own schema validation
tool. This enables us to validate a record without having to translate to a generic format schema whose constraints aren’t optimized for record validation.
Nevertheless, any schema validation tool can be used for our purpose, as long as it
supports error handling listener functionality for generation of structured output errors. On the other hand, any format schema can also be used, requiring it would be
compliant with the UNIMARC format rules. A possible format schema would be
Schematron based on the ability to define syntactic and semantic relations.
4.2 Filtering
Filtering consists on selecting records according to a given concern. On a quality
control workflow it consists on distinguishing records according to the errors it contains and their characteristics.
Records can be divided according to the type of cataloguing skills required to perform a certain task or according to the level of concern it implies. Records requiring
special attention of skilled professionals can be put apart from the remaining. Among
these are records with complex errors, with errors affecting crucial information, etc.
On the other hand, others with less complex errors can therefore be solved with help
of common professionals and/or automatic procedures.
Filters can even select which records do not have the minimal required quality to
be processed.
Quality Control of Metadata: A Case with UNIMARC
Record Index
Identified Errors
Fi 4
Fig. 4. Excerpt from a Record Error Index Report
4.3 Reporting
Reporting consists on gathering all available information and organizing (filtering,
classifying and sorting) according to a given concern. In quality control processes the
information is focused on record content and record quality information.
These reports are the source for more elaborate kind of statistic reports required for
a number of other activities like monitoring, evaluating process accuracy, bookkeeping, or even, identifying problematic catalogue sources. This can also be used to help
existing and future quality control processes. In a moment basis, these reports are
important for managing the correction effort.
At the moment we build three kinds of reports, a detailed report showing all identified error occurrences in a single report, an index report merging error types and displaying the corresponding record indexes (see figure 4), and a summary report that
merges all error types and shows the statistic information only. All these reports are
built in machine or human readable format.
4.4 Correction
The correction is a very delicate process, requiring a high level of experience and trust
in the developed system. To satisfy this requirement we decided to split the process
into three distinct activities. This way we separate our concern into small problems
each with increasing level of responsibility. This increases the level of control over
the processes (enabling monitoring) and consequently the level of trust on the overall
process. This process is cyclic and will finish when no correctable error is detected.
Figure 5 shows the corresponding workflow for this process.
H. Manguinhas and J. Borbinha
The process can be full automatic or require human interaction and approval.
Each activity works as a module. The interface for input and output are well defined
in order to enable the improvement of each activity without influencing the others.
As we already mentioned there are three distinct and complementary activities:
a correction analysis activity, responsible for selecting the possible corrections
that fit the given scenario (record and corresponding errors);
a correction decision activity, responsible for deciding which correction is more
appropriate to the scenario, and a last activity;
a correction performing activity, which applies the selected corrections to the
In most of the cases, where an error is well known and does not involve unpredictable
information, this process goes smoothly.
A correction analysis activity is responsible for selecting a number of acceptable
error corrections (hints), within a knowledge base, which fit the corresponding scenario. This knowledge base is maintained by professionals and enriched with new
knowledge every time a new type of error is identified and the corresponding solution
(if possible) is built. This activity only selects possible solutions. No reasoning is
done over what is the right solution to be made. The activity receives as input a record
and the identified errors and produces for outputs a correction script.
The correction script is composed of a set of correction hints for each identified error. There can be more than one possible correction (hint) for a given error. Correction hints are composed of actions that can be applied to a record in a given scenario
in order to repair it. They are classified by a certainty degree that ensures a level of
trust. These actions can be of two different types (add or remove) and are defined by a
XPATH pointer to the data source element. Add actions are also followed by XSLT
construct data elements.
The correction decision activity is responsible for choosing the best correction
from a given scenario. This process can be done automatically with help of a decision
making support systems (DSS) or manually with human intervention. The DSS could
be as simple as choosing the first correction hint available with the highest certainty
degree, or as complex as, inferring or reasoning based on pass experience (with help
of artificial intelligence algorithms like neural networks, predicate logic and so on).
At the moment we choose the simpler one.
It is not the responsibility of the correction decision activity to predict further errors (errors emerging from applying the correction) but only that, the ones already
detected, are solved.
New human or machine based decisions, made in this activity, can be used to enrich the knowledge base used on the first activity.
Finally, the correction performing activity, receives the correction script, compiles
the script into XSTL transformations and applies them to the record in order to repair it.
After the record transformation, new unpredicted errors could occur emerging from
relations between the changed information and existing information. To solve this problem the workflow must be cyclic and will end when no repairable error exists.
This workflow becomes more effective every time it is run. As new errors are detected, analysed and corrected, the process becomes more accurate.
Quality Control of Metadata: A Case with UNIMARC
ad Correction Process
«data store»
Error Knowledge
Correction Script
Fig. 5. Activity Diagram showing the correction process
5 Tools
A set of applications, named MANGAS, were developed to aid this quality control
processes like validation, reporting, filtering and correction. These applications were
developed essentially based on the end user profile. Common users, professionals or
systems can use specific applications to perform their work. Nevertheless all these
tools use the same core infrastructure to perform the available functionality.
5.1 MANGAS Diag
MANGAS Diag is a standalone tool developed to produce reports that gather the
collected information. These reports can contain validation information or other record related information.
It is suited for common users who aren’t familiar with the UNIMARC format and
only require a way to build knowledge about their personal catalogues. This application is available as a standalone tool (see figure 6) or as a web service (located at
5.2 MANGAS Workstation
The MANGAS Workstation is a tool for skilled professionals who are familiar with
the UNIMARC format and require more complex functionality for their work. It is
suited for professionals responsible for quality control procedures that require the
ability to detect errors and act upon them.
Among the available functionality are the generation of custom reports (validation
and content reports), different record views, record editing, record manipulation, record
H. Manguinhas and J. Borbinha
Identified Errors
Fig. 6. MANGAS Diag standalone tool
transformation, record error correction, search and replace functionality and printing
capabilities. This application is only available as a standalone tool (see figure 7).
Another Record
Identified Errors
Fig. 7. MANGAS Workstation stand alone tool
Quality Control of Metadata: A Case with UNIMARC
5.3 MANGAS Batch
The purpose of MANGAS Batch is to enable third party applications to use the
MANGAS functionality to produce additional value. Currently this application is
available as a standalone tool that can be called by a prompt in the local operating
system, or be embedded as a library in a Java based application.
6 Services
For record acquisition, BN has created an elaborate workflow composed of quality
control, duplicate entry checking and merger processes. This workflow is supported by
a developed infrastructure named IRIS. This infrastructure works automatically but
allows users to intervene whenever they decide or when their help is need. IRIS uses
MANGAS tools for validation, filtering, reporting and correcting records (see figure 8).
In order to minimize BN acquisition effort, a number off services are provided to
help partner libraries to fulfil this task. All MANGAS related tools used for quality
control processes are available for download for local usage and a web service is
available for prior validation (MANGAS Diag web service).
pd BN Services
M angas Batch
Fig. 8. BN Services System Model
For catalogue maintenance BN has developed a tool called QualiCat (web user interface only available on intranet) for the purpose of validating (using MangasBatch
embedded in the system, see figure 8) all changed records during the day and reporting to the implied users in order to perform the necessary corrections. This tool runs
every day at a scheduled time, usually after hours, to minimize interference with
working processes.
7 Results
All this technology has been applied to BN quality control processes with excellent results. BN receives during the year a number of batches from other libraries to integrate
with PORBASE they are submitted through IRIS (see figure 9). QualiCat runs every day
H. Manguinhas and J. Borbinha
Fig. 9. Report from 2005 record acquisition statistics
at a scheduled time, performing validation over all records that have been changed in
PORBASE. PORBASE as an average of 492.58 changed records every day, 60.49%
of those contain errors with an average of 1.67 errors per record, totalling 822.63
8 Work in Progress and Future Work
Concerning the quality control processes some aspects require more attention; for
Validation purposes we need to develop converters from our schema language into
existing generic schema languages (like Schematron) in order to enable third party
systems, working with standard schemas, the usage of this technology; for Reporting,
new customizable reports must be developed to satisfy different emerging information
requirements; for Filtering we need to develop a way to customize filterers to apply in
different situations; and finally, for Correction we need to improve the correction
decision activity with a more advanced decision support system as well as register
new types of errors and new solutions.
At a more practical level, we need to reevaluate the developed tools in order to
identify functionality that needs to be improved and acquire new requirements for
development of new functionality. Other services and systems concerning quality
control processes must be reevaluated in order to take advantage of this technology.
Quality Control of Metadata: A Case with UNIMARC
1. IFLA – International Federation of Library Associations and Institutions
2. LOC - MARC Standards, MARC in XML, September 2004 (http://www.loc.gov/
3. XML Schema (http://www.w3.org/XML/Schema).
4. RELAX-NG (http://relaxng.org).
5. Schematron (http://www.schematron.com).
6. Moats, R., “URN Syntax”, RFC 2141, May 1997.
Large-Scale Impact of Digital Library Services: Findings
from a Major Evaluation of SCRAN
Gobinda Chowdhury, David McMenemy, and Alan Poulter
Department of Computer and Information Sciences, University of Strathclyde,
Glasgow G1 1XH, UK
{gobinda.chowdhury, david.mcmenemy,
Abstract. This paper reports on an evaluation carried out on behalf of the Scottish Library and Information Council (SLIC) of a Scottish Executive initiative
to fund a year's use of a major commercial digital library service called SCRAN
throughout public libraries in Scotland. The methodology used for investigating
value for money aspects, content and nature of the service, users and usage patterns, the effects of intermediaries (staff in public libraries), the training of
those intermediaries and project rollout is given. Conclusions are presented
about SCRAN usage and user and public library staff reactions.
1 Introduction
Even after a decade of intensive research and development activity, evaluation of
large-scale digital library application and use still remains problematic. The ultimate
goal of a digital library evaluation is to study how digital libraries are impacting on,
and hopefully transforming, information seeking and use, research, education, learning and indeed the very lives of users. Several online bibliographies on digital library
evaluation are now available (see for example, DELOS WP7 [1]; Neuhaus [2];
Giersch, Butcher and Reeves [3]; and Zhang [4]). Regular international workshops
on digital library evaluation take place under the DELOS programme, and evaluation
is a regular topic at all other digital library conferences. Several evaluation guidelines
and methods have been proposed in course of evaluation projects like ADEPT [5],
DELOS [6], eValued [7], JUBILEE [8], etc. Projects like eValued and HyLife [9]
have developed toolkits and guidelines for evaluation of digital libraries. Many other
researchers and institutions have also produced guidelines and toolkits for digital
library evaluation. See for example: Reeves, Apedoe, and Woo [10]; Nicholson [11];
Borgman [12]; Blandford [13]; Blandford and Buchanan [14]; Blandford et al, [15];
Choudhury, Hobs and Lorie [16]; Chowdhury [17]; Borgman and Larsen [18]; Jeng
[19] and Saracevic [20, 21, 22].
This paper reports on a recently completed large-scale evaluation of a major commercial digital library service called SCRAN (http://www.scran.ac.uk). This evaluation
is unique for a number of reasons. First, it is an evaluation study of a large, nationwide,
commercial digital library service, which was funded by the Scottish Executive to provide a specific range of services for all Scottish public libraries for one year, with the
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 256 – 266, 2006.
© Springer-Verlag Berlin Heidelberg 2006
Large-Scale Impact of Digital Library Services
total cost of the project amounting to £123,900. Second, the outcome of the evaluation
would determine whether Scottish Executive funding continued, thus it was necessary to
ascertain the success or failure of the initial funding in value for money terms. Third, the
evaluation was large-scale in that there are 557 public libraries in Scotland which attract
over 31 million visits per annum. We would argue that the funding of access to a commercial digital library service by a national government for all citizens is hitherto a
unique event in the development of very large-scale digital library services and needed
to be evaluated extremely carefully, bearing in mind the complex social, economic and
political aspects of the project. The evaluation however could not follow previously
tried and tested well-trodden routes, for example by looking in detail at features like
usability of individual pages in controlled conditions using a selected group of volunteers acting as users. It had to survey a large and diverse clientele of public library users
to whom a large-scale digital library service was but one of a competing portfolio of
services. Users could not be expected to recognize the uniqueness of the digital library
service nor would its novelty alone give it any extra weight in their opinions. Public
library staff, although being library professionals with an understanding of digital library services, would see it simply as yet another new service they had to support and
deliver and would not give it any special treatment, apart from marketing it in the standard way as a new service. Finally, funders would not be looking for the meeting of
research aims or achievement of good design but rather on visible take up and usage by
the public vis a vis existing services and the reaction expressed by professional public
library staff involved in its delivery.
A specific methodology was developed that addressed a number of issues including value for money aspects, content and nature of the service, users and usage patterns, the effects of intermediaries (staff in public libraries), the training of those
intermediaries and project rollout. The paper briefly discusses the nature of the
SCRAN service followed by the detailed methods used in the evaluation; major findings of the evaluation are then discussed with some critical comments that may be
useful for the future design and management of large-scale digital libraries.
2 Background to SCRAN
SCRAN began in 1996. Its name came from an abbreviation of its initial purpose
(Scottish Cultural Resources Access Network) but was also a reference to the Scottish
word ‘scran’, which meant ‘food’ and ‘gather together’, very appropriate for a digital
cultural portal. Resources were acquired through different stages of growth. The first
batch came from Millennium funding in conjunction with the National Museums
Service/National Library of Scotland. The actual digitisation of resources was outsourced. The second batch of resources came from NOF (National Opportunities
Fund) funding for Resources for Learning in Scotland (http://www.rls.org.uk/). Other
organisations provided resources, which SCRAN digitised and mounted and stored
for fast access. SCRAN is essentially a federated database of resources from a variety
of sources, some of which are commercial organizations, for example The Scotsman
and Herald newspapers.
G. Chowdhury, D. McMenemy, and A. Poulter
Over its history, SCRAN has accumulated a unique set of skills in digitisation and
digital preservation. All of SCRAN’s resources have copyright clearance for general
use but with specific privileges for subscribers. SCRAN is currently working with the
British Museum and the Scottish Motor Museum to acquire more resources.
Individual resource records are in Dublin Core format. Place names are provided
by contributing institutions and can be variable as different institutions use different
rules. SCRAN have tagged about 170,000 records in the past year with Ordnance
Survey [the UK’s national grid location system] co-ordinates. Geographic search
allows linkages between areas and their sub-areas. There is no generic vocabulary or
taxonomy for the vast range of subjects in SCRAN and contributing institutions themselves have no agreed system, which has the potential to influence the ability to efficiently search the resource. SCRAN are working with the Royal Commission on the
Ancient and Historical Monuments (RCAHMS) and the National Museums of Scotland on a joint thesaurus for Scottish cultural institutions. SCRAN employ the UK
Learning Object Metadata (LOM) with Pathfinder packs and they have a full hierarchy of curriculum terms for the English and Scottish curricula. SCRAN have three
staff working full time on metadata – two checking, correcting and adding to records,
and a data officer managing quality and carrying out global updates. SCRAN’s three
educational officers look after LOM information.
At the time of the evaluation, SCRAN offered an extensive range of materials consisting of over 1.3 million records, with over 300,000 multimedia resources, to
schools, libraries and higher education institutions. Although SCRAN has created
many ‘Pathfinder’ packs of resources by topic, SCRAN’s interface has been extended
over time to allow users to develop a range of resource applications for themselves by
means of personalization or customisation. Such user-created information is stored on
SCRAN’s servers so it will work anywhere and not just on a local machine. ‘My
Stuff’ offers a basic level of personalisation, like bookmarking. ‘Albums’ are more
sophisticated, allowing user editing features (e.g. the addition of captions).
The Scottish Executive funding for access to SCRAN had several agreed objectives, viz.:
• To provide licensed access to SCRAN for all Scottish local authority libraries
• To provide user names and passwords to all participating libraries, and authentication system including IP authentication where required.
• To deliver a programme of training information professionals in developing
their own use of the resources and in assembling learning objects
• To provide multi user rights to SCRAN ‘Albums’, CD-ROMs and resources to
all libraries
• To provide ‘Albums’ functionality with captioning and local output to personal
mini-website for use by public library staff to create their own ‘Collections’ for users
• To provide unrestricted 24/7 access, free at the point of use, to multimedia
• To handle IPR management of all resources.
Project management was provided by SCRAN, in conjunction with representatives
from public libraries and from the Scottish Library and Information Council (SLIC),
which is an independent advisory body to the Scottish Executive on library matters
Large-Scale Impact of Digital Library Services
3 Evaluation Objectives, Methods and Tools
The main objective of the evaluation of SCRAN was to assess the value for money of
the year-long public library license. Outcomes could either be recommending continued access at the same (or higher or lower) cost or to devolve responsibility for funding to library authorities or to recommend an alternative to SCRAN.
In order to find answers to these questions the following multi-stage methodology
was adopted involving the following tasks:
1. A detailed and critical study of the SCRAN website
2. Visits to SCRAN headquarters to interview key personnel and to study useful
3. Extensive analysis of web logs and other usage statistics supplied by SCRAN
4. A survey of selected public library staff to understand how the service is used
by the end-users with the perceived benefits, level of difficulties, and various issues
5. A survey of end-users to understand the usage patterns and level of satisfaction
6. An analysis of the case study materials promoted by SCRAN as examples of
best practice
7. Analysis of minutes from Steering group and Project Group and relevant
documentation from SLIC.
Each stage of the methodology aimed to find specific information about SCRAN
that would answer specific questions relating to the evaluation of the service:
How much was SCRAN used? What factors affected usage?
What did users think of SCRAN?
What did public library staff think of SCRAN?
3.1 Factors Affecting Usage of SCRAN
In theory virtually anyone can be a SCRAN user – school children doing homework,
students at all levels, community groups in public libraries and any individual. SCRAN
has local resources for everywhere in Scotland; and these resources can have personal
resonance for individuals, a service SCRAN label quite succinctly as ‘reminiscence’.
Originally SCRAN was a unique service, with no competitors. However this is no
longer the case. There are a plethora of alternative channels for obtaining information
that is available through SCRAN. For example public library services maintain local
gateways giving alternative free access to Scottish digital resources. The Resources for
Learning for Scotland Project (RLS) used the UK’s New Opportunities Fund (NOF)
funding to draw together contributors from across the public sector with the intention of
the digital assets being freely available. Material held on RLS is a combination of
SCRAN and RLS data, but while text-based information can be accessed freely, access
to the full image requires SCRAN subscription. Other Scottish projects such as AmBaile (http://www.ambaile.org.uk/en/highlights.jsp), Springburn Museum (http://
gdl.cdlr.strath.ac.uk/springburn/), and Virtual Mitchell (http://www.mitchelllibrary.org/
vm/) provide full access to all images and not just thumbnails. For general educational
resources not related to Scotland, websites like the BBC’s Learning Homepage
(http://www.bbc.co.uk/learning/) provide stiff competition.
G. Chowdhury, D. McMenemy, and A. Poulter
Transaction log data maintained by SCRAN for the months of January to May
2005 was made available. Over the five month period, the average number of sessions
(defined as at least one access in a half-hour period) per branch on SCRAN for all
Scottish public library authorities was 15. This equates to an average of 3 sessions per
month for each branch in Scotland over the period. There were occasional peaks but
these were found to correspond with periods of staff training on SCRAN. Thus
SCRAN usage generally was very low.
One of the main objectives of the project funding was 24/7 access to SCRAN.
However, the nature of library opening hours varies considerably across Scotland,
meaning that 24/7 access may in fact equate to only a handful of hours of access per
day for many members of the public. This should have been raised when negotiations
on the funding of SCRAN were taking place and should have been a consideration
from the point of view of pricing.
Low in-branch usage could potentially have been offset by high at-home usage.
The ATHENS Access Management system (http://www.athens.ac.uk/) was SCRAN’s
preferred access model, whereby unique IP addresses were recognised and tied to
authorised users. Because of the licensing requirements on SCRAN from contributors,
each user must be identifiable so that should a resource be discovered being used
illegally, SCRAN can tell the user to desist. A number of SCRAN’s commercial and
non-commercial contributors regularly trawl Google to see if their resources are being
used illegally and let SCRAN know of any illegal uses they find. Whilst this is important for contributing commercial organisations like The Scotsman and Herald
newspapers, it is not that important for public sector bodies like museums who are
trying to increase access to their digital content.
However the implementation cost for this type of authentication approach outside
of academia had made it prohibitive for local authorities to implement. Remote access
to SCRAN (i.e. by a public library user from home) would be possible with a different
type of authentication system. As an example, access for public library members to
other databases such as NewsUK and Encyclopaedia Britannica has been set up, allowing library card holders to access the databases 24/7 from their home computers
using only their library card number. This is true universal access and allows members of the public to access library services even when the building is closed.
Even within public libraries, the differing usage of IP addresses in different library
authorities posed problems for accessing SCRAN, as while some used fixed IPs, some
did not use them at all (North Ayrshire, Argyle and Bute plus parts of Highlands, are
examples). A subsequent problem was that several IT departments within councils
changed the IP addresses of the computers in their authority, causing authentication
issues beyond the control of SCRAN. Access, then, in public libraries was by mainly
menu and password authentication. Choosing the default authority level rather than a
particular public library would hide access from that library and served to obfuscate
usage logging.
The original focus of SCRAN was and continues to be aimed at schools, and there
is certainly an argument for suggesting that its interface displays an age profile bias
towards children. Some of the terminology used could be confusing to adults who
have not undertaken training, and there may be issues for the casual adult browser
who is drawn to the service via marketing material only to be faced with terminology
Large-Scale Impact of Digital Library Services
such as: “Homework”, “My Stuff”, “Lucky Dip”, “Monkeying Around”, “Fun and
Games” and “Sticky Pics”.
Each of these features in its own right is creative and greatly enhances the user experience of the site. However their use in a database aimed at a wider market than
schools does need to be rethought. A more intuitive homepage for public libraries
could have been developed, aimed at the wider range of ages and interests that this
client market represents. Certainly, doubts about SCRAN’s interface were born out
through the user questionnaire: 41% of users had difficulty in finding material on
SCRAN using the simple search.
3.2 Public Library User Perceptions of SCRAN
A questionnaire survey was conducted with the users of SCRAN services in public
libraries throughout Scotland. The main objective of the user survey was to ascertain
public library users’ views on the service, problems encountered, and the users’ overall reactions to the service. A total of 351 responses to the user survey were received.
The public library user survey indicated that 51% of respondents had never used the
SCRAN service. This was not because of a lack of interest in computer-based services
as such: 71% of respondents said they would use online services and only 8% said
they would not. The remainder would use them but would prefer printed materials.
There was no obvious bias against online services by facets like age or gender. Those
who used the service were interested in many types of material available via SCRAN:
materials that are unique to their locality, their country, or their family were the most
popular choices.
Awareness of the SCRAN service within the library was high, despite less than
50% of respondents had actually used it. Comments received on using SCRAN included the following:
• “I find retrieval of results most problematic on SCRAN, there seems to be no
consistency in what terms, names or subjects are used for indexing and retrieval”
• “In the past I have noted inaccuracies of information stored”
• “Sometimes filtering of results could be better. I tend to get lots of irrelevant
material along with my search results”
• “I used SCRAN for the first time today and found it very easy to use and full of
interesting information”
Some of the comments from users suggest that retrieval of results is an issue for
many, and this reinforces the need for a richer metadata scheme.
In order to gauge value for money and willingness to pay, a question was asked
that requested users to give a cost per session they would be willing to pay to access a
service providing the types of material available on SCRAN. Over 58% of respondents indicated they felt such a service should be free, with a further 15% not wishing
to put a figure on it. This suggests that public libraries would struggle if they wished
to recoup from their users some of the outlay of a SCRAN subscription.
3.3 Public Library Staff Perceptions of SCRAN
Another feature that the usage log revealed was a discrepancy between different library authorities. A few (Fife, Borders, Aberdeen) appeared to be heavier users than
G. Chowdhury, D. McMenemy, and A. Poulter
the other authorities. It was felt that these differences in usage patterns among the
various authorities may have been caused by several factors including effectiveness of
staff training and staff attitudes towards new digital library services in general and
SCRAN in particular, making some staff more committed to using SCRAN. The webbased questionnaire survey was designed to find out answers to these issues.
The survey was conducted via the Internet; a total of 419 responses were received.
Interestingly, a high proportion of responses came from the ‘committed’ group of
library authorities. The responses on initial training were very positive. It was noted
that most popular internal method of marketing was word of mouth, making cascading of training to as many staff as possible an absolutely crucial issue for success. A
variety of user marketing methods were noted, but none seemed to be predominant.
A number of respondents mentioned that in their experience an aging population
might not be computer literate but showed a liking for reminiscence services. There
was however a general awareness that SCRAN usage was very low, and lower in
some authorities than others. Fife was known to be a high user but then as commented
by the respondents “Fife always was keen on online services”.
Finally, respondents were asked to indicate how much of an effect losing access to
SCRAN would have on the library service. While being broadly warmly receptive to
SCRAN the opinion of the largest group (37%) of respondents was that the effect of
losing SCRAN would be limited, although a high percentage of respondents felt that
the effect would be reasonable (29%), with a smaller number thinking the effect
would be significant (21%).
Richer information about the staff attitudes towards the service, problems encountered while using the service on behalf of the users, etc., was ascertained though a
series of interviews among library staff. User interviews were undertaken with a range
of authorities, both from the group identified by usage statistics, and staff survey
responses, as ‘committed’ users and those not in this group. The intention was to try
to elucidate how staff viewed the effectiveness of training, the utility of new services
delivered and value for money of the project. Altogether 17 individuals from five
authorities were interviewed. Most were experienced library staff, with lengths of
services ranging from 15 years up to 40; 11 were in professional grade, 6 paraprofessional. Their areas of responsibility ranged from managing one or more libraries, to managing a specific facet of service (e.g. ICT, specifically People’s Network
services, children’s services, or local history) or being in customer-facing roles. All
the staff had received an initial round of training and then a second round focusing on
hands-on use and creating applications. All used links on local portals to promote
SCRAN. A general issue was that local computer technical support was often overstretched. One group commented that just getting bookmarks changed and icons
placed on screens was extremely difficult as rights to do these tasks were maintained
All had engaged with ECDL (European Computer Drivers Licence, http://www.
ecdl.co.uk/) and felt that they had the requisite IT skills for the job; although they
recognised that they were continually being stretched. They also admitted to being
stretched generally, because of shrinking staff numbers and an unchanging set of core
tasks which were being added to by new tasks – “Staff are being hit by new initiative
after new initiative, with no time to bed one down before the next arrives.” However
all interviewees appeared well motivated and keen to do the best they could for their
Large-Scale Impact of Digital Library Services
users. All the respondents were engaged in making provision of local digitised services, in the areas of Scottish history, local history and family history. All agreed that
genealogy and reminiscence especially were popular services. Most were using local
portals to point to web resources or locally-mounted CD-ROMs.
Digitisation for local history collections was being attempted by some but costs and
other difficulties meant that it was sometimes easier to ask users to go to a central
library to consult originals. The drawback to this approach, as mentioned by some
professionals, was that “some materials would sit in vaults forever”. It was remarked
that some popular sites (e.g. Statistical Accounts of Scotland online; http://
stat-acc-scot.edina.ac.uk/stat-acc-scot/stat-acc-scot.asp) were moving to ‘for pay’ access
which meant that users could not be directed to them anymore. SAS online still is a free
service. It is the value added elements which are moving to a subscription service.
One issue with promotion that was raised suggested that SCRAN’s name gave no
indication of what it was. Also its name was easy to confuse with those of other services e.g. SCAN, the Scottish Archives Network. No one reported problems in using
SCRAN and most praised the suite of tools which enabled customisation to be done.
Most interviewees made only light use of SCRAN. The biggest driver of usage was
SCRAN’s newsletters which prompted a check of SCRAN for new features or materials. Some staff wanted access to SCRAN from home as there they would have had
time to explore.
The interviews of staff indicated that they felt stretched, and while being appreciative of the SCRAN service were often not in a position to promote it. One interviewee also stated that she felt the service was only now beginning to be used by
more staff as they were finding time to pass on the skills. A selection of the comments received from staff are summarised below:
• “Easy access and detailed information make this an invaluable tool for public use”
• “Excellent service that will grow in usefulness”
• “Money could have been better spent on subscriptions of our choice”
• “If SCRAN is allowed more time to develop (i.e. amass more material) its resources, it will become an increasingly useful tool for public library online services”
• “Not many people have used it. I think that it is a good site but with so many
other sites on the Internet it is easy to find the images you're looking for elsewhere”
• “I think advertising of this tool is woefully inadequate, and it's not available on
enough of our PCs”
4 Conclusion
When SCRAN began, it had a clear focus as an online archive of Scottish cultural
materials. Now SCRAN offers a much wider range of services, and is downplaying its
Scottish focus. Rather than being the sole provider in a focused market, SCRAN is
trying to push into other markets. While SCRAN’s major strength as a service is still
in its Scottishness and its collection of Scottish material, by not concentrating on this
SCRAN did not impact on the public in Scotland as a strong brand associated with
Scottish culture. For marketing purposes in Scottish public libraries it would seem
better to have used SCRAN's old full title, Scottish Cultural Resource Network,
rather than the more gnomic ‘SCRAN'. Marketing could have concentrated on this
G. Chowdhury, D. McMenemy, and A. Poulter
message; posters and rolling screen saver demos showing SCRAN resources for a
locality, tailored for each public library in that locality, would have much more effectively revealed the depth of SCRAN's Scottish resource base. Behind the marketing
should have been a range of new services that would engage users (for example picture 'tours' of a locality as it looked in the past, opportunities for individuals to contribute their personal resources to their public library, etc). Public libraries have been
accused recently in the UK of not developing their image beyond being mere lenders
of books, and the success of a new online service based around reminiscence would
have been a great triumph. It is clear from comments quoted above that SCRAN has
been the source of many moments of deep satisfaction for public library users and
staff who found its material of local and personal relevance.
That there is value in SCRAN is fully supported by anecdotal evidence but that
value is highly personal and transitory and not embedded as an expected feature of
public library services. There was also a generally supported wish for a publicly
funded archive of freely available digital resources commemorating and celebrating
Scottish culture. This creates tension between SCRAN as a commercial entity and the
publicly funded library service which supplies it with free content only to be charged
later to access that same content. The irony is that SCRAN was formed with Millennium Commission funds initially, and has navigated into being a commercial subscription service, while maintaining some funding from public sources for specific
projects from time to time, like the Scottish Executive funding making possible the
initiative evaluated here. While there is nothing wrong per se in commercialising
successful digital library projects, the commercial rationale ought not to conflict with
the public interest, in this case for free public access to materials that are clearly
owned by the public. The most negative comment made by public library staff was
that “SCRAN is a product whose time has gone”. A counter example of the British
Library’s website was cited as a free site which offered much the same facilities as
The issue of transferring ownership of a library’s own materials was of particular
concern to public library staff. Without a SCRAN subscription, a library authority,
and the public in the local communities it serves, could not view their own contributions to the SCRAN site. This means, in essence, that public library staff in that
library authority would have to hand a list of the material they had provided, but
members of the public served by that library authority would be blocked from accessing more than mere thumbnails of material that in theory belongs to them through
their authority’s ownership of the material. This would happen in non-subscribing
library authorities throughout Scotland. The ethos behind the Creative Commons
(http://creativecommons.org/worldwide/scotland/) licensing based on Scottish law
encourages the sharing of digital resources with the owner retaining IPR but allowing
pre-agreed use of the resource. A distributed environment incorporating the Creative
Commons license for Scotland would offer an opportunity to access digital material
that was owned in the public domain.
There is a much bigger question of what that distributed environment would look
like. What needs to be addressed is exactly how the Scottish digital heritage will be
developed and accessed, whether that heritage should be held in a centralised commercial database or decentralised in a managed set of collections held by the public
sector bodies that accumulate that heritage. We believe that provision of a national
Large-Scale Impact of Digital Library Services
database of cultural materials could easily be provided by public bodies in Scotland if
provided with appropriate funding. What is necessary is to ensure that rather than
training for a specific service such as SCRAN, staff members in cultural institutions
are trained to create and manage their own digital materials under a national umbrella.
This would negate the need for the nation’s cultural institutions to be reliant on commercial providers for delivering their digital materials, and instead allow the public to
access their heritage free of charge.
The evaluation project of SCRAN was funded by the Scottish Library and Information Council. We are grateful to SLIC staff members, especially to Ms. Elaine Fulton,
and SCRAN staff members for their cooperation and support in this project.
We would like to express our gratitude to all the users and LIS professionals who
took part in the user survey. The full evaluation report is available at:
1. DELOS WP7 evaluation workpackage. Bibliography. 2005 http://dlib.ionio.gr/wp7/
2. Neuhaus, C. “Digital library: evaluation and assessment bibliography”. 2005. Available:
3. Giersch, S., Butcher, K. and Reeves, T. “Annotated bibliography of evaluating the educational impact of digital libraries”, Online. 2003. Available: http://eduimpact.comm.nsdl.org/
4. Zhang, Ying “Moving image collection evaluation: research background - digital Library
evaluation”. Available: http://www.scils.rutgers.edu/~miceval/research/DL_eval.htm.
5. Alexandria Digital Library Project. Research. 2005. Available:
6. DELOS Network of Excellence on Digital Libraries. http://www.delos.info/
7. eValued. “An evaluation toolkit for e-library developments”. Available: http://www.
8. JUBILEE “JISC User Behaviour in Information Seeking: Longitudinal Evaluation of EIS”
Available: http://online.northumbria.ac.uk/faculties/art/information_studies/imri/rarea/im/hfe/
9. The HyLife hybrid library toolkit. Available: http://hylife.unn.ac.uk/toolkit/
10. Reeves, T. C., Apedoe, X. and Woo, Y. Evaluating digital libraries: a user-friendly guide.
NSDL.ORG. The University of Georgia, 2003
11. Nicholson, Scott “A conceptual framework for the holistic measurement and cumulative
evaluation of library services”, Journal of Documentation, Vol. 60 No. 2, 2004. pp.164 – 182.
12. Borgman, C. L. et al. “How geography professors select materials for classroom lectures:
Implications for the design of digital libraries”. In: Proceedings of the 4th ACM/IEEE-CS
Joint Conference on Digital Libraries, Tucson, Arizona, USA. New York: ACM, 2004.
G. Chowdhury, D. McMenemy, and A. Poulter
13. Blandford, A. “Understanding user’s experiences: evaluation of digital libraries”. In:
DELOS workshop on evaluation of digital libraries Padova, Italy. 2004. Available:
14. Blandford, A. and Buchanan, G. “Usability of digital libraries: A source of creative tensions with technical developments”, TCDL Bulletin. 2003. Available: http://www.ieeetcdl.org/Bulletin/current/blandford/blandford.html
15. Blandford, A., Keith, S., Connell, I. and Edwards, H. “Analytical usability evaluation for
digital libraries: a case study”. In: Proceedings of the 2004 Joint ACM/IEEE Conference
on Digital Libraries. 2004. Available: http://portal.acm.org
16. Choudhury, S., Hobbs, B. and Lorie, M. “A framework for evaluating digital library services”. D-Lib Magazine, Vol. 8 No. 7/8., 2002. Available: http://www.dlib.org/dlib/july02/
17. Chowdhury, G.G. “Access and usability issues of scholarly electronic publications”. In:
Gorman, G.E. and Rowland, F. eds. Scholarly publishing in an electronic era. International
yearbook of Library and Information management, 2004/2005. London: Facet Publishing,
2004. pp. 77-98.
18. Borgman, C.L.& Larsen, R. ECDL 2003 Workshop Report: Digital Library Evaluation Metrics, Testbeds and Processes. D-Lib Magazine, 9(9), 2003. Available:
19. Jeng, Judy “What is usability in the context of the digital library and how can it be measured? Information Technology and Libraries, Vol. 24(2), 2005. pp. 47-56.
20. Saracevic, T. “Digital library evaluation: Toward evolution of concepts -1- evaluation criteria
for design and management of digital libraries”, Library Trends. Assessing Digital Library
Services, Vol. 49 No. 2, 2000. pp. 350- 369. Available: http://www.scils.rutgers.edu/~tefko/
21. Saracevic, T. “Evaluation of digital libraries: an overview. Presented at the DELOS workshop on the evaluation of digital libraries”. 2004. Available: http://dlib.ionio.gr/wp7/
22. Saracevic, T. “How were digital libraries evaluated?” In: Libraries in the Digital
Age (LIDA 2005), 30May -3 June, Dubrovnik, Croatia. 2005. Available: http://www.
A Logging Scheme for Comparative Digital
Library Evaluation
Claus-Peter Klas1 , Hanne Albrechtsen2 , Norbert Fuhr1 , Preben Hansen5 ,
Sarantos Kapidakis4 , Laszlo Kovacs3, Sascha Kriewel1 , Andras Micsik3 ,
Christos Papatheodorou,4 Giannis Tsakonas4, and Elin Jacob6
University of Duisburg-Essen, Duisburg, Germany
Institute of Knowledge Sharing, Copenhagen, Denmark
MTA SZTAKI, Budapest, Hungary
Ionian University, Kekryra, Greece
Swedish Institute of Computer Science, Kista, Sweden
Indiana University Bloomington, Bloomington, USA
Abstract. Evaluation of digital libraries assesses their effectiveness,
quality and overall impact. To facilitate the comparison of different evaluations and to support the re-use of evaluation data, we are proposing
a new logging schema. This schema will allow for logging and sharing
of a wide array of data about users, systems and their interactions.
We discuss the multi-level logging framework presented in [19] and describe how the community can add to and gain from using the framework. The main focus of this paper is the logging of events within digital libraries on a generalised, conceptual level, as well as the services
based on it. These services will allow diverse digital libraries to store
their log data in a common repository using a common format. In addition they provide means for analysis and comparison of search history
Evaluation of digital libraries (DLs) aims at assessing their effectiveness, quality and overall impact. Analysis of transaction logs is one evaluation method
that has provided DL stakeholders with substantial input for making managerial decisions and establishing priorities, as well as indicating the need for system
enhancements. However, the quantitative nature of this method is often criticised
for its inability to provide in-depth information about user interactions with the
DL being evaluated. The results of logging studies are often localised and not
easily interpretable outside the DL being investigated. The problem of generalisability is compounded by the absence of a standardised logging scheme that
could map across the various logging formats being used. The development of
such a scheme would facilitate comparisons across DL evaluation activities and
provide the means for highlighting critical events in user behaviour and system
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 267–278, 2006.
c Springer-Verlag Berlin Heidelberg 2006
C.-P. Klas et al.
The logging scheme proposed in this paper is framed within the DL evaluation
activities of the DELOS Network of Excellence. In our whitepaper on DL evaluation [9], we have identified the long-term goal of building a community for DL
evaluation. Under the umbrella of an experimental framework that will serve
as a theoretical and practical platform for the evaluation of DLs, the proposed
logging scheme will allow for meaningful interpretation and comparison of DL
transactions. By using this scheme, researchers will be able to extract re-usable
data from the results of previous DL evaluations and to identify useful benchmarks, allowing for more efficient and effective design of evaluation studies.
In order to support the re-use of all possible evaluation data, we intend that
the proposed scheme will account for all manner of data that can be collected
from the user, the system and the user-system interaction. To this end, we are
proposing a novel, multi-level logging framework that will provide complete coverage of the different aspects of DL usage. The main focus of this paper is the
level of conceptual, generalised user actions, for which we describe the logging
scheme in some detail. Based on this specification, we present tools that can help
DL stakeholders to analyse the logging data according to their specific interests.
The remainder of this paper is structured as follows: A brief survey of related
work on DL logging is given in Section 2. The levels of the proposed logging
scheme are presented in Section 3, while Section 4 addresses the events that
comprise the conceptual aspects of user actions. Section 5 presents an application
of the logging scheme in the Daffodil DL. Using researchers as an example, we
discuss how the various DL stakeholders can gain from a common logging scheme
in Section 6. In conclusion, Section 7 summarises the arguments presented in this
paper and outlines future work with logging schemes.
Related Work
In DL research, there is a tendency to identify patterns of information searching
behaviour through the use of system features such as Boolean operators, use of
ranking and sorting mechanisms, identification of the type and nature of queries
submitted and analysis of users’ distribution of these features [17, 24, 18].
Transaction logs also provide a useful resource for the remote evaluation of
web-based information systems due to their ability to record every action that
occurs during the user’s interaction with a digital resource [6]. They are generally used to gather quantitative data and to optimise the generation of statistical
indicators. Logs are widely employed in the DL domain because of the consistency they provide with respect to the conditions of data collection; but logging
should be understood as only one aspect of the overall experimental framework,
as suggested by [4]. For websites and for bibliographic systems such as OPACs,
logs provide dependable and coherent information about the usage, traffic and
performance of a given system. However, because logs can only partially reveal
both the behaviour of the user and her level of satisfaction [7], richer information
is frequently derived by applying other methods of analysis such as sequential
pattern analysis and web surveys.
A Logging Scheme for Comparative Digital Library Evaluation
Goncalves and others proposed an XML log standard for digital library logging analysis [12, 11]. They concentrate on high level events that are generated by
user actions, and describe events for searching, browsing, updating and storing
actions. However, as pointed out by Cooper in [5], for a comprehensive transaction log analysis different sources of log data have to be taken into account. We
distinguish between different aspects and levels of logging, as described in Section 3. Each of these levels is supported by a standard XML schema that defines
the events of interest according to the special needs of the different stakeholders
interested in the logging data.
In addition, we extend the original set of events that have been proposed to
more comprehensively support the logging of actions allowed by modern DL services. For the analysis of logged events, we also differentiate between the various
stakeholders of a DL system, including, for example, system owners, librarians,
endusers, developers and researchers. Each stakeholder group requires a particular view on the logging data in order to address its specific information needs.
To support these views on transaction log data and to combine and analyse data
from the different aspects of logging, a number of tools is being developed.
There are other efforts to standardise aspects of transaction logs or to provide
uniform classifications for their analysis. Yuan and Meadow [27] provide a codification of, among other aspects, the variables of participants in user studies.
Levels of Logging
When using transaction logs for evaluation, the main participants under survey
are the user and the system, as well as the content that is being searched, read,
manipulated, or created. The interaction between the system and the user can
be examined and captured at various levels of abstraction.
1. System parameters
2. User-system interaction
– UI events (keystrokes, mouse movement, etc.)
– Interface action events (data entry, menu selections, etc.)
– Service events (use of specific DL services)
– Conceptual events (generic actions)
3. User behaviour
On a very low level, researchers or developers might be interested in getting
information about key presses, mouse clicks and movements, and similar concrete interactive events. These can be grouped together into more meaningful
interactive events with specific graphical tools or interface elements, several of
which in turn might comprise an interaction with a digital library service like
a citation search service. For all three of these levels of abstractions a common
logging schema will help in comparison, e.g. of the usability of specific interfaces
or services between different digital libraries.
An additional level of abstraction, that will be called the conceptual level,
can be seen as an abstraction away from the concrete services of specific digital
C.-P. Klas et al.
library systems towards generalised actions that users can take in using a DL
system. Several generic services that could be captured at this level have been
proposed in [12] and [11]. Herein we suggest a modified and expanded version of
this selection.
By separating the log data into user data, system data and different abstraction levels of interaction – all connected by timestamps and identifiers linking
a specific user with each event – the transaction logs can be used for following
a users actions chronologically along a specific tier of the model. On the other
hand, a comprehensive analysis of what exactly happens during a search action
of the user across the various tiers is also possible.
Of course a digital library system might only capture data on a low level of
abstraction to minimise the overhead incured by the logging, and use a mapping
of UI events to more abstract events for later analysis.
The system tier of the logging model describes the changes of the system over
time, as represented by various parameters of the hardware and software involved
in providing the digital library services. These parameters can be captured directly by the backend, usually without involvement of the client software used
to access the digital library.
Aspects of the system fall into two groups: static parameters like operating
system, available memory, bandwidth or computing power, and dynamic parameters that change over time, usually in reaction to user action, like server load,
amount of connected users, network traffic and ping times.
User Behaviour
At the other end of the spectrum is the tier representing the user and her behaviour, her changing and evolving cognitive model, and her interactions with
the environment outside the digital library system. While system behaviour is
usually easiest to capture, user behaviour can hardly be captured through transaction logs. Other methods are common within user studies, e.g. video capturing,
think aloud studies, search diaries, interviews or questionnaires.
As with systems, aspects describing the user can be grouped into static and
dynamic parameters. Static parameters like age, first language, professional or
search background, social and organisational environment, usually don’t change
over the course of one session. Dynamic parameters on the other hand, might include frustration levels, or the user’s progression along the stages of her information task.
User-System Interaction
The logging of the interaction of user and system in distributed systems is complicated by the fact that much of the interaction occurs at the client. Low-level
interactive events therefore need to be captured at the site of the user and be
transmitted to the server, which introduces difficulties for browser based user
A Logging Scheme for Comparative Digital Library Evaluation
clients. In these cases only higher level interactions with corresponding events
on the server side of the digital library might be logged and analysed later [14].
UI Level. On the lowest level of user-system or human computer interaction
(HCI), events consist of single input events like keystrokes, mouse clicks or
mouse movements. These are of interest for usage and usability studies of
systems, and have been analysed and studied in HCI research for a long
time [15]. This level corresponds to the keystroke level of the GOMS model
[16] and can be combined with low-level events captured by other means
(e.g. eye movements).
Interface Level. It is common practice in HCI research [13, 14] to group lowlevel events into higher-level abstractions for specifying more general models of the user-system interaction. A file selection that incorporates several
mouse movements, clicks and textual input is combined into a single, abstract interface action. On this level e.g. the number and kind of interface
actions necessary to complete a specific task can be compared between several digital library systems.
Service Level. In a further step of abstraction the service level combines
several interface actions into more meaningful services provided by digital library systems. Most DL systems offer a number of different services
that support searching and other tasks of the information seeking process,
e.g. metadata or fulltext search, search for citations, annotating of documents, services for organising personal collections of DL objects, or for supporting the reviewing of documents. Depending on the tools and options
offered by a specific system, it is possible that different combination of interface actions can be combined to use the same service.
Conceptual Level. While the first three levels of user-system interaction
combine actions from the level below into a larger, higher-level action, the
conceptual level represents an abstraction away from the concrete implementation of specific services, and tries to define generalised types of events.
These generalised events or conceptual events represent the various actions
that a user might pursue in a digital library system. While this list is probably not comprehensive, care has been taken to describe these actions in a
general way that can be applied to most of todays DL systems.
Events on the Conceptual Level
On the conceptual level, we have identified several general event types that
support comparative evaluation across DLs. These events are partially in line
with the events proposed in [12, 11]. They identified some generic events –
search, browse, update, store – and some higher concepts – annotate, filtering, recommending, rating, reviewing – which they call transactions. Our focus
on the conceptual level represents the centrality of these events for log analysis
and interpretation, because they indicate critical aspects of the user’s interaction with the DL system and supply valuable data for rich interpretation of user
C.-P. Klas et al.
behaviour. As has been highlighted in other DL logging studies [22], current approaches are often inadequate for capturing complex or abstract actions by the
user and are therefore unable to elicit meaningful conclusions.
By logging data about general event types at the concept level, we provide a
basis for comparative evaluation across DLs.
The event types and event properties that we have identified are neither fixed
nor a comprehensive model of user-system interaction and should be viewed as
recommendations that can also serve as discussion points. Each event consists
of its own set of properties modelled in XML as sub-elements and attributes.
Properties that are common for all events are:
– a unique session-id
– start-stop timestamps
– possible errors during the event
– a unique event-id
– a service name
– a cancelation indicator
In addition, each event as described herein also has an event-specific set of
properties which are summarized in Table 1. If the collection of additional data
about specific events is necessary for a study, it is suggested to extending standard event definitions through reference to an XML namespace that defines the
new properties. For example, the standardised search event describes the search
condition as a list of terms. However, many DLs allow for more complex query
formulations such as Boolean queries, which could be stored in a specific field defined in the extended namespace as a sibling of the list-of-terms input element.
Although comparison across DLs will only work on the list-of-terms property
defined in the original namespace, researchers who require an extended view of
logging events are guaranteed that no information will be lost in the process of
Search events represent any action of users that involve formulating a query or
filter condition that is to be processed by a DL service against a collection.
The collection can be the entire document space, already be pre-filtered or
even be the result of a previous query. The system response consists of the
subset (e.g. in the form of a ranked list) of objects from the initial collection
that satisfy the given condition.
Navigate events represent actions that consist of selecting a specific item from
a set of items or following a link to its target. This conceptual event includes
the use of hyperlinks to navigate within a set of hypertext documents, but
also navigating through a representation of a social network (e.g. of coauthors).
Inspect events capture the user actions of accessing a detailed view of a single
object, like the metadata or fulltext of a result document. Similarly, looking up a definition of a term in an encyclopedia or dictionary, or semantic
information from a thesaurus is seen as an inspection of this term.
Display events describe specific visualisations of DL objects. While the actual
content of the presented information does not necessarily change, the change
in view on this information is representated by a display event. This conceptual event encompasses actions ranging from a simple resorting of a list
A Logging Scheme for Comparative Digital Library Evaluation
Table 1. Subelements and properties of events
query or filter condition, collection to be searched, system
link to be followed, current collection, system response
object to be inspected, system response
collection to be (re-) displayed, visual transformation or visualisation to be applied, sort criteria (optional)
collection to be browsed, method and dimensions of browsing,
direction and distance that the view point is moved
set of DL objects to be stored, target location, method of
document (optionally part of) to be annotated, type and content of the annotation (may be one or more other DL objects)
the new document, optionally identifier of changed document
help request (optional), type and content of system suggestion
type, content and recipients of message
of objects or the presentation of a set of terms in form of a tag cloud, up
to complex visual representations of abstract information. In the interest of
comparability, the use of a standardised taxonomy or classification of visualisation techniques is proposed, e.g. based on Shneiderman’s task by data
type taxonomy [25] or the classification of visualisation techniques in [3].
Browse events describe user actions that involve changing the view point on
a set of DL objects without changing the visualisation or navigating to a
different set of items. Typically these actions will involve scrolling in one
or more dimension, using sliders to zoom in or out of a visualisation, or
“thumbing” through a document. If the original set of documents has been
split into several chunks, browsing might also describe moving from one
chunk of the document set to the next or previous one.
Store events are actions of the user or the system that create a permanent or
temporary copy of a digital library objects. This might be a digital copy
to a clipboard for temporary storage during a search session or to a more
permanent location either within the DL system or outside (e.g. on an optical
medium or in a web storage). Storing a digital library object can also mean
converting from digital to physical form (printing a document), or exporting
to a special citation format.
Annotate events cover any user action that adds additional information to an
existing DL object, which may be user-specific, shared among a group of
collaborators, or visible system wide. The general annotation event includes
marking entire documents or specific parts, adding ratings, tags [10] or textual comments like reviews or summaries to an object, or linking two or more
DL objects.
Author events describe the creation of a new DL object or direct editing (not
annotating) of an existing one. Authoring a document can include writing
C.-P. Klas et al.
a review or another type of textual annotation, or can be part of creating a
completely new document if supported by the digital library system.
Help events from the user’s point of view can be of a passive nature, where the
system provides unprompted suggestions to the user, or of a more interactive
nature. In the latter case the user explicitely requests help, suggestions or
recommendations about a specific or general topic, and the system generates
a response to that. Unprompted help can take the form of recommendations
about content, users or specific actions, or provide explanation about functions or system activity.
Communicate events capture events that occur during the communication between two or more users of the digital library system. The communication
can be textual or include other media. This general event includes the direct
sharing of a DL object with another user and sending messages to other
users by using instant messaging, message boards or e-mail components of
the system. More technically sophisticated systems might allow for sending
voice or video messages between users as well. Digital library services for collaboration will typically also include means of communication for managing
the collaboration.
Some of the events contain the actual digital library objects either in form
of object identification like DOI, URL or URI. If such a unique identification is
not possible, the main or all metadata fields should be provided to distinguish
the objects.
Logging in Daffodil
Daffodil1 [23, 21, 8] is a virtual DL targeted at strategic support of users
during the information search process. For searching, exploring and managing DL
objects Daffodil provides information seeking patterns that can be customised
by the user for searching over a federation of heterogeneous digital libraries.
Searching with Daffodil makes a broad range of information sources easily
accessible and enables quick access to a rich information space.
Daffodil is an application consisting of a graphical user interface and back-end
services written in Java. This makes logging either UI events within the client or
back-end event triggered by user actions or by the system a much simpler process
than for web-based DL systems. The logging service itself, depicted in Figure 1,
is simplistic: the Daffodil user interface client and each Daffodil back-end
service can send an event to the event logging service, which then stores the
Currently, Daffodil handles over 40 different events. The main groups of
events are search, navigate and browse events; result events are generated by each
A Logging Scheme for Comparative Digital Library Evaluation
Fig. 1. Logging Service Model
of the system services (e.g., the thesaurus, the journal and conference browser,
or the main search tool). The personal library supports store events as well as
events involving annotation or authoring of objects.
Through the Daffodil service for log schema conversion, we are able to
provide more than 100 MB of log data in the format of the proposed XML
schema. The data will be anonymised and soon be made accessible for comparative analysis.
Analysis of Log Events
In order to analyse the logged events, we have assumed that different stakeholders
need different views of the logging data; thus, a variety of analysis tools is required. We have identified the following DL stakeholders: system owners, content
providers, system administration, librarians, developers, scientific researchers,
and end-users of a DL.
A number of tools for facilitating analysis on log data in the new scheme habe
already been implemented. The example statistics shown in Figures 2 and 3 were
produced with the help of these tools.
Fig. 2. Inspected Objects
Fig. 3. Usage per hour
Having an experimental framework is a special boon for researchers in the
field of digital libraries. With Daffodil and its use of the standardised logging
scheme, a baseline system and a powerful toolbox for evaluation work can be
provided. For research on higher concepts, the researchers do not have to develop
C.-P. Klas et al.
or implement a complete setting, but can re-use the framework and build on it.
At the system level, the efficiency of algorithms or the appropriateness of DL
architectures can be evaluated and compared. HCI research on the usability of
digital library interfaces as well can benefit from a framework that provides
standardised, comparable logging data.
The major research focus of the Daffodil project has always been to place
the user in the center of digital library and information retrieval research. Daffodil is based on the information search model by Marcia Bates [2, 1] which
classifies search activities of users as moves, tactics, stratagems and strategies.
The proposed logging scheme and the concept level is a natural extension of
this original aim, as it provides a method to analyse the sequences of moves and
tactics. As Wildemuth stated in [26]:
While significant work has examined the individual moves that searchers
make (Bates, 1979, 1987; Fidel, 1985), it is equally important to examine the sequences of moves made by searchers in order to understand
the cognitive processes they use in formulating and reformulating their
With regard to this goal, the logging scheme allows for studying the usage
of the diverse services; furthermore, the whole search process/sequence can be
analysed, from the initial formulation of a search query to its conclusion with
the storage of DL objects for further use.
In addition to capturing and understanding users’ search activities through
analysis of logging data, recommendations can be made for re-use of search
patterns. Kriewel [20] suggests that search paths discovered in analysis of logging
data can be used as a basis for suggesting potential search steps to other users.
Of course the vertical look at the levels can also be under examination.
Summary and Outlook
In this paper, we have presented the first efforts to develop a standardised experimental framework for digital library evaluation. If the various DL stakeholders
can form a community and agree upon models, dimensions, definitions and common sense criteria for the evaluation of DL systems, the process of evaluation will
gain impetus among DL researchers. As an experimental platform for evaluation
both within and across DL systems, application of the Daffodil framework can
substantially advance this research area, since it currently provides functions and
services that are based on a solid theoretical foundation and well known models.
The proposed logging scheme is a first step intended to encourage evaluation of
individual systems as well as comparisons across systems.
Most evaluation techniques require a great deal of preparation and effort
and are thus not easily replicated. In case of online DL systems, this means
that the results of an evaluation often reflect a past snapshot of the system.
It is necessary to find ways for continuous, cost effective and (more or less)
automated evaluation of digital libraries. The suggested logging scheme is a first
A Logging Scheme for Comparative Digital Library Evaluation
step in this direction. Our group aims at establishing a community forum for
evaluators in order to promote the propagation of various tools and approaches,
and the exchange of experience.
As part of the effort to encourage a community forum for researchers interested in DL evaluation, we have published documentation for the logging scheme
on the forum website2 . The log analysis tools will follow soon, as they are under preparation. As a further step, large amounts of anonymised logging data
from two DL services will be made available. Such primary data will help the
community to improve the application of transaction logging and to compare
and experiment with sample data. In the long term, we envision that this effort
will evolve into a primary data repository, providing help for evaluators who
want to find similar scenarios together with logging data. In other fields of research, such primary data repositories are already well established and play an
important role in the conduct of research. (In fact, the provision of primary data
frequently counts as a publication, creating further incentives for this kind of
work.) An infrastructure of independent evaluator services, primary data repositories, logging tools and on-line questionnaires may provide computer-based
support for some of the cost-effective and time-consuming tasks in evaluation
and the community will gain sustainability. So we ask the community to form,
discuss and agree upon a schema, to add data and service in order make a step
forward in DL evaluation.
[1] M. J. Bates. Idea tactics. JASIS, 30(5):280–289, 1979.
[2] M. J. Bates. Information search tactics. JASIS, 30(4):205–214, 1979.
[3] E. Bertini, T. Catarci, S. Kimani, and L. D. Bello. Visualization in digital libraries.
In From Integrated Publication and Information Systems to Virtual Information
and Knowledge Environments, pages 183–196, 2005.
[4] J. Bertot, C. McClure, W. Moen, and J. Rubin. Web usage statistics: measurement
issues and analytical techniques. Goverment Information Quarterly, 14(4):373–
3951, 1997.
[5] M. D. Cooper. Design considerations in instrumenting and monitoring web-based
information retrieval systems. JASIS, 49(10):903–919, 1998.
[6] M. D. Cooper. Usage patterns of a web-based library catalog. JASIST, 52(2):137–
148, 2001.
[7] D. T. Covey. Usage and usability assessment: Library practices and concerns.
Technical report, Digital Library Federation, 2002.
[8] N. Fuhr, C.-P. Klas, A. Schaefer, and P. Mutschke. Daffodil: An integrated desktop
for supporting high-level search activities in federated digital libraries. In Research
and Advanced Technology for Digital Libraries. 6th European Conference, ECDL
2002, pages 597–612. Springer, 2002.
[9] N. Fuhr, G. Tsakonas, T. Aalberg, M. Agosti, P. Hansen, S. Kapidakis, C.-P. Klas,
L. Kovcs, M. Landoni, A. Micsik, C. Papatheodorou, C. Peters, and I. Solvberg.
Evaluation of digital libraries, 2006.
http://www.is.informatik.uni-duisburg.de/wiki/index.php/JPA 2 - WP7
C.-P. Klas et al.
[10] S. A. Golder and B. A. Huberman. The structure of collaborative tagging systems.
Technical report, Information Dynamics Labs, HP Labs, 2005.
[11] M. A. Goncalves, E. A. Fox, L. Cassel, A. Krowne, U. Ravindranathan, G. Panchanathan, and F. Jagodzinski. Standards, mark-up, and metadata: The xml log
standard for digital libraries: analysis, evolution, and deployment. In Proceedings
of the third ACM/IEEE-CS joint conference on Digital libraries, 2003.
[12] M. A. Goncalves, R. Shen, E. A. Fox, M. F. Ali, and M. Luo. An xml log standard
and tool for digital library logging analysis. In Agosti, Maristella (ed.) et al., Proc.
of 6th ECDL, pages 129–143, 2002.
[13] M. A. Hearst and M. Y. Ivory. The state of the art in automating usability
evaluation of user interfaces. pages 470–516, 2001.
[14] J. Helms, D. Neale, and P. I. amd J.M. Carroll. Data logging: Higher-level capturing and multi-level abstracting of user activities. In Proceedings of the 40th
annual meeting of the Human Factors and Ergonomics Society, 2000.
[15] D. M. Hilbert and D. F. Redmiles. Extracting usability information from user
interface events. ACM Computing Surveys, 32(4):384–421, 2000.
[16] B. E. John and D. E. Kieras. Using goms for user interface design and evaluation:
Which technique? acm trans. ACM Trans. Comput.-Hum. Interact. 3, 1996.
[17] S. Jones, S. J. Cunningham, R. McNab, and S. Boddie. A transaction log analysis
of a digital library. International Journal on Digital Libraries, 3(2):152–169, 2000.
[18] H. Ke, R. Kwakkelaar, T. Tai, and L. Chen. Exploring behavior of e-journal users
in science and technology: Transaction log analysis of elsevier’s sciencedirect onsite
in taiwan. Library & Information Science Research, 24(3):265–291, 2002.
[19] C.-P. Klas, H. Albrechtsen, N. Fuhr, P. Hansen, E. Jacob, S. Kapidakis, L. Kovacs,
S. Kriewel, A. Micsik, C. Papatheodorou, and G. Tsakonas. An Experimental
Framework for Comparative Digital Library Evaluation: The Logging Scheme. In
JCDL, Short Paper, 2006.
[20] S. Kriewel. Finding and using strategies for search situations in digital libraries.
Bulletin of the IEEE Technical Committee on Digital Libraries (to appear), 2005.
[21] S. Kriewel, C.-P. Klas, A. Schaefer, and N. Fuhr. Daffodil - strategic support
for user-oriented access to heterogeneous digital libraries. D-Lib Magazine, 10(6),
June 2004.
[22] B. Pan. Capturing users behavior in the national science digital library (nsdl).
Technical report, NSDL, 2003.
[23] A. Schaefer, M. Jordan, C.-P. Klas, and N. Fuhr. Active support for query formulation in virtual digital libraries: A case study with DAFFODIL. In Proc. of
7th ECDL, 2005.
[24] M. Sfakakis and S. Kapidakis. User Behavior Tendencies on Data Collections
in a Digital Library, volume 2458 of Lecture Notes In Computer Science, pages
550–559. Springer-Verlag, Berlin; Heidelberg, 2002.
[25] B. Shneiderman. The eyes have it: A task by data type taxonomy for information
visualizations. Technical Report CS-TR-3665, University of Maryland, Department of Computer Science, July 1996.
[26] B. M. Wildemuth. The effects of domain knowledge on search tactic formulation.
JASIST, 55(3):246–258, 2004.
[27] W. Yuan and C. T. Meadow. A study of the use of variables in information
retrieval user studies. JASIS, 50(2):140–150, 1999.
Evaluation of Relevance and Knowledge
Augmentation in Discussion Search
Ingo Frommholz and Norbert Fuhr
University of Duisburg-Essen
Duisburg, Germany
[email protected],
[email protected]
Abstract. Annotation-based discussions are an important concept for
today’s digital libraries and those of the future, containing additional
information to and about the content managed in the digital library. To
gain access to this valuable information, discussion search is concerned
with retrieving relevant annotations and comments w.r.t. a given query,
making it an important means to satisfy users’ information needs. Discussion search methods can make use of a variety of context information
given by the structure of discussion threads. In this paper, we present
and evaluate discussion search approaches which exploit quotations in
different roles as highlight and context quotations, applying two different strategies, knowledge and relevance augmentation. Evaluation shows
the suitability of these augmentation strategies for the task at hand; especially knowledge augmentation using both highlight and context quotations boosts retrieval effectiveness w.r.t. the given baseline.
Annotation-based discussions have been identified as an important concept for
future digital libraries, supporting collaboration between users [3]. With annotations, a user can comment on the material at hand and others’ annotations. As
an example for an existing system, the COLLATE prototype uses nested public annotations as a building block for collaborative discussion in a community
of scientists, with the purpose of interpreting the digital material at hand [13].
Other examples are web-based newswire systems like ZDNet News1 which allow
users to annotate published articles and other users’ comments. In each of these
systems, users can change their role from a passive reader to an active content
provider. Stored discussion threads can be a helpful source for satisfying users’
information needs: On the one hand, annotations can be exploited as auxiliary
objects for document search, and on the other hand they are retrieval targets
themselves in discussion search. It becomes clear that discussion search is an important means for uncovering valuable knowledge in information systems such
as digital libraries.
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 279–290, 2006.
c Springer-Verlag Berlin Heidelberg 2006
I. Frommholz and N. Fuhr
Fig. 1. A discussion thread
In this paper we present our discussion search approaches based on strategies
called knowledge and relevance augmentation, respectively. The methods and results reported here continue the work and preliminary evaluation introduced in
[4, 5]. In the next section we briefly present the test collection and our view on
emails as annotations. We then introduce possible discussion search approaches
in a probabilistic, logic-oriented framework. Subsequently we describe our experiments and discuss their results. We conclude after presenting some related
The Annotation View on Emails
In order to evaluate our discussion search approaches discussed below, we had to
find a suitable test collection. Due to the lack of a “real” digital library testbed
containing annotation threads, we participated in last year’s TREC Enterprise
Track2 in the discussion search task, where relevant emails had to be found
[4]. The collection consists of 174,307 emails from several W3C discussion lists.
Figure 1 shows an example excerpt of a discussion thread. Email replies usually
consist of two different parts; the quotations, which are passages from the original
text, and the new part containing the actual comments (annotations) of the
email author. Quotations are usually prefixed by quotation characters like ‘>’;
combinations of them determine the quotation depth. Quotations are thus the
document fragments a comment belongs to. As an example, in m2 the comment
“Huh?...established for its use” belongs to the fragment “BTW...go bankrupt”
Evaluation of Relevance and Knowledge Augmentation in Discussion Search
of m1. Applying the distinction between new parts and quotations as well as the
thread structure extracted from email headers, we can transform email discussion
threads into annotation threads with fragments (determined by quotations) as
annotation targets. Due to the fact that whole new parts of emails were the
primary target of the discussions search task, we applied one simplification. All
quotations and all new parts of an email were merged, so that each email now
consisted of one (merged) new part and at most one (merged) quotation part.
Discussion Search Approaches
We implement our retrieval functions in predicate logic, in particular probabilistic Datalog (pDatalog). We will briefly introduce pDatalog before discussing our
retrieval approaches.
Probabilistic Datalog
pDatalog [7] is a probabilistic variant of predicate logic. Similar to Prolog, its syntax consists of variables, constants, predicates and Horn clauses. Capital letters
denote variables. Probabilities can be assigned to facts. Consider the following
example program:
0.7 about(d1,"databases"). 0.5 about(d1,"retrieval").
retrieve(D) :- about(D,"databases").
retrieve(D) :- about(D,"retrieval").
Probabilistic facts model extensional knowledge. The about predicate says that
d1 is about ‘databases’ with 0.7 probability and about ‘retrieval’ with 0.5 probability. Rules model intensional knowledge, from which new probabilistic facts are
derived. The rule retrieve means that a document should be retrieved when it is
about ‘databases’ or ‘retrieval’. With the given facts and rules, pDatalog would
now calculate the retrieval status values of a document d w.r.t. the retrieval
function retrieve as a combination of probabilistic evidence. In particular, if
e1 , . . . , en are joint independent events, pDatalog computes
P (e1 ∧ . . . ∧ en ) = P (e1 ) · . . . · P (en )
P (e1 ∨ . . . ∨ en ) =
P (ei ) =
⎜ ⎟
(−1)i−1 ⎝
P (ej1 ∧ . . . ∧ eji )⎠ (2)
1≤j1 <
...<ji ≤n
For our example document d1, pDatalog would calculate
P (retrieve(d1)) = P (about(d1,"databases") ∨ about(d1,"retrieval"))
= 0.7 + 0.5 − 0.7 · 0.5 = 0.85
I. Frommholz and N. Fuhr
Simple Content-Based Approach
In this baseline approach, we do not apply any context at all for discussion search.
Each document only contains new parts of an email, stripping all quotations. The
approach can be expressed with the following datalog rules:
wqterm(T) :- qterm(T) & termspace(T).
about(T,D) :- term(T,D).
retrieve(D) :- wqterm(T) & about(T,D).
The qterm predicate contains the query terms (after stemming and stopword
elimination). termspace contains the termspace, here regarding only the new
part of an email as a document. termspace thus contains all terms appearing in
new parts of emails. For each term t in the termspace, it is P (termspace(t)) =
P (t) which is interpreted as an intuitive measure of the probability of t being
informative. P (t) can be estimated based on the inverse document frequency
of t, idf (t) = − log (df (t)/numdoc), with df (t) as the number of documents in
which t appears and numdoc as the number of documents in the collection, and
P (t) ≈
idf (t)
with maxidf being the maximum inverse document frequency. The wqterm
rule states that we weight a query term t according to P (t). term relates
terms to the documents they appear in. For each term t in document d,
P (term(t,d)) = P (t|d), the probability that we observe term t given document
d. P (t|d) is estimated as
P (t|d) ≈
tf (t, d)
avgtf (d) + tf (t, d)
where tf (t, d) is the frequency of term t in document
d and avgtf (d) is the
average term frequency of d, calculated as avgtf (d) = t∈dT tf (t, d)/|dT | with
dT being the document representation of d (i.e. the bag of words of d). We
say that a document is about a term if the term appears in the document;
this is modeled with the about rule. The retrieve rule is our actual retrieval
function. A document should be retrieved if it contains at least one query term.
The retrieval status value of d is determined by P (retrieve(d)) which in turn
depends on query and document-term weights, as described above. The result
list presented to the user ranks documents according to descending retrieval
status values.
Context Quotations
In the last subsection we were only considering new parts of email messages for
retrieval. However, in an email discussion thread, we have the information about
the targets that a comment addresses, given by the quotations. Quotations are
an important source for determining what a new part is about, as can be seen
Evaluation of Relevance and Knowledge Augmentation in Discussion Search
in message m3 in our example in Figure 1. If we only consider the new part
of the message, as we do in the approach described above, the system could
not infer that this part is actually about “annotation lawsuits”. m3 would not
be retrieved for such a query, although it would be relevant. Quotations thus
establish an important context for the new parts of messages; quotations are
referred to as context quotations when regarding them as such a kind of context
for new parts. We will now introduce our idea of exploiting context quotations
for discussion search, beginning with the obvious choice, merging quotations and
new parts by not distinguishing between them in email messages.
Merging Quotations and New Parts. This simple approach sees whole
emails as a document (instead of only new parts as in Subsection 3.2). Any
further parsing of email messages to distinguish between quotations and new
parts is not required here; all terms in a message are regarded as belonging to
the corresponding document. In an annotation scenario, this is similar to the
case where all annotation targets are merged with their respective annotation
to form a new document. We apply exactly the same predicates and rules like
those discussed in Subsection 3.2, except that the estimations of P (t) and P (t|d),
respectively, are now based on the view of a document being a full email message
(resulting in different values for term and document frequencies).
Knowledge Augmentation. While in the last approach context quotations
were merged with new parts, the approaches discussed next regard context quotations as separate, virtual documents. Thus, from a message m, two new documents are created: dm , containing the new part of m, and quotm , containing
m’s quotations. quotm is regarded as “virtual” since it is not to be retrieved, but
serves as an auxiliary document to determine the relevance of dm . Furthermore,
each virtual document does not contribute to the document frequency of a term.
If a term t appears in both quotm and dm , only its appearance in dm is counted
and used in Equation 3.
We introduce a new predicate quotedterm(t,d) which says that the
term t appears in the quotation quot belonging to document d. It is
P (quotedterm(t,d)) = P (t|quot), and the latter probability is estimated with
Equation 4. We apply a knowledge augmentation approach by extending our
about rule to
about(T,D) :- term(T,D).
about(T,D) :- acc("quotation") & quotedterm(T,D).
where acc("quotation") describes the event that a quotation is actually accessed when reading the unquoted part. P (acc("quotation")) is thus the probability that a quotation is considered. By extending the about rule like this, we
augment our knowledge of what a new part is about with the knowledge of what
the quotation is about. In this extended context, new terms are introduced which
appear in quotations only, and the probability that a document is about a term
is raised according to Equations 1 and 2 if we also observe this term in the quotation. The analogy to the real world is that if a user reads the new part first
I. Frommholz and N. Fuhr
and then the corresponding quotation, she augments her knowledge of what the
new part is about. The wqterm and the retrieve rules are the same as before.
Relevance Augmentation. Another augmentation strategy we are going to
evaluate is relevance augmentation. Here, we augment the knowledge that a new
part is relevant with the knowledge that its corresponding quotation part is
relevant. The idea is that we infer to a certain degree the relevance of a new part
with the relevance of its quotation part. This context-based relevance decision
is performed by the system in two steps. First, the relevance of documents and
context quotations w.r.t. the query is determined:
rel(D) :- wqterm(T) & about(T,D).
quot_rel(D) :- wqterm(T) & quotedterm(T,D).
In the second step, this knowledge is combined, taking into account the probability that we actually access the quotation:
retrieve(D) :- rel(D).
retrieve(D) :- acc("quotation") & quot_rel(D).
(wqterm and about are the same as in Section 3.2).
Highlight Quotations
When a user annotates a (part of a) document, it is assumed that she found it
interesting enough to react to it. This means the annotation target is implicitly
highlighted and considered important by the annotation author, reaching a kind
of n-way consensus [9] of the significance of this part if n persons used it as
annotation target. Examining the quotations and the quotation levels of emails,
we can identify such highlighted parts of previous messages. A highlight quotation of a message m in another message m is the part of m which is quoted by
m , where m is a (direct or indirect) successor of m in the discussion thread.
Consider the following simple example with 3 messages:
m1: line1.1
> line1.2
> line1.3
m3: >> line1.3
> line2.1
m1 consists of 3 lines (line1.1 - line1.3). m2 quotes two of these lines, line1.2
and line1.3. m3 quotes a line from m1 (line1.3) and from m2 (line2.1). The
quotation in m2 containing line1.2 and line1.3 is a highlight quotation of m1.
Our claim is that line1.2 and line1.3 are important due to the fact that they
are quoted; line1.3 seems to be even more important since it is quoted in m3
as well. For an email message, we create a highlight quotation virtual document
from each quotation containing a fragment of this email message. In our example
we would create two highlight quotation virtual documents for m1: high_m1-m2
consists of line1.2 and line1.3 (the part of m1 quoted in m2), and high_m1-m3
contains line1.3 (the part of m1 quoted in m3). For m2, one virtual document
Evaluation of Relevance and Knowledge Augmentation in Discussion Search
is created (high_m2-m3 containing line2.1). We use highlight quotation virtual
documents as a context for retrieval by performing knowledge and relevance
augmentation again.
Knowledge Augmentation. To add highlight quotations, we introduce a new
predicate highlightterm(t,d,high) where t is a term, d a document and high
the highlight quotation where t appears. It is P (highlightterm(t,d,high)) =
P (t|high), again estimated with Equation 4. Knowledge augmentation is applied
by extending the about rule:
about(T,D) :- term(T,D).
about(T,D) :- acc("highlight") & highlightterm(T,D,H).
P (acc("highlight")) is the probability that we actually consider (access)
highlight quotations. A short note on the evaluation of highlightterm(T,D,H)
follows. In the second about rule, the variable H is free. For a possible
valuation D=d and T=t to determine about(t,d), pDatalog substitutes
highlightterm(t,d,H) with a disjunction containing all possible values H can
take. In our example above, let the term ‘developers’ appear in line1.3. Now,
with T="developers" and D=m1, highlightterm("developers",m1,H)
highlightterm("developers",m1,high m1-m2) ∨
highlightterm("developers",m1,high m1-m3). The probability of this
disjunction is calculated and multiplied with P (acc("highlight")) to gain a
probability for the second about rule. wqterm and retrieve are the same as in
Section 3.2 here.
Relevance Augmentation. Relevance augmentation with highlight quotations is quite straightforward. Again, we need two steps;
rel(D) :- wqterm(T) & about(T,D).
high_rel(D,H) :- wqterm(T) & highlightterm(T,D,H).
determines the relevance of documents and highlight quotations, and
retrieve(D) :- rel(D).
retrieve(D) :- acc("highlight") & high_rel(D,H).
combines this evidence in the actual retrieval rules. wqterm and about are the
same as in Section 3.2.
We also conducted experiments where we combined the evidence gained from
highlight and context quotations. For knowledge augmentation, we combined
the corresponding about rules introduced in Sections 3.3 and 3.4 with wqterm
and retrieve identical as in Section 3.2. For relevance augmentation, we combined the rel, high rel, quot rel and retrieve rules in Sections 3.3 and 3.4,
respectively, with wqterm and about as before in Section 3.2.
I. Frommholz and N. Fuhr
rsv(a, q)
rsv(quota, q)
(w1 ⊕ acc · w2 ) t1
(acc · w3) t2
RSV (a, q)
rsv(higha , q)
RSV (a, q) = rsv(a, q) ⊕ acc · rsv(quota, q) ⊕ acc · rsv(higha , q)
a(quota, higha )
a) Knowledge augmentation
a(quota, higha )
b) Relevance augmentation
Fig. 2. Knowledge and relevance augmentation
Non-probabilistic Formulation
The knowledge and relevance augmentation strategies are not bound to a probabilistic, logic-based formulation like the one we presented above with pDatalog.
Consider the example in Fig. 2. Here we can see an example annotation a with
a corresponding highlight quotation document higha and a context quotation
document quota . acc models the probability that higha or quota , respectively,
are accessed from a. With knowledge augmentation, the term weights (w1 and
w2 for t1 and w3 for t2) are propagated to the supercontext a(quota , higha ) according to the access probability. The operator ⊕ combines the weights from the
subcontexts in the supercontext; ⊕ can be a simple sum operator, or, as it is the
case with pDatalog, formulated with the inclusion-exclusion formula in Equation 2. The calculated new term weights for t1 and t2 are then used to compute
the final retrieval status value RSV (a, q) of a w.r.t. the query q. When applying
relevance augmentation, we first calculate a local retrieval status value rsv(a, q),
rsv(quota , q) and rsv(higha , q), respectively, for the subcontexts; these values
are again combined in the supercontext a(quota , higha ) with the ⊕ operator in
order to compute the final retrieval status value RSV (a, q).
Experiments and Results
The main goal of our experiments was to answer the question: can relevance
or knowledge augmentation increase retrieval effectiveness, and which strategy
should be preferred? Whereas for knowledge augmentation the first question
has already been answered [4], we have as yet not conducted any experiments
for relevance augmentation. We also provide the results of further runs for
knowledge augmentation, applying different values for P (acc("quotation"))
and P (acc("highlight")), respectively. For both probabilities, we used global
Evaluation of Relevance and Knowledge Augmentation in Discussion Search
values ranging from 0.1 to 1.0, in steps of 0.13 . For our experiments, we used the
W3C email lists described in Section 2 with 59 distinct queries. Topics and relevance judgements were given by the participants of the TREC 2005 Enterprise
track. All runs were performed using HySpirit4 , a pDatalog implementation.
Table 1 briefly describes the experiments and their settings.
Table 1. Description of experiments
Experiment Parameters
P (acc("quotation")) = x
P (acc("quotation")) = x
P (acc("highlight")) = x
P (acc("highlight")) = x
P (acc("highlight")) = x
P (acc("quotation")) = y
P (acc("highlight")) = x
P (acc("quotation")) = y
The baseline, only new parts.
Merged quotations and new parts
Knowledge augmentation with
context quotations
Relevance augmentation with context quotations
Knowledge augmentation with
highlight quotations
Relevance augmentation with
highlight quotations
Knowledge augmentation with
highlight and context quotations
Relevance augmentation with
highlight and context quotations
Some selected results of our experiments are presented in Table 2, where
we show the mean average precision and the precision at selected numbers of
documents retrieved. The latter values are important user-oriented ones: users
tend to browse through the first 20 or even 30 top-ranked documents in a result
list, but usually do not go deeper in the ranking. The other runs not presented
here did not gain better results or considerably new insights. From the results
we can see that both relevance and knowledge augmentation improve retrieval
effectiveness: there are slight improvements with highlight quotations, and larger
improvements with context quotations. To our surprise, the experiment with
merged context quotations and new parts gains worse results than the baseline.
The combination of highlight and context quotations further improves retrieval
effectiveness. So we see that creating separate virtual documents from highlight
and context quotations and linking them with a certain access probability to their
corresponding document seems to be worth the effort. Regarding knowledge vs.
relevance augmentation, the results clearly show that knowledge augmentation is
to be preferred over relevance augmentation. In the case of context quotations,
knowledge augmentation can possibly handle the vocabulary problem better
We bear in mind that this is only a preliminary solution; more advanced ones might
take evidence from the thread structure or given by users’ preferences to estimate
the access probability.
I. Frommholz and N. Fuhr
(when query terms do not appear in the new part, but in the quotation), but
the exact reasons are not yet clear and subject to further investigation. Figure 3
shows the interpolated recall-precision averages of selected runs.
Table 2. Mean average precision and precision at 5, 10, 20 and 30 documents retrieved
for some selected runs. Best results are printed in bold.
[email protected]
[email protected]
[email protected]
[email protected]
Related Work
The studies performed by the Marshall group (see, e.g., [9, 10]) contain many
results and conclusions relevant for designers of annotation systems, which have
Fig. 3. Interpolated recall-precision graph of selected runs
Evaluation of Relevance and Knowledge Augmentation in Discussion Search
a strong impact on our work. The studies reported in Shipman et al. [12] focus on
the identification of high-value annotations in order to find useful passages in a
text. Agosti et al. examine annotations from a syntactic, semantic and pragmatic
view [2].
There are several approaches for annotation-based information retrieval and
discussion search. A relevance feedback approach where only highlighted terms
instead of whole documents are considered is reported to be successful [8]. [1]
reports on an approach where evidence coming from documents and the elements
in the annotation hypertext is combined using a data fusioning approach. Xi et
al. evaluate a feature-based approach for discussion search in [15]; their results
show an increase in retrieval effectiveness when using the thread context. The
proceedings of the TREC 2005 conference contain many other evaluations of
discussion search approaches [14]. The idea of knowledge augmentation has its
roots in structured document retrieval and is discussed thoroughly in [11].
In this paper we presented some approaches for discussion search and their evaluation, using quotations in a special role as context and highlight quotations,
respectively. Based on probabilistic datalog, we applied two strategies, knowledge
and relevance augmentation. The results indicate that a knowledge augmentation
strategy combining highlight and context quotations is preferable. Knowledge
augmentation has another benefit: it is query independent to a certain degree,
meaning that P (about(t,d)) may be calculated offline as a post-indexing step,
whereas the relevance augmentation strategy can only be applied during query
processing. Based on the promising results gained so far, we proposed a probabilistic, object-oriented logical framework for annotation-based retrieval called
POLAR in [5].
Future work will concentrate on further evaluation and discussion of our augmentation strategies using context and highlight quotations. As a third source
of evidence, the content of annotations made to another annotation could also
be used for augmentation, as discussed for relevance augmentation in [6].
1. Maristella Agosti and Nicola Ferro. Annotations as context for searching documents. In Fabio Crestani and Ian Ruthven, editors, Information Context: Nature,
Impact, and Role: 5th International Conference on Conceptions of Library and
Information Sciences, CoLIS 2005, volume 3507 of Lecture Notes in Computer
Science, pages 155–170, Heidelberg et al., June 2005. Springer.
2. Maristella Agosti, Nicola Ferro, Ingo Frommholz, and Ulrich Thiel. Annotations
in digital libraries and collaboratories – facets, models and usage. In Rachel Heery
and Liz Lyon, editors, Research and Advanced Technology for Digital Libraries.
Proc. European Conference on Digital Libraries (ECDL 2004), Lecture Notes in
Computer Science, pages 244–255, Heidelberg et al., 2004. Springer.
I. Frommholz and N. Fuhr
3. Alberto Del Bimbo, Stefan Gradmann, and Yannis Ioannidis. Future research directions – 3rd DELOS brainstorming workshop report, DELOS Network of Excellence,
July 2004.
4. Ingo Frommholz. Applying the annotation view on messages for discussion search.
In Voorhees and Buckland [14].
5. Ingo Frommholz and Norbert Fuhr. Probabilistic, object-oriented logics for
annotation-based retrieval in digital libraries. In Proceedings of JCDL 2006, 2006.
In print.
6. Ingo Frommholz, Ulrich Thiel, and Thomas Kamps. Annotation-based document
retrieval with four-valued probabilistic datalog. In Thomas Roelleke and Arjen P.
de Vries, editors, Proceedings of the first SIGIR Workshop on the Integration of
Information Retrieval and Databases (WIRD’04), pages 31–38, Sheffield, UK, 2004.
7. Norbert Fuhr. Probabilistic Datalog: Implementing logical information retrieval for
advanced applications. Journal of the American Society for Information Science,
51(2):95–110, 2000.
8. Gene Golovchinsky, Morgan N. Price, and Bill N. Schilit. From reading to retrieval:
freeform ink annotations as queries. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval, pages 19–25, New
York, 1999. ACM.
9. Catherine C. Marshall. Toward an ecology of hypertext annotation. In Proceedings
of the ninth ACM conference on Hypertext and hypermedia: links, objects, time and
space – structure in hypermedia systems, pages 40–49, 1998.
10. C.C. Marshall and A.J Brush. Exploring the relationship between personal and
public annotations. In JCDL ’04: Proceedings of the 4th ACM/IEEE-CS Joint
Conference on Digital Libraries, pages 349–357, New York, NY, USA, 2004. ACM
11. Thomas Rölleke. POOL: Probabilistic Object-Oriented Logical Representation and
Retrieval of Complex Objects. PhD thesis, University of Dortmund, Germany, 1998.
12. Frank Shipman, Morgan Price, Catherine C. Marshall, and Gene Golovchinsky. Identifying useful passages in documents based on annotation patterns. In
Panos Constantopoulos and Ingeborg T. Sølvberg, editors, Research and Advanced
Technology for Digital Libraries. Proc. European Conference on Digital Libraries
(ECDL 2003), Lecture Notes in Computer Science, pages 101–112, Heidelberg et
al., 2003. Springer.
13. Ulrich Thiel, Holger Brocks, Ingo Frommholz, Andrea Dirsch-Weigand, Jürgen
Keiper, Adelheit Stein, and Erich Neuhold. COLLATE - a collaboratory supporting
research on historic european films. International Journal on Digital Libraries
(IJDL), 4(1):8–12, 2004.
14. E. M. Voorhees and Lori P. Buckland, editors. The Fourteenth Text REtrieval
Conference (TREC 2005), Gaithersburg, MD, USA, 2005. NIST.
15. Wensi Xi, Jesper Lind, and Eric Brill. Learning effective ranking functions for newsgroup search. In Kalervo Järvelin, James Allen, Peter Bruza, and Mark Sanderson,
editors, Proceedings of the 27th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 394–401, New York,
2004. ACM.
Designing a User Interface for Interactive
Retrieval of Structured Documents — Lessons
Learned from the INEX Interactive Track
Saadia Malik1 , Claus-Peter Klas1 , Norbert Fuhr1 ,
Birger Larsen2 , and Anastasios Tombros3
University of Duisburg-Essen, Duisburg, Germany
{malik, klas, fuhr}@is.informatik.uni-duisburg.de
Royal School of Library & Information Science, Copenhagen, Denmark
[email protected]
Queen Mary, University of London, United Kingdom
[email protected]
Abstract. The interactive track of the Initiative for the Evaluation of
XML retrieval (INEX) aims at collecting empirical data about user interaction behaviour and to build methods and algorithms for supporting
interactive retrieval in digital library systems containing structured documents. In this paper we discuss and compare the usability aspects of
the web-based user interface used in 2004 with the application based
user interface implemented with the Daffodil framework in 2005. The
results include a validation of the element retrieval approach, successful
implementation of the berrypicking model, and that additional clues for
facilitating interactive retrieval (e.g. table of contents, indication of entry
points, related terms, etc.) are appreciated by users.
Many of today’s DL systems still treat documents as atomic units, providing little
support for searching or navigating along the logical structure of documents.
With the steadily increasing use of the eXtensible Markup Language (XML), we
have a widely adopted standard format for structured documents. Thus, there
is now an opportunity for providing better support for structured documents
in digital libraries (DLs). Besides supporting navigation, the logical structure of
XML has the potential to assist the DL systems in providing more specific results
to users by pointing to document elements rather than to whole documents.
Since 2002, the Initiative for the Evaluation of XML Retrieval (INEX) has
organised annual evaluation campaigns for researchers in this field. However,
little research has been carried out to study user behaviour and to investigate
methods supporting interaction in the context of retrieval systems that take
advantage of the additional features offered by XML documents.
This work was funded by the DELOS Network of Excellence on Digital Libraries
(EU 6. FP IST, G038-507618).
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 291–302, 2006.
c Springer-Verlag Berlin Heidelberg 2006
S. Malik et al.
In order to address these issues, an interactive track (iTrack) was added to
INEX in 2004. In this paper, we report on the usability issues addressed in the
interactive XML retrieval systems that formed the baseline in these tracks in
2004 and 2005 (hereafter called iTrack 04 and iTrack 05). We show how the
findings from the first year led to the development of an improved system in
2005, and we report on the user reactions to both systems.
In iTrack 04, the main goal was to study user behaviour with an XML retrieval system and to validate the element retrieval approach. For this, the user
interface design was kept simple in order to give a clear picture of element retrieval systems. During iTrack 04 many usability issues arose, and these led to
formulating the main hypotheses for iTrack 05. In addition, more elaborate design principles and the berrypicking paradigm [1] were followed for the iTrack
05 interface design.
This paper is structured as follows: Section 2 gives a brief overview of related
work. Section 3 describes the evaluation methodology, the user interface and
findings of iTrack 04. The description of the iTrack 05 user interface follows in
section 4 including the necessary adaptions derived from iTrack 04, the evaluation and findings. The last section presents a comparison of both evaluations
and an outlook.
Related Work
Classical information retrieval (IR) research has focused on a system-oriented
view and taken a simplified view of user behaviour: the user submits a query
and then looks through the ranked items one by one. Thus the goal of the system
is to rank relevant items at the top of the list. A broader perspective has been
taken in interactive IR research, as represented by the TREC interactive tracks
[2]. Quite surprisingly, results of these evaluations showed that differences in
system performance identified in laboratory experiments are hard to recreate in
interactive retrieval. As described in [3], this result is due to users being able
to easily identify relevant entries in a list of documents. Thus, cognitive factors
should be considered, as well as richer interaction functions, that can enhance
user interaction with the system.
Whereas the standard IR model assumes that the user’s information need
does not change throughout the search process, empirical studies (e.g. [4]) have
shown that interactive retrieval consists of a sequence of related queries targeting different aspects of an ever changing information need. For coping with
this problem, Bates et al. has proposed the berrypicking model of information seeking, which assumes that the user’s need changes while looking at the
retrieved documents, thus leading into new unanticipated directions [1]. During the search, users collect relevant items retrieved by different queries
So far, there has been little work on interactive XML retrieval. Finesilver and
Reid describe the setup of a small collection from Shakespeare’s plays in XML,
followed by a study of end user interaction with the collection [5]. Two interfaces
Designing a User Interface for Interactive Retrieval of Structured Documents
were used: one highlighting the best entry points and the other highlighting the
relevant objects.
Some recent efforts have been made within the INEX interactive track [6, 7].
In addition to the baseline systems which are the topic of this paper, Kamps et
al. tested a web-based interface that used a hierarchal result presentation with
summarisation and visualisation[8], and van Zwol, Spruit and Baas worked with
graphical XML query formulation and different result presentation techniques
also in a web-based interface [9]. Besides these systems, various techniques for
visualisation of structured documents have been proposed in [10] and [11, 7].
iTrack 04
Evaluation Methodology
Document Corpus. The document corpus used was the 500 MB corpus of
12,107 articles from the IEEE Computer Society’s journals covering articles from
1995-2002 [12].
Topics. We used content only (CO) topics that refer to document contents. In
order to make the tasks comprehensible by other people besides the topic author,
it was required to add why and in what context the information need had arisen.
Thus the INEX topics are in effect simulated work task situations as developed
by Borlund [13]. Four of the 2004 CO topics were used in the study.
Participating sites. The minimum requirement for sites to participate in the
iTrack 04 was to provide runs using 8 searchers on the baseline version of the webbased XML retrieval system provided. 10 sites participated in this experiment,
with 88 users altogether.
Experimental protocol & data collection. Each searcher worked on one task
from each task category. The task was chosen by the searcher and the order of
task categories was permuted. The goal for each searcher was to locate sufficient
information towards completing a task, in a maximum timeframe of 30 minutes
per task.
Searchers had to fill in questionnaires at various points in the study: before
the start of the experiment, before each task, after each task, and at the end
of the experiment. An informal interview and debriefing of the subjects concluded the experiment. The collected data comprised questionnaires completed
by searchers, the logs of searcher interaction with the system, the notes experimenters kept during the sessions and the informal feedback provided by searchers
at the end of the sessions.
User Interface
The user interface in iTrack 04 was a browser-based frontend connecting to the
HyREX retrieval engine [14, 15].
In response to a user query, the system presented a ranked list of XML elements including title and author of the document in which the element occurred.
S. Malik et al.
Fig. 1. iTrack 04: Query form and resultlist
In addition, a retrieval score expressing the similarity of the element to the query
and the path to the element was shown in form of an result path expression (see
Figure 1). The searcher could scroll through the resultlist and access element
details by clicking on the result path. This would open a new window displaying
this element.
Fig. 2. iTrack 04: Detail view of an element
The detailed element view is depicted in Figure 2. The content of the selected
element was presented on the right hand side. The left hand part of the view
showed the table of contents (TOC) of the whole document. Searchers could
access other elements within the same document, either by clicking on entries in
the TOC, or by using the Next and Previous buttons (top of right hand part). A
relevance assessment for each viewed element could be given on two dimensions
of relevance: how useful and how specific the element was in relation to the
Designing a User Interface for Interactive Retrieval of Structured Documents
task. These dimensions corresponded to the relevance dimensions of the main
ad-hoc track of INEX in an attempt to ensure comparability of the results of
the two tracks. Each dimension had three grades of relevance, and ten possible
combinations of these dimensions could be given in a drop down list as shown
in Figure 2.
The main findings based on the log and questionnaires are reported in [16]. Here,
only the findings related to the usability of the baseline system are discussed.
We analysed the questionnaire and interview data to investigate these issues.
Most questionnaire questions were answered on a 5-point scale, which we have
analysed statistically.
The overall opinion of the participants about the baseline system was recorded
in the final questionnaire which they filled after the completion of both tasks.
Users were asked to rate the different features of the system on the scale of 1
to 5, where 1 stood for ’Not at all’, 3 ’Somewhat’ and 5 for ’Extremely’. The
results are summarised in Table 3.
In addition to these ratings, users were asked to comment on the different
aspects of the system after the completion of each task and after the completion
of the experiment. Example questions were:
– In what ways (if any) did you find the system interface useful in the task?
– In what ways (if any) did you find the system interface not useful in the
– What did you like about the search system? What did you dislike about the
system? and
– Do you have any general comments?
The analysis of the most frequent comments are presented in the following
sections. Table 1 summarises the positive and Table 2 the negative results.
Element overlap. One of the critical issues of element retrieval is the possible
retrieval of overlapping result elements, i.e. components from the same document where one includes the other (due to the hierarchic structure of XML
documents). Typically these elements are shown at non-adjacent ranks in the
hit list. In our case, the HyREX retrieval engine did not take care of overlapping
elements and thus searchers frequently ended up accessing elements of the same
document at different points in time and at different result ranks.
Data from both the system logs and the questionnaires showed that searchers
found the presence of overlapping elements distracting. By recognising that they
had accessed the same document already through a different retrieved element,
searchers typically would return to the resultlist and access to another element
instead of browsing again within a document visited before. 31 users commented
negatively on the element overlap.
Document structure provides context. The presence of the logical structure of the documents alongside the contents of the accessed elements was a
S. Malik et al.
Table 1. Positive responses on
system usefulness (iTrack 04, 88
Table 2. Negative responses on
system usefulness (iTrack 04, 88
Table of contents
Keyword highlighting
Good results
Simple querying
Overlapping elements
Insufficient summary
Distinction b/w visited & unvisited
Limited query language
Poor results
Limited collection
feature that searchers commented positively on. The table of contents of each
document (see Figure 2) seemed to provide sufficient context to searchers in
order to decide on the usefulness of the document. 66 users found the TOC of
the whole article very useful because it provided easy browsing, navigation, less
scrolling or gave a quick overview of which elements might be relevant and which
might not be.
Element summaries. The resultlist presentation in the iTrack 04 system did
not include any element summarisation. Only the title and authors of the document were displayed in addition to the result path expression of the element and
its similarity to the query. As a consequence searchers had little clues available to
decide on the usefulness of retrieved elements at this point. 30 users commented
on these insufficient clues.
Keyword highlighting. Within the detail presentation of an element, all query
terms were highlighted. This feature was very much appreciated, and several
users suggested to provide this feature not only at the resultlist level, but
also at the table of contents level. 36 users gave positive comments on this
Distinction between visited and unvisited elements. There was no distinction between visited and unvisited elements at the resultlist and detail levels.
Thus, a number of times users visited the same elements/documents more than
once. 24 users commented negatively on this.
Limited query language. The system did not support sophisticated queries
and there was no possibility to use phrases, boolean queries, or to set the preference for terms. 22 users found this an obstacle.
General issues. There are also some more general issues that were commented
on. These stated that the multiple windows of the web-interface were somewhat confusing and that the "Result path" shown in the resultlist was mostly
meaningless, and with the square brackets, it had a very technical
Designing a User Interface for Interactive Retrieval of Structured Documents
iTrack 04 was the first attempt to set up an interactive track for XML retrieval, and there was very little knowledge on which we could build upon when
designing the iTrack 04 interface. In contrast, the design of the iTrack 05 interface was based on the expereinces from the previous year. In designing the
interface, we aimed at overcoming the main weaknesses of the 2004 interface.
iTrack 05
Evaluation Methodology
The evaluation methodology used in iTrack 05 was similar to the one used in
iTrack 04. An extended version of the INEX IEEE document collection was used
(now comprising 16819 documents).
This time six topics were selected from the INEX 2005 ad-hoc topics, and
modified into simulated work tasks. In addition, searchers were asked to supply
two examples of their own information needs. Depending on the coverage in the
collection, one of these tasks was selected by the experimenter for the experiment.
In total, each searcher performed three tasks. With a total of 11 participating
organisations, 76 searchers performed 228 tasks in iTrack 05.
Desktop-Based System
For iTrack 05 the Daffodil framework was used and extended to meet the
functionality of XML retrieval. Daffodil is a front-end to federated, heterogeneous digital libraries. It is aimed at providing strategic support (see [17])
during the information search process and already supports interactive retrieval
through integrated high-level search and browse services.
The Daffodil framework consists of two parts, the graphical user interface
client and the agent-based backend services (see [18, 19]). The user interface
client, implemented in Java, is based on a tool metaphor, where each service is
presented by a tool and the tools are integrated among each other.
The interface for iTrack 05 was designed by taking into account the findings of the iTrack 04, the berrypicking model described in section 2 and iconic
visulisation techniques for better recall and immediate recognition.
Additions to the Architecture. The base system had to be extended for
INEX in order to deal with the highly structured XML data. These extensions
affected both the user interface and the corresponding backend services, e.g.
connecting the XML search engine.
Query formulation. The problem of limited query language expressiveness was
resolved by allowing Boolean queries, in combination with proactive query formulation support [20]. The latter feature recognises syntactic errors and spelling
mistakes, and marks these. Besides full-text search, the system now also allowed
for searching on metadata fields such as authors, title, year.
For further support during query formulation we added a Daffodil service
for suggesting related query terms (based on statistical analysis of a different
S. Malik et al.
corpus). While the user specifies her query, a list of possible alternative terms
are presented to her. This service follows the berrypicking model because the
newly discovered related terms can change the search direction of the user. For
easy query reformulation, the drag&drop feature of Daffodil could be used to
add new query terms from documents or the related term list.
Resultlist presentation. In order to resolve the issues of overlapping elements
and element summarisation identified in iTrack 04, results in the resultlist were
now grouped document-wise and hits within documents were presented as possible entry points within the hierarchical document structure. The document
metadata information is shown as the top level element, as depicted in Figure 3.
In addition, whenever some element within a document is retrieved, the title
of that element is presented as a document entry point, depicted as a clickable
folder icon. This change reflected user preference for the TOC view, where titles
of elements are displayed.
We also took into account the comments about the retrieval score and the
result path expression from iTrack 04. The retrieval score of each retrieved element was now shown in pictorial (as opposed to numerical) form, and result path
expressions of elements were removed from the resultlist. The whole resultlist
entry was made clickable.
The comments on the distinction between visited and unvisited elements were
considered by using an iconic visualisation technique. An eye icon is shown with
any resultlist entry that has been visited before. The analogy with the berrypicking model is given here as marking the paths where a user walked to pick
only unknown berries, to avoid looking twice at the same information. We also
adopted query keyword highlighting at the resultlist level, since searchers appreciated this feature at the detail view level.
Detail view. The main layout of the detail level was kept the same as in iTrack
04, as seen in Figure 4. Some additions were made for supporting document
browsing. First, the entry points from the resultlist level are now also highlighted
in the detail view. Second, elements already visited are indicated with an iconised
eye in the table of contents.
Many participants in iTrack 04 felt that the two-dimensional relevance scale
used in these experiments was too complex [21]. For this reason, we moved to a
simple 3-point scale, measuring only the usefulness of an element in relation to the
searcher’s perception of the task: 2 (Relevant), 1 (Partially Relevant), and 0 (Not
Relevant). This three grade relevance scale was visualised as shown in Figure 4
(top left hand). The same icons were added to the viewed element when a relevance
value was assigned by the user. Here again one more aspect of the berrypicking
model analogy was implemented successfully: the user puts the ’good’ beeries into
her basket, and also can see which berries she has picked before.
The analysis was made along the same lines as for iTrack 04. The overall opinion of the participants about the system was recorded in the final questionnaire
Designing a User Interface for Interactive Retrieval of Structured Documents
Fig. 3. iTrack 05: Query form and resultlist
Table 3. Overall opinion about the system on the scale of 1 (Not at all) to 5 (Extremely) in iTrack 04 (88 searchers) & iTrack 05 (76 searchers)
System Features
iTrack 04
μ σ2
How easy was it to learn to use the system?
4.17 0.6
How easy was it to use the system?
3.95 0.7
How well did you understand how to use the system?
3.94 0.5
How well did the system support you in this task?
How relevant to the task was the information presented to you? Did you in general find the presentation in the resultlist useful? Did you find the table of contents in the detail view useful?
iTrack 05
μ σ2
3.40 0.9
3.96 0.9
3.84 0.9
3.13 0.9
2.97 1.13
3.35 0.8
3.72 1.0
that they filled after the completion of all tasks. New questions enquiring about
the distinct aspects of the system used in 2005 were added. The results are
summarised in Table 3. As can be seen users were in general positive on both
systems, and the major difference between the two years was the better learnability of the 2005 system. In addition, there were many informal comments in
response to the questions mentioned in section 3.3. We analyse the data in the
following paragraphs.
S. Malik et al.
Fig. 4. iTrack 05: Detail view
Resultlist presentation. Presentation of results in a hierarchy is generally
found useful. 43 users commented positively on it, whereas 3 users found the
information presented insufficient for deciding about relevance or irrelevance. 2
users commented on the inconsistency of the result presentation. This situation
occurred when a whole article was retrieved as a hit, with no further elements
within this article, 3 users disliked scrolling at the result list level.
Table of contents and query term highlighting. As in iTrack 04, the TOC
is found to be extremely useful and 32 users commented positively on it. Query
term highlighting in the resultlist and the detail view were also appreciated (22
positive comments).
Related terms. The new functionality of suggesting related query terms was
found highly helpful: 29 users found this function useful in their performance of
search tasks. There were some cases when the suggested terms either retrieved
no documents, or there was no obvious semantic relationship to the query terms.
These situations led to negative remarks by 11 searchers.
Awareness in the detail view. The document entry points shown in the
resultlist were also displayed in the detail view, 14 users commented positively on
it. In addition, icons indicating visited elements and their relevance assessments
are shown in the TOC: 3 users found this useful. In addition, 15 users also wanted
to have the relevance assessment information in the resultlist.
Retrieval quality. Although the underlying retrieval engine had shown good
retrieval results in previous INEX rounds, it produced poor answers for some
Designing a User Interface for Interactive Retrieval of Structured Documents
queries, so 25 users commented negatively on this. A possible reason could be
the limited material on the choosen topic of search.
Other Issues. 4 users remarked positively on the interface usefulness and 3
liked the query form. The response time of the system was encountered as being
too high, so 35 users comments negatively on it.
Overall, user responses show that the main weaknesses of the iTrack 04 interface have been resolved. In addition, the new features supporting the berrypicking paradigm were appreciated by the users.
Conclusion and Outlook
In this article we presented the lessons learned from INEX iTrack 04 to iTrack
05. The analysis of iTrack 04 showed several negative responses to the used
web-based interface. The main issues were the overlapping elements presented
in a linear resultlist, insufficient summaries to indicate the relevance of an item,
the lack of distinction between visited and unvisited items and a limited query
language. Also some positive comments were made, e.g., the document structure
(TOC) provided sufficient context and was a quick way of locating the interesting
information. Keyword highlighting was also found to be helpful in ’catching’
information parts that may be relevant to the existing query terms.
These findings were used to shift to an application-based interface. The analysis of iTrack 05 showed that the overlapping elements presentation in a hierarchy
can provide sufficient summerisation and context for the decision of relevance or
irrelevance. The second major improvement was the addition of design elements
based on the berrypicking model [1], which received substantial appreciation.
These desgin elements included keyword highlighing, iconic visualisation and
provision of related terms.
The most problematic issue with the iTrack 05 system was the responsiveness
of the system. This was due to the underlying search engine and inefficiencies
within the Daffodil message flow. These issues will be worked on for iTrack 06.
Overall, the evaluations showed that interface design adaptation based on the
2004 findings were taken as an improvement. The shift to an application based
framework proved to be the right step, as we gained more flexibilty in features
besides a web-based framework. In iTrack 06 a major focus will be the efficiency,
by replacing the underlying search engine and a tighter integration with the
Daffodil framework to lower response times.
1. Bates, M.J.: The design of browsing and berrypicking techniques for the online
search interface. Online Review 13 (1989) 407–424
2. Voorhees, E., Harman, D.: Overview of the eighth Text REtrieval Conference
(TREC-8). In: The Eighth Text REtrieval Conference (TREC-8). NIST, Gaithersburg, MD, USA (2000) 1–24
S. Malik et al.
3. Turpin, A.H., Hersh, W.: Why batch and user evaluations do not give the same
results. In: Proc. of SIGIR, ACM Press (2001) 225–231
4. O’Day, V.L., Jeffries, R.: Orienting in an information landscape: How information
seekers get from here to there. In: Proc. of the INTERCHI ’93, IOS Press (1993)
5. Finesilver, K., Reid, J.: User behaviour in the context of structured documents.
In: Proc. of ECIR. (2003) 104–119
6. Larsen, B., Malik, S., Tombros, A.: The interactive track at inex 2005. In: Advances
in XML Information Retrieval and Evaluation: Springer, p. 398-410. (Lecture Notes
in Computer Science vol. 3977). (2006)
7. Tombros, A., Larsen, B., Malik, S.: The interactive track at inex 2004. In: Advances
in XML Information Retrieval: Springer, p. 410-423. (Lecture Notes in Computer
Science vol. 3493). (2004)
8. Kamps, J., de Rijke, M., Sigurbjörnsson, B.: University of amsterdam at inex 2005.
(In: Advances in XML Information Retrieval and Evaluation: Springer, p. 398-410.
(Lecture Notes in Computer Science vol. 3977))
9. van Zwol, R., Spruit, S., Baas, J.: B3 [email protected] track: User interface design
issues. (In: INEX 2005 Workshop Pre-Proceedings)
10. Crestani, F., Vegas, J., de la Fuente, P.: A graphical user interface for the retrieval
of hierchically structured documents. Information Processing and Management 40
(2004) 269–289
11. Großjohann, K., Fuhr, N., Effing, D., Kriewel, S.: Query formulation and result
visualization for XML retrieval. In: Proceedings ACM SIGIR 2002 Workshop on
XML and Information Retrieval. (2002)
12. Gövert, N., Kazai, G.: Overview of the INitiative for the Evaluation of XML
retrieval. In: Proc. of INEX workshop. (2003) 1–17
13. Borlund, P.: Evaluation of interactive information retrieval systems. (2000) 276
PhD dissertation.
14. Fuhr, N., Gövert, N., Großjohann, K.: HyREX: Hyper-media retrieval engine for
XML. In: Proceedings of the 25th Annual International Conference on Research
and Development in Information Retrieval. (2002) 449 Demonstration.
15. Gövert, N., Fuhr, N., Abolhassani, M., Großjohann, K.: Content-oriented XML
retrieval with HyREX. In: Proc. of INEX workshop. (2003) 26–32
16. Tombros, A., Malik, S., Larsen, B.: Report on the INEX 2004 interactive track.
SIGIR Forum 39 (2005)
17. Klas, C.P., Fuhr, N., Schaefer, A.: Evaluating strategic support for information
access in the DAFFODIL system. In: Proc. of 8th ECDL. (2004)
18. Fuhr, N., Klas, C.P., Schaefer, A., Mutschke, P.: Daffodil: An integrated desktop
for supporting high-level search activities in federated digital libraries. In: Proc.
of 6th ECDL, Springer (2002) 597–612
19. Fuhr, N., Gövert, N., Klas, C.P.: An agent-based architecture for supporting highlevel search activities in federated digital libraries. In: Proc. of ICADL, Taejon,
Korea, KAIST (2000) 247–254
20. Schaefer, A., Jordan, M., Klas, C.P., Fuhr, N.: Active support for query formulation
in virtual digital libraries: A case study with DAFFODIL. In: Proc. of 7th ECDL.
21. Pehcevski, J., Thom, J.A., Vercoustre, A.: Users and assessors in the context of
inex: Are relevance dimensions relevant? In: Proc. of INEX Workshop on Element
Retrieval Methodology. (2005)
“I Keep Collecting”: College Students Build and Utilize
Collections in Spite of Breakdowns
Eunyee Koh and Andruid Kerne
Interface Ecology Lab, Center for Study of Digital Libraries, Computer Science Department,
Texas A&M University, College Station, TX 77843, USA
{eunyee, andruid}@cs.tamu.edu
Abstract. As people become more and more involved with digital information,
they grow proportionally involved in situated practices of collecting. They put
together large sets of information elements. However, their attention to those information elements is limited. They use whatever means are at hand in order to
form representations of their collections. They need to keep track of the elements in these collections, so they can use them later. We conducted a study
with 20 college students. A major concern for the students during collection
building was collection management and utilization, particularly as the size and
number of their collections grows. They experienced breakdowns in these processes, yet continued to engage in collecting. They developed strategies such as
informal metadata schemas and hierarchical organization to try to cope with
their problems. We consider the practices observed, and their implications for
the development of tools to support digital collection building and utilization.
Collection representations that support cognition, collaboration, and semantic
schemas are prescribed.
1 Introduction
Dick is a graduate student in industrial engineering. As he is a research assistant, his
work involves writing research papers. He regularly searches for and collects relevant
prior work from the internet and digital libraries. He collects articles and URLs on his
own computer. He utilizes this collection regularly. Jane is a visualization lab student.
She collects many images and pictures for class work such as animation, and also for
fun. Some of these are photographs she has taken; some come from the internet. She
is also a student worker in the university newspaper. She collects images to support
this activity, as well. These examples illustrate the contexts in which students are
making collections, and provide a sense of the scope of collections and collecting
activities addressed by this paper. We define collecting as people’s practices of putting together archives of information elements, such as hyperlinks, documents, images, audio, and video, with the intention of creating and supporting meaningful,
engaging, and useful experiences.
Due to popularity of digital media devices and the abundance of information on the
web, a broad cross-section of society becomes more and more exposed to large numbers of digital documents and media elements. People are confronted with the
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 303 – 314, 2006.
© Springer-Verlag Berlin Heidelberg 2006
E. Koh and A. Kerne
problem of how to keep track of significant elements within the stream of this experience. They begin collecting, and again due to the preponderance of meaningful digital
information and media, the collections become larger and larger. This trend is further
promulgated by the increasing availability and capacity of inexpensive digital storage
However, a wealth of information creates a poverty of attention [19]. The disparity
between the growing amount of information and media that people are collecting in
practice, and their fixed amount of attention, is leading to breakdowns in their collecting experiences. According to Winograd and Flores, breakdowns occur when there is
a discrepancy between our expectations and actions, and the world [22]. Breakdowns
can serve as an opportunity for learning, because they identify important parts of tasks
and activities, and can provoke the articulation of new user needs and design requirements. The present research investigates breakdowns in collecting practices.
The study has been conducted with college students. College students tend to be
fast movers in the face of ongoing technological transformation. 81% of them go
online. Many of them can scarcely imagine what the world was like way back when
people weren’t always connected to the net, “Always on” [14]. The Pew Internet and
American Life Project, reporting on 2054 students from 27 college and university,
says that nearly 73% of college students use the internet more than the library, while
only 9% said they use the brick and mortar libraries more than the internet for information searching [15]. College students typify the category of power creators, which
Pew has identified as an important constituency of internet users [16]. Power creators
are twice as likely to engage in content creating activities as other internet users [16].
The intention of this research is to develop understanding of current practices
and resulting breakdowns in building and utilizing digital collections. We have
investigated the practices of college students by interviewing them, and observing
the collections that they build. We also gathered quantitative data about collection
building and utilization practices. From this understanding, we will infer implications for the design of new tools to support these processes. This paper begins with
a review of related work. Next, we describe the study and its participants. The subsequent section presents data and analysis. We conclude by discussing current collection practices and tools, and infer design implications for future research and
2 Related Work
Prior studies have investigated the usage of tools for building and utilizing collections
in specific media, such as email [3], bookmarks [1][12], and files [4][5]. Some studies
have offered classifications of user behavior with various collection tools. Malone
identified two fundamental strategies in office management: filing and piling [13],
focusing on the organization activities. Whittaker and Sidner [21] observed three
email management strategies: frequent filer, spring cleaner, and no filer. Balter [3]
extended this classification by dividing the no-filer class into folder-less cleaner and
folder-less spring-cleaner, depending on whether items are deleted from the inbox on
a daily basis. Abrams et al. [1] described four bookmark management strategies: nofiler, creation-time filer, end-of-session filer, and sporadic filer. Barreau and Nardi [5]
“I Keep Collecting”: College Students Build and Utilize Collections
looked at the types of information manage by users, identifying three types based on
lifetime and use: ephemeral, working, and archived. They noted the relative importance of ephemeral/working items retrieved by location-based browsing over archived
items and the use of search. However, as the information age matures, it seems that
the importance of archiving grows.
While each of the previously mentioned works addresses utilization of a single
collection medium, Jones et al. conducted a study that traverses collecting practices
involving e-mail, images, document addresses (URLs), and documents [12]. They
investigated various methods people use in their workplace to organize information
for re-use. They found that people differ in their collection building practices according to their job position and their relationship to the information. Their study is
similar to the present research in its addressing of multiple collection media, as well
as in the number of experimental subjects, and the social proximity of the subjects
to the researchers. Boardman et al. [7] also collected cross-tool data relating to file,
email and web bookmark usage. They found that individuals employ a rich variety
of strategies both within and across collection tools, and discuss synergies and differences between tools, to guide the design of tool integration. The data underlined
the challenge of the collection tool design by addressing that future design work
must take account of the variation in strategies by providing the flexibility to manage different types of information in distinct way. They observed that people
usually browse rather than search to find relevant elements in their collections. In
addition, they found that the slow-changing nature of hierarchical representations
may benefit users by promoting familiarity with the personal information environment. Such familiarity, in turn, supports location-based finding for which users
expressed a clearer preference.
The present research focuses on human experiences of collecting and the role of
collections across a broad range of meaning-making activities and digital media.
Some prior work has addressed particular media, such as web pages or email. Some
has focused on well-defined scenarios regarding information filing, finding, and
management. This study investigates processes of collection building and utilization
across media and tools through open questions about participants’ situated practices, in order to discover how they engage in collecting throughout their everyday
activities. We use a hybrid data collection approach, in which qualitative data from
open questions is augmented by quantitative data about collection building and
3 Study Description
To investigate power users’ collection building and utilizing practices, we performed
a study consisting of interviews of 20 college students. The study brought together
narrative accounts, interview questionnaires, and examples of their digital collections
in order to investigate how they currently build and utilize collections as part of everyday life. Students were informed that they were participating in a study, and that
their responses would be recorded, and anonymously recounted in a research paper.
Participants were distributed by gender and academic concentration. Ten students
were male and the other ten were female. There were eight undergraduate students
E. Koh and A. Kerne
and twelve graduate students. Students’ majors were diverse, including computer
science, visualization, aerospace engineering, statistics, landscape design, industrial
engineering, and history. The interviews were conducted with participants at their
offices or homes, so they could show artifacts from their personal computers.
The interviews were semi-structured and open-ended. We did not limit the dialogue to our pre-formulated questions. We also did not place any limits on the media type or representational forms of the collections we investigated. Rather, we
considered any type of personal collection. We spent 60-90 minutes with each participant to explore the kinds of collections they made, their processes of using and
organizing the collections, the collection tools they used, and their overall experiences of collecting.
While conducting the study, the interviewer was guided by an agenda of relevant
research questions:
− To what extent do you think intentionally about your needs for collecting digital
information prior to actually doing so?
− What activities are involved in your collection building processes?
− How do you feel about spending time through collection making processes?
− How many elements are in your collections?
− Which tool(s) or mechanism(s) do you use to build collections?
− How often do you make / refer to / organize collections?
− What types of inconveniences and breakdowns do you encounter during building
and utilizing digital collections?
− What are your strategies for coping with breakdowns in your experiences of building and utilizing collections?
− What are your suggestions for future collection tools?
We recorded and screen-copied examples of collections participants built, and
took notes of interviews. After each interview, participants filled out a survey
4 Results
We analyzed the study data in terms of the distribution of activities, significance,
type, and quantity of information elements involved, as well as the kinds of mechanisms people used for building and utilizing collections. We also investigated their
frequency of involvement in collecting. Quantitative and qualitative data and its
analysis will show participants’ collection building and utilizing practices and
4.1 Collection Building and Utilizing
We looked at collection building and utilizing practices in terms of the stance participants brought into the process of collecting, the patterns and expectations that
occurred in these processes and the ways in which users perceived success and
“I Keep Collecting”: College Students Build and Utilize Collections
Intention and Need
Participants were asked whether they thought about the need for collecting prior to
engaging in processes of seeking digital information. All participants expressed
awareness of a personal ongoing deliberate intention and need to be involved in collection building and utilizing practices.
Activities and Significance
The participants reported collecting digital media materials that support a range of
personal and work-related activities. The personal media included photographs taken
by themselves and friends, as well as popular media elements such as music, movie
star pictures, and art images. As the subjects were students, their work is learning and
research, so the materials here included class notes and research papers. Students
whose majors are related to design collect many image files as part of their school
work. From this data, we see that the participants’ collecting activities are conducted
in relationship to the span of significant activities in their lives.
Frequency and Time Period
One hundred percent of participants report that they build and utilize collections regularly. Of these, more than half utilize collections more than one hour per week. In
more detail, 18% of participants said that they spend more than one hour per day on
collection building; 10% spend one hour per day engaged in the collection process;
27% said that they spend more than one hour per week and less than one hour a day;
while 27% spend one hour a week; and 18% of participants spend one hour per
month. However, participants do not have a specific time frame scheduled for collection building and utilizing. It is something they do spontaneously, as part of a range of
tasks and activities (P3: “I build and utilize collections regularly, and I engage in this
process during spare time and while I am taking rest.”).
Worthwhile or Useless
Participants were asked how they feel about spending time on collection building
and utilization. 46% of participants said that they experience the process as meaningful and worthwhile. 18% of participants answered that they find it somewhat
meaningful. 9% of participants answered that their experience is neither worthwhile
nor useless. 27% of participants said that they experience collecting as rather useless. Those participants who answered rather useless said that they nonetheless
continue to engage in the collection building process; they experience it as necessary and meaningful initially, but after a while, their engagement seems to be performed in vain. They said that a collection is not worthwhile if they do not utilize it
well, and they seldom utilize most parts of their collections because of the huge
volume of collected information.
Collection Types
All participants said that they build image, music, and/or movie collections. The
sources of the images are from digital cameras, camera phones, and the internet.
Twenty-two percent of participants have 50-100 images in their collection; another
22% keep 100-500 images; while 56% keep more than 5000 image collections. Participants said they mostly obtain music from music downloading services or their
friends’ collections. Thirty-three percent of participants keep 50-100 music files, 33%
keep 100-500 files, and 34% keep more than 500 music files in their collections.
E. Koh and A. Kerne
Movie files are obtained through similar means, such as downloading services or
creation with a video camera. Twenty-two percent of participants keep 10-50 movie
files; another 22% keep 50-100 movie files; another 22% keep 100-500 movie files;
while 34% keep more than 500 movie files in their collections.
Participants also collect documents such as Word files and PDF files. 56% of participants keep 100-500 documents; 44% keep more than 500 documents in their collections. They also collect web documents in the form of hyperlinks (URLs). 11% of
participants keep 1-10 URLs, 33% keep 10-50 URLs, 45% keep 50-100, and 11%
keep 100-500 URLs in their collections. Compared to the other media collections,
participants keep fewer URLs, because web documents are easier to search for.
Collection Mechanism
In terms of what is stored, there are three ways to build digital collections: (1) save
the files themselves; (2) extract some parts from files and save only those parts; (3)
save the location of files. Participants use whatever tools and structures are at hand to
build their collections; for example, files, folders, bookmarks, and e-mail.
All participants said that they make file folders for file collections. There are also
within-file collections, in which small elements of information from diverse sources
are gathered into a single file. Participants said that they used Excel, Word, Photoshop, and Notepad to build this type of within-file collection. They used drawn lines,
tables and newline characters (vertical whitespace) to spatially distinguish elements in
a within-file collection. When participants save URLs of web pages, they usually use
bookmarks, but they also use e-mail, so that those URLs can be utilized from the
other computers (P9: “I am not using bookmarks at all. Instead I keep URLs in my
email because I use three computers; my office computer, my home computer, and my
laptop, so I can look at important URLs from any of my computers.”).
Levels of Engagement with Collections
We observe that in general, people collect information and media with the intention of
later referring to the collected elements for use. Sometimes, they actually get to this
process of referring. Further, sometimes, with collections that are important, they take
steps to organize the form of the collection. Referring and organizing are aspects of
collection utilization.
While participants accessed the internet daily, their activities of selecting elements
to add to their collections, referring to the collections, and organizing them occurred
less frequently (See Figure 1 Left). The frequency of these activities can be categorized in three tiers. All of the subjects accessed the internet daily. At the same time,
43% of them engaged in collection building and referring on a daily basis, while 36%
did so on a weekly basis, and the remaining 21% engaged in such activities monthly.
The difference between internet access frequency and collection building/referring
frequency was statistically significant [F(2,26)=3.67, p<.01]. While distribution of the
participants’ collection building frequency and collection referring frequency were the
same, these distributions are independent and do not necessarily refer to the same
participant. The third tier of engagement with collections is to organize them; 36% of
the subjects did this weekly, 57% did it monthly, and the last 7% reported they never
did it at all. The last group corresponds, for example, to Abrams, “no-filers” [1]. The
frequency of engaging in collection building/referring was again greater than that of
“I Keep Collecting”: College Students Build and Utilize Collections
collection organizing in a statistically significant manner [F(2,26)=3.45, p<0.002].
This shows that people refer to their collections as much as they build the collections,
but they rarely organize their collections.
Fig. 1. Left - participants’ internet access and collection building/referring/organizing frequency; Right - rate at which participants’ collections are unutilized and abandoned
Collection Sharing
The study data shows that participants share their collections with other people, and
also across several computers. 85% of participants said that they have their own blogs
or personal web sites and publish some of their collections to share with others. These
published collections may in turn function as source materials for others’ collection
building processes.
As mentioned above, one participant (P4) keeps URLs in email in order to access
them from different computers. All participants said that they use several computers
in different places. Participants use portable devices to carry their digital media materials or store them in network accessible spaces in order to share among different
computers and as well as with others.
Breakdowns in Collection Practice
We investigated discrepancies between participants’ expectations, and their experiences in practices of collecting. Our goal in identifying these breakdowns is to articulate user needs and design requirements. The most common breakdowns that
participants experienced during the present study arose during their practices of
referring, organizing, and finding things in their collections (P15: “I initially made
URL collections using bookmarks without any folder structure and renaming. Later,
I had trouble finding a specific URL in it, so I deleted all my bookmarks and made
folders with renaming. After this experience, I became more cautions about adding
and renaming URLs to the collection.”). They said that they initially didn’t have
trouble finding elements in collections they built, but as time elapsed after collection
building, it became more difficult to remember what is in the collections, and where.
Recall, a problem of limited human attention, becomes a problem (P12: “I had really
important data in my collections, but I cannot find it! Could you make a program for
me?”). As the set of collections they own grows larger, it becomes difficult to remember all of them. Even though they sometimes don’t have any clue of where the
elements are, they said that they start browsing their collections first rather than
searching. When they don’t find the elements in the expected location, they use a
E. Koh and A. Kerne
search tool (P13: “I seldom organize my collection very well, so I went through all
folders one by one sequentially trying to find a certain file. Sometimes, I forgot what
I saved, so I searched the web instead of the collections, and saved the same thing
again.”). However, they may not even remember what to search for.
As mentioned above, 27% of participants said that collection building is somewhat
useless because most parts of their collections are not utilized, and thus abandoned.
Participants were asked what percent of their digital collections remain unutilized. At
least 40% of the participants’ collections are abandoned (See Figure 1 Right); 27% of
participants said that 90% of their collections are abandoned; another 27% of participants indicate that 80% of their collections are abandoned; for 20% of participants
70% of collections are abandoned; 14% of participants have a 60% abandonment rate;
6% of participants have 50% abandoned collections; another 6% have 40% abandoned collections. Nonetheless, participants continue to engage in collecting (P4:
“Even though I am not using most of my collections and I sometimes think what I’ve
built is useless, I keep building collections.”).
The participants initially build their collections with the intention of using them
later. However, most collected material is not utilized because of trouble remembering and finding what has been collected. They lack effective means for referring to
their collections. Collections are abandoned not because the information and media
they contain are useless, but because of breakdowns in utilization practice.
Reasons for Collection Building
Participants were asked why they still build collections even though they do not utilize most parts of them. Like P14 (“Wow, I realize that I am not using most parts of
my collections, around 90%”), they are often unaware that they are not utilizing most
of what they collect. However, all participants still build collections from some sense
that they will need the collected information elements later (P6: “I want to save time
on searching when I need a document in the future. That is my main reason for continuing to build collections.”). They collect media files to enjoy and also to share with
others. Participants collect information that seems meaningful, useful and needed.
They collect media that seems fun, unique, and consonant with their personal tastes.
They make collections not for the definite promise of later utility, but from some
intuitive sense of meaning and value.
4.2 Using Semantics to Represent Collections
Through the study, we observed that participants create semantic structures to organize their collections using any available affordances. They build their own structures
for meaningfully representing their collections for usage later.
Developing Informal Metadata Schemas
All participants said that they make hierarchical directory structures to organize and
manage their collections. They make folders based on contents, dates, semantic identifiers related to tasks or activities, or other categories that are somehow significant to
them. Participants said that folder structures are created and changed because collections are added and deleted continuously.
Participants said that they rename files and file folders using metadata such as date,
location, title, or author in order to help find them later. Renaming is important for
“I Keep Collecting”: College Students Build and Utilize Collections
search also. They seek to remember which words they used to rename files, in order
to reuse them later when they browse and search their collections. Several participants
mentioned strategies other than renaming for keep tracking of collected material. For
example, they create index files inside of folders so that they can know what they
contain (P6: “Inside file folders, I make a ‘readme’ file to look at it later. This will
help me to remember what the collection is about. In the individual file, I rename the
file, and in addition to that, I put an explanation about the content in the first line.”).
We identify participants’ practices such as renaming elements and creating hierarchical folder structures for representing important and large collections as the
development of informal metadata schemas. They found ways to develop informal
metadata schemas even in the absence of tools that support extensible field creation.
They used the single accessible field afforded by existing tools that is the file or
link name, to store the metadata. This practice was mostly spontaneous, occurring
without an ontological plan. It was conducted informally and incrementally, as a
series of situated actions [20]. This is an example of incremental formalism [18].
4.3 Suggestions
Participants were asked what new functionalities would be helpful in tools for collection building and utilization. Categories were not specified. Participants could mention whatever was on their minds. Participants’ suggestions addressed areas such as
collection utilization statistics display, filing assistance, and collection privacy support. They wanted help in renaming their collection materials in order to make the
structure consistent, to make it easier to find materials later. They also asked for cues
such as a ‘visited count,’ which shows how many times the owner read the file, in
their collection representations and search and browsing environments to support
finding specific materials. They liked the way desktop search is moving to assist collection utilization, however, they wanted their private files to be processed differently
(P13: “I have a big paper collection, but it is hard to find the paper I need when I
need it using search tools supported in Windows. I tried Google desktop search, and
it is pretty good, but one time I was a little embarrassed because a file that I wanted
to keep private was retrieved as a search results when I was with my friend”).
5 Discussion
Study participants invest substantial personal effort and resources into processes of
building and utilizing collections. Their persistence in collecting in spite of breakdowns conveys the sense that they need to keep collecting to support a range of activities that span personal and work-related parts of their lives. In this section we examine
participants’ engagement with collections and the needs they express, and extrapolate
from these, while considering human cognitive facilities and emerging technological
capabilities. The result is to derive implications and ideas for designers of systems
that support collection building and utilization.
The data shows that participants’ breakdowns were centered in processes of collection utilization. They had trouble finding specific elements in their collections, and
even though they built collections of elements that were useful, most of them are not
E. Koh and A. Kerne
utilized in the relevant context because of limited human attention and memory. They
forget what to look for and where. Abandoned collections consume disk space, and
more importantly, human attention during browsing, which is people’s first choice for
how to refer to collections.
We propose prescriptions to address breakdowns discovered in this study. Since
the discovered breakdowns generally involve limitations of human understanding of
collections, the prescriptions involve making better use of individual cognitive resources, sharing collections, and the definition of collection semantics. The first prescription addresses breakdowns that involve forgetting what has been collected, by
using representations for collections that better cue human memory. The next proposed solution is based on ambient displays that use peripheral attention and changes
over time for individual and collaborative interaction with collection visualizations.
Other user needs that result from analysis of the breakdowns involve distributed tools
for collection sharing, and the automatic generation of metadata schemas.
We can take steps to help people track of their collected information, by making
better utilization of human memory capabilities. It is a well-accepted principle of
cognitive science that in the working memory system, the visuospatial buffer, which
store mental images, and the rehearsal loop used for text are complementary subsystems [2]. Thus, dual coding strategies that represent the elements stored in a collection
with images as well as text will improve memory utilization [2][8], and contribute to
helping people find elements while browsing. Thus, we can provide users with tools
that support them in developing and generating visual index representations of their
collections, which integrate images and text. These representations will be easier to
remember, promote recognition, and facilitate the formation of mental models [10].
Since collection representations function as visual communication, either from a user
to her/himself or between users, visual design principles must be applied during processes of collection organization..
Developing representations during collection-building and explicit organization activities is one solution. But people don’t have sufficient attention to always work on
representing their collections. Another prescription develops peripheral ambient visualizations that gradually display elements from collections over time. Ambient visualizations use time as a dimension in collection visualization. They can represent
personal and group collections, engaging human attention without requiring it. Ambient visualizations can be deployed on a dedicated display, or as a screensaver. The set
of collections that get visualized can be specified explicitly by users, and/or by an
agent that uses clues, such as recency of access. For example, a large display in a
collaborative environment such as a research lab or departmental work area can visualize collected materials that represent information relevant to current projects and
research. This method can jog memories and promote serendipity, to facilitate individual and collaborative utilization of meaningful, useful and important elements in
collections. Affordances that enable privacy will be required.
Additionally, we have seen that sharing with others is an important motivation for
peoples’ collecting practices. People utilize and collect information on multiple computers and devices in different locations. This can cause access problems, when the
person is in one place, and the needed information is somewhere else. One initiative
that addresses this is ‘del.icio.us’, which supports URL collection sharing [17].
del.ico.us enables users to tag URLs while collecting. It shows the metadata that
“I Keep Collecting”: College Students Build and Utilize Collections
others have used, and enables social browsing through these relationships. We believe
this is a start for sharing collections and their semantics. New collection tools need to
consider people’s social and distributed collection-sharing intentions and enable
collecting actual objects as well as references, while considering accessibility and
privacy. Deeper semantic structures than single tags will also add value. These
functionalities need to be integrated with editing, saving, browsing, and searching in
order to best use limited human attention.
Users who are organizing collections by building informal metadata schemas need
more powerful semantic structures. Easy to use extensible metadata systems will
address this need. New collection tools need to use human attention effectively by
supporting people’s processes of semantic schema development in context, using
content analysis, text pattern recognition, and image processing techniques. They can
apply and extend collaborative filtering techniques for making suggestions about
which metadata tags fit what is being saved [17][9]. Feature-based clustering and
content analysis techniques can be applied to facilitate the semantic organization of
collections by grouping similar information elements and building referential links.
Users need to be able to override as well as accept the resulting suggestions. As part
of this process, agents can track mutually relevant information elements scattered
across the computer and the network, and inform the user about related information
elements in diverse collection substructures using similarity measures.
6 Conclusion
Our study participants display tenacity in their involvement in processes of collecting.
They explicitly express the intention and need to be involved in ongoing practices of
collecting. They collect digital media materials involved in a broad range of activities,
spanning personal and work relationships, which make up their everyday experiences.
Their collection artifacts directly signify, relate to, and support these activities. Thus,
collections and the process of collecting, itself, play important roles in how people
create meaning in their lives.
Participants engage in collection building and utilizing activities regularly, even
though it is not mandatory, and even though problems arise in the user experience.
They keep collecting in spite of breakdowns. Better representations can help support
these processes, by making better use of human attention. Tools for collecting need to
be based in a sense of supporting individual and collaborative processes of meaning
creation, while maximizing utilization of cognitive resources.
1. Abrams, D., Baecker, R., Chignell, M., Information archiving with bookmarks: personal
Web space construction and organization, Proc. SIGCHI, April 18-23, 1998, p.41-48.
2. Baddeley, A.D., Is working memory working?, Quarterly Journal of Exp Psych, 44A, 1-31,
3. Bälter, O., Strategies for Organizing Email, Proc. of HCI on People and Computers XII,
p.21-38, January 1997
E. Koh and A. Kerne
4. Barreau, D., Context as a factor in personal information management systems, JASIS,
46(5):327-339, June 1995.
5. Barreau, D., Nardi, B., Finding and reminding: file organization from the desktop. ACM
SIGCHI Bulletin 27, 3 (1995), 39-43.
6. Billsus, D., Hilbert, D., Maynes-Aminzade, D., Improving Proactive Information Systems,
Proc. IUI 2005, January 9, 2005, p. 159-166
7. Boardman, R., Sasse, M. A., "Stuff goes into the computer and doesn't come out": a crosstool study of personal information management, Proc. SIGCHI 2004, p. 583-590.
8. Carney, R.M., Levin, J.R., Pictorial Illustrations Still Improve Students’ Learning From
Text, Educational Psychology Review, Vol. 14, No. 1, March 2002.
9. Davis, M., King, S., Good, N., Sarvas, R., From context to content: leveraging context to
infer media metadata, ACM Multimedia 2004, pp. 188-195.
10. Glenberg, A.M., Langston, W.E., Comprehension of illustrated text: Pictures help to build
mental models, Journal of Memory & Language, 31(2):129-151, April 1992.
11. Hawkey, K., Inkpen, K. M., Privacy gradients: exploring ways to manage incidental information during co-located collaboration, Proc. CHI 2005, April, 2005, p. 1431-1434.
12. Jones, W., Dumais, S., Bruce, H., Once found, what then?: a study of "keeping" behaviors
in personal use of Web information. Proc. ASIST 2002, November 18-21, 2002, 391-402.
13. Malone, T., How do people organize their desks?: Implications for the design of office information systems, TOIS, 1(1):99-112, Jan. 1983
14. Pew Internet & American Life Project, Internet: The Mainstreaming of Online Life, 2005,
15. Pew Internet & American Life Project, The Internet Goes to College, 2002, http://www.
16. Pew Internet & American Life Project, Content Creation
17. Schachter, J., del.icio.us, http://del.icio.us
18. Shipman, F., Marshall, C., Formality Considered Harmful: Experiences, Emerging
Themes, and Directions on the Use of Formal Representations in Interactive Systems,
Proc. CSCW 1999, 333-352.
19. Simon, H., Computers, Communications and the Public Interest, Martin Greenberger, ed.,
The Johns Hopkins Press, 1971, 40-41.
20. Suchman, L., Plans and Situated Actions, New York: Cambridge University Press, 1987.
21. Whittaker, S., Sidner, C., Email overload: exploring personal information management of
email, Proc. SIGCHI, April 13-18, 1996, p.276-283.
22. Winograd, T., Flores, F., Understanding Computers and Cognition, Addison-Wesley, 1986
An Exploratory Factor Analytic Approach to Understand
Design Features for Academic Learning Environments
Shu-Shing Lee, Yin-Leng Theng, Dion Hoe-Lian Goh,
and Schubert Shou-Boon Foo
Division of Information Studies
School of Communication and Information
Nanyang Technological University
Singapore 637718
{ps7918592b, tyltheng, ashlgoh, assfoo}@ntu.edu.sg
Abstract. Subjective relevance (SR) is defined as usefulness of documents for
tasks. This paper enhances objective relevance and tackles its limitations by
conducting a quantitative study to understand students’ perceptions of features
for supporting evaluations of subjective relevance of documents. Data was analyzed by factor analysis to identify groups of features that supported students’
document evaluations during IR interaction stages and provide design guidelines for an IR interface supporting students’ document evaluations. Findings
suggested an implied order of importance amongst groups of features for each
interaction stage. The paper concludes by discussing groups of features, its implied order of importance, and support for information seeking activities to provide design implications for IR interfaces supporting SR.
Keywords: Subjective relevance, exploratory factor analysis, interface design.
1 Introduction
Information retrieval (IR) systems are traditionally developed using the “best match”
principle assuming that users can specify their needs in queries [3]. It retrieves documents “matching closely” to the query and regards these documents as relevant. Here,
relevance is computed objectively using a similarity measure between query terms
and terms in documents without considering users’ needs and tasks [24].
This paper enhances objective relevance and addresses its limitations by taking a
quantitative, subjective relevance (SR) approach. The SR concept provides suitable
theoretical underpinnings for our approach as it focuses on document’s relevance for
users’ needs [12]. This paper builds on an initial study [15] where features supporting
users’ evaluations of subjective relevance of documents were elicited. Here, we aim
to understand university students’ perceptions for elicited features. Specifically, we
use factor analysis to investigate groups of features and their implied order of importance to provide design guidelines for IR interfaces supporting SR.
Our approach may show designers how users’ perceptions of importance of features may be elicited and how factor analysis may be used to imply order of importance for features so that better decisions are made to design IR interfaces supporting
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 315 – 328, 2006.
© Springer-Verlag Berlin Heidelberg 2006
S.-S. Lee et al.
users’ relevance evaluations of documents. Similarly, our work applies to digital libraries by supporting designers determine design features for IR interfaces so that
users are guided to find documents based on their needs.
2 Related Work
Different approaches have attempted to enhance objective relevance by developing
user-centered IR systems. One method adopts an algorithmic approach to support
techniques like collaborative browsing and collaborative filtering in IR systems. Collaborative browsing aims to understand how users interact with other users to facilitate browsing processes and retrieve relevant documents. An example application is
Let’s Browse [16]. Collaborative filtering helps users retrieve relevant documents by
recommending documents based on users’ behaviors and behaviors of similar users.
Example applications are Fab [1] and GroupLens [23].
In the digital library domain, researchers have tried to design user-centered systems
that helped users retrieve relevant documents. One such work is the Digital Work
Environment library [18] which points users of a university digital library to relevant
documents based on their user categories and tasks. Another example uses a participatory design approach through techniques like observations and low-tech prototyping
to develop a user-centered children’s digital library called SearchKids [9].
Another research area looks at user-centered criteria and dimensions affecting relevance judgments, such as, [2] and [19]. These works may allow IR designers to provide appropriate information that helps users find documents for tasks.
3 Theoretical Framework
Our approach differs from those highlighted in Section 2. Firstly, we focus on the
location stage in the information life cycle [11], where we use SR to elicit features
supporting users’ relevance evaluations of documents. Secondly, we conducted a
quantitative study identifying users’ perceptions of elicited features. Factor analysis
was used to discover groups of features for IR interaction stages and their implied
order of importance amongst groups to provide design guidelines for IR interfaces
supporting users’ evaluations of relevance of documents for academic research.
This paper builds on our first study [15]. SR [6], information seeking in electronic
environments [17], and a model of user interaction [21] were used to provide rationale
for the first study. In that study, the SR concept was used to elicit features. SR was
defined as usefulness of an information object for users’ tasks [4]. SR also referred to
different intellectual interpretations that a user conducted to interpret if an information
object was useful [4]. The four SR types were [6]:
ƒ Topical relevance: This relevance is achieved if the topic covered by the assessed
information object corresponds to the topic in user’s information need.
ƒ Pertinence relevance: This relevance is measured based on a relation between user’s
knowledge state and retrieved information objects as interpreted by the user.
ƒ Situational relevance: This relevance is determined based on whether the user can
use retrieved information objects to address a particular task.
ƒ Motivational relevance: This relevance is assessed based on whether the user can
use retrieved information objects in ways that are accepted by the community.
An Exploratory Factor Analytic Approach to Understand Design Features
The first study also investigated how stages in Marchionini’s [17] model of information seeking were mapped to phases in Norman’s [21] model of user interaction.
The mapping aimed to illustrate how users might interact with an IR system to complete tasks. Our mapping showed that Marchionini’s [17] model was similar to Norman’s [21] model in terms of three stages (see Figure 1). It was implicitly inferred
that Norman’s [21] stages of task completion could be implied in each stage as each
stage involved completing a task, such as, query formulation.
In the first study, subjects completed a task using exemplary IR systems. The
task informed subjects to think about what features supported their relevance
evaluation of documents. Subjects brainstormed SR features for IR interfaces. Elicited features were analyzed using SR types, stages in information seeking and
phases in the model of user interaction to understand how students’ used features
during IR interactions. Features not coded to SR types were removed. Details of this
study are found in [15].
Stage 1 (Search page)
Formulate query
(Before execution phase)
Stage 3 (Document record page)
Use details about documents to evaluate
relevance (After execution phase)
Execute query in search page
(During execution phase)
Stage 2 (Results list page)
Review documents retrieved
(After execution phase)
Fig. 1. Stages of Users’ Interactions in IR Systems
4 A Study
Using digital libraries as examples of IR systems, we designed a survey form and
conducted a study based on SR features from the first study. In an ideal situation,
various methods, such as, reviewing IR systems and asking large groups of users
could be used to get features for the survey. However these methods could yield
many features and made decisions on what features to be included in the survey
The study was exploratory and aimed to gather students’ perceptions of features
elicited in the first study. Specifically, the study investigated students’ perceptions of
features as they imagined completing a task in a digital library. Data gathered were
analyzed using exploratory factor analysis (EFA) as EFA removed redundant features
and identified relationships so that groups describing most of the original data were
discovered [14; 20]. Thus, groups of features supporting students’ IR interaction
stages could be identified to provide guidelines for designing IR interfaces supporting
SR. Reasons for conducting the quantitative study was because a qualitative study
could be expensive and time-consuming as there was a need to interview subjects,
videotape and transcribe interviews. Moreover, the qualitative study might gather rich
data with many relationships that made it difficult to remove redundant features and
data gathered might not be generalisable to larger populations.
S.-S. Lee et al.
4.1 Designing the Survey Form
The designed survey form consisted of three parts:
ƒ Part 1 provided a brief overview of the study.
ƒ Part 2 included a glossary of difficult terms to help participants rate SR features.
ƒ Part 3 consisted of two sections. Section A contained a list of 50 SR feature
questions. A five-point Likert scale (very important; important; neutral; not very
important; not important) was used to rate each SR feature. Our previous work [15]
indicated that SR judgments were related to users’ tasks and IR interactions. Hence,
a task scenario and stages that students might experience were highlighted at the
start of Section A. The IR interaction stages were: S1) formulate and execute query
in the search page; S2) review documents in results list; and S3) view details in the
document record page to support evaluation of documents. Participants considered
the task and stages as they rated SR features. This approach was in line with Carroll’s [5] scenario-based design. Section B contained demographic questions.
The form was pilot-tested with 2 self-reported information seeking experts and 2
novices. Their feedback indicated that questions might be organized by IR interaction
stages. Analyses done in the first study [15] were used to re-organize questions.
4.2 Methodology
The survey form was handed out during 6 Master’s level and 8 Undergraduate level
classes. Participants rated their perceptions of importance of SR features based on a
given scenario of use. 565 responses were received of which 465 were valid. A valid
response was defined as a form that had all 50 SR feature questions answered.
Profiles of Participants
48.4% of students were males and 51.6% of students were females. Ages ranged from
18-49 years old and 65% were less than 23 years old. The high percentage of students
younger than 23 years old was because most of them were undergraduates.
Data Analysis Method
EFA was conducted according to organization of questions into the 3 interaction
stages. EFA was conducted using Principle Components Analysis with varimax rotation and a 0.4 factor loading. This factor loading was suitable for EFA [20].
Three heuristics were used to extract the number of factors for each analysis. In the
first heuristic, factors were extracted above the “elbow” of the scree plot [14; 20]. The
second heuristic extracted as many factors that had eigenvalues greater than 1 [14;
20]. The third heuristic was to compare eigenvalues from a dummy dataset with eigenvalues from the real dataset, and factors in the real dataset that had eigenvalues
higher than those in the dummy dataset were retained [14]. These heuristics provided
a range of factors to explore to derive the most meaningful factor solution. The most
meaningful factor structure was selected using these criteria [7]: 1) the factor structure
accounted for at least 50% of the variance amongst features included in the structure;
2) each factor had at least 3 features; 3) no or few cross factor loadings; and 4) factors
must be meaningful. Reliability of each factor was checked using Cronbach’s coefficient alpha [8]. A threshold value of 0.6 was selected [22]. If a factor had an alpha
value below 0.6, items in the factor were removed and analysis was repeated.
An Exploratory Factor Analytic Approach to Understand Design Features
It is emphasized that the final factor solution for each interaction stage was decided
based on the criteria for most meaningful factor structure and we did not aim for each
factor to account for more than 50% of the variance amongst features in the solution.
5 Findings and Discussion
Factors for stage 1 are described in detail. Due to limited space, findings for stages 2
and 3 are shown in tables and described briefly. We discuss the implied order of importance for factors in each stage and its implications towards interface design. Findings are also discussed in terms of how features support information seeking activities
stated in Ellis’ [10] behavioral model of information seeking.
5.1 Findings for Stage 1 (Search Page)
We started with a comprehensive set of 17 SR features for the search page. EFA reduced it to 14 features and loaded them to 3 factors. The factors accounted for
54.543% of the total variance (that is, the dispersion of data) in the 14 features. The
features were coded to pertinence relevance in the first study [15], thus, factor names
attempted to reflect this fact. SR features here were coded to pertinence relevance
because success of determining pertinence relevance depends, to a certain extent, on
the ability of users to formulate queries. In turn, users’ ability to formulate queries is
dependent on their knowledge of a topic or perceptions of information need [6].
Table 1 shows factor loadings for stage 1. Factors are labeled as S1_F1 to S1_F3
to indicate that it supported stage 1 and its respective factor number in this stage.
Tables 2 and 3 are constructed similarly. Factors for stage 1 are described in detail
ƒ Factor S1_F1: Search Options for Query Formulation and Pertinence Relevance
Features in Factor S1_F1 (see Table 1, column S1_F1) indicated search options that
guided students formulate queries, especially for those who could not articulate their
needs. Alpha value for this factor was 0.852.
Table 1. Factor Loadings of SR Features for Stage 1
SR features
1. Search in journal title field
2. Search in abstract field
3. Search in author field
4. Search in document full text
5. Provide search tutorials and examples
6. Provide advanced search mode
7. Provide basic search mode
8. Provide “clear query” button
9. Provide search history
10. Basic search considers query as a phrase if no Boolean operators are specified
11. Method of entering and executing queries should be simple like search engines
12. Provide search entry boxes
13. Search in keywords field
14. Search in title field
Factor loadings
S1_F2 S1_F3
S.-S. Lee et al.
ƒ Factor S1_F2: Additional Features for Query Formulation and Pertinence Relevance
Factor S1_F2 described additional features supported query formulation in the search
page. Example features were: provide basic and advanced search modes (see Table 1,
column S1_F2 for all features). This factor’s alpha value was 0.644.
ƒ Factor S1_F3: Basic Features for Query Formulation and Pertinence Relevance
This factor included basic features that let students specify their queries, like, provide
search entry boxes. Search options here supported query formulation for students who
knew their information need, such as, keywords describing contents and titles of
documents (see Table 1, column S1_F3 for features). The alpha value was 0.669.
5.2 Discussion for Stage 1 (Search Page)
Principles of EFA indicated that the first factor extracted would account for the highest percentage of total variance in all variables analyzed and subsequent factors would
account for as much of the remaining variance as possible that was not accounted by
the preceding factor [14]. Thus, the order in which factors were extracted and the
percentage of total variance in all features analyzed were used to imply the order of
importance for factors in each stage [13]. This rationale for implying order of importance was used to discuss findings for all stages.
ƒ Most Important SR Features for Stage 1
Factor S1_F1 contained the most important SR features for stage 1 as it accounted for
the highest amount of total variance in the 14 features analyzed for this stage
(34.142%). This factor indicated different search options for the search page (see
Table 1). Thus, students might have found search options to be most important as it
showed the types of information that could be searched. Search options in Factor
S1_F1 differed from those in Factor S1_3 (see Table 1, rows 13-14). This was because search options in Factor S1_F1 were more comprehensive and allowed students
to search for documents using different means, such as, by author, abstract, or full text
whereas search options in Factor S1_F3 seemed to support query formulation for
students who knew the titles and keywords of documents they needed.
ƒ Second Most Important SR Features for Stage 1
Features in Factors S1_F2 (see Table 1) were the second most important SR features
as it was ranked second for percentage of total variance in the 14 features analyzed
(11.293%). Thus, it was inferred that besides providing search options, students also
wanted other features to support query formulation. For example, if different search
modes were designed, students could select a search mode depending on their needs.
ƒ Third Most Important SR Features for Stage 1
Features in Factor S1_F3 (see Table 1) were ranked third for the amount of total
variance in the 14 features analyzed in stage 1 (9.109%). Reason could be because
students felt that the feature, “provide search entry boxes”, was redundant as search
pages should have text boxes for users to enter queries. Factor S1_F3 was similar to
Factor S1_F1 as search options were available in both factors. However, search
options in Factor S1_F3 might not be as important as those from Factor S1_F1 as
students might not know keywords or titles of relevant works. Thus, search options
in Factor S1_F1 would provide more access points for students to search for
An Exploratory Factor Analytic Approach to Understand Design Features
Analyses of SR features for stage 1 yielded three factors ranked in implied order of
importance. Hence, depending on students’ needs and design resources, different
groups of SR features might be designed in the search page. For example, if resources
were limited, then the most important SR features in Factor S1_F1 could be designed.
However, if comprehensive support for query formulation was needed then all three
factors of SR features could be designed to provide basic and advanced search pages.
Features highlighted in factors for stage 1 seemed to support the information seeking activities of starting, browsing and monitoring. Features here might support starting as students could have initial references recommended by their teachers and they
might formulate queries to find out if these documents were available in the system.
Alternatively, students could already have a clear understanding of their need and
were actively browsing (that is, semi-directed / semi-structured searching) to look for
relevant documents or they could search the system to monitor developments within
interested areas. Figure 2 shows the designed search page with most important SR
features. Search option with highest factor loading was designed on the top and the
one with the lowest factor loading was designed at the bottom.
Provide fielded search options
from Factor S1_F1.
Instructions on how to use
the search page
Fig. 2. Search Page with Most Important SR Features
5.3 Findings for Stage 2 (Results List Page)
A comprehensive list of 21 SR features for stage 2 was packed to 5 factors. The factors accounted for 52.567% of the total variance in all 21 features. Factors are labeled
as S2_F1 to S2_F5, factor loadings and alpha values are described in Table 2.
Factor S2_F1 was labeled “point students to documents supporting topical, situational
and motivational relevance” as features (see Table 2, rows 1-5) were coded to these SR
types and indicated different ways of pointing students to other documents. Features in
Factor S2_F2 (see Table 2, rows 6-10) could help students find suitable contents and
document types for their needs. Moreover, features were coded to topical, situational and
S.-S. Lee et al.
motivational relevance in the first study [15]. Hence, this factor was named “features for
evaluating contents for topical, situational and motivational relevance”. Features for
Factor S2_3 (see Table 2, rows 11-13) were coded to topical and situational relevance in
the first study [15] so this factor was named “alternate ways of presenting results list to
support topical and situational relevance”. Factor S2_F4 was labeled “extra information
to evaluate documents for topical, situational and motivational relevance” as features
(see Table 2, rows 14-17) were coded to topical, situational and motivational relevance in
the first study [15]. These features provided additional information about retrieved
documents and its source to facilitate document evaluations. Features for Factor S2_F5
(see Table 2, rows 18-21) included those that were commonly available in results list and
they were coded to topical relevance in the first study [15]. Hence, this factor was named
“common features available in results list page to support topical relevance”.
5.4 Discussion for Stage 2 (Results List Page)
ƒ Most Important SR Features for Stage 2
Factor S2_F1 (see Table 2) were inferred as the most important SR features for stage
2 as it had the highest percentage of total variance in all features analyzed (26.060%).
The survey form asked students to rate features with the assumption that the results
list included a list of retrieved documents. Hence, it was inferred features in Factor
S2_F1 could be built on top of retrieved documents in the results list page.
ƒ Second Most Important SR Features for Stage 2
Features in Factor S2_F2 (see Table 2) focused on allowing students evaluate appropriate contents and document types for their needs. This factor was inferred as second
most important because it was ranked second in terms of total variance in all features
analyzed for stage 2 (7.911%).
ƒ Third Most Important SR features for Stage 2
Factor S2_F3 (see Table 2) focused on providing novel ways of presenting results list
and providing explanations of how documents were ranked. Features here might indicate that students were willing to try new ways of presenting documents in results list
to determine if these methods were effective. Features in this factor were inferred as
third most important because its percentage of total variance in all features analyzed
was ranked third amongst factors extracted for stage 2 (6.912%).
ƒ Fourth Most Important SR features for Stage 2
Factor S2_F4 focused on features that provided additional information to help students evaluate documents for their needs. Thus, if students could not get sufficient
information, they might turn to features in Factor S2_F4 to get more information to
support their document evaluations. Features here were implied as the fourth most
important for stage 2 as its percentage of total variance in all features analyzed
(5.916%) was ranked fourth amongst the five factors for this stage.
ƒ Fifth Most Important SR features for Stage 2
Features in Factor S2_F5 (see Table 2) were inferred as fifth most important for this
stage as its percentage of total variance in all features analyzed was ranked fifth
(5.767%). Reason might be because students rated features based on their assumptions
of common features in results lists. Hence, features here were redundant as they
matched students’ perspectives.
An Exploratory Factor Analytic Approach to Understand Design Features
Table 2. Factor Loadings of SR Features for Stage 2
SR features
Factor loadings
Factor S2_F1: Point students to documents supporting topical, situational and motivational relevance (Alpha value: 0.738)
1. Recommend related documents and topics based on query
2. Recommend related documents for each document retrieved
3. Provide details of other people the author had worked with
4. Recommend documents based on what others have looked at
5. Recommend related documents based on user’s profile and
searching behavior
Factor S2_F2: Features for evaluating contents for topical, situational and motivational relevance (Alpha value: 0.697)
6. Provide an abstract for each document retrieved in results list
7. Allow users to preview abstract before downloading full text
8. Highlight search terms for each document in results list
9. Provide an option so users can choose to display a paragraph
or a few lines in which search terms appear in full text
10. Categorize documents retrieved based on types of
documents like journals, conference proceedings, etc.
Factor S2_F3: Alternate ways of presenting results list to support topical and situational relevance (Alpha value: 0.643)
11. Rank documents in results list in terms of how many times it
has been used by others
12. Provide explanation of how documents are ranked
13. Present results list in pictorial format
Factor S2_F4: Extra information to evaluate documents for topical, situational and motivational relevance (Alpha value: 0.614)
14. Provide link that shows general information about
document’s source
15. Provide link to document source’s table of contents
16. Provide subject categories for each document retrieved
17. Provide selected references cited for each document
Factor S2_F5: Common features available in results list page to support topical relevance (Alpha value: 0.617)
18. Rank retrieved documents in results list in order of relevance
19. Display results list
20. Rank and provide relevance percentage for documents
retrieved in results list
21. Allow searching within documents retrieved in results list
The factors seemed to include features that were exclusive to their respective factors except for an overlap amongst features in Factors S2_F3 and S2_F5. The overlapping occurred as features in both factors related to ranking of documents retrieved.
However, there were slight differences. The feature in Factor S2_F3 (see Table 2, row
11) focused on ranking documents retrieved based on frequency of use whereas features in Factor S2_F5 (see Table 2, rows 18 and 20) focused on ranking documents in
order of relevance and relevance percentage.
An order of importance was implied amongst factors for stage 2. Thus, features in
different factors could be implemented as groups. Students might activate clusters and
incrementally add features to the interface as pop-up boxes and pull-down menus
Features highlighted in factors for stage 2 seemed to support the information seeking activities of chaining and differentiating. Students might perform backward chaining by following references cited in documents to gain access to other documents.
Backward chaining might be supported by the feature, “provide selected references
cited for each document”. Forward chaining was also supported by features in factors
for stage 2 which involved providing links to other possible relevant documents
through recommendation methods, such as, by users’ profiles, and related topics.
Most features in factors for stage 2 aimed to provide information to help students
differentiate if a retrieved document was worth evaluating in more detail in the document record page. Examples of such features were: provide abstract, and categorize
S.-S. Lee et al.
documents based on document type. Figure 3 illustrates the designed results list page
incorporating most important features for stage 2 (Factor S2_F1). Features were built
on top of a ranked list of retrieved documents.
Recommendations of
documents and topics related
to query from Factor S2_F1.
Recommendations of
documents based on what
others have looked at from
Factor S2_F1.
Details of other people
the author has worked
with from Factor S2_F1.
Fig. 3. Results List Page with Most Important SR Features
5.5 Findings for Stage 3 (Document Record Page)
Twelve comprehensive features were loaded to 3 factors. Factor loadings, factor
names and alpha values for stage 3 are shown in Table 3. The factors accounted for
58.959% of the total variance in the 12 features analyzed.
Factor S3_F1 was named “seek others’ help to evaluate documents for pertinence
and motivational relevance” as features identified (see Table 3, rows 1-4) were coded
to pertinence and motivational relevance in the first study [15]. Features here seemed
to allow students discuss relevance with authors and other users. Features in Factor
S3_F2 (see Table 3, rows 5-9) were coded to situational relevance in our first study
[15] and facilitated management of full text. Thus, this factor was labeled “features
that support access and management of full text for situational relevance”. Factor
S3_F3 (see Table 3, rows 10-12) provided full text and highlighted search terms so
students could evaluate relevance of highlighted text in relation to contents.
5.6 Discussion for Stage 3 (Document Record Page)
ƒ Most Important SR Features for Stage 3
Features in Factor S3_F1 were inferred as the most important features as its percentage of total variance in all features analyzed was the highest (33.822%). Students
rated features based on an understanding that the document record page provided
detailed information, such as, title, author and publisher. Hence, it was inferred that
An Exploratory Factor Analytic Approach to Understand Design Features
students were keen to discuss with others to find relevant documents and features here
could be built on top of detailed information in document record page.
Table 3. Factor Loadings of SR Features for Stage 3
SR features
Factor loadings
Factor S3_F1: Seek others’ help to evaluate documents for pertinence and motivational relevance (Alpha value: 0.795)
1. Provide asynchronous collaborative features
2. Provide synchronous collaborative features
3. Provide author’s contact details
4. Allow users to ask experts to evaluate documents retrieved
Factor S3_F2: Features that support access and management of full text for situational relevance (Alpha value: 0.761)
5. Allow full text to be saved using its title as the default file name
6. Allow full text to be saved in a compressed version
7. Print full text without “highlighted / bolded” search terms
8. Provide “reader” software in the document record page
9. Specify on what pages in full text do search terms appear and provide link to the page
Factor S3_F3: Highlight portions in full text and point users to other documents for situational relevance (Alpha value: 0.657)
10. Highlight search terms in full text
11. Provide links to full text of documents cited in the current document
12. Allow users to download full text in PDF format
ƒ Second Most Important SR Features for Stage 3
Factor S3_F2 focused on providing features that facilitated access and management of
full texts. Hence, it was inferred that students wanted easy access and management of
full texts so that they would extract relevant content for tasks. Features here were
deduced as the second most important features as its percentage of total variance in all
features analyzed (15.233%) was ranked second amongst factors for stage 3.
ƒ Third Most Important SR Features for Stage 3
Features in Factor S3_F3 were specified as third most important as its percentage of
total variance in all features analyzed (9.904%) was ranked third amongst factors for
Provide author’s contact
details from Factor S3_F1.
Allow users to ask
experts to evaluate the
retrieved document
from Factor S3_F1.
Asynchronous and synchronous
collaborative features from Factor
Fig. 4. Document Record Page with Most Important SR Features
S.-S. Lee et al.
this stage. Reasons could be: 1) students wanted to read full text to extract information; and 2) students might find full text of cited documents to be relevant.
The three factors extracted for stage 3 seemed to indicate that three important
groups of features could be designed. Features in these groups seemed unique and
there were no overlaps. Thus, depending of design requirements different groups of
important features could be designed. Features indicated in factors for Stage 3
seemed to support the information seeking activities of differentiating and extracting. This was because the document record page provided detailed information so
that students could differentiate if the retrieved document was useful. Moreover, the
document record page also provided access to full text so that students could extract
Figure 4 shows the designed document record page with most important SR features. As students rated features based on an understanding that the document record
page provided detailed information about the document, like, title, author and publisher, features in Factor S3_F1 were built on top of such information.
6 Conclusion and On-Going Work
Our approach differs from approaches addressing collaborative browsing and filtering, user-centered design approaches and user-defined criteria for relevance judgments highlighted in Section 2. Firstly, our approach used SR as a theoretical basis to
elicit features supporting document evaluations. We also used stages of IR interaction
to understand how students might use features to complete tasks in IR systems. Secondly, we investigated students’ perceptions for elicited features using EFA. The
contributions of our work are:
ƒ EFA extracted groups of SR features to support each stage of students’ IR interactions. Although all groups of features were important to form the factor solutions to
support students’ document evaluations during IR interactions, there seemed to be
an implied order of importance amongst groups. Thus, depending on requirements,
different groups of features could be designed in IR interfaces.
ƒ The groupings seemed to indicate clusters of SR features that could be implemented
collectively. Student might activate different clusters and features could be added to
the interface in the form of pop-up boxes and pull-down menus.
Findings presented are preliminary and have limitations. The study gathered students’ perceptions of importance of SR features without actually using the system.
Students might have different understandings of SR features and this could be problematic when students did not have prior experience using such features. Hence, future work may focus on verifying and evaluating our findings in a qualitative study
where users could comment on importance of SR features in actual context of use.
Findings presented are exploratory and applied specifically to students who participated in the study. Future work might use EFA to discover groups of SR features
supporting IR interactions for other students in different task scenarios so that insights
could be gathered on the needs of larger student populations for IR interfaces supporting SR. The translation of factors into interface design is also another area that needs
to be looked into in future.
An Exploratory Factor Analytic Approach to Understand Design Features
1. Balabanovic, M. and Shoham, Y. (1997). Fab: Content-based, collaborative recommendation. Communications of the ACM, 40 (3), 66-72.
2. Barry, C. L. (1994). User-defined relevance criteria: an exploratory study. Journal of the
American Society for Information Science, 45 (3), 149-159.
3. Belkin, N. J., Oddy, R. N., and Brooks, H.(1982). ASK for information retrieval: Part I.
background and theory. The Journal of Documentation, 38 (2), 61-71.
4. Borlund, P. and Ingwersen, P. (1998). Measures for relative relevance and ranked half-life:
Performance indicators for interactive IR. Proceedings of the 21st Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, ACM
Press, 324-331.
5. Carroll, J. M. (2000). Making use: Scenario-based design of human-computer interactions.
California, USA: The MIT Press.
6. Cosijin, E., and Ingwersen, P. (2000). Dimensions of relevance. Information Processing
and Management 63,533-550.
7. Costello, A. B. and Osborne, J. W. (2005). Best practices in exploratory factor analysis: four
recommendations for getting the most from your analysis. Practical Assessment, Research &
Evaluation: A Peer-reviewed Electronic Journal, 10 (7), http://pareonline.net/pdf/v10n7.pdf.
8. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika
16, 297-334.
9. Druin, A. et al. (2001). Designing a digital library for young children: An intergenerational
partnership. Proceedings of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries,
ACM Press, 398-401.
10. Ellis, D. (1989). A behavioural approach to information retrieval system design. Journal of
Documentation, 45 (3), 171-212.
11. Fischer, G., Henninger. S. and Redmiles, D. (1991). Cognitive tools for locating and comprehending software objects for reuse, Proceedings of 13th International Conference on
Software Engineering, IEEE Computer Society Press, 318-328.
12. Ingwersen, P. and Borlund, P. (1996). Information transfer viewed as interactive cognitive
processes. In Ingwersen, P. and Pors, N. O. (Eds.). Information Science: Integration in
Perspective. Royal School of Librarianship, Denmark, 219-232.
13. Kim, Jae-On and Mueller, C. W. (1978). Factor analysis: statistical methods and practical
issues. California: Sage Publications, Inc.
14. Lattin, J., Carroll, J. D., and Green, P. E. (2003). Analyzing multivariate data. Nelson,
Canada: Brooks/Cole.
15. Lee, S. S., Theng, Y. L., Goh, D. H. L, and Foo, S. S. B. (2005). Subjective relevance: implications on interface design for information retrieval systems. In Fox, E., Neuhold, E. J.,
Pimrumpai, P, and Wuwongse, V. (Eds.), The 8th International Conference on Asian Digital Libraries, ICADL, 2005. Digital libraries: implementing strategies and sharing experiences (pp. 424-434). Germany, Berlin: Springer-Verlag.
16. Lieberman, H. (1995).An agent for web browsing. Proc. International Conference on Artificial Intelligence, 924-929.
17. Marchionini, G. (1995). Information seeking in electronic environments. Cambridge, UK:
Cambridge University Press.
18. Meyyapan, N., Chowdhury, G. G. and Foo, S. (2001). Use of a digital work environment
prototype to create a user-centered university library. Journal of Information Science, 27
(4), 249-264.
S.-S. Lee et al.
19. Mizzaro, S. (1998). How many relevances in information retrieval?. Interacting with Computers, 10, 303-320.
20. Netemeyer, R. G., Bearden, W. O., and Sharma, S. (2003). Scaling procedures: issues and
applications. California, USA: Sage Publications.
21. Norman, D. A. (1998). The psychology of everyday things. New York: Basic Books.
22. Nunnally, J.C. (1978).Psychometric Theory (2nd ed.). New York: MacGraw-Hill.
23. Resnick, P., Iacovou, N., Mitesh, S., Bergstron, P., and Riedl, J. (1994). GroupLens: an
open architecture for collaborative filtering of Netnews. Proc. ACM Conference on Computer Supported Cooperative Work, ACM Press, 175-186.
24. Tang, R. and Soloman, P. (1998). Toward an understanding of the dynamics of relevance
judgment: An analysis of one person’s search behavior. Information Process and Management 34,237-256.
Representing Contextualized Information in the NSDL
Carl Lagoze1, Dean Krafft1, Tim Cornwell1, Dean Eckstrom1,
Susan Jesuroga2, and Chris Wilper1
1 Computing and Information Science, Cornell University, Ithaca, NY 14850 USA
{lagoze, dean, cornwell, eckstrom, cwilper}@cs.cornell.edu
2 UCAR-NSDL, PO Box 3000, Boulder, CO 80307 USA
[email protected]
Abstract. The NSDL (National Science Digital Library) is funded by the National Science Foundation to advance science and math education. The initial
product was a metadata-based digital library providing search and access to distributed resources. Our recent work recognizes the importance of context – relations, metadata, annotations – for the pedagogical value of a digital library.
This new architecture uses Fedora, a tool for representing complex content,
data, metadata, web-based services, and semantic relationships, as the basis of
an information network overlay (INO). The INO provides an extensible knowledge base for an expanding suite of digital library services.
1 Introduction
Libraries, traditional and digital, are by nature information rich environments - the
organization, selection, and preservation of information are their raison d’etre. In
pursuit of this purpose, libraries have focused on two areas: building a collection of
all the resources that meet the library’s selection criteria, and building a catalog of
metadata that facilitates organization and discovery of those resources.
This is the approach that the NSDL (National Science Digital Library) Project took
over its first three years of existence, when it focused mainly on the location and
development of resources appropriate for Science, Technology, Engineering, and
Mathematics education, and the creation of quality metadata about those resources.
This focus was reflected in the technical infrastructure that harvested metadata from
distributed providers, processed and stored that metadata, and made it available to
digital library services such as search and preservation.
The value of an excellent collection of resources as a basis for library quality is
undeniable. And, even after years of advances in automatic indexing, metadata remains important for a class of resources and applications. However, our three years
of effort in the NSDL have revealed that collection building and metadata aggregation
are necessary but not sufficient activities for building an information-rich digital library. In particular, our experience has led to two conclusions. First, the technical
and organizational infrastructure to support harvesting, aggregation, and refinement of
metadata is surprisingly human-intensive and expensive [15]. Second, in a world of
increasingly powerful and ubiquitous search engines, digital libraries must distinguish
themselves by providing more than simple search and access [16]. This is particularly
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 329 – 340, 2006.
© Springer-Verlag Berlin Heidelberg 2006
C. Lagoze et al.
true in educationally-focused digital libraries where research shows the importance of
interaction with information rather than simple intake.
Based on these conclusions, we have redirected our efforts over the past year towards building a technical infrastructure that supports a more refined definition of
information richness. This definition includes, of course, collection size and integrity,
and it accommodates the relevance of structured metadata. But it adds the notion of
building information context around digital library resources. Our goal is to create a
knowledge environment that supports aggregation of multiple types of structured and
unstructured information related to library resources, the instantiation of multiple
relationships among digital library resources, and participation of users in the creation
of this context. We are creating an infrastructure that captures the wisdom of users
[32], adding information from their usage patterns and collective experience to the
formal resources and structured metadata we already collect.
Our technical infrastructure is based on the notion of an information network overlay [16] – a directed, typed graph that combines local and distributed information
resources, web services, and their semantic relationships. We have implemented this
infrastructure using Fedora [17], an architecture for representing complex objects and
their relationships.
In this paper we describe the motivations for this architecture, present the information model that underlies it, and provide results from our year of implementation. We
note for the reader that this is still a work in progress. The results we provide in this
paper relate to the implementation and scaling issues in creating a rich information
model. As our work progresses, we will report in future papers on the effectiveness
of this architecture from the perspective of the user and evaluate whether it really
does enable a richer and more useful digital library.
The organization of this paper is as follows. Section 0 describes related work and
situates this work in the context of other digital library efforts. Section 3 summarizes
the importance of information contextualization for educational digital libraries.
Section 4 provides a brief background on the NSDL and establishes the application
context in which this work occurs. Section 0 describes the information model of the
information network overlay. Section 6 provides the results of our implementation
experience. Finally, section 0 concludes the paper.
2 Related Work
The work described in this paper builds on a number of earlier and ongoing research
and implementation projects that investigate the role of user annotations in information environments, the importance of inter-resource relationships, and the integration
of web services with digital content. We believe that our work is distinguished from
these other projects in two ways. First, it combines traditional digital library notions
of resources and structured metadata with service-oriented architecture and semantic
web technology, thereby representing the rich relationships among a variety of structured, unstructured, and semi-structured information. Second, it implements this rich
information environment at relatively large scale (millions of resources), exercising a
number of state-of-the-art technologies beyond their previous deployments.
Representing Contextualized Information in the NSDL
Perhaps the most closely related work is the body of research on information annotation. Catherine Marshall has written extensively on this subject [20] in the digital
library and hypertext context. A number of systems have been developed that
implement annotation in digital libraries. For example, Roscheisen, Mogensen, and
Winograd created a system call ComMenter [31] that allowed sharing of unstructured
comments about on-line resources. The multi-valent document work at Berkeley
provides the interface and infrastructure for arbitrary markup and annotation of digital
documents, and storage and sharing of that markup [34]. The semantic web community has also examined annotation, with the Annotea project [13] being the most notable example.
The importance of annotation capabilities for education and scholarly digital libraries has been noted by many researchers including Wolfe [35]. The ScholOnto project
[24] created a system for the publication and discussion/annotation of scholarly
papers, arguing for the importance of informal information along-side established
resources. Constantopoulus, et al. [8] examine the semantics of annotations in the
SCHOLNET project, a EU-funded project to build a rich digital library environment
supporting scholarship. Within the NSDL effort, there have been a number of projects that support annotations, most notably DLESE (Digital Library for Earth System
Education) [1].
Annotations and their association with primary resources are one class of the variety of relationships that can be established among digital content. Ever since Vannevar Bush invented hypertext [5], researchers have been examining tools for interlinking information. Faaborg and Lagoze [11] examined the notion of semantic
browsing whereby users could establish personalized and sharable semantic relationships among existing web pages. Huynh, et al. [12] have recently done similar work
in the Simile project.
There is also related work on resource linking specifically for pedagogic purposes
within the educational research community. Unmil, et al. [33] describe Walden’s
Paths, a project that allows teachers to establish meta-structure over the web graph for
creation of lesson plans and other learning materials. Recker, et al. have created another system called Instructional Architect [28], that similarly allows integration of
on-line resources by teachers into educational units.
Finally, an important component of the work described here is the integration of
content and web services. In many ways our digital library “philosophy” resembles
that of the Web 2.0 philosophy [25]. Key components of this are the collection
and integration of unique data, the participation of users in that data collection and
formulation process, and the availability of the data environment as a web service that
can be leveraged by value-add providers. Chad and Miller [6] extend Web 2.0 to
something they call Library 2.0. We hope that our work demonstrates many of the
principles they describe, notably the notion that Library 2.0 encourages a “culture of
participation” and provides the interface to its accumulated information for innovative
“mash-ups” that exploit library information in innovative ways.
C. Lagoze et al.
3 The Need for Context and Reuse
Research shows that education-focused digital libraries (and digital libraries in general) need to support the full life cycle of information [19]. Reeves wrote “The real
power of media and technology to improve education may only be realized when
students actively use them as cognitive tools rather than simply perceive and interact
with them as tutors or repositories of information.” [30].
One requirement that appears frequently in the learning technology literature is the
reuse of resources for the creation of new learning objects. This involves integrating
and relating existing resources into a new learning context. A learning context has
many dimensions including social and cultural factors; the learner’s educational system; and the learner’s abilities, preferences and prior knowledge [21].
Most digital libraries, including the NSDL, currently rely on forms of metadata to
describe learning objects and enable discovery. Metadata standards abstract properties
of learning objects, and abstraction can lead to instances where learning context is
ignored or reduced to single dimensions [26]. Metadata is often focused on the technical aspects of description and cataloging, not on capturing the actual context of
instructional use. Recker and Wiley write “a learning object is part of a complex web
of social relations and values regarding learning and practice. We thus question
whether such contextual and fluid notions can be represented and bundled up within
one, unchanging metadata record.” [29]
McCalla also argues that there is no way of guaranteeing that metadata captures the
breadth and depth of content domains. He writes that, ideally, learning objects need to
reflect “appropriateness” to address the differences between learners’ needs. [22] In
addition, questions remain as to whether these logical representations (e.g. metadata
and vocabularies) created primarily for use by computer systems will make the most
intuitive sense for learners [7].
Several approaches have been suggested to help supply the rich context for learning object creation and reuse. These include capturing opinions about learning objects and descriptions of how they are used [26]; recording the community of users
from which the learning object is derived [29]; collecting teacher-created linkages
to state education standards [28]; tracking and using student-generated search keywords [2]; and providing access to comments or reviews by other faculty and
students [23].
We see that in order to provide an educationally-focused digital library, the information infrastructure must support flexible integration of information, ranging
from highly structured metadata to unstructured comments and observations. It
needs to nr dynamic, expanding both in the manner that a standard library collection
expands, but also based on the collective experience and input of the user
4 A Suite of Contextualized NSDL Services
We are creating the infrastructure to meet notions of information richness outlined in
the previous section. This work follows more than three years of work by the NSDL
Core Infrastructure (CI) team, and has been described in a number of other papers
Representing Contextualized Information in the NSDL
[14, 15]. Stated very briefly, this earlier work used OAI-PMH to populate a metadata
repository (MR). This metadata was indexed by a CI-managed search service, which
was accessible by users through a central portal at http://nsdl.org.
Our goal is to move beyond the search and access capabilities provided by the MR.
The creation of the NSDL Data Repository (NDR), built on the architecture described
in the next section, provides a platform for a number of exciting new NSDL applications focused directly on increasing user participation in the library. In addition to
creating specific new capabilities for NSDL users, these applications all create context around resources that aids in discovery, selection and use. Four specific applications that exploit the infrastructure described in this paper are currently in various
phases of development, testing, and deployment.
Expert Voices (EV) is a collaborative blogging system that fully integrates the resources and capabilities of the NDR. It allows subject matter experts to create realtime entries on critical STEM issues, while weaving into their presentation direct
references to NSDL resources. These blog entries automatically become both resources in the NSDL library and annotations on all the referenced resources. EV supports Question/Answer discussions, resource recommendations and annotations, the
provision of structured metadata about existing resources, and establishing relationships among existing resources in the NSDL, as well as between blog entries and
On Ramp is a system for the distributed creation, editing, and dissemination of
content from multiple users and groups in a variety of formats. Disseminations
range from publications like the NSDL Annual Report to educational workshop
materials to online presentations like the Homepage Highlights exhibit at
NSDL.org's homepage. Resources created and released in OnRamp can become
NDR content resources, and NDR resources and other content can be directly incorporated into On Ramp publications, creating new context and relationships
within the NDR.
Instructional Architect, described by Recker [27], “… enables users (primarily
teachers) to discover, select, and design instruction (e.g., lesson plans, study aids,
homework) using online learning resources. ”. Currently, IA supports searching the
NSDL for resources and incorporating direct references to those resources into an IA
project. The IA team is currently working with the NDR group to support both publication of IA projects as new NSDL resources and the direct capture the web of relationships created by an IA project in the NDR.
The Content Alignment Tool (CAT), currently in development by a team led by
Anne Diekema and Elizabeth Liddy of Syracuse University, uses machine learning
techniques to support the alignment of NSDL resources to state and national educational standards [10]. Initially (2Q2006), users will be able to use the tool to suggest
appropriate educational standards for any resource they are viewing. Later versions of
the system will allow experts and other users to provide feedback, incorporated into
the NDR, on the appropriateness of the assignments. This tool, and the overall incorporation of educational standards relationships into the NDR, will allow NSDL users
to search and browse the NSDL by "standards", starting either from a standard or
from any relevant resource.
C. Lagoze et al.
5 Design and Information Model
To provide the foundation for this rich array of user-visible services, we have implemented the NSDL Data Repository (NDR). The NDR implements all features of the
pre-existing MR such as metadata harvesting, storage, and dissemination. However,
it moves from the restrictive metadata-centric focus of the MR to a resource-centric
model, which allows representation of rich relationships and context among digital
library resources.
The NDR implements a data abstraction that we call an information network overlay (INO). Like other overlay networks [3] the INO instantiates a layer over another
network, in this case the web graph.
Specifically, an INO is a directed graph. Nodes are identified via URIs and are
packages of multiple streams of data. This data stream composition corresponds to
compound object formats such as METS [18] and DIDL [4], allowing the creation of
compound digital objects with multiple representations. The component data streams
may be contained data or they may be surrogates (via URLs) to web-accessible content. This allows nodes to aggregate local and distributed content, for example the
reuse of multiple primary resources into new learning objects. Web services may be
associated with information units and their components, allowing service-mediated
disseminations of the data aggregated in a digital object. This advances the reuse
paradigm beyond simple aggregation, allowing, for example, a set of resources written in English to be refactored into a Spanish learning object though mediation by a
translation service. Edges represent ontologically-typed relationships among the
digital objects. The relationship ontology is extensible in the manner of OWL-based
ontologies [9]. This allows the NDR to represent the variety of application-based
relations described earlier such as collection membership, aggregation via reuse into a
learning object, and correlation with one or more state standards. Nodes (digital objects) are polymorphic - they can have multiple types in the data model, where typing
means the set of operations that can be performed on the digital object. In the digital
library environment, this flexibility overcomes well-known dilemmas such as the
data/metadata distinction, which conflicts with the reality that an individual object can
be viewable through both of these type lenses.
The NDR is implemented within a Fedora repository. A complete description of
Fedora is out-of-scope for this paper and the reader is directed to the up-to-date explanation at [17]. Each node in the INO corresponds to a Fedora digital object. Fedora provides all the functionality necessary for the INO including compound objects,
aggregation of local and distributed content, web service linkages, and expression of
semantic relationships. Fedora is implemented as a web service and includes finegrained access control and a persistent storage layer.
Length constraints on this paper prohibit a full description of the information modeling in the NDR and the use of Fedora to accomplish this modeling. This modeling
includes the design of Fedora digital objects to provide the different NDR object types
– resources, agents, metadata, aggregations, and the like – and the relationships
among these types for common use cases such as resource and metadata branding and
resource annotation.
Representing Contextualized Information in the NSDL
Fig. 1. Modeling an aggregation
However, an example shown in Fig. 1 demonstrates how the NDR represents aggregation. Examples of aggregations include conventional collection/item membership, but also aggregations with other semantics such as membership of individual
resources in a compound learning object or alignment of set of resources with a state
educational standard. Each node corresponds to a Fedora digital object, with the key
at the left showing the type of the object. The labels on the arcs document the type of
the relationship. As shown, “memberOf” arcs relate resources to one or more aggregations. Aggregations can have arbitrary semantics, with the semantics documented
by the resource that is the object of the “representedBy” arc. For example, this resource may be a surrogate for a collection, or may represent a specific state standard.
Lastly, the person or organization responsible for the aggregation is represented by
the agent that is the source of the “aggregatorFor” arc.
6 Results from Implementation of the NSDL Data Repository
Over the past year we have been designing, implementing, and loading data into the
NDR. The major implementation task was the creation and coding of an NDRspecific API for manipulation of information objects in the NDR data model – specific “types” of digital objects such as resources, metadata, agents, and the like and
the required relationships among them. Note that this API is distinct from the SOAP
and REST API in Fedora that provides access to low-level digital object operations.
The NDR API consists of a set of higher level operations such as addResource, addMetadata, and setAggregationMembership. Each of these higher level operations is a
composition of low-level Fedora primitive operations. For example, the logical NDR
operation addResource, which adds a new resource to the NDR, translates to a set of
low-level Fedora operations including creating the digital object that corresponds to
the resource, configuring its datastreams so that they match our model for the resource “type”, and establishing the relationships between that resource and its collection digital object and to the metadata digital objects that describe it.
C. Lagoze et al.
We have implemented in Java an API layer that mediates all interaction with the
NDR, by calling on the constituent set of low-level Fedora operations. In addition to
providing a relatively easy-to-use interface for services accessing the NDR, the API
performs the vital task of ensuring that constraints of the data model are enforced.
For example, the data model mandates that no metadata digital object should exist
that does not have one (and only one) “metadataFor” relationship to a resource digital
We have used this API to bootstrap the production NDR with data from the preexisting MR, thereby duplicating existing functionality in the new infrastructure. At
the time of writing of this paper, this process is complete. The platform for our NDR
production environment is a Dell 6850 server with dual 3Ghz Xeon processors, 32Gb
of 400Mhz memory and 517Gb of SCSI RAID disk with 80MB/second sustained
performance. This server is running 64-bit LINUX, for reasons outlined later. We
note that the 2006 cost for this production server is about 22K USD.
The NDR has over 2.1 million digital objects – 882,000 of them matching metadata from the MR, 1.2 million of them representing NSDL resources, and several
hundred representing other information objects – agents, services, etc., - in the NDR
data model. The representation of the relationships among these objects (those defined by the NDR data model and those internal to the Fedora digital object representation) produces over 165 million RDF triples in the triple-store. We have found that
ingest into the NDR takes about .7 seconds per object – making data load for this rich
information environment a non-trivial task.
This bootstrapping process has been a learning process in scaling up semanticallyrich information environments. In order to understand the results, it is necessary to
distinguish three components: core Fedora, the triple-store it uses to represent and
query inter-object relationships, and the Proai1 component that supports OAI-PMH.
Core Fedora is a web service application built on top of a collection of file-system
resident XML documents (one file for each digital object) and a relational database
that caches fragments and transformations of those documents for performance.
These XML documents are relatively small and stable, and at present we are using
about 21 GBytes of disk space to store these files across 39,000 directories. We have
not experienced any scaling problems nor do we foresee any with this core architecture. In fact, as we expected from our knowledge of the Fedora implementation, basic
digital object access is not really dependent on the size of the Fedora repository. For
example, our tests on dissemination performance show that requests for metadata
formats that are stored literally in the NDR are about 69 ms. Requests for formats
that are crosswalked from stored formats using an XSLT transform service take about
480 ms.
The more challenging aspect of our data loading and implementation work has involved the triple-store. Relationships among Fedora digital objects, and therefore
among nodes in the NDR graph, are stored persistently as RDF/XML in a datastream
in the digital object and are indexed as RDF triples in a triple-store, which provides
query access to the relationship graph. In the case of the NDR, this provides query
functionality such as “return all resources related to a state standard, a specific collection, or in an OAI set”.
Representing Contextualized Information in the NSDL
Triple-store technology is relatively immature. Scaling it up to accomplish our initial data load has been especially challenging. As part of our implementation of the
Fedora relationship architecture (known as the resource index), we experimented with
scaling and performance of a number of tripe-store implementations. Our extensive
tests comparing Sesame2, Jena3, and Kowari4 are available online5. One particular
target of our testing was the performance of complex queries that involve multiple
graph node joins – these are the types of queries we issue to perform OAI-PMH List
Records operations that select according to metadata format, set, and date range. We
found that Jena would not scale over a few tens of thousands of triples with complex
query times approaching 20 minutes for complex queries over .5 million triples. Sesame can be configured in both native storage mode or on top of mysql. We found that
Sesame-mysql, like Jena, was unable to return large results sets, producing an out-ofmemory error due to accumulating the entire result set in memory. Our remaining
tests comparing Sesame native to Kowari showed that for a database of several million triples Kowari was faster by a factor of 2 for simple queries, and by a factor of
over 9000 for complex queries.
Although the Kowari implementation proved capable under controlled tests of high
performance and scalability, we encountered a number of hurdles along the path of
our data load. The apparent reality is that neither Kowari nor any other triple-store
has been pushed to this scale. Such scale revealed unpleasant and previously undiscovered bugs, such as a memory leak that took months of effort to verify and find6.
Furthermore, we have found that the hardware requirements to run a large-scale semantic web application are non-trivial. Kowari uses memory mapped indexes, which
are both disk and memory-intensive. Presently the Kowari-based resource index
requires over 54 GB of virtual memory, which is significantly larger than the 5 GB
addressable by standard 32-bit processors and operating systems (thus the configuration of our production server described earlier).
In order to understand our results on semantic queries to the NDR resource index
(storing 165 million triples), it makes sense to divide these queries into two classes.
The first class of queries is relatively simple, such as those issued by a user application seeking all resources correlated with a state standard or another accessing all
members of a collection. We have found that query performance in this case is on the
order of 25ms for the simplest examples (no transitive joins over the graph) to about
250 ms for examples with 2-3 joins. The second class of queries are those that populate the NDR OAI server, Proai, which is a part of the Fedora service framework.
Proai is an advanced OAI server that supports any metadata format available through
the Fedora repository via direct datastream transcription or service-mediated dissemination. It operates over a MySQL database that is populated via resource-index queries to Fedora (in batch after an initial load and incrementally over the lifespan of the
Fedora repository). The resource-index queries to populate Proai are quite complex
with semantics such as “list all Fedora disseminations representing OAI-records of a
certain format, and get their associated properties and set membership information”.
C. Lagoze et al.
Such a query takes about 1 hour, when issued in batch over the fully loaded repository, and the combination of queries to pre-load the Proai database after the batch
NDR load takes about 1-2 days. We note, however, that this load is only performed
once on initial load of the NDR and that incremental updates, as information is added
to the NDR, are much quicker.
Proai performance is quite impressive. Throughput on an OAI-PMH ListRecords
request is about 900 records per/second, and we have been able to harvest all Dublin
Core records from the NDR (to populate our search indexes) in about 3 hours.
Our results provide hardware guidelines for large Fedora implementations that use
the resource index. We have found that they greatly benefit from a machine with large
real memory, high-speed disks, and high-performance disk controllers. The Dual Xeon
processors provide an excellent match for Fedora processing allowing uniform execution partitioning of core Fedora, the NDR API, Proai and MySql processing among the
4 hyper threaded CPU cores available. CPU clock rate is a minor performance factor
compared with the overall memory and I/O performance of the chassis. As of this writing, machines with more than 32GB of memory remain rare. Within 18 months we
anticipate that machines having 64GB will become commonly available.
7 Conclusions
We have described in this paper our initial work in implementing an advanced infrastructure to support an information-rich NSDL. This infrastructure supports the integration and reuse of local and distributed content, the integration of that content with
web services, and the contextualization of that content within a semantic graph.
The work described in this paper has advanced the state-of-the-art in two areas.
First, it involves the innovative use of Fedora to represent an information network
overlay. This data structure combines local and distributed content management,
service-oriented architecture, and semantic web technologies. At a time when digital
libraries need to move beyond the search and access paradigm, the INO supports
contextualized and participatory information environments. Second, this work pushes
the envelope on scaling issues related to semantic web technologies. Although RDF
and the semantic web have existed for over 8 years, large-scale implementations still
need to be demonstrated. Our experience with scaling the NDR is instructive to a
number of other projects looking to build on top of semantic web technologies.
The results in this paper demonstrate only the basic functionality of the NDR. The
basic operations, however, are the building blocks for the applications described in
Section 4. In future papers, we will describe our experience with these applications
and the ability of the NDR to support them in a highly scaled manner.
We thank the entire NSDL CI team for their contributions to this work. The authors
acknowledge the contributions of the entire Fedora team, especially Sandy Payette.
The work described here is based upon work supported by the National Science
Foundation under Grants No. 0227648, 0227656, and 0227888. Support for Fedora is
provided by the Andrew W. Mellon Foundation.
Representing Contextualized Information in the NSDL
1. Annotation Metadata Overview, http://www.dlese.org/Metadata/annotation/
2. Abbas, J., Norris, C. and Soloway, E., Middle School Children's Use of the ARTEMIS
Digital Library. in ACM/IEEE Joint Conference on Digital Libraries (JCDL '02), (Portland, OR, 2002), ACM Press, 98-105.
3. Andersen, D.G., Balakrishnan, H. and Kaashoek, M.F., Resilient Overlay Networks. in
18th ACM SOSP, (Banff, Canada, 2001).
4. Bekaert, J., Hochstenbach, P. and Van de Sompel, H. Using MPEG-21 DIDL to Represent
Complex Digital Objects in the Los Alamos National Laboratory Digital Library. D-Lib
Magazine, 9 (11).
5. Bush, V.F. As We May Think Atlantic Monthly, 1945.
6. Chad, K. and Miller, P., Do Libraries Matter? The rise of Library 2.0, http://www.talis.com/
7. Collis, B. and Strijker, A. Technology and Human Issues in Reusing Learning. Journal of
Interactive Media in Education, 4 (Special Issue on the Educational Semantic Web).
8. Constantopoulos, P., Doerr, M., Theodoridou, M., et al. On Information Organization in
Annotation Systems. in Grieser, G. and Tanaka, Y. eds. Intuitive Human Interface 2004,
LNAI3359, Springer-Verlag, Berlin, 2004, 189-200.
9. Dean, M., Connolly, D., van Harmelen, F., et al. OWL Web Ontology Language 1.0 Reference. W3C Working Draft, 20020729.
10. Diekema, A. and Chen, J., Experimenting with the Automatic Assignment of Educational
Standards to Digital Library Content. in Joint Conference of Digital Libraries (JCDL),
(Denver, 2005).
11. Faaborg, A. and Lagoze, C. Semantic Browsing. in Lecture Notes in Computer Science,
Springer-Verlag, Trondheim, Norway, 2003, 70-81.
12. Huynh, D., Mazzocchi, S. and Karger, D., Piggy Bank: Experience the Semantic Web Inside Your Web Browser. in International Semantic Web Conference (ISWC), (2005).
13. Kahan, J., Koivunen, M.-R., Prud'Hommeaux, E., et al., Annotea: An Open RDF Infrastructure for Shared Web Annotations. in WWW10, (Hong Kong, 2001).
14. Lagoze, C., Arms, W., Gan, S., et al., Core Services in the Architecture of the National
Digital Library for Science Education (NSDL). in Joint Conference on Digital Libraries,
(Portland, Oregon, 2002), ACM/IEEE.
15. Lagoze, C., Krafft, D., Cornwell, T., et al., Metadata aggregation and "automated digital
libraries": A retrospective on the NSDL experience. in Joint Conference on Digital Libraries, (Chapel Hill, NC, 2006), ACM.
16. Lagoze, C., Krafft, D.B., Payette, S., et al. What Is a Digital Library Anyway? Beyond
Search and Access in the NSDL. D-Lib Magazine, 11 (11).
17. Lagoze, C., Payette, S., Shin, E., et al. Fedora: An Architecture for Complex Objects and
their Relationships. International Journal of Digital Libraries, December 2005.
18. Library of Congress, METS: An Overview & Tutorial, http://www.loc.gov/standards/
19. Marshall, B., Zhang, Y., Chen, H., et al., Convergence of Knowledge Management and ELearning: the GetSmart Experience. in ACM/IEEE Joint Conference on Digital Libraries
(JCDL '03), (Houston, TX, 2003), ACM Press, 135-146.
20. Marshall, C.C., Annotation: from paper books to the digital library. in Digital Libraries
'97, (1997), ACM Press.
21. Martin, K. Learning in Context Issues of Teaching and Learning, 1998.
C. Lagoze et al.
22. McCalla, G. The Ecological Approach to the Design of E-Learning Environments: Purpose-based Capture and Use of the Information about Learners. Journal of Interactive Media in Education, 7 (Special Issue on the Educational Semantic Web).
23. McMartin, F. and Terada, Y., Digital Library Services for Authors of Learning Materials.
in ACM/IEEE Joint Conference on Digital Libraries (JCDL '02), (Portland, OR, 2002),
ACM Press, 117-118.
24. Motta, E., Shum, S.B. and Domingue, J. ScholOnto: an ontology-based digital library
server for research documents and discourse. International Journal on Digital Libraries,
3 (3).
25. O'Reilly, T., What is Web 2.0: Design Patterns and Business Models for the Next
Generation of Software, http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/
26. Parrish, P. The Trouble with Learning Objects. Educational Technology Research and Development, 52 (1). 49-67.
27. Recker, M., Dorward, J., Dawson, D., et al. Teaching, Designing, and Sharing: A Context
for Learning Objects. Interdisciplinary Journal of Knowledge and Learning Objects, 1.
28. Recker, M., Dorward, J. and Nelson, L.M. Discovery and Use of Online Learning Resources: Case Study Findings. Educational Technology and Society, 7 (2). 93-104.
29. Recker, M. and Walker, A. Collaboratively filtering learning objects. in Wiley, D.A. ed.
Designing Instruction with Learning Objects, 2000.
30. Reeves, T.C., The Impact of Media and Technology in Schools: A Research Report prepared for The Bertelsmann Foundation, http://www.athensacademy.org/instruct/
31. Roscheisen, M., Mogensen, C. and Winograd, T. Shared Web Annotations as a Platform
for Third-Party Value-Added, Information Providers: Architecture, Protocols, and Usage
Examples. Technical Report, CS-TR-97-1582,
32. Surowiecki, J. The wisdom of crowds : why the many are smarter than the few and how
collective wisdom shapes business, economies, societies, and nations. Doubleday :, New
York, 2004.
33. Unmil, P.K., Francisco-Revilla, L., Furuta, R.K., et al., Evolution of the Walden's Paths
Authoring Tools. in Webnet 2000, (San Antonio, TX, 2000).
34. Wilensky, R. Digital library resources as a basis for collaborative work. Journal of the
American Society for Information Science, 51 (3). 228-245.
35. Wolfe, J.L., Effects of Annotations on Student Readers and Writers. in Fifth ACM International Conference on Digital Libraries, (San Antonio, TX, 2000).
Towards a Digital Library for Language Learning
Shaoqun Wu and Ian H. Witten
Department of Computer Science
University of Waikato
Hamilton, New Zealand
{shaoqun, ihw}@cs.waikato.ac.nz
Abstract. Digital libraries have untapped potential for supporting language
teaching and learning. Although the Internet at large is widely used for language
education, it has critical disadvantages that can be overcome in a more controlled
environment. This article describes a language learning digital library, and a new
metadata set that characterizes linguistic features commonly taught in class as
well as textual attributes used for selection of suitable exercise material. On the
system is built a set of eight learning activities that together offer a classroom
and self-study environment with a rich variety of interactive exercises, which are
automatically generated from digital library content. The system has been evaluated by usability experts, language teachers, and students.
1 Introduction
The rise of computer-assisted language learning on the Internet has brought a new
dimension to language classes. The Web offers learners a wealth of language material and gives students opportunities to learn in different ways. They can study by
reading newspaper articles, listening to audio and viewing video clips; undertake
online learning exercises; or join courses. Media such as email, chat and blogs
enable them to communicate with other learners and with speakers of the target
language all over the world. When preparing lessons, teachers benefit from the
panoply of resources that the web provides. Automated tools can be used to build
practice exercises and design lessons. Teachers construct language learning tasks
based on the Internet because the language is real and the topics are contemporary,
which motivates learners.
Despite all these advantages, the Internet has many drawbacks for language study.
Although it offers innumerable language resources, learners and teachers alike face
the challenge of discovering usable material. Search engines return an overwhelming
amount of dross in response to any query, and locating suitable sources demands skill
and judgment. When learners study on their own, it is hard for them to locate material
that matches their language ability. Finally, students may accidentally encounter material with grossly unsuitable content.
Digital libraries, like traditional ones, can play a crucial role in education.
Marchionini [1] identifies many advantages in using them for teaching and learning.
As well as providing a safe and reliable educational environment, they have special
advantages for language classes. Digital libraries are a great source of material that
J. Gonzalo et al. (Eds.): ECDL 2006, LNCS 4172, pp. 341 – 352, 2006.
© Springer-Verlag Berlin Heidelberg 2006
S. Wu and I.H. Witten
teachers can turn into meaningful language exercises. They offer vast quantities of
authentic text. Learners experience language in realistic and genuine contexts, which
prepares them for what they will encounter in the real world. Searching and browsing
facilities can be tailored to the special needs of language learners. Teachers can integrate digital libraries into classes that help students locate appropriate material, giving
them the tools to study independently. Interpersonal communication media can be
incorporated to create a socially engaging learning environment.
This project has built a language learning digital library called LLDL based on the
Greenstone digital library software [2]. The goal is to explore the potential of digital
libraries in this field by addressing issues intrinsic to language learning. We developed a language learning metadata set (LLM) that characterizes linguistic features
commonly taught in class. By using it in searching and browsing, teachers and learners can locate appropriate material.
Eight learning activities are implemented that utilize LLDL’s search and retrieval
facilities. Together they offer a classroom and self-study environment with a rich
variety of interactive exercises. Four features distinguish them from existing systems:
They are student-centered
They provide a communicative learning environment
They provide a multilingual interface
Exercises are automatically generated from digital library content.
While the present implementation of LLDL is for learning English, it is designed to
provide a multilingual interface. English and Chinese versions exist; new languages
can easily be added. We close the paper with some remarks on extending the interface
and the language taught to other European languages.
2 DLs in Language Learning
Digital libraries can serve many roles in language education. First, they provide linguistic resources. In the classroom, text, pictures, models, audio, and video are used as material for teaching. Edge [3] summarizes three kinds of language resource, published,
authentic and teacher-produced, and digital libraries allow teachers to build collections
of each kind. Culturally situated learning helps students interpret the target language and
master skills in communication and behavior within the target culture [4]. Teachers can
build collections that introduce the people, history, environment, art, literature, music.
The material can be presented in diverse media—text, images, audio, video, and maps.
Students can experience the culture without leaving the classroom.
Second, digital libraries can bring teachers and learners together. Forums, discussion boards, electronic journals and chat programs can be incorporated to create a
community where teachers share their thoughts, tips and lesson plans; learners meet
their peers and exchange ideas; and teachers organize collaborative task-based, content-based projects. This community is especially meaningful for language learning
because it embeds learners in an authentic social environment, and also integrates the
various skills of learning and use [5]. As Vygotsky [6] points out, true learning involves socialization, where students internalize language by collaborating on common
activities and sharing the means of communicating information.
Towards a Digital Library for Language Learning
Third, digital libraries can provide students with activities, references and tools.
Language activities include courses, practice exercises, and instructional programs. In
traditional libraries students find reference works: dictionaries, thesauri, grammar
tutorials, books of synonyms, antonyms and collocations, and so on. Equivalent resources in digital libraries can be used as the basis of stimulating educational games.
3 Language Learning Metadata
Metadata is a key component of any digital library. It is used to organize resources
and locate material by searching and browsing. Metadata schemas developed specifically for education and training over the past few years have recently been formally
standardized [7]. The two most prominent are LOM (Learning Object Metadata) and
SCORM (Sharable Content Object Reference Model). LOM aims to specify the syntax and semantics of the attributes required to describe a learning object. It groups
features into categories: general, life-cycle, meta-metadata, educational, technical,
rights, annotation, and classification. SCORM aims to create flexible training options
by ensuring that content is reusable, interoperable, durable, and accessible regardless
of the delivery and management systems used. While LOM defines metadata for a
learning object, SCORM references a set of technical specifications and guidelines
designed to meet the needs of its developers, the US Department of Defense.
Neither of these standards proved particularly useful for our purpose. The aim of
metadata is to help users find things. Although digital libraries make it easy to locate
documents based on title, author, or content, they do not make it easy to find material
for language lessons—such as texts written for a certain level of reading ability, or
sentences that use the present perfect tense. To identify these users would have to sift
through countless examples, most of which do not satisfy the search criteria.
The LLM metadata set is designed to help teachers and students locate material
for particular learning activities. It has two levels: documents and sentences. All
values are intended to be capable of being extracted automatically from full text.
Some LLM metadata are extracted with the help of tools from the OpenNLP package, which provides the underlying framework for linguistic analysis of the documents by tagging all words with their part of speech and identifying units such as
prepositional phrases.
3.1 Document Metadata
Readability metadata can help both teachers and students locate material at an appropriate level. We have adopted two widely used measures recommended by practicing
teachers: Flesch Reading Ease and the Flesch-Kincaid Grade Level [8]. The former is
normally used to assess adult materials, and calculates an index between 0 and 100
from the average number of words per sentence and the average number of syllables
per word. The latter is widely used for upper elementary and secondary material and
scores text on a US grade-school scale ranging from 1 to 12.
LLM incorporates both these scores as separate pieces of metadata, and in addition
computes LOM Difficulty metadata by quantizing the Grade Level into five bands.
S. Wu and I.H. Witten
3.2 Sentence Metadata
Readability metadata is associated both with the document as a whole and with individual sentences. Three further types of metadata are associated with sentences: sentence metadata, syntactic metadata, and usage metadata.
LLDL splits every document into individual sentences using a simple heuristic involving terminating punctuation, the case of initial words, common abbreviations, and
HTML tags. Whereas sentences used as examples in the classroom or language teaching books have been carefully targeted, prefabricated, and honed into clean and polished examples, sentences extracted automatically from authentic text are often untidy
and incomplete; some have inordinately complex structures.
LLM addresses this by defining the following metadata for each sentence:
Processed version
Tagged version
State: clean or dirty
Type: simple or complex.
The first two are variants of the original extracted sentence, which usually contains
HTML mark-up. The Processed version contains plain text: mark-up has been
stripped. The Tagged version has been annotated with linguistic tags that reflect the
syntactic category of each word. Part-of-speech metadata is used by the language
learning digital library to generate exercises, as described in Section 5.
Some extracted sentences are messy. State metadata is used to indicate whether a
sentence is clean, comprising alphabetic characters and punctuation only, or dirty,
including other extraneous characters. The Type of a sentence is simple if it has just
one clause and complex otherwise, where a clause is a group of words containing a
main verb. Teachers normally use simple sentences to explain grammar rules where
The extraction process first detects sentence boundaries and strips HTML, yielding Processed sentence metadata. If sentences contain any characters other than
alphabetic ones, space, and punctuation, their State metadata is Dirty. Clean sentences are analyzed by the OpenNLP tagger and chunker to yield Tagged sentence
metadata. These contain syntactic tags that reflect the categories of individual
words and reveal the sentence structure, facilitating the extraction of language
metadata. Simple and complex sentences are differentiated by the number of verb
phrases (VP) they contain.
3.3 Syntactic Metadata
English grammar is relatively simple because it has fixed rules. On other hand, the
number of rules is large and there are many exceptions. Based on recommendations
from language teachers, we identified nine syntactic metadata elements that can be
extracted automatically by natural language processing tools. While these do not
cover all aspects of English grammar, they form the basis of a useful digital library.
The syntactic metadata elements are Adjective, Preposition, Possessive pronoun
and determiner, Modal, Tense, Voice, Coordinating conjunction, Subordinate conjunction, That-clause and wh-clause. For each one a regular expression is defined—
Towards a Digital Library for Language Learning
for example, \\w+/JJ is the expression for Adjective metadata: it indicates a string that
contains one or more word characters (\\w+) followed by /JJ, the syntactic tag for
adjective. Tense and Voice metadata are also extracted using tagged sentences. They
comprise both the tense or voice and the verbs or verb groups that are so marked.
The extraction process for the remaining syntactic metadata is similar. Understanding the grammatical implications of the tags is the key to successful extraction.
Preposition metadata is extracted by searching for prepositional phrases, tagged PP.
Subordinate conjunction and that-clause metadata are extracted by seeking subordinating clauses tagged as SBAR. Wh-clauses are not indicated by a clause-level tag,
and must be identified by combining phrase tags and wh-word tags.
3.4 Usage Metadata
LLM contains a single usage metadata element: Collocation. This is a group of two or
more words that are commonly found together or in close proximity. For example,
native speakers usually prefer the collocation heavy rain to the non-collocation big
rain, or totally convinced to absolutely convinced. Lewis [9] points out that native
speakers carry hundreds of thousands, possibly millions, of collocations in their heads
ready to draw upon in order to produce fluent, accurate and meaningful language, and
this presents great challenges to language learners.
We define collocations in terms of 9 two- and three-word syntactic patterns such as
adjective+noun, adverb+adjective, and phrasal verbs in the form verb+preposition—
for example, make up and take off. They are identified by looking for particular tags
and matching them with the nine syntactic collocation patterns. Following common
practice [10] we use the t-statistic to rank potential collocations. This uses the number
of occurrences of words individually and in combination, and the total number of
tokens in the corpus. Its accuracy depends on the size of the corpus: good collocations
that occur just once do not receive high scores.
4 Searching the Digital Library
LLM metadata captures linguistic aspects of the documents in a digital library. It
allows users to search and browse language learning materials. This section demonstrates the use of the extracted metadata in LLDL. In this project, we have built five
demonstration collections for use in the activities described in the next section:
Documents from the UN FAO Better farming series
Children’s short stories from East of the web
News articles from the BBC World Service
Sample articles from Password, a magazine for new English speakers
Collection of plant and animal images downloaded from the Internet.
The first collection includes practical articles intentionally written in a simple style,
but not targeted at children. The second contains material specifically for children.
The third and fourth are made from material that is intended to be particularly suitable
for second language learners. These four collections exhibit a wide variety of styles
and difficulty levels.
S. Wu and I.H. Witten
LLDL uses standard Greenstone facilities [2] to present options for browsing and
searching on entry to the library. When users browse, they can select Titles, Difficulty,
and other metadata elements. Clicking Titles presents an alphabetical list of titles of
the documents in the collection, broken down into alphabetic ranges; the full text of
the documents is available by clicking beside the appropriate title. Difficulty also
applies to documents, and allows the reader to browse titles in each of the five difficulty levels mentioned above.
The other browsing options refer to individual sentences: they are Tense, Preposition, Clause, Difficulty (which differs from the document-level Difficulty above because it refers to individual sentences), and Type. Sentences are the essential units in
language communication. Students study vocabulary and learn grammars in order to
construct sentences. Conversely, studying good sentences helps master word usage or
grammar rules in context. LLDL allows readers to browse for particular grammatical
constructions or identify particular parts of speech. For example, selecting Preposition shows the sentences of the collection, with the prepositions that each one contains listed in parentheses after it. The sentences are presented in alphabetic groups
according to preposition: those under the A–B section of the hierarchy contain about,
at, above, as, between, before, by, beside, … These sample sentences help students
learn the usage of particular prepositions and study what words commonly appear
before and after them—for example, above all, ask about.
Searching is more highly targeted than browsing. Users can perform an ordinary
full-text search to locate documents that treat particular topics; the search results show
the title and difficulty level of matching documents. Advanced search allows users to
specify metadata as well as content. For example, one might search for particular fulltext content but confine the search to documents that are easy (in terms of difficulty
level). Or search for individual sentences rather than documents, whose type is simple
(i.e., no compound sentences), or whose state is clean (i.e., no non-alphabetic characters). Users can combine these criteria in a search form to find simple and clean sentences from easy documents whose text contains specified words or phrases.
Users can also search for sentences that contain particular words. New learners
are often confused about word usage—for example, distinguishing the different implications of look, see and watch. One way to help is to provide many authentic samples that show these words in context. LLDL can retrieve sentences that include a
specified word or phrase, and are restricted by the above-mentioned sentence-level
metadata. Students can also search for sentences that exhibit any of the grammatical
constructs that are identified by metadata, for example passive voice sentences, modal
sentences or sentences in a particular tense.
5 Language Learning Activities
LLDL facilitates the creation of language learning activities. To demonstrate this we
have developed eight activities: Image Guessing, Collocation Matching, Quiz, Scrambled Sentences, Collocation Identifying, Predicting Words, Fill-in-blanks, and Scrambled Documents; unfortunately space permits a description of the first four activities
only. They share the common components login, chat, scoring and feedback.
Towards a Digital Library for Language Learning
5.1 Common Components
Learners are not required to register, but must login by providing a user name and
select a difficulty level (easy, medium or hard). This parameter is used to select sentences or documents for each activity, to determine which image collections are used
to generate exercises, and to group students for activities in which they work in pairs.
For these activities the system maintains a queue of users waiting at each level. When
a student logs in, the queue is checked and they are either paired up with a waiting
student at the same level, or queued to await a new opponent.
LLDL makes a chat facility available in all activities, in order to create an environment in which students can practice communication skills by discussing with
peers, seeking help, and negotiating tasks. The chat panel resides either in the activity
interface or a window that is launched by clicking a Chat button.
Each activity contains a scoring system intended to maintain a high level of motivation by encouraging students to compete with each other informally. Students can
view the accumulated scores of all participants, sorted so that the high scorers appear
at the top. Additional statistical information is provided such as the number of identified collocations in the Collocation activity or the number of predicted words in the
Predicting Words activity. The implementation of the scoring mechanism varies from
one activity to another, depending on whether students do the exercise individually, or
collaborate in pairs, or compete in pairs.
Students are provided with feedback on whether the response is correct or incorrect, and in the latter case they are invited to try again, perhaps with a hint that leads
to the correct response. In general, feedback is given item by item, at logical content breaks, at the end of the unit or session, or when requested by the student. Students also see their accumulated scores. Some activities provide an exercise-based
summary that includes questions, correct answers, and answers by the student’s
Hints provide direct help without giving away the answer. They can be offered
through text, pictures, audio or video clips, or by directing students to reference articles or relevant tutorials. Some exercises give hints that use text from the digital
library. For example, the Quiz activity allows students to ask for other sentences containing the same words; Collocation Matching provides more surrounding text so that
students can study the question in context.
5.2 The Image Guessing Exercise
In Image Guessing, the system pairs students according to their self-selected difficulty
level. One plays the role of describer, while the other is the guesser. An image is chosen randomly from a digital library collection of images and shown to the describer
alone; the guesser must identify that exact image. The describer describes the picture
in words that are automatically used by the system as a query term, and also decides
how many of the search results the guesser will see. The guesser does not see the
description; the describer does not see the search results. The guesser and describer
can communicate using the chat facility. The activity moves to the next image when
the guesser successfully identifies the image, chooses the wrong one, or the timer
S. Wu and I.H. Witten
expires. The students use the search and chat facility to identify as many images as
possible in a given time. They can pass on a particular image, or switch roles.
The difficulty level is determined by the image collection, which teachers build for
their student population. They select simple images—e.g. animal images or cartoons—for lower level students, and more complex ones—e.g. landscapes—for advanced ones. For searching, image collections use metadata provided by the teacher,
which they tailor to the students’ linguistic ability. The more specifically the metadata
describes the images, the easier the game.
5.3 The Collocation Matching Exercise
Collocations are the key to language fluency and competence. Lewis [9] believes that
fluency is based on the acquisition of a large store of fixed or semi-fixed prefabricated
items. Hill [11] points out that students with good ideas often lose marks because they
don't know the four or five most important collocations of a key word that is central to
what they are writing about. Today, teachers spend more time helping students develop a large stock of collocations; less on grammar rules.
LLDL is particularly useful for learning collocations because it contains a large
amount of genuine text and provides useful search facilities. In the Collocation
matching activity, students compete in pairs to match parts of a collocation pattern.
This is a traditional gap filling exercise in which one part of a collocation is removed
and the students fill the gap with an appropriate word. For example, for
verb+preposition collocations, verbs or prepositions are deleted. Students select the
collocation type they want to practice on, and decide which component will be excised. The exercises use complete sentences retrieved from the library as question
Students are paired up and one is chosen to control the activity by selecting collocation types. The other one can see what is going on and negotiate using chat. Then
complete sentences are presented one by one, with the target collocation colored blue
and missing words replaced by a line. The students select the most appropriate word
from four choices before the count-down timer expires. When the exercise is complete the pair view their performance in a summary window that shows the question
text with collocations highlighted, and the students’ answers and scores.
Exercises are generated from collocation metadata. Sentences at the appropriate
difficulty level and collocation type are retrieved. The words that appear in the collocations are grouped according to their syntactic tags and used as choices for the exercise. For each sentence, four choices, including the correct one, are picked randomly.
5.4 The Quiz Exercise
Quizzes comprising a question and a few choices from which the correct answer must
be selected are widely used language drills for learning grammar and vocabulary.
Traditionally, teachers construct quizzes and use them for practice exercises, tests or
exams. Our system offers a unique feature that makes quizzes far more motivational:
students can create their own.
The teacher begins by defining a list of topics and perhaps creating a few initial
quizzes. Students can select a topic and construct a new quiz by entering up to four
Towards a Digital Library for Language Learning
words or phrases; limiting the maximum number of questions; choosing whether or
not to stem the terms; and specifying sentence types—simple, complex or both.
Once the learner has defined a new quiz or selected an existing one, the system
presents the questions. Each has two to five possible answers. When the student selects one, the system indicates its correctness and moves to the next question. Students can get help by initiating a digital library search for sentences that contain the
correct word or words, without being told which one it is. When the quiz is finished a
summary is shown of all questions, along with the correct answer and the student’s
incorrect ones. Students then re-take the questions on which they performed poorly.
This activity uses a simple quiz-generation mechanism that constructs questions
and answers using words or phrases specified by students. For example, a question
might be What did you think ___ the film? with possible answers of, at, about, and
over. The question text comprises a single sentence retrieved from the digital library
using words or phrases specified by the student. These are excised from the question
text and used as the correct answer. Sentence retrieval employs full text search on the
sentence text and metadata. For example, to construct questions on prepositions,
teachers retrieve sentences by searching on Preposition metadata. To avoid students
having to understand the metadata structure, they are only asked to provide the words
or phrases of interest when creating new quizzes.
Stemming is a key parameter for quiz generation that significantly affects the number of available questions and choices. Without stemming, the question text for a
make and do quiz would be restricted to sentences that contain make or do, and students would have only two answer choices. With stemming, different forms such as
making, makes, doing and does are also provided as alternatives.
Students can use stemming to explore the variants of a word. When teaching a new
word, teachers often encourage students to check its variants in a dictionary. This
activity enables students to find variants and practice them by creating an appropriate
quiz. For example, students use a quiz to learn more about the variants of effect,
namely effects, effective, and effectively.
5.5 The Scrambled Sentence Exercise
The words of sentences are permuted and students sort them into their original order, to help study sentence structure. Students can select suitable material to practice on.
LLDL retrieves sentences from the digital library, according to selected criteria
specified by the student:
Word or phrases that must appear
Corpus that the sentences come from
Difficulty level
Sentence type (simple, complex, or both)
Number of sentences retrieved
Whether to sort in ascending or descending length order.
Once the sentences have been retrieved, they are permuted and presented one after
another. The search terms are put in their correct position, highlighted in blue. Stu-
S. Wu and I.H. Witten
dents can view the title of the document containing the sentence, and the sentences
preceding and following it, by clicking the help icon.
In this activity, students can see what other students are doing, in order to encourage them to help each other and learn from their peers’ mistakes. Their names
are shown (the list is updated when students log in and out); clicking a name allows
you to observe how that student unscrambles a sentence by observing his word
moves. Students can use chat to discuss the exercise or help each other. Teachers
can also log in and observe what the students are doing, and identify and analyze
their errors.
6 Evaluation
LLDL demonstrates the roles that digital libraries can play in language study. It has
been extensively evaluated, although we have not attempted to assess effectiveness—
whether it results in efficient learning—because this paper addresses digital library
issues rather than educational ones. We have also drawn a line between evaluating the
system itself and evaluating the language material that teachers have put into it.
We conducted four kinds of evaluation: metadata extraction, usability, and activity
evaluation with both teachers and learners. We recruited three different kinds of
evaluator: usability experts, teachers, and students. The teachers also contributed to
the system throughout its development, and helped recruit language students as
evaluators. The evaluation is anecdotal rather than quantitative.
6.1 Evaluating Metadata Extraction
Extracted metadata provides the underlying framework for LLDL by facilitating
automatic exercise generation for the various language activities. However, they are
not always accurate. Sample documents were used to assess the accuracy of sentence
boundary detection and identify language constructions and collocations. We identified several tags that had been incorrectly assigned by OpenNLP, causing errors in
both the Tagged sentence metadata and the values associated with the syntactic metadata types. Four factors affect the accuracy of collocation metadata. First, errors in
tagging produce incorrect matches against the underlying syntactic pattern. Second,
the numbers used to calculate the t-values are not exact. Third, the choice of the rejection threshold is arbitrary. Fourth, groups of words that commonly come together
more often than chance are not necessarily good collocations.
6.2 Evaluating Usability
Evaluators examined the interface and judged its compliance with recognized usability principles. They focused on:
Explicitness: users understand how to use the system
Compatibility: operations meet expectations formed from previous experience
Consistency: similar tasks are performed in similar ways
Learnability: users can learn about the system’s capability
Feedback: actions are acknowledged and responses are meaningful.
Towards a Digital Library for Language Learning
Three rounds of usability evaluation were conducted, by usability experts, students,
and language teachers. This feedback was used to improve the interface before embarking on the next stage of evaluation.
6.3 Evaluating Activities by Language Teachers
We showed the system to teachers at an early stage, and they proposed several activities that were incorporated into the system we have described. We also made other
modifications based on their feedback, giving more search options for the scrambled
sentence exercises, excising only nouns and verbs in the Predicting words activity,
and showing students extracted collocations for the Collocation identification activity.
Later we performed a further evaluation, focusing on:
Do the activities meet the teachers’ original expectation?
What do they think of the feedback provided to students?
Which ability levels are the activities suitable for?
What do they think of the exercise material that is used?
On the whole, the teachers thought the activities exceeded their original expectations.
They especially liked the use of authentic reading material. They also liked the
feedback provided to students, particularly the summaries provided at the end of
exercises. They made many constructive and detailed comments on the individual
exercises which were used for further improvements such as providing help and hints.
6.4 Evaluating Activities by Language Learners
Ten language learners, from 18 to 67 years old and native speakers of Arabic, Chinese,
Italian and Japanese, participated in an experiment aimed at assessing student satisfaction with the activities. They were grouped into beginner (2), intermediate (4) and
advanced (4), and paired up with like partners. In each session they tried out three
activities. They filled out a questionnaire and an